In [16]:
# hide
from nbdev.showdoc import *

# Synthetic Data Generation with NumerBlox

This example notebook covers ways to generate synthetic data using `numerblox` components. Synthetic data can be a great way to improve performance simply by having more representative data. We will both cover ways to generate synthetic target variables and features.

## 0. Download and load

In [17]:
from numerblox.download import NumeraiClassicDownloader
from numerblox.numerframe import create_numerframe, NumerFrame

In [18]:
dl = NumeraiClassicDownloader(directory_path="synth_test")
dl.download_training_data(version=3)

2022-04-19 12:30:26,420 INFO numerapi.utils: target file already exists
2022-04-19 12:30:26,421 INFO numerapi.utils: download complete


2022-04-19 12:30:27,681 INFO numerapi.utils: target file already exists
2022-04-19 12:30:27,683 INFO numerapi.utils: download complete


In [19]:
dataf = create_numerframe("synth_test/numerai_training_data.parquet")

In [20]:
dataf.head(2)

Unnamed: 0_level_0,era,data_type,feature_dichasial_hammier_spawner,feature_rheumy_epistemic_prancer,feature_pert_performative_hormuz,feature_hillier_unpitied_theobromine,feature_perigean_bewitching_thruster,feature_renegade_undomestic_milord,feature_koranic_rude_corf,feature_demisable_expiring_millepede,...,target_paul_20,target_paul_60,target_george_20,target_george_60,target_william_20,target_william_60,target_arthur_20,target_arthur_60,target_thomas_20,target_thomas_60
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
n003bba8a98662e4,1,train,1.0,0.5,1.0,1.0,0.0,0.0,1.0,1.0,...,0.25,0.25,0.25,0.0,0.166667,0.0,0.166667,0.0,0.166667,0.0
n003bee128c2fcfc,1,train,0.5,1.0,0.25,0.75,0.0,0.75,0.5,0.75,...,1.0,1.0,1.0,1.0,0.833333,0.666667,0.833333,0.666667,0.833333,0.666667


## 1. Synthetic target (Bayesian GMM)

First we will tackle the problem of creating a synthetic target column to improve model performance. `BayesianGMMTargetProcessor` allows you to generate a new target variable based on a given target. The preprocessor sample the target from a [Bayesian Gaussian Mixture model](https://scikit-learn.org/stable/modules/generated/sklearn.mixture.BayesianGaussianMixture.html) which is fitted on coefficients from a [regularized linear model (Ridge regression)](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html).

This implementation is based on a [Github Gist by Michael Oliver (mdo)](https://gist.github.com/the-moliver/dcdd2862dc2c78dda600f1b449071c93).

In [21]:
from numerblox.preprocessing import BayesianGMMTargetProcessor

In [22]:
show_doc(BayesianGMMTargetProcessor)

<h2 id="BayesianGMMTargetProcessor" class="doc_header"><code>class</code> <code>BayesianGMMTargetProcessor</code><a href="https://github.com/crowdcent/numerblox/tree/master/numerblox/preprocessing.py#L258" class="source_link" style="float:right">[source]</a></h2>

> <code>BayesianGMMTargetProcessor</code>(**`target_col`**:`str`=*`'target'`*, **`n_components`**:`int`=*`6`*) :: [`BaseProcessor`](/numerbloxpreprocessing.html#BaseProcessor)

Generate synthetic (fake) target using a Bayesian Gaussian Mixture model. 

Based on Michael Oliver's GitHub Gist implementation: 

https://gist.github.com/the-moliver/dcdd2862dc2c78dda600f1b449071c93

:param target_col: Column from which to create fake target. 

:param n_components: Number of components for fitting Bayesian Gaussian Mixture Model.

In [23]:
dataf.head()

Unnamed: 0_level_0,era,data_type,feature_dichasial_hammier_spawner,feature_rheumy_epistemic_prancer,feature_pert_performative_hormuz,feature_hillier_unpitied_theobromine,feature_perigean_bewitching_thruster,feature_renegade_undomestic_milord,feature_koranic_rude_corf,feature_demisable_expiring_millepede,...,target_paul_20,target_paul_60,target_george_20,target_george_60,target_william_20,target_william_60,target_arthur_20,target_arthur_60,target_thomas_20,target_thomas_60
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
n003bba8a98662e4,1,train,1.0,0.5,1.0,1.0,0.0,0.0,1.0,1.0,...,0.25,0.25,0.25,0.0,0.166667,0.0,0.166667,0.0,0.166667,0.0
n003bee128c2fcfc,1,train,0.5,1.0,0.25,0.75,0.0,0.75,0.5,0.75,...,1.0,1.0,1.0,1.0,0.833333,0.666667,0.833333,0.666667,0.833333,0.666667
n0048ac83aff7194,1,train,0.5,0.25,0.75,0.0,0.75,0.0,0.75,0.75,...,0.5,0.25,0.25,0.25,0.5,0.333333,0.5,0.333333,0.5,0.333333
n00691bec80d3e02,1,train,1.0,0.5,0.5,0.75,0.0,1.0,0.25,1.0,...,0.5,0.5,0.5,0.5,0.666667,0.5,0.5,0.5,0.666667,0.5
n00b8720a2fdc4f2,1,train,1.0,0.75,1.0,1.0,0.0,0.0,1.0,0.5,...,0.5,0.5,0.5,0.5,0.666667,0.5,0.5,0.5,0.666667,0.5


In [24]:
bgmm = BayesianGMMTargetProcessor(target_col="target_nomi_20")
test_columns = ['era', 'data_type', 'feature_dichasial_hammier_spawner',
                'feature_rheumy_epistemic_prancer', 'target',
                'target_nomi_20', 'target_paul_20']
sample_dataf = NumerFrame(dataf[test_columns].sample(100).fillna(0.5))
fake_dataf = bgmm(sample_dataf)

Generating fake target:   0%|          | 0/91 [00:00<?, ?it/s]

In [25]:
sample_dataf.head()

Unnamed: 0_level_0,era,data_type,feature_dichasial_hammier_spawner,feature_rheumy_epistemic_prancer,target,target_nomi_20,target_paul_20,target_nomi_20_fake
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
ne87a484dba54b55,278,train,0.5,0.75,0.5,0.5,0.5,0.5
n7e246122dd0b76f,562,train,1.0,0.75,0.5,0.5,0.5,0.5
n9f667bd2c4387e2,557,train,0.0,0.0,0.75,0.75,0.5,0.5
nd4ff4b599aa37ab,378,train,0.25,0.5,0.25,0.25,0.25,0.75
n4afc6dfde83eb35,498,train,0.75,0.75,0.5,0.5,0.5,0.5


The new target will be suffixed by `_fake` to distinguish it from the original targets.

In [26]:
fake_dataf.get_target_data.head(2)

Unnamed: 0_level_0,target,target_nomi_20,target_paul_20,target_nomi_20_fake
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
ne87a484dba54b55,0.5,0.5,0.5,0.5
n7e246122dd0b76f,0.5,0.5,0.5,0.5


Note that you can easily generate multiple fake targets in a loop.

In [27]:
for target_col in sample_dataf.target_cols:
    bgmm = BayesianGMMTargetProcessor(target_col=target_col)
    sample_dataf = bgmm(sample_dataf)
sample_dataf.get_target_data.head(2)

Generating fake target:   0%|          | 0/91 [00:00<?, ?it/s]

Generating fake target:   0%|          | 0/91 [00:00<?, ?it/s]

Generating fake target:   0%|          | 0/91 [00:00<?, ?it/s]

Unnamed: 0_level_0,target,target_nomi_20,target_paul_20,target_nomi_20_fake,target_fake,target_paul_20_fake
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
ne87a484dba54b55,0.5,0.5,0.5,0.5,0.5,0.5
n7e246122dd0b76f,0.5,0.5,0.5,0.5,0.5,0.5


## 2. DeepDream Generator

In [28]:
from numerblox.preprocessing import DeepDreamGenerator

In [29]:
show_doc(DeepDreamGenerator)

<h2 id="DeepDreamGenerator" class="doc_header"><code>class</code> <code>DeepDreamGenerator</code><a href="https://github.com/crowdcent/numerblox/tree/master/numerblox/preprocessing.py#L179" class="source_link" style="float:right">[source]</a></h2>

> <code>DeepDreamGenerator</code>(**`model_path`**:`str`, **`batch_size`**:`int`=*`200000`*, **`steps`**:`int`=*`5`*, **`step_size`**:`float`=*`0.01`*, **`feature_names`**:`list`=*`None`*) :: [`BaseProcessor`](/numerbloxpreprocessing.html#BaseProcessor)

Generate synthetic eras using DeepDream technique. 

Based on implementation by nemethpeti: 

https://github.com/nemethpeti/numerai/blob/main/DeepDream/deepdream.py

:param model_path: Path to trained DeepDream model. Example can be downloaded from 

https://github.com/nemethpeti/numerai/blob/main/DeepDream/model.h5

For our example we will use the model open sourced by [nemethpeti](https://github.com/nemethpeti) which you can download [here](https://github.com/nemethpeti/numerai/blob/main/DeepDream/model.h5). This model works on the v3 medium feature set. We therefore use v3 data in this example. The v3 medium feature set can be easily retrieved using `NumeraiClassicDownloader`.

In [30]:
#hide_output
feature_set = dl.get_classic_features(filename="v3/features.json")
feature_names = feature_set['feature_sets']['medium']

2022-04-19 12:30:43,836 INFO numerapi.utils: starting download
synth_test/features.json: 441kB [00:00, 644kB/s]                            


In [31]:
ddg = DeepDreamGenerator(model_path="../test_assets/deepdream_model.h5",
                         feature_names=feature_names)

2022-04-19 12:30:44.640007: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


Let's try to generate features from a small subset of 100 rows.

In [32]:
sample_dataf_2 = NumerFrame(dataf.sample(100))

In [33]:
dreamed_dataf = ddg.transform(sample_dataf_2)

Deepdreaming Synthetic Batches:   0%|          | 0/5 [00:00<?, ?it/s]

The new dreamed `NumerFrame` consists of the original data and 100 new additional rows. Note that targets are the same.

Also, `era`, `data_type` and any other columns besides features and targets will be `NaN`s.

In [34]:
print(dreamed_dataf.shape)
dreamed_dataf.tail()

(199, 1073)


Unnamed: 0,era,data_type,feature_dichasial_hammier_spawner,feature_rheumy_epistemic_prancer,feature_pert_performative_hormuz,feature_hillier_unpitied_theobromine,feature_perigean_bewitching_thruster,feature_renegade_undomestic_milord,feature_koranic_rude_corf,feature_demisable_expiring_millepede,...,target_paul_20,target_paul_60,target_george_20,target_george_60,target_william_20,target_william_60,target_arthur_20,target_arthur_60,target_thomas_20,target_thomas_60
94,,,0.236611,0.721029,0.707977,0.02525,0.53145,0.493006,0.506379,1.0,...,0.5,0.5,0.75,0.75,0.5,0.5,0.5,0.5,0.5,0.666667
95,,,0.0,0.688268,0.717093,0.212227,0.759024,0.633137,0.996057,0.568541,...,0.5,0.0,0.5,0.25,0.5,0.333333,0.5,0.333333,0.5,0.166667
96,,,0.987918,0.760779,0.8138,0.965248,0.300323,0.495363,0.998792,0.229803,...,0.5,0.5,0.0,0.25,0.5,0.5,0.5,0.5,0.0,0.333333
97,,,0.764998,1.0,0.766135,1.0,0.774427,0.420843,0.777349,0.463642,...,0.5,0.25,0.5,0.25,0.5,0.333333,0.5,0.333333,0.5,0.333333
98,,,1.0,0.960024,0.998135,0.95222,0.498495,0.476565,0.032347,1.0,...,0.75,0.5,0.75,0.5,0.666667,0.833333,0.666667,0.5,0.833333,0.666667


To only get new synthetic data use `.get_synthetic_batch`.


In [35]:
synth_dataf = ddg.get_synthetic_batch(sample_dataf_2)

Deepdreaming Synthetic Batches:   0%|          | 0/5 [00:00<?, ?it/s]

In [36]:
print(synth_dataf.shape)
synth_dataf.head()

(99, 441)


Unnamed: 0,feature_abstersive_emotional_misinterpreter,feature_accessorial_aroused_crochet,feature_acerb_venusian_piety,feature_affricative_bromic_raftsman,feature_agile_unrespited_gaucho,feature_agronomic_cryptal_advisor,feature_alkaline_pistachio_sunstone,feature_altern_unnoticed_impregnation,feature_ambisexual_boiled_blunderer,feature_amoebaean_wolfish_heeler,...,target_paul_20,target_paul_60,target_george_20,target_george_60,target_william_20,target_william_60,target_arthur_20,target_arthur_60,target_thomas_20,target_thomas_60
0,0.920705,0.776099,0.252246,1.0,0.255734,0.467267,0.444772,0.478586,0.16981,0.51349,...,0.5,0.75,0.75,0.75,0.5,0.666667,0.5,0.5,0.5,0.5
1,0.0,0.954657,0.827482,0.93986,1.0,1.0,0.990011,0.023633,0.94953,1.0,...,0.5,0.75,0.5,0.75,0.666667,0.833333,0.666667,0.833333,0.666667,1.0
2,0.981136,1.0,0.361227,0.251239,0.228109,0.738564,0.648518,0.02011,0.0,0.0306,...,0.25,0.5,0.75,0.75,0.0,0.5,0.166667,0.5,0.5,0.666667
3,0.936477,0.498917,0.807439,1.0,1.0,0.0,0.0,0.533342,0.673116,0.219802,...,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.333333,0.5,0.5
4,0.480438,0.73003,0.5091,0.128314,0.24705,0.318486,0.267071,0.277129,0.966291,0.812651,...,0.5,0.75,0.75,0.75,0.833333,0.833333,0.666667,0.833333,0.666667,0.666667


## 3. UMAP

In [37]:
# Clean up environment
dl.remove_base_directory()