In [None]:
#| include: false
import pandas as pd
from nbdev.showdoc import *

This example notebook covers ways to generate synthetic data using `numerblox` components. Synthetic data can be a great way to improve performance simply by having more data to train. We will both cover ways to generate synthetic target variables and features.

## 0. Download and load

In [None]:
from numerblox.download import NumeraiClassicDownloader
from numerblox.numerframe import create_numerframe, NumerFrame

In [None]:
dl = NumeraiClassicDownloader(directory_path="synth_test")
dl.download_training_data(version="4.2", int8=True)

2023-09-07 17:49:58,502 INFO numerapi.utils: target file already exists
2023-09-07 17:49:58,503 INFO numerapi.utils: download complete


2023-09-07 17:49:59,171 INFO numerapi.utils: target file already exists
2023-09-07 17:49:59,172 INFO numerapi.utils: download complete


In [None]:
dataf = create_numerframe("synth_test/train_int8.parquet")

In [None]:
dataf.head(2)

Unnamed: 0_level_0,era,data_type,feature_honoured_observational_balaamite,feature_polaroid_vadose_quinze,feature_untidy_withdrawn_bargeman,feature_genuine_kyphotic_trehala,feature_unenthralled_sportful_schoolhouse,feature_divulsive_explanatory_ideologue,feature_ichthyotic_roofed_yeshiva,feature_waggly_outlandish_carbonisation,...,target_bravo_v4_20,target_bravo_v4_60,target_charlie_v4_20,target_charlie_v4_60,target_delta_v4_20,target_delta_v4_60,target_echo_v4_20,target_echo_v4_60,target_jeremy_v4_20,target_jeremy_v4_60
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
n003bba8a98662e4,1,train,4,2,4,4,0,0,4,4,...,0.25,0.0,0.5,0.25,0.25,0.0,0.25,0.0,0.25,0.25
n003bee128c2fcfc,1,train,2,4,1,3,0,3,2,3,...,0.75,1.0,0.75,0.75,0.75,0.75,0.75,0.75,0.75,1.0


## 1. Synthetic target (Bayesian GMM)

First we will tackle the problem of creating a synthetic target column to improve model performance. `BayesianGMMTargetProcessor` allows you to generate a new target variable based on a given target. The preprocessor sample the target from a [Bayesian Gaussian Mixture model](https://scikit-learn.org/stable/modules/generated/sklearn.mixture.BayesianGaussianMixture.html) which is fitted on coefficients from a [regularized linear model (Ridge regression)](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html).

This implementation is based on a [Github Gist by Michael Oliver (mdo)](https://gist.github.com/the-moliver/dcdd2862dc2c78dda600f1b449071c93).

In [None]:
from numerblox.preprocessing import BayesianGMMTargetProcessor

In [None]:
show_doc(BayesianGMMTargetProcessor)

---

[source](https://github.com/crowdcent/numerblox/blob/master/numerblox/preprocessing.py#LNone){target="_blank" style="float:right; font-size:smaller"}

### BayesianGMMTargetProcessor

>      BayesianGMMTargetProcessor (target_col:str='target',
>                                  feature_names:list=None, n_components:int=6)

Generate synthetic (fake) target using a Bayesian Gaussian Mixture model. 

Based on Michael Oliver's GitHub Gist implementation: 

https://gist.github.com/the-moliver/dcdd2862dc2c78dda600f1b449071c93

:param target_col: Column from which to create fake target. 

:param feature_names: Selection of features used for Bayesian GMM. All features by default.
:param n_components: Number of components for fitting Bayesian Gaussian Mixture Model.

In [None]:
dataf.head()

Unnamed: 0_level_0,era,data_type,feature_honoured_observational_balaamite,feature_polaroid_vadose_quinze,feature_untidy_withdrawn_bargeman,feature_genuine_kyphotic_trehala,feature_unenthralled_sportful_schoolhouse,feature_divulsive_explanatory_ideologue,feature_ichthyotic_roofed_yeshiva,feature_waggly_outlandish_carbonisation,...,target_bravo_v4_20,target_bravo_v4_60,target_charlie_v4_20,target_charlie_v4_60,target_delta_v4_20,target_delta_v4_60,target_echo_v4_20,target_echo_v4_60,target_jeremy_v4_20,target_jeremy_v4_60
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
n003bba8a98662e4,1,train,4,2,4,4,0,0,4,4,...,0.25,0.0,0.5,0.25,0.25,0.0,0.25,0.0,0.25,0.25
n003bee128c2fcfc,1,train,2,4,1,3,0,3,2,3,...,0.75,1.0,0.75,0.75,0.75,0.75,0.75,0.75,0.75,1.0
n0048ac83aff7194,1,train,2,1,3,0,3,0,3,3,...,0.5,0.25,0.5,0.25,0.5,0.25,0.5,0.25,0.5,0.25
n00691bec80d3e02,1,train,4,2,2,3,0,4,1,4,...,0.75,0.5,0.75,0.75,0.5,0.5,0.75,0.5,0.5,0.5
n00b8720a2fdc4f2,1,train,4,3,4,4,0,0,4,2,...,0.75,0.5,0.75,0.5,0.75,0.5,0.75,0.5,0.5,0.5


In [None]:
bgmm = BayesianGMMTargetProcessor(target_col="target")
test_columns = ['era', 'data_type', 
"feature_polaroid_vadose_quinze", "feature_genuine_kyphotic_trehala", 
"feature_unenthralled_sportful_schoolhouse", 'target']
sample_dataf = NumerFrame(dataf[test_columns].sample(1000).fillna(0.5))
fake_dataf = bgmm(sample_dataf)

Generating fake target:   0%|          | 0/475 [00:00<?, ?it/s]

In [None]:
sample_dataf.head()

Unnamed: 0_level_0,era,data_type,feature_polaroid_vadose_quinze,feature_genuine_kyphotic_trehala,feature_unenthralled_sportful_schoolhouse,target,target_fake
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
nd9f11e36aea10ba,60,train,1,3,3,0.5,0.25
n9c50438e437c15b,160,train,2,0,0,0.75,0.5
n99c943dd412f83a,225,train,0,3,1,0.5,0.5
na5308787ae2aa35,140,train,0,1,1,0.5,0.75
n77ff7ba852e7a10,419,train,0,4,0,0.5,0.75


The new target will be suffixed by `_fake` to distinguish it from the original targets.

In [None]:
fake_dataf.get_target_data.head(2)

Unnamed: 0_level_0,target,target_fake
id,Unnamed: 1_level_1,Unnamed: 2_level_1
nd9f11e36aea10ba,0.5,0.25
n9c50438e437c15b,0.75,0.5


Note that you can easily generate multiple fake targets in a loop.

In [None]:
for target_col in sample_dataf.target_cols:
    bgmm = BayesianGMMTargetProcessor(target_col=target_col)
    sample_dataf = bgmm(sample_dataf)
sample_dataf.get_target_data.head(2)

Generating fake target:   0%|          | 0/475 [00:00<?, ?it/s]

Unnamed: 0_level_0,target,target_fake
id,Unnamed: 1_level_1,Unnamed: 2_level_1
nd9f11e36aea10ba,0.5,0.5
n9c50438e437c15b,0.75,0.25


In [None]:
# Clean up environment
dl.remove_base_directory()