This example notebook covers ways to generate synthetic data using `numerblox` components. Synthetic data can be a great way to improve performance simply by having more data to train. We will both cover ways to generate synthetic target variables and features.

## 0. Download and load

In [1]:
import pandas as pd
from uuid import uuid4

from numerblox.download import NumeraiClassicDownloader

In [2]:
unique_id = uuid4()

dl = NumeraiClassicDownloader(directory_path=f"synth_test_{unique_id}")
dl.download_training_data(version="4.2", int8=True)

No existing directory found at 'synth_test_ecc9c7f1-e9e8-475e-b8e4-2f15ad7cd41f'. Creating directory...
Downloading 'v4.2/train_int8.parquet'.


2023-09-25 14:35:28,180 INFO numerapi.utils: starting download
synth_test_ecc9c7f1-e9e8-475e-b8e4-2f15ad7cd41f/train_int8.parquet: 1.88GB [02:20, 13.4MB/s]                            


Downloading 'v4.2/validation_int8.parquet'.


2023-09-25 14:37:49,315 INFO numerapi.utils: starting download
synth_test_ecc9c7f1-e9e8-475e-b8e4-2f15ad7cd41f/validation_int8.parquet: 2.17GB [02:22, 15.2MB/s]                            


In [3]:
dataf = pd.read_parquet(f"synth_test_{unique_id}/train_int8.parquet")

In [4]:
dataf.head(2)

Unnamed: 0_level_0,era,data_type,feature_honoured_observational_balaamite,feature_polaroid_vadose_quinze,feature_untidy_withdrawn_bargeman,feature_genuine_kyphotic_trehala,feature_unenthralled_sportful_schoolhouse,feature_divulsive_explanatory_ideologue,feature_ichthyotic_roofed_yeshiva,feature_waggly_outlandish_carbonisation,...,target_bravo_v4_20,target_bravo_v4_60,target_charlie_v4_20,target_charlie_v4_60,target_delta_v4_20,target_delta_v4_60,target_echo_v4_20,target_echo_v4_60,target_jeremy_v4_20,target_jeremy_v4_60
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
n003bba8a98662e4,1,train,4,2,4,4,0,0,4,4,...,0.25,0.0,0.5,0.25,0.25,0.0,0.25,0.0,0.25,0.25
n003bee128c2fcfc,1,train,2,4,1,3,0,3,2,3,...,0.75,1.0,0.75,0.75,0.75,0.75,0.75,0.75,0.75,1.0


## 1. Synthetic target (Bayesian GMM)

First we will tackle the problem of creating a synthetic target column to improve model performance. `BayesianGMMTargetProcessor` allows you to generate a new target variable based on a given target. The preprocessor sample the target from a [Bayesian Gaussian Mixture model](https://scikit-learn.org/stable/modules/generated/sklearn.mixture.BayesianGaussianMixture.html) which is fitted on coefficients from a [regularized linear model (Ridge regression)](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html).

This implementation is based on a [Github Gist by Michael Oliver (mdo)](https://gist.github.com/the-moliver/dcdd2862dc2c78dda600f1b449071c93).

In [5]:
from numerblox.targets import BayesianGMMTargetProcessor

In [6]:
dataf.head()

Unnamed: 0_level_0,era,data_type,feature_honoured_observational_balaamite,feature_polaroid_vadose_quinze,feature_untidy_withdrawn_bargeman,feature_genuine_kyphotic_trehala,feature_unenthralled_sportful_schoolhouse,feature_divulsive_explanatory_ideologue,feature_ichthyotic_roofed_yeshiva,feature_waggly_outlandish_carbonisation,...,target_bravo_v4_20,target_bravo_v4_60,target_charlie_v4_20,target_charlie_v4_60,target_delta_v4_20,target_delta_v4_60,target_echo_v4_20,target_echo_v4_60,target_jeremy_v4_20,target_jeremy_v4_60
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
n003bba8a98662e4,1,train,4,2,4,4,0,0,4,4,...,0.25,0.0,0.5,0.25,0.25,0.0,0.25,0.0,0.25,0.25
n003bee128c2fcfc,1,train,2,4,1,3,0,3,2,3,...,0.75,1.0,0.75,0.75,0.75,0.75,0.75,0.75,0.75,1.0
n0048ac83aff7194,1,train,2,1,3,0,3,0,3,3,...,0.5,0.25,0.5,0.25,0.5,0.25,0.5,0.25,0.5,0.25
n00691bec80d3e02,1,train,4,2,2,3,0,4,1,4,...,0.75,0.5,0.75,0.75,0.5,0.5,0.75,0.5,0.5,0.5
n00b8720a2fdc4f2,1,train,4,3,4,4,0,0,4,2,...,0.75,0.5,0.75,0.5,0.75,0.5,0.75,0.5,0.5,0.5


In [7]:
bgmm = BayesianGMMTargetProcessor()
bgmm.set_output(transform="pandas")
sample = dataf.sample(1000)
X = sample[["feature_polaroid_vadose_quinze", "feature_genuine_kyphotic_trehala"]].fillna(0.5)
y = sample["target"]
eras = sample['era']
bgmm.fit(X, y, eras=eras)
fake_target = bgmm.transform(X, eras=eras)

Generating fake target: 100%|██████████| 479/479 [00:00<00:00, 683.11it/s]


In [8]:
fake_target.head(10)

Unnamed: 0_level_0,fake_target
id,Unnamed: 1_level_1
n7630c37f7eac63e,0.5
n2822c11fdda3a05,0.5
n4bbcc611330e8c9,0.5
n7cb19c5d5278030,0.5
n183d83b326a9661,0.25
nd00493791012fd3,0.5
na9415225ee2acad,0.75
n8570f6cb08a8ae3,0.5
n8a1beeaceea745b,0.75
n0d75fc8a7c87b75,0.5


In [9]:
# Clean up environment
dl.remove_base_directory()

Path: '/home/clepelaars/numerblox/examples/synth_test_ecc9c7f1-e9e8-475e-b8e4-2f15ad7cd41f'
