In [None]:
#| include: false
import pandas as pd
from nbdev.showdoc import *

This example notebook covers ways to generate synthetic data using `numerblox` components. Synthetic data can be a great way to improve performance simply by having more data to train. We will both cover ways to generate synthetic target variables and features.

## 0. Download and load

In [None]:
from numerblox.download import NumeraiClassicDownloader
from numerblox.numerframe import create_numerframe, NumerFrame

In [None]:
dl = NumeraiClassicDownloader(directory_path="synth_test")
dl.download_training_data(version=3)

In [None]:
dataf = create_numerframe("synth_test/numerai_training_data.parquet")

In [None]:
dataf.head(2)

## 1. Synthetic target (Bayesian GMM)

First we will tackle the problem of creating a synthetic target column to improve model performance. `BayesianGMMTargetProcessor` allows you to generate a new target variable based on a given target. The preprocessor sample the target from a [Bayesian Gaussian Mixture model](https://scikit-learn.org/stable/modules/generated/sklearn.mixture.BayesianGaussianMixture.html) which is fitted on coefficients from a [regularized linear model (Ridge regression)](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html).

This implementation is based on a [Github Gist by Michael Oliver (mdo)](https://gist.github.com/the-moliver/dcdd2862dc2c78dda600f1b449071c93).

In [None]:
from numerblox.preprocessing import BayesianGMMTargetProcessor

In [None]:
show_doc(BayesianGMMTargetProcessor)

In [None]:
dataf.head()

In [None]:
bgmm = BayesianGMMTargetProcessor(target_col="target_nomi_20")
test_columns = ['era', 'data_type', 'feature_dichasial_hammier_spawner',
                'feature_rheumy_epistemic_prancer', 'target',
                'target_nomi_20', 'target_paul_20']
sample_dataf = NumerFrame(dataf[test_columns].sample(100).fillna(0.5))
fake_dataf = bgmm(sample_dataf)

In [None]:
sample_dataf.head()

The new target will be suffixed by `_fake` to distinguish it from the original targets.

In [None]:
fake_dataf.get_target_data.head(2)

Note that you can easily generate multiple fake targets in a loop.

In [None]:
for target_col in sample_dataf.target_cols:
    bgmm = BayesianGMMTargetProcessor(target_col=target_col)
    sample_dataf = bgmm(sample_dataf)
sample_dataf.get_target_data.head(2)

## 3. UMAPFeatureGenerator

UMAP is a feature reduction technique that can be used to generate synthetic features. In other words, we create new representations of the existing features and add them to our dataset.

We will perform UMAP on the training and validation data combined. Note that the data created with `DeepDreamGenerator` is included in this dataset. Then, once again we train a model on it and evaluate results.

In [None]:
from numerblox.preprocessing import UMAPFeatureGenerator

`n_components` denotes the amount of additional features we are generating.

In [None]:
n_components = 3
umap_gen = UMAPFeatureGenerator(n_components=n_components, n_neighbors=9)

In [None]:
test_data = create_numerframe("../test_assets/mini_numerai_version_2_data.parquet")

In [None]:
test_data = umap_gen(test_data)

OMP: Info #271: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.


The new features follow the naming convention `f"feature_umap_{i}"`. All new components are scaled between 0 and 1.

In [None]:
umap_features = [f"feature_umap_{i}" for i in range(n_components)]
test_data[umap_features].head(3)

Unnamed: 0_level_0,feature_umap_0,feature_umap_1,feature_umap_2
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
n559bd06a8861222,0.887313,0.365509,1.0
n9d39dea58c9e3cf,1.0,0.779677,0.732083
nb64f06d3a9fc9f1,0.879256,0.073302,0.174605


Contrast this with the deep dream results.

After you're done all the downloaded files can be cleaned up with `.remove_base_directory()`.

In [None]:
# Clean up environment
dl.remove_base_directory()