This example notebook covers ways to generate synthetic data using `numerblox` components. Synthetic data can be a great way to improve performance simply by having more data to train. We will both cover ways to generate synthetic target variables and features.

## 0. Download and load

In [None]:
import pandas as pd
from uuid import uuid4

from numerblox.download import NumeraiClassicDownloader

In [None]:
unique_id = uuid4()

dl = NumeraiClassicDownloader(directory_path=f"synth_test_{unique_id}")
dl.download_training_data(version="5.0")

In [None]:
dataf = pd.read_parquet(f"synth_test_{unique_id}/train.parquet")

In [None]:
dataf.head(2)

## 1. Synthetic target (Bayesian GMM)

First we will tackle the problem of creating a synthetic target column to improve model performance. `BayesianGMMTargetProcessor` allows you to generate a new target variable based on a given target. The preprocessor sample the target from a [Bayesian Gaussian Mixture model](https://scikit-learn.org/stable/modules/generated/sklearn.mixture.BayesianGaussianMixture.html) which is fitted on coefficients from a [regularized linear model (Ridge regression)](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html).

This implementation is based on a [Github Gist by Michael Oliver (mdo)](https://gist.github.com/the-moliver/dcdd2862dc2c78dda600f1b449071c93).

In [None]:
from numerblox.targets import BayesianGMMTargetProcessor

In [None]:
dataf.head()

In [None]:
bgmm = BayesianGMMTargetProcessor()
bgmm.set_output(transform="pandas")
sample = dataf.sample(1000)
X = sample[["feature_polaroid_vadose_quinze", "feature_genuine_kyphotic_trehala"]].fillna(0.5)
y = sample["target"]
eras = sample['era']
bgmm.fit(X, y, eras=eras)
fake_target = bgmm.transform(X, eras=eras)

In [None]:
fake_target.head(10)

In [None]:
# Clean up environment
dl.remove_base_directory()