# End To End

This notebook features some examples on some advanced end to end pipeline that really leverage NumerBlox's power. Consider this a testing ground on how well it integrates with sklearn and associated libraries.

In [1]:
from sklearn.compose import ColumnTransformer, make_column_transformer
from sklearn.model_selection import TimeSeriesSplit
from sklearn.pipeline import make_pipeline
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
from xgboost import XGBRegressor

from numerblox.ensemble import NumeraiEnsemble, PredictionReducer
from numerblox.meta import CrossValEstimator, make_meta_pipeline
from numerblox.neutralizers import FeatureNeutralizer
from numerblox.preprocessing import GroupStatsPreProcessor

  from .autonotebook import tqdm as notebook_tqdm


## 0. Get data

In [2]:
from numerblox.numerframe import create_numerframe

df = create_numerframe("../tests/test_assets/val_3_eras.parquet")

In [3]:
X, y = df.get_feature_target_pair(multi_target=False)
fncv3_cols = df.get_fncv3_feature_data.columns.tolist()
era_series = df.get_era_data
features = df.get_feature_data

## 1. Weighted XGBoost ensemble pipeline with feature neutralization

This first pipeline preprocessing the data by creating group features of the `sunshine` and `rain` feature groups. Additionally we add the FNCv3 features using [sklego's ColumnSelector](https://scikit-lego.netlify.app/api/preprocessing.html#sklego.preprocessing.ColumnSelector). Sklego is another library that follows scikit-learn conventions so it integrates well with NumerBlox. These feature are concatenated by using [scikit-learn's FeatureUnion](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.FeatureUnion.html#sklearn.pipeline.FeatureUnion). [make_union](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.make_union.html#sklearn.pipeline.make_union) is a convenience function to initialize a `FeatureUnion`.

The preprocessed data in used to train five folds following the `TimeSeriesSplit` strategy. `NumeraiEnsemble` will standardize each fold by era and ensembles it. If `donate_weighted=True` it will create a weighted ensemble where the folds that were trained on the most recent data get a higher weight. Alternatively you can set your own weights like `weights=[0.02, 0.04, 0.04, 0.3, 0.6]`. If `donate_weighted=False` and no weights are set, it will create a simple average ensemble. 

Lastly, the final prediction column is neutralized.

In order to be able to run preprocessor after a model like `DecisionTreeRegressor` we need a wrapper that will add a `transform` method to the model. `MetaPipeline` does this automatically for you and works in the same way you would use [scikit-learn's Pipeline object](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html#sklearn.pipeline.Pipeline). `make_meta_pipeline` is a convenience function that works in the same way as [scikit-learn's make_pipeline](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.make_pipeline.html).

In [4]:
# !pip install scikit-lego

In [5]:
from sklego.preprocessing import ColumnSelector

In [6]:
# Preprocessing
gpp = GroupStatsPreProcessor(groups=["sunshine", "rain"])
fncv3_selector = ColumnSelector(fncv3_cols)

preproc_pipe = ColumnTransformer([("gpp", gpp, X.columns.tolist()), ("fncv3_selector", fncv3_selector, fncv3_cols)])

# Model
xgb = DecisionTreeRegressor()
cve = CrossValEstimator(estimator=xgb, cv=TimeSeriesSplit(n_splits=5))
ens = NumeraiEnsemble(donate_weighted=True)
fn = FeatureNeutralizer(proportion=0.5)
full_pipe = make_meta_pipeline(preproc_pipe, cve, ens, fn)
full_pipe

In [7]:
# Train full model
full_pipe.fit(X, y, era_series=era_series);

In [8]:
# End to end predictions
preds = full_pipe.predict(X=X, features=features, era_series=era_series)
preds[:5]

Processing feature neutralizations: 100%|██████████| 1/1 [00:00<00:00, 26886.56it/s]


array([[0.28655201],
       [0.63724474],
       [0.27848242],
       [0.55815509],
       [0.47477194]])

## 2. Multiclass Classification Ensemble

This example shows a multiclass classification example where the Numerai target is transformed into integers (`[0, 0.25, 0.5, 0.75, 1.0] -> [0, 1, 2, 3, 4]`) and treated as a classification problem. 

When we call `predict_proba` on a classifier the result will be a probability for every class, like for example `[0.1, 0.2, 0.3, 0.2, 0.2]`. In order to reduce these to one number we use the `PredictionReducer`, which takes the probabilities for every model and reduces it with a vector multiplication (Fro example, `[0.1, 0.2, 0.3, 0.2, 0.2] @ [0, 1, 2, 3, 4] = 2.2`). It does this for every model so the output of `PredictionReducer` has 3 columns. 

Because we set `donate_weighted=True` in `NumeraiEnsemble` 3 columns are reduced to one column using a weighted ensemble where the most recent fold get the highest weight. Lastly, the final prediction column is neutralized.

In [9]:
model = DecisionTreeClassifier()
crossval1 = CrossValEstimator(estimator=model, cv=TimeSeriesSplit(n_splits=3), predict_func="predict_proba")
pred_rud = PredictionReducer(n_models=3, n_classes=5)
ens2 = NumeraiEnsemble(donate_weighted=True)
neut2 = FeatureNeutralizer(proportion=0.5)
full_pipe = make_meta_pipeline(preproc_pipe, crossval1, pred_rud, ens2, neut2)

In [10]:
full_pipe

In [11]:
y_int = (y * 4).astype(int)
full_pipe.fit(X, y_int, era_series=era_series)

In [12]:
preds = full_pipe.predict(X, era_series=era_series, features=features)
preds[:5]

Processing feature neutralizations: 100%|██████████| 1/1 [00:00<00:00, 1893.59it/s]


array([[0.27212312],
       [0.61574058],
       [0.2635116 ],
       [0.53971591],
       [0.46098369]])

## 3. Ensemble of ensemble of regressors

This object introduces a `ColumnTransformer` that contains 3 pipelines. Each pipeline can have a different set of arguments. Here we simplify by passing every pipeline with the same columns. 
The output from all pipelines is concatenated, ensembled with `NumeraiEnsemble` and the final ensembles column is neutralized. Note that every fold here is equal weighted. If you want to give recent folds more weight set `weights` in `NumeraiEnsemble` for all `ColumnTransformer` output.

In [13]:
pipes = []
for i in range(3):
    model = XGBRegressor()
    crossval = CrossValEstimator(estimator=model, cv=TimeSeriesSplit(n_splits=5), predict_func="predict")
    pipe = make_pipeline(crossval)
    pipes.append(pipe)

models = make_column_transformer(*[(pipe, features.columns.tolist()) for pipe in pipes])
ens_end = NumeraiEnsemble()
neut = FeatureNeutralizer(proportion=0.5)
full_pipe = make_meta_pipeline(models, ens_end, neut)

In [14]:
full_pipe

In [15]:
full_pipe.fit(X, y, era_series=era_series);

In [16]:
preds = full_pipe.predict(X, era_series=era_series, features=features)
preds[:5]

Processing feature neutralizations: 100%|██████████| 1/1 [00:00<00:00, 11214.72it/s]


array([[0.38385137],
       [0.65767811],
       [0.39945052],
       [0.61573322],
       [0.64903178]])