In [1]:
#| include: false
from nbdev.showdoc import *

In [2]:
from numerblox.download import NumeraiClassicDownloader
from numerblox.numerframe import create_numerframe
from numerblox.model_pipeline import ModelPipeline, ModelPipelineCollection

## Why bother using ModelPipeline?

This framework allows you to easily compose a full data pipeline from Preprocessors, Models and Postprocessors. This lego block style will improve readability and scalability for your Numerai inference pipelines. Since many components are reusable, it is also likely to speed up the process of putting Numerai models into production.

In order to make predictions, `ModelPipeline` takes a `NumerFrame` as input and outputs a `NumerFrame` with prediction columns added.

`ModelPipeline` ensures that all processing steps are performed in a correct order and gives you a more concise overview of your full pipeline. This will simplify your weekly inference setup and allows you to scale more comfortably to multiple models.

To increase overview, many components of a typical pipeline also perform data integrity checks and provide helpful console output. These displays allow you to identify slow implementations or other data bottlenecks.

## 0. Download live data

In [3]:
# Download most recent live data
downloader = NumeraiClassicDownloader("pipeline_test")
downloader.download_live_data()

# Initialize NumerFrame from parquet file path
file_path = 'pipeline_test/numerai_live_data.parquet'
dataf = create_numerframe(file_path, metadata={"version": 2,
                                               "type": "live",
                                               "model_name": "test",
                                               "original_path": file_path
                                               }
                          )

2022-02-18 13:38:44,899 INFO numerapi.utils: target file already exists
2022-02-18 13:38:44,900 INFO numerapi.utils: download complete


------------------------------------------------------------------
## Example 1. Catboost model (.joblib) with 0.5 feature neutralization.

A very common use case is to predict from a single model on all features and perhaps do some feature neutralization. These can be set up with a few lines of code.

1. Use `SingleModel` which handles prediction logic for several formats (`.joblib`, `.cbm`, `.pickle`, `.pkl`, `.cbm`, `.lgb` and `.h5`.)

In [4]:
from numerblox.model import SingleModel
from numerblox.postprocessing import FeatureNeutralizer

In [5]:
joblib_model = SingleModel("../nbs/test_assets/joblib_v2_example_model.joblib",
                    model_name="joblib")
neutralizer = FeatureNeutralizer(pred_name="prediction_joblib",
                                 proportion=0.5)

In [6]:
pipeline1 = ModelPipeline(models=[joblib_model],
                          pipeline_name="joblib_pipeline",
                          postprocessors=[neutralizer])

In [7]:
prediction_dataf = pipeline1(dataf)

joblib_pipeline Preprocessing:: 0it [00:00, ?it/s]

joblib_pipeline Model prediction:   0%|          | 0/1 [00:00<?, ?it/s]

2022-02-18 13:38:45,496 INFO numexpr.utils: Note: NumExpr detected 16 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8.
2022-02-18 13:38:45,497 INFO numexpr.utils: NumExpr defaulting to 8 threads.


joblib_pipeline Postprocessing:   0%|          | 0/1 [00:00<?, ?it/s]

In [8]:
prediction_dataf.get_prediction_data.head(2)

Unnamed: 0_level_0,prediction_joblib,prediction_joblib_neutralized_0.5
id,Unnamed: 1_level_1,Unnamed: 2_level_1
n0001e4a82d5531c,0.480704,0.466076
n000ace6d1f6367e,0.834582,0.583167


--------------------------------------------------
## Example 2. Ensembling multiple models

In [9]:
from numerblox.model import RandomModel
from numerblox.postprocessing import MeanEnsembler

In [10]:
random_model = RandomModel()

In [11]:
pipeline2 = ModelPipeline(models=[joblib_model, random_model],
                          pipeline_name="joblib_and_random",
                          postprocessors=[MeanEnsembler(cols=['prediction_joblib',
                                                              'prediction_random'],
                                                        final_col_name="prediction_ensemble")]
                         )

In [12]:
multi_model_dataf = pipeline2(dataf)

joblib_and_random Preprocessing:: 0it [00:00, ?it/s]

joblib_and_random Model prediction:   0%|          | 0/2 [00:00<?, ?it/s]

joblib_and_random Postprocessing:   0%|          | 0/1 [00:00<?, ?it/s]

In [13]:
multi_model_dataf.get_prediction_data.head(3)

Unnamed: 0_level_0,prediction_joblib,prediction_random,prediction_ensemble
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
n0001e4a82d5531c,0.480704,0.278381,0.379543
n000ace6d1f6367e,0.834582,0.472462,0.653522
n000ae61e2b11e0a,0.459723,0.039903,0.249813


## Example 3. ModelPipelineCollection use case

Multiple `ModelPipeline` objects can be combined into a `ModelPipelineCollection`. This is convenient if you are use the same starting dataset, but have multiple pipelines with different Preprocessors, Models and/or Postprocessors.

We will use the pipelines from example 1 and example 2 as an arbitrary example.

In [14]:
pipeline_collection = ModelPipelineCollection([pipeline1, pipeline2])

Pipelines can be retrieved by name. If no `pipeline_name` is specified a [UUID4](https://docs.python.org/3/library/uuid.html#example) is generated for that pipeline.

In [15]:
print(f"Pipeline names in collection: {pipeline_collection.pipeline_names}")
pipeline_collection.get_pipeline("joblib_pipeline")

Pipeline names in collection: ['joblib_pipeline', 'joblib_and_random']


<numerai_blocks.model_pipeline.ModelPipeline at 0x7fbe5a35df10>

All pipelines on the collection can be performed with a single `NumerFrame` as input. The pipeline collection will return a dictionary, mapping pipeline names to `NumerFrame` results for each pipeline in the collection.

In [16]:
prediction_results = pipeline_collection(dataf)

Processing Pipeline Collection:   0%|          | 0/2 [00:00<?, ?it/s]

joblib_pipeline Preprocessing:: 0it [00:00, ?it/s]

joblib_pipeline Model prediction:   0%|          | 0/1 [00:00<?, ?it/s]

joblib_pipeline Postprocessing:   0%|          | 0/1 [00:00<?, ?it/s]

joblib_and_random Preprocessing:: 0it [00:00, ?it/s]

joblib_and_random Model prediction:   0%|          | 0/2 [00:00<?, ?it/s]

joblib_and_random Postprocessing:   0%|          | 0/1 [00:00<?, ?it/s]

Now `prediction_results` can be used to easily retrieve the results.

In [17]:
prediction_results.keys()

dict_keys(['joblib_pipeline', 'joblib_and_random'])

In [18]:
prediction_results['joblib_pipeline'].get_prediction_data.head(2)

Unnamed: 0_level_0,prediction_joblib,prediction_joblib_neutralized_0.5
id,Unnamed: 1_level_1,Unnamed: 2_level_1
n0001e4a82d5531c,0.480704,0.466076
n000ace6d1f6367e,0.834582,0.583167


In [19]:
prediction_results['joblib_and_random'].get_prediction_data.head(2)

Unnamed: 0_level_0,prediction_joblib,prediction_random,prediction_ensemble
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
n0001e4a82d5531c,0.480704,0.682091,0.581398
n000ace6d1f6367e,0.834582,0.275009,0.554796


All metadata saved in the original `NumerFrame` will still be available in all resulting `NumerFrame` objects.

In [20]:
prediction_results['joblib_pipeline'].meta

{'era_col': 'era',
 'era_col_verified': True,
 'version': 2,
 'type': 'live',
 'model_name': 'test',
 'original_path': 'pipeline_test/numerai_live_data.parquet'}

In [21]:
prediction_results['joblib_and_random'].meta

{'era_col': 'era',
 'era_col_verified': True,
 'version': 2,
 'type': 'live',
 'model_name': 'test',
 'original_path': 'pipeline_test/numerai_live_data.parquet'}

------------------------------------------------

In [22]:
#| include: false
# Run this cell to sync all changes with library
from nbdev import nbdev_export

nbdev_export()

Converted 00_misc.ipynb.
Converted 01_download.ipynb.
Converted 02_numerframe.ipynb.
Converted 03_preprocessing.ipynb.
Converted 04_model.ipynb.
Converted 05_postprocessing.ipynb.
Converted 06_modelpipeline.ipynb.
Converted 07_evaluation.ipynb.
Converted 08_key.ipynb.
Converted 09_submission.ipynb.
Converted 10_staking.ipynb.
Converted index.ipynb.
