In [1]:
%load_ext autoreload
%autoreload 2
%load_ext nb_black
%load_ext lab_black

<IPython.core.display.Javascript object>

In [2]:
# default_exp preprocessing

<IPython.core.display.Javascript object>

# Preprocessing

This section provides functionality for all data manipulation steps that are needed before data is passed into a model for prediction. We group all these steps under Preprocessing. This includes feature/target selection, feature/target engineering and feature/target manipulation.

Some preprocessors work with both Pandas DataFrames and NumerFrames. Most preprocessors use specific `NumerFrame` functionality.

In the last section we explain how you can implement your own Preprocessor that integrates well with the rest of this framework.

In [3]:
# hide
from nbdev.showdoc import *

<IPython.core.display.Javascript object>

In [4]:
#export
import time
import numpy as np
import pandas as pd
import datetime as dt
from typing import Union
from functools import wraps
from typeguard import typechecked
from abc import ABC, abstractmethod
from rich import print as rich_print

from numerai_blocks.numerframe import NumerFrame, create_numerframe

<IPython.core.display.Javascript object>

## 0. Base

This objects will provide a solid base for all processing (pre and post) and log relevant information regarding the data pipeline.

### 0.1. BaseProcessor

`BaseProcessor` defines common functionality for Preprocessors and Postprocessors (Section 5).

Every Preprocessor should inherit from `BaseProcessor` and implement the `.transform` method.

In [5]:
#export
class BaseProcessor(ABC):
    """ Common functionality for preprocessors and postprocessors. """
    def __init__(self):
        ...

    @abstractmethod
    def transform(self, dataf: Union[pd.DataFrame, NumerFrame], *args, **kwargs) -> NumerFrame:
        ...

    def __call__(self, dataf: Union[pd.DataFrame, NumerFrame], *args, **kwargs) -> NumerFrame:
        return self.transform(dataf=dataf, *args, **kwargs)

<IPython.core.display.Javascript object>

### 0.3. Logging

We would like to keep an overview of which steps are done in a data pipeline and where processing bottlenecks occur.
The decorator below will display:
1. When a step has finished.
2. What the output shape of the data is.
3. How long the step took to finish.

To use this functionality, simply add `@display_processor_info` as a decorator to the function/method you want to track.

We will use this decorator throughout the pipeline (preprocessing, model and postprocessing).

Inspiration for this decorator: [Calmcode Pandas Pipe Logs](https://calmcode.io/pandas-pipe/logs.html)

In [6]:
#export
def display_processor_info(func):
    """ Fancy console output for data processing. """
    @wraps(func)
    def wrapper(*args, **kwargs):
        tic = dt.datetime.now()
        result = func(*args, **kwargs)
        time_taken = str(dt.datetime.now() - tic)
        class_name = func.__qualname__.split('.')[0]
        rich_print(f":white_check_mark: Finished step [bold]{class_name}[/bold]. Output shape={result.shape}. Time taken for step: [blue]{time_taken}[/blue]. :white_check_mark:")
        return result
    return wrapper

<IPython.core.display.Javascript object>

In [7]:
class TestDisplay:
    """
    Small test for logging.
    Output should mention 'TestDisplay',
    Return output shape of (10, 314) and
    time taken for step should be close to 2 seconds.
    """
    def __init__(self, dataf: NumerFrame):
        self.dataf = dataf
    @display_processor_info
    def test(self) -> NumerFrame:
        time.sleep(2)
        return self.dataf

dataf = create_numerframe("test_assets/mini_numerai_version_1_data.csv")
TestDisplay(dataf).test();

<IPython.core.display.Javascript object>

## 1. Common preprocessing steps


We invite the Numerai community to develop new preprocessors. This section implements commonly used preprocessing for Numerai.

### 1.0 Tournament agnostic

This section cover all the available Preprocessors that can be applied for both Numerai Classic and Numerai Signals.

#### 1.0.1. CopyPreProcessor

The first and obvious preprocessor is copying, which is implemented as a default in `ModelPipeline` (Section 4) to avoid manipulation of the original Dataset that you load in.

In [8]:
#export
@typechecked
class CopyPreProcessor(BaseProcessor):
    """Copy DataFrame to avoid manipulation of original DataFrame. """
    def __init__(self):
        super().__init__()

    @display_processor_info
    def transform(self, dataf: Union[pd.DataFrame, NumerFrame]) -> NumerFrame:
        return NumerFrame(dataf.copy())

<IPython.core.display.Javascript object>

In [9]:
dataset = create_numerframe("test_assets/mini_numerai_version_1_data.csv", metadata={"version": 1})
copied_dataset = CopyPreProcessor().transform(dataset)
assert np.array_equal(copied_dataset.values, dataset.values)
assert dataset.meta == copied_dataset.meta

<IPython.core.display.Javascript object>

#### 1.0.2. FeatureSelectionPreProcessor

`FeatureSelectionPreProcessor` will keep all features that you pass + keeps all other columns that are not features.

In [10]:
#export
@typechecked
class FeatureSelectionPreProcessor(BaseProcessor):
    """
    Keep only features given + all target, predictions and aux columns.
    """
    def __init__(self, feature_cols: Union[str, list]):
        super().__init__()
        self.feature_cols = feature_cols

    @display_processor_info
    def transform(self, dataf: NumerFrame) -> NumerFrame:
        keep_cols = self.feature_cols + dataf.target_cols + dataf.prediction_cols + dataf.aux_cols
        dataf = dataf.loc[:, keep_cols]
        return NumerFrame(dataf)

<IPython.core.display.Javascript object>

In [11]:
selected_dataset = FeatureSelectionPreProcessor(feature_cols=['feature_wisdom1']).transform(dataset)
assert selected_dataset.get_feature_data.shape[1] == 1
assert dataset.meta == selected_dataset.meta

<IPython.core.display.Javascript object>

In [12]:
selected_dataset.head(2)

Unnamed: 0,feature_wisdom1,target,id,era,data_type
0,0.25,0.5,n000315175b67977,era1,train
1,0.5,0.25,n0014af834a96cdd,era1,train


<IPython.core.display.Javascript object>

#### 1.0.2. TargetSelectionPreProcessor

`TargetSelectionPreProcessor` will keep all targets that you pass + all other columns that are not targets.

Not relevant for an inference pipeline, but especially convenient for Numerai Classic training if you train on a subset of the available targets. Can also be applied to Signals if you are using multiple engineered targets in your pipeline.


In [13]:
#export
@typechecked
class TargetSelectionPreProcessor(BaseProcessor):
    """
    Keep only features given + all target, predictions and aux columns.
    """
    def __init__(self, target_cols: Union[str, list]):
        super().__init__()
        self.target_cols = target_cols

    @display_processor_info
    def transform(self, dataf: NumerFrame) -> NumerFrame:
        keep_cols = self.target_cols + dataf.feature_cols + dataf.prediction_cols + dataf.aux_cols
        dataf = dataf.loc[:, keep_cols]
        return NumerFrame(dataf)

<IPython.core.display.Javascript object>

In [14]:
dataset = create_numerframe("test_assets/mini_numerai_version_2_data.parquet", metadata={"version": 2})
target_cols = ['target', 'target_nomi_20', 'target_nomi_60']
selected_dataset = TargetSelectionPreProcessor(target_cols=target_cols).transform(dataset)
assert selected_dataset.get_target_data.shape[1] == len(target_cols)
selected_dataset.head(2)

Unnamed: 0_level_0,target,target_nomi_20,target_nomi_60,feature_dichasial_hammier_spawner,feature_rheumy_epistemic_prancer,feature_pert_performative_hormuz,feature_hillier_unpitied_theobromine,feature_perigean_bewitching_thruster,feature_renegade_undomestic_milord,feature_koranic_rude_corf,...,feature_drawable_exhortative_dispersant,feature_metabolic_minded_armorist,feature_investigatory_inerasable_circumvallation,feature_centroclinal_incentive_lancelet,feature_unemotional_quietistic_chirper,feature_behaviorist_microbiological_farina,feature_lofty_acceptable_challenge,feature_coactive_prefatorial_lucy,era,data_type
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
n559bd06a8861222,0.25,0.25,0.5,0.25,0.75,0.25,0.75,0.25,0.5,1.0,...,1.0,0.0,0.0,0.25,0.0,0.0,1.0,0.25,297,train
n9d39dea58c9e3cf,0.5,0.5,0.75,0.75,0.5,0.75,1.0,0.5,0.25,0.5,...,0.25,0.5,0.0,0.25,0.75,1.0,0.75,1.0,3,train


<IPython.core.display.Javascript object>

### 1.1. Numerai Classic

The Numerai Classic dataset has a certain structure that you may not encounter in the Numerai Signals tournament.
Therefore, this section has all preprocessors that can only be applied to Numerai Classic.

#### 1.1.0 Numerai Classic: Version agnostic

Preprocessors that work for all Numerai Classic versions.

#### 1.1.1. Numerai Classic: Version 1 specific

Preprocessors that only work for version 1 (legacy data).
When using version 1 preprocessor it is recommended that the input `NumerFrame` has `version` in its metadata.
This avoids using version 1 preprocessors on version 2 data and confusing error messages.

As a new user we recommend to start modeling the version 2 data and avoid version 1.
The preprocessors below are there for legacy and compatibility reasons.

##### 1.1.1.1. GroupStatsPreProcessor

The version 1 legacy data has 6 groups of features which allows us to calculate aggregate features.

In [15]:
#export
class GroupStatsPreProcessor(BaseProcessor):
    """
    WARNING: Only supported for Version 1 (legacy) data.
    Calculate group statistics for all data groups.
    :param groups: Groups to create features for. All groups by default.
    """
    def __init__(self, groups: list = None):
        super().__init__()
        self.all_groups = ["intelligence", "wisdom", "charisma",
                           "dexterity", "strength", "constitution"]
        self.group_names = groups if groups else self.all_groups

    @display_processor_info
    def transform(self, dataf: NumerFrame, *args, **kwargs) -> NumerFrame:
        """ Check validity and add group features. """
        self._check_data_validity(dataf=dataf)
        dataf = dataf.pipe(self._add_group_features)
        return NumerFrame(dataf)

    def _add_group_features(self, dataf: pd.DataFrame) -> pd.DataFrame:
        """ Mean, standard deviation and skew for each group. """
        for group in self.group_names:
            cols = [col for col in dataf.columns if group in col]
            dataf[f"feature_{group}_mean"] = dataf[cols].mean(axis=1)
            dataf[f"feature_{group}_std"] = dataf[cols].std(axis=1)
            dataf[f"feature_{group}_skew"] = dataf[cols].skew(axis=1)
        return dataf

    def _check_data_validity(self, dataf: NumerFrame):
        """ Make sure this is only used for version 1 data. """
        assert hasattr(dataf.meta, 'version'), f"Version should be specified for '{self.__class__.__name__}' This Preprocessor will only work on version 1 data."
        assert getattr(dataf.meta, 'version') == 1, f"'{self.__class__.__name__}' only works on version 1 data. Got version: '{getattr(dataf.meta, 'version')}'."

<IPython.core.display.Javascript object>

In [16]:
dataset = create_numerframe("test_assets/mini_numerai_version_1_data.csv", metadata={"version": 1})
group_features_dataset = GroupStatsPreProcessor().transform(dataset)
group_features_dataset.head(2)
assert group_features_dataset.meta.version == 1

<IPython.core.display.Javascript object>

In [17]:
new_cols =  ['feature_intelligence_mean', 'feature_intelligence_std', 'feature_intelligence_skew',
             'feature_wisdom_mean', 'feature_wisdom_std', 'feature_wisdom_skew',
             'feature_charisma_mean', 'feature_charisma_std', 'feature_charisma_skew',
             'feature_dexterity_mean', 'feature_dexterity_std', 'feature_dexterity_skew',
             'feature_strength_mean', 'feature_strength_std', 'feature_strength_skew',
             'feature_constitution_mean', 'feature_constitution_std', 'feature_constitution_skew']
assert set(group_features_dataset.columns).intersection(new_cols)
group_features_dataset.get_feature_data[new_cols].head(2)

Unnamed: 0,feature_intelligence_mean,feature_intelligence_std,feature_intelligence_skew,feature_wisdom_mean,feature_wisdom_std,feature_wisdom_skew,feature_charisma_mean,feature_charisma_std,feature_charisma_skew,feature_dexterity_mean,feature_dexterity_std,feature_dexterity_skew,feature_strength_mean,feature_strength_std,feature_strength_skew,feature_constitution_mean,feature_constitution_std,feature_constitution_skew
0,0.333333,0.246183,0.558528,0.668478,0.236022,-0.115082,0.438953,0.25991,-0.004783,0.696429,0.200446,-0.60762,0.480263,0.292829,-0.372064,0.427632,0.27572,0.276155
1,0.208333,0.234359,0.382554,0.559783,0.358177,-0.062362,0.485465,0.252501,-0.021737,0.267857,0.249312,0.382267,0.407895,0.309866,0.220625,0.644737,0.33408,-0.794938


<IPython.core.display.Javascript object>

`GroupStatsPreProcessor` should break if `version != 1`.

In [18]:
def test_invalid_version(dataset: NumerFrame):
    copied_dataset = dataset.copy()
    copied_dataset.version = 2
    try:
        GroupStatsPreProcessor().transform(copied_dataset)
    except AssertionError:
        return True
    return False

test_invalid_version(dataset);

<IPython.core.display.Javascript object>

#### 1.1.2. Numerai Classic: Version 2 specific

Preprocessors that are only compatible with version 2 data. If the preprocessor is agnostic to Numerai Classic version implement under heading 1.1.0.

### 1.4. Signals specific

Preprocessors that are specific to Numerai Signals.

## 2. Custom preprocessors

There are an almost unlimited number of ways to preprocess (selection, engineering and manipulation). We have only scratched the surface with the preprocessors currently implemented in `numerai-blocks`. We invite the Numerai community to develop Numerai Classic and Signals preprocessors for `numerai-blocks`.

A new Preprocessor should inherit from `BaseProcessor` and implement a `transform` method. For efficient implementation we recommend you use `NumerFrame` functionality for preprocessing, but if this is not relevant for your application, the Preprocessor can also support a Pandas DataFrame as input as long as it returns a `NumerFrame`. This ensures that the Preprocessor still works within a full `numerai-blocks` pipeline. A template for new Preprocessors is given below.

To enable fancy logging output. Add the `@display_processor_info` decorator to the `transform` method.

Note that arbitrary metadata stored by `NumerFrame` can be added or changed in a preprocessing step.

In [19]:
# export
class AwesomePreProcessor(BaseProcessor):
    """
    - TEMPLATE -
    Do some awesome preprocessing.
    """
    def __init__(self, *args, **kwargs):
        super().__init__()

    @display_processor_info
    def transform(self, dataf: NumerFrame, *args, **kwargs) -> NumerFrame:
        # Do processing
        ...
        # Parse all contents of NumerFrame to the next pipeline step
        return NumerFrame(dataf)

<IPython.core.display.Javascript object>

-------------------------------------------

In [20]:
# hide
# Run this cell to sync all changes with library
from nbdev.export import notebook2script

notebook2script()

Converted 01_download.ipynb.
Converted 02_numerframe.ipynb.
Converted 03_preprocessing.ipynb.
Converted 04_model.ipynb.
Converted 05_postprocessing.ipynb.
Converted 06_modelpipeline.ipynb.
Converted 07_evaluation.ipynb.
Converted 08_key.ipynb.
Converted 09_submission.ipynb.
Converted 10_staking.ipynb.
Converted index.ipynb.


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>