In [24]:
%load_ext autoreload
%autoreload 2
%load_ext nb_black
%load_ext lab_black

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload
The nb_black extension is already loaded. To reload it, use:
  %reload_ext nb_black
The lab_black extension is already loaded. To reload it, use:
  %reload_ext lab_black


<IPython.core.display.Javascript object>

In [25]:
# default_exp model

<IPython.core.display.Javascript object>

# Model

This section implements functionality concerned with generating predictions for Numerai on preprocessed data.

Currently supported frameworks and formats:
1. `.joblib` (Common format to save Python objects. These models should have a `.predict` method. Especially convenient for [sklearn models](https://scikit-learn.org/stable/supervised_learning.html).)
2. `.pickle`/`.pkl` (Arbitrary Python objects. All pickled models should have a `.predict` method.)
3. `.cbm` (Easy format to load [CatBoost](https://catboost.ai/en/docs/) models.)
4. `.lgb` (Format to load [LightGBM](https://lightgbm.readthedocs.io/en/latest/) models.)
5. Baseline models for which loading from files is not relevant (i.e. `ConstantModel` and `RandomModel`.)


It is recommended to use models within `ModelPipeline`s (section 6), but they can also be used on its own.

The last section of this notebook explains two different ways you can implement your own models for `numerai-blocks`:
1. From `BaseModel` (custom prediction logic).
2. From `DirectoryModel` (make predictions for all models in directory with given file suffix.
Prediction logic will already be implemented. Only write model loading logic).

In [26]:
# hide
from nbdev.showdoc import *

<IPython.core.display.Javascript object>

In [27]:
#export
import gc
import uuid
import joblib
import pickle
import numpy as np
import pandas as pd
import lightgbm as lgb
from pathlib import Path
from typing import Union
from tqdm.auto import tqdm
from catboost import CatBoost
from typeguard import typechecked
from abc import ABC, abstractmethod
from sklearn.dummy import DummyRegressor

from numerai_blocks.numerframe import NumerFrame, create_numerframe
from numerai_blocks.preprocessing import display_processor_info

<IPython.core.display.Javascript object>

## 0. Base

### 0.1. BaseModel

The `BaseModel` is an abstract base class that handles some directory logic and naming conventions. All models should inherit from `BaseModel` and be sure to implement the `.predict` method.

In general, models are loaded in from disk. However, if no model files are involved in your model you should pass an empty string (`""`) as the `model_directory` argument.

Note that a new prediction column will have the column name `prediction_{MODEL_NAME}`.

In [28]:
#export
class BaseModel(ABC):
    """
    Setup for model prediction on a Dataset.

    :param model_directory: Main directory from which to read in models.
    :param model_name: Name that will be used to create column names and for display purposes.
    """
    def __init__(self, model_directory: str, model_name: str = None, *args, **kwargs):
        self.model_directory = Path(model_directory)
        self.__dict__.update(*args, **kwargs)
        self.model_name = model_name if model_name else uuid.uuid4().hex
        self.prediction_col_name = f"prediction_{self.model_name}"
        self.description = f"{self.__class__.__name__}: '{self.model_name}' prediction"

    @abstractmethod
    def predict(self, dataf: Union[pd.DataFrame, NumerFrame]) -> NumerFrame:
        """ Return NumerFrame with column added for prediction. """
        ...
        return NumerFrame(dataf)

    def __call__(self, dataf: Union[pd.DataFrame, NumerFrame]) -> NumerFrame:
        return self.predict(dataf=dataf)

<IPython.core.display.Javascript object>

### 0.2. DirectoryModel

A `DirectoryModel` assumes that you have a directory of models and you want to load + predict for all models with a certain `file_suffix` (for example, `.joblib`, `.cbm` or `.lgb`). This base class handles prediction logic for this situation.

If you are thinking of implementing your own model and this is your use case, then you should inherit from `DirectoryModel` and be sure to implement `load_models` method. Your then don't have to implement any prediction logic in the `.predict` method.

When inheriting from `DirectoryModel` the only mandatory method implementation is for `load_models`. It should instantiate all models and return them as a `list`.

Currently only single target predictions are supported for `DirectoryModel`. Future versions will also support Directory models for multiple targets. If your use case concerns multiple target predictions consider using `SingleModel`s or inheriting from `BaseModel`.

In [29]:
#export
class DirectoryModel(BaseModel):
    """
    Base class implementation for JoblibModel, CatBoostModel, LGBMModel, etc.
    Walks through every file with given file_suffix in a directory.
    :param model_directory: Main directory from which to read in models.
    :param file_suffix: File format to load (For example, .joblib, .pkl, .cbm or .lgb)
    :param model_name: Name that will be used to create column names and for display purposes.
    """
    def __init__(self, model_directory: str, file_suffix: str, model_name: str = None, *args, **kwargs):
        super().__init__(model_directory=model_directory,
                         model_name=model_name,
                         *args, **kwargs
                         )
        self.file_suffix = file_suffix
        self.model_paths = list(self.model_directory.glob(f'*.{self.file_suffix}'))
        if self.file_suffix:
            assert self.model_paths, f"No {self.file_suffix} files found in {self.model_directory}."
        self.total_models = len(self.model_paths)

    @display_processor_info
    def predict(self, dataf: NumerFrame, *args, **kwargs) -> NumerFrame:
        """
        Use all recognized models to make predictions and average them out.
        :param dataf: A Preprocessed DataFrame where all its features can be passed to the model predict method.
        *args, **kwargs will be parsed into the model.predict method.
        :return: A new dataset with prediction column added.
        """
        dataf.loc[:, self.prediction_col_name] = np.zeros(len(dataf))
        models = self.load_models()
        for model in tqdm(models, desc=self.description, position=1):
            predictions = model.predict(dataf.get_feature_data, *args, **kwargs)
            dataf.loc[:, self.prediction_col_name] += predictions / self.total_models
        del models; gc.collect()
        return NumerFrame(dataf)

    @abstractmethod
    def load_models(self) -> list:
        """ Instantiate all models detected in self.model_paths. """
        ...


<IPython.core.display.Javascript object>

## 1. Standard model formats

Implementations for common Numerai model prediction situations.

### 1.1. Single Model file

In many cases you just want to load a single model file and create predictions for that model. `SingleModel` supports this.

This class supports multiple model formats for easy use. All models should have a `.predict` method.
Currently, `.joblib`, `.cbm`, `.pkl` and `.pickle` format are supported.

**Things to keep in mind**
- This model will use all available features in the `NumerFrame` and use them for prediction. Make sure to define proper feature selection if the models does not use all features.
- If you have XGBoost models we recommend saving them as `.joblib`.
- The added prediction column will have the column name `prediction_{MODEL_NAME}` if 1 target is predicted.
For multiple targets the new column names will be `prediction_{MODEL_NAME}_{i}` for each target number i (starting with 0).
- We welcome the Numerai community to extend `SingleModel` for more file formats. See the Contributing section in `README.md` for more information on contributing.

In [30]:
#export
@typechecked
class SingleModel(BaseModel):
    """
    Load single model from file and perform prediction logic.
    :param model_directory: Main directory from which to read in models.
    :param model_name: Name that will be used to create column names and for display purposes.
    """
    def __init__(self, model_file_path: str, model_name: str = None, *args, **kwargs):
        self.model_file_path = Path(model_file_path)
        assert self.model_file_path.exists(), f"File path '{self.model_file_path}' does not exist."
        assert self.model_file_path.is_file(), f"File path must point to file. Not valid for '{self.model_file_path}'."
        super().__init__(model_directory=str(self.model_file_path.parent),
                         model_name=model_name,
                         *args, **kwargs
                         )
        self.model_suffix = self.model_file_path.suffix
        self.suffix_to_model_mapping = {".joblib": joblib.load,
                                        ".cbm": CatBoost().load_model,
                                        ".pkl": pickle.load,
                                        ".pickle": pickle.load}
        self.__check_valid_suffix()

    def predict(self, dataf: NumerFrame, *args, **kwargs) -> NumerFrame:
        model = self._load_model()
        predictions = model.predict(dataf.get_feature_data, *args, **kwargs)
        prediction_cols = self.get_prediction_col_names(predictions.shape)
        dataf.loc[:, prediction_cols] = predictions
        del model; gc.collect()
        return NumerFrame(dataf)

    def _load_model(self, *args, **kwargs):
        """ Load arbitrary model from path using suffix to model mapping. """
        return self.suffix_to_model_mapping[self.model_suffix](str(self.model_file_path), *args, **kwargs)

    def get_prediction_col_names(self, pred_shape: tuple) -> list:
        """ Create multiple columns if predictions are multi-target. """
        if len(pred_shape) > 1:
            # Multi target
            prediction_cols = [f"{self.prediction_col_name}_{i}" for i in range(pred_shape[1])]
        else:
            # Single target
            prediction_cols = [self.prediction_col_name]
        return prediction_cols

    def __check_valid_suffix(self):
        """ Detailed message if model is not supported in this class. """
        try:
            self.suffix_to_model_mapping[self.model_suffix]
        except KeyError:
            raise NotImplementedError(
                f"Format '{self.model_suffix}' is not available. Available versions are {list(self.suffix_to_model_mapping.keys())}"
            )


<IPython.core.display.Javascript object>

In [31]:
dataset = create_numerframe("test_assets/mini_numerai_version_2_data.parquet")
test_paths = ["test_assets/joblib_v2_example_model.joblib"]
for path in test_paths:
    model = SingleModel(path, model_name="test")
    print(model.predict(dataset).get_prediction_data.head(2))

                  prediction_test
id                               
n559bd06a8861222         0.506948
n9d39dea58c9e3cf         0.492578


<IPython.core.display.Javascript object>

In [32]:
model = SingleModel(test_paths[0], model_name="test")
model.suffix_to_model_mapping

{'.joblib': <function joblib.numpy_pickle.load(filename, mmap_mode=None)>,
 '.cbm': <bound method CatBoost.load_model of <catboost.core.CatBoost object at 0x7f8b3093acd0>>,
 '.pkl': <function _pickle.load(file, *, fix_imports=True, encoding='ASCII', errors='strict')>,
 '.pickle': <function _pickle.load(file, *, fix_imports=True, encoding='ASCII', errors='strict')>}

<IPython.core.display.Javascript object>

### 1.2. Joblib directory

Many models, like `scikit-learn`, can conveniently be saved as `.joblib` files. This class automatically loads all `.joblib` files in a given folder and generates (averaged out) predictions.

In [33]:
#export
@typechecked
class JoblibModel(DirectoryModel):
    """
    Load and predict for arbitrary models in directory saved as .joblib.
    All loaded models should have a .predict method and accept the features present in the data.
    :param model_directory: Main directory from which to read in models.
    :param model_name: Name that will be used to create column names and for display purposes.
    """
    def __init__(self, model_directory: str, model_name: str = None, *args, **kwargs):
        file_suffix = 'joblib'
        super().__init__(model_directory=model_directory,
                         file_suffix=file_suffix,
                         model_name=model_name,
                         *args, **kwargs
                         )

    def load_models(self) -> list:
        return [joblib.load(path) for path in self.model_paths]

<IPython.core.display.Javascript object>

In [34]:
dataset = create_numerframe("test_assets/mini_numerai_version_2_data.parquet", metadata={"version": 2})
model = JoblibModel("test_assets", model_name="Joblib_LGB")
predictions = model.predict(dataset).get_prediction_data
assert predictions['prediction_Joblib_LGB'].between(0, 1).all()
predictions.head(3)

JoblibModel: 'Joblib_LGB' prediction:   0%|          | 0/1 [00:00<?, ?it/s]

Unnamed: 0_level_0,prediction_Joblib_LGB
id,Unnamed: 1_level_1
n559bd06a8861222,0.506948
n9d39dea58c9e3cf,0.492578
nb64f06d3a9fc9f1,0.490879


<IPython.core.display.Javascript object>

### 1.3. Catboost directory (.cbm)

This model setup loads in all `CatBoost` (`.cbm`) models present in a given directory and makes (averaged out) predictions.

In [35]:
#export
@typechecked
class CatBoostModel(DirectoryModel):
    """
    Load and predict with all .cbm models (CatBoostRegressor) in directory.
    :param model_directory: Main directory from which to read in models.
    :param model_name: Name that will be used to create column names and for display purposes.
    """
    def __init__(self, model_directory: str, model_name: str = None, *args, **kwargs):
        file_suffix = 'cbm'
        super().__init__(model_directory=model_directory,
                         file_suffix=file_suffix,
                         model_name=model_name,
                         *args, **kwargs
                         )

    def load_models(self) -> list:
        return [CatBoost().load_model(path) for path in self.model_paths]

<IPython.core.display.Javascript object>

In [36]:
from numerai_blocks.preprocessing import GroupStatsPreProcessor
dataset = create_numerframe("test_assets/mini_numerai_version_1_data.csv", {"version": 1})
processed_dataset = GroupStatsPreProcessor()(dataset)
model = CatBoostModel("test_assets", model_name="CB")

<IPython.core.display.Javascript object>

In [37]:
predictions = model.predict(processed_dataset).get_prediction_data
assert predictions['prediction_CB'].between(0, 1).all()
predictions.head(3)

CatBoostModel: 'CB' prediction:   0%|          | 0/1 [00:00<?, ?it/s]

Unnamed: 0,prediction_CB
0,0.492046
1,0.499881
2,0.485325


<IPython.core.display.Javascript object>

### 1.4. LightGBM directory (.lgb)

This model setup loads in all `LightGBM` (`.lgb`) models present in a given directory and makes (averaged out) predictions.

In [38]:
#export
@typechecked
class LGBMModel(DirectoryModel):
    """ Load and predict with all .lgb models (LightGBM) in directory. """
    def __init__(self, model_directory: str, model_name: str = None, *args, **kwargs):
        file_suffix = 'lgb'
        super().__init__(model_directory=model_directory,
                         file_suffix=file_suffix,
                         model_name=model_name,
                         *args, **kwargs
                         )

    def load_models(self) -> list:
        return [lgb.Booster(model_file=str(path)) for path in self.model_paths]

<IPython.core.display.Javascript object>

In [39]:
dataset = create_numerframe("test_assets/mini_numerai_version_2_data.parquet")
model = LGBMModel("test_assets", model_name="LGB")
predictions = model.predict(dataset).get_prediction_data
assert predictions['prediction_LGB'].between(0, 1).all()
predictions.head(3)

LGBMModel: 'LGB' prediction:   0%|          | 0/1 [00:00<?, ?it/s]

Unnamed: 0_level_0,prediction_LGB
id,Unnamed: 1_level_1
n559bd06a8861222,0.506948
n9d39dea58c9e3cf,0.492578
nb64f06d3a9fc9f1,0.490879


<IPython.core.display.Javascript object>

## 2. Baseline models

Setting a baseline is always an important step for data science problems. This section introduces models that should only be used a baselines.

### 2.1. ConstantModel

This model simply outputs a constant of your choice. Convenient for setting classification baselines.

In [40]:
#export
class ConstantModel(BaseModel):
    """
    WARNING: Only use this Model for testing purposes.
    Create constant prediction.
    :param constant: Value for constant prediction.
    :param model_name: Name that will be used to create column names and for display purposes.
    """
    def __init__(self, constant: float = 0.5, model_name: str = None):
        self.constant = constant
        model_name = model_name if model_name else f"constant_{self.constant}"
        super().__init__(model_directory="",
                         model_name=model_name
                         )
        self.clf = DummyRegressor(strategy='constant', constant=constant).fit([0.], [0.])

    def predict(self, dataf: NumerFrame) -> NumerFrame:
        dataf.loc[:, self.prediction_col_name] = self.clf.predict(dataf.get_feature_data)
        return NumerFrame(dataf)

<IPython.core.display.Javascript object>

In [41]:
constant = 0.85
dataset = create_numerframe("test_assets/mini_numerai_version_1_data.csv")
constant_model = ConstantModel(constant=constant)
predictions = constant_model.predict(dataset).get_prediction_data
assert (predictions.to_numpy() == constant).all()
predictions.head(3)

Unnamed: 0,prediction_constant_0.85
0,0.85
1,0.85
2,0.85


<IPython.core.display.Javascript object>

### 2.2. RandomModel

This model returns uniformly distributed predictions (range $[0...1)$). Good baseline for regression models.

In [42]:
#export
class RandomModel(BaseModel):
    """
    WARNING: Only use this Model for testing purposes.
    Create uniformly distributed predictions.
    :param model_name: Name that will be used to create column names and for display purposes.
    """
    def __init__(self, model_name: str = None):
        model_name = model_name if model_name else "random"
        super().__init__(model_directory="",
                         model_name=model_name
                         )

    def predict(self, dataf: Union[pd.DataFrame, NumerFrame]) -> NumerFrame:
        dataf.loc[:, self.prediction_col_name] = np.random.uniform(size=len(dataf))
        return NumerFrame(dataf)

<IPython.core.display.Javascript object>

In [43]:
dataset = create_numerframe("test_assets/mini_numerai_version_1_data.csv", metadata={})
random_model = RandomModel()
predictions = random_model.predict(dataset).get_prediction_data
assert predictions['prediction_random'].between(0, 1).all()
predictions.head(3)

Unnamed: 0,prediction_random
0,0.570822
1,0.086813
2,0.452903


<IPython.core.display.Javascript object>

## 3. Custom Model

There are two different way to different ways to implement new models. Both have their own conveniences and use cases.
1. Inherit from `BaseModel` (custom prediction logic).
2. Inherit from `DirectoryModel` (make predictions for all models in directory with given file suffix.
Prediction logic will already be implemented. Only write model loading logic).

**Option 1** works well when you have no or only a single file that you use for generating predictions.
Examples:
1. Loading a model is not relevant or your model is already loaded in memory.
2. You would like predictions for one model loaded from disk.
3. The object you are loading already aggregates multiple models and transformation steps (such as [scikit-learn FeatureUnion](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.FeatureUnion.html)).

**Option 2** is convenient when you have a lot of similar models in a directory and want to generate predictions for all of them.
Examples:
1. You have multiple similar models saved through a cross validation process.
2. You have a bagging strategy where you have a lot of models trained on slightly different data or with different initializations.


### 3.1. From BaseModel

Arbitrary models can be instantiated and use for prediction generation by inheriting from `BaseModel`. Arbitrary logic (model loading, prediction, etc.) can be defined in `.predict` as long as the method takes a `NumerFrame` as input and outputs a `NumerFrame`. The Model should be able to typecheck by adding the `@typeguard.typechecked` decorator at the top of the class.

For clear console output we recommend adding the `@display_processor_info` decorator to the `.predict` method.

In [44]:
#export
@typechecked
class AwesomeModel(BaseModel):
    """
    - TEMPLATE -
    Predict with arbitrary prediction logic and model formats.
    """
    def __init__(self, model_directory: str, model_name: str = None, *args, **kwargs):
        super().__init__(model_directory=model_directory,
                         model_name=model_name,
                         *args, **kwargs
                         )

    @display_processor_info
    def predict(self, dataf: NumerFrame) -> NumerFrame:
        """ Return NumerFrame with column(s) added for prediction(s). """
        # Get all features
        feature_df = dataf.get_feature_data
        # Predict and add to new column
        ...
        # Parse all contents of NumerFrame to the next pipeline step
        return NumerFrame(dataf)

<IPython.core.display.Javascript object>

### 3.2. From DirectoryModel

You may want to implement a setup similar to `JoblibModel` and `CatBoostModel`. Namely, load in all models of a certain type from a directory, predict for all and take the average. If this is your use case, inherit from `DirectoryModel` and be sure to implement the `load_models` method.

For a `DirectoryModel` you should specify a `file_suffix` (like `.joblib` or `.cbm`) which will be used to store all available models in `self.model_paths`.

The `.predict` method will in this case already be implemented, but can be overridden if the prediction logic is more complex. For example, if you want to apply weighted averaging or a geometric mean for models within a given directory.


Like with inheriting from `BaseModel`, This Model should also be able to typecheck by adding the `@typeguard.typechecked` decorator at the top of the class.

In [45]:
#export
@typechecked
class AwesomeDirectoryModel(DirectoryModel):
    """
    - TEMPLATE -
    Load in all models of arbitrary file format and predict for all.
    """
    def __init__(self, model_directory: str, model_name: str = None, *args, **kwargs):
        file_suffix = '.anything'
        super().__init__(model_directory=model_directory,
                         file_suffix=file_suffix,
                         model_name=model_name,
                         *args, **kwargs
                         )

    def load_models(self) -> list:
        """ Instantiate all models and return as a list. (abstract method) """
        ...

<IPython.core.display.Javascript object>

----------------------------------------------

In [46]:
# hide
# Run this cell to sync all changes with library
from nbdev.export import notebook2script

notebook2script()

Converted 01_download.ipynb.
Converted 02_numerframe.ipynb.
Converted 03_preprocessing.ipynb.
Converted 04_model.ipynb.
Converted 05_postprocessing.ipynb.
Converted 06_modelpipeline.ipynb.
Converted 07_evaluation.ipynb.
Converted 08_key.ipynb.
Converted 09_submission.ipynb.
Converted 10_staking.ipynb.
Converted index.ipynb.


<IPython.core.display.Javascript object>