<a href="https://colab.research.google.com/github/djliden/numerai/blob/main/Era_Cross_Validation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction

[Jump Straight to Cross Validation Section](https://colab.research.google.com/drive/19HnPt-IMBan4uddMjfM2RLO8f9YkSCHP#scrollTo=dFwt8jJvaRWW&line=7&uniqifier=1)

This notebook introduces an era-wise cross-validation scheme. Conceptually, it works as follows:

1. The class is initialized with the `era` column of the training data (which includes the labeled `validation` data in this example)
2. The user decides the number of training eras to include ("N"), the number of eras to validate on ("M"), and the first validation era ("X"). The indices for all observations in the M eras in and following X are returned as the test set; the indices for all observations in the N eras preceding X are returned as the training set.

# Setup
We run through the setup without documenting our steps. See my `Numerai Starter Kit` notebook for details about these steps: [![Open In Collab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/djliden/numerai_starter_kit/blob/main/Numerai_Starter_Kit.ipynb)

## Imports, installations, etc.

In [1]:
%%capture
# install
!pip install --upgrade python-dotenv numerapi

# import dependencies
import gc
import os
from dotenv import load_dotenv, find_dotenv
from getpass import getpass
import pandas as pd
import numpy as np
import numerapi
from pathlib import Path
from scipy.stats import spearmanr
import sklearn.linear_model
from tqdm import tqdm

## NumerAPI setup

In [2]:
# Load the numerapi credentials from .env or prompt for them if not available
def credential():
    dotenv_path = find_dotenv()
    load_dotenv(dotenv_path)

    if os.getenv("NUMERAI_PUBLIC_KEY"):
        print("Loaded Numerai Public Key into Global Environment!")
    else:
        os.environ["NUMERAI_PUBLIC_KEY"] = getpass("Please enter your Numerai Public Key. You can find your key here: https://numer.ai/submit -> ")
    
    if os.getenv("NUMERAI_SECRET_KEY"):
        print("Loaded Numerai Secret Key into Global Environment!")
    else:
        os.environ["NUMERAI_SECRET_KEY"] = getpass("Please enter your Numerai Secret Key. You can find your key here: https://numer.ai/submit -> ")
    
    if os.getenv("NUMERAI_MODEL_ID_REGRESSIONS"):
        print("Loaded Numerai Model ID into Global Environment!")
    else:
        os.environ["NUMERAI_MODEL_ID_REGRESSIONS"] = getpass("Please enter your Numerai Model ID. You can find your key here: https://numer.ai/submit -> ")

credential()
public_key = os.environ.get("NUMERAI_PUBLIC_KEY")
secret_key = os.environ.get("NUMERAI_SECRET_KEY")
model_id = os.environ.get("NUMERAI_MODEL_ID_REGRESSIONS")
napi = numerapi.NumerAPI(verbosity="info", public_id=public_key, secret_key=secret_key)

Loaded Numerai Public Key into Global Environment!
Loaded Numerai Secret Key into Global Environment!
Loaded Numerai Model ID into Global Environment!


## Data Setup

In [3]:
napi.download_current_dataset()
tourn_file = Path(f'./numerai_dataset_{napi.get_current_round()}/numerai_tournament_data.csv')
train_file = Path(f'./numerai_dataset_{napi.get_current_round()}/numerai_training_data.csv')
processed_train_file = Path('./training_processed.csv')

if processed_train_file.exists():
    print("Loading the processed training data from file\n")
    training_data = pd.read_csv(processed_train_file)
else:
    tourn_iter_csv = pd.read_csv(tourn_file, iterator=True, chunksize=1e6)
    val_df = pd.concat([chunk[chunk['data_type'] == 'validation'] for chunk in tqdm(tourn_iter_csv)])
    tourn_iter_csv.close()
    training_data = pd.read_csv(train_file)
    training_data = pd.concat([training_data, val_df])
    training_data.reset_index(drop=True, inplace=True)
    print("Training Dataset Generated! Saving to file ...")
    training_data.to_csv(processed_train_file, index=False)


feature_cols = training_data.columns[training_data.columns.str.startswith('feature')]
target_cols = ['target']

train_idx = training_data.index[training_data.data_type=='train'].tolist()
val_idx = training_data.index[training_data.data_type=='validation'].tolist()

2021-03-13 16:38:51,496 INFO numerapi.base_api: target file already exists


Loading the processed training data from file



## Metrics Setup

In [4]:
def corr(df: pd.DataFrame) -> np.float32:
    """
    Calculate the correlation by using grouped per-era data
    :param df: A Pandas DataFrame containing the columns "era", "target" and "prediction"
    :return: The average per-era correlations.
    """
    def _score(sub_df: pd.DataFrame) -> np.float32:
        """ Calculate Spearman correlation for Pandas' apply method """
        return spearmanr(sub_df["target"],  sub_df["prediction"])[0]
    corrs = df.groupby("era").apply(_score)
    return corrs.mean() 

def sharpe(df: pd.DataFrame) -> np.float32:
    """
    Calculate the Sharpe ratio by using grouped per-era data
    :param df: A Pandas DataFrame containing the columns "era", "target" and "prediction"
    :return: The Sharpe ratio for your predictions.
    """
    def _score(sub_df: pd.DataFrame) -> np.float32:
        """ Calculate Spearman correlation for Pandas' apply method """
        return spearmanr(sub_df["target"],  sub_df["prediction"])[0]
    corrs = df.groupby("era").apply(_score)
    return corrs.mean() / corrs.std()

# Era "Time Series" Cross Validation

The goal of this section is to set up a "group time series" approach where we specify a certain set of "eras" for training with the last era for validation. We will be training on segments of the validation set.

There are a few ways to do this; I want to write a class that can take the "eras to test on" as input and return CV folds as outout. Perhaps a future refinement would include an argument for number of eras validate on. Perhaps unnecessary given that the "real" task is testing on a single era. Or four eras?

## Usage

1. Initialize the class with the eras column: `cv = EraCV(training_data.era)`
2. Get splits: `X, y = test.get_splits(valid_start = 80, valid_n_eras = 4, train_n_eras = None)`

The `valid_start` argument identifies the first training era; it takes an integer value. `valid_n_eras` is the number of eras to include in the validation set. `train_n_eras` is the number of eras to include in the training set. `train_n_eras` before `valid_start` are included in the training set. If no argument is passed to `train_n_eras`, all eras from 0 to `valid_start` are included.

A single instance of this class can be used in a loop to generate multiple train/test splits. Assuming you want to keep the number of train and test eras constant, you can just iterate over a list of validation starting eras.

Features such as checking if a given validation era actually exists have not yet been implemented.

In [5]:
class EraCV:
    """Select validation eras and train on previous eras

    provides train/test indices to split data in train/test splits. In
    each split, one or more eras are used as a validation set while the
    specified number of immediately preceding eras are used as a
    training set.
    """

    def __init__(self, eras):
        self.eras = eras
        self.unique_eras = self._era_to_int(eras.unique())
        self.eras_int = self._era_to_int(eras)
        #self.valid_start = valid_start
        #self.valid_n_eras = valid_n_eras
        #self.train_n_eras = 0 if (train_n_eras is None) else train_n_eras
    
    def _era_to_int(self, eras):
        return [int(era[3:]) for era in eras]

    def get_valid_indices(self, valid_start, valid_n_eras):
        self.valid_eras = self.unique_eras[self.unique_eras.index(valid_start):\
                                      self.unique_eras.index(valid_start)+\
                                      valid_n_eras]
        valid_bool = [era in self.valid_eras for era in self.eras_int] 
        self.valid_indices = np.where(valid_bool)

    def get_train_indices(self, valid_start:int, train_n_eras:int):
        train_n_eras = 0 if (train_n_eras is None) else train_n_eras
        self.train_eras = [era for era in self.unique_eras if era <\
                           valid_start][-train_n_eras:]
        train_bool = [era in self.train_eras for era in self.eras_int]
        self.train_indices = np.where(train_bool)

    def get_splits(self, valid_start:int, valid_n_eras:int,
                   train_n_eras:int = None):
        self.get_valid_indices(valid_start, valid_n_eras)
        self.get_train_indices(valid_start, train_n_eras)
        return self.train_indices[0], self.valid_indices[0]

# Linear Regression Demonstration

In [6]:
corrs = []
sharpes = []
era_split = EraCV(eras = training_data.era)
X, y, era = training_data[feature_cols], training_data.target, training_data.era
for valid_era in tqdm(range(200,209)):
    train, test = era_split.get_splits(valid_start = valid_era,
                           valid_n_eras = 4,
                           train_n_eras = 50)
    model = sklearn.linear_model.LinearRegression(n_jobs = -1)
    model.fit(X.iloc[train], y.iloc[train])
    val_preds = model.predict(X.iloc[test])
    eval_df = pd.DataFrame({'prediction':val_preds,
                        'target':y.iloc[test],
                        'era':era.iloc[test]}).reset_index()
    corrs.append(corr(eval_df))
    sharpes.append(sharpe(eval_df))

print(corrs)
print(sharpes)

100%|██████████| 9/9 [01:11<00:00,  7.99s/it]

[0.007221323160113864, 0.019747589012373216, 0.014591688589274144, -0.002209343140336509, 0.003436434304452429, 0.0017170660155478596, 0.009927853627001944, 0.0175896213761404, 0.012326817375057541]
[0.2880704062851405, 0.8254960106953122, 0.5365416338781753, -0.05678378916415171, 0.0903576886278792, 0.05125240876655709, 0.2972006561378662, 1.3761530359055556, 0.4027527038637555]



