# Introduction

Previous iterations of this project attempted to reduce everything to a single command-line script that could read a configuration file, run the corresponding model or models, out output the results.

This notebook takes a different approach. It loads and uses some of the utiltiies for e.g. loading data, making submissions, computing metrics, and handling configurations, but the training loop should be defined fully in the notebook. This will allow for faster iteration and more transparent model comparison. Models to be consistently re-used can be repackaged to run from the command line if that would be beneficial.

## Imports and Setup

In [29]:
import os
from pathlib import Path
if Path().resolve().name != 'numerai':
    os.chdir(Path().resolve().parent)

!pip install git+https://github.com/djliden/config-nest.git

import confignest.confignest
from pathlib import Path
import pandas as pd
import numpy as np
import time
from importlib import import_module
from tqdm import tqdm

import src.utils.setup
import src.utils.cross_val
import src.utils.eval
import src.utils.metrics
import gc
np.random.seed(623)    

Collecting git+https://github.com/djliden/config-nest.git
  Cloning https://github.com/djliden/config-nest.git to /tmp/pip-req-build-qdwvjz3y
  Running command git clone -q https://github.com/djliden/config-nest.git /tmp/pip-req-build-qdwvjz3y


## NumerAPI setup and Data Download

In [16]:
src.utils.setup.credential()
napi = src.utils.setup.init_numerapi()

round = napi.get_current_round()
train = Path(f"./input/numerai_dataset_{round}/numerai_training_data.csv")
tourn = Path(f"./input/numerai_dataset_{round}/numerai_tournament_data.csv")
processed = Path('./input/training_processed.csv')
processed_pkl = Path('./input/training_processed.pkl')
output = Path("./output/")

src.utils.setup.download_current(napi=napi)
training_data, feature_cols, target_cols = src.utils.setup.process_current(processed,
                                                           processed_pkl, train, tourn)

Loaded Numerai Public Key into Global Environment!
Loaded Numerai Secret Key into Global Environment!


input/numerai_dataset_258.zip: 100%|█████████▉| 398M/398M [01:15<00:00, 6.49MB/s] 2021-04-06 14:56:16,038 INFO numerapi.base_api: unzipping file...


Loading the pickled training data from file



input/numerai_dataset_258.zip: 398MB [01:30, 6.49MB/s]                           

## Initial Configuration Setup

In [15]:
default_config = Path("./src/config/default_config.yaml")
cfg = confignest.confignest.Config(default_config)
cfg

Config Object with Keys:
CV:
  GAP: 0
  TRAIN_START: 0
  TRAIN_STOP: null
  VAL_END: 210
  VAL_N_ERAS: 4
  VAL_START: 206
DATA:
  REFRESH: false
  SAVE_PROCESSED_TRAIN: true
EVAL:
  CHUNK_SIZE: 1000000
  SAVE_PREDS: true
  SUBMIT_PREDS: false
SYSTEM:
  DEBUG: false

# Model Definition
This involves two components: the "default" model configuration and the definition of the model itself. In this case, we'll look at a Lasso model (to continue exploring the unreasonably good effectiveness of these models).

In [36]:
# Model Configuration
mod_cfg = {'MODEL': {
    'mod': 'Lasso',
    'alpha': .0005
}}
cfg.update_config(mod_cfg)
cfg.MODEL

Config Object with Keys:
alpha: 0.0005
mod: Lasso

In [47]:
# Model Definition
import sklearn.linear_model
mod = getattr(sklearn.linear_model, cfg.MODEL.mod)(alpha=cfg.MODEL.alpha)

# Training Loop
Similarly, the training loop has (or can have) its own configuration. Though it might be more convenient to keep this outside the configuration. It depends.

In [43]:
cv_config = {'CV': {
    'TRAIN_STOP': 120
}}
cfg.update_config(cv_config)
cfg.CV

Config Object with Keys:
GAP: 0
TRAIN_START: 0
TRAIN_STOP: 120
VAL_END: 210
VAL_N_ERAS: 4
VAL_START: 206

In [62]:
# Sweep
from sklearn.model_selection import ParameterGrid
param_grid = {
    'alpha': np.linspace(.0001, .001, 10),
    'mod': ['Lasso', 'Ridge'] 
}

In [108]:
era_split = src.utils.cross_val.EraCV(eras = training_data.era)

X, y, era = training_data[feature_cols], training_data.target, training_data.era

for params in ParameterGrid(param_grid):
    corrs = []
    sharpes = []
    cfg.MODEL.alpha = params['alpha']
    cfg.MODEL.mod = params['mod']
    mod = getattr(sklearn.linear_model, cfg.MODEL.mod)(alpha=cfg.MODEL.alpha)

    for valid_era in tqdm(range(208,209)):
        train, test = era_split.get_splits(valid_start = valid_era,
                                           valid_n_eras = 4,
                                           train_start = cfg.CV.TRAIN_START,
                                           train_stop = valid_era - 150)
        mod.fit(X.iloc[train], y.iloc[train])
        val_preds = mod.predict(X.iloc[test])
        eval_df = pd.DataFrame({'prediction':val_preds,
                            'target':y.iloc[test],
                            'era':era.iloc[test]}).reset_index()
        corrs.append(src.utils.metrics.val_corr(eval_df))
        sharpes.append(src.utils.metrics.sharpe(eval_df))
    print(f'\nmodel: {mod.__class__.__name__}')
    if mod.__class__.__name__!="LinearRegression":
        print(f'alpha: {mod.alpha}')
    if mod.__class__.__name__=="ElasticNet":
        print(f'L1 Ratio: {mod.l1_ratio}')
    print(f'mean validation corr: {np.array(corrs).mean()}')
    print(f'mean validation sharpe: {np.array(sharpes).mean()}')

100%|██████████| 1/1 [00:06<00:00,  6.25s/it]
  0%|          | 0/1 [00:00<?, ?it/s]


model: Lasso
alpha: 0.0001
mean validation corr: 0.003767937955538562
mean validation sharpe: 0.08156784526645254


100%|██████████| 1/1 [00:01<00:00,  1.74s/it]
  0%|          | 0/1 [00:00<?, ?it/s]


model: Ridge
alpha: 0.0001
mean validation corr: 0.005278566696775655
mean validation sharpe: 0.13240229973041368


100%|██████████| 1/1 [00:04<00:00,  4.78s/it]
  0%|          | 0/1 [00:00<?, ?it/s]


model: Lasso
alpha: 0.00019999999999999998
mean validation corr: 0.003857938886777868
mean validation sharpe: 0.07693783097693367


100%|██████████| 1/1 [00:01<00:00,  1.68s/it]
  0%|          | 0/1 [00:00<?, ?it/s]


model: Ridge
alpha: 0.00019999999999999998
mean validation corr: 0.005278566696775655
mean validation sharpe: 0.13240229973041368


100%|██████████| 1/1 [00:04<00:00,  4.41s/it]
  0%|          | 0/1 [00:00<?, ?it/s]


model: Lasso
alpha: 0.0003
mean validation corr: 0.006476573489574136
mean validation sharpe: 0.13253236270839683


100%|██████████| 1/1 [00:01<00:00,  1.67s/it]
  0%|          | 0/1 [00:00<?, ?it/s]


model: Ridge
alpha: 0.0003
mean validation corr: 0.005278566696775655
mean validation sharpe: 0.13240229973041368


100%|██████████| 1/1 [00:04<00:00,  4.42s/it]
  0%|          | 0/1 [00:00<?, ?it/s]


model: Lasso
alpha: 0.00039999999999999996
mean validation corr: 0.00824856587997525
mean validation sharpe: 0.17570614001928694


100%|██████████| 1/1 [00:01<00:00,  1.79s/it]
  0%|          | 0/1 [00:00<?, ?it/s]


model: Ridge
alpha: 0.00039999999999999996
mean validation corr: 0.005278566696775655
mean validation sharpe: 0.13240229973041368


100%|██████████| 1/1 [00:04<00:00,  4.44s/it]
  0%|          | 0/1 [00:00<?, ?it/s]


model: Lasso
alpha: 0.0005
mean validation corr: 0.009094310924760311
mean validation sharpe: 0.19224294034779724


100%|██████████| 1/1 [00:01<00:00,  1.73s/it]
  0%|          | 0/1 [00:00<?, ?it/s]


model: Ridge
alpha: 0.0005
mean validation corr: 0.005278566696775655
mean validation sharpe: 0.13240229973041368


100%|██████████| 1/1 [00:04<00:00,  4.10s/it]
  0%|          | 0/1 [00:00<?, ?it/s]


model: Lasso
alpha: 0.0006000000000000001
mean validation corr: 0.008701314275919737
mean validation sharpe: 0.18406534489031698


100%|██████████| 1/1 [00:01<00:00,  1.76s/it]
  0%|          | 0/1 [00:00<?, ?it/s]


model: Ridge
alpha: 0.0006000000000000001
mean validation corr: 0.005278610014463591
mean validation sharpe: 0.1324032907428277


100%|██████████| 1/1 [00:03<00:00,  3.99s/it]
  0%|          | 0/1 [00:00<?, ?it/s]


model: Lasso
alpha: 0.0007
mean validation corr: 0.008115213283697102
mean validation sharpe: 0.17353177602904765


100%|██████████| 1/1 [00:01<00:00,  1.73s/it]
  0%|          | 0/1 [00:00<?, ?it/s]


model: Ridge
alpha: 0.0007
mean validation corr: 0.005278566672569067
mean validation sharpe: 0.1324022991766202


100%|██████████| 1/1 [00:04<00:00,  4.07s/it]
  0%|          | 0/1 [00:00<?, ?it/s]


model: Lasso
alpha: 0.0007999999999999999
mean validation corr: 0.006853109283287171
mean validation sharpe: 0.14457702043028617


100%|██████████| 1/1 [00:01<00:00,  1.83s/it]
  0%|          | 0/1 [00:00<?, ?it/s]


model: Ridge
alpha: 0.0007999999999999999
mean validation corr: 0.005278566672569067
mean validation sharpe: 0.1324022991766202


100%|██████████| 1/1 [00:04<00:00,  4.05s/it]
  0%|          | 0/1 [00:00<?, ?it/s]


model: Lasso
alpha: 0.0009
mean validation corr: 0.005624869644309747
mean validation sharpe: 0.116456060346192


100%|██████████| 1/1 [00:01<00:00,  1.82s/it]
  0%|          | 0/1 [00:00<?, ?it/s]


model: Ridge
alpha: 0.0009
mean validation corr: 0.005278566672569067
mean validation sharpe: 0.1324022991766202


100%|██████████| 1/1 [00:04<00:00,  4.34s/it]
  0%|          | 0/1 [00:00<?, ?it/s]


model: Lasso
alpha: 0.001
mean validation corr: 0.0054332948510329
mean validation sharpe: 0.11787188761595681


100%|██████████| 1/1 [00:01<00:00,  1.78s/it]


model: Ridge
alpha: 0.001
mean validation corr: 0.005278582129013011
mean validation sharpe: 0.13240301790409006





In [99]:
output

PosixPath('output')

In [103]:
cfg.LOG.dump_config(path = './output/logs/test.yaml')

MEAN_CORR: 0.005278582129013011
MEAN_SHARPE: 0.13240301790409006
alpha: 0.001
mod: Ridge

