# Testing for Distribution Shift

This short exploration was inspired by the discussion [here](http://www.d2l.ai/chapter_multilayer-perceptrons/environment.html#correction-of-distribution-shift), along with my own exploration of the gap between the training and live eras. Basically, we will separate the training set from the test/live data and label the former "0" and the latter "1" and see if we can tell them apart with logistic regression. If we can, then we have some evidence of distribution shift, and working to correct it could be valuable.

This exploration could also show whether or not numerically-closer eras are more similar to each other than more distant eras.

In [1]:
%cd /home/djliden91/git/projects/numerai/
from src.utils import setup
from pathlib import Path
import pandas as pd
from tqdm import tqdm

/home/djliden91/git/projects/numerai


In [2]:
setup.credential()
napi = setup.init_numerapi()
setup.download_current(napi=napi)

# Define paths
round = napi.get_current_round()
#current_file = Path(data_dir/f"numerai_dataset_{round}.zip")
train = Path(f"./input/numerai_dataset_{round}/numerai_training_data.csv")
tourn = Path(f"./input/numerai_dataset_{round}/numerai_tournament_data.csv")
processed = Path('./input/training_processed.csv')
output = Path("./output/")

# process data
training_data, feature_cols, target_cols = setup.process_current(processed, train, tourn)

Loaded Numerai Public Key into Global Environment!
Loaded Numerai Secret Key into Global Environment!
The dataset has already been downloaded.
You can re-download it with refresh = True
Loading the processed training data from file



In [3]:
training_data['target'] = 0
training_data['target'].head()

0    0
1    0
2    0
3    0
4    0
Name: target, dtype: int64

In [5]:
live = Path('./input/live.csv')
if not live.exists():
    tourn_iter_csv = pd.read_csv(tourn, iterator=True, chunksize=1e6)
    live_df = pd.concat([chunk[chunk['data_type'] == 'live'] \
                        for chunk in tqdm(tourn_iter_csv)])
    tourn_iter_csv.close()
live_df = pd.read_csv(live)
live_df.target = 1
live_df.to_csv(live, index=False)

In [70]:
train_sub = training_data[training_data.era=='era211']

In [71]:
data = pd.concat((train_sub, live_df))
data.reset_index(drop=True, inplace=True)
X,y = data[feature_cols], data['target']

In [72]:
from sklearn.model_selection import StratifiedShuffleSplit
strat = StratifiedShuffleSplit(n_splits=1, test_size=0.25, random_state=11235)

for train_idx, test_idx in strat.split(X, y):
    X_train, X_test = X.iloc[train_idx], X.iloc[test_idx]
    y_train, y_test = y.iloc[train_idx], y.iloc[test_idx]

In [73]:
from sklearn.linear_model import LogisticRegression
mod = LogisticRegression(n_jobs=-1, max_iter=1000)
mod.fit(X_train, y_train)

LogisticRegression(max_iter=1000, n_jobs=-1)

In [74]:
mod.score(X_test, y_test)

0.3976674191121144

In [69]:
sum(y_test==1)/len(y_test)

0.5112951807228916

In [77]:
mod.score(X_train, y_train)

0.5492975413948821