<a href="https://colab.research.google.com/github/djliden/numerai/blob/main/notebooks/regressions.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 1 Introduction
This notebook will walk you through the entire process of making a [numerai](numer.ai) submission, from downloading the data to submitting final predictions, all in a google colab notebook. In particular, it will address two challenges:
- handling API keys in a remote environment (colab)
- parsing the large CSV files which, if read all at once, will exceed colab's memory and cause the notebook to crash.

This notebook will implement two models: a basic tabular neural network using `fastai` and a linear regression model using `scikit-learn`.

## 1.1 Installing and Importing Dependencies
First, we install and import the necessary packages. This cell is currently set *not* to print any output; if you run into any issues and need to check for error messages, comment out the `%%capture` line

In [1]:
%%capture
# install
!pip install --upgrade python-dotenv numerapi

# import dependencies
import gc
import os
from dotenv import load_dotenv, find_dotenv
from getpass import getpass
import pandas as pd
import numpy as np
import numerapi
from pathlib import Path
from scipy.stats import spearmanr
import sklearn.linear_model
from tqdm import tqdm

## 1.2 Setting Up numerapi
We will use the [numerapi](https://github.com/uuazed/numerapi) package to access the data and make submissions. For this to work, numerapi needs to use your API keys (which can be obtained [here](https://numer.ai/submit)). We will set up two main ways of passing these API keys to a numerapi instance:
1. Read a `.env` file using the `python-dotenv` package. This will require you to upload a `.env` file (which contains your secret key and should *not* be kept under version control). Using this method means you will not have to directly enter your keys each time you use this notebook, though you will need to re-upload the `.env` file.
2. Manually entering the API keys -- if you don't have access to, or don't want to mess with, your `.env` file.

If you have a `.env` file, upload it to the default working directory, `content`, now. In either case, run the cell below to set up the numerapi instance. See [Appendix A](#app_a) for instructions on generating and downloading a .env file.

In [2]:
# Load the numerapi credentials from .env or prompt for them if not available
def credential():
    dotenv_path = find_dotenv()
    load_dotenv(dotenv_path)

    if os.getenv("NUMERAI_PUBLIC_KEY"):
        print("Loaded Numerai Public Key into Global Environment!")
    else:
        os.environ["NUMERAI_PUBLIC_KEY"] = getpass("Please enter your Numerai Public Key. You can find your key here: https://numer.ai/submit -> ")
    
    if os.getenv("NUMERAI_SECRET_KEY"):
        print("Loaded Numerai Secret Key into Global Environment!")
    else:
        os.environ["NUMERAI_SECRET_KEY"] = getpass("Please enter your Numerai Secret Key. You can find your key here: https://numer.ai/submit -> ")
    
    if os.getenv("NUMERAI_MODEL_ID"):
        print("Loaded Numerai Model ID into Global Environment!")
    else:
        os.environ["NUMERAI_MODEL_ID"] = getpass("Please enter your Numerai Model ID. You can find your key here: https://numer.ai/submit -> ")

credential()
public_key = os.environ.get("NUMERAI_PUBLIC_KEY")
secret_key = os.environ.get("NUMERAI_SECRET_KEY")
model_id = os.environ.get("NUMERAI_MODEL_ID")
napi = numerapi.NumerAPI(verbosity="info", public_id=public_key, secret_key=secret_key)

Loaded Numerai Public Key into Global Environment!
Loaded Numerai Secret Key into Global Environment!
Loaded Numerai Model ID into Global Environment!


You can read up on the functionality of numerapi [here](https://github.com/uuazed/numerapi). You can use it to download the competition data, view other numerai users' public profiles, check submission status, manage your stake, and much more. In this case, we'll only be using it to download competition data and submit predictions.

## 1.3 Downloading Competition Data
In a more structured project, you'll probably want to keep the data in a seprate directory from your scripts etc. You could also link google colab to your google drive and store the data there in order to avoid needing to download and process the data every time. In this case, however, we'll keep everything in `./content`, and download the data fresh each time.

In [46]:
napi.download_current_dataset()








./numerai_dataset_254.zip: 394MB [00:35, 21.9MB/s]                           [A[A[A[A[A[A[A

'./numerai_dataset_254.zip'

## 1.4 Generating the Training Sample

If you look at the files we downloaded above, you'll see a `numerai_tournament_data.csv` file and a `numerai_training_data.csv` file. The "tournament" file contains many rows with targets which we can use for validation, so let's extract those and combine them with our training set. Note that this cell saves a new `csv` after combining the training and validation data, so we can avoid the time-consuming parsing process if we run this cell again in the same session.

In [47]:
tourn_file = Path(f'./numerai_dataset_{napi.get_current_round()}/numerai_tournament_data.csv')
train_file = Path(f'./numerai_dataset_{napi.get_current_round()}/numerai_training_data.csv')
processed_train_file = Path('./training_processed.csv')

if processed_train_file.exists():
    print("Loading the processed training data from file\n")
    training_data = pd.read_csv(processed_train_file)
else:
    tourn_iter_csv = pd.read_csv(tourn_file, iterator=True, chunksize=1e6)
    val_df = pd.concat([chunk[chunk['data_type'] == 'validation'] for chunk in tqdm(tourn_iter_csv)])
    tourn_iter_csv.close()
    training_data = pd.read_csv(train_file)
    training_data = pd.concat([training_data, val_df])
    training_data.reset_index(drop=True, inplace=True)
    print("Training Dataset Generated! Saving to file ...")
    training_data.to_csv(processed_train_file, index=False)


feature_cols = training_data.columns[training_data.columns.str.startswith('feature')]
target_cols = ['target']

train_idx = training_data.index[training_data.data_type=='train'].tolist()
val_idx = training_data.index[training_data.data_type=='validation'].tolist()

Loading the processed training data from file



KeyboardInterrupt: ignored

# 2 Modeling the Data

In this section, we will define our evaluation metrics; run two different models (a linear regression model from `scikit-learn` and a neural network from `fastai`); and generate submission dataframes from those files.

## 2.1 Evaluation Metrics

In this section, we will define two key evaluation metrics used to assess the performance of models before submitting to the tournament. These metrics are:
- Average Spearman Correlation per era: The sum of each era's Spearman correlation divided by the number of eras.
- Sharpe Ratio: The average correlation per era divided by the standard deviation of the correlations per era.

Both are defined in reasonable detail [here](https://wandb.ai/carlolepelaars/numerai_tutorial/reports/How-to-get-Started-With-Numerai--VmlldzoxODU0NTQ). The methods defined below are modified versions of the methods described in that post.

In [5]:
def corr(df: pd.DataFrame) -> np.float32:
    """
    Calculate the correlation by using grouped per-era data
    :param df: A Pandas DataFrame containing the columns "era", "target" and "prediction"
    :return: The average per-era correlations.
    """
    def _score(sub_df: pd.DataFrame) -> np.float32:
        """ Calculate Spearman correlation for Pandas' apply method """
        return spearmanr(sub_df["target"],  sub_df["prediction"])[0]
    corrs = df.groupby("era").apply(_score)
    return corrs.mean() 

def sharpe(df: pd.DataFrame) -> np.float32:
    """
    Calculate the Sharpe ratio by using grouped per-era data
    :param df: A Pandas DataFrame containing the columns "era", "target" and "prediction"
    :return: The Sharpe ratio for your predictions.
    """
    def _score(sub_df: pd.DataFrame) -> np.float32:
        """ Calculate Spearman correlation for Pandas' apply method """
        return spearmanr(sub_df["target"],  sub_df["prediction"])[0]
    corrs = df.groupby("era").apply(_score)
    return corrs.mean() / corrs.std()

## 2.2 Cross Validation Setup

In [6]:
from sklearn.model_selection import GroupKFold
train_sample = training_data.iloc[train_idx]
X, y, era = train_sample[feature_cols], train_sample.target, train_sample.era
groups = train_sample.era
group_kfold = GroupKFold(n_splits = 5)

## 2.3 Linear Regression Model
This model closely follows the tutorial example [here](https://colab.research.google.com/github/numerai/example-scripts/blob/master/making-your-first-submission-on-numerai.ipynb). We will use the `scikit-learn` package, with which we can implement and fit our regression model in just a couple of lines of code.

#### Fitting the Linear Regression Model

In [29]:
%%time
corrs = []
sharpes = []

for train, test in group_kfold.split(X, y, groups=groups):
    model = sklearn.linear_model.LinearRegression(n_jobs = -1)
    model.fit(X.iloc[train], y.iloc[train])
    val_preds = model.predict(X.iloc[test])
    eval_df = pd.DataFrame({'prediction':val_preds,
                        'target':y.iloc[test],
                        'era':groups.iloc[test]}).reset_index()
    corrs.append(corr(eval_df))
    sharpes.append(sharpe(eval_df))

print(corrs)
print(sharpes)

[0.0413988755478284, 0.03116761404394065, 0.03960886988203633, 0.03559675523170117, 0.031243561514435816]
[1.5032811853317405, 0.7154594471580004, 1.8264734992258658, 1.0864696284517816, 1.0273427745585324]
CPU times: user 1min 18s, sys: 6.92 s, total: 1min 25s
Wall time: 48.6 s


In [34]:
models = [
          # sklearn.linear_model.LinearRegression(n_jobs = -1),
          # sklearn.linear_model.Lasso(alpha=0.00006), # good! Takes pretty long time.
          # sklearn.linear_model.Lasso(alpha=0.00001), # Takes a long time, worse than .00005
          # sklearn.linear_model.Lasso(alpha=0.000005), # Takes a really long time, worse than .00001
          # sklearn.linear_model.Lasso(alpha=0.01), # fails
          # sklearn.linear_model.Ridge(alpha=0.0001),
          # sklearn.linear_model.Ridge(alpha=0.01),
          # sklearn.linear_model.Ridge(alpha = 0.1),
          sklearn.linear_model.ElasticNet(alpha=.0001, l1_ratio=0.5),  # Good!
          # sklearn.linear_model.ElasticNet(alpha=.0002, l1_ratio=0.5),
          # sklearn.linear_model.ElasticNet(alpha=.0001, l1_ratio=0.4),
          sklearn.linear_model.ElasticNet(alpha=.0001, l1_ratio=0.6), # Good! Best?
          sklearn.linear_model.ElasticNet(alpha=.0001, l1_ratio=0.7) # Good! Best?

          # sklearn.linear_model.ElasticNet(alpha=.00005, l1_ratio=0.5),
          # sklearn.linear_model.ElasticNet(alpha=.0005, l1_ratio=0.5),
          # sklearn.linear_model.ElasticNet(alpha=.00001, l1_ratio=0.25) # Takes very long time
          # sklearn.linear_model.ElasticNet(alpha=.01, l1_ratio=0.5), # fails
          # sklearn.linear_model.ElasticNet(alpha=.001, l1_ratio=0.25),
          # sklearn.linear_model.ElasticNet(alpha=.001, l1_ratio=0.75) # bad result
          ]

In [36]:
for model in models:
    corrs = []
    sharpes = []
    for train, test in tqdm(group_kfold.split(X, y, groups=groups), total=5):
        model.fit(X.iloc[train], y.iloc[train])
        val_preds = model.predict(X.iloc[test])
        eval_df = pd.DataFrame({'prediction':val_preds,
                            'target':y.iloc[test],
                            'era':groups.iloc[test]}).reset_index()
        corrs.append(corr(eval_df))
        sharpes.append(sharpe(eval_df))
    print(f'\nmodel: {model.__class__.__name__}')
    if model.__class__.__name__!="LinearRegression":
        print(f'alpha: {model.alpha}')
    if model.__class__.__name__=="ElasticNet":
        print(f'L1 Ratio: {model.l1_ratio}')
    print(f'validation correlations: {corrs}, mean: {np.array(corrs).mean()}')
    print(f'validation sharpes: {sharpes}, mean: {np.array(sharpes).mean()}')



  0%|          | 0/5 [00:00<?, ?it/s][A[A

 20%|██        | 1/5 [01:09<04:36, 69.17s/it][A[A

 40%|████      | 2/5 [02:00<03:11, 63.85s/it][A[A

 60%|██████    | 3/5 [03:08<02:10, 65.10s/it][A[A

 80%|████████  | 4/5 [04:07<01:03, 63.18s/it][A[A

100%|██████████| 5/5 [05:17<00:00, 63.49s/it]


  0%|          | 0/5 [00:00<?, ?it/s][A[A

model: Lasso
alpha: 6e-05
validation correlations: [0.04332792912546912, 0.031434612619569154, 0.047669425200041284, 0.037039767734678784, 0.033323948348046604], mean: 0.03855913660556099
validation sharpes: [1.3982035171750715, 0.663410919086826, 2.0610600121461347, 0.9854986745433042, 0.9674671028213477], mean: 1.215128045154537





 20%|██        | 1/5 [00:26<01:47, 26.77s/it][A[A

 40%|████      | 2/5 [00:50<01:17, 25.97s/it][A[A

 60%|██████    | 3/5 [02:00<01:18, 39.11s/it][A[A

 80%|████████  | 4/5 [02:24<00:34, 34.66s/it][A[A

100%|██████████| 5/5 [03:28<00:00, 41.75s/it]


  0%|          | 0/5 [00:00<?, ?it/s][A[A

model: ElasticNet
alpha: 0.0001
L1 Ratio: 0.5
validation correlations: [0.04347474676845641, 0.03146511092526201, 0.047107530862073116, 0.03730122819900296, 0.03341224276698343], mean: 0.03855217190435559
validation sharpes: [1.416323394305138, 0.6677888412291337, 2.0469916387510954, 1.0044445499519083, 0.9793182577424634], mean: 1.2229733363959476





 20%|██        | 1/5 [00:42<02:49, 42.30s/it][A[A

 40%|████      | 2/5 [01:12<01:55, 38.65s/it][A[A

 60%|██████    | 3/5 [01:56<01:20, 40.41s/it][A[A

 80%|████████  | 4/5 [03:07<00:49, 49.43s/it][A[A

100%|██████████| 5/5 [04:27<00:00, 53.56s/it]


  0%|          | 0/5 [00:00<?, ?it/s][A[A

model: ElasticNet
alpha: 0.0001
L1 Ratio: 0.6
validation correlations: [0.04333048511430412, 0.0314392864379456, 0.04768023730348006, 0.037036624347257956, 0.033322480554536356], mean: 0.03856182275150481
validation sharpes: [1.398431288545246, 0.6634818817752716, 2.061180211216678, 0.985120477558037, 0.9673392231760424], mean: 1.2151106164542549





 20%|██        | 1/5 [00:43<02:54, 43.65s/it][A[A

 40%|████      | 2/5 [01:38<02:20, 46.93s/it][A[A

 60%|██████    | 3/5 [02:50<01:49, 54.66s/it][A[A

 80%|████████  | 4/5 [03:21<00:47, 47.56s/it][A[A

100%|██████████| 5/5 [04:35<00:00, 55.07s/it]

model: ElasticNet
alpha: 0.0001
L1 Ratio: 0.7
validation correlations: [0.04306784559555605, 0.031391661906845475, 0.047987359704428266, 0.03667405756155036, 0.033069786367601856], mean: 0.038438142227196395
validation sharpes: [1.375155679465384, 0.6616136201502878, 2.0557192796959773, 0.965906404669539, 0.9535450170076238], mean: 1.2023880001977623






In [38]:
%%script false --no-raise-error
# Takes a very long time
import sklearn.neighbors
models = [        
          sklearn.neighbors.KNeighborsRegressor(n_neighbors = 500, n_jobs = -1),
          sklearn.neighbors.KNeighborsRegressor(n_neighbors = 500, leaf_size = 150, n_jobs = -1),
          sklearn.neighbors.KNeighborsRegressor(n_neighbors = 500, leaf_size = 10, n_jobs = -1)
                    ]

for model in models:
    corrs = []
    sharpes = []
    for train, test in tqdm(group_kfold.split(X, y, groups=groups), total=5):
        model.fit(X.iloc[train], y.iloc[train])
        val_preds = model.predict(X.iloc[test])
        eval_df = pd.DataFrame({'prediction':val_preds,
                            'target':y.iloc[test],
                            'era':groups.iloc[test]}).reset_index()
        corrs.append(corr(eval_df))
        sharpes.append(sharpe(eval_df))
    print(f'model: {model.__class__.__name__}')
    print(f'neighbors: {model.n_neighbors}')
    print(f'leaf size: {model.leaf_size}')
    print(f'validation correlations: {corrs}, mean: {np.array(corrs).mean()}')
    print(f'validation sharpes: {sharpes}, mean: {np.array(sharpes).mean()}\n')

In [None]:
%%script false --no-raise-error
# Takes a very long time
import sklearn.ensemble
models = [
          sklearn.ensemble.ExtraTreesRegressor(n_estimators=5, max_features = 0.8, n_jobs=-1),
          sklearn.ensemble.ExtraTreesRegressor(n_estimators=10, max_features = 0.8, n_jobs=-1),
          sklearn.ensemble.ExtraTreesRegressor(n_estimators=5, max_features = 0.2, n_jobs=-1),
          sklearn.ensemble.ExtraTreesRegressor(n_estimators=10, max_features = 0.2, n_jobs=-1),
          # sklearn.ensemble.ExtraTreesRegressor(n_estimators=100, max_features = 0.8, n_jobs=-1),
          # sklearn.ensemble.ExtraTreesRegressor(n_estimators=1000, max_features = 0.8, n_jobs=-1)
                    ]

for model in models:
    corrs = []
    sharpes = []
    for train, test in tqdm(group_kfold.split(X, y, groups=groups), total=5):
        model.fit(X.iloc[train], y.iloc[train])
        val_preds = model.predict(X.iloc[test])
        eval_df = pd.DataFrame({'prediction':val_preds,
                            'target':y.iloc[test],
                            'era':groups.iloc[test]}).reset_index()
        corrs.append(corr(eval_df))
        sharpes.append(sharpe(eval_df))
    print(f'model: {model.__class__.__name__} with {model.n_estimators} estimators and {model.max_features} max features.')
    print(f'validation correlations: {corrs}, mean: {np.array(corrs).mean()}')
    print(f'validation sharpes: {sharpes}, mean: {np.array(sharpes).mean()}\n')

In [45]:
import sklearn.ensemble
models = [
          sklearn.ensemble.GradientBoostingRegressor(n_estimators=50, max_features = 0.8, subsample=0.1),
          sklearn.ensemble.GradientBoostingRegressor(n_estimators=100, max_features = 0.8),
          sklearn.ensemble.GradientBoostingRegressor(n_estimators=50, max_features = 0.2),
          sklearn.ensemble.GradientBoostingRegressor(n_estimators=100, max_features = 0.2)
                    ]

for model in models:
    corrs = []
    sharpes = []
    for train, test in tqdm(group_kfold.split(X, y, groups=groups), total=5):
        model.fit(X.iloc[train], y.iloc[train])
        val_preds = model.predict(X.iloc[test])
        eval_df = pd.DataFrame({'prediction':val_preds,
                            'target':y.iloc[test],
                            'era':groups.iloc[test]}).reset_index()
        corrs.append(corr(eval_df))
        sharpes.append(sharpe(eval_df))
    print(f'model: {model.__class__.__name__} with {model.n_estimators} estimators and {model.max_features} max features.')
    print(f'validation correlations: {corrs}, mean: {np.array(corrs).mean()}')
    print(f'validation sharpes: {sharpes}, mean: {np.array(sharpes).mean()}\n')

KeyboardInterrupt: ignored

#### Assessing Regression Model Performance

Here we apply the `corr` and `sharpe` methods defined above to predictions made on the validation sample in order to estimate our model's tournament performance.

In [None]:
val_sample = training_data.iloc[val_idx]
val_preds = model.predict(val_sample[feature_cols])
eval_df = pd.DataFrame({'prediction':val_preds,
                        'target':val_sample.target,
                        'era':val_sample.era}).reset_index()
val_sharpe = sharpe(eval_df)
val_corr = corr(eval_df)

print((f'The linear regression model\'s validation correlation is {val_corr}. '
       f'Its validation sharpe is {val_sharpe}'))

#### Making Predictions with the Regression Model

In [None]:
ids = []
preds = []

chunksize = 50000

tourn_iter_csv = pd.read_csv(tourn_file, iterator=True, chunksize=1e6)
for chunk in tourn_iter_csv:
    df = chunk[feature_cols]
    out = model.predict(df)
    ids.extend(chunk["id"])
    preds.extend(out)
tourn_iter_csv.close()

In [None]:
linear_regression_predictions_df = pd.DataFrame({
    'id':ids,
    'prediction':preds
})
linear_regression_predictions_df.head()

In [None]:
linear_regression_predictions_df.to_csv("lr_predictions.csv", index=False)

#### Submitting Predictions from the Linear Regression Model

We can use `numerapi` to submit these predictions as follows:

In [None]:
napi.upload_predictions("lr_predictions.csv", model_id=os.environ.get("NUMERAI_MODEL_ID"))