<a href="https://colab.research.google.com/github/djliden/numerai/blob/main/notebooks/regressions_new_CV.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 1 Introduction
This notebook will walk you through the entire process of making a [numerai](numer.ai) submission, from downloading the data to submitting final predictions, all in a google colab notebook. In particular, it will address two challenges:
- handling API keys in a remote environment (colab)
- parsing the large CSV files which, if read all at once, will exceed colab's memory and cause the notebook to crash.

This notebook will implement two models: a basic tabular neural network using `fastai` and a linear regression model using `scikit-learn`.

## 1.1 Installing and Importing Dependencies
First, we install and import the necessary packages. This cell is currently set *not* to print any output; if you run into any issues and need to check for error messages, comment out the `%%capture` line

In [1]:
%%capture
# install
!pip install --upgrade python-dotenv numerapi

# import dependencies
import gc
import os
from dotenv import load_dotenv, find_dotenv
from getpass import getpass
import pandas as pd
import numpy as np
import numerapi
from pathlib import Path
from scipy.stats import spearmanr
import sklearn.linear_model
from tqdm import tqdm

## 1.2 Setting Up numerapi
We will use the [numerapi](https://github.com/uuazed/numerapi) package to access the data and make submissions. For this to work, numerapi needs to use your API keys (which can be obtained [here](https://numer.ai/submit)). We will set up two main ways of passing these API keys to a numerapi instance:
1. Read a `.env` file using the `python-dotenv` package. This will require you to upload a `.env` file (which contains your secret key and should *not* be kept under version control). Using this method means you will not have to directly enter your keys each time you use this notebook, though you will need to re-upload the `.env` file.
2. Manually entering the API keys -- if you don't have access to, or don't want to mess with, your `.env` file.

If you have a `.env` file, upload it to the default working directory, `content`, now. In either case, run the cell below to set up the numerapi instance. See [Appendix A](#app_a) for instructions on generating and downloading a .env file.

In [2]:
# Load the numerapi credentials from .env or prompt for them if not available
def credential():
    dotenv_path = find_dotenv()
    load_dotenv(dotenv_path)

    if os.getenv("NUMERAI_PUBLIC_KEY"):
        print("Loaded Numerai Public Key into Global Environment!")
    else:
        os.environ["NUMERAI_PUBLIC_KEY"] = getpass("Please enter your Numerai Public Key. You can find your key here: https://numer.ai/submit -> ")
    
    if os.getenv("NUMERAI_SECRET_KEY"):
        print("Loaded Numerai Secret Key into Global Environment!")
    else:
        os.environ["NUMERAI_SECRET_KEY"] = getpass("Please enter your Numerai Secret Key. You can find your key here: https://numer.ai/submit -> ")
    
    if os.getenv("NUMERAI_MODEL_ID_REGRESSIONS"):
        print("Loaded Numerai Model ID into Global Environment!")
    else:
        os.environ["NUMERAI_MODEL_ID_REGRESSIONS"] = getpass("Please enter your Numerai Model ID. You can find your key here: https://numer.ai/submit -> ")

credential()
public_key = os.environ.get("NUMERAI_PUBLIC_KEY")
secret_key = os.environ.get("NUMERAI_SECRET_KEY")
model_id = os.environ.get("NUMERAI_MODEL_ID_REGRESSIONS")
napi = numerapi.NumerAPI(verbosity="info", public_id=public_key, secret_key=secret_key)

Loaded Numerai Public Key into Global Environment!
Loaded Numerai Secret Key into Global Environment!
Loaded Numerai Model ID into Global Environment!


You can read up on the functionality of numerapi [here](https://github.com/uuazed/numerapi). You can use it to download the competition data, view other numerai users' public profiles, check submission status, manage your stake, and much more. In this case, we'll only be using it to download competition data and submit predictions.

## 1.3 Downloading Competition Data
In a more structured project, you'll probably want to keep the data in a seprate directory from your scripts etc. You could also link google colab to your google drive and store the data there in order to avoid needing to download and process the data every time. In this case, however, we'll keep everything in `./content`, and download the data fresh each time.

In [3]:
napi.download_current_dataset()

./numerai_dataset_255.zip: 395MB [00:08, 44.0MB/s]                           
2021-03-13 19:33:57,733 INFO numerapi.base_api: unzipping file...


'./numerai_dataset_255.zip'

## 1.4 Generating the Training Sample

If you look at the files we downloaded above, you'll see a `numerai_tournament_data.csv` file and a `numerai_training_data.csv` file. The "tournament" file contains many rows with targets which we can use for validation, so let's extract those and combine them with our training set. Note that this cell saves a new `csv` after combining the training and validation data, so we can avoid the time-consuming parsing process if we run this cell again in the same session.

In [4]:
tourn_file = Path(f'./numerai_dataset_{napi.get_current_round()}/numerai_tournament_data.csv')
train_file = Path(f'./numerai_dataset_{napi.get_current_round()}/numerai_training_data.csv')
processed_train_file = Path('./training_processed.csv')

if processed_train_file.exists():
    print("Loading the processed training data from file\n")
    training_data = pd.read_csv(processed_train_file)
else:
    tourn_iter_csv = pd.read_csv(tourn_file, iterator=True, chunksize=1e6)
    val_df = pd.concat([chunk[chunk['data_type'] == 'validation'] for chunk in tqdm(tourn_iter_csv)])
    tourn_iter_csv.close()
    training_data = pd.read_csv(train_file)
    training_data = pd.concat([training_data, val_df])
    training_data.reset_index(drop=True, inplace=True)
    print("Training Dataset Generated! Saving to file ...")
    training_data.to_csv(processed_train_file, index=False)


feature_cols = training_data.columns[training_data.columns.str.startswith('feature')]
target_cols = ['target']

train_idx = training_data.index[training_data.data_type=='train'].tolist()
val_idx = training_data.index[training_data.data_type=='validation'].tolist()

2it [00:54, 27.41s/it]


Training Dataset Generated! Saving to file ...


# 2 Modeling the Data

In this section, we will define our evaluation metrics; run two different models (a linear regression model from `scikit-learn` and a neural network from `fastai`); and generate submission dataframes from those files.

## 2.1 Evaluation Metrics

In this section, we will define two key evaluation metrics used to assess the performance of models before submitting to the tournament. These metrics are:
- Average Spearman Correlation per era: The sum of each era's Spearman correlation divided by the number of eras.
- Sharpe Ratio: The average correlation per era divided by the standard deviation of the correlations per era.

Both are defined in reasonable detail [here](https://wandb.ai/carlolepelaars/numerai_tutorial/reports/How-to-get-Started-With-Numerai--VmlldzoxODU0NTQ). The methods defined below are modified versions of the methods described in that post.

In [5]:
def corr(df: pd.DataFrame) -> np.float32:
    """
    Calculate the correlation by using grouped per-era data
    :param df: A Pandas DataFrame containing the columns "era", "target" and "prediction"
    :return: The average per-era correlations.
    """
    def _score(sub_df: pd.DataFrame) -> np.float32:
        """ Calculate Spearman correlation for Pandas' apply method """
        return spearmanr(sub_df["target"],  sub_df["prediction"])[0]
    corrs = df.groupby("era").apply(_score)
    return corrs.mean() 

def sharpe(df: pd.DataFrame) -> np.float32:
    """
    Calculate the Sharpe ratio by using grouped per-era data
    :param df: A Pandas DataFrame containing the columns "era", "target" and "prediction"
    :return: The Sharpe ratio for your predictions.
    """
    def _score(sub_df: pd.DataFrame) -> np.float32:
        """ Calculate Spearman correlation for Pandas' apply method """
        return spearmanr(sub_df["target"],  sub_df["prediction"])[0]
    corrs = df.groupby("era").apply(_score)
    return corrs.mean() / corrs.std()

## 2.2 Cross Validation Setup

### 2.2.1 Custom Cross Validation Setup

The goal of this section is to set up a "group time series" approach where we specify a certain set of "eras" for training with the last era for validation. We will be training on segments of the validation set.

There are a few ways to do this; I want to write a class that can take the "eras to test on" as input and return CV folds as outout. Perhaps a future refinement would include an argument for number of eras validate on. Perhaps unnecessary given that the "real" task is testing on a single era. Or four eras?

#### Usage

1. Initialize the class with the eras column: `cv = EraCV(training_data.era)`
2. Get splits: `X, y = test.get_splits(valid_start = 80, valid_n_eras = 4, train_n_eras = None)`

The `valid_start` argument identifies the first training era; it takes an integer value. `valid_n_eras` is the number of eras to include in the validation set. `train_n_eras` is the number of eras to include in the training set. `train_n_eras` before `valid_start` are included in the training set. If no argument is passed to `train_n_eras`, all eras from 0 to `valid_start` are included.

A single instance of this class can be used in a loop to generate multiple train/test splits. Assuming you want to keep the number of train and test eras constant, you can just iterate over a list of validation starting eras.

Features such as checking if a given validation era actually exists have not yet been implemented.

In [9]:
class EraCV:
    """Select validation eras and train on previous eras

    provides train/test indices to split data in train/test splits. In
    each split, one or more eras are used as a validation set while the
    specified number of immediately preceding eras are used as a
    training set.
    """

    def __init__(self, eras):
        self.eras = eras
        self.unique_eras = self._era_to_int(eras.unique())
        self.eras_int = self._era_to_int(eras)
        #self.valid_start = valid_start
        #self.valid_n_eras = valid_n_eras
        #self.train_n_eras = 0 if (train_n_eras is None) else train_n_eras
    
    def _era_to_int(self, eras):
        return [int(era[3:]) for era in eras]

    def get_valid_indices(self, valid_start, valid_n_eras):
        self.valid_eras = self.unique_eras[self.unique_eras.index(valid_start):\
                                      self.unique_eras.index(valid_start)+\
                                      valid_n_eras]
        valid_bool = [era in self.valid_eras for era in self.eras_int] 
        self.valid_indices = np.where(valid_bool)

    def get_train_indices(self, valid_start:int, train_n_eras:int):
        train_n_eras = 0 if (train_n_eras is None) else train_n_eras
        self.train_eras = [era for era in self.unique_eras if era <\
                           valid_start][-train_n_eras:]
        train_bool = [era in self.train_eras for era in self.eras_int]
        self.train_indices = np.where(train_bool)

    def get_splits(self, valid_start:int, valid_n_eras:int,
                   train_n_eras:int = None):
        self.get_valid_indices(valid_start, valid_n_eras)
        self.get_train_indices(valid_start, train_n_eras)
        return self.train_indices[0], self.valid_indices[0]

    def __repr__(self):
       return (f'{self.__class__.__name__}('
               f'last era:{max(self.eras_int)})')


## 2.3 Linear Regression Model
This model closely follows the tutorial example [here](https://colab.research.google.com/github/numerai/example-scripts/blob/master/making-your-first-submission-on-numerai.ipynb). We will use the `scikit-learn` package, with which we can implement and fit our regression model in just a couple of lines of code.

#### Fitting the Linear Regression Model

In [None]:
%%time
corrs = []
sharpes = []
era_split = EraCV(eras = training_data.era)
X, y, era = training_data[feature_cols], training_data.target, training_data.era
for valid_era in tqdm(range(200,209)):
    train, test = era_split.get_splits(valid_start = valid_era,
                           valid_n_eras = 4,
                           train_n_eras = 50)
    model = sklearn.linear_model.LinearRegression(n_jobs = -1)
    model.fit(X.iloc[train], y.iloc[train])
    val_preds = model.predict(X.iloc[test])
    eval_df = pd.DataFrame({'prediction':val_preds,
                        'target':y.iloc[test],
                        'era':era.iloc[test]}).reset_index()
    corrs.append(corr(eval_df))
    sharpes.append(sharpe(eval_df))

print(corrs)
print(sharpes)

In [11]:
models = [
          # sklearn.linear_model.LinearRegression(n_jobs = -1),
          # sklearn.linear_model.Lasso(alpha=0.00006), # good! Takes pretty long time.
          # sklearn.linear_model.Lasso(alpha=0.00001), # Takes a long time, worse than .00005
          # sklearn.linear_model.Lasso(alpha=0.000005), # Takes a really long time, worse than .00001
          # sklearn.linear_model.Lasso(alpha=0.01), # fails
          # sklearn.linear_model.Ridge(alpha=0.0001),
          # sklearn.linear_model.Ridge(alpha=0.01),
          #sklearn.linear_model.Ridge(alpha = 0.1),
          #sklearn.linear_model.ElasticNet(alpha=.0001, l1_ratio=0.5),  # Good! .0145 mean
          # sklearn.linear_model.ElasticNet(alpha=.0001, l1_ratio=0.4),
          #sklearn.linear_model.ElasticNet(alpha=.001, l1_ratio=0.5),  # .0162
          #sklearn.linear_model.ElasticNet(alpha=.001, l1_ratio=0.4),  # .0167
          #sklearn.linear_model.ElasticNet(alpha=.001, l1_ratio=0.6)  # .0146
          sklearn.linear_model.ElasticNet(alpha=.001, l1_ratio=0.3), #.0167
          sklearn.linear_model.ElasticNet(alpha=.001, l1_ratio=0.4), #  .015
          sklearn.linear_model.ElasticNet(alpha=.001, l1_ratio=0.5) # .015
          # sklearn.linear_model.ElasticNet(alpha=.0002, l1_ratio=0.5),
          # sklearn.linear_model.ElasticNet(alpha=.0001, l1_ratio=0.4),
          #sklearn.linear_model.ElasticNet(alpha=.0001, l1_ratio=0.6), # Good! Best?
          #sklearn.linear_model.ElasticNet(alpha=.0001, l1_ratio=0.7) # Good! Best?

          # sklearn.linear_model.ElasticNet(alpha=.00005, l1_ratio=0.5),
          # sklearn.linear_model.ElasticNet(alpha=.0005, l1_ratio=0.5),
          # sklearn.linear_model.ElasticNet(alpha=.00001, l1_ratio=0.25) # Takes very long time
          # sklearn.linear_model.ElasticNet(alpha=.01, l1_ratio=0.5), # fails
          # sklearn.linear_model.ElasticNet(alpha=.001, l1_ratio=0.25),
          # sklearn.linear_model.ElasticNet(alpha=.001, l1_ratio=0.75) # bad result
          ]

In [14]:
era_split = EraCV(eras = training_data.era)
X, y, era = training_data[feature_cols], training_data.target, training_data.era
for model in models:
    corrs = []
    sharpes = []
    for valid_era in tqdm(range(200,209)):
        train, test = era_split.get_splits(valid_start = valid_era,
                                           valid_n_eras = 4,
                                           train_n_eras = None)
        model.fit(X.iloc[train], y.iloc[train])
        val_preds = model.predict(X.iloc[test])
        eval_df = pd.DataFrame({'prediction':val_preds,
                            'target':y.iloc[test],
                            'era':era.iloc[test]}).reset_index()
        corrs.append(corr(eval_df))
        sharpes.append(sharpe(eval_df))
    print(f'\nmodel: {model.__class__.__name__}')
    if model.__class__.__name__!="LinearRegression":
        print(f'alpha: {model.alpha}')
    if model.__class__.__name__=="ElasticNet":
        print(f'L1 Ratio: {model.l1_ratio}')
    print(f'validation correlations: {corrs}, mean: {np.array(corrs).mean()}')
    print(f'validation sharpes: {sharpes}, mean: {np.array(sharpes).mean()}')

100%|██████████| 9/9 [01:20<00:00,  8.92s/it]
  0%|          | 0/9 [00:00<?, ?it/s]


model: ElasticNet
alpha: 0.001
L1 Ratio: 0.3
validation correlations: [0.021791328956349696, 0.012926227931999653, 0.006615510537843738, -0.0033322783873920955, 0.008991823348009909, 0.01856060461003374, 0.03769456393412631, 0.03559889821075727, 0.007053140490382437], mean: 0.016211091070234517
validation sharpes: [0.7976902981366837, 0.593099839105777, 0.20437078815039159, -0.12058073667515777, 0.1940444061704515, 0.3712213403459524, 1.1356237295529281, 0.9698592191332843, 0.1418193707681109], mean: 0.4763498060764914


100%|██████████| 9/9 [01:23<00:00,  9.26s/it]
  0%|          | 0/9 [00:00<?, ?it/s]


model: ElasticNet
alpha: 0.001
L1 Ratio: 0.4
validation correlations: [0.02295339224341187, 0.014346934913335952, 0.009229280041262805, -0.0016668459595145886, 0.009630041411502654, 0.019515629597780856, 0.036910893616761004, 0.033116379302891644, 0.006815274149046641], mean: 0.016761219924053203
validation sharpes: [0.7840950515264731, 0.5819336205803927, 0.2785531688352436, -0.05992444257140876, 0.21629206971476153, 0.4009135524762245, 1.158232058073494, 0.9034112756394792, 0.13876037039460667], mean: 0.48914074718547407


100%|██████████| 9/9 [01:17<00:00,  8.59s/it]


model: ElasticNet
alpha: 0.001
L1 Ratio: 0.5
validation correlations: [0.02230521479144073, 0.014752141201887118, 0.0117495049707727, 0.0002690304184231572, 0.010571216651782173, 0.019421253725138258, 0.033041033840800954, 0.02807527924414685, 0.0060239184973200945], mean: 0.016245399260190228
validation sharpes: [0.7137527819708085, 0.5565901878508427, 0.3768262499261184, 0.010272633523150138, 0.2649419257026714, 0.4371966927910981, 1.1069030548203225, 0.8109670081976144, 0.1285047421047309], mean: 0.48955058632081744





# Regression with interaction terms


In [None]:
%%script false --no-raise-error
# DOESN'T COMPLETE
from sklearn.preprocessing import PolynomialFeatures
models = [
          sklearn.linear_model.LinearRegression(n_jobs = -1),
          # sklearn.linear_model.Lasso(alpha=0.00006), # good! Takes pretty long time.
          # sklearn.linear_model.Ridge(alpha=0.01),
          #sklearn.linear_model.ElasticNet(alpha=.001, l1_ratio=0.4),  # .0167
          ]


poly = PolynomialFeatures(degree = 2, interaction_only=True)
era_split = EraCV(eras = training_data.era)
X, y, era = training_data[feature_cols], training_data.target, training_data.era
for model in models:
    corrs = []
    sharpes = []
    for valid_era in tqdm(range(200,209)):
        train, test = era_split.get_splits(valid_start = valid_era,
                                           valid_n_eras = 4,
                                           train_n_eras = None)
        model.fit(poly.fit_transform(X.iloc[train]), y.iloc[train])
        val_preds = model.predict(poly.fit_transform(X.iloc[test]))
        eval_df = pd.DataFrame({'prediction':val_preds,
                            'target':y.iloc[test],
                            'era':groups.iloc[test]}).reset_index()
        corrs.append(corr(eval_df))
        sharpes.append(sharpe(eval_df))
    print(f'\nmodel: {model.__class__.__name__}')
    if model.__class__.__name__!="LinearRegression":
        print(f'alpha: {model.alpha}')
    if model.__class__.__name__=="ElasticNet":
        print(f'L1 Ratio: {model.l1_ratio}')
    print(f'validation correlations: {corrs}, mean: {np.array(corrs).mean()}')
    print(f'validation sharpes: {sharpes}, mean: {np.array(sharpes).mean()}')


  0%|          | 0/9 [00:00<?, ?it/s][A

In [None]:
%%script false --no-raise-error
# Takes a very long time
import sklearn.neighbors
models = [        
          sklearn.neighbors.KNeighborsRegressor(n_neighbors = 500, n_jobs = -1),
          sklearn.neighbors.KNeighborsRegressor(n_neighbors = 500, leaf_size = 150, n_jobs = -1),
          sklearn.neighbors.KNeighborsRegressor(n_neighbors = 500, leaf_size = 10, n_jobs = -1)
                    ]

for model in models:
    corrs = []
    sharpes = []
    for train, test in tqdm(group_kfold.split(X, y, groups=groups), total=5):
        model.fit(X.iloc[train], y.iloc[train])
        val_preds = model.predict(X.iloc[test])
        eval_df = pd.DataFrame({'prediction':val_preds,
                            'target':y.iloc[test],
                            'era':groups.iloc[test]}).reset_index()
        corrs.append(corr(eval_df))
        sharpes.append(sharpe(eval_df))
    print(f'model: {model.__class__.__name__}')
    print(f'neighbors: {model.n_neighbors}')
    print(f'leaf size: {model.leaf_size}')
    print(f'validation correlations: {corrs}, mean: {np.array(corrs).mean()}')
    print(f'validation sharpes: {sharpes}, mean: {np.array(sharpes).mean()}\n')

In [None]:
%%script false --no-raise-error
# Takes a very long time
import sklearn.ensemble
models = [
          sklearn.ensemble.ExtraTreesRegressor(n_estimators=5, max_features = 0.8, n_jobs=-1),
          sklearn.ensemble.ExtraTreesRegressor(n_estimators=10, max_features = 0.8, n_jobs=-1),
          sklearn.ensemble.ExtraTreesRegressor(n_estimators=5, max_features = 0.2, n_jobs=-1),
          sklearn.ensemble.ExtraTreesRegressor(n_estimators=10, max_features = 0.2, n_jobs=-1),
          # sklearn.ensemble.ExtraTreesRegressor(n_estimators=100, max_features = 0.8, n_jobs=-1),
          # sklearn.ensemble.ExtraTreesRegressor(n_estimators=1000, max_features = 0.8, n_jobs=-1)
                    ]

for model in models:
    corrs = []
    sharpes = []
    for train, test in tqdm(group_kfold.split(X, y, groups=groups), total=5):
        model.fit(X.iloc[train], y.iloc[train])
        val_preds = model.predict(X.iloc[test])
        eval_df = pd.DataFrame({'prediction':val_preds,
                            'target':y.iloc[test],
                            'era':groups.iloc[test]}).reset_index()
        corrs.append(corr(eval_df))
        sharpes.append(sharpe(eval_df))
    print(f'model: {model.__class__.__name__} with {model.n_estimators} estimators and {model.max_features} max features.')
    print(f'validation correlations: {corrs}, mean: {np.array(corrs).mean()}')
    print(f'validation sharpes: {sharpes}, mean: {np.array(sharpes).mean()}\n')

In [None]:
%%script false --no-raise-error

import sklearn.ensemble
models = [
          sklearn.ensemble.GradientBoostingRegressor(n_estimators=50, max_features = 0.8, subsample=0.1),
          sklearn.ensemble.GradientBoostingRegressor(n_estimators=100, max_features = 0.8),
          sklearn.ensemble.GradientBoostingRegressor(n_estimators=50, max_features = 0.2),
          sklearn.ensemble.GradientBoostingRegressor(n_estimators=100, max_features = 0.2)
                    ]

for model in models:
    corrs = []
    sharpes = []
    for train, test in tqdm(group_kfold.split(X, y, groups=groups), total=5):
        model.fit(X.iloc[train], y.iloc[train])
        val_preds = model.predict(X.iloc[test])
        eval_df = pd.DataFrame({'prediction':val_preds,
                            'target':y.iloc[test],
                            'era':groups.iloc[test]}).reset_index()
        corrs.append(corr(eval_df))
        sharpes.append(sharpe(eval_df))
    print(f'model: {model.__class__.__name__} with {model.n_estimators} estimators and {model.max_features} max features.')
    print(f'validation correlations: {corrs}, mean: {np.array(corrs).mean()}')
    print(f'validation sharpes: {sharpes}, mean: {np.array(sharpes).mean()}\n')

#### Assessing Regression Model Performance

Here we apply the `corr` and `sharpe` methods defined above to predictions made on the validation sample in order to estimate our model's tournament performance.

In [15]:
# Fit final model
X, y, era = training_data[feature_cols], training_data.target, training_data.era
model = sklearn.linear_model.ElasticNet(alpha=.001, l1_ratio=0.4)
model.fit(X, y)
val_sample = training_data.iloc[val_idx]
val_preds = model.predict(val_sample[feature_cols])
eval_df = pd.DataFrame({'prediction':val_preds,
                        'target':val_sample.target,
                        'era':val_sample.era}).reset_index()
val_sharpe = sharpe(eval_df)
val_corr = corr(eval_df)

print((f'The linear regression model\'s validation correlation is {val_corr}. '
       f'Its validation sharpe is {val_sharpe}'))

The linear regression model's validation correlation is 0.021927411934820517. Its validation sharpe is 0.547569078760028


#### Making Predictions with the Regression Model

In [16]:
ids = []
preds = []

chunksize = 50000

tourn_iter_csv = pd.read_csv(tourn_file, iterator=True, chunksize=1e6)
for chunk in tourn_iter_csv:
    df = chunk[feature_cols]
    out = model.predict(df)
    ids.extend(chunk["id"])
    preds.extend(out)
tourn_iter_csv.close()

In [17]:
linear_regression_predictions_df = pd.DataFrame({
    'id':ids,
    'prediction':preds
})
linear_regression_predictions_df.head()

Unnamed: 0,id,prediction
0,n0003aa52cab36c2,0.494364
1,n000920ed083903f,0.497263
2,n0038e640522c4a6,0.506464
3,n004ac94a87dc54b,0.49951
4,n0052fe97ea0c05f,0.499483


In [18]:
linear_regression_predictions_df.to_csv("lr_predictions.csv", index=False)

#### Submitting Predictions from the Linear Regression Model

We can use `numerapi` to submit these predictions as follows:

In [19]:
napi.upload_predictions("lr_predictions.csv", model_id=os.environ.get("NUMERAI_MODEL_ID_REGRESSIONS"))

2021-03-13 19:48:56,546 INFO numerapi.base_api: uploading predictions...


'f0d64837-ad84-49c9-ac04-2d0b8353d8c9'