# Introduction

In this notebook, we will quickly implement the full pipeline of downloading and modeling data from [numer.ai](numer.ai) and making a submission to the numerai competition. We will mostly follow the approach laid out in [this colab notebook](https://colab.research.google.com/github/numerai/example-scripts/blob/master/making-your-first-submission-on-numerai.ipynb).

## Dependencies
Make sure to install `pandas`, `sklearn`, and `numerapi` prior to running through this notebook.

In [1]:
# Import Dependencies
# import dependencies
import os
from dotenv import load_dotenv, find_dotenv
import pandas as pd
import numpy as np
import numerapi
import sklearn.linear_model

In [17]:
# Secrets setup
dotenv_path = find_dotenv()
load_dotenv(dotenv_path)
public_key = os.environ.get("NUMERAI_PUBLIC_KEY")
private_key = os.environ.get("NUMERAI_PRIVATE_KEY")

napi = numerapi.NumerAPI(verbosity="info", public_id=public_key, secret_key=private_key)

# Data Download and Setup

In this step, we download the data and take a quick look at what we're working with. A full exploratory analysis will come later.

In [3]:
napi.download_current_dataset(dest_path="../input/", unzip=True)

2021-02-11 11:41:46,571 INFO numerapi.base_api: target file already exists


'../input/numerai_dataset_250.zip'

In [4]:
training_data = pd.read_csv("../input/numerai_dataset_250/numerai_training_data.csv")
training_data.head()

Unnamed: 0,id,era,data_type,feature_intelligence1,feature_intelligence2,feature_intelligence3,feature_intelligence4,feature_intelligence5,feature_intelligence6,feature_intelligence7,...,feature_wisdom38,feature_wisdom39,feature_wisdom40,feature_wisdom41,feature_wisdom42,feature_wisdom43,feature_wisdom44,feature_wisdom45,feature_wisdom46,target
0,n000315175b67977,era1,train,0.0,0.5,0.25,0.0,0.5,0.25,0.25,...,1.0,1.0,0.75,0.5,0.75,0.5,1.0,0.5,0.75,0.5
1,n0014af834a96cdd,era1,train,0.0,0.0,0.0,0.25,0.5,0.0,0.0,...,1.0,1.0,0.0,0.0,0.75,0.25,0.0,0.25,1.0,0.25
2,n001c93979ac41d4,era1,train,0.25,0.5,0.25,0.25,1.0,0.75,0.75,...,0.25,0.5,0.0,0.0,0.5,1.0,0.0,0.25,0.75,0.25
3,n0034e4143f22a13,era1,train,1.0,0.0,0.0,0.5,0.5,0.25,0.25,...,1.0,1.0,0.75,0.75,1.0,1.0,0.75,1.0,1.0,0.25
4,n00679d1a636062f,era1,train,0.25,0.25,0.25,0.25,0.0,0.25,0.5,...,0.75,0.75,0.25,0.5,0.75,0.0,0.5,0.25,0.75,0.75


The tournament data is very large, so we will load only a subset in order to take a quick look at it.

In [5]:
tournament_data = pd.read_csv("../input/numerai_dataset_250/numerai_tournament_data.csv", nrows = 1e4)
tournament_data.head()

Unnamed: 0,id,era,data_type,feature_intelligence1,feature_intelligence2,feature_intelligence3,feature_intelligence4,feature_intelligence5,feature_intelligence6,feature_intelligence7,...,feature_wisdom38,feature_wisdom39,feature_wisdom40,feature_wisdom41,feature_wisdom42,feature_wisdom43,feature_wisdom44,feature_wisdom45,feature_wisdom46,target
0,n0003aa52cab36c2,era121,validation,0.25,0.75,0.5,0.5,0.0,0.75,0.5,...,0.75,0.75,1.0,0.75,0.5,0.5,1.0,0.0,0.0,0.25
1,n000920ed083903f,era121,validation,0.75,0.5,0.75,1.0,0.5,0.0,0.0,...,0.5,0.5,0.75,1.0,0.75,0.5,0.5,0.5,0.5,0.5
2,n0038e640522c4a6,era121,validation,1.0,0.0,0.0,1.0,1.0,1.0,1.0,...,0.0,0.0,0.5,0.25,0.0,0.0,0.5,0.5,0.0,1.0
3,n004ac94a87dc54b,era121,validation,0.75,1.0,1.0,0.5,0.0,0.0,0.0,...,0.0,0.0,0.0,0.25,0.0,0.0,0.0,0.25,0.25,0.5
4,n0052fe97ea0c05f,era121,validation,0.25,0.5,0.5,0.25,1.0,0.5,0.5,...,0.5,0.75,0.0,0.0,0.75,1.0,0.0,0.25,1.0,0.75


In [6]:
training_data.describe()

Unnamed: 0,feature_intelligence1,feature_intelligence2,feature_intelligence3,feature_intelligence4,feature_intelligence5,feature_intelligence6,feature_intelligence7,feature_intelligence8,feature_intelligence9,feature_intelligence10,...,feature_wisdom38,feature_wisdom39,feature_wisdom40,feature_wisdom41,feature_wisdom42,feature_wisdom43,feature_wisdom44,feature_wisdom45,feature_wisdom46,target
count,501808.0,501808.0,501808.0,501808.0,501808.0,501808.0,501808.0,501808.0,501808.0,501808.0,...,501808.0,501808.0,501808.0,501808.0,501808.0,501808.0,501808.0,501808.0,501808.0,501808.0
mean,0.499981,0.499979,0.499979,0.499981,0.499977,0.499977,0.499977,0.499981,0.49998,0.49998,...,0.499982,0.499982,0.499974,0.49998,0.499982,0.49998,0.499974,0.499979,0.499971,0.499997
std,0.353596,0.353593,0.353593,0.353596,0.353587,0.353587,0.353587,0.353596,0.352099,0.352099,...,0.353139,0.353139,0.351328,0.350662,0.352151,0.352965,0.351328,0.347689,0.353419,0.223268
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.25,0.25,0.25,0.25,0.25,0.25,0.25,0.25,0.25,0.25,...,0.25,0.25,0.25,0.25,0.25,0.25,0.25,0.25,0.25,0.5
50%,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,...,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5
75%,0.75,0.75,0.75,0.75,0.75,0.75,0.75,0.75,0.75,0.75,...,0.75,0.75,0.75,0.75,0.75,0.75,0.75,0.75,0.75,0.5
max,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


So -- we have a lot of data. More than 500,000 rows of tabular data i nthe training set. It would appear that all of the inputs are scaled to be between 0 and 1, but again, we will defer the EDA until later.

# Train a Basic Regression Model

First, we divide out training data into a training set and a validation set.

In [7]:
from sklearn.model_selection import train_test_split

In [8]:
feature_cols = training_data.columns[training_data.columns.str.startswith('feature')]
X = training_data[feature_cols]
y = training_data.target

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42)

Next, we train the linear regression model on our training set.

In [9]:
model = sklearn.linear_model.LinearRegression()
model.fit(X_train, y_train)

LinearRegression()

Now we do a quick check of performance.

In [10]:
y_pred = model.predict(X_test)

In [11]:
from sklearn.metrics import mean_squared_error
mean_squared_error(y_test, y_pred)

0.049993155518750824

# Make Predictions on the Tournament Data
Our tournament CSV is very large, so we'll want to do this in chunks. We take chunks of size 10000, make predictions on the feature columns, and extract the ids.

In [18]:
ids = []
preds = []

filename = "../input/numerai_dataset_250/numerai_tournament_data.csv"
chunksize = 10000
with pd.read_csv(filename, chunksize=chunksize) as reader:
    for chunk in reader:
        df = chunk[feature_cols]
        out = model.predict(df)
        ids.extend(chunk["id"])
        preds.extend(out)

In [19]:
preds[0:10], ids[0:10], len(preds)

([0.4858435099137928,
  0.500381886809774,
  0.5324541513431267,
  0.49535029730187624,
  0.49798473762874546,
  0.5067866682075791,
  0.5076158840825137,
  0.4998479690408145,
  0.48889199898887953,
  0.4769361715435418],
 ['n0003aa52cab36c2',
  'n000920ed083903f',
  'n0038e640522c4a6',
  'n004ac94a87dc54b',
  'n0052fe97ea0c05f',
  'n00a5ccf3b6b2870',
  'n00bf78d0bbbc1b6',
  'n00c6fd95ff0c83e',
  'n00cd56868258aec',
  'n00e7d6fb71ef69f'],
 1644415)

# Submit to the Competition
First we format our predictions according the requirements:

In [28]:
# predictions must have an `id` column and a `prediction_kazutsugi` column
predictions_df = pd.DataFrame({
    'id':ids,
    'prediction_kazutsugi':preds
})
predictions_df.head()

Unnamed: 0,id,prediction_kazutsugi
0,n0003aa52cab36c2,0.485844
1,n000920ed083903f,0.500382
2,n0038e640522c4a6,0.532454
3,n004ac94a87dc54b,0.49535
4,n0052fe97ea0c05f,0.497985


In [29]:
predictions_df.to_csv("../output/predictions.csv", index=False)

In [30]:
# Upload predictions
submission_id = napi.upload_predictions("../output/predictions.csv", model_id=os.environ.get("NUMERAI_MODEL_ID"))

2021-02-11 11:54:16,101 INFO numerapi.base_api: uploading predictions...
