<a href="https://colab.research.google.com/github/djliden/numerai_starter_kit/blob/main/Numerai_Starter_Kit.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction
This notebook will walk you through the entire process of making a [numerai](numer.ai) submission, from downloading the data to submitting final predictions, all in a google colab notebook. In particular, it will address two challenges:
- handling API keys in a remote environment (colab)
- parsing the large CSV files which, if read all at once, will exceed colab's memory and cause the notebook to crash.

This notebook will implement two models: a basic tabular neural network using `fastai` and a linear regression model using `scikit-learn`.

# Initial Setup
First, we install and import the necessary packages. If you don't care about all of the resulting messages, uncomment # %%capture at the top of the cell; this will hide all of the output. 

In [9]:
# %%capture
# install
!pip install --upgrade python-dotenv fastai numerapi

# import dependencies
import os
from dotenv import load_dotenv, find_dotenv
from getpass import getpass
import pandas as pd
import numpy as np
import numerapi
from fastai.tabular.all import *

Requirement already up-to-date: python-dotenv in /usr/local/lib/python3.7/dist-packages (0.15.0)
Requirement already up-to-date: fastai in /usr/local/lib/python3.7/dist-packages (2.2.7)
Requirement already up-to-date: numerapi in /usr/local/lib/python3.7/dist-packages (2.4.3)


# Setting Up numerapi
We will use the [numerapi](https://github.com/uuazed/numerapi) package to access the data and make submissions. For this to work, numerapi needs to use your API keys (which can be obtained [here](https://numer.ai/submit)). We will set up two main ways of passing these API keys to a numerapi instance:
1. Read a `.env` file using the `python-dotenv` package. This will require you to upload a `.env` file (which contains your secret key and should *not* be kept under version control). Using this method means you will not have to directly enter your keys each time you use this notebook, though you will need to re-upload the `.env` file.
2. Manually entering the API keys -- if you don't have access to, or don't want to mess with, your `.env` file.

If you have a `.env` file, upload it (to the default working directory, `content`, now. In either case, run the cell below to set up the numerapi instance. See [Appendix A](#app_a) for instructions on generating and downloading a .env file.

In [18]:
# Load the numerapi credentials from .env or prompt for them if not available
def credential():
    dotenv_path = find_dotenv()
    load_dotenv(dotenv_path)

    if os.getenv("NUMERAI_PUBLIC_KEY"):
        print("Loaded Numerai Public Key into Global Environment!")
    else:
        os.environ["NUMERAI_PUBLIC_KEY"] = getpass("Please enter your Numerai Public Key. You can find your key here: https://numer.ai/submit -> ")
    
    if os.getenv("NUMERAI_SECRET_KEY"):
        print("Loaded Numerai Secret Key into Global Environment!")
    else:
        os.environ["NUMERAI_SECRET_KEY"] = getpass("Please enter your Numerai Secret Key. You can find your key here: https://numer.ai/submit -> ")
    
    if os.getenv("NUMERAI_MODEL_ID"):
        print("Loaded Numerai Secret Key into Global Environment!")
    else:
        os.environ["NUMERAI_MODEL_ID"] = getpass("Please enter your Numerai Secret Key. You can find your key here: https://numer.ai/submit -> ")

credential()
public_key = os.environ.get("NUMERAI_PUBLIC_KEY")
secret_key = os.environ.get("NUMERAI_SECRET_KEY")
napi = numerapi.NumerAPI(verbosity="info", public_id=public_key, secret_key=secret_key)

Loaded Numerai Public Key into Global Environment!
Loaded Numerai Secret Key into Global Environment!


You can read up on the functionality of numerapi [here](https://github.com/uuazed/numerapi). You can use it to download the competition data, view other numerai users' public profiles, check submission status, manage your stake, and much more. In this case, we'll only be using it to download competition data and submit predictions.

# Downloading Competition Data
In a more structured project, you'll probably want to keep the data in a seprate directory from your scripts etc. In this case, however, we'll keep everything in `./content`.

In [20]:
napi.download_current_dataset()

./numerai_dataset_253.zip: 100%|█████████▉| 393M/393M [00:11<00:00, 40.6MB/s]2021-03-03 16:33:45,600 INFO numerapi.base_api: unzipping file...
./numerai_dataset_253.zip: 393MB [00:30, 40.6MB/s]                           

'./numerai_dataset_253.zip'

# Generating our Training Set

If you look at the files we downloaded above, you'll see a `numerai_tournament_data.csv` file and a `numerai_training_data.csv` file. The "tournament" file contains many rows with targets which we can use for validation, so let's extract those and combine them with our training set:

In [21]:
napi.get_current_round()

253

In [None]:
tourn_file = f'./numerai_dataset_{napi.get_current_round()}/numerai_tournament_data.csv'
train_file = f'./numerai_dataset_{napi.get_current_round()}/numerai_training_data.csv'

iter_csv = pd.read_csv("./n", iterator=True, chunksize=1e6)
val_df = pd.concat([chunk[chunk['data_type'] == 'validation'] for chunk in tqdm(iter_csv)])
iter_csv.close()

training_data = pd.read_csv(train)
training_data = pd.concat([training_data, val_df])
training_data.reset_index(drop=True, inplace=True)

training_data.to_csv((data_dir / "training_processed.csv"), index=False)

    if (data_dir / "training_processed.csv").exists():
        print("Loading the processed training data from file\n")
        training_data = pd.read_csv(data_dir / "training_processed.csv")

    if (debug):
        print("using a debugging sample of 1500 rows\n")
        training_data = training_data.sample(1500)
        training_data.reset_index(drop=True, inplace=True)

    feature_cols = training_data.columns[training_data.columns.str.startswith('feature')]
    target_cols = ['target']

    train_idx = training_data.index[training_data.data_type=='train'].tolist()
    test_idx = training_data.index[training_data.data_type=='validation'].tolist()

<a name="app_a"></a>
# Appendix A
## Generating and Saving a `.env` file
I recommend filling out this section to generate a `.env` file and then downloading that file for future use. Then, the next time you want to run this notebook, upload the `.env` file and you will not need to enter your credentials manually.

In [14]:
# Write lines to file

with open('./.env', 'w') as dotenv:
    dotenv.write(f'NUMERAI_PUBLIC_KEY = {getpass("Please enter your Numerai Public Key. You can find your key here: https://numer.ai/submit -> ")}\n')
    dotenv.write(f'NUMERAI_SECRET_KEY = {getpass("Please enter your Numerai Secret Key. You can find your key here: https://numer.ai/submit -> ")}\n')
    dotenv.write(f'NUMERAI_MODEL_ID = {getpass("Please enter your Numerai Model ID. You can find your key here: https://numer.ai/submit -> ")}\n')

Please enter your Numerai Public Key. You can find your key here: https://numer.ai/submit -> ··········
Please enter your Numerai Secret Key. You can find your key here: https://numer.ai/submit -> ··········
Please enter your Numerai Model ID. You can find your key here: https://numer.ai/submit -> ··········


To confirm that this worked, check the output and compare it against the values on https://numer.ai/submit. Make sure no one is looking!

In [15]:
!cat .env

NUMERAI_PUBLIC_KEY = X5ACOQPPFFV7SK2H2CBVX53GHWLYPDCB
NUMERAI_SECRET_KEY = KKID7IKZMUXTL4EGLRIA43ZWV4ELAIVNBZ77XFCPG7257OQHVSW5GPLISU5HUZOP
NUMERAI_MODEL_ID = 51961039-e9f9-493c-b97b-f6f4a3785d90


If you want to download this file for future use, run the next cell:

In [19]:
from google.colab import files
files.download("./.env")

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>