Hello and welcome to the Numerai Data Science Tournament!

This notebook is designed to help you build your first machine learning model and start competing the tournament.

In this notebook we will
1. Download and explore the Numerai dataset
2. Train our first machine learning model
3. Upload our model to start making live submissions

In [1]:
# Install dependencies
!pip install -q numerapi pandas matplotlib lightgbm cloudpickle pyarrow

# Inline plots
%matplotlib inline

## 1. Dataset  

At a high level, the Numerai dataset is a tabular dataset that describes the stock market over time.

Each row represents a stock at a specific point in time, where `id` is the stock id and the `era` is the date. The `features` describe the attributes of the stock (eg. P/E ratio) known on the date and the `target` is a measure of future returns relative to the date.

The unique thing about Numerai's dataset is that it is obfuscated, which means that the underlying stock ids, feature names, and target definitions are anonymized. This makes it so that we can give this data out for free and so that it can be modeled without any financial domain knowledge (or bias).

Let's download the historical training data and take a closer look.

In [2]:
# Initialize NumerAPI - the official Python API client for Numerai
from numerapi import NumerAPI
napi = NumerAPI()

# Print all files available for download in the latest v4.1 dataset
[f for f in napi.list_datasets() if f.startswith("v4.1")]

['v4.1/features.json',
 'v4.1/live.parquet',
 'v4.1/live_example_preds.csv',
 'v4.1/live_example_preds.parquet',
 'v4.1/live_int8.parquet',
 'v4.1/meta_model.parquet',
 'v4.1/train.parquet',
 'v4.1/train_int8.parquet',
 'v4.1/validation.parquet',
 'v4.1/validation_example_preds.csv',
 'v4.1/validation_example_preds.parquet',
 'v4.1/validation_int8.parquet']

In [5]:
import pandas as pd
import json

# Download the training data and feature metadata
# This may take a few minutes 🍵
napi.download_dataset("v4.1/train.parquet");
napi.download_dataset("v4.1/features.json");

# Load the training data but only the "small" subset of features to save time and memory
# In practice you will want to use all the features to maximize your model's performance
feature_metadata = json.load(open("v4.1/features.json"))
feature_cols = feature_metadata["feature_sets"]["small"] ### Change this to full data set!
training_data = pd.read_parquet("v4.1/train.parquet", columns= ["era"] + feature_cols + ["target"])

2023-08-09 13:52:05,335 INFO numerapi.utils: target file already exists
2023-08-09 13:52:05,336 INFO numerapi.utils: download complete
2023-08-09 13:52:06,305 INFO numerapi.utils: target file already exists
2023-08-09 13:52:06,307 INFO numerapi.utils: download complete


Unnamed: 0_level_0,era,feature_bijou_penetrant_syringa,feature_burning_phrygian_axinomancy,feature_coraciiform_sciurine_reef,feature_corporatist_seborrheic_hopi,feature_cyclopedic_maestoso_daguerreotypist,feature_distressed_bloated_disquietude,feature_ecstatic_foundational_crinoidea,feature_elaborate_intimate_bor,feature_entopic_interpreted_subsidiary,...,feature_tragical_rainbowy_seafarer,feature_ugrian_schizocarpic_skulk,feature_undisguised_unenviable_stamen,feature_undrilled_wheezier_countermand,feature_unpainted_censual_pinacoid,feature_unreproved_cultish_glioma,feature_unsizable_ancestral_collocutor,feature_unswaddled_inenarrable_goody,feature_unventilated_sollar_bason,target
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
n003bba8a98662e4,0001,0.00,0.00,0.50,1.00,0.50,0.50,0.00,0.50,0.50,...,0.00,0.00,1.00,0.75,1.00,1.00,0.50,0.25,0.00,0.25
n003bee128c2fcfc,0001,0.50,0.75,0.50,0.25,0.50,0.75,1.00,0.75,1.00,...,1.00,0.25,0.25,0.25,1.00,0.25,0.50,0.75,0.25,0.75
n0048ac83aff7194,0001,0.25,0.25,1.00,0.75,0.75,0.75,0.75,0.75,0.00,...,0.50,1.00,1.00,0.75,0.00,0.25,1.00,0.75,1.00,0.25
n00691bec80d3e02,0001,0.75,0.75,0.75,0.25,0.25,0.00,0.00,0.00,0.25,...,0.75,0.75,0.25,0.50,1.00,0.00,0.50,0.50,0.75,0.75
n00b8720a2fdc4f2,0001,0.00,0.00,0.00,1.00,0.25,0.25,0.75,0.25,0.00,...,0.75,0.00,1.00,0.25,1.00,0.00,0.00,1.00,0.00,0.50
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
nffcc1dbdf2212e6,0574,1.00,0.25,0.75,1.00,0.75,1.00,0.25,1.00,0.00,...,0.50,1.00,0.75,0.75,0.75,0.25,1.00,0.50,1.00,0.75
nffd71b7f6a128df,0574,0.75,0.00,0.50,0.50,0.00,0.25,0.00,0.00,0.00,...,0.75,0.50,0.00,0.00,0.00,1.00,0.25,0.25,0.50,0.25
nffde3b371d67394,0574,0.75,0.00,1.00,1.00,0.75,1.00,0.00,1.00,0.25,...,0.25,1.00,1.00,1.00,0.50,0.50,1.00,1.00,1.00,0.25
nfff1a1111b35e84,0574,0.25,0.75,0.00,0.00,1.00,1.00,0.50,1.00,0.75,...,1.00,0.25,0.00,0.00,0.25,0.25,0.00,1.00,0.00,0.50


In [7]:
# Print the training data
pd.set_option('display.max_rows', 10)
pd.set_option('display.max_columns', 20)
training_data.shape # 34 columns for the small feature set
training_data

Unnamed: 0_level_0,era,feature_bijou_penetrant_syringa,feature_burning_phrygian_axinomancy,feature_coraciiform_sciurine_reef,feature_corporatist_seborrheic_hopi,feature_cyclopedic_maestoso_daguerreotypist,feature_distressed_bloated_disquietude,feature_ecstatic_foundational_crinoidea,feature_elaborate_intimate_bor,feature_entopic_interpreted_subsidiary,...,feature_tragical_rainbowy_seafarer,feature_ugrian_schizocarpic_skulk,feature_undisguised_unenviable_stamen,feature_undrilled_wheezier_countermand,feature_unpainted_censual_pinacoid,feature_unreproved_cultish_glioma,feature_unsizable_ancestral_collocutor,feature_unswaddled_inenarrable_goody,feature_unventilated_sollar_bason,target
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
n003bba8a98662e4,0001,0.00,0.00,0.50,1.00,0.50,0.50,0.00,0.50,0.50,...,0.00,0.00,1.00,0.75,1.00,1.00,0.50,0.25,0.00,0.25
n003bee128c2fcfc,0001,0.50,0.75,0.50,0.25,0.50,0.75,1.00,0.75,1.00,...,1.00,0.25,0.25,0.25,1.00,0.25,0.50,0.75,0.25,0.75
n0048ac83aff7194,0001,0.25,0.25,1.00,0.75,0.75,0.75,0.75,0.75,0.00,...,0.50,1.00,1.00,0.75,0.00,0.25,1.00,0.75,1.00,0.25
n00691bec80d3e02,0001,0.75,0.75,0.75,0.25,0.25,0.00,0.00,0.00,0.25,...,0.75,0.75,0.25,0.50,1.00,0.00,0.50,0.50,0.75,0.75
n00b8720a2fdc4f2,0001,0.00,0.00,0.00,1.00,0.25,0.25,0.75,0.25,0.00,...,0.75,0.00,1.00,0.25,1.00,0.00,0.00,1.00,0.00,0.50
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
nffcc1dbdf2212e6,0574,1.00,0.25,0.75,1.00,0.75,1.00,0.25,1.00,0.00,...,0.50,1.00,0.75,0.75,0.75,0.25,1.00,0.50,1.00,0.75
nffd71b7f6a128df,0574,0.75,0.00,0.50,0.50,0.00,0.25,0.00,0.00,0.00,...,0.75,0.50,0.00,0.00,0.00,1.00,0.25,0.25,0.50,0.25
nffde3b371d67394,0574,0.75,0.00,1.00,1.00,0.75,1.00,0.00,1.00,0.25,...,0.25,1.00,1.00,1.00,0.50,0.50,1.00,1.00,1.00,0.25
nfff1a1111b35e84,0574,0.25,0.75,0.00,0.00,1.00,1.00,0.50,1.00,0.75,...,1.00,0.25,0.00,0.00,0.25,0.25,0.00,1.00,0.00,0.50
