## First Numerai Submission

Following example script listed [here](https://github.com/numerai/example-scripts/blob/master/example_model.py)

In [4]:
# import dependencies
import pandas as pd
from lightgbm import LGBMRegressor
import gc
import json
from pathlib import Path
from numerapi import NumerAPI

## 1. Download Tournament Items and Data

### Setup API and Access Tokens

In [18]:
# access api tokens
with open('/Users/akg/.secret/numerai/numerai-keys.json', 'r') as f:
    keys = json.load(f)
    
# pull out public and secret key
pub_key = keys['public-key']
secret = keys['secret-key']

# setup api
napi = NumerAPI(pub_key, secret)

### Details on Current Round

In [12]:
# get current round details
current_round = napi.get_current_round()
print(f'Current round: {current_round}')

Current round: 411


In [13]:
# get current leaderboard
leaderboard = napi.get_leaderboard()

In [14]:
# check if a new round has started
if napi.check_new_round():
    print('New round has started within the last 12 hours!')
else: 
    print('No new round within last 12 hours')

New round has started within the last 12 hours!


### Download Data Files

Tournament data changes every week so we specify the round in their name. Training and validation data only change periodically, so no need to download them every time.

In [19]:
# download datasets to specific subfolder
print('Downloading dataset files...')
Path("./v4.1").mkdir(parents=False, exist_ok=True)

napi.download_dataset("v4.1/train.parquet")
napi.download_dataset("v4.1/validation.parquet")
napi.download_dataset("v4.1/live.parquet", f"v4.1/live_{current_round}.parquet")
napi.download_dataset("v4.1/validation_example_preds.parquet")
napi.download_dataset("v4.1/features.json")

napi.download_dataset("v4.1/live_example_preds.parquet")
napi.download_dataset("v4.1/meta_model.parquet")

Downloading dataset files...


2023-02-01 13:47:26,828 INFO numerapi.utils: starting download
v4.1/train.parquet:  15%|█▌        | 218M/1.45G [00:08<00:44, 27.6MB/s] 

KeyboardInterrupt: 

v4.1/train.parquet:  15%|█▌        | 221M/1.45G [00:20<00:44, 27.6MB/s]

## 2. Read Training Data

In [4]:
# Reading minimal training data
# read feature metadata, feature set, or all features
with open('v4.1/features.json', 'r') as f: 
    feature_metadata = json.load(f)

# all_features = list(feature_metadata["feature_stats"].keys()) 
# small_feature_set = feature_metadata["feature_sets"]["small"]
# medium_feature_set = feature_metadata["feature_sets"]["medium"]
features = feature_metadata['feature_sets']['medium']

# read in just those features along with era and target columns
read_columns = features + ['era', 'data_type', 'target_nomi_v4_20']

## 3. Read Data from Downloaded Parquet Files

Sometimes when reading downloaded data, an error is raised about invalid magic parquet bytes. If this occurs, delete the file and rerun the napi.download_dataset to fix the corrupted file

In [7]:
# read data from parquet files
training_data = pd.read_parquet('v4.1/train.parquet', columns=read_columns)
validation_data = pd.read_parquet('v4.1/validation.parquet', columns=read_columns)
live_data = pd.read_parquet(f'v4.1/live_{current_round}.parquet', columns=read_columns)

# 4. Get Per Era Correlations