# QuickStarter for the DataCrunch tournament 

## Basic steps and workflow:

0. Using this notebook

1. Download data

2. Explore data

3. Select a model

4. Scoring

5. Train / validation split

6. Train your model 

7. Make prediction

8. Submit

---

## 0. Using this notebook 

To execute the cell press `shift+enter`. 

Follow the steps and login with your Google account.

In [None]:
# Install the crunchDAO package - credit to @uuazed. Check here: https://github.com/uuazed/crunchdao
!pip install --upgrade crunchdao

In [None]:
# Lib & Dependencies
import pandas as pd
import numpy as np
import xgboost as xgb
import scipy

import seaborn as sns
import matplotlib.pyplot as plt

import requests
import gc

import crunchdao

Paste <u>your</u> API key here. If you don't have one, go to the API management section of your account: https://account.crunchdao.com/account/api

In [None]:
client = crunchdao.Client(apikey="") # <= Your API key here

In [None]:
# Get the configuration of the current dataset
client.dataset_config()

---

## 1. Download data

Each week we will provide you with four DataFrames:

- X_train contains the features;
- y_train contains the targets;
- X_test contains the features you can use in your models;
- example_prediction contains an example of the submission awaited.

There are 4 targets you need to predict: target_w, target_r, target_g, and target_b.

You can either download the data in the *.csv* or *.parquet* extension.
The *.csv* will take longer to download and take up more space in the RAM.

In [None]:
# Chose a file format between parquet and csv 
file_format = 'parquet'

# Download current dataset
client.download_data(directory=".", file_format=file_format)

In [None]:
if file_format == 'parquet':
  # Data for training
  train_features = pd.read_parquet(f'./X_train.{file_format}')
  # Data for which you will submit your prediction
  test_data = pd.read_parquet(f'./X_test.{file_format}')
  # Targets use for your supervised training
  train_targets = pd.read_parquet(f'./y_train.{file_format}')
  # Exemple of an awaited submission
  example_submission = pd.read_parquet(f'./example_submission.{file_format}')
elif file_format == 'csv':
  # Data for training
  train_features = pd.read_csv(f'./X_train.{file_format}')
  # Data for which you will submit your prediction
  test_data = pd.read_csv(f'./X_test.{file_format}')
  # Targets use for your supervised training
  train_targets = pd.read_csv(f'./y_train.{file_format}')
  # Exemple of an awaited submission
  example_submission = pd.read_csv(f'./example_submission.{file_format}')

In [None]:
# Merge train_features and train_targets for ease of use
train_data = pd.merge(train_features, train_targets, on=['id', 'Moons'], how='inner')

del train_features, train_targets
gc.collect()

In [None]:
# Get the features columns name and the targets columns name
features = [col for col in train_data.columns if 'Feature' in col]
targets = [col for col in train_data.columns if 'target' in col]

In [None]:
display(train_data.head())
display(test_data.head())

---

## 2. Explore Data

Observe your data and carefully prepare what you will give to your model for training.

In [None]:
display(train_data.describe())

### 2.1. Correlation matrix

In [None]:
def show_cov(x):
    cov = x.corr().abs().to_numpy()
    plt.figure(figsize=(15, 12))
    sns.heatmap(cov,vmin=0, vmax=1, cmap="cividis")
    plt.show()

# Show last moon covariance matrix
train_data_last = train_data[train_data['Moons'] == train_data['Moons'].max()]
show_cov(train_data_last)

### 2.2. Data distribution 

In [None]:
# Train data binning explained and plot
hist = train_data[features].iloc[:, :9].hist(bins=30, figsize=(20,20))

In [None]:
# Target explained and plot
hist = train_data[targets].hist(bins=30, figsize=(20,20))

### 2.3. Number of stocks overtime

In [None]:
# Show the number of stocks in each moon 
train_data.groupby('Moons', group_keys=False).id.count().plot(figsize=(20, 5), grid='on')

---

## 3. Model

In [None]:
def xg_boost_template(X_train, y_train, X_val, y_val, val_refs, target):
    model = xgb.XGBRegressor(objective='reg:squarederror', max_depth=3, learning_rate=0.01, n_estimators=50, n_jobs=-1, colsample_bytree=0.5)
    model.fit(X_train, y_train[target], verbose=True)

    # Test the spearman of your model on the X_test data
    preds = pd.DataFrame(model.predict(X_val), columns=[target])
    get_spearman_results(preds, y_val, val_refs)

    return model

---

## 4. Scoring: Spearman's rank correlation of your predictions vs the targets

In [None]:
def get_spearman_results(preds, y_val, val_refs):
    preds.rename({f'{target}':f'pred_{target.split("_")[1]}' for target in preds.columns}, axis=1, inplace=True)
    preds_ref = pd.concat([preds.reset_index(drop=True), val_refs.reset_index(drop=True), y_val.reset_index(drop=True)], axis=1)
    spearman = pd.DataFrame()
    
    target_suffixes = [col.split('_')[-1] for col in preds.columns if 'pred' in col]
    for suffix in target_suffixes:
        spearman[f'target_{suffix}'] = preds_ref.groupby('Moons')[[f'pred_{suffix}', f'target_{suffix}']].corr(method='spearman').unstack().iloc[:,1]

    print(f'\nSpearman score over the period :\n{spearman.describe()}\n')
    return spearman

---

## 5. Embargoed Train / Test split

In [None]:
def train_test_split(data):
    number_of_moons = len(data['Moons'].unique())
    embargo = 13 # Embargo between train and test set
    proportion = 0.8

    # Train on 80% of the first moons and test on 20% of the last moons
    train_set = data[data['Moons'] < int(number_of_moons * proportion) - embargo]
    test_set = data[data['Moons'] > int(number_of_moons * proportion)]

    X_train = train_set[features]
    y_train = train_set[targets]
    X_test = test_set[features]
    y_test = test_set[targets]
    test_refs = test_set.iloc[:, :2]

    return X_train, y_train, X_test, y_test, test_refs

---

## 6. Supervised training of a simple XGBoost

In [None]:
# Split your data to validate your model
X_train, y_train, X_val, y_val, val_refs = train_test_split(train_data)

In [None]:
# Run your model on the different targets
model = {}
for target in targets:
  model[f'xgb_model_{target}'] = xg_boost_template(X_train, y_train, X_val, y_val, val_refs, target)

---

## 7. Make prediction on the 4 targets

When you feel like your model is accurate enough it's time to predict the targets and submit your results.

Predict on the 4 targets, concatenate the answers and submit.

1. **WARNING**  Be sure that your columns are named id, Moons, target_w, target_r, etc.

2. **WARNING** Your prediction must be in [0, 1].

3. **WARNING** Don't submit constant values.

4. **WARNING** Submit the id and the moon columns.

In [None]:
prediction = test_data.iloc[:, :2]
for target in targets:
    prediction.loc[:, target] = model[f'xgb_model_{target}'].predict(test_data.iloc[:, 2:])

**Check your submission file**

In [None]:
prediction

In [None]:
def check_columns_name(df, sub):
    if sub.columns.tolist() != df.columns.tolist():
        raise Exception('Columns name are different from what is expected')

def check_nans(sub):
    if sub.isna().sum().sum() > 0:
        raise Exception('NaNs detected')

def check_values(sub, targets):
    for target in targets:
      if (sub.loc[:, target].values > 1).any() or (sub.loc[:, target].values < 0).any():
          raise Exception('Values are not between 0 and 1')

def check_moons(df, sub):
    if set(df['Moons'].unique()) != set(sub['Moons'].unique()):
        raise Exception('Moons are different from what is expected')

def check_ids(df, sub, moon):
    if not set(sub[sub['Moons'] == moon]['id'].unique()) == set(df[df['Moons'] == moon]['id'].unique()):
        raise Exception('At least an id is missing')

def check_constants(sub, moon, targets):
    for target in targets:
        if sub[sub['Moons'] == moon][target].nunique() == 1:
            raise Exception('Constant values have been detected on a moon')

try:
    check_columns_name(example_submission, prediction)
    check_nans(prediction)
    check_values(prediction, targets)
    check_moons(example_submission, prediction)
    for moon in prediction['Moons'].unique():
        check_ids(example_submission, prediction, moon)
        check_constants(prediction, moon, targets)
    print(f'Submission: OK')
except Exception as e:
    print(f'Error: {e}')

---

## 8. Submit predictions

In [None]:
# Upload predictions
submission_id = client.upload(prediction)

In [None]:
# Set a comment for the submission, to remember which model that is, etc...
client.set_comment(submission_id, "quickstart model")

In [None]:
# Download your prediction file if you prefer submitting through the website
from google.colab import files
with open("prediction.csv", "wb") as f:
    f.write(prediction.to_csv(index=False).encode('ascii'))
files.download('prediction.csv')

In [None]:
# Check your past submissions
client.submissions(
    user_id=None, # None is your id by default 
    round_num=None # None shows all the round by default
    )

---

## Useful links

**Website**

- https://www.crunchdao.com/

**Social media**
- discord : https://discord.gg/9wvzxS7A (come say hi! 😉)
- twitter : https://twitter.com/CrunchDAO
- linkedin : https://www.linkedin.com/company/crunchdao-com/
- reddit : https://www.reddit.com/r/crunchdao/

**Documentation**

- https://docs.crunchdao.com/tournament/getting-started

**Github**

- https://github.com/crunchdao/
- https://github.com/uuazed/crunchdao

**DeSci - research framework**

- https://desci.crunchdao.com/projects/crunchdao