[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://githubtocolab.com/crunchdao/datacrunch-notebooks/blob/master/python/quickstarter_notebook.ipynb)

# QuickStarter for the DataCrunch tournament 

## Basic steps and workflow:

0. Using this notebook

1. Download data

2. Explore data

3. Select a model

4. Scoring

5. Train / validation split

6. Train your model 

7. Make prediction

8. Submit

---

## 0. Using this notebook 

To execute the cell press `shift+enter`. 

Follow the steps and login with your Google account.

In [1]:
# Install the crunchDAO package - credit to @uuazed. Check here: https://github.com/uuazed/crunchdao
!pip install --upgrade crunchdao



In [111]:
# Lib & Dependencies
import pandas as pd
import numpy as np
import xgboost as xgb
import scipy

import seaborn as sns
import matplotlib.pyplot as plt

import requests
import gc

import crunchdao
import lightgbm as lgb
from sklearn.model_selection import GridSearchCV

Paste <u>your</u> API key here. If you don't have one, go to the API management section of your account: https://account.crunchdao.com/account/api

In [3]:
client = crunchdao.Client(apikey="TFmGZ3pLNRllYyl8wGj0gO0IzskBY8v10kjVedWJBe2s9hSstzbyDwcDsrkpIqYMxO7pULSNsVIAvQeKUoxZwQAnzLVGqhYg3uEO5UYQ5oMJMAe0G18vwT4ksdh3AewF") # <= Your API key here

In [4]:
# Get the configuration of the current dataset
client.dataset_config()

{'round': {'id': 156,
  'number': 2,
  'absoluteNumber': 131,
  'start': '2023-06-23T17:00:00',
  'end': '2023-06-27T09:00:00',
  'batch': {'id': 50,
   'number': 32,
   'absoluteNumber': 34,
   'start': '2023-06-18',
   'end': '2023-07-12',
   'hackathon': False,
   'open': True,
   'over': False},
  'dataset': {'id': 11,
   'name': 'master',
   'description': 'all in one',
   'hidden': False,
   'leaderboardDisabled': False},
  'updated': True,
  'periods': {'white': 'P8D', 'red': 'P30D', 'green': 'P60D', 'blue': 'P90D'},
  'inception': '2023-06-22',
  'forcedStart': None,
  'moonsDuration': 'P7D',
  'negativePrevented': False,
  'published': True,
  'threadPoolSize': 4,
  'minimumDaysForUsableTarget': 5,
  'universeFile': 'hash_table',
  'offset': 'P0D',
  'benchmark': '^RUI',
  'columnSuffix': '',
  'metric': 'SPEARMAN',
  'targetType': 'ALPHA_V5',
  'open': False,
  'over': True,
  'batchId': 50,
  'datasetId': 11,
  'scoringStart': '2023-06-27'},
 'live': True,
 'forced_start': N

---

## 1. Download data

Each week we will provide you with four DataFrames:

- X_train contains the features;
- y_train contains the targets;
- X_test contains the features you can use in your models;
- example_prediction contains an example of the submission awaited.

There are 4 targets you need to predict: target_w, target_r, target_g, and target_b.

You can either download the data in the *.csv* or *.parquet* extension.
The *.csv* will take longer to download and take up more space in the RAM.

In [5]:
# Chose a file format between parquet and csv 
file_format = 'parquet'

# Download current dataset
client.download_data(directory=".", file_format=file_format)

['./X_train.parquet',
 './y_train.parquet',
 './X_test.parquet',
 './example_submission.parquet']

In [6]:
if file_format == 'parquet':
  # Data for training
  train_features = pd.read_parquet(f'./X_train.{file_format}')
  # Data for which you will submit your prediction
  test_data = pd.read_parquet(f'./X_test.{file_format}')
  # Targets use for your supervised training
  train_targets = pd.read_parquet(f'./y_train.{file_format}')
  # Exemple of an awaited submission
  example_submission = pd.read_parquet(f'./example_submission.{file_format}')
elif file_format == 'csv':
  # Data for training
  train_features = pd.read_csv(f'./X_train.{file_format}')
  # Data for which you will submit your prediction
  test_data = pd.read_csv(f'./X_test.{file_format}')
  # Targets use for your supervised training
  train_targets = pd.read_csv(f'./y_train.{file_format}')
  # Exemple of an awaited submission
  example_submission = pd.read_csv(f'./example_submission.{file_format}')

In [7]:
# Merge train_features and train_targets for ease of use
train_data = pd.merge(train_features, train_targets, on=['id', 'Moons'], how='inner')

del train_features, train_targets
gc.collect()

0

In [8]:
# Get the features columns name and the targets columns name
features = [col for col in train_data.columns if 'Feature' in col]
targets = [col for col in train_data.columns if 'target' in col]

---

## 2. Explore Data

For a discussion on this, see the exploratory data analysis notebook [here](https://github.com/crunchdao/datacrunch-notebooks/blob/master/python/advanced_exploratory_data_analysis.ipynb).

---

## 3. Model

In [109]:


def xg_boost_template(X_train, y_train, X_val, y_val, val_refs, target):
    model = xgb.XGBRegressor(objective='reg:squarederror', max_depth=9, learning_rate=0.02, n_estimators=11, n_jobs=-1, colsample_bytree=1, subsample=1, min_child_weight=14, colsample_bylevel= 1, gamma=0)
    model.fit(X_train, y_train[target], verbose=True)

    # Test the spearman of your model on the X_test data
    preds = pd.DataFrame(model.predict(X_val), columns=[target])
    get_spearman_results(preds, y_val, val_refs)

    return model

def xg_boost_template_ex(X_train, y_train, X_val, y_val, val_refs, target, estimator):
    model = xgb.XGBRegressor(objective='reg:squarederror', max_depth=9, learning_rate=0.02, n_estimators=estimator, n_jobs=-1, colsample_bytree=1, subsample=1, min_child_weight=14, colsample_bylevel= 1, gamma=0)
    model.fit(X_train, y_train[target], verbose=True)

    # Test the spearman of your model on the X_test data
    preds = pd.DataFrame(model.predict(X_val), columns=[target])
    get_spearman_results_ex(preds, y_val, val_refs, estimator)

    return model

def LGBM(X_train, y_train, X_val, y_val, val_refs, target):
    model = lgb.LGBMRegressor(learning_rate=0.01,max_depth=-5,random_state=42)
    model.fit(X_train,y_train[target],eval_metric='rmse')
    
    preds = pd.DataFrame(model.predict(X_val), columns=[target])
    get_spearman_results(preds, y_val, val_refs)

    return model

---

## 4. Scoring: Spearman's rank correlation of your predictions vs the targets

In [108]:
def get_spearman_results(preds, y_val, val_refs):
    preds.rename({f'{target}':f'pred_{target.split("_")[1]}' for target in preds.columns}, axis=1, inplace=True)
    preds_ref = pd.concat([preds.reset_index(drop=True), val_refs.reset_index(drop=True), y_val.reset_index(drop=True)], axis=1)
    spearman = pd.DataFrame()
    
    target_suffixes = [col.split('_')[-1] for col in preds.columns if 'pred' in col]
    for suffix in target_suffixes:
        spearman[f'target_{suffix}'] = preds_ref.groupby('Moons')[[f'pred_{suffix}', f'target_{suffix}']].corr(method='spearman').unstack().iloc[:,1]

    print(f'\nSpearman score over the period :\n{spearman.describe()}\n')
    return spearman

def get_spearman_results_ex(preds, y_val, val_refs, estimator):
    preds.rename({f'{target}':f'pred_{target.split("_")[1]}' for target in preds.columns}, axis=1, inplace=True)
    preds_ref = pd.concat([preds.reset_index(drop=True), val_refs.reset_index(drop=True), y_val.reset_index(drop=True)], axis=1)
    spearman = pd.DataFrame()
    
    target_suffixes = [col.split('_')[-1] for col in preds.columns if 'pred' in col]
    for suffix in target_suffixes:
        spearman[f'target_{suffix}'] = preds_ref.groupby('Moons')[[f'pred_{suffix}', f'target_{suffix}']].corr(method='spearman').unstack().iloc[:,1]
    print(f'\nSpearman for {estimator} estimators score over the period :\n{spearman.describe()}\n')
    return spearman

---

## 5. Embargoed Train / Test split

In [81]:
def train_test_split(data):
    number_of_moons = len(data['Moons'].unique())
    embargo = 13 # Embargo between train and test set
    proportion = 0.8

    # Train on 80% of the first moons and test on 20% of the last moons
    train_set = data[data['Moons'] < int(number_of_moons * proportion) - embargo]
    test_set = data[data['Moons'] > int(number_of_moons * proportion)]

    X_train = train_set[features]
    y_train = train_set[targets]
    X_test = test_set[features]
    y_test = test_set[targets]
    test_refs = test_set.iloc[:, :2]

    return X_train, y_train, X_test, y_test, test_refs

---

## 6. Supervised training of a simple XGBoost

In [73]:
# Split your data to validate your model
X_train, y_train, X_val, y_val, val_refs = train_test_split(train_data)

In [110]:
# Run your model on the different targets
"""model = {}
for target in targets:
  model[f'xgb_model_{target}'] = xg_boost_template(X_train, y_train, X_val, y_val, val_refs, target)"""

estimators=[2]
models=[]
for estimator in estimators:
    models.append(LGBM(X_train, y_train, X_val, y_val, val_refs, target))




Spearman score over the period :
        target_b
count  73.000000
mean   -0.022150
std     0.144805
min    -0.275744
25%    -0.141551
50%    -0.028717
75%     0.098617
max     0.299959



---

## 7. Make prediction on the 4 targets

When you feel like your model is accurate enough it's time to predict the targets and submit your results.

Predict on the 4 targets, concatenate the answers and submit.

1. **WARNING**  Be sure that your columns are named id, Moons, target_w, target_r, etc.

2. **WARNING** Your prediction must be in [0, 1].

3. **WARNING** Don't submit constant values.

4. **WARNING** Submit the id and the moon columns.

In [22]:
prediction = test_data.iloc[:, :2]
for target in targets:
    prediction.loc[:, target] = model[f'xgb_model_{target}'].predict(test_data.iloc[:, 2:])

**Check your submission file**

In [30]:
prediction

Unnamed: 0,id,Moons,target_w,target_r,target_g,target_b
0,53174,370,0.497363,0.493858,0.500583,0.500254
1,167191,370,0.500289,0.499882,0.499862,0.501058
2,127094,370,0.500491,0.501243,0.501118,0.502040
3,115077,370,0.499218,0.499130,0.500716,0.500025
4,109783,370,0.500922,0.497900,0.503553,0.504149
...,...,...,...,...,...,...
11394,66866,383,0.500284,0.498289,0.499796,0.496772
11395,3476,383,0.501302,0.500216,0.509875,0.513133
11396,211774,383,0.499775,0.498610,0.499987,0.500254
11397,231760,383,0.503613,0.500763,0.497054,0.497529


In [16]:
def check_columns_name(df, sub):
    if sub.columns.tolist() != df.columns.tolist():
        raise Exception('Columns name are different from what is expected')

def check_nans(sub):
    if sub.isna().sum().sum() > 0:
        raise Exception('NaNs detected')

def check_values(sub, targets):
    for target in targets:
      if (sub.loc[:, target].values > 1).any() or (sub.loc[:, target].values < 0).any():
          raise Exception('Values are not between 0 and 1')

def check_moons(df, sub):
    if set(df['Moons'].unique()) != set(sub['Moons'].unique()):
        raise Exception('Moons are different from what is expected')

def check_ids(df, sub, moon):
    if not set(sub[sub['Moons'] == moon]['id'].unique()) == set(df[df['Moons'] == moon]['id'].unique()):
        raise Exception('At least an id is missing')

def check_constants(sub, moon, targets):
    for target in targets:
        if sub[sub['Moons'] == moon][target].nunique() == 1:
            raise Exception('Constant values have been detected on a moon')

try:
    check_columns_name(example_submission, prediction)
    check_nans(prediction)
    check_values(prediction, targets)
    check_moons(example_submission, prediction)
    for moon in prediction['Moons'].unique():
        check_ids(example_submission, prediction, moon)
        check_constants(prediction, moon, targets)
    print(f'Submission: OK')
except Exception as e:
    print(f'Error: {e}')

Submission: OK


---

## 8. Submit predictions

In [66]:
prediction.to_csv('prediction.csv')

In [67]:
# Upload predictions
submission_id = client.upload(prediction)

submissions are closed


In [68]:
# Set a comment for the submission, to remember which model that is, etc...
client.set_comment(submission_id, "quickstart model")

setting comment failed
b'{"code":"ARGUMENT_TYPE_MISMATCH","message":"Failed to convert value of type \'java.lang.String\' to required type \'long\'; For input string: \\"None\\"","expectedType":"long","property":"id","rejectedValue":"None"}'


In [69]:
# Download your prediction file if you prefer submitting through the website
from google.colab import files
with open("prediction.csv", "wb") as f:
    f.write(prediction.to_csv(index=False).encode('ascii'))
files.download('prediction.csv')

ModuleNotFoundError: No module named 'google.colab'

In [31]:
# Check your past submissions
client.submissions(
    user_id=None, # None is your id by default 
    round_num=None # None shows all the round by default
    )

KeyError: "None of ['id'] are in the columns"

---

## Useful links

**Website**

- https://www.crunchdao.com/

**Social media**
- discord : https://discord.gg/9wvzxS7A (come say hi! 😉)
- twitter : https://twitter.com/CrunchDAO
- linkedin : https://www.linkedin.com/company/crunchdao-com/
- reddit : https://www.reddit.com/r/crunchdao/

**Documentation**

- https://docs.crunchdao.com/tournament/getting-started

**Github**

- https://github.com/crunchdao/
- https://github.com/uuazed/crunchdao

**DeSci - research framework**

- https://desci.crunchdao.com/projects/crunchdao