[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://githubtocolab.com/crunchdao/adialab-notebooks/blob/main/quickstarter_notebook.ipynb)

# ![title](https://cdn.discordapp.com/attachments/692035498625204245/1090596888857813062/banner.png)

# Setup your crunch workspace

#### STEP 1
Run this cell to install the crunch library in your workspace.

In [None]:
!pip3 install crunch-cli --upgrade

#### STEP 2 
(Temporary - will be removed once the pip package is public)

In [None]:
# temporary command that will be remove once public communication done
%env API_BASE_URL=http://api.adialab.staging.crunchdao.com
%env WEB_BASE_URL=https://adialab.staging.crunchdao.com/

#### STEP 3
Importing the crunch package and instantiate it to be able to access its functionality.

In [None]:
import crunch
crunch = crunch.load_notebook(__name__) # will allow you to access the crunch commands from this notebook

#### STEP 4

In [None]:
# go to your submit page and copy past your setup command to access the data
# https://adialab.staging.crunchdao.com/submit
!crunch setup happy-mike --token l65Za9SsJiBi8pH8xPvwSfGuRY8ChynyvD2DVxoAtWOosW6p6SNtwnci6conlWW4

# The Adialab x CrunchDAO competition

## A code competition

This competition is divided in two phases.

Submission phase - 12 weeks

Out-of-Sample phase - 12 weeks

During the first phase the participants will submit notebook or python scripts that build the best possible model on the data proposed by the organizers. In the second phase also called [Out-of-Sample](https://en.wikipedia.org/wiki/Cross-validation_(statistics)) (OOS) phase the participant's code will be automatically run by the platform on live market data to be evaluated on unseen data. During this phase the participants won't be able to modify their code.

- There is two main interests in proceeding that way:

- The participants won't be able to game or cheat in any ways which is very often the case in traditional data-science competitions.

- The [overfitting](https://deliverypdf.ssrn.com/delivery.php?ID=634087103098022017102089127026118070055022030067038035066070070118003108076075122073107013020035005031116084117030102014013119017036066065011126115081078006004108029033051020066006092025091103065117104075029100098011096065096065079019015002101078070&EXT=pdf&INDEX=TRUE) of the training data will lead to a very bad performance OOS.

To ensure reproducibility of your work, you will need to follow certain guidelines to participate in the competition. These guidelines will also allow our scoring system to run your code in the cloud during the OOS period without any issues.

CrunchDAO is acting as a third party intermediary in this competition and will off-course never communicate the code to the organizers in any ways.

## The user interface

User Interfaces are recurring solutions that solve common problems. In the world of data-science and modeling, the typical interface is covered by the following functions:

1. **import**: As any script, if your solution contains dependancies on external packages make sure to import. The system will automatically your dependancies. Make sure that you are using only packages that are whitelisted in overview >> Libraries page.

2. **data processing**: In the data processing step users will proceed with the transformation of the data that they deem necessary before training a model. This step includes feature selection, data transformations, creation of new synthetic features etc... This step must return the x_train, y_train and x_test data sample.

3. **train**: In the training phase the users will build the model and train it such that it can perform inferences on the testing data. This function should return a trained model ready to perform inferences on the testing data.

4. **infer**: In the inference function the model train in the previous step will be used to perform inferences on a data sample matching the characteristic of the training test.

## Scoring on the public leaderboard

To make sure that the public leaderboard is solid you don't have access to the all testing data on wich you will be scored.
The x_test data downloaded in your workspace is composed of only 5 dates for you to test localy your code.
Once you will have push your solution the system will run your code on a private test set of around 30 dates.
You are left to decide how many retrain you can do under the 5 hours of ressources / week / user allowed to predict the 30 moons of the private test set.

```python
for date in dates: # This loop over private test set dates to avoids leaking the x of future periods

    # The wrapper will block the logging of users code after the 5 first dates
    if date >= log_treshold:
        log = False
    
    # Cutting the sample such that the users code will only access the right part of the data
    X_train = X_train[X_train.date < date - embargo]
    y_train = y_train[y_train.date < date - embargo]
    x_test = x_test[x_test.date == date] # Only the current date

    # Call user interface and instantiate
    data_process(x_train, y_train, x_test)

    # The backend decide if we call train model for ALL user
    if retrain:
        train(x_train, y_train, model_directory_path) # This function is saving the new state of the model
    
    # The backend call the inference
    prediction_current = infer(model_directory_path, X_test)

    # Concat current date prediction with previous date prediction if over date log_treshold so scoring only happends after the logs are deactivated
    if date > log_treshold:
        prediction = pd.concat([prediction, prediction_current])
    
# Backend upload predictions and model_directory_path's content
# Backend score
```

## Scoring on the out-of-sample phase

During the out-of-sampple the backend will call your code 3 time every week on live datapoint.

The mean spearman score after 12 weeks of OOS will determine the winners of the tournament.

In [None]:
# imports
import pandas as pd
import xgboost as xgb
from sklearn.model_selection import train_test_split
import joblib

# The 3 stages of the crunchdao user interface

# STAGE 1 - Data-Processing
def data_process( x_train, y_train, x_test):
    """
    Do your data processing here.
    """

    return x_train, y_train, x_test

# STAGE 2: Training
def train(x_train, y_train, model_directory_path):
    """
    Do your model training here..
    """
    
    # spliting training test
    x_train, x_test, y_train, y_test = train_test_split(x_train, y_train, test_size=0.2, shuffle=False)

    # choosing a model
    model = xgb.XGBRegressor(objective='reg:squarederror', max_depth=5, learning_rate=0.01, n_estimators=20, n_jobs=-1, colsample_bytree=0.5)

    # training
    model.fit(x_train.iloc[:,2:], y_train.iloc[:,2:])

    # make sure that the train function is correctly saving the trained model in the model_directory_path
    joblib.dump(model, os.path.join(model_directory_path, "model.joblib"))

# STAGE 3: Inferencing
def infer(model_directory_path, x_test):
    """
    load model and infer here.
    """

    model = joblib.load(model_directory_path)
    pred = model.predict(x_test.iloc[:,2:])

    return pred

# Following Step: Build a basic submission
# https://github.dev/crunchdao/adialab-notebooks/blob/main/basic_submission.ipynb