[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://githubtocolab.com/crunchdao/adialab-notebooks/blob/main/quickstarter_notebook.ipynb)

# ![title](https://cdn.discordapp.com/attachments/692035498625204245/1090596888857813062/banner.png)

# Setup your crunch workspace

#### STEP 1
Run this cell to install the crunch library in your workspace.

In [None]:
!pip3 install crunch-cli --upgrade

#### STEP 2 
(Temporary - will be removed once the pip package is public)

In [None]:
# temporary command that will be remove once public communication done
%env API_BASE_URL=http://api.adialab.staging.crunchdao.com
%env WEB_BASE_URL=https://adialab.staging.crunchdao.com/

#### STEP 3
Importing the crunch package and instantiate it to be able to access its functionality.

In [None]:
import crunch
import sys
crunch = crunch.load_notebook(sys.modules[__name__])

# The Adialab x CrunchDAO competition

## A code competition

This competition is divided in two phases.

Submission phase - 12 weeks

Out-of-Sample phase - 12 weeks

During the first phase the participants will submit notebook or python scripts that build the best possible model on the data proposed by the organizers. In the second phase also called [Out-of-Sample](https://en.wikipedia.org/wiki/Cross-validation_(statistics)) (OOS) phase the participant's code will be automatically run by the platform on live market data to be evaluated on unseen data. 

- There is two main interests in proceeding that way:

- The participants won't be able to game or cheat in any ways which is very often the case in traditional data-science competitions.

- The [overfitting](https://deliverypdf.ssrn.com/delivery.php?ID=634087103098022017102089127026118070055022030067038035066070070118003108076075122073107013020035005031116084117030102014013119017036066065011126115081078006004108029033051020066006092025091103065117104075029100098011096065096065079019015002101078070&EXT=pdf&INDEX=TRUE) of the training data will lead to a very bad performance OOS.

To ensure reproducibility of your work, you will need to follow certain guidelines to participate in the competition. These guidelines will also allow our scoring system to run your code in the cloud during the OOS period without any issues.

CrunchDAO is acting as a third party intermediary in this competition and will off-course never communicate the code to the organizers in any ways.

## The user interface

User Interfaces are recurring solutions that solve common problems. In the world of data-science and modeling, the typical interface is covered by the following functions:

1. **import**: As any script, if your solution contains dependancies on external packages make sure to import. The system will automatically your dependancies. Make sure that you are using only packages that are whitelisted in overview >> Libraries page.

2. **data processing**: In the data processing step users will proceed with the transformation of the data that they deem necessary before training a model. This step includes feature selection, data transformations, creation of new synthetic features etc... This step must return the x_train, y_train and x_test data sample.

3. **train**: In the training phase the users will build the model and train it such that it can perform inferences on the testing data. This function should return a trained model ready to perform inferences on the testing data.

4. **infer**: In the inference function the model train in the previous step will be used to perform inferences on a data sample matching the characteristic of the training test.

In [None]:
# !! Make sure to run this cell to be abble to run the rest of the notebook.

# imports
import xgboost as xgb
import sklearn
from sklearn.feature_selection import VarianceThreshold
from sklearn.model_selection import train_test_split
from scipy import stats
import matplotlib.pyplot as plt
import pandas as pd

# The 3 stages of the crunchdao user interface

# STAGE 1 - Data-Processing
def data_process( x_train, y_train, x_test):
    """
    Do your data processing here.
    """

    return x_train, y_train, x_test

# STAGE 2: Training
def train(x_train, y_train):
    """
    Do your model training here..
    """
    
    # spliting training test
    x_train, x_test, y_train, y_test = train_test_split(x_train, y_train, test_size=0.2, shuffle=False)

    # choosing a model
    model = xgb.XGBRegressor(objective='reg:squarederror', max_depth=5, learning_rate=0.01, n_estimators=2000, n_jobs=-1, colsample_bytree=0.5)

    # training 
    model.fit(x_train, y_train)

    return model

# STAGE 3: Inferencing
def infer(model, x_test):
    """
    load model and infer here.
    """
    
    pred = model.predict(x_test)

    return pred

# Construction of a basic submission

### Submission process

1- Make sure to put all your code in the code interface inside your Notebook. The system will parse these functions to execute it in the cloud. You can work outside of the code interface but to be able to submit you will need to fill-in the submission function with the code you want to submit

2- Once satisfied with your work. Download this notebook ( file -> Download -> Download.ipynb )

3- Then upload this Notebook on https://adialab.staging.crunchdao.com/submit#



In [None]:
import xgboost as xgb
import sklearn
from sklearn.feature_selection import VarianceThreshold
from sklearn.model_selection import train_test_split
from scipy import stats
import pandas as pd

def scorer(y_test, y_pred):
    score = (stats.spearmanr(y_test, y_pred)*100)[0]
    print(f"In sample spearman correlation {score}")

def data_process(x_train, y_train, x_test):
    """
    Do your data processing here.
    """
    print("Starting - data_process")
    
    # selecting some features
    x_train = x_train.iloc[:,2:15]
    x_test = x_test.iloc[:,2:15]
    y_train = y_train.iloc[:,2:]

    # typing
    for col in x_train.columns:
        x_train[col] = x_train[col].astype(float)
        x_test[col] = x_test[col].astype(float)

    return x_train, y_train, x_test

def train(x_train, y_train):
    """
    Do your model training here..
    """
    
    print("Starting - trainning")
    
    # spliting training test
    X_train, X_test, y_train, y_test = train_test_split(x_train, y_train, test_size=0.2, shuffle=False)
    
    # choosing a model
    model = xgb.XGBRegressor(objective='reg:squarederror', max_depth=4, learning_rate=0.01, n_estimators=2, n_jobs=-1, colsample_bytree=0.5)

    # training 
    model.fit(X_train, y_train)

    # testing model's Spearman score
    pred = model.predict(X_test)
    scorer(y_test, pred)

    return model
  
# def infer(model, x_test):
#     """
#     load model and infer here.
#     """

#     print("Starting - infer")
#     return pd.DataFrame(model.predict(x_test))


def infer(model, x_test):
    """
    Do you infer here.
    """

    predicted = x_test[["date", "id"]].copy()
    predicted["value"] = model.predict(x_test.iloc[:, 2:])

    return predicted