[![Open In Colab](https://raw.githubusercontent.com/crunchdao/competitions/refs/heads/master/documentation/badge/open-in-colab.svg)](https://colab.research.google.com/github/crunchdao/quickstarters/blob/feat/datacrunch-2/competitions/datacrunch-2/quickstarters/quickstarter/quickstarter.ipynb)

![Banner](https://raw.githubusercontent.com/crunchdao/competitions/refs/heads/feat/synth/documentation/assets/generic/banner.webp)

# DataCrunch 2

## Challenge Overview

Datacrunch uses the quantitative research of the CrunchDAO to manage its systematic market-neutral portfolio. Datacrunch built a dataset covering thousands of publicly traded U.S companies.

The long-term strategic goal of the fund is capital appreciation by capturing idiosyncratic return at low volatility.

In order to achieve this goal, Datacrunch needs the community to assess the relative performance of all assets in a subset of the [Russell 3000](https://www.investopedia.com/terms/r/russell_3000.asp) universe. In other words, DataCrunch is expecting your model to maximise the correlation to the constituent of its investment universe.

# Setup

The first steps to get started are:
1. Get the setup command
2. Execute it in the cell below

### >> https://hub.crunchdao.io/competitions/datacrunch-2/submit/notebook

![Reveal token](https://raw.githubusercontent.com/crunchdao/competitions/refs/heads/master/documentation/animations/reveal-token.gif)

In [None]:
# Install the Crunch CLI
%pip install --upgrade crunch-cli

# Setup your local environment
!crunch setup --notebook datacrunch hello --token aaaabbbbccccddddeeeeffff

In [None]:
%env API_BASE_URL=https://api.hub.crunchdao.io
%env WEB_BASE_URL=https://hub.crunchdao.io
%env CRUNCH_COMPETITIONS_BRANCH=feat/datacrunch-2

# Your model

## Setup

In [None]:
# Imports
import os
import typing

# Specify the library version with the `==` operator.
import joblib # == 1.3.2
import pandas as pd # == 2.1.0
import numpy as np # == 1.24.3

# Import sklearn linear model
import sklearn # == 1.1.3
from sklearn.linear_model import LinearRegression

In [None]:
import crunch

# Load the Crunch Toolings
crunch_tools = crunch.load_notebook()

## Understanding the Data

Each row of the dataset describes a stock at a certain date.

In [4]:
# Load the data simply

X_train, y_train, X_test = crunch_tools.load_data()

### Understanding `X_train`

**Columns:**
- `moon`: A sequentially increasing integer representing a date. Time between subsequent dates is constant, denoting a weekly fixed frequency at which the data is sampled.
- `id`: A unique identifier representing a stock at a given `moon`. Note that the same asset has a different `id` in different `moon`.
- `(Feature_1, …, Feature_n)`: Anonymised features that describe the state of assets on a given date. They are grouped into several families, or ways of assessing the relative performance of each stock on a given month.

**Note:**
- All features have the string "Feature" in their name.

In [5]:
X_train

Unnamed: 0,id,moon,Feature_1,Feature_2,Feature_3,Feature_4,Feature_5,Feature_6,Feature_7,Feature_8,...,Feature_1141,Feature_1142,Feature_1143,Feature_1144,Feature_1145,Feature_1146,Feature_1147,Feature_1148,Feature_1149,Feature_1150
0,1354416,635,0.83,0.83,0.83,0.83,0.83,0.83,0.83,0.67,...,1.00,0.83,0.67,0.50,0.67,0.67,0.17,0.17,0.17,0.50
1,1354414,635,0.67,0.67,0.67,0.67,0.67,0.67,0.50,0.83,...,0.50,0.67,0.67,0.67,0.67,0.67,0.17,0.67,0.67,0.50
2,1354415,635,0.67,0.67,0.67,0.67,0.67,0.67,0.67,0.33,...,0.50,0.67,0.67,0.83,0.83,0.67,0.33,0.17,0.17,0.83
3,1354417,635,0.17,0.17,0.17,0.17,0.17,0.17,0.17,0.33,...,0.33,0.33,0.00,0.00,0.00,0.00,0.50,0.67,0.83,0.83
4,1354423,635,0.67,0.67,0.67,0.83,0.83,0.83,0.83,0.50,...,0.50,0.50,0.50,0.50,0.50,0.50,0.00,0.17,0.33,0.50
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
276631,1628473,777,0.50,0.50,0.50,0.50,0.50,0.50,0.50,0.33,...,0.50,0.50,0.33,0.33,0.33,0.33,0.33,0.33,0.33,0.83
276632,1628474,777,0.50,0.50,0.50,0.50,0.50,0.50,0.50,0.50,...,0.17,0.33,0.67,0.50,0.50,0.67,0.33,0.50,0.67,0.50
276633,1628475,777,0.50,0.50,0.50,0.50,0.50,0.33,0.33,0.50,...,0.67,1.00,1.00,1.00,1.00,1.00,0.67,0.33,0.33,0.17
276634,1628476,777,0.67,0.67,0.67,0.67,0.67,0.67,0.67,0.33,...,0.33,0.17,0.00,0.00,0.00,0.00,0.33,0.67,0.50,1.00


### Understanding `y_train`

**Columns:**
- `moon`: Same as in `X_train`.
- `id`: Same as in `X_train`.
- `target`: the target that may help you build your models which is based on 28 days (4 moons) compounded returns.

In [6]:
y_train

Unnamed: 0,id,moon,target
0,1354416,635,0.0
1,1354414,635,0.0
2,1354415,635,0.0
3,1354417,635,0.0
4,1354423,635,0.0
...,...,...,...
276631,1628473,777,0.0
276632,1628474,777,0.0
276633,1628475,777,0.0
276634,1628476,777,0.0


### Understanding `X_test`

`X_test` have the same structure as `X_train` but comprises only a few moons, the ones you must predict.

These files are used to simulate the submission process locally via `crunch_tools.test()`. <br />
The aim is to help participants debug their code and have successful submissions. <br />
A successful local test usually means no errors during execution on the submission platform.

In [7]:
X_test

Unnamed: 0,id,moon,Feature_1,Feature_2,Feature_3,Feature_4,Feature_5,Feature_6,Feature_7,Feature_8,...,Feature_1141,Feature_1142,Feature_1143,Feature_1144,Feature_1145,Feature_1146,Feature_1147,Feature_1148,Feature_1149,Feature_1150
0,1638541,782,0.50,0.50,0.50,0.50,0.50,0.50,0.50,0.33,...,0.67,0.33,0.50,0.67,0.67,0.67,0.33,0.50,0.50,0.83
1,1638542,782,0.17,0.17,0.17,0.17,0.33,0.33,0.33,0.67,...,0.33,0.50,0.83,0.83,0.67,0.83,0.83,0.83,0.17,1.00
2,1638543,782,0.33,0.33,0.33,0.33,0.33,0.33,0.33,0.67,...,0.67,0.50,0.67,0.67,0.67,0.67,0.33,0.33,0.33,0.50
3,1638544,782,0.67,0.67,0.67,0.67,0.67,0.67,0.67,0.50,...,0.50,0.33,0.33,0.33,0.33,0.33,0.00,0.67,0.33,0.50
4,1638546,782,0.67,0.67,0.67,0.67,0.67,0.67,0.67,0.33,...,0.33,0.17,0.00,0.00,0.00,0.00,0.83,0.50,0.33,0.83
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
17031,1653032,790,0.67,0.67,0.67,0.67,0.67,0.67,0.67,0.50,...,0.33,0.17,0.17,0.17,0.33,0.17,0.50,0.67,0.67,0.67
17032,1653033,790,0.33,0.33,0.33,0.33,0.33,0.33,0.33,0.67,...,0.00,0.17,0.17,0.00,0.00,0.17,0.17,0.33,0.33,0.50
17033,1653035,790,0.50,0.50,0.50,0.50,0.50,0.50,0.50,0.33,...,0.50,0.50,0.50,0.50,0.50,0.50,0.50,0.33,0.50,0.17
17034,1653034,790,0.17,0.17,0.17,0.17,0.33,0.33,0.33,0.67,...,0.67,0.50,0.17,0.17,0.33,0.17,0.33,0.17,0.17,0.67


## Strategy Implementation

### Utilities

Function used in both `train()` and `infer()`.

In [8]:
def get_model_path(
    model_directory_path: str,
):
    return os.path.join(
        model_directory_path,
        f"model.joblib"
    )

get_model_path("resources")

'resources\\model.joblib'

In [9]:
def get_feature_columns(
    X: pd.DataFrame,
):
    return [
        column
        for column in X.columns
        if column.startswith("Feature_")
    ]

get_feature_columns(X_train)[:10]

['Feature_1',
 'Feature_2',
 'Feature_3',
 'Feature_4',
 'Feature_5',
 'Feature_6',
 'Feature_7',
 'Feature_8',
 'Feature_9',
 'Feature_10']

### The `train()` Function

In this function, you build and train your model for making inferences on the test data. Your model must be stored in the `model_directory_path`.

This function will be called in a frequency that is defined by your `train frequency` parameter that you will define when deploying your model on the Crunch platform.

In [10]:
# Uncomment what you need!
def train(
    X_train: pd.DataFrame,
    y_train: pd.DataFrame,
    model_directory_path: str,
    # moon: int,
    # current_moon: int, # same as "moon"
    # embargo: int,
    # has_gpu: bool,
    # has_trained: bool,
) -> None:
    """
    Do your model training here.
    At each retrain this function will have to save an updated version of
    the model under the model_directiory_path, as in the example below.
    Note: You can use other serialization methods than joblib.dump(), as
    long as it matches what reads the model in infer().

    Args:
        X_train, y_train: the data to train the model.
        model_directory_path: the path to save your updated model
        moon, current_moon: the moon currently being processed
        embargo: data embrago
        has_gpu: if the runner has a gpu
        has_trained: if the moon will train

    Returns:
        None
    """

    model = LinearRegression()

    feature_columns = get_feature_columns(X_train)
    model.fit(X_train[feature_columns], y_train["target"])

    model_path = get_model_path(model_directory_path)
    joblib.dump(model, model_path)

### The `infer()` Function

In the inference function, your trained model (if any) is loaded and used to make predictions on test data.

This function will be called on every `moon` of the `Out-Of-Sample`.

In [11]:
# Uncomment what you need!
def infer(
    X_test: pd.DataFrame,
    model_directory_path: str,
    # moon: int,
    # current_moon: int, # same as "moon"
    # embargo: int,
    # has_gpu: bool,
    # has_trained: bool,
) -> pd.DataFrame:
    """
    Do your inference here.
    This function will load the model saved at the previous iteration and use
    it to produce your inference on the current date.
    It is mandatory to send your inferences with the ids so the system
    can match it correctly.

    Args:
        X_test: the independant  variables of the current date passed to your model.
        model_directory_path: the path to the directory to the directory in wich we will be saving your updated model.
        moon, current_moon: the moon currently being processed
        embargo: data embrago
        has_gpu: if the runner has a gpu
        has_trained: if the moon will train

    Returns:
        A dataframe (date, id, value) with the inferences of your model for the current date.
    """

    # Creating the predicted label dataframe with correct dates and ids
    prediction = X_test[["id", "moon"]].copy()
    
    # Loading the model saved by the train function at previous iteration
    model_path = get_model_path(model_directory_path)
    model = joblib.load(model_path)

    feature_columns = get_feature_columns(X_test)
    prediction["prediction"] = model.predict(X_test[feature_columns])

    return prediction

## Local testing

To make sure your `train()` and `infer()` function are working properly, you can call the `crunch.test()` function that will reproduce the cloud environment locally. <br />
Even if it is not perfect, it should give you a quick idea if your model is working properly.

In [None]:
# Uncomment to clear up a bit of RAM by unloading some data

# import gc

# X_train.drop(X_train.index, inplace=True)
# del X_train

# y_train.drop(y_train.index, inplace=True)
# del y_train

# X_test.drop(X_test.index, inplace=True)
# del X_test

# gc.collect()

In [12]:
crunch_tools.test(
    # Uncomment to disable the forced first train
    # force_first_train=False,
    force_first_train=True,

    # Uncomment to set the training frequency
    # train_frequency=2,  # train every 2 moons
    train_frequency=0,

    # Uncomment to disable the determinism check
    # no_determinism_check=True,
)

[32m17:39:29[0m [33m[0m
[32m17:39:29[0m started
[32m17:39:29[0m running local test
[32m17:39:29[0m [33minternet access isn't restricted, no check will be done[0m
[32m17:39:29[0m 
[32m17:39:29[0m starting unstructured loop...
[32m17:39:29[0m looping moon=782 train=True (1/9)
[32m17:39:29[0m executing - command=train
[32m17:39:44[0m executing - command=infer
[32m17:39:45[0m looping moon=783 train=False (2/9)
[32m17:39:45[0m executing - command=infer
[32m17:39:46[0m looping moon=784 train=False (3/9)
[32m17:39:46[0m executing - command=infer
[32m17:39:47[0m looping moon=785 train=False (4/9)
[32m17:39:47[0m executing - command=infer
[32m17:39:47[0m looping moon=786 train=False (5/9)
[32m17:39:47[0m executing - command=infer
[32m17:39:48[0m looping moon=787 train=False (6/9)
[32m17:39:48[0m executing - command=infer
[32m17:39:49[0m looping moon=788 train=False (7/9)
[32m17:39:49[0m executing - command=infer
[32m17:39:49[0m looping moon=789 tr

## Results

Once the local tester is done, you can preview the result stored in `prediction/prediction.parquet`.

In [13]:
prediction = pd.read_parquet("prediction/prediction.parquet")
prediction

Unnamed: 0,id,moon,prediction
0,1638541,782,-7.899920e-05
1,1638542,782,3.025327e-04
2,1638543,782,1.909996e-04
3,1638544,782,-9.198941e-04
4,1638546,782,-4.871050e-05
...,...,...,...
17031,1653032,790,7.075723e-07
17032,1653033,790,1.589209e-04
17033,1653035,790,-4.385489e-04
17034,1653034,790,1.135529e-04


### Local scoring

You can call the function that the system uses to estimate your score locally.

A [Pearson correlation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.corr.html) will be computed against the **targets**.

The final score will be the mean correlation over all the moon / the standard deviation of the correlations.

**Note**:
- If all predictions are constant, the correlation will be undefined. In this case, the score will be set to `0`.
- Predictions must be between `-1` and `1`.

In [14]:
# Load the targets
y_test = pd.read_parquet(
    "data/y.reduced.parquet",
    filters=[
        ("moon", "in", prediction["moon"].unique())
    ]
)

y_test

Unnamed: 0,id,moon,target
0,1638541,782,0.00
1,1638542,782,0.02
2,1638543,782,0.02
3,1638544,782,-0.02
4,1638546,782,0.00
...,...,...,...
17031,1653032,790,0.00
17032,1653033,790,0.00
17033,1653035,790,0.00
17034,1653034,790,0.00


In [None]:
# Define the scoring function
def score(
    group: pd.DataFrame,
):
    prediction_column_name = f"prediction"
    target_column_name = f"target"

    return group[prediction_column_name].corr(
        group[target_column_name],
        method="pearson"
    )

# Merge the prediction with the targets with moon and id
merged = y_test.merge(
    prediction,
    on=["moon", "id"],
)

# Compute the pearson for each moon
pearson_values = merged\
    .groupby("moon")\
    .apply(score, include_groups=False)\
    .fillna(0)  # map constants to zero

try:
    score_value = pearson_values.mean() / pearson_values.std()
except ZeroDivisionError:
    # if prediction is only constants
    score_value = 0

score_value

# Submit your Notebook

To submit your work, you must:
1. Download your Notebook from Colab
2. Upload it to the platform
3. Create a run to validate it

### >> https://hub.crunchdao.com/competitions/datacrunch/submit/notebook

![Download and Submit Notebook](https://raw.githubusercontent.com/crunchdao/competitions/refs/heads/master/documentation/animations/download-and-submit-notebook.gif)