[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/crunchdao/quickstarters/blob/master/competitions/datacrunch/quickstarters/quickstarter/quickstarter.ipynb)

![Banner](https://raw.githubusercontent.com/crunchdao/quickstarters/refs/heads/master/competitions/datacrunch/assets/banner.webp)

# DataCrunch

## Challenge Overview

DataCrunch uses the quantitative research of the CrunchDAO to manage its systematic market-neutral portfolio. DataCrunch built a dataset covering thousands of publicly traded U.S companies.

The long-term strategic goal of the fund is capital appreciation by capturing idiosyncratic return at low volatility.

In order to achieve this goal, DataCrunch needs the community to assess the relative performance of all assets in a subset of the [Russell 3000](https://www.investopedia.com/terms/r/russell_3000.asp) universe. In other words, DataCrunch is expecting your model to rank the constituent of its investment universe.

### Evaluation Metric

A [Spearman rank correlation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.corr.html) will be computed against the **live targets**.

Predictions must be between `0` and `1`.

# Setup

The first steps to get started are:
1. Get the setup command
2. Execute it in the cell below

### >> https://hub.crunchdao.com/competitions/datacrunch/submit/notebook

![Reveal token](https://raw.githubusercontent.com/crunchdao/competitions/refs/heads/master/documentation/animations/reveal-token.gif)

In [None]:
# Install the Crunch CLI
%pip install --upgrade crunch-cli

# Setup your local environment
!crunch setup --notebook datacrunch hello --token aaaabbbbccccddddeeeeffff

# Your model

## Setup

In [13]:
# Imports
import os
import typing

# Specify the library version with the `==` operator.
import joblib # == 1.3.2
import pandas as pd # == 2.1.0
import numpy as np # == 1.24.3

# Import sklearn linear model
import sklearn # == 1.1.3
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LinearRegression

In [None]:
import crunch

# Load the Crunch Toolings
crunch = crunch.load_notebook()

## Understanding the Data

Each row of the dataset describes a stock at a certain date.

In [15]:
# Load the data simply
X_train, y_train, X_test = crunch.load_data()

### Understanding `X_train`

**Columns:**
- `moon`: A sequentially increasing integer representing a date. Time between subsequent dates is constant, denoting a weekly fixed frequency at which the data is sampled.
- `id`: A unique identifier representing a stock at a given `moon`. Note that the same asset has a different `id` in different `moon`.
- `(gordon_Feature_1, …, dolly_Feature_30)`: Anonymised features that describe the state of assets on a given date. They are grouped into several families, or ways of assessing the relative performance of each stock on a given month.

**Note:**
- All features have the string "Feature" in their name, prefixed by a code name for the feature family.

In [16]:
X_train

Unnamed: 0,id,moon,vratios_Feature_6,vratios_Feature_1,vratios_Feature_2,vratios_Feature_3,vratios_Feature_4,vratios_Feature_5,vratios_Feature_7,vratios_Feature_8,...,fdriver_Feature_149_v2,fdriver_Feature_150_v2,fdriver_Feature_151_v2,fdriver_Feature_152_v2,fdriver_Feature_153_v2,fdriver_Feature_154_v2,fdriver_Feature_155_v2,fdriver_Feature_156_v2,fdriver_Feature_157_v2,fdriver_Feature_158_v2
0,0,0,0.50,0.83,0.83,0.50,0.67,0.17,0.33,0.50,...,0.67,0.83,0.67,0.83,0.33,0.33,0.67,0.33,0.50,0.67
1,561,0,0.33,0.17,0.67,0.67,0.33,0.17,0.50,0.50,...,0.33,0.33,0.33,0.50,0.83,0.83,0.67,0.33,0.33,0.50
2,562,0,0.33,0.50,0.33,0.33,0.67,0.67,0.33,0.33,...,0.33,0.50,0.33,0.33,0.50,0.17,0.17,0.83,0.50,0.33
3,563,0,0.50,0.67,0.50,0.83,0.17,0.33,0.17,0.33,...,0.83,0.83,0.67,0.67,1.00,0.67,0.67,0.50,0.67,0.83
4,564,0,0.17,0.00,0.00,0.83,0.33,0.00,0.33,1.00,...,1.00,0.83,0.67,0.67,0.17,0.33,0.67,0.50,0.67,0.67
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
465309,464653,468,0.50,0.83,0.50,0.50,0.33,0.67,0.50,0.17,...,0.67,0.50,0.67,0.50,0.50,0.67,0.67,0.50,0.67,0.50
465310,464652,468,0.17,0.67,0.33,0.83,1.00,0.67,0.67,0.50,...,0.33,0.33,0.50,0.33,0.50,0.50,0.33,0.17,0.50,0.17
465311,464651,468,0.50,0.50,0.00,0.83,0.50,1.00,0.50,0.50,...,0.83,0.67,0.67,0.50,0.50,0.50,0.67,0.00,0.67,0.33
465312,464650,468,0.83,0.67,0.50,0.67,0.67,0.50,0.50,0.50,...,0.83,0.67,0.67,0.17,0.00,0.00,0.17,0.67,0.67,0.50


### Understanding `y_train`

**Columns:**
- `moon`: Same as in `X_train`.
- `id`: Same as in `X_train`.
- `(target_w, …, target_b)`: the targets that may help you build your models.

**Targets:**
- `w` refer to 7 days compounding of returns.
- `r` refer to 28 days "
- `g` refer to 63 days "
- `b` refer to 91 days "

In [17]:
y_train

Unnamed: 0,id,moon,target_w,target_r,target_g,target_b
0,0,0,0.67,0.50,0.67,0.50
1,561,0,0.50,0.83,0.67,0.83
2,562,0,0.17,0.33,0.33,0.33
3,563,0,0.00,0.17,0.33,0.33
4,564,0,0.33,0.00,0.00,0.17
...,...,...,...,...,...,...
465309,464653,468,0.33,0.50,0.33,0.33
465310,464652,468,0.50,0.50,0.50,0.50
465311,464651,468,0.50,0.17,0.33,0.67
465312,464650,468,0.50,0.67,0.33,0.50


### Understanding `X_test`

`X_test` have the same structure as `X_train` but comprises only 13 moons.

These files are used to simulate the submission process locally via `crunch.test()`. <br />
The aim is to help participants debug their code and have successful submissions. <br />
A successful local test usually means no errors during execution on the submission platform.

The data of these files is composed of the 13 moons on which the longest target (`target_b`) is not resolved. <br />
The missing data for each target were replaced by `-1` values.

**Note:** <br />
The features are split in two groups. The legacy features and the v2 features which are suffixed by "`_v2`".

In [18]:
X_test

Unnamed: 0,id,moon,vratios_Feature_6,vratios_Feature_1,vratios_Feature_2,vratios_Feature_3,vratios_Feature_4,vratios_Feature_5,vratios_Feature_7,vratios_Feature_8,...,fdriver_Feature_149_v2,fdriver_Feature_150_v2,fdriver_Feature_151_v2,fdriver_Feature_152_v2,fdriver_Feature_153_v2,fdriver_Feature_154_v2,fdriver_Feature_155_v2,fdriver_Feature_156_v2,fdriver_Feature_157_v2,fdriver_Feature_158_v2
0,465978,469,0.50,0.33,0.50,0.33,0.17,0.33,0.67,0.83,...,0.17,0.17,0.17,0.17,0.00,0.17,0.17,0.83,0.83,0.17
1,465977,469,0.83,1.00,0.17,0.33,0.67,0.50,0.67,0.17,...,0.17,0.33,0.50,0.33,0.17,0.17,0.33,1.00,0.83,0.33
2,465976,469,1.00,1.00,0.67,0.50,0.83,0.83,0.50,0.83,...,0.50,0.50,0.50,0.67,0.83,0.83,0.67,0.67,1.00,0.83
3,465975,469,0.50,0.67,0.50,0.50,0.33,0.67,0.67,0.33,...,0.50,0.50,0.67,0.67,0.67,0.67,0.67,0.50,0.33,0.17
4,465974,469,0.33,0.83,0.17,0.33,0.67,0.17,0.50,0.50,...,0.83,0.67,0.50,0.50,0.33,0.50,0.50,0.50,0.67,0.50
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
11900,476559,480,0.83,0.17,0.33,0.33,0.67,0.33,0.67,0.67,...,0.50,0.67,0.67,0.67,0.50,0.50,0.50,0.83,0.67,0.50
11901,476558,480,0.33,0.50,0.33,0.33,0.67,0.50,0.67,1.00,...,0.33,0.50,0.50,0.33,0.50,0.33,0.33,0.50,0.67,0.50
11902,476557,480,0.33,0.50,0.33,0.33,0.33,0.67,0.67,0.33,...,0.67,0.67,0.67,0.33,0.00,0.17,0.33,0.50,0.67,0.50
11903,476556,480,0.50,0.33,0.33,0.67,0.17,0.67,0.00,0.33,...,0.67,0.67,0.50,0.33,0.50,0.50,0.33,0.83,0.67,0.50


## Strategy Implementation

### Utilities

Function used in both `train()` and `infer()`.

In [19]:
def get_model_path(
    model_directory_path: str,
    target_column_name: str,
):
    return os.path.join(
        model_directory_path,
        f"model.{target_column_name}.joblib"
    )

### The `train()` Function

In this function, you build and train your model for making inferences on the test data. Your model must be stored in the `model_directory_path`.

This function will be called in a frequency that is defined by your `train frequency` parameter that you will define when deploying your model on the Crunch platform.

In [20]:
# Uncomment what you need!
def train(
    X_train: pd.DataFrame,
    y_train: pd.DataFrame,
    # number_of_features: int,
    model_directory_path: str,
    # id_column_name: str,
    # moon_column_name: str,
    target_column_names: typing.List[str],
    prediction_column_names: typing.List[str],
    feature_column_names: typing.List[str],
    # moon: int,
    # current_moon: int, # same as "moon"
    # embargo: int,
    # has_gpu: bool,
    # has_trained: bool,
) -> None:
    """
    Do your model training here.
    At each retrain this function will have to save an updated version of
    the model under the model_directiory_path, as in the example below.
    Note: You can use other serialization methods than joblib.dump(), as
    long as it matches what reads the model in infer().

    Args:
        X_train, y_train: the data to train the model.
        number_of_features: the number of features of the dataset
        model_directory_path: the path to save your updated model
        id_column_name: the name of the id column
        moon_column_name: the name of the moon column
        target_column_name: the name of the target column
        prediction_column_name: the name of the prediction column
        moon, current_moon: the moon currently being processed
        embargo: data embrago
        has_gpu: if the runner has a gpu
        has_trained: if the moon will train

    Returns:
        None
    """

    for target_column_name, prediction_column_name in zip(target_column_names, prediction_column_names):
        model = LinearRegression()
        
        model.fit(X_train[feature_column_names], y_train[target_column_name])

        model_path = get_model_path(model_directory_path, target_column_name)
        joblib.dump(model, model_path)

### The `infer()` Function

In the inference function, your trained model (if any) is loaded and used to make predictions on test data.

This function will be called on every `moon` of the `Out-Of-Sample`.

In [21]:
# Uncomment what you need!
def infer(
    X_test: pd.DataFrame,
    # number_of_features: int,
    model_directory_path: str,
    id_column_name: str,
    moon_column_name: str,
    target_column_names: typing.List[str],
    prediction_column_names: typing.List[str],
    feature_column_names: typing.List[str],
    # moon: int,
    # current_moon: int, # same as "moon"
    # embargo: int,
    # has_gpu: bool,
    # has_trained: bool,
) -> pd.DataFrame:
    """
    Do your inference here.
    This function will load the model saved at the previous iteration and use
    it to produce your inference on the current date.
    It is mandatory to send your inferences with the ids so the system
    can match it correctly.

    Args:
        X_test: the independant  variables of the current date passed to your model.
        number_of_features: the number of features of the dataset
        model_directory_path: the path to the directory to the directory in wich we will be saving your updated model.
        id_column_name: the name of the id column
        moon_column_name: the name of the moon column
        target_column_name: the name of the target column
        prediction_column_name: the name of the prediction column
        moon, current_moon: the moon currently being processed
        embargo: data embrago
        has_gpu: if the runner has a gpu
        has_trained: if the moon will train

    Returns:
        A dataframe (date, id, value) with the inferences of your model for the current date.
    """

    # Creating the predicted label dataframe with correct dates and ids
    prediction = X_test[[moon_column_name, id_column_name]].copy()
    
    for target_column_name, prediction_column_name in zip(target_column_names, prediction_column_names):
        # Loading the model saved by the train function at previous iteration
        model_path = get_model_path(model_directory_path, target_column_name)
        model = joblib.load(model_path)

        prediction[prediction_column_name] = model.predict(X_test[feature_column_names])

    # Predictions must be between 0 and 1. If you need to scale them, uncomment the following lines.
    # columns = list(prediction_column_names)
    # scaler = MinMaxScaler(feature_range=(0.01, 0.99))
    # scaled_array = scaler.fit_transform(prediction[columns])
    # prediction[columns] = scaled_array

    return prediction

## Local testing

To make sure your `train()` and `infer()` function are working properly, you can call the `crunch.test()` function that will reproduce the cloud environment locally. <br />
Even if it is not perfect, it should give you a quick idea if your model is working properly.

In [22]:
crunch.test(
    # Uncomment to disable the forced first train
    # force_first_train=False,
    force_first_train=True,

    # Uncomment to set the training frequency
    # train_frequency=2,  # train every 2 moons
    train_frequency=0,

    # Uncomment to disable the determinism check
    # no_determinism_check=True,
)

[32m18:18:45[0m [33mno forbidden library found[0m
[32m18:18:45[0m [33m[0m
[32m18:18:46[0m started
[32m18:18:46[0m running local test
[32m18:18:46[0m [33minternet access isn't restricted, no check will be done[0m
[32m18:18:46[0m 


data\X_train.parquet: download from https:crunchdao--competition--production.s3.eu-west-1.amazonaws.com/data-releases/152/X_train.parquet (313462484 bytes)
data\X_train.parquet: already exists, file length match
data\X_test.parquet: download from https:crunchdao--competition--production.s3.eu-west-1.amazonaws.com/data-releases/152/X_test_reduced.parquet (9163839 bytes)
data\X_test.parquet: already exists, file length match
data\y_train.parquet: download from https:crunchdao--competition--production.s3.eu-west-1.amazonaws.com/data-releases/152/y_train.parquet (2853097 bytes)
data\y_train.parquet: already exists, file length match
data\y_test.parquet: download from https:crunchdao--competition--production.s3.eu-west-1.amazonaws.com/data-releases/152/y_test_reduced.parquet (81171 bytes)
data\y_test.parquet: already exists, file length match
data\example_prediction.parquet: download from https:crunchdao--competition--production.s3.eu-west-1.amazonaws.com/data-releases/152/example_predictio

[32m18:18:51[0m starting timeseries loop...
[32m18:18:51[0m looping moon=469 train=True (1/12)
[32m18:18:51[0m [33mcall: train[0m
[32m18:19:28[0m [33mcall: infer[0m
[32m18:19:28[0m [33mcall: infer[0m
[32m18:19:28[0m deterministic: true
[32m18:19:28[0m looping moon=470 train=False (2/12)
[32m18:19:28[0m [33mcall: infer[0m
[32m18:19:28[0m [33mcall: infer[0m
[32m18:19:28[0m deterministic: true
[32m18:19:28[0m looping moon=471 train=False (3/12)
[32m18:19:28[0m [33mcall: infer[0m
[32m18:19:28[0m [33mcall: infer[0m
[32m18:19:28[0m deterministic: true
[32m18:19:28[0m looping moon=472 train=False (4/12)
[32m18:19:28[0m [33mcall: infer[0m
[32m18:19:28[0m [33mcall: infer[0m
[32m18:19:29[0m deterministic: true
[32m18:19:29[0m looping moon=473 train=False (5/12)
[32m18:19:29[0m [33mcall: infer[0m
[32m18:19:29[0m [33mcall: infer[0m
[32m18:19:29[0m deterministic: true
[32m18:19:29[0m looping moon=474 train=False (6/12)
[32m18:19:2

## Results

Once the local tester is done, you can preview the result stored in `data/prediction.parquet`.

In [23]:
prediction = pd.read_parquet("data/prediction.parquet")
prediction

Unnamed: 0,moon,id,prediction_w,prediction_r,prediction_g,prediction_b
0,469,465978,0.491969,0.485709,0.526433,0.506473
1,469,465977,0.507537,0.540807,0.557808,0.533293
2,469,465976,0.530685,0.563406,0.555870,0.546687
3,469,465975,0.468521,0.449328,0.424102,0.417711
4,469,465974,0.464691,0.452561,0.439337,0.434361
...,...,...,...,...,...,...
11900,480,476559,0.519221,0.515320,0.487000,0.491303
11901,480,476558,0.486348,0.468434,0.480508,0.468444
11902,480,476557,0.541200,0.547296,0.531897,0.526216
11903,480,476556,0.480370,0.453427,0.452859,0.463767


### Local scoring

You can call the function that the system uses to estimate your score locally.

In [24]:
# Load the targets
y_test = pd.read_parquet("./data/y_test.parquet")

# Define the scoring function
def score(
    group: pd.DataFrame,
    target_name: str,
):
    prediction_column_name = f"prediction_{target_name}"
    target_column_name = f"target_{target_name}"

    return group[prediction_column_name].corr(
        group[target_column_name],
        method="spearman"
    )

# Merge the prediction with the targets with moon and id
merged = y_test.merge(
    prediction,
    on=["moon", "id"],
)

# Compute the scores for each moon and for each target
scores = pd.DataFrame([
    {
        "moon": key,
        **{
            f"score_{target_name}": score(group, target_name)
            for target_name in "wrgb"
        }
    }
    for key, group in merged.groupby("moon")
])

scores

  return spearmanr(a, b)[0]


Unnamed: 0,moon,score_w,score_r,score_g,score_b
0,469,-0.062167,-0.078989,-0.02659,
1,470,0.135609,-0.027103,0.0584,
2,471,0.017189,-0.050042,-0.00391,
3,472,-0.100541,-0.073094,-0.023074,
4,473,-0.096647,-0.010683,,
5,474,0.076487,0.062836,,
6,475,0.000512,0.056292,,
7,476,,,,
8,477,0.031451,-0.006316,,
9,478,0.034088,,,


# Submit your Notebook

To submit your work, you must:
1. Download your Notebook from Colab
2. Upload it to the platform
3. Create a run to validate it

### >> https://hub.crunchdao.com/competitions/datacrunch/submit/notebook

![Download and Submit Notebook](https://raw.githubusercontent.com/crunchdao/competitions/refs/heads/master/documentation/animations/download-and-submit-notebook.gif)