[![Open In Colab](https://raw.githubusercontent.com/crunchdao/competitions/refs/heads/master/documentation/badge/open-in-colab.svg)](https://colab.research.google.com/github/crunchdao/quickstarters/blob/master/competitions/structural-break-open-benchmark/quickstarters/baseline/baseline.ipynb)
[![Open In Kaggle](https://raw.githubusercontent.com/crunchdao/competitions/refs/heads/master/documentation/badge/open-in-kaggle.svg)](https://www.kaggle.com/code/crunchdao/structural-break-open-benchmark-baseline)

![Banner](https://raw.githubusercontent.com/crunchdao/quickstarters/refs/heads/master/competitions/structural-break/assets/banner.webp)

# ADIA Lab Structural Break Open Benchmark Challenge

## Challenge Overview

Welcome to the ADIA Lab Structural Break Open Benchmark Challenge! In this challenge, you will analyze univariate time series data to determine whether a structural break has occurred at a specified boundary point.

### What is a Structural Break?

A structural break occurs when the process governing the data generation changes at a certain point in time. These changes can be subtle or dramatic, and detecting them accurately is crucial across various domains such as climatology, industrial monitoring, finance, and healthcare.

![Structural Break Example](https://raw.githubusercontent.com/crunchdao/competitions/refs/heads/master/competitions/structural-break/quickstarters/baseline/images/example.png)

### Your Task

For each time series in the test set, you need to predict a score between `0` and `1`:
- Values closer to `0` indicate no structural break at the specified boundary point;
- Values closer to `1` indicate a structural break did occur.

### Evaluation Metric

The evaluation metric is [ROC AUC (Area Under the Receiver Operating Characteristic Curve)](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_auc_score.html), which measures the performance of detection algorithms regardless of their specific calibration.

- ROC AUC around `0.5`: No better than random chance;
- ROC AUC approaching `1.0`: Perfect detection.

# Setup

The first steps to get started are:
1. Get the setup command
2. Execute it in the cell below

### >> https://hub.crunchdao.com/competitions/structural-break-open-benchmark/submit/notebook

![Reveal token](https://raw.githubusercontent.com/crunchdao/competitions/refs/heads/master/documentation/animations/reveal-token.gif)

In [18]:
# Install the Crunch CLI
%pip install --upgrade crunch-cli

# Setup your local environment
!crunch setup-notebook structural-break-open-benchmark ou1T3M4Vv9vSIUOJNlSU5j2O

crunch-cli, version 10.10.2
you appear to have never submitted code before
data/X_train.parquet: download from https:crunchdao--competition--production.s3-accelerate.amazonaws.com/data-releases/211/X_train.parquet (209821819 bytes)
data/X_test.reduced.parquet: download from https:crunchdao--competition--production.s3-accelerate.amazonaws.com/data-releases/211/X_test.reduced.parquet (2267868 bytes)
data/y_train.parquet: download from https:crunchdao--competition--production.s3-accelerate.amazonaws.com/data-releases/211/y_train.parquet (60884 bytes)
data/y_test.reduced.parquet: download from https:crunchdao--competition--production.s3-accelerate.amazonaws.com/data-releases/211/y_test.reduced.parquet (2549 bytes)
                                
---
Success! Your environment has been correctly setup.
Next recommended actions:
1. Load the Crunch Toolings: `crunch_tools = crunch.load_notebook()`
2. Execute the cells with your code
3. Run a test: `crunch_tools.test()`
4. Download and submit 

# Your model

## Setup

In [19]:
import os
import typing

# Import your dependencies
import joblib
import pandas as pd
import scipy
import sklearn.metrics  # <1.8

In [20]:
import crunch

# Load the Crunch Toolings
crunch_tools = crunch.load_notebook()

loaded inline runner with module: <module '__main__'>

cli version: 10.10.2
available ram: 12.67 gb
available cpu: 2 core
----


## Understanding the Data

The dataset consists of univariate time series, each containing ~2,000-5,000 values with a designated boundary point. For each time series, you need to determine whether a structural break occurred at this boundary point.

The data was downloaded when you setup your local environment and is now available in the `data/` directory.

In [21]:
# Load the data simply
X_train, y_train, X_test = crunch_tools.load_data()

X_train = typing.cast(pd.DataFrame, X_train)
y_train = typing.cast(pd.Series, y_train)
X_test = typing.cast(pd.DataFrame, X_test)

data/X_train.parquet: download from https:crunchdao--competition--production.s3-accelerate.amazonaws.com/data-releases/211/X_train.parquet (209821819 bytes)
data/X_train.parquet: already exists, file length match
data/X_test.reduced.parquet: download from https:crunchdao--competition--production.s3-accelerate.amazonaws.com/data-releases/211/X_test.reduced.parquet (2267868 bytes)
data/X_test.reduced.parquet: already exists, file length match
data/y_train.parquet: download from https:crunchdao--competition--production.s3-accelerate.amazonaws.com/data-releases/211/y_train.parquet (60884 bytes)
data/y_train.parquet: already exists, file length match
data/y_test.reduced.parquet: download from https:crunchdao--competition--production.s3-accelerate.amazonaws.com/data-releases/211/y_test.reduced.parquet (2549 bytes)
data/y_test.reduced.parquet: already exists, file length match


### Understanding `X_train`

The training data is structured as a pandas DataFrame with a MultiIndex:

**Index Levels:**
- `id`: Identifies the unique time series
- `time`: (arbitrary) The time step within each time series, which is regularly sampled

**Columns:**
- `value`: The values of the time series at each given time step
- `period`: whether you are in the first part of the time series (`0`), before the presumed break point, or in the second part (`1`), after the break point

In [22]:
X_train

Unnamed: 0_level_0,Unnamed: 1_level_0,value,period
id,time,Unnamed: 2_level_1,Unnamed: 3_level_1
0,0,-0.005564,0
0,1,0.003705,0
0,2,0.013164,0
0,3,0.007151,0
0,4,-0.009979,0
...,...,...,...
10000,2134,0.001137,1
10000,2135,0.003526,1
10000,2136,0.000687,1
10000,2137,0.001640,1


### Understanding `y_train`

This is a simple `pandas.Series` that tells if a time series id has a structural break, or not, from the presumed break point on.

**Index:**
- `id`: the ID of the time series

**Value:**
- `structural_breakpoint`: Boolean indicating whether a structural break occurred (`True`) or not (`False`)

In [23]:
y_train

Unnamed: 0_level_0,structural_breakpoint
id,Unnamed: 1_level_1
0,False
1,False
2,True
3,False
4,False
...,...
9996,False
9997,False
9998,False
9999,False


### Understanding `X_test`

The test data is provided with the same format as [`X_train`](#understanding-X_train).

In [24]:
X_test

Unnamed: 0_level_0,Unnamed: 1_level_0,value,period
id,time,Unnamed: 2_level_1,Unnamed: 3_level_1
10001,0,0.004353,0
10001,1,0.003517,0
10001,2,0.006206,0
10001,3,-0.000983,0
10001,4,-0.002316,0
...,...,...,...
10101,3069,-0.000151,1
10101,3070,0.000264,1
10101,3071,0.000712,1
10101,3072,-0.001793,1


In [25]:
dataset_ids = X_test.index.get_level_values("id").unique()
print("Number of datasets:", len(dataset_ids))

Number of datasets: 101


In [26]:
X_test[X_test.index.get_level_values("id") == dataset_ids.min()]

Unnamed: 0_level_0,Unnamed: 1_level_0,value,period
id,time,Unnamed: 2_level_1,Unnamed: 3_level_1
10001,0,0.004353,0
10001,1,0.003517,0
10001,2,0.006206,0
10001,3,-0.000983,0
10001,4,-0.002316,0
10001,...,...,...
10001,2105,0.003565,1
10001,2106,0.048444,1
10001,2107,0.003224,1
10001,2108,0.001674,1


## Strategy Implementation

There are multiple approaches you can take to detect structural breaks:

1. **Statistical Tests**: Compare distributions before and after the boundary point;
2. **Feature Engineering**: Extract features from both segments for comparison;
3. **Time Series Modeling**: Detect deviations from expected patterns;
4. **Machine Learning**: Train models to recognize break patterns from labeled examples.

The baseline implementation below uses a simple statistical approach: a t-test to compare the distributions before and after the boundary point.

### The `train()` Function

In this function, you build and train your model for making inferences on the test data. Your model must be stored in the `model_directory_path`.

The baseline implementation below doesn't require a pre-trained model, as it uses a statistical test that will be computed at inference time.

In [27]:
def train(
    X_train: pd.DataFrame,
    y_train: pd.Series,
    model_directory_path: str,
):
    # For our baseline t-test approach, we don't need to train a model
    # This is essentially an unsupervised approach calculated at inference time
    model = None

    # You could enhance this by training an actual model, for example:
    # 1. Extract features from before/after segments of each time series
    # 2. Train a classifier using these features and y_train labels
    # 3. Save the trained model

    joblib.dump(model, os.path.join(model_directory_path, 'model.joblib'))

### The `infer()` Function

In the inference function, your trained model (if any) is loaded and used to make predictions on test data.

**Changes from the previous edition**:
- Now, you can receive the full `X_test` dataframe and make predictions all at once.
- To ensure backward compatibility, you can still use the `yield`-based approach (see the [previous quickstarter](https://colab.research.google.com/github/crunchdao/quickstarters/blob/master/competitions/structural-break/quickstarters/baseline/baseline.ipynb#scrollTo=7n-jboJH-0fU)).

In [28]:
def infer(
    X_test: pd.DataFrame,
    model_directory_path: str,
):
    model = joblib.load(os.path.join(model_directory_path, 'model.joblib'))

    # Baseline approach: Compute t-test between values before and after boundary point
    # The negative p-value is used as our score - smaller p-values (larger negative numbers)
    # indicate more evidence against the null hypothesis that distributions are the same,
    # suggesting a structural break

    def t_test(u: pd.DataFrame):
        return -scipy.stats.ttest_ind(
            u["value"][u["period"] == 0],  # Values before boundary point
            u["value"][u["period"] == 1],  # Values after boundary point
        ).pvalue

    prediction = X_test.groupby("id").apply(t_test)

    # Note: This baseline approach uses a t-test to compare the distributions
    # before and after the boundary point. A smaller p-value (larger negative number)
    # suggests stronger evidence that the distributions are different,
    # indicating a potential structural break.

    return prediction

## Local testing

To make sure your `train()` and `infer()` function are working properly, you can call the `crunch.test()` function that will reproduce the cloud environment locally. <br />
Even if it is not perfect, it should give you a quick idea if your model is working properly.

In [29]:
crunch_tools.test(
    # Uncomment to disable the train
    # force_first_train=False,

    # Uncomment to disable the determinism check
    # no_determinism_check=True,
)

00:47:27 
00:47:27 started
00:47:27 running local test
00:47:27 internet access isn't restricted, no check will be done
00:47:27 
00:47:28 starting unstructured loop...
00:47:28 executing - command=train


data/X_train.parquet: download from https:crunchdao--competition--production.s3-accelerate.amazonaws.com/data-releases/211/X_train.parquet (209821819 bytes)
data/X_train.parquet: already exists, file length match
data/X_test.reduced.parquet: download from https:crunchdao--competition--production.s3-accelerate.amazonaws.com/data-releases/211/X_test.reduced.parquet (2267868 bytes)
data/X_test.reduced.parquet: already exists, file length match
data/y_train.parquet: download from https:crunchdao--competition--production.s3-accelerate.amazonaws.com/data-releases/211/y_train.parquet (60884 bytes)
data/y_train.parquet: already exists, file length match
data/y_test.reduced.parquet: download from https:crunchdao--competition--production.s3-accelerate.amazonaws.com/data-releases/211/y_test.reduced.parquet (2549 bytes)
data/y_test.reduced.parquet: already exists, file length match


00:47:30 executing - command=infer


detected regular function, passing the full dataframe


00:47:31 checking determinism by executing the inference again with 30% of the data (tolerance: 1e-08)
00:47:31 executing - command=infer


detected regular function, passing the full dataframe


00:47:31 save prediction - path=prediction
00:47:31 determinism check: passed
00:47:31 ended
00:47:31 duration - time=00:00:03
00:47:31 memory - before="1.57 GB" after="1.69 GB" consumed="124.96 MB"


## Results

Once the local tester is done, you can preview the result stored in `prediction/prediction.parquet`.

In [30]:
prediction = pd.read_parquet("prediction/prediction.parquet")
prediction

Unnamed: 0_level_0,prediction
id,Unnamed: 1_level_1
10001,-0.177695
10002,-0.580514
10003,-0.021564
10004,-0.834794
10005,-0.358398
...,...
10097,-0.013969
10098,-0.559414
10099,-0.807586
10100,-0.059172


### Local scoring

You can call the function that the system uses to estimate your score locally.

In [31]:
# Load the targets
target = pd.read_parquet("data/y_test.reduced.parquet")["structural_breakpoint"]

# Call the scoring function
sklearn.metrics.roc_auc_score(
    target,
    prediction,
)

np.float64(0.6568627450980392)

# Submit your Notebook

To submit your work, you must:
1. Download your Notebook from Colab
2. Upload it to the platform
3. Create a run to validate it

### >> https://hub.crunchdao.com/competitions/structural-break-open-benchmark/submit/notebook

![Download and Submit Notebook](https://raw.githubusercontent.com/crunchdao/competitions/refs/heads/master/documentation/animations/download-and-submit-notebook.gif)