[![Open In Colab](https://raw.githubusercontent.com/crunchdao/competitions/refs/heads/master/documentation/badge/open-in-colab.svg)](https://colab.research.google.com/github/crunchdao/quickstarters/blob/master/competitions/structural-break/quickstarters/baseline/baseline.ipynb)
[![Open In Kaggle](https://raw.githubusercontent.com/crunchdao/competitions/refs/heads/master/documentation/badge/open-in-kaggle.svg)](https://www.kaggle.com/code/crunchdao/structural-break-baseline)

![Banner](https://raw.githubusercontent.com/crunchdao/quickstarters/refs/heads/master/competitions/structural-break/assets/banner.webp)

# ADIA Lab Structural Break Challenge

## Challenge Overview

Welcome to the ADIA Lab Structural Break Challenge! In this challenge, you will analyze univariate time series data to determine whether a structural break has occurred at a specified boundary point.

### What is a Structural Break?

A structural break occurs when the process governing the data generation changes at a certain point in time. These changes can be subtle or dramatic, and detecting them accurately is crucial across various domains such as climatology, industrial monitoring, finance, and healthcare.

![Structural Break Example](https://raw.githubusercontent.com/crunchdao/competitions/refs/heads/master/competitions/structural-break/quickstarters/baseline/images/example.png)

### Your Task

For each time series in the test set, you need to predict a score between `0` and `1`:
- Values closer to `0` indicate no structural break at the specified boundary point;
- Values closer to `1` indicate a structural break did occur.

### Evaluation Metric

The evaluation metric is [ROC AUC (Area Under the Receiver Operating Characteristic Curve)](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_auc_score.html), which measures the performance of detection algorithms regardless of their specific calibration.

- ROC AUC around `0.5`: No better than random chance;
- ROC AUC approaching `1.0`: Perfect detection.

# Setup

The first steps to get started are:
1. Get the setup command
2. Execute it in the cell below

### >> https://hub.crunchdao.com/competitions/structural-break/submit/notebook

![Reveal token](https://raw.githubusercontent.com/crunchdao/competitions/refs/heads/master/documentation/animations/reveal-token.gif)

In [3]:
print("hi")

hi


In [4]:
print("hellow world")

hellow world


In [1]:
# Install the Crunch CLI
%pip install --upgrade crunch-cli

# Setup your local environment
!crunch setup --notebook structural-break hello --token YGJHU3Wm5YfrhXHC7HKL1Uqw

Collecting crunch-cli
  Downloading crunch_cli-7.4.0-py3-none-any.whl.metadata (3.4 kB)
Collecting coloredlogs (from crunch-cli)
  Downloading coloredlogs-15.0.1-py2.py3-none-any.whl.metadata (12 kB)
Collecting dataclasses_json (from crunch-cli)
  Downloading dataclasses_json-0.6.7-py3-none-any.whl.metadata (25 kB)
Collecting inquirer (from crunch-cli)
  Downloading inquirer-3.4.1-py3-none-any.whl.metadata (6.8 kB)
Collecting libcst (from crunch-cli)
  Downloading libcst-1.8.2-cp311-cp311-manylinux_2_28_x86_64.whl.metadata (15 kB)
Collecting python-dotenv (from crunch-cli)
  Downloading python_dotenv-1.1.1-py3-none-any.whl.metadata (24 kB)
Collecting requirements-parser>=0.11.0 (from crunch-cli)
  Downloading requirements_parser-0.13.0-py3-none-any.whl.metadata (4.7 kB)
Collecting humanfriendly>=9.1 (from coloredlogs->crunch-cli)
  Downloading humanfriendly-10.0-py2.py3-none-any.whl.metadata (9.2 kB)
Collecting marshmallow<4.0.0,>=3.18.0 (from dataclasses_json->crunch-cli)
  Downloadin

# Your model

## Setup

In [7]:
import os
import typing

# Import your dependencies
import joblib
import pandas as pd
import scipy
import sklearn.metrics

In [8]:
import crunch

# Load the Crunch Toolings
crunch = crunch.load_notebook()

loaded inline runner with module: <module '__main__'>

cli version: 7.4.0
available ram: 12.67 gb
available cpu: 2 core
----


## Understanding the Data

The dataset consists of univariate time series, each containing ~2,000-5,000 values with a designated boundary point. For each time series, you need to determine whether a structural break occurred at this boundary point.

The data was downloaded when you setup your local environment and is now available in the `data/` directory.

In [10]:
# Load the data simply
X_train, y_train, X_test = crunch.load_data()

data/X_train.parquet: download from https:crunchdao--competition--production.s3-accelerate.amazonaws.com/data-releases/146/X_train.parquet (204327238 bytes)
data/X_train.parquet: already exists, file length match
data/X_test.reduced.parquet: download from https:crunchdao--competition--production.s3-accelerate.amazonaws.com/data-releases/146/X_test.reduced.parquet (2380918 bytes)
data/X_test.reduced.parquet: already exists, file length match
data/y_train.parquet: download from https:crunchdao--competition--production.s3-accelerate.amazonaws.com/data-releases/146/y_train.parquet (61003 bytes)
data/y_train.parquet: already exists, file length match
data/y_test.reduced.parquet: download from https:crunchdao--competition--production.s3-accelerate.amazonaws.com/data-releases/146/y_test.reduced.parquet (2655 bytes)
data/y_test.reduced.parquet: already exists, file length match


### Understanding `X_train`

The training data is structured as a pandas DataFrame with a MultiIndex:

**Index Levels:**
- `id`: Identifies the unique time series
- `time`: The timestep within each time series

**Columns:**
- `value`: The actual time series value at each timestep
- `period`: A binary indicator where `0` represents the **period before** the boundary point, and `1` represents the **period after** the boundary point

In [23]:
import scipy.stats as stats

In [38]:
idx = X_train.loc[0].period==0
idx = idx.astype(int)

In [43]:
X_train.loc[0].iloc[~idx]

Unnamed: 0_level_0,value,period
time,Unnamed: 1_level_1,Unnamed: 2_level_1
1643,0.001089,1
1643,0.001089,1
1643,0.001089,1
1643,0.001089,1
1643,0.001089,1
...,...,...
1644,0.002160,1
1644,0.002160,1
1644,0.002160,1
1644,0.002160,1


In [45]:
X_train.loc[0].iloc[idx]

Unnamed: 0_level_0,value,period
time,Unnamed: 1_level_1,Unnamed: 2_level_1
1,0.003705,0
1,0.003705,0
1,0.003705,0
1,0.003705,0
1,0.003705,0
...,...,...
0,-0.005564,0
0,-0.005564,0
0,-0.005564,0
0,-0.005564,0


In [55]:
import numpy as np

In [56]:
print(np.array(stats.ks_2samp(X_train.loc[0].iloc[idx].value,X_train.loc[0].iloc[~idx].value,alternative="two_sided")))

[0.8212766 0.       ]


In [52]:
type(X_train)

In [112]:
def ks_stat_f(df):
  #print(df.head())
  idx = df.period==0
  idx = idx.astype(int) #creates index where 0 is before structed break

  sample1 = df.iloc[idx].value
  sample2 = df.iloc[~idx].value
  res = stats.ks_2samp(sample1,sample2, alternative="two_sided" )
  stat = res.statistic
  p = res.pvalue


  return stat

def ks_p_f(df):
  #print(df.head())
  idx = df.period==0
  idx = idx.astype(int) #creates index where 0 is before structed break

  sample1 = df.iloc[idx].value
  sample2 = df.iloc[~idx].value
  res = stats.ks_2samp(sample1,sample2, alternative="two_sided" )
  stat = res.statistic
  p = res.pvalue


  return p




In [101]:
pd.DataFrame(data_train_sub.groupby("id").apply(ks_testing), columns = ["col"])

[np.float64(0.8212765957446808), np.float64(0.0)]
[np.float64(0.8884934756820878), np.float64(0.0)]
[np.float64(0.8191713483146067), np.float64(0.0)]
[np.float64(1.0), np.float64(0.0)]
[np.float64(0.7841930903928065), np.float64(0.0)]
[np.float64(1.0), np.float64(0.0)]
[np.float64(1.0), np.float64(0.0)]
[np.float64(0.7167106752168992), np.float64(0.0)]
[np.float64(0.8580159849553362), np.float64(0.0)]
[np.float64(0.7508261731658956), np.float64(0.0)]
[np.float64(0.8365231259968102), np.float64(0.0)]


Unnamed: 0_level_0,col
id,Unnamed: 1_level_1
0,"(0.8212765957446808, 0.0)"
1,"(0.8884934756820878, 0.0)"
2,"(0.8191713483146067, 0.0)"
3,"(1.0, 0.0)"
4,"(0.7841930903928065, 0.0)"
5,"(1.0, 0.0)"
6,"(1.0, 0.0)"
7,"(0.7167106752168992, 0.0)"
8,"(0.8580159849553362, 0.0)"
9,"(0.7508261731658956, 0.0)"


In [115]:
data_train_sub = X_train.loc[0:500]
ks_stat = pd.DataFrame(data_train_sub.groupby("id").apply(ks_stat_f), columns=['ks_stat'])
ks_p = pd.DataFrame(data_train_sub.groupby("id").apply(ks_p_f), columns=['ks_p'])

ks_test_results = pd.concat([ks_stat,ks_p], axis=1)
ks_test_results #This returns, for each sample data the results of a ks test

Unnamed: 0_level_0,ks_stat,ks_p
id,Unnamed: 1_level_1,Unnamed: 2_level_1
0,0.821277,0.0
1,0.888493,0.0
2,0.819171,0.0
3,1.000000,0.0
4,0.784193,0.0
...,...,...
496,0.652374,0.0
497,0.700608,0.0
498,0.680825,0.0
499,1.000000,0.0


In [116]:
data_train_sub

Unnamed: 0_level_0,Unnamed: 1_level_0,value,period
id,time,Unnamed: 2_level_1,Unnamed: 3_level_1
0,0,-0.005564,0
0,1,0.003705,0
0,2,0.013164,0
0,3,0.007151,0
0,4,-0.009979,0
...,...,...,...
500,2326,-0.020000,1
500,2327,-0.001276,1
500,2328,0.019157,1
500,2329,-0.002506,1


In [107]:
ks_test_results

Unnamed: 0_level_0,ks_stat__ks_p
id,Unnamed: 1_level_1
0,"(0.8212765957446808, 0.0)"
1,"(0.8884934756820878, 0.0)"
2,"(0.8191713483146067, 0.0)"
3,"(1.0, 0.0)"
4,"(0.7841930903928065, 0.0)"
5,"(1.0, 0.0)"
6,"(1.0, 0.0)"
7,"(0.7167106752168992, 0.0)"
8,"(0.8580159849553362, 0.0)"
9,"(0.7508261731658956, 0.0)"


### Understanding `y_train`

This is a simple `pandas.Series` that tells if a dataset id has a structural breakpoint or not.

**Index:**
- `id`: the ID of the dataset

**Value:**
- `structural_breakpoint`: Boolean indicating whether a structural break occurred (`True`) or not (`False`)

In [12]:
y_train

Unnamed: 0_level_0,structural_breakpoint
id,Unnamed: 1_level_1
0,False
1,False
2,True
3,False
4,False
...,...
9996,False
9997,False
9998,False
9999,False


### Understanding `X_test`

The test data is provided as a **`list` of `pandas.DataFrame`s** with the same format as [`X_train`](#understanding-X_test).

It is structured as a list to encourage processing records one by one, which will be mandatory in the `infer()` function.

In [13]:
print("Number of datasets:", len(X_test))

Number of datasets: 101


In [14]:
X_test[0]

Unnamed: 0_level_0,Unnamed: 1_level_0,value,period
id,time,Unnamed: 2_level_1,Unnamed: 3_level_1
10001,0,0.010753,0
10001,1,-0.031915,0
10001,2,-0.010989,0
10001,3,-0.011111,0
10001,4,0.011236,0
10001,...,...,...
10001,2774,-0.013937,1
10001,2775,-0.015649,1
10001,2776,-0.009744,1
10001,2777,0.025375,1


## Strategy Implementation

There are multiple approaches you can take to detect structural breaks:

1. **Statistical Tests**: Compare distributions before and after the boundary point;
2. **Feature Engineering**: Extract features from both segments for comparison;
3. **Time Series Modeling**: Detect deviations from expected patterns;
4. **Machine Learning**: Train models to recognize break patterns from labeled examples.

The baseline implementation below uses a simple statistical approach: a t-test to compare the distributions before and after the boundary point.

### The `train()` Function

In this function, you build and train your model for making inferences on the test data. Your model must be stored in the `model_directory_path`.

The baseline implementation below doesn't require a pre-trained model, as it uses a statistical test that will be computed at inference time.

In [15]:
def train(
    X_train: pd.DataFrame,
    y_train: pd.Series,
    model_directory_path: str,
):

    # For our baseline t-test approach, we don't need to train a model
    # This is essentially an unsupervised approach calculated at inference time
    model = None

    # You could enhance this by training an actual model, for example:
    # 1. Extract features from before/after segments of each time series
    # 2. Train a classifier using these features and y_train labels
    # 3. Save the trained model

    joblib.dump(model, os.path.join(model_directory_path, 'model.joblib'))

### The `infer()` Function

In the inference function, your trained model (if any) is loaded and used to make predictions on test data.

**Important workflow:**
1. Load your model;
2. Use the `yield` statement to signal readiness to the runner;
3. Process each dataset one by one within the for loop;
4. For each dataset, use `yield prediction` to return your prediction.

**Note:** The datasets can only be iterated once!

In [16]:
def infer(
    X_test: typing.Iterable[pd.DataFrame],
    model_directory_path: str,
):
    model = joblib.load(os.path.join(model_directory_path, 'model.joblib'))

    yield  # Mark as ready

    # X_test can only be iterated once.
    # Before getting the next dataset, you must predict the current one.
    for dataset in X_test:
        # Baseline approach: Compute t-test between values before and after boundary point
        # The negative p-value is used as our score - smaller p-values (larger negative numbers)
        # indicate more evidence against the null hypothesis that distributions are the same,
        # suggesting a structural break
        def t_test(u: pd.DataFrame):
            return -scipy.stats.ttest_ind(
                u["value"][u["period"] == 0],  # Values before boundary point
                u["value"][u["period"] == 1],  # Values after boundary point
            ).pvalue

        prediction = t_test(dataset)
        yield prediction  # Send the prediction for the current dataset

        # Note: This baseline approach uses a t-test to compare the distributions
        # before and after the boundary point. A smaller p-value (larger negative number)
        # suggests stronger evidence that the distributions are different,
        # indicating a potential structural break.

## Local testing

To make sure your `train()` and `infer()` function are working properly, you can call the `crunch.test()` function that will reproduce the cloud environment locally. <br />
Even if it is not perfect, it should give you a quick idea if your model is working properly.

In [17]:
crunch.test(
    # Uncomment to disable the train
    # force_first_train=False,

    # Uncomment to disable the determinism check
    # no_determinism_check=True,
)

11:34:05 no forbidden library found
11:34:05 
11:34:05 started
11:34:05 running local test
11:34:05 internet access isn't restricted, no check will be done
11:34:05 
11:34:06 starting unstructured loop...
11:34:06 executing - command=train


data/X_train.parquet: download from https:crunchdao--competition--production.s3-accelerate.amazonaws.com/data-releases/146/X_train.parquet (204327238 bytes)
data/X_train.parquet: already exists, file length match
data/X_test.reduced.parquet: download from https:crunchdao--competition--production.s3-accelerate.amazonaws.com/data-releases/146/X_test.reduced.parquet (2380918 bytes)
data/X_test.reduced.parquet: already exists, file length match
data/y_train.parquet: download from https:crunchdao--competition--production.s3-accelerate.amazonaws.com/data-releases/146/y_train.parquet (61003 bytes)
data/y_train.parquet: already exists, file length match
data/y_test.reduced.parquet: download from https:crunchdao--competition--production.s3-accelerate.amazonaws.com/data-releases/146/y_test.reduced.parquet (2655 bytes)
data/y_test.reduced.parquet: already exists, file length match


11:34:08 executing - command=infer
11:34:08 checking determinism by executing the inference again with 30% of the data (tolerance: 1e-08)
11:34:08 executing - command=infer
11:34:08 determinism check: passed
11:34:08 save prediction - path=data/prediction.parquet
11:34:08 ended
11:34:08 duration - time=00:00:03
11:34:08 memory - before="811.96 MB" after="823.27 MB" consumed="11.31 MB"


## Results

Once the local tester is done, you can preview the result stored in `data/prediction.parquet`.

In [18]:
prediction = pd.read_parquet("data/prediction.parquet")
prediction

Unnamed: 0_level_0,prediction
id,Unnamed: 1_level_1
10001,-0.590381
10002,-0.363831
10003,-0.731208
10004,-0.762609
10005,-0.527371
...,...
10097,-0.539917
10098,-0.843084
10099,-0.203762
10100,-0.612978


### Local scoring

You can call the function that the system uses to estimate your score locally.

In [19]:
# Load the targets
target = pd.read_parquet("data/y_test.reduced.parquet")["structural_breakpoint"]

# Call the scoring function
sklearn.metrics.roc_auc_score(
    target,
    prediction,
)

np.float64(0.48450704225352115)

# Submit your Notebook

To submit your work, you must:
1. Download your Notebook from Colab
2. Upload it to the platform
3. Create a run to validate it

### >> https://hub.crunchdao.com/competitions/structural-break/submit/notebook

![Download and Submit Notebook](https://raw.githubusercontent.com/crunchdao/competitions/refs/heads/master/documentation/animations/download-and-submit-notebook.gif)