[![Open In Colab](https://raw.githubusercontent.com/crunchdao/competitions/refs/heads/master/documentation/badge/open-in-colab.svg)](https://colab.research.google.com/github/crunchdao/quickstarters/blob/master/competitions/structural-break/quickstarters/unsupervised-baseline/unsupervised-baseline.ipynb)

![Banner](https://raw.githubusercontent.com/crunchdao/quickstarters/refs/heads/master/competitions/structural-break/assets/banner.webp)

# ADIA Lab Structural Break Challenge

## Challenge Overview

Welcome to the ADIA Lab Structural Break Challenge! In this challenge, you will analyze univariate time series data to determine whether a structural break has occurred at a specified boundary point.

### What is a Structural Break?

A structural break occurs when the process governing the data generation changes at a certain point in time. These changes can be subtle or dramatic, and detecting them accurately is crucial across various domains such as climatology, industrial monitoring, finance, and healthcare.

![Structural Break Example](https://raw.githubusercontent.com/crunchdao/competitions/refs/heads/master/competitions/structural-break/quickstarters/baseline/images/example.png)

### Your Task

For each time series in the test set, you need to predict a score between `0` and `1`:
- Values closer to `0` indicate no structural break at the specified boundary point;
- Values closer to `1` indicate a structural break did occur.

### Evaluation Metric

The evaluation metric is [ROC AUC (Area Under the Receiver Operating Characteristic Curve)](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_auc_score.html), which measures the performance of detection algorithms regardless of their specific calibration.

- ROC AUC around `0.5`: No better than random chance;
- ROC AUC approaching `1.0`: Perfect detection.

# Setup

The first steps to get started are:
1. Turn on Internet in Kaggle

![Turn on Internet token](https://raw.githubusercontent.com/crunchdao/competitions/refs/heads/master/documentation/animations/kaggle-turn-on-internet.gif)

2. Get the setup command
3. Execute it in the cell below

### >> https://hub.crunchdao.com/competitions/structural-break/submit/notebook

![Reveal token](https://raw.githubusercontent.com/crunchdao/competitions/refs/heads/master/documentation/animations/reveal-token.gif)

In [None]:
# Install the Crunch CLI
%pip install --upgrade crunch-cli

# Setup your local environment
!crunch setup --notebook structural-break hello --token aaaabbbbccccddddeeeeffff

# Your model

## Setup

In [3]:
import os
import typing

# Import your dependencies
import joblib
import numpy as np
import pandas as pd
from tqdm.auto import tqdm
import sklearn.metrics
from scipy.stats import wasserstein_distance  # 1D Earth Mover's Distance

In [None]:
import crunch

# Load the Crunch Toolings
crunch = crunch.load_notebook()

## Understanding the Data

The dataset consists of univariate time series, each containing ~2,000-5,000 values with a designated boundary point. For each time series, you need to determine whether a structural break occurred at this boundary point.

The data was downloaded when you setup your local environment and is now available in the `data/` directory.

In [None]:
# Load the data simply
X_train, y_train, X_test = crunch.load_data()

### Understanding `X_train`

The training data is structured as a pandas DataFrame with a MultiIndex:

**Index Levels:**
- `id`: Identifies the unique time series
- `time`: (arbitrary) The time step within each time series, which is regularly sampled

**Columns:**
- `value`: The values of the time series at each given time step
- `period`: whether you are in the first part of the time series (`0`), before the presumed break point, or in the second part (`1`), after the break point

In [7]:
X_train

Unnamed: 0_level_0,Unnamed: 1_level_0,value,period
id,time,Unnamed: 2_level_1,Unnamed: 3_level_1
0,0,-0.005564,0
0,1,0.003705,0
0,2,0.013164,0
0,3,0.007151,0
0,4,-0.009979,0
...,...,...,...
10000,2134,0.001137,1
10000,2135,0.003526,1
10000,2136,0.000687,1
10000,2137,0.001640,1


### Understanding `y_train`

This is a simple `pandas.Series` that tells if a time series id has a structural break, or not, from the presumed break point on.

**Index:**
- `id`: the ID of the time series

**Value:**
- `structural_breakpoint`: Boolean indicating whether a structural break occurred (`True`) or not (`False`)

In [8]:
y_train

id
0         True
1         True
2        False
3         True
4        False
         ...  
9996     False
9997      True
9998     False
9999     False
10000     True
Name: structural_breakpoint, Length: 10001, dtype: bool

### Understanding `X_test`

The test data is provided as a **`list` of `pandas.DataFrame`s** with the same format as [`X_train`](#understanding-X_test).

It is structured as a list to encourage processing records one by one, which will be mandatory in the `infer()` function.

In [11]:
print("Number of datasets:", len(X_test))

Number of datasets: 101


In [12]:
X_test[0]

Unnamed: 0_level_0,Unnamed: 1_level_0,value,period
id,time,Unnamed: 2_level_1,Unnamed: 3_level_1
10001,0,0.010753,0
10001,1,-0.031915,0
10001,2,-0.010989,0
10001,3,-0.011111,0
10001,4,0.011236,0
10001,...,...,...
10001,2774,-0.013937,1
10001,2775,-0.015649,1
10001,2776,-0.009744,1
10001,2777,0.025375,1


## Strategy Implementation

There are multiple approaches you can take to detect structural breaks:

1. **Statistical Tests**: Compare distributions before and after the boundary point;
2. **Feature Engineering**: Extract features from both segments for comparison;
3. **Time Series Modeling**: Detect deviations from expected patterns;
4. **Machine Learning**: Train models to recognize break patterns from labeled examples.

The baseline implementation below uses a simple statistical approach: a t-test to compare the distributions before and after the boundary point.

### The `train()` Function

In this function, you build and train your model for making inferences on the test data. Your model must be stored in the `model_directory_path`.

The baseline implementation below doesn't require a pre-trained model, as it uses a statistical test that will be computed at inference time.

In [13]:
def train(
    X_train: pd.DataFrame,
    y_train: pd.Series,  # not used here (unsupervised baseline)
    model_directory_path: str,
):
    """
    SIMPLE TRAINING:
    - For each training time series (grouped by 'id'), compute a raw difference signal:
        raw = Wasserstein distance between period==0 and period==1 values.
    - Learn a robust scaling (median + IQR) to map raw distances into (0,1) via a logistic curve.
    - Save just two numbers (median, iqr) for use in inference.
    """

    # Build a list of raw EMD scores across all training ids
    raw_scores = []
    for _, df in X_train.groupby(level="id", sort=False):
        df = df[["value", "period"]].dropna()

        before = df.loc[df["period"] == 0, "value"].to_numpy(dtype=float, copy=False)
        after =  df.loc[df["period"] == 1, "value"].to_numpy(dtype=float, copy=False)

        # Safe fallback if any segment is empty
        if before.size == 0 or after.size == 0:
            raw = 0.0
        else:
            raw = float(wasserstein_distance(before, after))

        raw_scores.append(raw)

    # Fit robust parameters: median and IQR of the raw scores
    arr = np.array(raw_scores, dtype=float)
    if arr.size == 0:
        median = 0.0
        iqr = 1.0
    else:
        q25, median, q75 = np.percentile(arr, [25, 50, 75])
        iqr = float(q75 - q25)

        # If IQR is tiny, fall back to std or 1.0 to avoid divide-by-zero
        if iqr < 1e-8:
            iqr = float(np.std(arr)) if arr.size > 1 else 1.0

    # Save the tiny "model": just the scaling params
    params = {
        "median": float(median),
        "iqr": float(iqr),
    }

    joblib.dump(params, os.path.join(model_directory_path, "model.joblib"))

### The `infer()` Function

In the inference function, your trained model (if any) is loaded and used to make predictions on test data.

**Important workflow:**
1. Load your model;
2. Use the `yield` statement to signal readiness to the runner;
3. Process each dataset one by one within the for loop;
4. For each dataset, use `yield prediction` to return your prediction.

**Note:** The datasets can only be iterated once!

In [14]:
def infer(
    X_test: typing.Iterable[pd.DataFrame],
    model_directory_path: str,
):
    """
    SIMPLE INFERENCE:
    - Load (median, iqr).
    - Crunch protocol: yield once to signal readiness.
    - For each dataset:
        * compute raw EMD between period 0 and 1,
        * map to (0,1) with a robust logistic transform,
        * yield the score.
    """

    params = joblib.load(os.path.join(model_directory_path, "model.joblib"))
    median = params["median"]
    iqr = params["iqr"]
    eps = 1e-8  # tiny constant for numerical stability

    # Signal readiness to the Crunch runner
    yield

    # X_test can only be iterated once.
    # Before getting the next dataset, you must predict the current one.
    for df in tqdm(X_test):
        df = df[["value", "period"]].dropna()
        before = df.loc[df["period"] == 0, "value"].to_numpy(dtype=float, copy=False)
        after =  df.loc[df["period"] == 1, "value"].to_numpy(dtype=float, copy=False)

        if before.size == 0 or after.size == 0:
            raw = 0.0
        else:
            raw = float(wasserstein_distance(before, after))

        # Robust logistic mapping to [0,1]
        # z ~ standardized by 1.35*IQR (≈ IQR as a robust std proxy)
        z = (raw - median) / (1.35 * iqr + eps)
        score = float(1.0 / (1.0 + np.exp(-z)))

        yield score   # send the prediction for the current dataset

## Local testing

To make sure your `train()` and `infer()` function are working properly, you can call the `crunch.test()` function that will reproduce the cloud environment locally. <br />
Even if it is not perfect, it should give you a quick idea if your model is working properly.

In [None]:
crunch.test(
    # Uncomment to disable the train
    # force_first_train=False,

    # Uncomment to disable the determinism check
    # no_determinism_check=True,
)

## Results

Once the local tester is done, you can preview the result stored in `data/prediction.parquet`.

In [None]:
prediction = pd.read_parquet("data/prediction.parquet")
prediction

### Local scoring

You can call the function that the system uses to estimate your score locally.

In [None]:
# Load the targets
target = pd.read_parquet("data/y_test.reduced.parquet")["structural_breakpoint"]

# Call the scoring function
sklearn.metrics.roc_auc_score(
    target,
    prediction,
)

# Submit your Notebook

To submit your work, you must:
1. Download your Notebook from Kaggle
2. Upload it to the platform
3. Create a run to validate it

### >> https://hub.crunchdao.com/competitions/structural-break/submit/notebook

![Download and Submit Notebook](https://raw.githubusercontent.com/crunchdao/competitions/refs/heads/master/documentation/animations/download-and-submit-notebook-on-kaggle.gif)