[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/crunchdao/quickstarters/blob/master/competitions/structural-break/quickstarters/baseline/baseline.ipynb)

![Banner](https://raw.githubusercontent.com/crunchdao/quickstarters/refs/heads/master/competitions/structural-break/assets/banner.webp)

# ADIA Lab Structural Break Challenge

## Challenge Overview

Welcome to the ADIA Lab Structural Break Challenge! In this challenge, you will analyze univariate time series data to determine whether a structural break has occurred at a specified boundary point.

### What is a Structural Break?

A structural break occurs when the process governing the data generation changes at a certain point in time. These changes can be subtle or dramatic, and detecting them accurately is crucial across various domains such as climatology, industrial monitoring, finance, and healthcare.

![Structural Break Example](https://raw.githubusercontent.com/crunchdao/competitions/refs/heads/master/competitions/structural-break/quickstarters/baseline/images/example.png)

### Your Task

For each time series in the test set, you need to predict a score between `0` and `1`:
- Values closer to `0` indicate no structural break at the specified boundary point;
- Values closer to `1` indicate a structural break did occur.

### Evaluation Metric

The evaluation metric is [ROC AUC (Area Under the Receiver Operating Characteristic Curve)](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_auc_score.html), which measures the performance of detection algorithms regardless of their specific calibration.

- ROC AUC around `0.5`: No better than random chance;
- ROC AUC approaching `1.0`: Perfect detection.

# Setup

The first steps to get started are:
1. Get the setup command
2. Execute it in the cell below

### >> https://hub.crunchdao.com/competitions/structural-break/submit/notebook

![Reveal token](https://raw.githubusercontent.com/crunchdao/competitions/refs/heads/master/documentation/animations/reveal-token.gif)

In [40]:
%pip install crunch-cli --upgrade --quiet --progress-bar off
!crunch setup-notebook structural-break CTPsuKof1y0hjxwoCgcJ74hB

crunch-cli, version 6.5.0

---
Your token seems to have expired or is invalid.

Please follow this link to copy and paste your new setup command:
https://hub.crunchdao.com/competitions/structural-break/submit

If you think that is an error, please contact an administrator.


# Your model

## Setup

In [3]:
import os
import typing

# Import your dependencies
import joblib
import pandas as pd
import scipy
import sklearn.metrics

In [4]:
import crunch

# Load the Crunch Toolings
crunch = crunch.load_notebook()

loaded inline runner with module: <module '__main__'>

cli version: 6.5.0
available ram: 12.67 gb
available cpu: 2 core
----


## Understanding the Data

The dataset consists of univariate time series, each containing ~2,000-5,000 values with a designated boundary point. For each time series, you need to determine whether a structural break occurred at this boundary point.

The data was downloaded when you setup your local environment and is now available in the `data/` directory.

In [5]:
# Load the data simply
X_train, y_train, X_test = crunch.load_data()

data/X_train.parquet: download from https:crunchdao--competition--production.s3.eu-west-1.amazonaws.com/data-releases/146/X_train.parquet (204327238 bytes)
data/X_train.parquet: already exists, file length match
data/X_test.reduced.parquet: download from https:crunchdao--competition--production.s3.eu-west-1.amazonaws.com/data-releases/146/X_test.reduced.parquet (2380918 bytes)
data/X_test.reduced.parquet: already exists, file length match
data/y_train.parquet: download from https:crunchdao--competition--production.s3.eu-west-1.amazonaws.com/data-releases/146/y_train.parquet (61003 bytes)
data/y_train.parquet: already exists, file length match
data/y_test.reduced.parquet: download from https:crunchdao--competition--production.s3.eu-west-1.amazonaws.com/data-releases/146/y_test.reduced.parquet (2655 bytes)
data/y_test.reduced.parquet: already exists, file length match


### Understanding `X_train`

The training data is structured as a pandas DataFrame with a MultiIndex:

**Index Levels:**
- `id`: Identifies the unique time series
- `time`: The timestep within each time series

**Columns:**
- `value`: The actual time series value at each timestep
- `period`: A binary indicator where `0` represents the **period before** the boundary point, and `1` represents the **period after** the boundary point

### Understanding `y_train`

This is a simple `pandas.Series` that tells if a dataset id has a structural breakpoint or not.

**Index:**
- `id`: the ID of the dataset

**Value:**
- `structural_breakpoint`: Boolean indicating whether a structural break occurred (`True`) or not (`False`)

In [6]:
y_train

Unnamed: 0_level_0,structural_breakpoint
id,Unnamed: 1_level_1
0,False
1,False
2,True
3,False
4,False
...,...
9996,False
9997,False
9998,False
9999,False


### Understanding `X_test`

The test data is provided as a **`list` of `pandas.DataFrame`s** with the same format as [`X_train`](#understanding-X_test).

It is structured as a list to encourage processing records one by one, which will be mandatory in the `infer()` function.

In [7]:
print("Number of datasets:", len(X_test))

Number of datasets: 101


In [8]:
X_test[0]

Unnamed: 0_level_0,Unnamed: 1_level_0,value,period
id,time,Unnamed: 2_level_1,Unnamed: 3_level_1
10001,0,0.010753,0
10001,1,-0.031915,0
10001,2,-0.010989,0
10001,3,-0.011111,0
10001,4,0.011236,0
10001,...,...,...
10001,2774,-0.013937,1
10001,2775,-0.015649,1
10001,2776,-0.009744,1
10001,2777,0.025375,1


## Strategy Implementation

There are multiple approaches you can take to detect structural breaks:

1. **Statistical Tests**: Compare distributions before and after the boundary point;
2. **Feature Engineering**: Extract features from both segments for comparison;
3. **Time Series Modeling**: Detect deviations from expected patterns;
4. **Machine Learning**: Train models to recognize break patterns from labeled examples.

The baseline implementation below uses a simple statistical approach: a t-test to compare the distributions before and after the boundary point.

### The `train()` Function

In this function, you build and train your model for making inferences on the test data. Your model must be stored in the `model_directory_path`.

The baseline implementation below doesn't require a pre-trained model, as it uses a statistical test that will be computed at inference time.

In [9]:
import os
import pandas as pd
import numpy as np
from multiprocessing import Pool, cpu_count
import joblib
import torch
import torch.nn as nn
import numpy as np
from torch.utils.data import Dataset, DataLoader
from torch.nn.utils.rnn import pad_sequence

In [32]:
import numpy as np
import torch
import time

# Set device once
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Kalman filter constants as tensors (still used outside JIT)
F = torch.tensor(1.0, device=device)
H = torch.tensor(1.0, device=device)
R = torch.tensor(1.0, device=device)
Q = torch.tensor(1e-5, device=device)

# TorchScript-compatible Kalman filter
@torch.jit.script
def run_kalman_filter_script(
    measurements: torch.Tensor,
    F: float,
    H: float,
    R: float,
    Q: float,
    device: torch.device
) -> torch.Tensor:

    x = torch.tensor(0.0, device=measurements.device)
    P = torch.tensor(1.0, device=measurements.device)
    est = []

    for z in measurements:
        x = F * x
        P = F * P * F + Q

        K = P * H / (H * P * H + R)
        x = x + K * (z - H * x)
        P = (1 - K * H) * P

        est.append(x)

    return torch.stack(est)


def extract_kalman(X_train,device = torch.device("cuda" if torch.cuda.is_available() else "cpu")):
    # Main loop
    kalman_series0 = []
    kalman_series1 = []
    raw_series0 = []
    raw_series1 = []

    # Assuming X_train is already defined with MultiIndex
    index_x = X_train.index.get_level_values(0).unique()

    start_time = time.time()
    for i in index_x:
        if i % 100 == 0:
            print(f"Estimating Kalman series for {i}-th series, time taken: {time.time() - start_time:.2f}s")
            start_time = time.time()

        series = X_train.loc[i]

        t0 = torch.tensor(series[series.period == 0].value.values, dtype=torch.float32, device=device)
        t1 = torch.tensor(series[series.period == 1].value.values, dtype=torch.float32, device=device)
        raw_series0.append(t0)
        raw_series1.append(t1)

        est0 = run_kalman_filter_script(t0, F.item(), H.item(), R.item(), Q.item(),device).cpu().numpy()
        est1 = run_kalman_filter_script(t1, F.item(), H.item(), R.item(), Q.item(),device).cpu().numpy()

        kalman_series0.append(est0)
        kalman_series1.append(est1)

    return kalman_series0, kalman_series1, raw_series0, raw_series1

def extract_kalman_test_single(X_test,device = torch.device("cuda" if torch.cuda.is_available() else "cpu")):
    # Main loop
    kalman_series0 = []
    kalman_series1 = []
    raw_series0 = []
    raw_series1 = []


    start_time = time.time()
    single = [X_test]
    for i,series in enumerate(single):

        t0 = torch.tensor(series[series.period == 0].value.values, dtype=torch.float32, device=device)
        t1 = torch.tensor(series[series.period == 1].value.values, dtype=torch.float32, device=device)
        raw_series0.append(t0)
        raw_series1.append(t1)

        est0 = run_kalman_filter_script(t0, F.item(), H.item(), R.item(), Q.item()).cpu().numpy()
        est1 = run_kalman_filter_script(t1, F.item(), H.item(), R.item(), Q.item()).cpu().numpy()

        kalman_series0.append(est0)
        kalman_series1.append(est1)

    return kalman_series0, kalman_series1, raw_series0, raw_series1



In [11]:
def _make_pad_mask(self, lengths, max_len):
    # Ensure both tensors are on the same device
    device = lengths.device
    return (torch.arange(max_len, device=device)[None, :] >= lengths[:, None])

In [12]:
class TransformerEncoder(nn.Module):
    def __init__(self, d_model=64, nhead=4, num_layers=2, dropout=0.1):
        super().__init__()
        self.input_proj = nn.Linear(2, d_model)
        encoder_layer = nn.TransformerEncoderLayer(d_model, nhead, dropout=dropout, batch_first=True)
        self.encoder = nn.TransformerEncoder(encoder_layer, num_layers)
        self.pool = nn.AdaptiveAvgPool1d(1)

    def forward(self, x, lengths):
        x = self.input_proj(x)
        mask = self._make_pad_mask(lengths, x.size(1))
        encoded = self.encoder(x, src_key_padding_mask=mask)
        pooled = self.pool(encoded.transpose(1, 2)).squeeze(-1)
        return pooled

    def _make_pad_mask(self, lengths, max_len):
        device = lengths.device
        return (torch.arange(max_len, device=device)[None, :] >= lengths[:, None])


In [13]:
def collate_fn(batch):
    if len(batch[0]) == 3:
        s0, s1, labels = zip(*batch)
        labels = torch.stack(labels)
    else:
        s0, s1 = zip(*batch)
        labels = None

    len0 = torch.tensor([len(x) for x in s0])
    len1 = torch.tensor([len(x) for x in s1])
    pad0 = pad_sequence(s0, batch_first=True)
    pad1 = pad_sequence(s1, batch_first=True)

    if labels is not None:
        return pad0, len0, pad1, len1, labels
    else:
        return pad0, len0, pad1, len1


In [14]:
class RegimeClassifier(nn.Module):
    def __init__(self, encoder: nn.Module, d_model=64):
        super().__init__()
        self.encoder = encoder
        self.classifier = nn.Sequential(
            nn.Linear(2 * d_model, 64),
            nn.ReLU(),
            nn.Linear(64, 1),
            nn.Sigmoid()
        )

    def forward(self, x1, len1, x2, len2):
        h1 = self.encoder(x1, len1)
        h2 = self.encoder(x2, len2)
        combined = torch.cat([h1, h2], dim=1)
        return self.classifier(combined).squeeze(1)


In [15]:
class AutoencoderPretrain(nn.Module):
    def __init__(self, encoder: nn.Module, d_model=64):
        super().__init__()
        self.encoder = encoder
        self.decoder = nn.Sequential(
            nn.Linear(d_model, d_model),
            nn.ReLU(),
            nn.Linear(d_model, 2)  # reconstruct 2 input channels
        )

    def forward(self, x, lengths):
        B, T, _ = x.size()
        encoded = self.encoder(x, lengths)  # [B, d_model]
        recon = self.decoder(encoded).unsqueeze(1).repeat(1, T, 1)  # [B, T, 2]
        return recon


In [16]:
def train_epoch(model, loader, optimizer, criterion, device):
    model.train()
    total_loss = 0
    for x0, len0, x1, len1, labels in loader:
        x0, len0 = x0.to(device), len0.to(device)
        x1, len1 = x1.to(device), len1.to(device)
        labels = labels.to(device)

        optimizer.zero_grad()
        outputs = model(x0, len0, x1, len1)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()
        total_loss += loss.item()
    return total_loss / len(loader)


In [17]:
class RegimePairDataset(Dataset):
    def __init__(self, raw_0, kalman_0, raw_1, kalman_1, labels=None):
        self.data = []
        self.has_labels = labels is not None
        if self.has_labels:
            for r0, k0, r1, k1, label in zip(raw_0, kalman_0, raw_1, kalman_1, labels):
                ts0 = np.stack([r0, k0], axis=-1)
                ts1 = np.stack([r1, k1], axis=-1)
                self.data.append((ts0, ts1, label))
        else:
            for r0, k0, r1, k1 in zip(raw_0, kalman_0, raw_1, kalman_1):
                ts0 = np.stack([r0, k0], axis=-1)
                ts1 = np.stack([r1, k1], axis=-1)
                self.data.append((ts0, ts1))

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        if self.has_labels:
            ts0, ts1, label = self.data[idx]
            return (
                torch.tensor(ts0, dtype=torch.float32),
                torch.tensor(ts1, dtype=torch.float32),
                torch.tensor(label, dtype=torch.float32)  # or torch.long if classification
            )
        else:
            ts0, ts1 = self.data[idx]
            return (
                torch.tensor(ts0, dtype=torch.float32),
                torch.tensor(ts1, dtype=torch.float32)
            )


In [18]:
def evaluate(model, loader, device):
    model.eval()
    preds, truths = [], []
    with torch.no_grad():
        for x0, len0, x1, len1, labels in loader:
            x0, len0 = x0.to(device), len0.to(device)
            x1, len1 = x1.to(device), len1.to(device)
            labels = labels.to(device)
            outputs = model(x0, len0, x1, len1)
            preds.extend((outputs > 0.5).int().cpu().tolist())
            truths.extend(labels.int().cpu().tolist())
    acc = sum([p == t for p, t in zip(preds, truths)]) / len(preds)
    return acc

In [33]:

def train(
    X_train: pd.DataFrame,
    y_train: pd.Series,
    model_directory_path: str,
):
    os.makedirs(model_directory_path, exist_ok=True)
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

    k0,k1,r0,r1 = extract_kalman(X_train,device)

    # Create the dataset with 10k total pairs
    pair_dataset = RegimePairDataset(r0, k0, r1, k1,y_train)

    # DataLoader
    train_loader = DataLoader(pair_dataset, batch_size=32, shuffle=True, collate_fn=collate_fn)

    # Model
    encoder = TransformerEncoder(d_model=64, nhead=4, num_layers=2, dropout=0.1)
    model = RegimeClassifier(encoder).to(device)

    # Optimizer & Loss
    optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
    criterion = nn.BCELoss()

    # Training
    num_epochs = 0
    for epoch in range(num_epochs):
        print(f"running {epoch} and device is {device}")
        loss = train_epoch(model, train_loader, optimizer, criterion, device)
        acc = evaluate(model, train_loader, device)
        print(f"Epoch {epoch+1} - Loss: {loss:.4f}, Accuracy: {acc:.4f}")

    torch.save(model.state_dict(), os.path.join(model_directory_path, 'model_transformer.pt'))

    return model

### The `infer()` Function

In the inference function, your trained model (if any) is loaded and used to make predictions on test data.

**Important workflow:**
1. Load your model;
2. Use the `yield` statement to signal readiness to the runner;
3. Process each dataset one by one within the for loop;
4. For each dataset, use `yield prediction` to return your prediction.

**Note:** The datasets can only be iterated once!

In [36]:
def infer(X_test: typing.Iterable[pd.DataFrame], model_directory_path: str):
    # === 1. Load model ===
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    encoder = TransformerEncoder(d_model=64, nhead=4, num_layers=2, dropout=0.1)
    model = RegimeClassifier(encoder).to(device)
    model.load_state_dict(torch.load(os.path.join(model_directory_path, 'model_transformer.pt'), map_location=device))
    model.eval()

    # === 2. Yield to start Crunch ===
    yield

    # === 3. Process each test row ===
    with torch.no_grad():
        for test_row in X_test:
            k0, k1, r0, r1 = extract_kalman_test_single(test_row,device)

            dataset = RegimePairDataset(r0, k0, r1, k1, None)
            loader = DataLoader(dataset, batch_size=1, shuffle=False, collate_fn=collate_fn)

            for x0, len0, x1, len1 in loader:
                x0, len0, x1, len1 = x0.to(device), len0.to(device), x1.to(device), len1.to(device)
                prob = model(x0, len0, x1, len1).item()
                yield float(prob > 0.5)


## Local testing

To make sure your `train()` and `infer()` function are working properly, you can call the `crunch.test()` function that will reproduce the cloud environment locally. <br />
Even if it is not perfect, it should give you a quick idea if your model is working properly.

In [37]:
crunch.test(
    # Uncomment to disable the train
    # force_first_train=False,

    # Uncomment to disable the determinism check
    # no_determinism_check=True,
)

03:42:42 no forbidden library found
03:42:42 
03:42:42 started
03:42:42 running local test
03:42:42 internet access isn't restricted, no check will be done
03:42:42 
03:42:43 starting unstructured loop...
03:42:43 executing - command=train


data/X_train.parquet: download from https:crunchdao--competition--production.s3.eu-west-1.amazonaws.com/data-releases/146/X_train.parquet (204327238 bytes)
data/X_train.parquet: already exists, file length match
data/X_test.reduced.parquet: download from https:crunchdao--competition--production.s3.eu-west-1.amazonaws.com/data-releases/146/X_test.reduced.parquet (2380918 bytes)
data/X_test.reduced.parquet: already exists, file length match
data/y_train.parquet: download from https:crunchdao--competition--production.s3.eu-west-1.amazonaws.com/data-releases/146/y_train.parquet (61003 bytes)
data/y_train.parquet: already exists, file length match
data/y_test.reduced.parquet: download from https:crunchdao--competition--production.s3.eu-west-1.amazonaws.com/data-releases/146/y_test.reduced.parquet (2655 bytes)
data/y_test.reduced.parquet: already exists, file length match
Estimating Kalman series for 0-th series, time taken: 0.00s


03:42:55 executing - command=infer


Estimating Kalman series for 0-th series, time taken: 0.00s


  output = torch._nested_tensor_from_mask(


Estimating Kalman series for 0-th series, time taken: 0.00s
Estimating Kalman series for 0-th series, time taken: 0.00s
Estimating Kalman series for 0-th series, time taken: 0.00s
Estimating Kalman series for 0-th series, time taken: 0.00s
Estimating Kalman series for 0-th series, time taken: 0.00s
Estimating Kalman series for 0-th series, time taken: 0.00s
Estimating Kalman series for 0-th series, time taken: 0.00s
Estimating Kalman series for 0-th series, time taken: 0.00s
Estimating Kalman series for 0-th series, time taken: 0.00s
Estimating Kalman series for 0-th series, time taken: 0.00s
Estimating Kalman series for 0-th series, time taken: 0.00s
Estimating Kalman series for 0-th series, time taken: 0.00s
Estimating Kalman series for 0-th series, time taken: 0.00s
Estimating Kalman series for 0-th series, time taken: 0.00s
Estimating Kalman series for 0-th series, time taken: 0.00s
Estimating Kalman series for 0-th series, time taken: 0.00s
Estimating Kalman series for 0-th series

03:43:51 checking determinism by executing the inference again with 30% of the data (tolerance: 1e-08)
03:43:51 executing - command=infer


Estimating Kalman series for 0-th series, time taken: 0.00s
Estimating Kalman series for 0-th series, time taken: 0.00s
Estimating Kalman series for 0-th series, time taken: 0.00s
Estimating Kalman series for 0-th series, time taken: 0.00s
Estimating Kalman series for 0-th series, time taken: 0.00s
Estimating Kalman series for 0-th series, time taken: 0.00s
Estimating Kalman series for 0-th series, time taken: 0.00s
Estimating Kalman series for 0-th series, time taken: 0.00s
Estimating Kalman series for 0-th series, time taken: 0.00s
Estimating Kalman series for 0-th series, time taken: 0.00s
Estimating Kalman series for 0-th series, time taken: 0.00s
Estimating Kalman series for 0-th series, time taken: 0.00s
Estimating Kalman series for 0-th series, time taken: 0.00s
Estimating Kalman series for 0-th series, time taken: 0.00s
Estimating Kalman series for 0-th series, time taken: 0.00s
Estimating Kalman series for 0-th series, time taken: 0.00s
Estimating Kalman series for 0-th series

03:44:09 determinism check: passed
03:44:09 save prediction - path=data/prediction.parquet
03:44:09 ended
03:44:09 duration - time=00:01:26
03:44:09 memory - before="934.61 MB" after="1.02 GB" consumed="89.85 MB"


## Results

Once the local tester is done, you can preview the result stored in `data/prediction.parquet`.

In [38]:
prediction = pd.read_parquet("data/prediction.parquet")
prediction

Unnamed: 0_level_0,prediction
id,Unnamed: 1_level_1
10001,1.0
10002,1.0
10003,1.0
10004,1.0
10005,1.0
...,...
10097,1.0
10098,1.0
10099,1.0
10100,1.0


### Local scoring

You can call the function that the system uses to estimate your score locally.

In [39]:
# Load the targets
target = pd.read_parquet("data/y_test.reduced.parquet")["structural_breakpoint"]

# Call the scoring function
sklearn.metrics.roc_auc_score(
    target,
    prediction,
)

np.float64(0.5)

# Submit your Notebook

To submit your work, you must:
1. Download your Notebook from Colab
2. Upload it to the platform
3. Create a run to validate it

### >> https://hub.crunchdao.com/competitions/structural-break/submit/notebook

![Download and Submit Notebook](https://raw.githubusercontent.com/crunchdao/competitions/refs/heads/master/documentation/animations/download-and-submit-notebook.gif)