[![Open In Colab](https://raw.githubusercontent.com/crunchdao/competitions/refs/heads/master/documentation/badge/open-in-colab.svg)](https://colab.research.google.com/github/crunchdao/quickstarters/blob/master/competitions/structural-break/quickstarters/baseline/baseline.ipynb)
[![Open In Kaggle](https://raw.githubusercontent.com/crunchdao/competitions/refs/heads/master/documentation/badge/open-in-kaggle.svg)](https://www.kaggle.com/code/crunchdao/structural-break-baseline)

In [6]:
%pip install hmmlearn

Collecting hmmlearn
  Downloading hmmlearn-0.3.3-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (3.0 kB)
Downloading hmmlearn-0.3.3-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (165 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m166.0/166.0 kB[0m [31m3.7 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: hmmlearn
Successfully installed hmmlearn-0.3.3


In [1]:
%pip install crunch-cli --upgrade --quiet --progress-bar off
!crunch setup-notebook structural-break BMYVWHdIAaaPrmCMo7VtBD4Y

crunch-cli, version 7.5.0
main.py: download from https:crunchdao--competition--production.s3-accelerate.amazonaws.com/submissions/25492/main.py (17703 bytes)
notebook.ipynb: download from https:crunchdao--competition--production.s3-accelerate.amazonaws.com/submissions/25492/notebook.ipynb (51157 bytes)
requirements.txt: download from https:crunchdao--competition--production.s3-accelerate.amazonaws.com/submissions/25492/requirements.original.txt (194 bytes)
resources/xgb_model1.joblib: download from https:crunchdao--competition--production.s3-accelerate.amazonaws.com/models/26814/xgb_model1.joblib (332638 bytes)
data/X_train.parquet: download from https:crunchdao--competition--production.s3-accelerate.amazonaws.com/data-releases/146/X_train.parquet (204327238 bytes)
data/X_test.reduced.parquet: download from https:crunchdao--competition--production.s3-accelerate.amazonaws.com/data-releases/146/X_test.reduced.parquet (2380918 bytes)
data/y_train.parquet: download from https:crunchdao--co

In [2]:
import os
import typing

# Import your dependencies
import joblib
import pandas as pd
import scipy
import sklearn.metrics

In [3]:
import crunch

# Load the Crunch Toolings
crunch = crunch.load_notebook()

loaded inline runner with module: <module '__main__'>

cli version: 7.5.0
available ram: 12.67 gb
available cpu: 2 core
----


In [4]:
# Load the data simply
X_train, y_train, X_test = crunch.load_data()

data/X_train.parquet: download from https:crunchdao--competition--production.s3-accelerate.amazonaws.com/data-releases/146/X_train.parquet (204327238 bytes)
data/X_train.parquet: already exists, file length match
data/X_test.reduced.parquet: download from https:crunchdao--competition--production.s3-accelerate.amazonaws.com/data-releases/146/X_test.reduced.parquet (2380918 bytes)
data/X_test.reduced.parquet: already exists, file length match
data/y_train.parquet: download from https:crunchdao--competition--production.s3-accelerate.amazonaws.com/data-releases/146/y_train.parquet (61003 bytes)
data/y_train.parquet: already exists, file length match
data/y_test.reduced.parquet: download from https:crunchdao--competition--production.s3-accelerate.amazonaws.com/data-releases/146/y_test.reduced.parquet (2655 bytes)
data/y_test.reduced.parquet: already exists, file length match


In [7]:
import os
import typing
import joblib
import pandas as pd
import numpy as np
from hmmlearn import hmm
import warnings

# Suppress warnings from hmmlearn for cleaner output
warnings.filterwarnings("ignore", category=DeprecationWarning)
warnings.filterwarnings("ignore", category=RuntimeWarning)

def train(
    X_train: pd.DataFrame,
    y_train: pd.Series,
    model_directory_path: str,
):
    """
    Trains the model. For this HMM approach, "training" involves setting and
    saving the model's hyperparameters. A more advanced version could use
    X_train and y_train to find the optimal number of hidden states.
    """
    # We hypothesize that a break involves a shift between two primary regimes.
    # Therefore, we fix the number of hidden states to 2.
    config = {
        'n_states': 2,
        'n_iter': 100,
        'covariance_type': 'diag',
        'random_state': 42  # Add a fixed seed for determinism
    }

    # Save the configuration object to be loaded during inference.
    joblib.dump(config, os.path.join(model_directory_path, 'model.joblib'))
    print("Model configuration saved.")

In [8]:
def infer(
    X_test: typing.Iterable[pd.DataFrame],
    model_directory_path: str,
) -> typing.Generator[float, None, None]:
    """
    Makes predictions on the test data using the HMM-based structural break detection.
    """
    # Load the model configuration saved during the training phase.
    config = joblib.load(os.path.join(model_directory_path, 'model.joblib'))
    n_states = config['n_states']
    n_iter = config['n_iter']

    yield  # Mark as ready to receive data

    # X_test can only be iterated once.
    for dataset in X_test:
        try:
            # 1. Pre-process the data: Use log returns to stabilize variance.
            # Using price directly can be problematic if it's not stationary.
            log_returns = np.log(dataset['value']).diff().dropna()

            # Align the period labels with the log returns
            periods = dataset['period'].iloc[1:]

            series_pre = log_returns[periods == 0].values.reshape(-1, 1)
            series_post = log_returns[periods == 1].values.reshape(-1, 1)

            # 2. Check for sufficient data in each period to fit a model.
            if len(series_pre) < n_states * 2 or len(series_post) < n_states * 2:
                yield 0.5  # Not enough data, yield a neutral score
                continue

            # 3. Fit HMMs to each period.
            model_pre = hmm.GaussianHMM(n_components=n_states, covariance_type=config['covariance_type'], n_iter=n_iter)
            model_pre.fit(series_pre)

            model_post = hmm.GaussianHMM(n_components=n_states, covariance_type=config['covariance_type'], n_iter=n_iter)
            model_post.fit(series_post)

            # 4. Calculate log-likelihoods, normalized by series length.
            # This measures how well each model explains its own data vs. the other's data.
            len_pre, len_post = len(series_pre), len(series_post)
            ll_pre_pre = model_pre.score(series_pre) / len_pre
            ll_post_post = model_post.score(series_post) / len_post
            ll_pre_post = model_post.score(series_pre) / len_pre
            ll_post_pre = model_pre.score(series_post) / len_post

            # 5. Compute the symmetric log-likelihood ratio as the break score.
            # A large positive value indicates the models are very different.
            raw_score = (ll_pre_pre + ll_post_post) - (ll_pre_post + ll_post_pre)

            # 6. Normalize the score to a probability [0, 1] using the sigmoid function.
            prediction = 1 / (1 + np.exp(-raw_score))

            yield prediction

        except Exception as e:
            # If any error occurs (e.g., HMM fails to converge), yield a neutral score.
            print(f"An error occurred for a dataset: {e}. Yielding neutral score.")
            yield 0.5

In [9]:
crunch.test(
    # Uncomment to disable the train
    # force_first_train=False,

    # Uncomment to disable the determinism check
    # no_determinism_check=True,
)

08:42:06 no forbidden library found
08:42:06 
08:42:06 started
08:42:06 running local test
08:42:06 internet access isn't restricted, no check will be done
08:42:06 
08:42:07 starting unstructured loop...
08:42:07 executing - command=train


data/X_train.parquet: download from https:crunchdao--competition--production.s3-accelerate.amazonaws.com/data-releases/146/X_train.parquet (204327238 bytes)
data/X_train.parquet: already exists, file length match
data/X_test.reduced.parquet: download from https:crunchdao--competition--production.s3-accelerate.amazonaws.com/data-releases/146/X_test.reduced.parquet (2380918 bytes)
data/X_test.reduced.parquet: already exists, file length match
data/y_train.parquet: download from https:crunchdao--competition--production.s3-accelerate.amazonaws.com/data-releases/146/y_train.parquet (61003 bytes)
data/y_train.parquet: already exists, file length match
data/y_test.reduced.parquet: download from https:crunchdao--competition--production.s3-accelerate.amazonaws.com/data-releases/146/y_test.reduced.parquet (2655 bytes)
data/y_test.reduced.parquet: already exists, file length match


08:42:09 executing - command=infer


Model configuration saved.


08:42:18 checking determinism by executing the inference again with 30% of the data (tolerance: 1e-08)
08:42:18 executing - command=infer
08:42:20 determinism check: failed
08:42:20 save prediction - path=data/prediction.parquet
08:42:20 ended
08:42:20 duration - time=00:00:13
08:42:20 memory - before="905.8 MB" after="939.49 MB" consumed="33.69 MB"


## Results

Once the local tester is done, you can preview the result stored in `data/prediction.parquet`.

In [None]:
prediction = pd.read_parquet("data/prediction.parquet")
prediction

### Local scoring

You can call the function that the system uses to estimate your score locally.

In [None]:
# Load the targets
target = pd.read_parquet("data/y_test.reduced.parquet")["structural_breakpoint"]

# Call the scoring function
sklearn.metrics.roc_auc_score(
    target,
    prediction,
)

# Submit your Notebook

To submit your work, you must:
1. Download your Notebook from Colab
2. Upload it to the platform
3. Create a run to validate it

### >> https://hub.crunchdao.com/competitions/structural-break/submit/notebook

![Download and Submit Notebook](https://raw.githubusercontent.com/crunchdao/competitions/refs/heads/master/documentation/animations/download-and-submit-notebook.gif)