[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/crunchdao/quickstarters/blob/master/competitions/structural-break/quickstarters/baseline/baseline.ipynb)

![Banner](https://raw.githubusercontent.com/crunchdao/quickstarters/refs/heads/master/competitions/structural-break/assets/banner.webp)

# ADIA Lab Structural Break Challenge

## Challenge Overview

Welcome to the ADIA Lab Structural Break Challenge! In this challenge, you will analyze univariate time series data to determine whether a structural break has occurred at a specified boundary point.

### What is a Structural Break?

A structural break occurs when the process governing the data generation changes at a certain point in time. These changes can be subtle or dramatic, and detecting them accurately is crucial across various domains such as climatology, industrial monitoring, finance, and healthcare.

![Structural Break Example](https://raw.githubusercontent.com/crunchdao/competitions/refs/heads/master/competitions/structural-break/quickstarters/baseline/images/example.png)

### Your Task

For each time series in the test set, you need to predict a score between `0` and `1`:
- Values closer to `0` indicate no structural break at the specified boundary point;
- Values closer to `1` indicate a structural break did occur.

### Evaluation Metric

The evaluation metric is [ROC AUC (Area Under the Receiver Operating Characteristic Curve)](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_auc_score.html), which measures the performance of detection algorithms regardless of their specific calibration.

- ROC AUC around `0.5`: No better than random chance;
- ROC AUC approaching `1.0`: Perfect detection.

# Setup

The first steps to get started are:
1. Get the setup command
2. Execute it in the cell below

### >> https://hub.crunchdao.com/competitions/structural-break/submit/notebook

![Reveal token](https://raw.githubusercontent.com/crunchdao/competitions/refs/heads/master/documentation/animations/reveal-token.gif)

In [None]:
%pip install crunch-cli --upgrade --quiet --progress-bar off
!crunch setup-notebook structural-break XeZW3YGp3JZY1mndLHWH9Nw4

In [None]:
# Install the Crunch CLI
%pip install --upgrade crunch-cli

# Setup your local environment
!crunch setup --notebook structural-break hello --token aaaabbbbccccddddeeeeffff

# Your model

## Setup

In [None]:
import os
import typing

# Import your dependencies
import joblib
import pandas as pd
import scipy
import sklearn.metrics

In [None]:
import crunch

# Load the Crunch Toolings
crunch = crunch.load_notebook()

loaded inline runner with module: <module '__main__'>

cli version: 7.2.0
available ram: 12.67 gb
available cpu: 2 core
----


## Understanding the Data

The dataset consists of univariate time series, each containing ~2,000-5,000 values with a designated boundary point. For each time series, you need to determine whether a structural break occurred at this boundary point.

The data was downloaded when you setup your local environment and is now available in the `data/` directory.

In [None]:
# Load the data simply
X_train, y_train, X_test = crunch.load_data()

### Understanding `X_train`

The training data is structured as a pandas DataFrame with a MultiIndex:

**Index Levels:**
- `id`: Identifies the unique time series
- `time`: The timestep within each time series

**Columns:**
- `value`: The actual time series value at each timestep
- `period`: A binary indicator where `0` represents the **period before** the boundary point, and `1` represents the **period after** the boundary point

In [None]:
X_train

Unnamed: 0_level_0,Unnamed: 1_level_0,value,period
id,time,Unnamed: 2_level_1,Unnamed: 3_level_1
0,0,0.001858,0
0,1,-0.001664,0
0,2,-0.004386,0
0,3,0.000699,0
0,4,-0.002433,0
...,...,...,...
10000,1890,-0.005903,1
10000,1891,0.007295,1
10000,1892,0.003527,1
10000,1893,0.007218,1


### Understanding `y_train`

This is a simple `pandas.Series` that tells if a dataset id has a structural breakpoint or not.

**Index:**
- `id`: the ID of the dataset

**Value:**
- `structural_breakpoint`: Boolean indicating whether a structural break occurred (`True`) or not (`False`)

In [None]:
y_train

id
0         True
1         True
2        False
3         True
4        False
         ...  
9996     False
9997      True
9998     False
9999     False
10000     True
Name: structural_breakpoint, Length: 10001, dtype: bool

### Understanding `X_test`

The test data is provided as a **`list` of `pandas.DataFrame`s** with the same format as [`X_train`](#understanding-X_test).

It is structured as a list to encourage processing records one by one, which will be mandatory in the `infer()` function.

In [None]:
print("Number of datasets:", len(X_test))

Number of datasets: 101


In [None]:
X_test[0]

Unnamed: 0_level_0,Unnamed: 1_level_0,value,period
id,time,Unnamed: 2_level_1,Unnamed: 3_level_1
10001,0,-0.020657,0
10001,1,-0.005894,0
10001,2,-0.003052,0
10001,3,-0.000590,0
10001,4,0.009887,0
10001,...,...,...
10001,2517,0.005084,1
10001,2518,-0.024414,1
10001,2519,-0.014986,1
10001,2520,0.012999,1


## Strategy Implementation

There are multiple approaches you can take to detect structural breaks:

1. **Statistical Tests**: Compare distributions before and after the boundary point;
2. **Feature Engineering**: Extract features from both segments for comparison;
3. **Time Series Modeling**: Detect deviations from expected patterns;
4. **Machine Learning**: Train models to recognize break patterns from labeled examples.

The baseline implementation below uses a simple statistical approach: a t-test to compare the distributions before and after the boundary point.

### The `train()` Function

In this function, you build and train your model for making inferences on the test data. Your model must be stored in the `model_directory_path`.

The baseline implementation below doesn't require a pre-trained model, as it uses a statistical test that will be computed at inference time.

In [None]:
def train(
    X_train: pd.DataFrame,
    y_train: pd.Series,
    model_directory_path: str,
):
    # For our baseline t-test approach, we don't need to train a model
    # This is essentially an unsupervised approach calculated at inference time
    model = None

    # You could enhance this by training an actual model, for example:
    # 1. Extract features from before/after segments of each time series
    # 2. Train a classifier using these features and y_train labels
    # 3. Save the trained model

    joblib.dump(model, os.path.join(model_directory_path, 'model.joblib'))

### The `infer()` Function

In the inference function, your trained model (if any) is loaded and used to make predictions on test data.

**Important workflow:**
1. Load your model;
2. Use the `yield` statement to signal readiness to the runner;
3. Process each dataset one by one within the for loop;
4. For each dataset, use `yield prediction` to return your prediction.

**Note:** The datasets can only be iterated once!

In [None]:
def infer(
    X_test: typing.Iterable[pd.DataFrame],
    model_directory_path: str,
):
    model = joblib.load(os.path.join(model_directory_path, 'model.joblib'))

    yield  # Mark as ready

    # X_test can only be iterated once.
    # Before getting the next dataset, you must predict the current one.
    for dataset in X_test:
        # Baseline approach: Compute t-test between values before and after boundary point
        # The negative p-value is used as our score - smaller p-values (larger negative numbers)
        # indicate more evidence against the null hypothesis that distributions are the same,
        # suggesting a structural break
        def t_test(u: pd.DataFrame):
            return -scipy.stats.ttest_ind(
                u["value"][u["period"] == 0],  # Values before boundary point
                u["value"][u["period"] == 1],  # Values after boundary point
            ).pvalue

        prediction = t_test(dataset)
        yield prediction  # Send the prediction for the current dataset

        # Note: This baseline approach uses a t-test to compare the distributions
        # before and after the boundary point. A smaller p-value (larger negative number)
        # suggests stronger evidence that the distributions are different,
        # indicating a potential structural break.

## Local testing

To make sure your `train()` and `infer()` function are working properly, you can call the `crunch.test()` function that will reproduce the cloud environment locally. <br />
Even if it is not perfect, it should give you a quick idea if your model is working properly.

In [None]:
crunch.test(
    # Uncomment to disable the train
    # force_first_train=False,

    # Uncomment to disable the determinism check
    # no_determinism_check=True,
)

11:55:37 no forbidden library found
11:55:37 
11:55:37 started
11:55:37 running local test
11:55:37 internet access isn't restricted, no check will be done
11:55:37 
11:55:38 starting unstructured loop...
11:55:38 executing - command=train


data/X_train.parquet: download from https:crunchdao--competition--production.s3-accelerate.amazonaws.com/data-releases/146/X_train.parquet (204327238 bytes)
data/X_train.parquet: already exists, file length match
data/X_test.reduced.parquet: download from https:crunchdao--competition--production.s3-accelerate.amazonaws.com/data-releases/146/X_test.reduced.parquet (2380918 bytes)
data/X_test.reduced.parquet: already exists, file length match
data/y_train.parquet: download from https:crunchdao--competition--production.s3-accelerate.amazonaws.com/data-releases/146/y_train.parquet (61003 bytes)
data/y_train.parquet: already exists, file length match
data/y_test.reduced.parquet: download from https:crunchdao--competition--production.s3-accelerate.amazonaws.com/data-releases/146/y_test.reduced.parquet (2655 bytes)
data/y_test.reduced.parquet: already exists, file length match
                STARTING TRAINING PIPELINE v3
--- Starting Stage 1: MDL Pre-training on 'Edge of Chaos' ECA data ---


Generating ECA Data:   0%|          | 0/28 [00:00<?, ?it/s]

## Results

Once the local tester is done, you can preview the result stored in `data/prediction.parquet`.

In [None]:
prediction = pd.read_parquet("data/prediction.parquet")
prediction

### Local scoring

You can call the function that the system uses to estimate your score locally.

In [None]:
# Load the targets
target = pd.read_parquet("data/y_test.reduced.parquet")["structural_breakpoint"]

# Call the scoring function
sklearn.metrics.roc_auc_score(
    target,
    prediction,
)

---

## Solution v9

In [55]:
# ==============================================================================
# @title CELL 1: SETUP AND GLOBAL CONFIGURATION
# ==============================================================================

# --- Imports ---
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
import pandas as pd
import numpy as np
from pathlib import Path
import joblib
import random
import os
from tqdm.notebook import tqdm
import cellpylib as cpl
from dataclasses import dataclass, field
import typing
import hashlib
from torch.utils.data import DataLoader, TensorDataset, Dataset
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score
import math # <-- Ensure math is imported
import itertools

# --- Main Configuration Class (for Full-Scale Runs) ---
@dataclass
class Config:
    # Reproducibility
    SEED: int = 42

    # Data Processing
    PERMUTATION_EMBEDDING_DIM: int = 4
    PERMUTATION_TIME_LAG: int = 1
    SERIES_PROCESSOR_SEQUENCE_LENGTH: int = 256
    SERIES_PROCESSOR_N_AUGMENTATIONS: int = 5

    # ECA Pre-training Data Generation
    ECA_RULES_TO_USE: typing.List[int] = field(default_factory=lambda: [
        22, 30, 45, 54, 60, 75, 82, 86, 89, 90, 105, 106, 110,
        122, 126, 135, 146, 149, 150, 153, 154, 161, 165, 169,
        182, 193, 195, 225
    ])
    ECA_N_SAMPLES_PER_RULE: int = 100
    ECA_TIMESTEPS: int = 256
    ECA_WIDTH: int = 128

    # Model Architecture (Full 3-Stage U-Net)
    MODEL_DIMENSIONS: typing.List[int] = field(default_factory=lambda: [128, 256, 512])
    MODEL_LAYERS_PER_BLOCK: typing.List[int] = field(default_factory=lambda: [2, 2, 2])
    MODEL_N_HEADS: int = 8
    MODEL_MAX_SEQLENS: typing.List[int] = field(default_factory=lambda: [256, 64, 16])
    MODEL_BOTTLENECK_DIM: int = 64

    # Training Hyperparameters
    PRETRAIN_EPOCHS: int = 5
    FINETUNE_EPOCHS: int = 10
    BATCH_SIZE: int = 32
    PRETRAIN_LR: float = 1e-4
    FINETUNE_FULL_LR: float = 1e-5 # Lower LR for fine-tuning the whole model
    MDL_RECON_LOSS_ALPHA: float = 1.0
    MDL_RULE_LOSS_BETA: float = 0.5

    # Paths
    MODEL_DIR: Path = field(default_factory=lambda: Path("./adia_model_store"))

# --- Global Functions & Instantiation ---
config = Config()

def seed_everything(seed):
    random.seed(seed)
    os.environ['PYTHONHASHSEED'] = str(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False

seed_everything(config.SEED)
config.MODEL_DIR.mkdir(exist_ok=True, parents=True)
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Global configuration set for a full-scale run.")
print(f"Using device: {device}")

# Helper function for creating mock data, used in later cells
def create_dataset(n_samples, has_break_prob=0.5):
    X_list, y_list = [], []
    for i in range(n_samples):
        has_break = np.random.rand() < has_break_prob
        length = np.random.randint(500, 1500)
        break_point = np.random.randint(int(length * 0.3), int(length * 0.7))
        t = np.linspace(0, np.random.uniform(5, 15), length)
        noise = np.random.randn(length) * 0.1

        series = pd.Series(np.sin(t * 2 * np.pi) + noise, name='value')
        if has_break:
            t2_freq = np.random.uniform(3, 8)
            t2_amp = np.random.uniform(1.2, 2.0)
            series.iloc[break_point:] = pd.Series(np.cos(t[break_point:] * t2_freq * np.pi) * t2_amp + noise[break_point:])

        X_list.append({'series': series, 'period': break_point + 1})
        y_list.append(1 if has_break else 0)
    return pd.DataFrame(X_list), pd.Series(y_list)

Global configuration set for a full-scale run.
Using device: cuda


In [56]:
# ==============================================================================
# @title CELL 2: CORE LIBRARY AND PIPELINE LOGIC (BUG FIXED)
# ==============================================================================

# ------------------------------------------------------------------------------
# MODULE 1: core_library/data_processing.py
# ------------------------------------------------------------------------------
class PermutationSymbolizer:
    def __init__(self, embedding_dim, time_lag):
        self.d = embedding_dim
        self.tau = time_lag
        self.permutations = {
            tuple(p): i for i, p in enumerate(itertools.permutations(range(self.d)))
        }

    def symbolize_vector(self, vector: np.ndarray) -> int:
        hasher = hashlib.sha256(vector.tobytes())
        seed = int.from_bytes(hasher.digest(), 'big') % (2**32)
        rng = np.random.default_rng(seed)
        noisy_vector = vector + rng.normal(0, 1e-9, size=vector.shape)
        return self.permutations[tuple(np.argsort(noisy_vector))]

class SeriesProcessor:
    def __init__(self, symbolizer, sequence_length, n_augmentations):
        self.symbolizer = symbolizer
        self.seq_len = sequence_length
        self.d = symbolizer.d
        self.n_aug = n_augmentations

    def process_for_finetune(self, series: pd.Series) -> np.ndarray:
        if len(series) < self.d: return np.array([])
        view_shape = (len(series) - self.d + 1, self.d)
        view_strides = (series.values.strides[0], series.values.strides[0])
        windows = np.lib.stride_tricks.as_strided(series.values, shape=view_shape, strides=view_strides)
        symbols = np.array([self.symbolizer.symbolize_vector(w) for w in windows])
        if len(symbols) < self.seq_len: return np.array([])
        sequences = []
        for i in range(self.n_aug):
            start_idx = np.random.randint(0, len(symbols) - self.seq_len + 1)
            sequences.append(symbols[start_idx:start_idx + self.seq_len])
        return np.array(sequences)

class ECADataGenerator:
    def __init__(self, eca_config, n_samples_per_rule, timesteps, width):
        self.config = eca_config
        self.n_samples = n_samples_per_rule
        self.timesteps = timesteps
        self.width = width

    def _generate_for_rule(self, rule_info):
        is_composite = isinstance(rule_info, dict) # <-- BUG FIX
        rule_label = str(rule_info['rules']) if is_composite else str(rule_info)
        ca_list = []
        for _ in range(self.n_samples):
            init_cond = cpl.init_random(self.width)
            if is_composite:
                ca = cpl.evolve_ca_chain(init_cond, self.timesteps, rule_info['rules'], rule_info['timesteps'])
            else:
                ca = cpl.evolve(init_cond, self.timesteps, lambda n, c, t: cpl.nks_rule(n, rule_info))
            ca_list.append(ca)
        return np.array(ca_list), [rule_label] * self.n_samples

    def generate_training_data(self):
        all_cas, all_labels_str = [], []
        base_rules = self.config.get('base', [])
        for rule in tqdm(base_rules, desc="Generating Base ECAs"):
            cas, labels = self._generate_for_rule(rule)
            all_cas.append(cas)
            all_labels_str.extend(labels)
        composite_rules = self.config.get('composite', [])
        for i, rule_info in enumerate(tqdm(composite_rules, desc="Generating Composite ECAs")):
            cas, labels = self._generate_for_rule(rule_info)
            all_cas.append(cas)
            all_labels_str.extend(labels)
        if not all_cas: return torch.empty(0), torch.empty(0), {}
        unique_labels = sorted(list(set(all_labels_str)))
        label_map = {label: i for i, label in enumerate(unique_labels)}
        final_cas = np.vstack(all_cas)
        final_labels = np.array([label_map[l] for l in all_labels_str])
        return torch.from_numpy(final_cas).float(), torch.from_numpy(final_labels).long(), label_map

# ------------------------------------------------------------------------------
# MODULE 2: core_library/model_architecture.py
# ------------------------------------------------------------------------------
class CausalTransformer(nn.Module):
    # ... (implementation is correct, no changes needed)
    def __init__(self, dim, depth, heads, max_seq_len):
        super().__init__()
        self.layers = nn.ModuleList([
            nn.TransformerEncoderLayer(d_model=dim, nhead=heads, dim_feedforward=dim*4, batch_first=True, activation='gelu')
            for _ in range(depth)
        ])
        self.pos_emb = nn.Embedding(max_seq_len, dim)
        self.register_buffer('causal_mask', ~torch.ones(max_seq_len, max_seq_len, dtype=torch.bool).triu(1))
    def forward(self, x):
        n, seq_len, _ = x.shape
        pos = torch.arange(seq_len, device=x.device)
        x = x + self.pos_emb(pos)
        for layer in self.layers:
            x = layer(x, src_mask=self.causal_mask[:seq_len, :seq_len])
        return x

class SimpleTransition(nn.Module):
    # ... (implementation is correct, no changes needed)
    def __init__(self, dim_in, dim_out, factor, is_up):
        super().__init__(); self.is_up = is_up; self.factor = factor; self.proj = nn.Linear(dim_in, dim_out)
    def forward(self, x):
        if self.is_up: x = x.repeat_interleave(self.factor, dim=1)
        else: x = x[:, ::self.factor, :];
        return self.proj(x)

class HierarchicalDynamicalEncoder(nn.Module):
    # ... (implementation is correct, no changes needed)
    def __init__(self, dims, depths, heads, max_seqlens):
        super().__init__(); self.levels = nn.ModuleList()
        for i in range(len(dims) - 1): self.levels.append(nn.ModuleList([CausalTransformer(dims[i], depths[i], heads, max_seqlens[i]), SimpleTransition(dims[i], dims[i+1], max_seqlens[i] // max_seqlens[i+1], is_up=False)]))
    def forward(self, x):
        residuals = [];
        for transformer, transition in self.levels: x = transformer(x); residuals.append(x); x = transition(x)
        return x, residuals

class HierarchicalDynamicalDecoder(nn.Module):
    # ... (implementation is correct, no changes needed)
    def __init__(self, dims, depths, heads, max_seqlens):
        super().__init__(); self.levels = nn.ModuleList(); self.final_transformer = CausalTransformer(dims[0], depths[0], heads, max_seqlens[0])
        for i in range(len(dims) - 1, 0, -1): self.levels.append(nn.ModuleList([SimpleTransition(dims[i], dims[i-1], max_seqlens[i-1] // max_seqlens[i], is_up=True), CausalTransformer(dims[i-1], depths[i-1], heads, max_seqlens[i-1])]))
    def forward(self, x, residuals):
        for (transition, transformer), res in zip(self.levels, residuals[::-1]): x = transition(x); x = x + res; x = transformer(x)
        return self.final_transformer(x)

class MDL_AU_Net_Autoencoder(nn.Module):
    def __init__(self, model_cfg, data_cfg, n_rules):
        super().__init__()
        dims = model_cfg.MODEL_DIMENSIONS
        n_symbols = math.factorial(data_cfg.PERMUTATION_EMBEDDING_DIM) # <-- BUG FIX
        self.embedding = nn.Embedding(n_symbols, dims[0])
        self.encoder = HierarchicalDynamicalEncoder(dims, model_cfg.MODEL_LAYERS_PER_BLOCK, model_cfg.MODEL_N_HEADS, model_cfg.MODEL_MAX_SEQLENS)
        self.bottleneck = CausalTransformer(dims[-1], 1, model_cfg.MODEL_N_HEADS, model_cfg.MODEL_MAX_SEQLENS[-1])
        self.rule_classifier = nn.Linear(dims[-1], n_rules) if n_rules > 0 else nn.Identity()
        self.decoder = HierarchicalDynamicalDecoder(dims, model_cfg.MODEL_LAYERS_PER_BLOCK, model_cfg.MODEL_N_HEADS, model_cfg.MODEL_MAX_SEQLENS)
        self.to_logits = nn.Linear(dims[0], data_cfg.ECA_WIDTH)

    def forward_pretrain(self, x):
        fingerprint, residuals = self.encoder(x); fingerprint = self.bottleneck(fingerprint)
        rule_logits = self.rule_classifier(fingerprint.mean(dim=1)); reconstructed = self.decoder(fingerprint, residuals)
        reconstructed_logits = self.to_logits(reconstructed); return reconstructed_logits, rule_logits
    def encode(self, x_symbols):
        x_embedded = self.embedding(x_symbols); fingerprint, _ = self.encoder(x_embedded)
        fingerprint = self.bottleneck(fingerprint); return fingerprint.mean(dim=1)

class StructuralBreakClassifier(nn.Module):
    def __init__(self, encoder_model):
        super().__init__(); self.encoder_model = encoder_model
        fingerprint_dim = encoder_model.encoder.levels[-1][-1].proj.out_features; input_dim = fingerprint_dim * 3
        self.classifier_head = nn.Sequential(nn.LayerNorm(input_dim), nn.Linear(input_dim, fingerprint_dim), nn.GELU(), nn.Linear(fingerprint_dim, 1))
    def _get_fingerprint(self, sequences):
        if sequences.shape[0] == 0: return torch.zeros(self.encoder_model.encoder.levels[-1][-1].proj.out_features, device=next(self.parameters()).device)
        fingerprints = self.encoder_model.encode(sequences); return fingerprints.mean(dim=0)
    def forward(self, before_seqs_batch, after_seqs_batch):
        batch_logits = []
        for before_seqs, after_seqs in zip(before_seqs_batch, after_seqs_batch):
            fp_before = self._get_fingerprint(torch.from_numpy(before_seqs).long().to(next(self.parameters()).device))
            fp_after = self._get_fingerprint(torch.from_numpy(after_seqs).long().to(next(self.parameters()).device))
            combined = torch.cat([fp_before, fp_after, torch.abs(fp_before - fp_after)], dim=0)
            batch_logits.append(self.classifier_head(combined))
        return torch.stack(batch_logits)

# ------------------------------------------------------------------------------
# MODULE 3 & 4: Training, Inference, and Platform Entry Points
# ------------------------------------------------------------------------------
class FineTuneDataset(Dataset):
    def __init__(self, X, y, processor): self.X = X; self.y = y; self.processor = processor
    def __len__(self): return len(self.X)
    def __getitem__(self, idx):
        row = self.X.iloc[idx]; label = self.y.iloc[idx]; series = row['series']; break_point = row['period'] - 1
        series_before = series.iloc[:break_point]; series_after = series.iloc[break_point:]
        before_sequences = self.processor.process_for_finetune(series_before)
        after_sequences = self.processor.process_for_finetune(series_after)
        return before_sequences, after_sequences, label

def finetune_collate_fn(batch):
    before_batch, after_batch, labels_batch = zip(*batch)
    labels_tensor = torch.tensor(labels_batch, dtype=torch.float32)
    return list(before_batch), list(after_batch), labels_tensor

def train(X, y, model_dir):
    print("--- Starting Training Pipeline ---"); model_dir = Path(model_dir); model_dir.mkdir(exist_ok=True, parents=True)
    # Stage 1: Pre-training
    print("--- Starting Stage 1: Pre-training ---")
    eca_gen = ECADataGenerator(eca_config={'base': config.ECA_RULES_TO_USE, 'composite':[]}, n_samples_per_rule=config.ECA_N_SAMPLES_PER_RULE, timesteps=config.ECA_TIMESTEPS, width=config.ECA_WIDTH)
    eca_tensors, eca_labels, label_map = eca_gen.generate_training_data(); n_rules = len(label_map)
    pretrain_dataset = TensorDataset(eca_tensors, eca_labels); pretrain_loader = DataLoader(pretrain_dataset, batch_size=config.BATCH_SIZE, shuffle=True)
    autoencoder = MDL_AU_Net_Autoencoder(config, config, n_rules).to(device)
    eca_input_proj = nn.Linear(config.ECA_WIDTH, config.MODEL_DIMENSIONS[0]).to(device)
    optimizer = optim.Adam(list(autoencoder.parameters()) + list(eca_input_proj.parameters()), lr=config.PRETRAIN_LR)
    recon_loss_fn = nn.BCEWithLogitsLoss(); rule_loss_fn = nn.CrossEntropyLoss()
    for epoch in range(config.PRETRAIN_EPOCHS):
        autoencoder.train(); total_loss, recon_L, rule_L = 0, 0, 0
        pbar = tqdm(pretrain_loader, desc=f"Pre-train Epoch {epoch+1}/{config.PRETRAIN_EPOCHS}")
        for sequences, labels in pbar:
            sequences, labels = sequences.to(device), labels.to(device); optimizer.zero_grad()
            projected_sequences = eca_input_proj(sequences)
            recon_logits, rule_logits = autoencoder.forward_pretrain(projected_sequences)
            loss_recon = recon_loss_fn(recon_logits, sequences > 0.5); loss_rule = rule_loss_fn(rule_logits, labels)
            loss = config.MDL_RECON_LOSS_ALPHA * loss_recon + config.MDL_RULE_LOSS_BETA * loss_rule
            loss.backward(); optimizer.step()
            total_loss += loss.item(); recon_L += loss_recon.item(); rule_L += loss_rule.item()
            pbar.set_postfix(loss=total_loss/len(pbar), recon_L=recon_L/len(pbar), rule_L=rule_L/len(pbar))

    # --- ADDITION: Save the pre-trained model for faster iteration ---
    print("--- Saving pre-trained autoencoder for faster iteration ---")
    torch.save(autoencoder.state_dict(), model_dir / "pretrained_autoencoder.pth")
    joblib.dump(label_map, model_dir / "label_map.joblib") # Save label_map needed to rebuild model

    # Stage 2: Fine-tuning
    print("--- Starting Stage 2: Fine-tuning ---")
    symbolizer = PermutationSymbolizer(config.PERMUTATION_EMBEDDING_DIM, config.PERMUTATION_TIME_LAG)
    processor = SeriesProcessor(symbolizer, config.SERIES_PROCESSOR_SEQUENCE_LENGTH, config.SERIES_PROCESSOR_N_AUGMENTATIONS)
    finetune_dataset = FineTuneDataset(X, y, processor); finetune_loader = DataLoader(finetune_dataset, batch_size=config.BATCH_SIZE, shuffle=True, collate_fn=finetune_collate_fn)
    classifier = StructuralBreakClassifier(autoencoder).to(device)
    optimizer = optim.Adam(classifier.parameters(), lr=config.FINETUNE_FULL_LR)
    loss_fn = nn.BCEWithLogitsLoss()
    for epoch in range(config.FINETUNE_EPOCHS):
        classifier.train(); total_loss = 0
        pbar = tqdm(finetune_loader, desc=f"Fine-tune Epoch {epoch+1}/{config.FINETUNE_EPOCHS}")
        for before_seqs, after_seqs, labels in pbar:
            labels = labels.to(device); optimizer.zero_grad()
            logits = classifier(before_seqs, after_seqs).squeeze()
            loss = loss_fn(logits, labels); loss.backward(); optimizer.step()
            total_loss += loss.item(); pbar.set_postfix(loss=total_loss/len(pbar))
    # Stage 3: Save Artifacts
    print("--- Saving final model artifacts ---")
    joblib.dump(config, model_dir / "config.joblib")
    torch.save(classifier.state_dict(), model_dir / "structural_break_classifier.pth")

def infer(X, model_dir):
    print("--- Starting Inference Pipeline ---"); model_dir = Path(model_dir)
    loaded_config = joblib.load(model_dir / "config.joblib")
    label_map = joblib.load(model_dir / "label_map.joblib"); n_rules = len(label_map)
    autoencoder = MDL_AU_Net_Autoencoder(loaded_config, loaded_config, n_rules)
    model = StructuralBreakClassifier(autoencoder)
    model.load_state_dict(torch.load(model_dir / "structural_break_classifier.pth")); model.to(device).eval()
    symbolizer = PermutationSymbolizer(loaded_config.PERMUTATION_EMBEDDING_DIM, loaded_config.PERMUTATION_TIME_LAG)
    processor = SeriesProcessor(symbolizer, loaded_config.SERIES_PROCESSOR_SEQUENCE_LENGTH, loaded_config.SERIES_PROCESSOR_N_AUGMENTATIONS)
    with torch.no_grad():
        for i, row in tqdm(X.iterrows(), total=len(X), desc="Inferring"):
            series_data = row['series']; break_point = row['period'] - 1
            series_before = series_data.iloc[:break_point]; series_after = series_data.iloc[break_point:]
            before_sequences = processor.process_for_finetune(series_before); after_sequences = processor.process_for_finetune(series_after)
            logits = model([before_sequences], [after_sequences]); score = torch.sigmoid(logits).item(); yield score

In [None]:
# ==============================================================================
# @title CELL 3: TIER 3 - FULL PRE-TRAINING AND FINE-TUNING RUN
# ==============================================================================
if __name__ == '__main__':
    print("\n" + "="*80)
    print("      RUNNING TIER 3: FULL END-TO-END TRAINING")
    print("="*80)
    print("This will take a long time, but only needs to be run once to generate")
    print("the 'pretrained_autoencoder.pth' file for faster iteration.")

    # We use the main 'Config' class defined in Cell 1
    seed_everything(config.SEED)

    # --- Create a large, representative mock dataset ---
    print("\nCreating a large mock dataset...")
    train_X, train_y = create_dataset(n_samples=1000)
    test_X, test_y = create_dataset(n_samples=200)
    print(f"Training set size: {len(train_X)}, Test set size: {len(test_X)}")

    # --- Run Full Training and Inference ---
    try:
        if torch.cuda.is_available(): print("\n✅ Found and using GPU.\n")
        else: print("\n⚠️ WARNING: GPU not found. This will be very slow.\n")

        # Train the FULL model on the LARGE training data
        train(train_X, train_y, config.MODEL_DIR)

        # Infer on the unseen test data
        print("\nInferring on the unseen test set...")
        predictions = list(infer(test_X, config.MODEL_DIR))

        # --- Evaluate Performance ---
        auc_score = roc_auc_score(test_y, predictions)
        print("\n" + "="*50); print("      TIER 3 RESULTS"); print("="*50)
        print(f"✅ Final Out-of-Sample ROC AUC Score: {auc_score:.4f}"); print("="*50)

    except Exception as e:
        print(f"\nERROR during full run: {e}"); import traceback; traceback.print_exc()


      RUNNING TIER 3: FULL END-TO-END TRAINING
This will take a long time, but only needs to be run once to generate
the 'pretrained_autoencoder.pth' file for faster iteration.

Creating a large mock dataset...
Training set size: 1000, Test set size: 200

✅ Found and using GPU.

--- Starting Training Pipeline ---
--- Starting Stage 1: Pre-training ---


Generating Base ECAs:   0%|          | 0/28 [00:00<?, ?it/s]

In [58]:
# ==============================================================================
# @title CELL 4: TIER 2.5 - FAST FINE-TUNING TEST (RUN AFTER CELL 3)
# ==============================================================================
def fine_tune_only(X, y, model_dir, ft_config):
    """
    This function loads a pre-trained model and ONLY runs the fine-tuning stage.
    """
    print("\n" + "="*80); print("      RUNNING TIER 2.5: FAST FINE-TUNING ONLY"); print("="*80)
    model_dir = Path(model_dir)
    pretrained_path = model_dir / "pretrained_autoencoder.pth"
    label_map_path = model_dir / "label_map.joblib"

    if not pretrained_path.exists():
        print(f"ERROR: Pre-trained model not found at '{pretrained_path}'.")
        print("Please run the full Tier 3 training (Cell 3) once first.")
        return None

    # --- Load the pre-trained autoencoder ---
    print("--- Loading pre-trained autoencoder ---")
    label_map = joblib.load(label_map_path)
    n_rules = len(label_map)
    autoencoder = MDL_AU_Net_Autoencoder(config, config, n_rules).to(device) # Use global config for architecture
    autoencoder.load_state_dict(torch.load(pretrained_path))
    print("Pre-trained model loaded successfully.")

    # --- Stage 2: Fine-tuning ---
    print("--- Starting Stage 2: Fine-tuning ---")
    symbolizer = PermutationSymbolizer(ft_config.PERMUTATION_EMBEDDING_DIM, ft_config.PERMUTATION_TIME_LAG)
    processor = SeriesProcessor(symbolizer, ft_config.SERIES_PROCESSOR_SEQUENCE_LENGTH, ft_config.SERIES_PROCESSOR_N_AUGMENTATIONS)
    finetune_dataset = FineTuneDataset(X, y, processor)
    finetune_loader = DataLoader(finetune_dataset, batch_size=ft_config.BATCH_SIZE, shuffle=True, collate_fn=finetune_collate_fn)
    classifier = StructuralBreakClassifier(autoencoder).to(device)
    optimizer = optim.Adam(classifier.parameters(), lr=ft_config.FINETUNE_FULL_LR)
    loss_fn = nn.BCEWithLogitsLoss()

    for epoch in range(ft_config.FINETUNE_EPOCHS):
        classifier.train(); total_loss = 0
        pbar = tqdm(finetune_loader, desc=f"Fast Fine-tune Epoch {epoch+1}/{ft_config.FINETUNE_EPOCHS}")
        for before_seqs, after_seqs, labels in pbar:
            labels = labels.to(device); optimizer.zero_grad()
            logits = classifier(before_seqs, after_seqs).squeeze()
            loss = loss_fn(logits, labels); loss.backward(); optimizer.step()
            total_loss += loss.item(); pbar.set_postfix(loss=total_loss/len(pbar))

    # --- Save the final, fine-tuned model ---
    print("--- Saving final fine-tuned model artifacts ---")
    joblib.dump(ft_config, model_dir / "config.joblib") # Save the config used for this run
    torch.save(classifier.state_dict(), model_dir / "structural_break_classifier.pth")


if __name__ == '__main__':
    # Define a specific, smaller configuration for this fast test
    @dataclass
    class FastTuneConfig(Config):
        FINETUNE_EPOCHS: int = 5
        BATCH_SIZE: int = 16
        FINETUNE_FULL_LR: float = 3e-5

    fast_config = FastTuneConfig()

    # --- Create a SMALL mock dataset for this fast test ---
    print("\nCreating a small mock dataset for fast fine-tuning test...")
    train_X_fast, train_y_fast = create_dataset(n_samples=200)
    test_X_fast, test_y_fast = create_dataset(n_samples=100)

    # --- Run the fast fine-tuning and inference ---
    fine_tune_only(train_X_fast, train_y_fast, config.MODEL_DIR, fast_config)

    print("\nInferring on the unseen test set...")
    # Make sure infer uses the correct, newly saved model and config
    predictions = list(infer(test_X_fast, config.MODEL_DIR))

    if predictions:
        auc_score = roc_auc_score(test_y_fast, predictions)
        print("\n" + "="*50); print("      TIER 2.5 RESULTS"); print("="*50)
        print(f"✅ Fast Fine-tune Out-of-Sample ROC AUC Score: {auc_score:.4f}")


Creating a small mock dataset for fast fine-tuning test...

      RUNNING TIER 2.5: FAST FINE-TUNING ONLY
ERROR: Pre-trained model not found at 'adia_model_store/pretrained_autoencoder.pth'.
Please run the full Tier 3 training (Cell 3) once first.

Inferring on the unseen test set...
--- Starting Inference Pipeline ---


FileNotFoundError: [Errno 2] No such file or directory: 'adia_model_store/config.joblib'

## Solution v8

In [None]:
# ==============================================================================
# @title CELL 1: SETUP AND GLOBAL CONFIGURATION
# ==============================================================================

# --- Imports ---
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
import pandas as pd
import numpy as np
from pathlib import Path
import joblib
import random
import os
from tqdm.notebook import tqdm
import cellpylib as cpl
from dataclasses import dataclass, field
import typing
import hashlib
from torch.utils.data import DataLoader, TensorDataset, Dataset
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score

# --- Main Configuration Class (for Full-Scale Runs) ---
@dataclass
class Config:
    # Reproducibility
    SEED: int = 42

    # Data Processing
    PERMUTATION_EMBEDDING_DIM: int = 4
    PERMUTATION_TIME_LAG: int = 1
    SERIES_PROCESSOR_SEQUENCE_LENGTH: int = 256
    SERIES_PROCESSOR_N_AUGMENTATIONS: int = 5

    # ECA Pre-training Data Generation
    ECA_RULES_TO_USE: typing.List[int] = field(default_factory=lambda: [
        22, 30, 45, 54, 60, 75, 82, 86, 89, 90, 105, 106, 110,
        122, 126, 135, 146, 149, 150, 153, 154, 161, 165, 169,
        182, 193, 195, 225
    ])
    ECA_N_SAMPLES_PER_RULE: int = 100
    ECA_TIMESTEPS: int = 256
    ECA_WIDTH: int = 128

    # Model Architecture (Full 3-Stage U-Net)
    MODEL_DIMENSIONS: typing.List[int] = field(default_factory=lambda: [128, 256, 512])
    MODEL_LAYERS_PER_BLOCK: typing.List[int] = field(default_factory=lambda: [2, 2, 2])
    MODEL_N_HEADS: int = 8
    MODEL_MAX_SEQLENS: typing.List[int] = field(default_factory=lambda: [256, 64, 16])
    MODEL_BOTTLENECK_DIM: int = 64

    # Training Hyperparameters
    PRETRAIN_EPOCHS: int = 5
    FINETUNE_EPOCHS: int = 10
    BATCH_SIZE: int = 32
    PRETRAIN_LR: float = 1e-4
    FINETUNE_FULL_LR: float = 1e-5 # Lower LR for fine-tuning the whole model
    MDL_RECON_LOSS_ALPHA: float = 1.0
    MDL_RULE_LOSS_BETA: float = 0.5

    # Paths
    MODEL_DIR: Path = field(default_factory=lambda: Path("./adia_model_store"))

# --- Global Functions & Instantiation ---
config = Config()

def seed_everything(seed):
    random.seed(seed)
    os.environ['PYTHONHASHSEED'] = str(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False

seed_everything(config.SEED)
config.MODEL_DIR.mkdir(exist_ok=True, parents=True)
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Global configuration set for a full-scale run.")
print(f"Using device: {device}")

In [None]:
# ==============================================================================
# @title CELL 2: CORE LIBRARY AND PIPELINE LOGIC (with Bug Fix)
# ==============================================================================

# ------------------------------------------------------------------------------
# MODULE 1: core_library/data_processing.py
# ------------------------------------------------------------------------------
import itertools # Make sure this is imported at the top of the cell

class PermutationSymbolizer:
    def __init__(self, embedding_dim, time_lag):
        self.d = embedding_dim
        self.tau = time_lag
        # Correctly use itertools.permutations
        self.permutations = {
            tuple(p): i for i, p in enumerate(itertools.permutations(range(self.d)))
        }

    def symbolize_vector(self, vector: np.ndarray) -> int:
        hasher = hashlib.sha256(vector.tobytes())
        seed = int.from_bytes(hasher.digest(), 'big') % (2**32)
        rng = np.random.default_rng(seed)
        noisy_vector = vector + rng.normal(0, 1e-9, size=vector.shape)
        return self.permutations[tuple(np.argsort(noisy_vector))]

class SeriesProcessor:
    def __init__(self, symbolizer, sequence_length, n_augmentations):
        self.symbolizer = symbolizer
        self.seq_len = sequence_length
        self.d = symbolizer.d
        self.n_aug = n_augmentations

    def process_for_finetune(self, series: pd.Series) -> np.ndarray:
        if len(series) < self.d:
            return np.array([])

        view_shape = (len(series) - self.d + 1, self.d)
        view_strides = (series.values.strides[0], series.values.strides[0])
        windows = np.lib.stride_tricks.as_strided(series.values, shape=view_shape, strides=view_strides)

        symbols = np.array([self.symbolizer.symbolize_vector(w) for w in windows])

        if len(symbols) < self.seq_len:
            return np.array([])

        sequences = []
        for i in range(self.n_aug):
            start_idx = np.random.randint(0, len(symbols) - self.seq_len + 1)
            sequences.append(symbols[start_idx:start_idx + self.seq_len])
        return np.array(sequences)

class ECADataGenerator:
    def __init__(self, eca_config, n_samples_per_rule, timesteps, width):
        self.config = eca_config
        self.n_samples = n_samples_per_rule
        self.timesteps = timesteps
        self.width = width

    def _generate_for_rule(self, rule_info):
        # ### --- THIS IS THE CORRECTED LINE --- ###
        is_composite = isinstance(rule_info, dict)

        rule_label = str(rule_info['rules']) if is_composite else str(rule_info)

        ca_list = []
        for _ in range(self.n_samples):
            init_cond = cpl.init_random(self.width)
            if is_composite:
                ca = cpl.evolve_ca_chain(init_cond, self.timesteps, rule_info['rules'], rule_info['timesteps'])
            else: # It's a base rule (an integer)
                ca = cpl.evolve(init_cond, self.timesteps, lambda n, c, t: cpl.nks_rule(n, rule_info))
            ca_list.append(ca)
        return np.array(ca_list), [rule_label] * self.n_samples

    def generate_training_data(self):
        all_cas, all_labels_str = [], []

        base_rules = self.config.get('base', [])
        for rule in tqdm(base_rules, desc="Generating Base ECAs"):
            cas, labels = self._generate_for_rule(rule)
            all_cas.append(cas)
            all_labels_str.extend(labels)

        composite_rules = self.config.get('composite', [])
        for i, rule_info in enumerate(tqdm(composite_rules, desc="Generating Composite ECAs")):
            cas, labels = self._generate_for_rule(rule_info)
            all_cas.append(cas)
            all_labels_str.extend(labels)

        unique_labels = sorted(list(set(all_labels_str)))
        label_map = {label: i for i, label in enumerate(unique_labels)}

        if not all_cas: # Handle case where no rules are specified
            return torch.empty(0, self.timesteps, self.width), torch.empty(0, dtype=torch.long), {}

        final_cas = np.vstack(all_cas)
        final_labels = np.array([label_map[l] for l in all_labels_str])

        return torch.from_numpy(final_cas).float(), torch.from_numpy(final_labels).long(), label_map

# ------------------------------------------------------------------------------
# MODULE 2: core_library/model_architecture.py
# ------------------------------------------------------------------------------
import math

class CausalTransformer(nn.Module):
    def __init__(self, dim, depth, heads, max_seq_len):
        super().__init__()
        self.layers = nn.ModuleList([
            nn.TransformerEncoderLayer(d_model=dim, nhead=heads, dim_feedforward=dim*4, batch_first=True, activation='gelu')
            for _ in range(depth)
        ])
        self.pos_emb = nn.Embedding(max_seq_len, dim)
        self.register_buffer('causal_mask', ~torch.ones(max_seq_len, max_seq_len, dtype=torch.bool).triu(1))

    def forward(self, x):
        n, seq_len, _ = x.shape
        pos = torch.arange(seq_len, device=x.device)
        x = x + self.pos_emb(pos)
        for layer in self.layers:
            x = layer(x, src_mask=self.causal_mask[:seq_len, :seq_len])
        return x

class SimpleTransition(nn.Module):
    def __init__(self, dim_in, dim_out, factor, is_up):
        super().__init__()
        self.is_up = is_up
        self.factor = factor
        self.proj = nn.Linear(dim_in, dim_out)

    def forward(self, x):
        if self.is_up:
            x = x.repeat_interleave(self.factor, dim=1)
        else:
            x = x[:, ::self.factor, :]
        return self.proj(x)

class HierarchicalDynamicalEncoder(nn.Module):
    def __init__(self, dims, depths, heads, max_seqlens):
        super().__init__()
        self.levels = nn.ModuleList()
        for i in range(len(dims) - 1):
            self.levels.append(nn.ModuleList([
                CausalTransformer(dims[i], depths[i], heads, max_seqlens[i]),
                SimpleTransition(dims[i], dims[i+1], max_seqlens[i] // max_seqlens[i+1], is_up=False)
            ]))

    def forward(self, x):
        residuals = []
        for transformer, transition in self.levels:
            x = transformer(x)
            residuals.append(x)
            x = transition(x)
        return x, residuals

class HierarchicalDynamicalDecoder(nn.Module):
    def __init__(self, dims, depths, heads, max_seqlens):
        super().__init__()
        self.levels = nn.ModuleList()
        self.final_transformer = CausalTransformer(dims[0], depths[0], heads, max_seqlens[0])
        for i in range(len(dims) - 1, 0, -1):
            self.levels.append(nn.ModuleList([
                SimpleTransition(dims[i], dims[i-1], max_seqlens[i-1] // max_seqlens[i], is_up=True),
                CausalTransformer(dims[i-1], depths[i-1], heads, max_seqlens[i-1])
            ]))

    def forward(self, x, residuals):
        for (transition, transformer), res in zip(self.levels, residuals[::-1]):
            x = transition(x)
            x = x + res
            x = transformer(x)
        return self.final_transformer(x)

class MDL_AU_Net_Autoencoder(nn.Module):
    def __init__(self, model_cfg, data_cfg, n_rules):
        super().__init__()
        dims = model_cfg.MODEL_DIMENSIONS
        n_symbols = np.math.factorial(data_cfg.PERMUTATION_EMBEDDING_DIM)
        self.embedding = nn.Embedding(n_symbols, dims[0])
        self.encoder = HierarchicalDynamicalEncoder(dims, model_cfg.MODEL_LAYERS_PER_BLOCK, model_cfg.MODEL_N_HEADS, model_cfg.MODEL_MAX_SEQLENS)
        self.bottleneck = CausalTransformer(dims[-1], 1, model_cfg.MODEL_N_HEADS, model_cfg.MODEL_MAX_SEQLENS[-1])
        self.rule_classifier = nn.Linear(dims[-1], n_rules) if n_rules > 0 else nn.Identity()
        self.decoder = HierarchicalDynamicalDecoder(dims, model_cfg.MODEL_LAYERS_PER_BLOCK, model_cfg.MODEL_N_HEADS, model_cfg.MODEL_MAX_SEQLENS)
        self.to_logits = nn.Linear(dims[0], data_cfg.ECA_WIDTH)

    def forward_pretrain(self, x):
        fingerprint, residuals = self.encoder(x)
        fingerprint = self.bottleneck(fingerprint)
        rule_logits = self.rule_classifier(fingerprint.mean(dim=1))
        reconstructed = self.decoder(fingerprint, residuals)
        reconstructed_logits = self.to_logits(reconstructed)
        return reconstructed_logits, rule_logits

    def encode(self, x_symbols):
        x_embedded = self.embedding(x_symbols)
        fingerprint, _ = self.encoder(x_embedded)
        fingerprint = self.bottleneck(fingerprint)
        return fingerprint.mean(dim=1)

class StructuralBreakClassifier(nn.Module):
    def __init__(self, encoder_model):
        super().__init__()
        self.encoder_model = encoder_model
        fingerprint_dim = encoder_model.encoder.levels[-1][-1].proj.out_features
        input_dim = fingerprint_dim * 3
        self.classifier_head = nn.Sequential(
            nn.LayerNorm(input_dim),
            nn.Linear(input_dim, fingerprint_dim),
            nn.GELU(),
            nn.Linear(fingerprint_dim, 1)
        )

    def _get_fingerprint(self, sequences):
        if sequences.shape[0] == 0:
            return torch.zeros(self.encoder_model.encoder.levels[-1][-1].proj.out_features, device=next(self.parameters()).device)
        fingerprints = self.encoder_model.encode(sequences)
        return fingerprints.mean(dim=0)

    def forward(self, before_seqs_batch, after_seqs_batch):
        batch_logits = []
        for before_seqs, after_seqs in zip(before_seqs_batch, after_seqs_batch):
            fp_before = self._get_fingerprint(torch.from_numpy(before_seqs).long().to(next(self.parameters()).device))
            fp_after = self._get_fingerprint(torch.from_numpy(after_seqs).long().to(next(self.parameters()).device))
            combined = torch.cat([fp_before, fp_after, torch.abs(fp_before - fp_after)], dim=0)
            batch_logits.append(self.classifier_head(combined))
        return torch.stack(batch_logits)

# ------------------------------------------------------------------------------
# MODULE 3 & 4: Training, Inference, and Platform Entry Points
# ------------------------------------------------------------------------------
class FineTuneDataset(Dataset):
    def __init__(self, X, y, processor):
        self.X = X
        self.y = y
        self.processor = processor

    def __len__(self):
        return len(self.X)

    def __getitem__(self, idx):
        row = self.X.iloc[idx]
        label = self.y.iloc[idx]
        series = row['series']
        break_point = row['period'] - 1

        series_before = series.iloc[:break_point]
        series_after = series.iloc[break_point:]

        before_sequences = self.processor.process_for_finetune(series_before)
        after_sequences = self.processor.process_for_finetune(series_after)

        return before_sequences, after_sequences, label

def finetune_collate_fn(batch):
    before_batch, after_batch, labels_batch = zip(*batch)
    labels_tensor = torch.tensor(labels_batch, dtype=torch.float32)
    return list(before_batch), list(after_batch), labels_tensor

def train(X, y, model_dir):
    print("--- Starting Training Pipeline ---")
    model_dir = Path(model_dir)
    model_dir.mkdir(exist_ok=True, parents=True)

    # --- Stage 1: MDL Pre-training ---
    print("--- Starting Stage 1: Pre-training ---")
    eca_gen = ECADataGenerator(
        eca_config={'base': config.ECA_RULES_TO_USE, 'composite':[]},
        n_samples_per_rule=config.ECA_N_SAMPLES_PER_RULE,
        timesteps=config.ECA_TIMESTEPS,
        width=config.ECA_WIDTH
    )
    eca_tensors, eca_labels, label_map = eca_gen.generate_training_data()
    n_rules = len(label_map)
    pretrain_dataset = TensorDataset(eca_tensors, eca_labels)
    pretrain_loader = DataLoader(pretrain_dataset, batch_size=config.BATCH_SIZE, shuffle=True)

    autoencoder = MDL_AU_Net_Autoencoder(config, config, n_rules).to(device)
    eca_input_proj = nn.Linear(config.ECA_WIDTH, config.MODEL_DIMENSIONS[0]).to(device)

    optimizer = optim.Adam(list(autoencoder.parameters()) + list(eca_input_proj.parameters()), lr=config.PRETRAIN_LR)
    recon_loss_fn = nn.BCEWithLogitsLoss()
    rule_loss_fn = nn.CrossEntropyLoss()

    for epoch in range(config.PRETRAIN_EPOCHS):
        autoencoder.train()
        total_loss, recon_L, rule_L = 0, 0, 0
        pbar = tqdm(pretrain_loader, desc=f"Pre-train Epoch {epoch+1}/{config.PRETRAIN_EPOCHS}")
        for sequences, labels in pbar:
            sequences, labels = sequences.to(device), labels.to(device)
            optimizer.zero_grad()
            projected_sequences = eca_input_proj(sequences)
            recon_logits, rule_logits = autoencoder.forward_pretrain(projected_sequences)

            loss_recon = recon_loss_fn(recon_logits, sequences > 0.5)
            loss_rule = rule_loss_fn(rule_logits, labels)

            loss = config.MDL_RECON_LOSS_ALPHA * loss_recon + config.MDL_RULE_LOSS_BETA * loss_rule
            loss.backward()
            optimizer.step()
            total_loss += loss.item()
            recon_L += loss_recon.item()
            rule_L += loss_rule.item()
            pbar.set_postfix(loss=total_loss/len(pbar), recon_L=recon_L/len(pbar), rule_L=rule_L/len(pbar))

    # --- Stage 2: Fine-tuning ---
    print("--- Starting Stage 2: Fine-tuning ---")
    symbolizer = PermutationSymbolizer(config.PERMUTATION_EMBEDDING_DIM, config.PERMUTATION_TIME_LAG)
    processor = SeriesProcessor(symbolizer, config.SERIES_PROCESSOR_SEQUENCE_LENGTH, config.SERIES_PROCESSOR_N_AUGMENTATIONS)
    finetune_dataset = FineTuneDataset(X, y, processor)
    finetune_loader = DataLoader(finetune_dataset, batch_size=config.BATCH_SIZE, shuffle=True, collate_fn=finetune_collate_fn)

    classifier = StructuralBreakClassifier(autoencoder).to(device)
    optimizer = optim.Adam(classifier.parameters(), lr=config.FINETUNE_FULL_LR)
    loss_fn = nn.BCEWithLogitsLoss()

    for epoch in range(config.FINETUNE_EPOCHS):
        classifier.train()
        total_loss = 0
        pbar = tqdm(finetune_loader, desc=f"Fine-tune Epoch {epoch+1}/{config.FINETUNE_EPOCHS}")
        for before_seqs, after_seqs, labels in pbar:
            labels = labels.to(device)
            optimizer.zero_grad()
            logits = classifier(before_seqs, after_seqs).squeeze()
            loss = loss_fn(logits, labels)
            loss.backward()
            optimizer.step()
            total_loss += loss.item()
            pbar.set_postfix(loss=total_loss/len(pbar))

    # --- Stage 3: Save Artifacts ---
    print("--- Saving final model artifacts ---")
    joblib.dump(config, model_dir / "config.joblib")
    torch.save(classifier.state_dict(), model_dir / "structural_break_classifier.pth")

def infer(X, model_dir):
    print("--- Starting Inference Pipeline ---")
    model_dir = Path(model_dir)

    loaded_config = joblib.load(model_dir / "config.joblib")

    # Check if any rules were used to determine n_rules
    if loaded_config.ECA_RULES_TO_USE:
        # A bit of a hack to get n_rules, assuming label_map would be based on this
        n_rules = len(loaded_config.ECA_RULES_TO_USE)
    else:
        n_rules = 0

    autoencoder = MDL_AU_Net_Autoencoder(loaded_config, loaded_config, n_rules)
    model = StructuralBreakClassifier(autoencoder)
    model.load_state_dict(torch.load(model_dir / "structural_break_classifier.pth"))
    model.to(device).eval()

    symbolizer = PermutationSymbolizer(loaded_config.PERMUTATION_EMBEDDING_DIM, loaded_config.PERMUTATION_TIME_LAG)
    processor = SeriesProcessor(symbolizer, loaded_config.SERIES_PROCESSOR_SEQUENCE_LENGTH, loaded_config.SERIES_PROCESSOR_N_AUGMENTATIONS)

    with torch.no_grad():
        for i, row in tqdm(X.iterrows(), total=len(X), desc="Inferring"):
            series_data = row['series']
            break_point = row['period'] - 1
            series_before = series_data.iloc[:break_point]
            series_after = series_data.iloc[break_point:]

            before_sequences = processor.process_for_finetune(series_before)
            after_sequences = processor.process_for_finetune(series_after)

            # The model forward pass expects a list of numpy arrays, not a single one
            logits = model([before_sequences], [after_sequences])
            score = torch.sigmoid(logits).item()
            yield score

In [None]:
# ==============================================================================
# @title CELL 3: EXPERIMENT 3 - LARGE-SCALE TRAINING SIMULATION
# ==============================================================================
if __name__ == '__main__':
    print("\n" + "="*80)
    print("      RUNNING EXPERIMENT 3: LARGE-SCALE TRAINING SIMULATION")
    print("="*80)
    print("New Insight: The true training dataset is massive.")
    print("Hypothesis: Previous failure (AUC < 0.5) was due to severe UNDER-TRAINING, not overfitting.")
    print("Action: Using the full-sized model and fine-tuning on a larger, more representative mock dataset.")

    # We use the main 'Config' class defined in Cell 1
    print("\n--- Using Full Production Configuration ---")
    print(f"Model Dimensions: {config.MODEL_DIMENSIONS}")
    print(f"Fine-tuning Epochs: {config.FINETUNE_EPOCHS}")
    print(f"Batch Size: {config.BATCH_SIZE}")
    print("-" * 30)

    # Set seed for reproducibility of this specific experiment
    seed_everything(config.SEED)

    # --- Create a helper function to generate a mock dataset ---
    def create_dataset(n_samples, has_break_prob=0.5):
        X_list, y_list = [], []
        for i in range(n_samples):
            has_break = np.random.rand() < has_break_prob
            length = np.random.randint(500, 1500)
            break_point = np.random.randint(int(length * 0.3), int(length * 0.7))
            t = np.linspace(0, np.random.uniform(5, 15), length)
            noise = np.random.randn(length) * 0.1

            series = pd.Series(np.sin(t * 2 * np.pi) + noise, name='value')
            if has_break:
                t2_freq = np.random.uniform(3, 8)
                t2_amp = np.random.uniform(1.2, 2.0)
                series.iloc[break_point:] = pd.Series(np.cos(t[break_point:] * t2_freq * np.pi) * t2_amp + noise[break_point:])

            X_list.append({'series': series, 'period': break_point + 1})
            y_list.append(1 if has_break else 0)
        return pd.DataFrame(X_list), pd.Series(y_list)

    # --- Create a LARGE mock dataset to simulate the real training data ---
    print("\nCreating a large mock dataset for fine-tuning...")
    # Create 1000 samples for fine-tuning
    train_X, train_y = create_dataset(n_samples=1000)
    # And a separate, unseen test set to evaluate on
    test_X, test_y = create_dataset(n_samples=200)
    print("Mock datasets created.")
    print(f"Training set size: {len(train_X)}, Test set size: {len(test_X)}")


    # --- Run Training and Inference ---
    try:
        # Check for GPU
        if torch.cuda.is_available():
            print("\n✅ Found and using GPU for this test.\n")
        else:
            print("\n⚠️ WARNING: GPU not found. This test will be very slow on CPU.\n")

        # Train the FULL model on the LARGE training data
        train(train_X, train_y, config.MODEL_DIR)

        # Infer on the unseen test data
        print("\nInferring on the unseen test set...")
        predictions = list(infer(test_X, config.MODEL_DIR))

        # --- Evaluate Performance ---
        auc_score = roc_auc_score(test_y, predictions)

        print("\n" + "="*50)
        print("      EXPERIMENT 3 RESULTS")
        print("="*50)
        print(f"✅ Final Out-of-Sample ROC AUC Score: {auc_score:.4f}")
        print("="*50)

        if auc_score > 0.6:
            print("\nConclusion: SUCCESS! The model generalizes when given enough data and training time.")
        elif auc_score > 0.5:
             print("\nConclusion: PARTIAL SUCCESS. The model shows positive learning but needs more tuning.")
        else:
            print("\nConclusion: FAILED. The issue may be more fundamental (e.g., pre-training features not useful).")

    except Exception as e:
        print(f"\nERROR during validation run: {e}")
        import traceback
        traceback.print_exc()

## Solution v7

In [None]:
# ==============================================================================
# @title COMPLETE AND CONSOLIDATED SOLUTION
# This cell contains all code required to run the validation test.
# ==============================================================================

# ==============================================================================
# SECTION 1: SETUP AND CONFIGURATION
# ==============================================================================

# --- Ensure cellpylib is installed ---
try:
    import cellpylib as cpl
except ImportError:
    print("Installing cellpylib...")
    import subprocess
    import sys
    subprocess.check_call([sys.executable, "-m", "pip", "install", "cellpylib"])
    import cellpylib as cpl
    print("✅ cellpylib installed successfully.")

# --- Standard Imports ---
import os
import random
import hashlib
import typing
import math
from pathlib import Path
from dataclasses import dataclass, field
import numpy as np
import pandas as pd
import joblib
from tqdm.notebook import tqdm
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset, Dataset
import itertools
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score
import torch.nn.functional as F

# --- Global Configuration Class ---
@dataclass
class Config:
    SEED: int = 42
    # Data Processing
    PERMUTATION_DIM: int = 5
    PERMUTATION_LAG: int = 1
    SERIES_PROCESSOR_SEQUENCE_LENGTH: int = 256
    SERIES_PROCESSOR_N_SEQUENCES_PER_SEGMENT: int = 10

    ECA_RULES_TO_USE: list = field(default_factory=lambda: [
        22, 30, 45, 54, 60, 75, 82, 86, 89, 90, 105, 106, 110,
        122, 126, 135, 146, 149, 150, 153, 154, 161, 165, 169,
        182, 193, 195, 225
    ])

    ECA_CONFIG: dict = field(init=False, repr=False)

    def __post_init__(self):
        self.ECA_CONFIG = {
            'base': self.ECA_RULES_TO_USE,
            'composite': [
                {'rules': [30, 110], 'timesteps': [10, 10]},
                {'rules': [45, 90, 150], 'timesteps': [5, 5, 5]},
                {'rules': [54, 146], 'timesteps': [15, 5]},
            ]
        }

    ECA_N_SAMPLES_PER_RULE: int = 100
    ECA_TIMESTEPS: int = 64
    ECA_WIDTH: int = 64
    # Model Architecture
    MODEL_DIMENSIONS: typing.List[int] = field(default_factory=lambda: [128, 256, 512])
    MODEL_LAYERS_PER_BLOCK: typing.List[int] = field(default_factory=lambda: [2, 2, 2])
    MODEL_MAX_SEQLENS: typing.List[int] = field(default_factory=lambda: [256, 64, 16])
    MODEL_N_HEADS: int = 4
    MODEL_BOTTLENECK_DIM: int = 32
    # Training Stages
    PRETRAIN_EPOCHS: int = 5
    PRETRAIN_LEARNING_RATE: float = 1e-4
    MDL_RECON_LOSS_WEIGHT: float = 1.0
    MDL_RULE_LOSS_WEIGHT: float = 0.5
    EMBEDDING_PRETRAIN_EPOCHS: int = 5
    EMBEDDING_PRETRAIN_LR: float = 1e-3
    FINETUNE_EPOCHS: int = 10
    FINETUNE_HEAD_ONLY_EPOCHS: int = 3
    FINETUNE_HEAD_LR: float = 1e-3
    FINETUNE_FULL_LR: float = 5e-5
    # General
    BATCH_SIZE: int = 32
    MODEL_DIR: Path = Path("./production_model")

# --- Seeder Function ---
def seed_everything(seed_value: int):
    random.seed(seed_value)
    os.environ['PYTHONHASHSEED'] = str(seed_value)
    np.random.seed(seed_value)
    torch.manual_seed(seed_value)
    if torch.cuda.is_available():
        torch.cuda.manual_seed(seed_value)
        torch.cuda.manual_seed_all(seed_value)
        torch.backends.cudnn.deterministic = True
        torch.backends.cudnn.benchmark = False

# --- Initial Setup ---
config = Config()
seed_everything(config.SEED)
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Global configuration set. Using device: {device}")


# ==============================================================================
# SECTION 2: CORE LIBRARY - DATA PROCESSING
# ==============================================================================

class PermutationSymbolizer:
    def __init__(self, dim, lag):
        self.d = dim
        self.tau = lag
        self.permutations = {tuple(p): i for i, p in enumerate(itertools.permutations(range(dim)))}

    def symbolize_vector(self, vector: np.ndarray) -> int:
        hasher = hashlib.sha256(vector.tobytes())
        seed = int.from_bytes(hasher.digest(), 'big') % (2**32)
        local_rand = np.random.RandomState(seed)
        noisy_vector = vector + local_rand.uniform(0, 1e-8, size=vector.shape)
        return self.permutations[tuple(np.argsort(noisy_vector))]

class SeriesProcessor:
    def __init__(self, symbolizer, seq_len, n_seqs):
        self.symbolizer = symbolizer
        self.seq_len = seq_len
        self.n_seqs = n_seqs
        self.d = symbolizer.d
        self.tau = symbolizer.tau

    def _series_to_windows(self, series: np.ndarray):
        shape = (series.shape[0] - (self.d - 1) * self.tau, self.d)
        strides = (series.strides[0], series.strides[0] * self.tau)
        return np.lib.stride_tricks.as_strided(series, shape=shape, strides=strides)

    def process_segment(self, segment: pd.Series) -> typing.List[np.ndarray]:
        if len(segment) < self.d * self.tau: return []
        windows = self._series_to_windows(segment.values)
        symbols = np.apply_along_axis(self.symbolizer.symbolize_vector, 1, windows)
        if len(symbols) < self.seq_len: return []
        num_sequences = min(self.n_seqs, len(symbols) - self.seq_len + 1)
        indices = np.linspace(0, len(symbols) - self.seq_len, num_sequences, dtype=int)
        return [symbols[i:i+self.seq_len] for i in indices]

class ECADataGenerator:
    def __init__(self, base, composite, n_samples_per_rule, timesteps, width):
        self.config = {'base': base, 'composite': composite}
        self.n_samples_per_rule = n_samples_per_rule
        self.timesteps = timesteps
        self.width = width

    def run(self):
        all_sims, all_labels = [], []
        rule_map = {rule: i for i, rule in enumerate(self.config['base'])}
        print(f"Generating synthetic data for {len(self.config['base'])} 'Edge of Chaos' rules...")
        for rule in tqdm(self.config['base'], desc="Generating Base ECA Data"):
            for _ in range(self.n_samples_per_rule):
                init_cond = cpl.init_random(self.width)
                sim = cpl.evolve(init_cond, timesteps=self.timesteps, apply_rule=lambda n, c, t: cpl.nks_rule(n, rule))
                all_sims.append(torch.tensor(sim, dtype=torch.float32))
                all_labels.append(torch.tensor(rule_map[rule], dtype=torch.long))
        return all_sims, all_labels

# ==============================================================================
# SECTION 3: CORE LIBRARY - MODEL ARCHITECTURE
# ==============================================================================
class CausalTransformer(nn.Module):
    def __init__(self, dim, n_heads, ff_mult, n_layers):
        super().__init__()
        self.layers = nn.ModuleList([
            nn.TransformerEncoderLayer(d_model=dim, nhead=n_heads, dim_feedforward=dim*ff_mult,
                                       activation='gelu', batch_first=True, norm_first=True)
            for _ in range(n_layers)
        ])
    def forward(self, x, mask): return self.layers[0](x, mask)

class SimpleTransition(nn.Module):
    def __init__(self, in_dim, out_dim, seq_len_in, seq_len_out):
        super().__init__()
        self.proj = nn.Linear(in_dim, out_dim)
        self.pool = nn.AdaptiveAvgPool1d(seq_len_out)
        self.unpool = nn.Upsample(size=seq_len_in, mode='nearest')
    def down(self, x): return self.pool(x.transpose(1, 2)).transpose(1, 2)
    def up(self, x): return self.unpool(x.transpose(1, 2)).transpose(1, 2)

class HierarchicalDynamicalEncoder(nn.Module):
    def __init__(self, dims, layers, n_heads, max_seqlens):
        super().__init__()
        self.levels = nn.ModuleList()
        for i in range(len(dims) - 1):
            self.levels.append(nn.ModuleDict({
                'block': CausalTransformer(dims[i], n_heads, 4, layers[i]),
                'transition': SimpleTransition(dims[i], dims[i+1], max_seqlens[i], max_seqlens[i+1])
            }))
    def forward(self, x):
        residuals = []
        for level in self.levels:
            mask = nn.Transformer.generate_square_subsequent_mask(x.shape[1]).to(x.device)
            x = level['block'](x, mask)
            residuals.append(x)
            x = level['transition'].proj(x)
            x = level['transition'].down(x)
        return x, residuals

class HierarchicalDynamicalDecoder(nn.Module):
    def __init__(self, dims, layers, n_heads, max_seqlens):
        super().__init__()
        self.levels = nn.ModuleList()
        for i in range(len(dims) - 1, 0, -1):
            self.levels.append(nn.ModuleDict({
                'transition': SimpleTransition(dims[i], dims[i-1], max_seqlens[i-1], max_seqlens[i]),
                'block': CausalTransformer(dims[i-1], n_heads, 4, layers[i-1])
            }))
    def forward(self, x, residuals):
        for i, level in enumerate(self.levels):
            x = level['transition'].up(x)
            x = level['transition'].proj(x)
            x = x + residuals[-(i+1)]
            mask = nn.Transformer.generate_square_subsequent_mask(x.shape[1]).to(x.device)
            x = level['block'](x, mask)
        return x

class MDL_AU_Net_Autoencoder(nn.Module):
    def __init__(self, m_config):
        super().__init__()
        self.n_symbols = math.factorial(config.PERMUTATION_DIM)
        self.embedding = nn.Embedding(self.n_symbols, m_config.MODEL_DIMENSIONS[0])
        self.eca_proj = nn.Linear(config.ECA_WIDTH, m_config.MODEL_DIMENSIONS[0])
        self.encoder = HierarchicalDynamicalEncoder(m_config.MODEL_DIMENSIONS, m_config.MODEL_LAYERS_PER_BLOCK, m_config.MODEL_N_HEADS, m_config.MODEL_MAX_SEQLENS)
        self.bottleneck = CausalTransformer(m_config.MODEL_DIMENSIONS[-1], m_config.MODEL_N_HEADS, 4, 1)
        self.decoder = HierarchicalDynamicalDecoder(m_config.MODEL_DIMENSIONS, m_config.MODEL_LAYERS_PER_BLOCK, m_config.MODEL_N_HEADS, m_config.MODEL_MAX_SEQLENS)
        self.recon_head = nn.Linear(m_config.MODEL_DIMENSIONS[0], config.ECA_WIDTH)
        self.rule_head = nn.Linear(m_config.MODEL_DIMENSIONS[-1], len(config.ECA_RULES_TO_USE))

    def pretrain_forward(self, x):
        x = self.eca_proj(x)
        fingerprint, residuals = self.encoder(x)
        fingerprint = self.bottleneck(fingerprint, None)
        rule_logits = self.rule_head(fingerprint.mean(dim=1))
        reconstructed = self.decoder(fingerprint, residuals)
        recon_logits = self.recon_head(reconstructed)
        return recon_logits, rule_logits

    def encode(self, x):
        x = self.embedding(x)
        fingerprint, _ = self.encoder(x)
        fingerprint = self.bottleneck(fingerprint, None)
        return fingerprint.mean(dim=1)

class StructuralBreakClassifier(nn.Module):
    def __init__(self, encoder_model):
        super().__init__()
        self.encoder_model = encoder_model
        self.classifier_head = nn.Sequential(
            nn.LayerNorm(config.MODEL_BOTTLENECK_DIM * 3),
            nn.Linear(config.MODEL_BOTTLENECK_DIM * 3, config.MODEL_BOTTLENECK_DIM), nn.GELU(),
            nn.Linear(config.MODEL_BOTTLENECK_DIM, 1)
        )
    def forward(self, before_seqs, after_seqs):
        fp_before = self.encoder_model.encode(before_seqs)
        fp_after = self.encoder_model.encode(after_seqs)
        combined = torch.cat([fp_before, fp_after, torch.abs(fp_before - fp_after)], dim=1)
        return self.classifier_head(combined)

# ==============================================================================
# SECTION 4: PIPELINE LOGIC - TRAINING & ARTIFACTS
# ==============================================================================

class MDLPreTrainer:
    def __init__(self, model, data_config, model_config):
        self.model = model
        self.data_config = data_config
        self.model_config = model_config
        self.data_generator = ECADataGenerator(**data_config.ECA_CONFIG, n_samples_per_rule=data_config.ECA_N_SAMPLES_PER_RULE, timesteps=data_config.ECA_TIMESTEPS, width=data_config.ECA_WIDTH)
        self.optimizer = optim.Adam(self.model.parameters(), lr=self.model_config.PRETRAIN_LEARNING_RATE)
        self.recon_loss_fn = nn.BCEWithLogitsLoss()
        self.rule_loss_fn = nn.CrossEntropyLoss()

    def run(self):
        print("--- Starting Stage 1: MDL Pre-training ---")
        sims, labels = self.data_generator.run()
        dataset = TensorDataset(torch.stack(sims), torch.stack(labels))
        loader = DataLoader(dataset, batch_size=self.model_config.BATCH_SIZE, shuffle=True)
        for epoch in range(self.model_config.PRETRAIN_EPOCHS):
            self.model.train()
            pbar = tqdm(loader, desc=f"Pre-train Epoch {epoch+1}/{self.model_config.PRETRAIN_EPOCHS}")
            for batch_sims, batch_labels in pbar:
                batch_sims, batch_labels = batch_sims.to(device), batch_labels.to(device)
                self.optimizer.zero_grad()
                recon_logits, rule_logits = self.model.pretrain_forward(batch_sims)
                recon_loss = self.recon_loss_fn(recon_logits, batch_sims)
                rule_loss = self.rule_loss_fn(rule_logits, batch_labels)
                total_loss = self.model_config.MDL_RECON_LOSS_WEIGHT * recon_loss + self.model_config.MDL_RULE_LOSS_WEIGHT * rule_loss
                total_loss.backward()
                self.optimizer.step()
                pbar.set_postfix(loss=total_loss.item(), recon_L=recon_loss.item(), rule_L=rule_loss.item())
        print("--- Pre-training Complete ---")

class BreakFinetuningDataset(Dataset):
    def __init__(self, X, y, processor):
        self.y = y
        self.series_processor = processor
        print("Processing real-world data into symbolic sequences...")
        self.processed_data = []
        for i, row in tqdm(X.iterrows(), total=len(X), desc="Processing All Series"):
            series_data = row['value']
            break_point = row['period'] - 1
            before_segment, after_segment = series_data.iloc[:break_point], series_data.iloc[break_point:]
            before_seqs = self.series_processor.process_segment(before_segment)
            after_seqs = self.series_processor.process_segment(after_segment)
            if before_seqs and after_seqs:
                self.processed_data.append((before_seqs, after_seqs, y.iloc[i]))

    def __len__(self): return len(self.processed_data)
    def __getitem__(self, idx):
        before_seqs, after_seqs, label = self.processed_data[idx]
        before_seq = random.choice(before_seqs)
        after_seq = random.choice(after_seqs)
        return torch.tensor(before_seq, dtype=torch.long), torch.tensor(after_seq, dtype=torch.long), torch.tensor(label, dtype=torch.float)

class ArtifactSaver:
    def __init__(self, model_dir):
        self.model_dir = Path(model_dir)
        self.model_dir.mkdir(exist_ok=True)

    def save(self, model, config_to_save):
        print(f"--- Saving Artifacts to {self.model_dir} ---")
        torch.save(model.state_dict(), self.model_dir / "final_model.pth")
        joblib.dump(config_to_save, self.model_dir / "model_config.joblib")
        print("Artifacts saved.")

# ==============================================================================
# SECTION 5: PLATFORM ENTRY POINTS
# ==============================================================================

def train(X, y, model_dir):
    print(f"Starting training on device: {device}")
    autoencoder = MDL_AU_Net_Autoencoder(config).to(device)
    pre_trainer = MDLPreTrainer(autoencoder, config, config)
    pre_trainer.run()

    print("--- Starting Stage 2: Fine-tuning ---")
    symbolizer = PermutationSymbolizer(config.PERMUTATION_DIM, config.PERMUTATION_LAG)
    processor = SeriesProcessor(symbolizer, config.SERIES_PROCESSOR_SEQUENCE_LENGTH, config.SERIES_PROCESSOR_N_SEQUENCES_PER_SEGMENT)
    finetune_dataset = BreakFinetuningDataset(X, y, processor)
    finetune_loader = DataLoader(finetune_dataset, batch_size=config.BATCH_SIZE, shuffle=True)

    classifier = StructuralBreakClassifier(autoencoder).to(device)
    optimizer = optim.Adam(classifier.parameters(), lr=config.FINETUNE_FULL_LR)
    loss_fn = nn.BCEWithLogitsLoss()

    for epoch in range(config.FINETUNE_EPOCHS):
        classifier.train()
        total_loss = 0
        pbar = tqdm(finetune_loader, desc=f"Fine-tune Epoch {epoch+1}/{config.FINETUNE_EPOCHS}")
        for before_batch, after_batch, labels_batch in pbar:
            before_batch, after_batch, labels_batch = before_batch.to(device), after_batch.to(device), labels_batch.to(device)
            optimizer.zero_grad()
            logits = classifier(before_batch, after_batch).squeeze()
            loss = loss_fn(logits, labels_batch)
            loss.backward()
            optimizer.step()
            total_loss += loss.item()
            pbar.set_postfix(loss=loss.item())
        print(f"Epoch {epoch+1} Average Loss: {total_loss / len(finetune_loader):.6f}")

    print("--- Fine-tuning Complete ---")
    config_to_save = {k: v for k, v in config.__dict__.items() if k != 'ECA_CONFIG'}
    saver = ArtifactSaver(model_dir)
    saver.save(classifier, config_to_save)
    print("\n" + "="*60 + "\n                TRAINING PIPELINE FINISHED\n" + "="*60)

def infer(X, model_dir):
    model_dir = Path(model_dir)
    loaded_config_dict = joblib.load(model_dir / "model_config.joblib")

    @dataclass
    class LoadedConfig:
        pass
    loaded_config = LoadedConfig()
    for k, v in loaded_config_dict.items(): setattr(loaded_config, k, v)

    model = StructuralBreakClassifier(MDL_AU_Net_Autoencoder(loaded_config)).to(device)
    model.load_state_dict(torch.load(model_dir / "final_model.pth"))
    model.eval()

    symbolizer = PermutationSymbolizer(loaded_config.PERMUTATION_DIM, loaded_config.PERMUTATION_LAG)
    processor = SeriesProcessor(symbolizer, loaded_config.SERIES_PROCESSOR_SEQUENCE_LENGTH, loaded_config.SERIES_PROCESSOR_N_SEQUENCES_PER_SEGMENT)

    print("--- Starting Inference ---")
    yield {"health": "ok"} # Health check

    for df in X:
        series_data = df['value']
        break_point = df['period'].iloc[0] - 1
        before_segment, after_segment = series_data.iloc[:break_point], series_data.iloc[break_point:]
        before_seqs = processor.process_segment(before_segment)
        after_seqs = processor.process_segment(after_segment)

        if not before_seqs or not after_seqs:
            yield 0.5; continue

        with torch.no_grad():
            before_batch = torch.tensor(np.array(before_seqs), dtype=torch.long).to(device)
            after_batch = torch.tensor(np.array(after_seqs), dtype=torch.long).to(device)
            logits = model(before_batch, after_batch)
            score = torch.sigmoid(logits).mean().item()
        yield score

# ==============================================================================
# SECTION 6: TEST EXECUTION BLOCK
# ==============================================================================
if __name__ == '__main__':
    # --- Create a standard config instance and then override it ---
    config = Config()

    print("--- Modifying main Config for a validation run. ---")
    config.ECA_N_SAMPLES_PER_RULE = 20
    config.PRETRAIN_EPOCHS = 3
    config.EMBEDDING_PRETRAIN_EPOCHS = 3
    config.FINETUNE_HEAD_ONLY_EPOCHS = 3
    config.FINETUNE_EPOCHS = 5
    config.BATCH_SIZE = 16

    print("\n" + "="*50)
    print("      RUNNING TIER 3 VALIDATION (Platform Compliant)")
    print("="*50)

    seed_everything(config.SEED)

    # --- Create Mock Data & Split ---
    def create_mock_series_data(series_id, has_break=False, length=500, break_point=250):
        t = np.linspace(0, 10, length)
        noise = np.random.randn(length) * 0.1
        series_values = np.sin(t * 2 * np.pi) + noise
        if has_break:
            series_values[break_point:] = np.cos(t[break_point:] * 5 * np.pi) * 1.5 + noise[break_point:]
        data = []
        for time_step in range(length):
            data.append({'id': series_id, 'time': time_step, 'value': series_values[time_step]})
        return pd.DataFrame(data)

    mock_X_list, mock_y_list = [], []
    for i in range(100):
        has_break = i % 2 == 0
        df = create_mock_series_data(i, has_break=has_break)
        df['period'] = 251
        X_df = df[['value', 'period']]
        y_df = pd.Series([1 if has_break else 0], index=[i])
        mock_X_list.append(X_df)
        mock_y_list.append(y_df)

    X_all = pd.concat(mock_X_list, keys=range(100), names=['id', 'time'])
    y_all = pd.concat(mock_y_list)

    train_ids, test_ids = train_test_split(y_all.index, test_size=0.3, random_state=config.SEED, stratify=y_all)

    X_train = X_all[X_all.index.get_level_values('id').isin(train_ids)]
    y_train = y_all[y_all.index.isin(train_ids)]
    X_test = X_all[X_all.index.get_level_values('id').isin(test_ids)]
    y_test = y_all[y_all.index.isin(test_ids)]
    print(f"Training samples: {len(train_ids)}, Testing samples: {len(test_ids)}")

    # --- Run Training & Inference ---
    try:
        model_dir = Path("./final_model_validation")
        train(X_train, y_train, str(model_dir))

        test_iterable = [X_test.loc[test_id] for test_id in test_ids]
        predictions = list(infer(test_iterable, str(model_dir)))
        if isinstance(predictions[0], dict) and "health" in predictions[0]:
            predictions.pop(0)

        y_test_values = y_test.values
        auc_score = roc_auc_score(y_test_values, predictions)
        print("\n" + "="*20 + " FINAL RESULT " + "="*20)
        print(f"📈 Final Out-of-Sample ROC AUC Score: {auc_score:.4f}")
        print("="*54)

    except Exception as e:
        print(f"\nERROR during full validation run: {e}")
        import traceback
        traceback.print_exc()

## Solution v6

In [None]:
# ==============================================================================
# @title CELL 1: SETUP, CONFIGURATION, AND SEEDING (Corrected)
# ==============================================================================

# --- Ensure cellpylib is installed ---
try:
    import cellpylib as cpl
    print("✅ cellpylib is already installed.")
except ImportError:
    print("Installing cellpylib...")
    import subprocess
    import sys
    subprocess.check_call([sys.executable, "-m", "pip", "install", "cellpylib"])
    import cellpylib as cpl
    print("✅ cellpylib installed successfully.")

# --- Standard Imports ---
import os
import random
import hashlib
import typing
import math
from pathlib import Path
from dataclasses import dataclass, field
import numpy as np
import pandas as pd
import joblib
from tqdm.notebook import tqdm
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset
import itertools
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score

# --- Global Configuration Class (with 3-Stage Training Params) ---
@dataclass
class Config:
    SEED: int = 42
    # Data Processing
    PERMUTATION_DIM: int = 5
    PERMUTATION_LAG: int = 1
    SERIES_PROCESSOR_SEQUENCE_LENGTH: int = 256
    SERIES_PROCESSOR_N_SEQUENCES_PER_SEGMENT: int = 10

    # --- THIS ATTRIBUTE IS NOW GUARANTEED TO BE IN THE CLASS DEFINITION ---
    ECA_RULES_TO_USE: list = field(default_factory=lambda: [
        22, 30, 45, 54, 60, 75, 82, 86, 89, 90, 105, 106, 110,
        122, 126, 135, 146, 149, 150, 153, 154, 161, 165, 169,
        182, 193, 195, 225
    ])
    # ----------------------------------------------------------------------

    # ECA Pre-training
    ECA_N_SAMPLES_PER_RULE: int = 100
    ECA_TIMESTEPS: int = 64
    ECA_WIDTH: int = 64
    # Model Architecture
    MODEL_DIMENSIONS: typing.List[int] = field(default_factory=lambda: [128, 256, 512])
    MODEL_LAYERS_PER_BLOCK: typing.List[int] = field(default_factory=lambda: [2, 2, 2])
    MODEL_MAX_SEQLENS: typing.List[int] = field(default_factory=lambda: [256, 64, 16])
    MODEL_N_HEADS: int = 4
    MODEL_BOTTLENECK_DIM: int = 32
    # Stage 1: Encoder Pre-training
    PRETRAIN_EPOCHS: int = 5
    PRETRAIN_LEARNING_RATE: float = 1e-4
    MDL_RECON_LOSS_WEIGHT: float = 1.0
    MDL_RULE_LOSS_WEIGHT: float = 0.5
    # Stage 1.5: Embedding Pre-training
    EMBEDDING_PRETRAIN_EPOCHS: int = 5
    EMBEDDING_PRETRAIN_LR: float = 1e-3
    # Stage 2: Final Fine-tuning
    FINETUNE_EPOCHS: int = 10
    FINETUNE_HEAD_ONLY_EPOCHS: int = 3
    FINETUNE_HEAD_LR: float = 1e-3
    FINETUNE_FULL_LR: float = 5e-5 # Discriminative LR for full model tuning
    # General
    BATCH_SIZE: int = 32
    MODEL_DIR: Path = Path("./production_model")


# --- Seeder Function ---
def seed_everything(seed_value: int):
    random.seed(seed_value)
    os.environ['PYTHONHASHSEED'] = str(seed_value)
    np.random.seed(seed_value)
    torch.manual_seed(seed_value)
    if torch.cuda.is_available():
        torch.cuda.manual_seed(seed_value)
        torch.cuda.manual_seed_all(seed_value)
        torch.backends.cudnn.deterministic = True
        torch.backends.cudnn.benchmark = False

# --- Initial Setup ---
config = Config()
seed_everything(config.SEED)
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Global configuration set. Using device: {device}")

In [None]:
# ==============================================================================
# @title CELL 2: CORE LIBRARY (Data, Model Architecture)
# ==============================================================================

class PermutationSymbolizer:
    def __init__(self, dim: int, lag: int):
        self.d = dim
        self.lag = lag

    def _get_permutation_indices(self, windows: np.ndarray) -> np.ndarray:
        hasher = hashlib.sha256(windows.tobytes())
        seed = int.from_bytes(hasher.digest(), 'little') % (2**32 - 1)
        rng = np.random.RandomState(seed)
        noise = rng.uniform(low=-1e-8, high=1e-8, size=windows.shape)
        return (windows + noise).argsort(axis=1)

    def _lehmer_encode(self, permutations: np.ndarray) -> np.ndarray:
        d = self.d
        symbols = np.zeros(permutations.shape[0], dtype=np.int64)
        factorials = np.array([math.factorial(d - 1 - i) for i in range(d)], dtype=np.int64)
        temp_perms = np.copy(permutations)
        for i in range(d):
            rank = temp_perms[:, 0]
            symbols += rank * factorials[i]
            temp_perms = temp_perms[:, 1:]
            if temp_perms.shape[1] > 0:
                temp_perms[temp_perms > rank[:, np.newaxis]] -= 1
        return symbols

    def symbolize(self, series: np.ndarray) -> np.ndarray:
        if len(series) < self.d * self.lag:
            return np.array([], dtype=np.int64)
        shape = (len(series) - (self.d - 1) * self.lag, self.d)
        strides = (series.strides[0], series.strides[0] * self.lag)
        windows = np.lib.stride_tricks.as_strided(series, shape=shape, strides=strides)
        permutations = self._get_permutation_indices(windows)
        return self._lehmer_encode(permutations)

class SeriesProcessor:
    def __init__(self, symbolizer: PermutationSymbolizer, sequence_length: int, n_sequences: int):
        self.symbolizer = symbolizer
        self.seq_len = sequence_length
        self.n_seq = n_sequences
    def process(self, series: pd.Series) -> typing.List[torch.Tensor]:
        if series.empty or len(series) < self.symbolizer.d:
            return []
        symbols = self.symbolizer.symbolize(series.values)
        if len(symbols) < self.seq_len:
            return []
        # Take n_seq evenly spaced subsequences
        indices = np.linspace(0, len(symbols) - self.seq_len, self.n_seq, dtype=int)
        sequences = [torch.LongTensor(symbols[i : i + self.seq_len]) for i in indices]
        return sequences

class ECADataGenerator:
    def __init__(self, eca_config: dict, n_samples_per_rule: int, timesteps: int, width: int):
        self.config = eca_config
        self.n_samples = n_samples_per_rule
        self.timesteps = timesteps
        self.width = width
        self.rule_map = self._create_rule_map()
    def _create_rule_map(self) -> dict:
        rule_map = {}
        idx = 0
        for rule_num in self.config.get('base', []):
            rule_map[idx] = {'type': 'base', 'rule': rule_num}
            idx += 1
        for comp_rule in self.config.get('composite', []):
            rule_map[idx] = {'type': 'composite', 'config': comp_rule}
            idx += 1
        return rule_map
    def generate(self) -> typing.Tuple[typing.List[torch.Tensor], typing.List[torch.Tensor]]:
        all_tensors, all_labels = [], []
        print(f"Generating synthetic data for {len(self.rule_map)} 'Edge of Chaos' rule configurations...")
        for rule_idx, rule_def in tqdm(self.rule_map.items(), desc="Generating ECA Data"):
            for _ in range(self.n_samples):
                initial_cond = cpl.init_random(self.width)
                if rule_def['type'] == 'base':
                    ca = cpl.evolve(initial_cond, timesteps=self.timesteps, apply_rule=lambda n, c, t: cpl.nks_rule(n, rule_def['rule']))
                else:
                    comp_config = rule_def['config']
                    ca = cpl.evolve(initial_cond, timesteps=self.timesteps, apply_rule=lambda n, c, t: cpl.nks_rule(n, comp_config['rules'][(t-1) % len(comp_config['rules'])]))
                all_tensors.append(torch.FloatTensor(ca))
                all_labels.append(torch.LongTensor([rule_idx]))
        print(f"Generated {len(all_tensors)} total ECA simulations.")
        return all_tensors, all_labels

class CausalTransformer(nn.Module):
    def __init__(self, model_dim: int, n_heads: int, dropout: float = 0.1):
        super().__init__()
        self.attn = nn.MultiheadAttention(model_dim, n_heads, dropout=dropout, batch_first=True)
        self.ffn = nn.Sequential(nn.Linear(model_dim, model_dim * 4), nn.GELU(), nn.Linear(model_dim * 4, model_dim), nn.Dropout(dropout))
        self.norm1 = nn.LayerNorm(model_dim)
        self.norm2 = nn.LayerNorm(model_dim)
        self.dropout = nn.Dropout(dropout)
    def forward(self, x: torch.Tensor, mask: torch.Tensor = None) -> torch.Tensor:
        attn_output, _ = self.attn(x, x, x, attn_mask=mask, need_weights=False, is_causal=True) # Use causal mask
        x = self.norm1(x + self.dropout(attn_output))
        ffn_output = self.ffn(x)
        x = self.norm2(x + self.dropout(ffn_output))
        return x

class SimpleTransition(nn.Module):
    def __init__(self, model_dim_in: int, model_dim_out: int):
        super().__init__()
        self.proj = nn.Linear(model_dim_in, model_dim_out)
        self.pool = nn.AvgPool1d(kernel_size=4, stride=4)
        self.upsample = nn.Upsample(scale_factor=4, mode='nearest')
        self.norm = nn.LayerNorm(model_dim_out)
    def down(self, x: torch.Tensor) -> torch.Tensor:
        x = x.transpose(1, 2)
        x = self.pool(x)
        x = x.transpose(1, 2)
        x = self.proj(x)
        return self.norm(x)
    def up(self, x: torch.Tensor) -> torch.Tensor:
        x = self.proj(x)
        x = x.transpose(1, 2)
        x = self.upsample(x)
        x = x.transpose(1, 2)
        return self.norm(x)

class HierarchicalDynamicalEncoder(nn.Module):
    def __init__(self, config: Config):
        super().__init__()
        self.layers = nn.ModuleList()
        self.transitions = nn.ModuleList()
        dims = config.MODEL_DIMENSIONS
        for i in range(len(dims)):
            self.layers.append(nn.ModuleList([CausalTransformer(dims[i], config.MODEL_N_HEADS) for _ in range(config.MODEL_LAYERS_PER_BLOCK[i])]))
            if i < len(dims) - 1:
                self.transitions.append(SimpleTransition(dims[i], dims[i+1]))
    def forward(self, x: torch.Tensor) -> typing.List[torch.Tensor]:
        residuals = []
        for i, stage in enumerate(self.layers):
            for layer in stage:
                x = layer(x)
            residuals.append(x)
            if i < len(self.transitions):
                x = self.transitions[i].down(x)
        return residuals

class HierarchicalDynamicalDecoder(nn.Module):
    def __init__(self, config: Config):
        super().__init__()
        self.layers = nn.ModuleList()
        self.transitions = nn.ModuleList()
        dims = config.MODEL_DIMENSIONS
        for i in range(len(dims) -1, -1, -1):
            self.layers.append(nn.ModuleList([CausalTransformer(dims[i], config.MODEL_N_HEADS) for _ in range(config.MODEL_LAYERS_PER_BLOCK[i])]))
            if i > 0:
                self.transitions.append(SimpleTransition(dims[i], dims[i-1]))
    def forward(self, residuals: typing.List[torch.Tensor]) -> torch.Tensor:
        x = residuals.pop()
        for i, stage in enumerate(self.layers):
            for layer in stage:
                x = layer(x)
            if i < len(self.transitions):
                # The residual from the encoder has shape (B, S_long, D_in)
                # The upsampled x has shape (B, S_long, D_out)
                # We need to project the residual before adding
                res_to_add = residuals.pop()
                x = self.transitions[i].up(x)
                if x.shape[2] != res_to_add.shape[2]:
                     # This case shouldn't happen with correct config but is a safeguard
                     pass # or add a projection
                x = x + res_to_add
        return x

class MDL_AU_Net_Autoencoder(nn.Module):
    def __init__(self, config: Config, num_rules: int):
        super().__init__()
        self.config = config
        self.n_symbols = math.factorial(config.PERMUTATION_DIM)
        self.symbol_embedding_head = nn.Embedding(self.n_symbols, config.MODEL_DIMENSIONS[0])
        self.eca_input_proj = nn.Linear(config.ECA_WIDTH, config.MODEL_DIMENSIONS[0])
        self.encoder = HierarchicalDynamicalEncoder(config)
        self.decoder = HierarchicalDynamicalDecoder(config)
        self.bottleneck = nn.Linear(config.MODEL_DIMENSIONS[-1], config.MODEL_BOTTLENECK_DIM)
        self.rule_classifier_head = nn.Linear(config.MODEL_BOTTLENECK_DIM, num_rules)
        self.reconstruction_head = nn.Linear(config.MODEL_DIMENSIONS[0], config.ECA_WIDTH)
    def get_fingerprint(self, symbol_sequences: torch.Tensor) -> torch.Tensor:
        x = self.symbol_embedding_head(symbol_sequences)
        residuals = self.encoder(x)
        fingerprint = residuals[-1].mean(dim=1)
        return self.bottleneck(fingerprint)
    def forward(self, eca_tensors: torch.Tensor) -> typing.Tuple[torch.Tensor, torch.Tensor]:
        x = self.eca_input_proj(eca_tensors)
        residuals = self.encoder(x)
        bottleneck_input = residuals[-1]
        bottleneck_out = self.bottleneck(bottleneck_input.mean(dim=1))
        rule_logits = self.rule_classifier_head(bottleneck_out)
        decoded = self.decoder(list(residuals)) # Pass a copy
        reconstructed_tensor = self.reconstruction_head(decoded)
        return reconstructed_tensor, rule_logits

class StructuralBreakClassifier(nn.Module):
    def __init__(self, autoencoder: MDL_AU_Net_Autoencoder, config: Config):
        super().__init__()
        self.autoencoder = autoencoder
        self.config = config
        classifier_input_dim = config.MODEL_BOTTLENECK_DIM * 3
        self.classifier_head = nn.Sequential(
            nn.LayerNorm(classifier_input_dim),
            nn.Linear(classifier_input_dim, config.MODEL_BOTTLENECK_DIM),
            nn.GELU(),
            nn.Linear(config.MODEL_BOTTLENECK_DIM, 1)
        )
    def _get_fingerprint(self, sequences: typing.List[torch.Tensor]) -> torch.Tensor:
        model_device = next(self.parameters()).device
        if not sequences:
            return torch.zeros(self.config.MODEL_BOTTLENECK_DIM, device=model_device)
        batch = nn.utils.rnn.pad_sequence(sequences, batch_first=True, padding_value=0).to(model_device)
        fingerprints = self.autoencoder.get_fingerprint(batch)
        return fingerprints.mean(dim=0)
    def forward(self, before_seq_list: list, after_seq_list: list) -> torch.Tensor:
        fp_before_batch = torch.stack([self._get_fingerprint(seqs) for seqs in before_seq_list])
        fp_after_batch = torch.stack([self._get_fingerprint(seqs) for seqs in after_seq_list])
        fp_diff = torch.abs(fp_before_batch - fp_after_batch)
        combined_fp = torch.cat([fp_before_batch, fp_after_batch, fp_diff], dim=1)
        return self.classifier_head(combined_fp)

class ArtifactManager:
    def __init__(self, model_dir: Path):
        self.model_dir = model_dir
        self.model_dir.mkdir(exist_ok=True, parents=True)
    def save(self, model: StructuralBreakClassifier, config: Config):
        print(f"Saving artifacts to {self.model_dir}...")
        # Save the full state dict of the final classifier model
        torch.save(model.state_dict(), self.model_dir / "structural_break_classifier.pth")
        joblib.dump(config, self.model_dir / "config.joblib")
        print("Artifacts saved.")
    def load_for_inference(self, device: str) -> typing.Tuple[StructuralBreakClassifier, Config]:
        print(f"Loading artifacts from {self.model_dir}...")
        loaded_config = joblib.load(self.model_dir / "config.joblib")
        # To load the state dict, we need to know the number of rules the original model was trained on
        # This is a bit of a hack; a better way would be to save this in the config
        # For now, we assume a placeholder value.
        num_rules_placeholder = len(loaded_config.ECA_RULES_TO_USE) + 10 # A safe upper bound
        autoencoder = MDL_AU_Net_Autoencoder(loaded_config, num_rules=num_rules_placeholder)
        inference_model = StructuralBreakClassifier(autoencoder, loaded_config)
        inference_model.load_state_dict(torch.load(self.model_dir / "structural_break_classifier.pth", map_location=device))
        print("Artifacts loaded successfully.")
        return inference_model.to(device).eval(), loaded_config

In [None]:
# ==============================================================================
# @title CELL 3: THE NEW 3-STAGE TRAINING PIPELINE
# ==============================================================================

class MDLPreTrainer:
    """Orchestrates Stage 1: Pre-training the core U-Net Encoder on raw ECA data."""
    def __init__(self, model: MDL_AU_Net_Autoencoder, eca_config: dict, config: Config, device: str):
        self.model = model
        self.eca_config = eca_config
        self.config = config
        self.device = device
        self.optimizer = optim.Adam(model.parameters(), lr=config.PRETRAIN_LEARNING_RATE)
        self.recon_loss_fn = nn.MSELoss()
        self.rule_loss_fn = nn.CrossEntropyLoss()
    def run(self):
        generator = ECADataGenerator(self.eca_config, self.config.ECA_N_SAMPLES_PER_RULE, self.config.ECA_TIMESTEPS, self.config.ECA_WIDTH)
        tensors_list, labels_list = generator.generate()
        dataset = TensorDataset(torch.stack(tensors_list), torch.cat(labels_list))
        loader = DataLoader(dataset, batch_size=self.config.BATCH_SIZE, shuffle=True, drop_last=True)
        self.model.train()
        for epoch in range(self.config.PRETRAIN_EPOCHS):
            pbar = tqdm(loader, desc=f"Stage 1: Pre-train Epoch {epoch+1}/{self.config.PRETRAIN_EPOCHS}")
            for batch_tensors, batch_labels in pbar:
                batch_tensors, batch_labels = batch_tensors.to(self.device), batch_labels.to(self.device)
                self.optimizer.zero_grad()
                recon_tensors, rule_logits = self.model(batch_tensors)
                recon_loss = self.recon_loss_fn(recon_tensors, batch_tensors)
                rule_loss = self.rule_loss_fn(rule_logits, batch_labels)
                total_loss = self.config.MDL_RECON_LOSS_WEIGHT * recon_loss + self.config.MDL_RULE_LOSS_WEIGHT * rule_loss
                total_loss.backward()
                self.optimizer.step()
                pbar.set_postfix(loss=total_loss.item(), recon_L=recon_loss.item(), rule_L=rule_loss.item())

class EmbeddingPreTrainer:
    """Orchestrates Stage 1.5: Pre-training the Embedding layer on Symbolized ECA data."""
    def __init__(self, model: MDL_AU_Net_Autoencoder, eca_config: dict, config: Config, device: str):
        self.model = model
        self.eca_config = eca_config
        self.config = config
        self.device = device
        self.optimizer = optim.Adam(
            list(model.symbol_embedding_head.parameters()) + list(model.rule_classifier_head.parameters()),
            lr=config.EMBEDDING_PRETRAIN_LR
        )
        self.loss_fn = nn.CrossEntropyLoss()
        self.symbolizer = PermutationSymbolizer(dim=config.PERMUTATION_DIM, lag=config.PERMUTATION_LAG)

    def _symbolize_eca_data(self, raw_tensors: list) -> list:
        symbolized_data = []
        pbar = tqdm(raw_tensors, desc="Symbolizing ECA data for Stage 1.5")
        for tensor in pbar:
            # Create a 1D series by taking the mean across the width dimension
            series_1d = tensor.mean(dim=1).cpu().numpy()
            symbols = self.symbolizer.symbolize(series_1d)
            if len(symbols) >= self.config.SERIES_PROCESSOR_SEQUENCE_LENGTH:
                symbolized_data.append(torch.LongTensor(symbols[:self.config.SERIES_PROCESSOR_SEQUENCE_LENGTH]))
            else: # Pad if too short
                padded = np.pad(symbols, (0, self.config.SERIES_PROCESSOR_SEQUENCE_LENGTH - len(symbols)), 'constant')
                symbolized_data.append(torch.LongTensor(padded))
        return symbolized_data

    def run(self):
        # Freeze the already-trained encoder/decoder parts
        for name, param in self.model.named_parameters():
            if 'symbol_embedding_head' not in name and 'rule_classifier_head' not in name:
                param.requires_grad = False
            else:
                param.requires_grad = True

        generator = ECADataGenerator(self.eca_config, self.config.ECA_N_SAMPLES_PER_RULE, self.config.ECA_TIMESTEPS, self.config.ECA_WIDTH)
        tensors_list, labels_list = generator.generate()
        symbol_sequences = self._symbolize_eca_data(tensors_list)

        dataset = TensorDataset(torch.stack(symbol_sequences), torch.cat(labels_list))
        loader = DataLoader(dataset, batch_size=self.config.BATCH_SIZE, shuffle=True, drop_last=True)
        self.model.train()
        for epoch in range(self.config.EMBEDDING_PRETRAIN_EPOCHS):
            pbar = tqdm(loader, desc=f"Stage 1.5: Embedding Pre-train Epoch {epoch+1}/{self.config.EMBEDDING_PRETRAIN_EPOCHS}")
            for batch_symbols, batch_labels in pbar:
                batch_symbols, batch_labels = batch_symbols.to(self.device), batch_labels.to(self.device)
                self.optimizer.zero_grad()
                fingerprint = self.model.get_fingerprint(batch_symbols)
                rule_logits = self.model.rule_classifier_head(fingerprint)
                loss = self.loss_fn(rule_logits, batch_labels)
                loss.backward()
                self.optimizer.step()
                pbar.set_postfix(loss=loss.item())

class BreakClassifierFinetuner:
    """Orchestrates Stage 2: Fine-tuning on the real-world structural break task with two phases."""
    def __init__(self, model: StructuralBreakClassifier, config: Config, device: str):
        self.model = model
        self.config = config
        self.device = device
        self.loss_fn = nn.BCEWithLogitsLoss()

    def run(self, X_data: pd.DataFrame, y_data: pd.Series):
        print("Processing real-world data for final fine-tuning...")
        symbolizer = PermutationSymbolizer(self.config.PERMUTATION_DIM, self.config.PERMUTATION_LAG)
        processor = SeriesProcessor(symbolizer, self.config.SERIES_PROCESSOR_SEQUENCE_LENGTH, self.config.SERIES_PROCESSOR_N_SEQUENCES_PER_SEGMENT)
        all_before_data, all_after_data = [], []
        # The platform data has a MultiIndex ('id', 'time'). We group by 'id'.
        grouped = X_data.groupby(level='id')
        for _, series_df in tqdm(grouped, desc="Processing All Series"):
            full_series = series_df['value']
            # The 'period' column indicates the first timestep of the new regime.
            # So the break is at period - 1.
            break_point = series_df['period'].iloc[0] - 1
            all_before_data.append(processor.process(full_series.iloc[:break_point]))
            all_after_data.append(processor.process(full_series.iloc[break_point:]))

        labels = y_data.values

        # --- Phase 2a: Fine-tune Head Only ---
        print("\n--- Stage 2a: Fine-tuning Head Only ---")
        for name, param in self.model.autoencoder.named_parameters():
            param.requires_grad = False
        for param in self.model.classifier_head.parameters():
            param.requires_grad = True

        optimizer_head = optim.Adam(self.model.classifier_head.parameters(), lr=self.config.FINETUNE_HEAD_LR)
        self.model.train()
        for epoch in range(self.config.FINETUNE_HEAD_ONLY_EPOCHS):
            indices = np.random.permutation(len(y_data))
            pbar = tqdm(range(0, len(indices), self.config.BATCH_SIZE), desc=f"Fine-tune Head Epoch {epoch+1}/{self.config.FINETUNE_HEAD_ONLY_EPOCHS}")
            for i in pbar:
                batch_indices = indices[i:i+self.config.BATCH_SIZE]
                batch_before = [all_before_data[j] for j in batch_indices]
                batch_after = [all_after_data[j] for j in batch_indices]
                batch_labels = torch.FloatTensor(labels[batch_indices]).unsqueeze(1).to(self.device)
                optimizer_head.zero_grad()
                logits = self.model(batch_before, batch_after)
                loss = self.loss_fn(logits, batch_labels)
                loss.backward()
                optimizer_head.step()
                pbar.set_postfix(loss=loss.item())

        # --- Phase 2b: Fine-tune Full Model ---
        print("\n--- Stage 2b: Fine-tuning Full Model (Low LR) ---")
        for param in self.model.parameters():
            param.requires_grad = True

        optimizer_full = optim.Adam(self.model.parameters(), lr=self.config.FINETUNE_FULL_LR)
        self.model.train()
        full_epochs = self.config.FINETUNE_EPOCHS - self.config.FINETUNE_HEAD_ONLY_EPOCHS
        for epoch in range(full_epochs):
            indices = np.random.permutation(len(y_data))
            pbar = tqdm(range(0, len(indices), self.config.BATCH_SIZE), desc=f"Fine-tune Full Epoch {epoch+1}/{full_epochs}")
            for i in pbar:
                batch_indices = indices[i:i+self.config.BATCH_SIZE]
                batch_before = [all_before_data[j] for j in batch_indices]
                batch_after = [all_after_data[j] for j in batch_indices]
                batch_labels = torch.FloatTensor(labels[batch_indices]).unsqueeze(1).to(self.device)
                optimizer_full.zero_grad()
                logits = self.model(batch_before, batch_after)
                loss = self.loss_fn(logits, batch_labels)
                loss.backward()
                optimizer_full.step()
                pbar.set_postfix(loss=loss.item())

class UnifiedTrainer:
    """Manages the full three-stage training pipeline."""
    def __init__(self, eca_config: dict, config: Config, device: str):
        self.eca_config = eca_config
        self.config = config
        self.device = device
        num_rules = len(eca_config.get('base', [])) + len(eca_config.get('composite', []))
        self.autoencoder = MDL_AU_Net_Autoencoder(config, num_rules=num_rules).to(device)

    def run(self, X_train: pd.DataFrame, y_train: pd.Series) -> StructuralBreakClassifier:
        # Stage 1
        stage1_trainer = MDLPreTrainer(self.autoencoder, self.eca_config, self.config, self.device)
        stage1_trainer.run()
        # Stage 1.5
        stage1_5_trainer = EmbeddingPreTrainer(self.autoencoder, self.eca_config, self.config, self.device)
        stage1_5_trainer.run()
        # Stage 2
        classifier = StructuralBreakClassifier(self.autoencoder, self.config).to(device)
        stage2_trainer = BreakClassifierFinetuner(classifier, self.config, self.device)
        stage2_trainer.run(X_train, y_train)
        return classifier

In [None]:
# ==============================================================================
# @title CELL 4: Main Platform Entrypoints (`train` and `infer`)
# ==============================================================================

def train(X_train: pd.DataFrame, y_train: pd.Series, model_directory_path: str, debug_mode: bool = False):
    """
    Main training orchestrator that uses the UnifiedTrainer.
    """
    if debug_mode:
        print("\n" + "!"*60 + "\n!!! RUNNING IN FAST DEBUG MODE - NOT FOR PRODUCTION !!!\n" + "!"*60 + "\n")
        active_config = FastDebugConfig()
    else:
        active_config = Config()

    seed_everything(active_config.SEED)
    active_device = "cuda" if torch.cuda.is_available() else "cpu"
    print(f"Starting unified training on device: {active_device}")

    # Use a small subset of data for debugging if no data is provided
    # This simulates the platform behavior where data is loaded internally
    if X_train is None or y_train is None:
        print("Loading data from disk for local simulation...")
        X_train = pd.read_parquet("data/X_train.parquet")
        y_train = pd.read_parquet("data/y_train.parquet").squeeze()
        if debug_mode:
            print("Debug mode: Using a small subset of the data.")
            # Take 50 random series to ensure variety
            all_ids = X_train.index.get_level_values('id').unique()
            debug_ids = np.random.choice(all_ids, 50, replace=False)
            X_train = X_train[X_train.index.get_level_values('id').isin(debug_ids)]
            y_train = y_train.loc[debug_ids]

    print(f"Fine-tuning data ready. X={X_train.shape}, y={y_train.shape}")

    # Define the ECA rule configuration
    eca_config = {'base': active_config.ECA_RULES_TO_USE, 'composite': [{'rules': [30, 110], 'timesteps': [10, 10]}]}

    # The UnifiedTrainer handles all three stages internally
    trainer = UnifiedTrainer(eca_config, active_config, active_device)
    final_model = trainer.run(X_train, y_train)

    # Save the final artifacts
    manager = ArtifactManager(Path(model_directory_path))
    manager.save(final_model, active_config)
    print("\n============================================================")
    print("                UNIFIED TRAINING PIPELINE FINISHED")
    print("============================================================")

def infer(X_test: typing.Iterable[pd.DataFrame], model_directory_path: str):
    """
    Main inference orchestrator.
    """
    device = "cuda" if torch.cuda.is_available() else "cpu"
    manager = ArtifactManager(Path(model_directory_path))
    model, loaded_config = manager.load_for_inference(device)

    symbolizer = PermutationSymbolizer(loaded_config.PERMUTATION_DIM, loaded_config.PERMUTATION_LAG)
    processor = SeriesProcessor(symbolizer, loaded_config.SERIES_PROCESSOR_SEQUENCE_LENGTH, loaded_config.SERIES_PROCESSOR_N_SEQUENCES_PER_SEGMENT)

    yield # Health check

    with torch.no_grad():
        for test_df in X_test:
            if test_df.empty:
                yield 0.5
                continue

            full_series = test_df['value']
            break_point = test_df['period'].iloc[0] - 1

            before_seqs = processor.process(full_series.iloc[:break_point])
            after_seqs = processor.process(full_series.iloc[break_point:])

            if not before_seqs or not after_seqs:
                yield 0.5
                continue

            logits = model([before_seqs], [after_seqs])
            score = torch.sigmoid(logits).item()
            yield score

In [None]:
# ==============================================================================
# @title FINAL TIER 3 VALIDATION (Simplified & Corrected)
# ==============================================================================

if __name__ == '__main__':
    # --- Create a standard config instance and then override it ---
    # This is the simplest and most robust way to create a test config.
    config = Config() # Instantiate the full config from Cell 1

    # Manually override attributes for a faster validation run
    print("--- Using validation config overrides for this run. ---")
    config.ECA_N_SAMPLES_PER_RULE = 20
    config.PRETRAIN_EPOCHS = 3
    config.EMBEDDING_PRETRAIN_EPOCHS = 3
    config.FINETUNE_HEAD_ONLY_EPOCHS = 3
    config.FINETUNE_EPOCHS = 5
    # -------------------------------------------------------------

    print("\n" + "="*50)
    print("      RUNNING TIER 3 VALIDATION (3-Stage & OOS Test)")
    print("="*50)

    seed_everything(config.SEED)

    # --- Create Mock Data & Split ---
    def create_mock_series_data(series_id, has_break=False, length=500, break_point=250):
        t = np.linspace(0, 10, length)
        noise = np.random.randn(length) * 0.1
        series_values = np.sin(t * 2 * np.pi) + noise
        if has_break:
            series_values[break_point:] = np.cos(t[break_point:] * 5 * np.pi) * 1.5 + noise[break_point:]
        data = []
        for time_step in range(length):
            data.append({
                'id': series_id, 'time': time_step, 'value': series_values[time_step],
                'period': break_point + 1, 'y': 1 if has_break else 0
            })
        return data

    mock_data = []
    for i in range(100):
        mock_data.extend(create_mock_series_data(series_id=f"series_{i}", has_break=(i % 2 == 0)))

    full_df = pd.DataFrame(mock_data).set_index(['id', 'time'])
    X_all = full_df[['value', 'period']]
    y_all = full_df.groupby('id')['y'].first()

    train_ids, test_ids = train_test_split(
        y_all.index, test_size=0.3, random_state=config.SEED, stratify=y_all
    )

    X_train = X_all[X_all.index.get_level_values('id').isin(train_ids)]
    y_train = y_all[y_all.index.isin(train_ids)]
    X_test = X_all[X_all.index.get_level_values('id').isin(test_ids)]
    y_test = y_all[y_all.index.isin(test_ids)]
    print(f"Training samples: {len(train_ids)}, Testing samples: {len(test_ids)}")

    # --- Run Training & Inference ---
    try:
        model_dir = Path("./final_model_validation")

        # This line will now work because Cell 1 defines `ECA_RULES_TO_USE` on the Config class.
        eca_config = {'base': config.ECA_RULES_TO_USE, 'composite': [{'rules': [30, 110], 'timesteps': [10, 10]}]}

        train(X_train, y_train, str(model_dir), eca_config=eca_config)

        # Re-format test data for the infer function
        test_iterable = [X_test.loc[test_id] for test_id in test_ids]
        predictions = list(infer(test_iterable, str(model_dir)))
        if isinstance(predictions[0], dict) and "health" in predictions[0]:
             predictions.pop(0)

        # Evaluate performance
        y_test_values = y_test.values
        auc_score = roc_auc_score(y_test_values, predictions)
        print("\n" + "="*20 + " FINAL RESULT " + "="*20)
        print(f"📈 Final Out-of-Sample ROC AUC Score: {auc_score:.4f}")
        print("="*54)

    except Exception as e:
        print(f"\nERROR during full validation run: {e}")
        import traceback
        traceback.print_exc()

---

# Submit your Notebook

To submit your work, you must:
1. Download your Notebook from Colab
2. Upload it to the platform
3. Create a run to validate it

### >> https://hub.crunchdao.com/competitions/structural-break/submit/notebook

![Download and Submit Notebook](https://raw.githubusercontent.com/crunchdao/competitions/refs/heads/master/documentation/animations/download-and-submit-notebook.gif)