# **AN2DL First Challenge: Preprocessing & Modeling**
> ## ***Ibuprofen*** **Team**
>
> **Team Members:**
> * Angelo Notarnicola (279710)
> * Daniele Piano (249385)
> * Luca Spreafico (303871)
> * Michele Leggieri (244615)
>
> This notebook contains the final pipeline that achieved our top score (0.9425). It implements the preprocessing, feature engineering, and the hybrid CNN-LSTM model architecture derived from our data analysis.

In [1]:
### 1.1. Environment Setup (Colab & Kaggle)
import os
import pandas as pd
import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader, WeightedRandomSampler
from sklearn.model_selection import StratifiedKFold
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.metrics import f1_score
import re
import warnings
import math
import random

# --- Reproducibility & Hardware ---
SEED = 42
random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
if torch.cuda.is_available():
    torch.cuda.manual_seed_all(SEED)
    torch.backends.cudnn.benchmark = True

# --- Suppress Warnings ---
warnings.filterwarnings('ignore')

# --- Environment Paths ---
DATASET_NAME = "piratt"  # Kaggle dataset name
COLAB_PATH = "/gdrive/My Drive/Colab Notebooks/[2025-2026] AN2DL/First Challenge - Data Analysis" # Adjust this path

if 'google.colab' in str(get_ipython()):
    print("Running on Google Colab...")
    from google.colab import drive
    drive.mount("/gdrive", force_remount=True)
    WORKING_DIR = COLAB_PATH
    INPUT_DIR = WORKING_DIR  # Assuming data is in the same Colab folder
    %cd $WORKING_DIR
    
elif 'KAGGLE_KERNEL_RUN_TYPE' in os.environ:
    print("Running on Kaggle...")
    WORKING_DIR = "/kaggle/working"
    # The "golden script" used /kaggle/input/pirate/
    # We will try a common path first, then fallback to that
    if os.path.exists(f"/kaggle/input/{DATASET_NAME}"):
        INPUT_DIR = f"/kaggle/input/{DATASET_NAME}"
    else:
        INPUT_DIR = "/kaggle/input/pirate" # Fallback to the golden script's path
    
else:
    print("Running locally...")
    WORKING_DIR = os.getcwd()
    INPUT_DIR = "./" # Adjust as needed

print(f"Using Device: {device}")
print(f"Working Directory: {WORKING_DIR}")
print(f"Input Directory: {INPUT_DIR}")

Running on Kaggle...
Using Device: cuda
Working Directory: /kaggle/working
Input Directory: /kaggle/input/piratt


## 1.2. Global Parameters & Design Choices

Based on the `data_analysis.ipynb` notebook, we define our global hyperparameters.

#### INSIGHT 1: CLASS IMBALANCE
* **Analysis:** Severe imbalance (`no_pain` ~77%, `high_pain` ~4.5%).
* **Action:** Implement `WeightedRandomSampler` (balance batches) and `FocalLoss` (focus on hard examples).

#### INSIGHT 2: STATIC FEATURES RELIABILITY
* **Analysis:** Static features (`n_legs`, `n_hands`, `n_eyes`) showed spurious correlations (e.g., <1% 'pirates' in train/test).
* **Action:** **Drop all static features**. They are noise. `STATIC_COLS` is empty.

#### INSIGHT 3: TEMPORAL DYNAMICS & WINDOWING
* **Analysis:** Autocorrelation (ACF) plots showed a signal "memory" of ~30-40 time steps.
* **Action:** Set **`WINDOW_SIZE = 40`** to capture these long-term patterns.

In [2]:
# 1. Columns definition
JOINT_COLS = [f'joint_{i:02d}' for i in range(30)] # joint_30 is excluded
SURVEY_COLS = ['pain_survey_1', 'pain_survey_2', 'pain_survey_3', 'pain_survey_4']
STATIC_COLS = []  # Dropped based on EDA
TIME_COL = 'time'

# 2. Hyperparameters
WINDOW_SIZE = 40
STRIDE = 10
BATCH_SIZE = 64
LEARNING_RATE = 1e-3
WEIGHT_DECAY = 1e-4
EPOCHS = 200
GRADIENT_CLIP_VALUE = 1.0
K_FOLDS = 5
LABEL_SMOOTHING = 0.1
EARLY_STOPPING_PATIENCE = 20

## 2. Data Loading & Initial Cleaning

We load the raw data and perform initial cleaning, specifically encoding the text-based 'Team Name' column, which we will treat as an embedding.

In [3]:
print("--- 2. Loading & Initial Cleaning ---")
try:
    df_features_raw = pd.read_csv(os.path.join(INPUT_DIR, 'pirate_pain_train.csv'))
    df_labels_raw = pd.read_csv(os.path.join(INPUT_DIR, 'pirate_pain_train_labels.csv'))
    df_test_raw = pd.read_csv(os.path.join(INPUT_DIR, 'pirate_pain_test.csv'))
    sample_sub_df = pd.read_csv(os.path.join(INPUT_DIR, 'sample_submission.csv'))
    print("Data loaded successfully.")
except Exception as e:
    print(f"Error loading data: {e}")
    print("Please check INPUT_DIR path.")

# --- Text Column (Team Name) Handling ---
# We treat the Team Name as a categorical entity via Label Encoding
exclude_cols = ['label', 'sample_index']
string_cols = df_features_raw.select_dtypes(include=['object']).columns.tolist()
string_cols = [c for c in string_cols if c not in exclude_cols]

TEXT_COL = None
TEXT_VOCAB_SIZE = 0

if len(string_cols) > 0:
    TEXT_COL = string_cols[0] 
    print(f"Found text column: {TEXT_COL}")
    
    def clean_team_name(text):
        if pd.isna(text): return "unknown"
        return re.sub(r'[^a-z0-9]', '', str(text).lower())

    df_features_raw[TEXT_COL] = df_features_raw[TEXT_COL].apply(clean_team_name)
    df_test_raw[TEXT_COL] = df_test_raw[TEXT_COL].apply(clean_team_name)
    
    le_text = LabelEncoder()
    all_text = pd.concat([df_features_raw[TEXT_COL], df_test_raw[TEXT_COL]], axis=0)
    le_text.fit(all_text)
    
    df_features_raw[TEXT_COL] = le_text.transform(df_features_raw[TEXT_COL])
    df_test_raw[TEXT_COL] = le_text.transform(df_test_raw[TEXT_COL])
    
    TEXT_VOCAB_SIZE = len(le_text.classes_)
else:
    print("No text column (Team Name) found.")

--- 2. Loading & Initial Cleaning ---
Data loaded successfully.
Found text column: n_legs


## 3. Feature Engineering

We transform the raw data into model-ready features based on our analysis.

In [4]:
def engineer_features(df):
    """
    Transforms raw data into model-ready features.
    """
    df_eng = df.copy()
    grouped = df_eng.groupby('sample_index')
    
    # 1. DELTA FEATURES (Velocity)
    # Raw joint data is non-stationary. We compute the first difference (velocity)
    # to help the model learn motion patterns instead of absolute positions.
    for col in JOINT_COLS:
        df_eng[f'd_{col}'] = grouped[col].diff().fillna(0)
    
    # 2. CYCLIC TIME ENCODING
    # 'time' is encoded cyclically to help the model understand
    # the start vs. end of a sequence without linear bias.
    max_time_val = 160 # Hardcoded to 160 (max steps)
    df_eng['sin_time'] = np.sin(2 * np.pi * df_eng[TIME_COL] / max_time_val)
    df_eng['cos_time'] = np.cos(2 * np.pi * df_eng[TIME_COL] / max_time_val)

    # 3. DROP CONSTANT FEATURE
    # 'joint_30' was found to be a constant value (0.5) and is noise.
    if 'joint_30' in df_eng.columns:
        df_eng = df_eng.drop(columns=['joint_30'])
        
    return df_eng

print("--- 3. Applying Feature Engineering ---")
df_features_engineered = engineer_features(df_features_raw)
df_test_engineered = engineer_features(df_test_raw)

# --- Define final feature sets ---
DELTA_JOINT_COLS = [f'd_{col}' for col in JOINT_COLS]
CONTINUOUS_COLS = JOINT_COLS + DELTA_JOINT_COLS + ['sin_time', 'cos_time']
print(f"Total continuous features: {len(CONTINUOUS_COLS)}")

# --- Prepare Categorical Vocabularies for Embeddings ---
# Pain surveys are categorical (0,1,2), not continuous scalars.
survey_vocab_sizes = [int(df_features_engineered[c].max() + 1) for c in SURVEY_COLS]
time_vocab_size = int(df_features_engineered[TIME_COL].max() + 1)

# --- Map Targets ---
label_mapping = {'no_pain': 0, 'low_pain': 1, 'high_pain': 2}
df_labels_raw['label_encoded'] = df_labels_raw['label'].map(label_mapping)

print("Feature engineering complete.")

--- 3. Applying Feature Engineering ---
Total continuous features: 62
Feature engineering complete.


## 4. Dataset & Sampling Strategy

We define the custom `Dataset` class to handle windowing and a `WeightedRandomSampler` to address class imbalance.

In [5]:
class PiratePainDataset(Dataset):
    """
    Custom Dataset to handle windowing of time-series data.
    - Applies windowing (size 40, stride 10).
    - Separates Continuous inputs (for Scaler) from Categorical inputs (for Embeddings).
    - Applies Gaussian noise augmentation during training.
    """
    def __init__(self, features_df, labels_df, sample_indices, window_size, stride, text_col=None, augment=False):
        self.features_df = features_df
        self.labels_df = labels_df.set_index('sample_index') if labels_df is not None else None
        self.sample_indices = sample_indices
        self.window_size = window_size
        self.stride = stride
        self.text_col = text_col
        self.augment = augment 
        
        # Grouping for O(1) access
        self.grouped_features = dict(tuple(features_df.groupby('sample_index')))
        self.indices = self._create_indices()

    def _create_indices(self):
        # Creates a list of valid (sample_idx, start, end) tuples
        indices = []
        for sample_idx in self.sample_indices:
            if sample_idx not in self.grouped_features: continue
            data = self.grouped_features[sample_idx]
            n_timesteps = len(data)
            # Create windows
            for start in range(0, n_timesteps - self.window_size + 1, self.stride):
                indices.append((sample_idx, start, start + self.window_size))
        return indices

    def __len__(self):
        return len(self.indices)

    def __getitem__(self, idx):
        sample_idx, start, end = self.indices[idx]
        window_data = self.grouped_features[sample_idx].iloc[start:end]

        # 1. Continuous Data + Gaussian Noise Jittering (Augmentation)
        vals = window_data[CONTINUOUS_COLS].values
        if self.augment:
            noise = np.random.normal(0, 0.02, vals.shape) 
            vals = vals + noise
        x_cont = torch.tensor(vals, dtype=torch.float)
        
        # 2. Categorical Data (Surveys + Time)
        # We add +1 to reserve 0 for padding/unknown
        x_survey = torch.tensor((window_data[SURVEY_COLS].values + 1), dtype=torch.long)
        x_time = torch.tensor((window_data[TIME_COL].values + 1), dtype=torch.long)
        
        # 3. Text Data
        x_text = torch.tensor(0, dtype=torch.long)
        if self.text_col:
            # Text is static, so we take the first value
            val = window_data[self.text_col].iloc[0]
            # --- MODIFICA: Bug corretto. Ripristinata la logica originale ---
            x_text = torch.tensor(val, dtype=torch.long) 
            # --- Fine Modifica ---

        # 4. Target
        label = torch.tensor(-1, dtype=torch.long) # For test set
        if self.labels_df is not None:
            label = torch.tensor(self.labels_df.loc[sample_idx, 'label_encoded'], dtype=torch.long)

        return x_cont, x_survey, x_time, x_text, label

In [6]:
def get_weighted_sampler(dataset, labels_df):
    """
    Creates a WeightedRandomSampler to handle class imbalance.
    Rare classes are sampled more frequently to balance the batches.
    """
    sample_to_label = labels_df.set_index('sample_index')['label_encoded'].to_dict()
    label_counts = labels_df['label_encoded'].value_counts().sort_index()
    
    # Calculate weights: 1 / (count)
    class_weights = 1.0 / label_counts
    
    weights = []
    for idx_tuple in dataset.indices:
        s_idx = idx_tuple[0] # Get sample_index from the window tuple
        if s_idx in sample_to_label:
            l = sample_to_label[s_idx]
            weights.append(class_weights[l])
        else:
            weights.append(0) # Should not happen in training
            
    return WeightedRandomSampler(weights, num_samples=len(weights), replacement=True)

## 5. Loss Function

We use **Focal Loss** to combat class imbalance. It adds a (1-p_t)^gamma term to the standard Cross-Entropy, forcing the model to focus on "hard" or "confidently wrong" examples (minority classes) rather than easily classifying the `no_pain` majority. We also use **Label Smoothing** to prevent overconfidence.

In [7]:
class FocalLoss(nn.Module):
    """
    Focal Loss with Label Smoothing.
    - Focal term (gamma): Focuses learning on hard-to-classify examples.
    - Label Smoothing: Prevents over-confidence (e.g., 1.0 probability).
    """
    def __init__(self, alpha=None, gamma=2.0, reduction='mean', label_smoothing=0.0):
        super(FocalLoss, self).__init__()
        self.alpha = alpha
        self.gamma = gamma
        self.reduction = reduction
        self.label_smoothing = label_smoothing

    def forward(self, inputs, targets):
        ce_loss = F.cross_entropy(
            inputs, targets, reduction='none', weight=self.alpha, 
            label_smoothing=self.label_smoothing
        )
        pt = torch.exp(-ce_loss)
        focal_loss = ((1 - pt) ** self.gamma) * ce_loss
        
        return focal_loss.mean() if self.reduction == 'mean' else focal_loss.sum()

## 6. Model Architecture (Hybrid CNN-LSTM)

This is the final model that achieved the `0.9425` score.

* **Rationale:** A hybrid approach to capture both local patterns and long-term dependencies.
* **Embeddings:** All categorical inputs (`surveys`, `time`, `team_name`) are embedded.
* **CNN Block:** A `Conv1d` layer (with `BatchNorm1d` for normalization) acts as a learnable feature extractor. It scans the sequence for salient *local patterns* (e.g., spikes, tremors).
* **LSTM Block:** A 2-layer `LSTM` processes the *sequence of features* extracted by the CNN, allowing it to model long-term temporal dependencies between these patterns.
* **Classifier:** A final `Linear` layer classifies the last hidden state of the LSTM.

In [8]:
class PiratePainModel(nn.Module):
    def __init__(self, n_continuous, survey_vocab_sizes, time_vocab_size, text_vocab_size, lstm_hidden=128, n_classes=3):
        super().__init__()
        
        # 1. EMBEDDING LAYERS (For Categorical Inputs)
        # We add +2 to vocab size: +1 for the 0-padding, +1 for potential OOV
        self.emb_surveys = nn.ModuleList([nn.Embedding(v+2, 4) for v in survey_vocab_sizes])
        self.emb_time = nn.Embedding(time_vocab_size+2, 8)
        
        self.use_text = (text_vocab_size > 0)
        text_dim = 8 if self.use_text else 0
        if self.use_text:
            self.emb_text = nn.Embedding(text_vocab_size+2, 8)
            
        # Calculate total input dimension for the CNN
        total_survey_dim = len(survey_vocab_sizes) * 4
        input_dim = n_continuous + total_survey_dim + 8 + text_dim
        
        # 2. CNN BLOCK (Local Feature Extraction)
        self.cnn = nn.Sequential(
            nn.Conv1d(in_channels=input_dim, out_channels=64, kernel_size=3, padding=1),
            nn.BatchNorm1d(64),
            nn.ReLU(),
            nn.Dropout(0.2),
            nn.Conv1d(in_channels=64, out_channels=128, kernel_size=3, padding=1),
            nn.BatchNorm1d(128),
            nn.ReLU(),
            nn.Dropout(0.2)
        )
        
        # 3. LSTM BLOCK (Long-Term Memory)
        self.lstm = nn.LSTM(
            input_size=128, # Input is the output channels from CNN
            hidden_size=lstm_hidden, 
            num_layers=2, 
            batch_first=True, 
            dropout=0.3 # Dropout between LSTM layers
        )
        
        # 4. CLASSIFIER
        self.dropout = nn.Dropout(0.3)
        self.classifier = nn.Linear(lstm_hidden, n_classes)

    def forward(self, x_cont, x_survey, x_time, x_text):
        batch_size, seq_len, _ = x_cont.shape
        
        # Process Embeddings
        e_surv = [emb(x_survey[:,:,i]) for i, emb in enumerate(self.emb_surveys)]
        e_time = self.emb_time(x_time)
        
        # Concatenate all features
        features = [x_cont] + e_surv + [e_time]
        if self.use_text:
            # Repeat static text embedding across all time steps
            e_txt = self.emb_text(x_text).unsqueeze(1).repeat(1, seq_len, 1)
            features.append(e_txt)
            
        full_input = torch.cat(features, dim=2) # Shape: (B, Seq, Feat)
        
        # CNN Pass
        # Conv1d expects (B, Channels, Seq)
        x = full_input.permute(0, 2, 1)
        x = self.cnn(x)
        
        # LSTM Pass
        # LSTM expects (B, Seq, Channels)
        x = x.permute(0, 2, 1)
        out, _ = self.lstm(x)
        
        # Final Classification
        # We use the output of the last time step
        last_hidden_state = out[:, -1, :]
        logits = self.classifier(self.dropout(last_hidden_state))
        
        return logits

## 7. Training & Validation Strategy

We employ a robust `StratifiedKFold` (5 splits) cross-validation.

* **Per-Fold Scaling:** The `StandardScaler` is fit *only* on the training data for each fold to prevent data leakage.
* **Ensemble:** The 5 models (one from each fold) are saved and used as an ensemble during inference.
* **OOF Predictions:** We store the validation predictions (Out-of-Fold) from the *best epoch* of each fold.
* **Window Aggregation:** Since we predict on windows, we aggregate window-level logits for a single sample by taking their `mean` (Soft Voting).
* **Thresholding:** The OOF predictions are used to find the optimal probability thresholds to maximize the F1-score, as `argmax` is suboptimal for imbalanced classes.

In [9]:
def train_epoch(model, loader, optimizer, criterion):
    model.train()
    total_loss = 0
    for xc, xs, xt, xtxt, y in loader:
        xc, xs, xt, xtxt, y = xc.to(device), xs.to(device), xt.to(device), xtxt.to(device), y.to(device)
        
        optimizer.zero_grad()
        logits = model(xc, xs, xt, xtxt)
        loss = criterion(logits, y)
        loss.backward()
        
        # Gradient Clipping (Prevents exploding gradients)
        torch.nn.utils.clip_grad_norm_(model.parameters(), GRADIENT_CLIP_VALUE)
        optimizer.step()
        
        total_loss += loss.item()
    return total_loss / len(loader)

In [10]:
# --- MAIN EXECUTION: K-FOLD ENSEMBLE ---
print("\n--- 7. Starting Stratified K-Fold Training (Ensemble) ---")

all_sample_indices = df_labels_raw['sample_index'].unique()
all_labels_strat = df_labels_raw.set_index('sample_index').loc[all_sample_indices]['label_encoded'].values

# Storage for Out-Of-Fold (OOF) predictions and models
oof_probs = np.zeros((len(all_sample_indices), 3))
oof_targets = np.zeros(len(all_sample_indices))
sample_to_idx = {s: i for i, s in enumerate(all_sample_indices)}
models_list = [] 

skf = StratifiedKFold(n_splits=K_FOLDS, shuffle=True, random_state=SEED)

for fold, (train_idx, val_idx) in enumerate(skf.split(all_sample_indices, all_labels_strat)):
    print(f"\n--- Fold {fold+1}/{K_FOLDS} ---")
    
    train_samples = all_sample_indices[train_idx]
    val_samples = all_sample_indices[val_idx]
    
    # 1. Standard Scaling (fitted ONLY on this fold's training data)
    scaler = StandardScaler()
    train_subset = df_features_engineered[df_features_engineered['sample_index'].isin(train_samples)]
    scaler.fit(train_subset[CONTINUOUS_COLS])
    
    # Apply scaler to a copy of the full dataset for this fold
    df_fold = df_features_engineered.copy()
    df_fold[CONTINUOUS_COLS] = scaler.transform(df_fold[CONTINUOUS_COLS])
    
    # 2. Datasets & Loaders
    train_ds = PiratePainDataset(df_fold, df_labels_raw, train_samples, WINDOW_SIZE, STRIDE, TEXT_COL, augment=True)
    val_ds = PiratePainDataset(df_fold, df_labels_raw, val_samples, WINDOW_SIZE, STRIDE, TEXT_COL, augment=False)
    
    # Weighted Sampler for Training
    sampler = get_weighted_sampler(train_ds, df_labels_raw)
    train_loader = DataLoader(train_ds, batch_size=BATCH_SIZE, sampler=sampler, shuffle=False, drop_last=True)
    val_loader = DataLoader(val_ds, batch_size=BATCH_SIZE, shuffle=False)
    
    # 3. Model Initialization
    model = PiratePainModel(
        n_continuous=len(CONTINUOUS_COLS), 
        survey_vocab_sizes=survey_vocab_sizes, 
        time_vocab_size=time_vocab_size,
        text_vocab_size=TEXT_VOCAB_SIZE
    ).to(device)
    
    optimizer = optim.AdamW(model.parameters(), lr=LEARNING_RATE, weight_decay=WEIGHT_DECAY)
    criterion = FocalLoss(alpha=None, gamma=2.0, label_smoothing=LABEL_SMOOTHING)

    # Early-stopping
    best_v_f1 = 0
    best_epoch = 0
    patience_counter = 0
    # Definiamo un path per salvare il modello migliore di questo fold
    model_path = os.path.join(WORKING_DIR, f"model_fold_{fold+1}_best.pt")
    
    # 4. Epoch Loop
    for ep in range(EPOCHS):
        t_loss = train_epoch(model, train_loader, optimizer, criterion)
        
        # Validation & OOF Prediction
        model.eval()
        val_logits_list = []
        window_sample_map_val = [x[0] for x in val_ds.indices] # Map window back to sample
        
        with torch.no_grad():
            for xc, xs, xt, xtxt, y in val_loader:
                xc, xs, xt, xtxt = xc.to(device), xs.to(device), xt.to(device), xtxt.to(device)
                logits = model(xc, xs, xt, xtxt)
                val_logits_list.extend(logits.cpu().numpy())
        
        # 5. Window Aggregation (Soft Voting)
        df_val_logits = pd.DataFrame(val_logits_list, columns=[0, 1, 2])
        df_val_logits['sample_index'] = window_sample_map_val
        df_val_agg_logits = df_val_logits.groupby('sample_index').mean()
        
        # Calculate F1 on aggregated predictions
        current_val_probs = torch.softmax(torch.tensor(df_val_agg_logits.values), dim=1).numpy()
        current_val_preds = np.argmax(current_val_probs, axis=1)
        
        current_val_indices = df_val_agg_logits.index
        current_val_labels = df_labels_raw.set_index('sample_index').loc[current_val_indices]['label_encoded'].values
        
        v_f1 = f1_score(current_val_labels, current_val_preds, average='weighted')
        
        if (ep + 1) % 10 == 0:
            print(f"Epoch {ep+1}/{EPOCHS}, Train Loss: {t_loss:.4f}, Val F1: {v_f1:.4f}, Patience: {patience_counter}/{EARLY_STOPPING_PATIENCE}")
            
        # 6. Model Checkpoint & Early Stopping (come da Lab)
        if v_f1 > best_v_f1:
            best_v_f1 = v_f1
            best_epoch = ep + 1
            patience_counter = 0 # Resetta la pazienza
            # Salva il modello migliore su disco
            torch.save(model.state_dict(), model_path)
            
            # Store OOF Probabilities for Threshold Optimization
            for idx, s_idx in enumerate(current_val_indices):
                global_idx = sample_to_idx[s_idx]
                oof_probs[global_idx] = current_val_probs[idx]
                oof_targets[global_idx] = current_val_labels[idx]
        else:
            patience_counter += 1 # Incrementa la pazienza
            
        if patience_counter >= EARLY_STOPPING_PATIENCE:
            print(f"--- Early stopping triggered at epoch {ep + 1} ---")
            break

    print(f"Fold {fold+1} Best Val F1: {best_v_f1:.4f} at epoch {best_epoch}")
    
    print(f"Loading best model from: {model_path}")
    model.load_state_dict(torch.load(model_path))
    models_list.append(model)


--- 7. Starting Stratified K-Fold Training (Ensemble) ---

--- Fold 1/5 ---
Epoch 10/200, Train Loss: 0.0394, Val F1: 0.9424, Patience: 0/20
Epoch 20/200, Train Loss: 0.0315, Val F1: 0.9544, Patience: 0/20
Epoch 30/200, Train Loss: 0.0284, Val F1: 0.9467, Patience: 4/20
Epoch 40/200, Train Loss: 0.0275, Val F1: 0.9513, Patience: 14/20
--- Early stopping triggered at epoch 45 ---
Fold 1 Best Val F1: 0.9700 at epoch 25
Loading best model from: /kaggle/working/model_fold_1_best.pt

--- Fold 2/5 ---
Epoch 10/200, Train Loss: 0.0409, Val F1: 0.9313, Patience: 0/20
Epoch 20/200, Train Loss: 0.0332, Val F1: 0.9398, Patience: 1/20
Epoch 30/200, Train Loss: 0.0246, Val F1: 0.9255, Patience: 8/20
Epoch 40/200, Train Loss: 0.0242, Val F1: 0.9398, Patience: 18/20
--- Early stopping triggered at epoch 41 ---
Fold 2 Best Val F1: 0.9614 at epoch 21
Loading best model from: /kaggle/working/model_fold_2_best.pt

--- Fold 3/5 ---
Epoch 10/200, Train Loss: 0.0391, Val F1: 0.9179, Patience: 0/20
Epoch 20

## 8. Threshold Optimization

We use the OOF (Out-of-Fold) probabilities collected during training to find the optimal decision thresholds. This is crucial for imbalanced classification, as the default 0.5 (or `argmax`) is not optimal for maximizing F1-score.

In [11]:
print("\n--- 8. Optimizing Decision Thresholds on OOF Data ---")

best_thresh = (0.0, 0.0)
best_score = 0.0

# Search space for thresholds
for t_high in np.arange(0.15, 0.50, 0.01):
    for t_low in np.arange(0.20, 0.55, 0.01):
        if t_low >= t_high: continue # Ensure low < high
        
        preds = []
        for p in oof_probs:
            # Apply thresholds
            if p[2] > t_high: preds.append(2)     # high_pain
            elif p[1] > t_low: preds.append(1)    # low_pain
            else: preds.append(0)                 # no_pain
            
        s = f1_score(oof_targets, preds, average='weighted')
        if s > best_score:
            best_score = s
            best_thresh = (t_low, t_high)

print(f"Optimal Thresholds Found: Low > {best_thresh[0]:.2f}, High > {best_thresh[1]:.2f}")
print(f"Best OOF F1 Score: {best_score:.4f}")


--- 8. Optimizing Decision Thresholds on OOF Data ---
Optimal Thresholds Found: Low > 0.43, High > 0.49
Best OOF F1 Score: 0.9574


## 9. Final Inference & Submission

We now generate the final `submission.csv` file.

1.  A final `StandardScaler` is fit on **all** training data.
2.  The test data is processed through the *full* pipeline (FE, Scaling, Windowing).
3.  We perform inference with **all 5 models** in our ensemble.
4.  The logits from the 5 models are **averaged** (Ensemble Soft Voting).
5.  The final averaged probabilities are converted to predictions using our **optimized thresholds**.

In [12]:
print("\n--- 9. Generating Submission File (Ensemble) ---")

# 1. Prepare Test Data
# Fit the final scaler on ALL training data
final_scaler = StandardScaler()
final_scaler.fit(df_features_engineered[CONTINUOUS_COLS])

df_test_scaled = df_test_engineered.copy()
df_test_scaled[CONTINUOUS_COLS] = final_scaler.transform(df_test_scaled[CONTINUOUS_COLS])

# Get sample indices from the official sample_submission.csv
sub_indices = sample_sub_df['sample_index'].unique()

test_ds_final = PiratePainDataset(df_test_scaled, None, sub_indices, WINDOW_SIZE, STRIDE, TEXT_COL, augment=False)
test_loader_final = DataLoader(test_ds_final, batch_size=BATCH_SIZE*2, shuffle=False)
# Map windows back to sample_index
window_sample_map_test = [x[0] for x in test_ds_final.indices]

# 2. Ensemble Inference
ensemble_logits_sum = None

for i, model in enumerate(models_list):
    model.eval()
    fold_logits = []
    print(f"Running inference with Model {i+1}/{K_FOLDS}...")
    with torch.no_grad():
        for xc, xs, xt, xtxt, _ in test_loader_final:
            xc, xs, xt, xtxt = xc.to(device), xs.to(device), xt.to(device), xtxt.to(device)
            logits = model(xc, xs, xt, xtxt)
            fold_logits.extend(logits.cpu().numpy())
    
    # Aggregate windows for this fold
    df_tmp = pd.DataFrame(fold_logits, columns=[0, 1, 2])
    df_tmp['sample_index'] = window_sample_map_test
    df_avg = df_tmp.groupby('sample_index').mean()
    
    # Add to the ensemble sum
    if ensemble_logits_sum is None:
        ensemble_logits_sum = df_avg
    else:
        ensemble_logits_sum = ensemble_logits_sum.add(df_avg, fill_value=0)

# 3. Average Ensemble & Apply Thresholds
ensemble_logits_avg = ensemble_logits_sum / K_FOLDS
final_probs = torch.softmax(torch.tensor(ensemble_logits_avg.values), dim=1).numpy()

final_preds_list = []
thr_l, thr_h = best_thresh

for p in final_probs:
    if p[2] > thr_h: final_preds_list.append(2)    # high_pain
    elif p[1] > thr_l: final_preds_list.append(1)  # low_pain
    else: final_preds_list.append(0)               # no_pain

final_series = pd.Series(final_preds_list, index=ensemble_logits_avg.index)

# 4. Format & Save Submission
inv_map = {v: k for k, v in label_mapping.items()}
submission = final_series.map(inv_map).reset_index()
submission.columns = ['sample_index', 'label']

# Re-order to match sample_submission.csv
submission_final = submission.set_index('sample_index').reindex(sample_sub_df['sample_index']).reset_index()

submission_path = os.path.join(WORKING_DIR, 'submission.csv')
submission_final.to_csv(submission_path, index=False)
print(f"Submission file created successfully at: {submission_path}")
display(submission_final.head())


--- 9. Generating Submission File (Ensemble) ---
Running inference with Model 1/5...
Running inference with Model 2/5...
Running inference with Model 3/5...
Running inference with Model 4/5...
Running inference with Model 5/5...
Submission file created successfully at: /kaggle/working/submission.csv


Unnamed: 0,sample_index,label
0,0,no_pain
1,1,no_pain
2,2,no_pain
3,3,no_pain
4,4,no_pain
