<a href="https://colab.research.google.com/github/armanfeili/novartis_datathon_2025/blob/Arman/notebooks/colab/main.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# ==============================================================================
# üîå Mount Google Drive (Run this cell FIRST!)
# ==============================================================================
# This cell must be run before any other cells in Colab

from google.colab import drive
drive.mount('/content/drive', force_remount=True)
print("‚úÖ Google Drive mounted at /content/drive")

# üß¨ Novartis Datathon 2025 - Complete Training Pipeline

This notebook provides a **complete end-to-end pipeline** for the Novartis Datathon 2025.

## ‚ö†Ô∏è First Steps (Colab Users)
1. **Run the first cell above** to mount Google Drive
2. Then run cells sequentially to clone the repo and set up the environment

## Configuration Structure

Each model has **one consolidated config file** supporting all training modes:

| Config File | Model | Priority | Description |
|-------------|-------|----------|-------------|
| `configs/model_xgb.yaml` | XGBoost | 1 (Primary) | Best performance on official metric, GPU support |
| `configs/model_lgbm.yaml` | LightGBM | 2 (Secondary) | Fast training, good for ensemble with XGBoost |
| `configs/model_cat.yaml` | CatBoost | 3 (Tertiary) | Native categorical handling, ensemble diversity |
| `configs/model_linear.yaml` | Linear | 4 | Ridge/Lasso/ElasticNet/Huber - baseline models |
| `configs/model_nn.yaml` | Neural Network | 5 | PyTorch MLP - experimental |
| `configs/model_hybrid.yaml` | Hybrid | 2 | Physics-based decay + ML residual learning |
| `configs/model_arihow.yaml` | ARIHOW | 4 | ARIMA + Holt-Winters time series hybrid |

Each config includes: `model`, `sweep`, `sweep_configs`, `scenario_best_params`, `validation`, `gpu`, `training`

---

## Pipeline Sections

1. **üîå Mount Drive** - Mount Google Drive (run first!)
2. **üîß Environment Setup** - Clone repo, install dependencies
3. **üìä Data Loading** - Load raw data and build panels
4. **üî¨ Feature Engineering** - Build scenario-specific features  
5. **üèãÔ∏è Model Training** - Train with GPU acceleration (multiple modes)
6. **üîÑ Hyperparameter Sweep** - Grid search with K-fold CV, select by **official_metric**
7. **ü§ù Ensemble** - XGBoost + LightGBM weighted ensemble
8. **üì§ Submission** - Generate competition submission files

---

## Available Models

| Model | Config File | GPU Support | Best For |
|-------|-------------|-------------|----------|
| `xgboost` | `model_xgb.yaml` | ‚úÖ Yes | Primary model, best official_metric |
| `lightgbm` | `model_lgbm.yaml` | ‚úÖ Yes | Fast training, ensemble partner |
| `catboost` | `model_cat.yaml` | ‚úÖ Yes | Categorical features, diversity |
| `linear` | `model_linear.yaml` | ‚ùå No | Baseline, interpretability |
| `nn` | `model_nn.yaml` | ‚úÖ Yes (PyTorch) | Experimental deep learning |
| `hybrid` | `model_hybrid.yaml` | ‚úÖ Yes | Physics + ML hybrid |
| `arihow` | `model_arihow.yaml` | ‚ùå No | Time series (ARIMA + Holt-Winters) |

---

## Key Principles
- ‚úÖ **Selection by official_metric** (PE), not RMSE
- ‚úÖ **K-fold CV** (3-5 folds) for robust hyperparameter selection
- ‚úÖ **XGB+LGBM ensemble** with optimized weights
- ‚úÖ **GPU acceleration** for tree models on Colab

---

## 1. Environment Setup

In [None]:
# ==============================================================================
# 1.1 Detect Environment (Drive should already be mounted from cell above)
# ==============================================================================
import sys
import os
from pathlib import Path

# Detect if running in Colab
IN_COLAB = 'google.colab' in sys.modules

print(f"üñ•Ô∏è  Environment: {'Google Colab' if IN_COLAB else 'Local'}")
print(f"üêç Python: {sys.version.split()[0]}")

if IN_COLAB:
    # Verify Drive is mounted (should have been mounted in the first cell)
    if os.path.exists('/content/drive/MyDrive'):
        print("‚úÖ Google Drive is mounted at /content/drive")
    else:
        print("‚ö†Ô∏è Google Drive not mounted! Please run the first cell to mount Drive.")
        print("   Run: drive.mount('/content/drive', force_remount=True)")
else:
    print("‚ö†Ô∏è Not running in Colab - using local paths")

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
‚úÖ Google Drive mounted successfully


In [None]:
# ==============================================================================
# 1.2 Clone Repository and Set Paths
# ==============================================================================
import os

# --- Configuration (MODIFY THESE) ---
REPO_URL = "https://github.com/armanfeili/novartis_datathon_2025.git"
BRANCH = "Arman"  # Change to your working branch

# Google Drive folder ID from your shared link
# https://drive.google.com/drive/folders/1_qUAkFZPx1psU0Gc0tOtf2EHS25U2X4H
DRIVE_FOLDER_ID = "1_qUAkFZPx1psU0Gc0tOtf2EHS25U2X4H"

# Paths depend on environment
if IN_COLAB:
    DRIVE_BASE = "/content/drive/MyDrive"
    PROJECT_PATH = "/content/novartis_datathon_2025"  # Clone to /content for speed
    
    # Data path - adjust based on your Drive structure
    # Option 1: If data is in a specific folder on Drive
    DATA_PATH = f"{DRIVE_BASE}/novartis-datathon-2025/data"
    
    # Option 2: If using the shared folder directly (uncomment if needed)
    # DATA_PATH = f"{DRIVE_BASE}/Colab Notebooks/novartis_data"
    
    ARTIFACTS_PATH = f"{DRIVE_BASE}/novartis-datathon-2025/artifacts"
    SUBMISSIONS_PATH = f"{DRIVE_BASE}/novartis-datathon-2025/submissions"
else:
    # Local paths (relative to notebook location)
    PROJECT_PATH = str(Path.cwd().parent.parent)
    DATA_PATH = os.path.join(PROJECT_PATH, "data")
    ARTIFACTS_PATH = os.path.join(PROJECT_PATH, "artifacts")
    SUBMISSIONS_PATH = os.path.join(PROJECT_PATH, "submissions")
# --------------------------------

if IN_COLAB:
    # Clone or update repository
    if not os.path.exists(PROJECT_PATH):
        print(f"üì• Cloning repository...")
        !git clone --branch {BRANCH} {REPO_URL} {PROJECT_PATH}
    else:
        print(f"üìÇ Repository exists. Pulling latest changes...")
        %cd {PROJECT_PATH}
        !git fetch origin {BRANCH}
        !git reset --hard origin/{BRANCH}
    
    %cd {PROJECT_PATH}
    
    # Check if data directory exists on Drive
    if os.path.exists(DATA_PATH):
        print(f"‚úÖ Data directory found at: {DATA_PATH}")
        # Create symlink to Drive data
        local_data = os.path.join(PROJECT_PATH, "data")
        if os.path.islink(local_data):
            os.unlink(local_data)
        elif os.path.exists(local_data):
            import shutil
            shutil.rmtree(local_data)
        os.symlink(DATA_PATH, local_data)
        print(f"üîó Linked data directory from Drive")
    else:
        print(f"‚ö†Ô∏è Data directory not found at: {DATA_PATH}")
        print(f"   Please ensure your data is uploaded to Google Drive")
        print(f"   Expected structure:")
        print(f"   {DATA_PATH}/")
        print(f"     ‚îú‚îÄ‚îÄ raw/")
        print(f"     ‚îÇ   ‚îú‚îÄ‚îÄ TRAIN/")
        print(f"     ‚îÇ   ‚îî‚îÄ‚îÄ TEST/")
        print(f"     ‚îî‚îÄ‚îÄ processed/ (optional, for cached features)")

# Create required directories
for path in [ARTIFACTS_PATH, SUBMISSIONS_PATH]:
    os.makedirs(path, exist_ok=True)

# Print paths
print(f"\nüìÅ Project: {PROJECT_PATH}")
print(f"üìÅ Data: {DATA_PATH}")
print(f"üìÅ Artifacts: {ARTIFACTS_PATH}")
print(f"üìÅ Submissions: {SUBMISSIONS_PATH}")

# List data directory contents if it exists
if os.path.exists(DATA_PATH):
    print(f"\nüìÇ Data directory contents:")
    for item in os.listdir(DATA_PATH):
        item_path = os.path.join(DATA_PATH, item)
        if os.path.isdir(item_path):
            print(f"  üìÅ {item}/")
        else:
            print(f"  üìÑ {item}")

üìÇ Repository exists at /content/drive/MyDrive/novartis_datathon_2025. Pulling latest changes...
/content/drive/MyDrive/novartis_datathon_2025
remote: Enumerating objects: 19, done.[K
remote: Counting objects: 100% (19/19), done.[K
remote: Compressing objects: 100% (3/3), done.[K
remote: Total 10 (delta 6), reused 10 (delta 6), pack-reused 0 (from 0)[K
Unpacking objects: 100% (10/10), 8.06 KiB | 26.00 KiB/s, done.
From https://github.com/armanfeili/novartis_datathon_2025
 * branch            Arman      -> FETCH_HEAD
   67c14aa..5c33709  Arman      -> origin/Arman
HEAD is now at 5c33709 project setup - 3
/content/drive/MyDrive/novartis_datathon_2025

üìÅ Project Path: /content/drive/MyDrive/novartis_datathon_2025
üìÅ Data Path: /content/drive/MyDrive/novartis-datathon-2025/data
üìÅ Artifacts Path: /content/drive/MyDrive/novartis-datathon-2025/artifacts
üìÅ Submissions Path: /content/drive/MyDrive/novartis-datathon-2025/submissions


In [None]:
# ==============================================================================
# 1.3 Install Dependencies
# ==============================================================================
import subprocess

print("üì¶ Installing dependencies...")

# Install from colab requirements
!pip install -q -r env/colab_requirements.txt

# For GPU support, ensure CUDA-compatible versions
if IN_COLAB:
    # XGBoost with GPU
    !pip install -q xgboost --upgrade
    
    # LightGBM with GPU (requires OpenCL)
    !pip install -q lightgbm --upgrade
    
    # CatBoost with GPU
    !pip install -q catboost --upgrade
    
    # PyTorch for neural network model
    !pip install -q torch --upgrade

# Verify key packages
import importlib

packages = [
    ('numpy', 'numpy'),
    ('pandas', 'pandas'),
    ('sklearn', 'scikit-learn'),
    ('yaml', 'pyyaml'),
    ('tqdm', 'tqdm'),
    ('catboost', 'catboost'),
    ('lightgbm', 'lightgbm'),
    ('xgboost', 'xgboost'),
    ('pyarrow', 'pyarrow'),
    ('scipy', 'scipy'),
    ('torch', 'pytorch'),
    ('joblib', 'joblib'),
]

print("\nüìã Package Status:")
for import_name, pkg_name in packages:
    try:
        mod = importlib.import_module(import_name)
        version = getattr(mod, '__version__', 'installed')
        print(f"  ‚úÖ {pkg_name}: {version}")
    except ImportError:
        print(f"  ‚ùå {pkg_name}: not installed")

# Check GPU availability
print("\nüñ•Ô∏è GPU Status:")
try:
    import torch
    if torch.cuda.is_available():
        print(f"  ‚úÖ CUDA available: {torch.cuda.get_device_name(0)}")
        print(f"  ‚úÖ CUDA version: {torch.version.cuda}")
    elif hasattr(torch.backends, 'mps') and torch.backends.mps.is_available():
        print(f"  ‚úÖ MPS (Apple Silicon) available")
    else:
        print("  ‚ö†Ô∏è CUDA not available - using CPU")
except ImportError:
    print("  ‚ö†Ô∏è PyTorch not installed")

# Check via nvidia-smi
!nvidia-smi --query-gpu=name,memory.total --format=csv,noheader 2>/dev/null || echo "  ‚ÑπÔ∏è nvidia-smi not available"

print("\n‚úÖ Dependencies installed!")

üì¶ Installing dependencies...
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m99.2/99.2 MB[0m [31m7.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m1.6/1.6 MB[0m [31m63.0 MB/s[0m eta [36m0:00:00[0m
[?25h  ‚úÖ torch
  ‚úÖ numpy
  ‚úÖ pandas
  ‚úÖ lightgbm
  ‚úÖ xgboost
  ‚úÖ catboost
  ‚úÖ sklearn
  ‚úÖ yaml

‚úÖ All dependencies installed!


## 2. Import Modules and Verify Environment

Import project modules and verify GPU availability.

In [None]:
# ==============================================================================
# 2.1 Import Project Modules
# ==============================================================================
import sys
import os
import gc
import warnings
warnings.filterwarnings('ignore')

# Ensure project root is in path
if PROJECT_PATH not in sys.path:
    sys.path.insert(0, PROJECT_PATH)

# Standard imports
import numpy as np
import pandas as pd
from pathlib import Path
from datetime import datetime

# Project imports
from src.utils import (
    load_config, set_seed, setup_logging, timer, 
    get_device, get_gpu_info, print_environment_info,
    clear_memory, get_memory_usage, optimize_dataframe_memory
)
from src.data import (
    get_panel, load_raw_data, prepare_base_panel, 
    compute_pre_entry_stats, handle_missing_values,
    META_COLS
)
from src.features import (
    get_features, make_features, split_features_target_meta,
    get_feature_columns, SCENARIO_CONFIG
)
from src.train import (
    train_scenario_model, run_cross_validation, 
    run_experiment, compute_sample_weights, split_features_target_meta
)
from src.evaluate import compute_metric1, compute_metric2, compute_per_series_error
from src.inference import (
    generate_submission, detect_test_scenarios, 
    validate_submission_format, save_submission_with_versioning
)
from src.models import get_model_class

print("‚úÖ All modules imported successfully!")

# ==============================================================================
# 2.2 Display Environment Information
# ==============================================================================
print_environment_info()

# ==============================================================================
# 2.3 List Available Models
# ==============================================================================
print("\nü§ñ Available Models:")
available_models = [
    ('xgboost', 'XGBoost gradient boosting (GPU)'),
    ('lightgbm', 'LightGBM fast gradient boosting'),
    ('catboost', 'CatBoost with native categorical'),
    ('linear', 'Ridge/Lasso/ElasticNet/Huber'),
    ('nn', 'Neural Network (MLP)'),
    ('hybrid', 'Physics + ML hybrid'),
    ('arihow', 'ARIMA + Holt-Winters'),
]
for model_name, desc in available_models:
    try:
        _ = get_model_class(model_name)
        print(f"  ‚úÖ {model_name}: {desc}")
    except Exception as e:
        print(f"  ‚ö†Ô∏è {model_name}: {desc} (may need dependencies)")

üñ•Ô∏è  Device: cpu

‚úÖ All modules imported successfully!


## 3. Load Configuration and Set Seed

Load all configuration files and set random seed for reproducibility.

In [None]:
# ==============================================================================
# 3.1 Load Configurations
# ==============================================================================
data_config = load_config('configs/data.yaml')
features_config = load_config('configs/features.yaml')
run_config = load_config('configs/run_defaults.yaml')

# Load all model configs (one file per model)
model_configs = {}
model_config_files = {
    'xgboost': 'configs/model_xgb.yaml',
    'lightgbm': 'configs/model_lgbm.yaml',
    'catboost': 'configs/model_cat.yaml',
    'linear': 'configs/model_linear.yaml',
    'nn': 'configs/model_nn.yaml',
    'hybrid': 'configs/model_hybrid.yaml',
    'arihow': 'configs/model_arihow.yaml',
}

print("üìã Loading model configurations:")
for model_name, config_path in model_config_files.items():
    try:
        model_configs[model_name] = load_config(config_path)
        sweep_configs = model_configs[model_name].get('sweep_configs', [])
        n_configs = len(sweep_configs) if sweep_configs else 0
        print(f"  ‚úÖ {model_name}: {config_path} ({n_configs} sweep configs)")
    except Exception as e:
        print(f"  ‚ö†Ô∏è {model_name}: Could not load {config_path} - {e}")

# Set random seed for reproducibility
SEED = run_config['reproducibility']['seed']
set_seed(SEED)

# Setup logging
setup_logging(level=run_config.get('logging', {}).get('level', 'INFO'))

print(f"\nüé≤ Random seed: {SEED}")
print(f"üìÖ Scenarios: {list(run_config['scenarios'].keys())}")

# Display model priorities
print(f"\nüèÜ Model Priorities:")
for name, cfg in model_configs.items():
    priority = cfg.get('model', {}).get('priority', 99)
    sweep_metric = cfg.get('sweep', {}).get('selection_metric', 'official_metric')
    print(f"  {priority}. {name.upper()} - selection: {sweep_metric}")

# Display scenario details
print(f"\nüìÖ Scenario Configuration:")
for s_name, s_config in run_config['scenarios'].items():
    print(f"  {s_name}:")
    print(f"    Forecast: months {s_config['forecast_start']} to {s_config['forecast_end']}")
    print(f"    Feature cutoff: month {s_config['feature_cutoff']}")

üìã Configurations loaded:
  - Data config: ['drive', 'local', 'files', 'keys', 'dates', 'columns', 'validation']
  - Features config: ['feature_groups', 'lags', 'rolling', 'diff', 'time_features', 'interactions', 'selection', 'encoding']
  - Run config: ['experiment', 'run', 'reproducibility', 'cv', 'paths', 'output', 'metrics', 'logging', 'drive', 'hardware']
  - Model configs: ['lightgbm', 'xgboost', 'catboost', 'linear', 'neural_network']

üé≤ Random seed: 42


## 4. Load and Explore Data

Load the training and test data panels.

In [None]:
# ==============================================================================
# 4.1 Load Training Panel
# ==============================================================================
print("üìÇ Loading training data...")

try:
    with timer("Load train panel"):
        train_panel = get_panel(split='train', config=data_config, use_cache=True)
    
    # Display statistics
    n_series = train_panel[['country', 'brand_name']].drop_duplicates().shape[0]
    print(f"\nüìä Training Panel Statistics:")
    print(f"  Shape: {train_panel.shape[0]:,} rows √ó {train_panel.shape[1]} columns")
    print(f"  Unique series: {n_series:,}")
    print(f"  Time range: {train_panel['months_postgx'].min()} to {train_panel['months_postgx'].max()}")
    
    # Bucket distribution
    if 'bucket' in train_panel.columns:
        bucket_dist = train_panel[['country', 'brand_name', 'bucket']].drop_duplicates()['bucket'].value_counts()
        print(f"\nü™£ Bucket Distribution:")
        for bucket, count in bucket_dist.items():
            pct = count / n_series * 100
            print(f"  Bucket {bucket}: {count:,} series ({pct:.1f}%)")
    
    # Memory usage
    mem_mb = train_panel.memory_usage(deep=True).sum() / (1024**2)
    print(f"\nüíæ Memory: {mem_mb:.1f} MB")
    
except FileNotFoundError as e:
    print(f"\n‚ùå Data not found: {e}")
    print("\nüìã Please ensure your data is in the correct location:")
    print(f"   Expected: {DATA_PATH}/raw/TRAIN/")
    print("\nüí° If using Google Colab:")
    print("   1. Upload data to Google Drive")
    print("   2. Run the 'setup_data_from_drive()' function in Section 8.2")
    print("   3. Re-run this cell")
    train_panel = None
    
except Exception as e:
    print(f"\n‚ùå Error loading data: {e}")
    import traceback
    traceback.print_exc()
    train_panel = None

üìÇ Data Directories:
  Raw: /content/drive/MyDrive/novartis-datathon-2025/data/raw (exists: True)
  Interim: /content/drive/MyDrive/novartis-datathon-2025/data/interim (exists: True)
  Processed: /content/drive/MyDrive/novartis-datathon-2025/data/processed (exists: True)

üìÑ Available raw files (0):


In [None]:
# ==============================================================================
# 4.2 Load Test Panel
# ==============================================================================
print("üìÇ Loading test data...")

try:
    with timer("Load test panel"):
        test_panel = get_panel(split='test', config=data_config, use_cache=True)
    
    # Detect scenarios
    test_scenarios = detect_test_scenarios(test_panel)
    n_test_series = test_panel[['country', 'brand_name']].drop_duplicates().shape[0]
    
    print(f"\nüìä Test Panel Statistics:")
    print(f"  Shape: {test_panel.shape[0]:,} rows √ó {test_panel.shape[1]} columns")
    print(f"  Unique series: {n_test_series:,}")
    print(f"  Scenario 1 series: {len(test_scenarios[1]):,}")
    print(f"  Scenario 2 series: {len(test_scenarios[2]):,}")
    
    # Clear memory
    clear_memory()
    print(f"\nüßπ Memory cleared")

except FileNotFoundError as e:
    print(f"\n‚ùå Test data not found: {e}")
    print("   This is OK if you only want to train - test data is only needed for submission")
    test_panel = None
    test_scenarios = {1: [], 2: []}
    
except Exception as e:
    print(f"\n‚ùå Error loading test data: {e}")
    test_panel = None
    test_scenarios = {1: [], 2: []}

AttributeError: 'NoneType' object has no attribute 'items'

In [None]:
# ==============================================================================
# 4.3 Quick Data Exploration
# ==============================================================================
if train_panel is not None:
    import matplotlib.pyplot as plt
    
    # Set up plotting
    plt.style.use('default')
    fig, axes = plt.subplots(2, 2, figsize=(14, 10))
    
    # 1. y_norm distribution
    ax = axes[0, 0]
    if 'y_norm' in train_panel.columns:
        train_panel['y_norm'].hist(bins=50, ax=ax, color='steelblue', edgecolor='white')
        ax.axvline(x=1.0, color='red', linestyle='--', label='No erosion (1.0)')
        ax.axvline(x=0.25, color='orange', linestyle='--', label='Bucket 1 threshold')
        ax.set_xlabel('Normalized Volume (y_norm)')
        ax.set_ylabel('Frequency')
        ax.set_title('Distribution of y_norm')
        ax.legend()
    else:
        ax.text(0.5, 0.5, 'y_norm not available', ha='center', va='center', transform=ax.transAxes)
        ax.set_title('y_norm Distribution (N/A)')
    
    # 2. Mean erosion curve by bucket
    ax = axes[0, 1]
    if 'bucket' in train_panel.columns and 'y_norm' in train_panel.columns:
        for bucket in [1, 2]:
            bucket_data = train_panel[train_panel['bucket'] == bucket]
            if len(bucket_data) > 0:
                erosion_by_month = bucket_data.groupby('months_postgx')['y_norm'].mean()
                ax.plot(erosion_by_month.index, erosion_by_month.values, 
                        label=f'Bucket {bucket}', linewidth=2)
        ax.axhline(y=1.0, color='gray', linestyle=':', alpha=0.7)
        ax.set_xlabel('Months Post Generic Entry')
        ax.set_ylabel('Mean Normalized Volume')
        ax.set_title('Erosion Curves by Bucket')
        ax.legend()
        ax.grid(True, alpha=0.3)
    else:
        ax.text(0.5, 0.5, 'bucket/y_norm not available', ha='center', va='center', transform=ax.transAxes)
        ax.set_title('Erosion Curves (N/A)')
    
    # 3. Number of generics over time
    ax = axes[1, 0]
    if 'n_gxs' in train_panel.columns:
        ngxs_by_month = train_panel.groupby('months_postgx')['n_gxs'].mean()
        ax.bar(ngxs_by_month.index, ngxs_by_month.values, color='forestgreen', alpha=0.7)
        ax.set_xlabel('Months Post Generic Entry')
        ax.set_ylabel('Mean Number of Generics')
        ax.set_title('Average Generic Competition Over Time')
        ax.grid(True, alpha=0.3, axis='y')
    else:
        ax.text(0.5, 0.5, 'n_gxs not available', ha='center', va='center', transform=ax.transAxes)
        ax.set_title('Generic Competition (N/A)')
    
    # 4. Hospital rate distribution
    ax = axes[1, 1]
    if 'hospital_rate' in train_panel.columns:
        hr_by_series = train_panel.groupby(['country', 'brand_name'])['hospital_rate'].first()
        hr_by_series.hist(bins=30, ax=ax, color='purple', edgecolor='white', alpha=0.7)
        ax.set_xlabel('Hospital Rate (%)')
        ax.set_ylabel('Number of Series')
        ax.set_title('Hospital Rate Distribution')
    else:
        # Try alternative columns
        if 'country' in train_panel.columns:
            country_dist = train_panel.groupby(['country', 'brand_name']).size().reset_index()['country'].value_counts()
            country_dist.plot(kind='bar', ax=ax, color='purple', alpha=0.7)
            ax.set_xlabel('Country')
            ax.set_ylabel('Number of Series')
            ax.set_title('Series by Country')
            ax.tick_params(axis='x', rotation=45)
        else:
            ax.text(0.5, 0.5, 'No distribution data available', ha='center', va='center', transform=ax.transAxes)
            ax.set_title('Distribution (N/A)')
    
    plt.tight_layout()
    plt.show()
    
    print("‚úÖ Data exploration complete")
else:
    print("‚ö†Ô∏è Training data not loaded - skipping exploration")

## 5. Feature Engineering

Build scenario-specific features for training.

In [None]:
# ==============================================================================
# 5.1 Build Features for Both Scenarios
# ==============================================================================

if train_panel is None:
    print("‚ùå Training data not loaded - cannot build features")
    print("   Please fix data loading issues first (Section 4)")
else:
    # Build Scenario 1 features (forecast months 0-23 using pre-entry only)
    print("üî¨ Building Scenario 1 features...")
    try:
        with timer("Scenario 1 features"):
            panel_s1 = get_panel(split='train', config=data_config, use_cache=True)
            panel_features_s1 = make_features(panel_s1, scenario=1, mode='train', config=features_config)
        
        n_features_s1 = len([c for c in panel_features_s1.columns if c not in META_COLS])
        print(f"  Panel shape: {panel_features_s1.shape}")
        print(f"  Features: {n_features_s1}")
        
    except Exception as e:
        print(f"  ‚ùå Error building S1 features: {e}")
        panel_features_s1 = None
    
    # Build Scenario 2 features (forecast months 6-23 using pre-entry + months 0-5)
    print("\nüî¨ Building Scenario 2 features...")
    try:
        with timer("Scenario 2 features"):
            panel_s2 = get_panel(split='train', config=data_config, use_cache=True)
            panel_features_s2 = make_features(panel_s2, scenario=2, mode='train', config=features_config)
        
        n_features_s2 = len([c for c in panel_features_s2.columns if c not in META_COLS])
        print(f"  Panel shape: {panel_features_s2.shape}")
        print(f"  Features: {n_features_s2}")
        
    except Exception as e:
        print(f"  ‚ùå Error building S2 features: {e}")
        panel_features_s2 = None
    
    # Display some feature examples
    if panel_features_s1 is not None:
        feature_cols_s1 = [c for c in panel_features_s1.columns if c not in META_COLS]
        print(f"\nüìã Sample Features (Scenario 1):")
        print(f"  {feature_cols_s1[:10]}...")
    
    # Check for early erosion features in S2 only
    if panel_features_s2 is not None:
        feature_cols_s2 = [c for c in panel_features_s2.columns if c not in META_COLS]
        s2_only_features = [c for c in feature_cols_s2 if 'erosion_0' in c or 'early_' in c or 'month_0' in c]
        if s2_only_features:
            print(f"\nüìã Scenario 2 Specific Features (early erosion):")
            print(f"  {s2_only_features[:5]}...")
    
    clear_memory()

## 6. Model Training

Train CatBoost models for both scenarios using cross-validation.

In [None]:
# ==============================================================================
# 6.1 Training Configuration
# ==============================================================================

# ============ CONFIGURE YOUR TRAINING HERE ============
# Model options: 'xgboost', 'lightgbm', 'catboost', 'linear', 'nn', 'hybrid', 'arihow'
MODEL_TYPE = 'xgboost'  # Primary model (best performing)

# Training mode options:
# - 'quick'    : Use best known params (from config scenario_best_params)
# - 'cv'       : Train with K-fold cross-validation using best params
# - 'sweep'    : Run hyperparameter sweep with holdout validation
# - 'sweep_cv' : Run sweep with K-fold cross-validation (most robust)
# - 'ensemble' : Train XGBoost + LightGBM ensemble
# - 'compare'  : Compare all models and select best by official_metric
TRAINING_MODE = 'cv'

N_FOLDS = 5  # Number of CV folds
USE_GPU = True  # Enable GPU acceleration (for tree models)
# =====================================================

# Create run ID
RUN_ID = datetime.now().strftime("%Y%m%d_%H%M%S")
RUN_DIR = Path(ARTIFACTS_PATH) / RUN_ID
RUN_DIR.mkdir(parents=True, exist_ok=True)

print(f"üèÉ Training Configuration:")
print(f"  Run ID: {RUN_ID}")
print(f"  Model: {MODEL_TYPE}")
print(f"  Mode: {TRAINING_MODE}")
print(f"  CV Folds: {N_FOLDS}")
print(f"  Artifacts: {RUN_DIR}")

# Check GPU availability and configure
gpu_info = get_gpu_info()
GPU_AVAILABLE = gpu_info.get('gpu_available', False) and gpu_info.get('cuda_version')

if GPU_AVAILABLE and USE_GPU:
    print(f"  üöÄ GPU: {gpu_info.get('device_name', 'Available')} - GPU training enabled")
else:
    print(f"  üíª Using CPU training")
    USE_GPU = False

# Set environment variable for thread safety (important for XGBoost/LightGBM)
os.environ['OMP_NUM_THREADS'] = '1'

# Show sweep configuration if using sweep mode
if TRAINING_MODE in ['sweep', 'sweep_cv'] and MODEL_TYPE in model_configs:
    sweep_config = model_configs[MODEL_TYPE].get('sweep', {})
    print(f"\nüîç Sweep Configuration:")
    print(f"  Selection metric: {sweep_config.get('selection_metric', 'official_metric')}")
    
    # Show sweep configs if available
    sweep_configs_list = model_configs[MODEL_TYPE].get('sweep_configs', [])
    if sweep_configs_list:
        print(f"  Named configurations: {len(sweep_configs_list)}")
        for cfg in sweep_configs_list[:5]:
            print(f"    - {cfg.get('id', 'unnamed')}: {cfg.get('description', 'No description')}")
        if len(sweep_configs_list) > 5:
            print(f"    ... and {len(sweep_configs_list) - 5} more")
    
    # Show grid if available
    grid = sweep_config.get('grid', {})
    if grid:
        total_combos = 1
        for param, values in grid.items():
            if isinstance(values, list):
                print(f"  - {param}: {values}")
                total_combos *= len(values)
        print(f"  Total grid combinations: {total_combos}")

In [None]:
# ==============================================================================
# 6.2 Configure GPU-Enabled Model Parameters
# ==============================================================================

def get_gpu_model_config(model_type: str, base_config: dict, use_gpu: bool = True) -> dict:
    """
    Get model configuration with GPU settings enabled.
    Uses gpu section from consolidated model config files.
    
    Args:
        model_type: 'xgboost', 'lightgbm', 'catboost', 'nn', etc.
        base_config: Base model configuration dict
        use_gpu: Whether to enable GPU
        
    Returns:
        Updated model configuration
    """
    import copy
    config = copy.deepcopy(base_config)
    
    # Get params from config - could be in 'model.params' or 'params' directly
    if 'model' in config and 'params' in config['model']:
        params = config['model']['params'].copy()
    elif 'params' in config:
        params = config['params'].copy()
    else:
        params = {}
    
    gpu_config = config.get('gpu', {})
    
    if use_gpu and GPU_AVAILABLE:
        # Apply GPU settings from config
        if model_type in ['xgboost', 'xgb']:
            params['tree_method'] = gpu_config.get('tree_method', 'gpu_hist')
            params['gpu_id'] = gpu_config.get('gpu_id', 0)
            params['predictor'] = gpu_config.get('predictor', 'gpu_predictor')
            print("  üöÄ XGBoost GPU mode enabled (tree_method='gpu_hist')")
            
        elif model_type in ['lightgbm', 'lgbm']:
            params['device'] = 'gpu'
            params['gpu_platform_id'] = gpu_config.get('gpu_platform_id', 0)
            params['gpu_device_id'] = gpu_config.get('gpu_device_id', 0)
            print("  ‚ö° LightGBM GPU mode enabled (device='gpu')")
            
        elif model_type in ['catboost', 'cat']:
            params['task_type'] = 'GPU'
            params['devices'] = str(gpu_config.get('device_id', 0))
            print("  üê± CatBoost GPU mode enabled (task_type='GPU')")
            
        elif model_type in ['nn', 'neural', 'mlp']:
            # Neural network automatically uses GPU via PyTorch
            print("  üß† Neural Network will use GPU via PyTorch")
        else:
            print(f"  üíª {model_type} using CPU (no GPU config)")
    else:
        # Use CPU settings
        if model_type in ['xgboost', 'xgb']:
            params['tree_method'] = 'hist'
        elif model_type in ['lightgbm', 'lgbm']:
            params['device'] = 'cpu'
        elif model_type in ['catboost', 'cat']:
            params['task_type'] = 'CPU'
        print(f"  üíª {model_type} CPU mode")
    
    # Update config with modified params
    if 'model' in config:
        config['model']['params'] = params
    else:
        config['params'] = params
    
    return config

# Get GPU-enabled config for selected model
print(f"\nüîß Configuring {MODEL_TYPE}...")

if MODEL_TYPE in model_configs:
    current_model_config = get_gpu_model_config(MODEL_TYPE, model_configs[MODEL_TYPE], USE_GPU)
else:
    print(f"  ‚ö†Ô∏è No config found for {MODEL_TYPE}, using empty config")
    current_model_config = {}

# Get model class
try:
    ModelClass = get_model_class(MODEL_TYPE)
    print(f"  Model class: {ModelClass.__name__}")
except Exception as e:
    print(f"  ‚ùå Could not load model class: {e}")
    ModelClass = None

# Display selected params
if 'model' in current_model_config and 'params' in current_model_config['model']:
    selected_params = current_model_config['model']['params']
elif 'params' in current_model_config:
    selected_params = current_model_config['params']
else:
    selected_params = {}

if selected_params:
    print(f"\nüìã Model Parameters:")
    for k, v in list(selected_params.items())[:10]:
        print(f"  {k}: {v}")
    if len(selected_params) > 10:
        print(f"  ... and {len(selected_params) - 10} more")

### 6.3 Train Scenario 1 Model

In [None]:
# ==============================================================================
# 6.3 Train Scenario 1 Model
# ==============================================================================
print(f"üèãÔ∏è Training Scenario 1 - {MODEL_TYPE.upper()}")
print("=" * 60)

# Build features for Scenario 1
print("Building Scenario 1 features...")
with timer("Feature engineering S1"):
    panel_s1 = get_panel(split='train', config=data_config, use_cache=True)
    panel_features_s1 = make_features(panel_s1, scenario=1, mode='train', config=features_config)

# Run cross-validation
models_s1, s1_cv_results, oof_s1 = run_cross_validation(
    panel_features=panel_features_s1,
    scenario=1,
    model_type=MODEL_TYPE,
    model_config=current_model_config,
    run_config=run_config,
    n_folds=N_FOLDS,
    save_oof=True,
    artifacts_dir=RUN_DIR / 'models_s1',
    run_id=RUN_ID,
)

print(f"\n‚úÖ Scenario 1 Training Complete")
print(f"  Mean CV Official Metric: {s1_cv_results['cv_official_mean']:.6f} ¬± {s1_cv_results['cv_official_std']:.6f}")
print(f"  Mean CV RMSE: {s1_cv_results['cv_rmse_mean']:.6f} ¬± {s1_cv_results['cv_rmse_std']:.6f}")

# Save S1 OOF predictions
if len(oof_s1) > 0:
    oof_s1.to_csv(RUN_DIR / 'oof_s1.csv', index=False)
    print(f"  OOF predictions saved: {len(oof_s1)} rows")

clear_memory()

In [None]:
# ==============================================================================
# 6.4 Train Scenario 2 Model
# ==============================================================================
print(f"üèãÔ∏è Training Scenario 2 - {MODEL_TYPE.upper()}")
print("=" * 60)

# Build features for Scenario 2
print("Building Scenario 2 features...")
with timer("Feature engineering S2"):
    panel_s2 = get_panel(split='train', config=data_config, use_cache=True)
    panel_features_s2 = make_features(panel_s2, scenario=2, mode='train', config=features_config)

# Run cross-validation
models_s2, s2_cv_results, oof_s2 = run_cross_validation(
    panel_features=panel_features_s2,
    scenario=2,
    model_type=MODEL_TYPE,
    model_config=current_model_config,
    run_config=run_config,
    n_folds=N_FOLDS,
    save_oof=True,
    artifacts_dir=RUN_DIR / 'models_s2',
    run_id=RUN_ID,
)

print(f"\n‚úÖ Scenario 2 Training Complete")
print(f"  Mean CV Official Metric: {s2_cv_results['cv_official_mean']:.6f} ¬± {s2_cv_results['cv_official_std']:.6f}")
print(f"  Mean CV RMSE: {s2_cv_results['cv_rmse_mean']:.6f} ¬± {s2_cv_results['cv_rmse_std']:.6f}")

# Save S2 OOF predictions
if len(oof_s2) > 0:
    oof_s2.to_csv(RUN_DIR / 'oof_s2.csv', index=False)
    print(f"  OOF predictions saved: {len(oof_s2)} rows")

clear_memory()

# Summary
print("\n" + "=" * 60)
print("üìä TRAINING SUMMARY")
print("=" * 60)
print(f"  Model: {MODEL_TYPE.upper()}")
print(f"  GPU: {'Enabled' if USE_GPU and GPU_AVAILABLE else 'Disabled'}")
print(f"  Scenario 1 CV: {s1_cv_results['cv_official_mean']:.6f} ¬± {s1_cv_results['cv_official_std']:.6f}")
print(f"  Scenario 2 CV: {s2_cv_results['cv_official_mean']:.6f} ¬± {s2_cv_results['cv_official_std']:.6f}")
print(f"  Models saved to: {RUN_DIR}")

## 6.5 Advanced Training Options

The cells below provide advanced training options:
- **Hyperparameter Sweep**: Grid search with K-fold CV to find optimal parameters
- **Multi-Model Training**: Train all models (XGBoost, LightGBM, CatBoost)
- **Ensemble**: Combine XGBoost + LightGBM predictions for better performance

‚ö†Ô∏è These are optional and computationally intensive. Skip to Section 7 for basic submission.

In [None]:
# ==============================================================================
# 6.5a Train All Models (Compare Performance)
# ==============================================================================
# Set RUN_ALL_MODELS = True to train all three models and compare

RUN_ALL_MODELS = False  # ‚ö†Ô∏è Set to True to run (takes ~15-30 min with GPU)

if RUN_ALL_MODELS:
    all_model_results = {}
    
    # Models to compare (tree-based models with GPU support)
    models_to_compare = ['xgboost', 'lightgbm', 'catboost']
    
    for model_name in models_to_compare:
        if model_name not in model_configs:
            print(f"‚ö†Ô∏è Config not found for {model_name}, skipping")
            continue
            
        print(f"\n{'='*60}")
        print(f"üèãÔ∏è Training {model_name.upper()}")
        print(f"{'='*60}")
        
        # Get GPU-enabled config
        model_cfg = get_gpu_model_config(model_name, model_configs[model_name], USE_GPU)
        
        # Train S1
        panel_s1 = get_panel(split='train', config=data_config, use_cache=True)
        panel_features_s1 = make_features(panel_s1, scenario=1, mode='train', config=features_config)
        
        s1_models, s1_results, _ = run_cross_validation(
            panel_features=panel_features_s1,
            scenario=1,
            model_type=model_name,
            model_config=model_cfg,
            run_config=run_config,
            n_folds=N_FOLDS,
            artifacts_dir=RUN_DIR / f'{model_name}_s1',
            run_id=RUN_ID
        )
        
        # Train S2
        panel_s2 = get_panel(split='train', config=data_config, use_cache=True)
        panel_features_s2 = make_features(panel_s2, scenario=2, mode='train', config=features_config)
        
        s2_models, s2_results, _ = run_cross_validation(
            panel_features=panel_features_s2,
            scenario=2,
            model_type=model_name,
            model_config=model_cfg,
            run_config=run_config,
            n_folds=N_FOLDS,
            artifacts_dir=RUN_DIR / f'{model_name}_s2',
            run_id=RUN_ID
        )
        
        all_model_results[model_name] = {
            's1_mean': s1_results['cv_official_mean'],
            's1_std': s1_results['cv_official_std'],
            's2_mean': s2_results['cv_official_mean'],
            's2_std': s2_results['cv_official_std'],
        }
        
        print(f"  S1: {s1_results['cv_official_mean']:.4f} ¬± {s1_results['cv_official_std']:.4f}")
        print(f"  S2: {s2_results['cv_official_mean']:.4f} ¬± {s2_results['cv_official_std']:.4f}")
        
        clear_memory()
    
    # Display comparison table
    print("\n" + "="*60)
    print("üìä MODEL COMPARISON")
    print("="*60)
    comparison_df = pd.DataFrame([
        {
            'Model': name.upper(),
            'S1 Mean': f"{r['s1_mean']:.4f}",
            'S1 Std': f"¬±{r['s1_std']:.4f}",
            'S2 Mean': f"{r['s2_mean']:.4f}",
            'S2 Std': f"¬±{r['s2_std']:.4f}",
        }
        for name, r in all_model_results.items()
    ])
    display(comparison_df)
    
    # Find best model
    best_model = min(all_model_results.keys(), 
                     key=lambda m: all_model_results[m]['s1_mean'] + all_model_results[m]['s2_mean'])
    print(f"\nüèÜ Best overall model: {best_model.upper()}")
else:
    print("‚ÑπÔ∏è Set RUN_ALL_MODELS = True to train and compare all models")

In [None]:
# ==============================================================================
# 6.5b Hyperparameter Sweep with K-Fold CV (using consolidated configs)
# ==============================================================================
# Run a grid search over hyperparameters with cross-validation
# Sweep parameters are defined in configs/model_*.yaml files

RUN_SWEEP = False  # ‚ö†Ô∏è Set to True to run (takes ~30-60 min)

if RUN_SWEEP:
    from src.config_sweep import generate_sweep_runs, get_config_by_id
    
    # Select model for sweep (use consolidated config files)
    SWEEP_MODEL = 'xgboost'  # 'xgboost', 'lightgbm', 'catboost'
    SWEEP_FOLDS = 3  # Fewer folds for faster sweep
    
    if SWEEP_MODEL not in model_configs:
        print(f"‚ùå Config not found for {SWEEP_MODEL}")
    else:
        # Get sweep configuration from consolidated model config
        sweep_model_config = model_configs[SWEEP_MODEL]
        sweep_config = sweep_model_config.get('sweep', {})
        sweep_configs_list = sweep_model_config.get('sweep_configs', [])
        selection_metric = sweep_config.get('selection_metric', 'official_metric')
        
        print(f"üîç Running {SWEEP_MODEL.upper()} hyperparameter sweep...")
        print(f"  Config file: configs/model_{SWEEP_MODEL.replace('boost', '')}.yaml")
        print(f"  Selection metric: {selection_metric}")
        print(f"  Folds: {SWEEP_FOLDS}")
        print(f"  GPU: {'Enabled' if USE_GPU else 'Disabled'}")
        
        if sweep_configs_list:
            print(f"\n  Named configurations to sweep: {len(sweep_configs_list)}")
            for cfg in sweep_configs_list:
                print(f"    - {cfg.get('id', 'unnamed')}: {cfg.get('description', '')}")
        
        # Run sweep for both scenarios
        sweep_all_results = {}
        
        for scenario in [1, 2]:
            print(f"\n{'='*60}")
            print(f"Scenario {scenario} Sweep")
            print(f"{'='*60}")
            
            # Build features for this scenario
            panel = get_panel(split='train', config=data_config, use_cache=True)
            panel_features = make_features(panel, scenario=scenario, mode='train', config=features_config)
            
            config_results = []
            
            for cfg in sweep_configs_list:
                config_id = cfg.get('id', 'default')
                config_params = cfg.get('params', {})
                
                print(f"\n  Testing config: {config_id}")
                
                # Merge params with base config
                test_config = get_gpu_model_config(SWEEP_MODEL, sweep_model_config, USE_GPU)
                if 'model' in test_config and 'params' in test_config['model']:
                    test_config['model']['params'].update(config_params)
                elif 'params' in test_config:
                    test_config['params'].update(config_params)
                
                # Run CV for this config
                _, cv_results, _ = run_cross_validation(
                    panel_features=panel_features,
                    scenario=scenario,
                    model_type=SWEEP_MODEL,
                    model_config=test_config,
                    run_config=run_config,
                    n_folds=SWEEP_FOLDS,
                    save_oof=False,
                )
                
                config_results.append({
                    'config_id': config_id,
                    'official_mean': cv_results['cv_official_mean'],
                    'official_std': cv_results['cv_official_std'],
                    'rmse_mean': cv_results['cv_rmse_mean'],
                })
                
                print(f"    {selection_metric}: {cv_results['cv_official_mean']:.4f} ¬± {cv_results['cv_official_std']:.4f}")
            
            # Find best config
            if selection_metric == 'official_metric':
                best_config = min(config_results, key=lambda x: x['official_mean'])
            else:
                best_config = min(config_results, key=lambda x: x['rmse_mean'])
            
            sweep_all_results[scenario] = {
                'best_config': best_config['config_id'],
                'best_mean_metric': best_config['official_mean'],
                'best_std_metric': best_config['official_std'],
                'all_results': config_results,
            }
            
            print(f"\n‚úÖ Best config: {best_config['config_id']}")
            print(f"   Mean {selection_metric}: {best_config['official_mean']:.4f} ¬± {best_config['official_std']:.4f}")
            
            clear_memory()
        
        # Summary
        print("\n" + "="*60)
        print(f"üìä SWEEP SUMMARY ({SWEEP_MODEL.upper()})")
        print("="*60)
        for s in [1, 2]:
            r = sweep_all_results[s]
            print(f"  Scenario {s}: {r['best_mean_metric']:.4f}")
            print(f"    Best config: {r['best_config']}")
else:
    print("‚ÑπÔ∏è Set RUN_SWEEP = True to run hyperparameter sweep")
    print("   Sweep configuration is in configs/model_*.yaml files")

In [None]:
# ==============================================================================
# 6.5c XGBoost + LightGBM Ensemble
# ==============================================================================
# Train both XGBoost and LightGBM and combine predictions
# Ensemble settings are defined in configs/model_xgb.yaml and configs/model_lgbm.yaml

RUN_ENSEMBLE = False  # ‚ö†Ô∏è Set to True to run (takes ~10-20 min)

if RUN_ENSEMBLE:
    print("ü§ù Training XGBoost + LightGBM Ensemble...")
    print("=" * 60)
    
    ensemble_results = {}
    
    for scenario in [1, 2]:
        print(f"\n{'='*60}")
        print(f"Scenario {scenario} Ensemble")
        print(f"{'='*60}")
        
        # Build features
        panel = get_panel(split='train', config=data_config, use_cache=True)
        panel_features = make_features(panel, scenario=scenario, mode='train', config=features_config)
        
        # Train XGBoost
        print("\n  Training XGBoost...")
        xgb_cfg = get_gpu_model_config('xgboost', model_configs['xgboost'], USE_GPU)
        xgb_models, xgb_results, xgb_oof = run_cross_validation(
            panel_features=panel_features,
            scenario=scenario,
            model_type='xgboost',
            model_config=xgb_cfg,
            run_config=run_config,
            n_folds=N_FOLDS,
            save_oof=True,
            artifacts_dir=RUN_DIR / f'ensemble_xgb_s{scenario}',
        )
        
        # Train LightGBM
        print("\n  Training LightGBM...")
        lgbm_cfg = get_gpu_model_config('lightgbm', model_configs['lightgbm'], USE_GPU)
        lgbm_models, lgbm_results, lgbm_oof = run_cross_validation(
            panel_features=panel_features,
            scenario=scenario,
            model_type='lightgbm',
            model_config=lgbm_cfg,
            run_config=run_config,
            n_folds=N_FOLDS,
            save_oof=True,
            artifacts_dir=RUN_DIR / f'ensemble_lgbm_s{scenario}',
        )
        
        # Optimize ensemble weights using OOF predictions
        print("\n  Optimizing ensemble weights...")
        from src.models.ensemble import EnsembleBlender
        
        # Align OOF predictions
        oof_merged = xgb_oof.merge(
            lgbm_oof[['country', 'brand_name', 'months_postgx', 'y_pred']],
            on=['country', 'brand_name', 'months_postgx'],
            suffixes=('_xgb', '_lgbm')
        )
        
        if 'y_pred_xgb' not in oof_merged.columns:
            oof_merged['y_pred_xgb'] = oof_merged['y_pred']
        
        # Fit blender
        blender = EnsembleBlender(constrain_weights=True)
        blender.fit(
            predictions={
                'xgboost': oof_merged['y_pred_xgb'].values,
                'lightgbm': oof_merged['y_pred_lgbm'].values,
            },
            y_true=oof_merged['y_true'].values
        )
        
        weights = blender.get_weights()
        
        # Compute ensemble OOF metric
        ensemble_preds = (
            weights.get('xgboost', 0.5) * oof_merged['y_pred_xgb'].values +
            weights.get('lightgbm', 0.5) * oof_merged['y_pred_lgbm'].values
        )
        ensemble_rmse = np.sqrt(np.mean((ensemble_preds - oof_merged['y_true'].values) ** 2))
        
        ensemble_results[scenario] = {
            'xgb_metric': xgb_results['cv_official_mean'],
            'lgbm_metric': lgbm_results['cv_official_mean'],
            'xgb_rmse': xgb_results['cv_rmse_mean'],
            'lgbm_rmse': lgbm_results['cv_rmse_mean'],
            'ensemble_rmse': ensemble_rmse,
            'weights': weights,
            'xgb_models': xgb_models,
            'lgbm_models': lgbm_models,
        }
        
        print(f"\n‚úÖ Scenario {scenario} Ensemble Results:")
        print(f"   XGBoost:  {xgb_results['cv_official_mean']:.4f} (RMSE: {xgb_results['cv_rmse_mean']:.4f})")
        print(f"   LightGBM: {lgbm_results['cv_official_mean']:.4f} (RMSE: {lgbm_results['cv_rmse_mean']:.4f})")
        print(f"   Ensemble RMSE: {ensemble_rmse:.4f}")
        print(f"   Weights:  XGB={weights.get('xgboost', 0.5):.2f}, LGBM={weights.get('lightgbm', 0.5):.2f}")
        
        clear_memory()
    
    # Save ensemble configuration for inference
    import json
    ensemble_output = {
        's1_weights': ensemble_results[1]['weights'],
        's2_weights': ensemble_results[2]['weights'],
        's1_xgb_metric': ensemble_results[1]['xgb_metric'],
        's1_lgbm_metric': ensemble_results[1]['lgbm_metric'],
        's2_xgb_metric': ensemble_results[2]['xgb_metric'],
        's2_lgbm_metric': ensemble_results[2]['lgbm_metric'],
    }
    with open(RUN_DIR / 'ensemble_config.json', 'w') as f:
        json.dump(ensemble_output, f, indent=2)
    
    print("\n" + "="*60)
    print("üìä ENSEMBLE SUMMARY")
    print("="*60)
    for s in [1, 2]:
        r = ensemble_results[s]
        print(f"  Scenario {s}:")
        print(f"    XGB: {r['xgb_metric']:.4f}, LGBM: {r['lgbm_metric']:.4f}")
        w = r['weights']
        print(f"    Weights: XGB={w.get('xgboost', 0.5):.0%}, LGBM={w.get('lightgbm', 0.5):.0%}")
else:
    print("‚ÑπÔ∏è Set RUN_ENSEMBLE = True to train XGBoost + LightGBM ensemble")

## 7. Generate Submission

Generate predictions on test data and create submission files.

In [None]:
# ==============================================================================
# 7.1 Build Test Features and Generate Predictions
# ==============================================================================
import joblib
from glob import glob

print("üì§ Generating submission...")

# Build test features for Scenario 1
print("  Building S1 test features...")
test_panel = get_panel(split='test', config=data_config, use_cache=True)
test_panel_s1 = make_features(test_panel.copy(), scenario=1, mode='test', config=features_config)

# Build test features for Scenario 2  
print("  Building S2 test features...")
test_panel_s2 = make_features(test_panel.copy(), scenario=2, mode='test', config=features_config)

# Detect which test samples belong to which scenario
test_scenarios = detect_test_scenarios(test_panel)
print(f"  Scenario 1 test series: {len(test_scenarios[1])}")
print(f"  Scenario 2 test series: {len(test_scenarios[2])}")

# Split features and meta for predictions
from src.train import split_features_target_meta

# For test data, we need to handle that there's no y_norm column
# Use get_feature_matrix_and_meta instead
def get_test_features_meta(panel_features, scenario_series):
    """Extract features and meta for test prediction."""
    # Filter to scenario-relevant series
    if scenario_series:
        mask = panel_features.set_index(['country', 'brand_name']).index.isin(scenario_series)
        mask = mask.reset_index(drop=True)
        panel_subset = panel_features[mask].copy()
    else:
        panel_subset = panel_features.copy()
    
    # Separate features from meta
    from src.data import META_COLS
    feature_cols = [c for c in panel_subset.columns if c not in META_COLS]
    meta_cols = [c for c in META_COLS if c in panel_subset.columns]
    
    X = panel_subset[feature_cols].copy()
    meta = panel_subset[meta_cols].copy()
    
    return X, meta

# Get S1 test features
X_test_s1, meta_test_s1 = get_test_features_meta(test_panel_s1, test_scenarios[1])
print(f"  S1 test features: {X_test_s1.shape}")

# Get S2 test features
X_test_s2, meta_test_s2 = get_test_features_meta(test_panel_s2, test_scenarios[2])
print(f"  S2 test features: {X_test_s2.shape}")

print(f"\n  Loading {MODEL_TYPE.upper()} models and predicting...")

# Scenario 1 predictions (average across folds)
s1_preds_list = []
s1_model_dir = RUN_DIR / 'models_s1'
if s1_model_dir.exists():
    for model_path in sorted(s1_model_dir.glob('model_fold*.bin')):
        model = ModelClass.load(str(model_path), current_model_config)
        preds = model.predict(X_test_s1)
        s1_preds_list.append(preds)
        print(f"    Loaded {model_path.name}")

if s1_preds_list:
    s1_test_preds = np.mean(s1_preds_list, axis=0)
    print(f"  ‚úÖ S1: Averaged {len(s1_preds_list)} fold predictions")
else:
    print("  ‚ö†Ô∏è No S1 models found, using baseline predictions (y_norm=1.0)")
    s1_test_preds = np.ones(len(X_test_s1))

# Scenario 2 predictions (average across folds)
s2_preds_list = []
s2_model_dir = RUN_DIR / 'models_s2'
if s2_model_dir.exists():
    for model_path in sorted(s2_model_dir.glob('model_fold*.bin')):
        model = ModelClass.load(str(model_path), current_model_config)
        preds = model.predict(X_test_s2)
        s2_preds_list.append(preds)
        print(f"    Loaded {model_path.name}")

if s2_preds_list:
    s2_test_preds = np.mean(s2_preds_list, axis=0)
    print(f"  ‚úÖ S2: Averaged {len(s2_preds_list)} fold predictions")
else:
    print("  ‚ö†Ô∏è No S2 models found, using baseline predictions (y_norm=1.0)")
    s2_test_preds = np.ones(len(X_test_s2))

print(f"\n  S1 predictions: {len(s1_test_preds):,}")
print(f"  S2 predictions: {len(s2_test_preds):,}")

### 7.2 Create Submission File

In [None]:
# ==============================================================================
# 7.2 Create and Save Submission
# ==============================================================================

# Create submission dataframes
# Convert y_norm predictions back to absolute volume using avg_vol

# Scenario 1 submission
submission_s1 = meta_test_s1[['country', 'brand_name', 'months_postgx']].copy()
if 'avg_vol_12m' in meta_test_s1.columns:
    submission_s1['volume'] = s1_test_preds * meta_test_s1['avg_vol_12m'].values
else:
    # If avg_vol not available, predictions are already in volume scale
    submission_s1['volume'] = s1_test_preds
    print("  ‚ö†Ô∏è S1: avg_vol_12m not found, using predictions as-is")

# Scenario 2 submission
submission_s2 = meta_test_s2[['country', 'brand_name', 'months_postgx']].copy()
if 'avg_vol_12m' in meta_test_s2.columns:
    submission_s2['volume'] = s2_test_preds * meta_test_s2['avg_vol_12m'].values
else:
    submission_s2['volume'] = s2_test_preds
    print("  ‚ö†Ô∏è S2: avg_vol_12m not found, using predictions as-is")

# Combine submissions
submission = pd.concat([submission_s1, submission_s2], ignore_index=True)

# Clip negative volumes to 0
submission['volume'] = submission['volume'].clip(lower=0)

# Validate submission format
try:
    is_valid, issues = validate_submission_format(submission)
    if is_valid:
        print("‚úÖ Submission format validated")
    else:
        print(f"‚ö†Ô∏è Validation issues: {issues}")
except Exception as e:
    print(f"‚ö†Ô∏è Could not validate format: {e}")

# Save submission
submission_path = Path(SUBMISSIONS_PATH) / f"submission_{RUN_ID}.csv"
submission.to_csv(submission_path, index=False)

print(f"\nüìÑ Submission saved to: {submission_path}")
print(f"  Shape: {submission.shape}")
print(f"  Columns: {list(submission.columns)}")

# Statistics
print(f"\nüìä Submission Statistics:")
print(f"  Volume min: {submission['volume'].min():.2f}")
print(f"  Volume max: {submission['volume'].max():.2f}")
print(f"  Volume mean: {submission['volume'].mean():.2f}")
print(f"  Volume median: {submission['volume'].median():.2f}")

# Count by scenario
n_s1 = len(submission_s1)
n_s2 = len(submission_s2)
print(f"\n  Scenario 1 rows: {n_s1:,}")
print(f"  Scenario 2 rows: {n_s2:,}")
print(f"  Total rows: {len(submission):,}")

# Preview
print(f"\nüìã Preview:")
display(submission.head(10))

In [None]:
# ==============================================================================
# 7.3 Download Submission (Colab only)
# ==============================================================================
if IN_COLAB:
    print("üì• Downloading submission file...")
    try:
        from google.colab import files
        files.download(str(submission_path))
        print("‚úÖ Download started!")
    except Exception as e:
        print(f"‚ö†Ô∏è Could not start download: {e}")
        print(f"   File is saved at: {submission_path}")
else:
    print(f"üìÑ Submission available at: {submission_path}")

# Also sync to Drive if in Colab
if IN_COLAB:
    try:
        # Copy to Drive submissions folder
        import shutil
        drive_submission_path = f"{SUBMISSIONS_PATH}/submission_{RUN_ID}.csv"
        shutil.copy(str(submission_path), drive_submission_path)
        print(f"‚òÅÔ∏è Saved to Google Drive: {drive_submission_path}")
    except Exception as e:
        print(f"‚ö†Ô∏è Could not save to Drive: {e}")

# Also save run summary
try:
    summary = {
        'run_id': RUN_ID,
        'model_type': MODEL_TYPE,
        'training_mode': TRAINING_MODE,
        'n_folds': N_FOLDS,
        'gpu_used': USE_GPU and GPU_AVAILABLE,
        'timestamp': datetime.now().isoformat(),
        'submission_path': str(submission_path),
        'submission_rows': len(submission),
    }
    
    if 's1_cv_results' in dir():
        summary['s1_cv_official_mean'] = s1_cv_results.get('cv_official_mean')
        summary['s1_cv_official_std'] = s1_cv_results.get('cv_official_std')
    
    if 's2_cv_results' in dir():
        summary['s2_cv_official_mean'] = s2_cv_results.get('cv_official_mean')
        summary['s2_cv_official_std'] = s2_cv_results.get('cv_official_std')
    
    summary_path = RUN_DIR / 'run_summary.json'
    import json
    with open(summary_path, 'w') as f:
        json.dump(summary, f, indent=2, default=str)
    print(f"üìÑ Run summary saved: {summary_path}")
    
except Exception as e:
    print(f"‚ö†Ô∏è Could not save run summary: {e}")

## 8. Utilities

Helper functions for common operations.

In [None]:
# ==============================================================================
# 8.1 Utility Functions
# ==============================================================================

def show_memory():
    """Display current memory usage."""
    mem = get_memory_usage()
    print(f"üíæ Memory Usage:")
    print(f"  Process: {mem.get('process_rss_gb', 0):.2f} GB")
    print(f"  System: {mem.get('system_used_percent', 0):.1f}% used")
    if 'gpu_allocated_gb' in mem:
        print(f"  GPU: {mem['gpu_allocated_gb']:.2f} GB allocated")

def free_memory():
    """Free unused memory."""
    before = get_memory_usage().get('process_rss_gb', 0)
    clear_memory()
    after = get_memory_usage().get('process_rss_gb', 0)
    print(f"üßπ Freed {max(0, before - after):.2f} GB")

def download_artifacts():
    """Download all artifacts as a zip file (Colab only)."""
    if not IN_COLAB:
        print(f"üìÅ Artifacts at: {RUN_DIR}")
        return
    
    import shutil
    zip_path = f"/content/artifacts_{RUN_ID}.zip"
    shutil.make_archive(zip_path.replace('.zip', ''), 'zip', str(RUN_DIR))
    
    from google.colab import files
    files.download(zip_path)
    print(f"üì¶ Downloaded: artifacts_{RUN_ID}.zip")

def restart_runtime():
    """Restart Colab runtime to free memory."""
    if IN_COLAB:
        import os
        os.kill(os.getpid(), 9)

def check_gpu():
    """Check GPU status and memory."""
    print("üñ•Ô∏è GPU Status:")
    if IN_COLAB:
        !nvidia-smi --query-gpu=name,memory.used,memory.total,utilization.gpu --format=csv,noheader 2>/dev/null || print("  GPU not available")
    else:
        gpu_info = get_gpu_info()
        if gpu_info.get('gpu_available'):
            print(f"  Device: {gpu_info.get('device_name', 'Unknown')}")
            print(f"  CUDA: {gpu_info.get('cuda_version', 'N/A')}")
        else:
            print("  GPU not available")

def list_available_configs():
    """List all available model configurations."""
    print("üìã Available Model Configurations:")
    for model_name, cfg in model_configs.items():
        sweep_configs = cfg.get('sweep_configs', [])
        n_configs = len(sweep_configs) if sweep_configs else 0
        active = cfg.get('active_config_id', 'none')
        print(f"\n  {model_name.upper()}:")
        print(f"    Active config: {active}")
        print(f"    Sweep configs: {n_configs}")
        if sweep_configs:
            for sc in sweep_configs[:3]:
                print(f"      - {sc.get('id', 'unnamed')}: {sc.get('description', '')[:40]}")
            if len(sweep_configs) > 3:
                print(f"      ... and {len(sweep_configs) - 3} more")

def save_run_summary():
    """Save a summary of the current run."""
    summary = {
        'run_id': RUN_ID,
        'model_type': MODEL_TYPE,
        'training_mode': TRAINING_MODE,
        'n_folds': N_FOLDS,
        'gpu_used': USE_GPU and GPU_AVAILABLE,
        'timestamp': datetime.now().isoformat(),
    }
    
    if 's1_cv_results' in dir():
        summary['s1_cv_official_mean'] = s1_cv_results.get('cv_official_mean')
        summary['s1_cv_official_std'] = s1_cv_results.get('cv_official_std')
    
    if 's2_cv_results' in dir():
        summary['s2_cv_official_mean'] = s2_cv_results.get('cv_official_mean')
        summary['s2_cv_official_std'] = s2_cv_results.get('cv_official_std')
    
    summary_path = RUN_DIR / 'run_summary.json'
    import json
    with open(summary_path, 'w') as f:
        json.dump(summary, f, indent=2, default=str)
    
    print(f"üìÑ Run summary saved to: {summary_path}")
    return summary

print("üõ†Ô∏è Utility functions available:")
print("  - show_memory(): Display memory usage")
print("  - free_memory(): Free unused memory")
print("  - check_gpu(): Check GPU status and memory")
print("  - download_artifacts(): Download all run artifacts")
print("  - restart_runtime(): Restart Colab runtime")
print("  - list_available_configs(): List model configurations")
print("  - save_run_summary(): Save run summary to file")

In [None]:
# ==============================================================================
# 8.2 Data Setup Helper (if data is not automatically found)
# ==============================================================================
# Run this cell if your data is not in the expected location

def setup_data_from_drive():
    """
    Helper to set up data from Google Drive.
    
    This function helps locate and link your data if it's in a different location.
    """
    if not IN_COLAB:
        print("‚ÑπÔ∏è Not in Colab - using local data paths")
        return
    
    print("üîç Searching for data directories on Google Drive...")
    
    # Common locations to check
    search_paths = [
        "/content/drive/MyDrive/novartis-datathon-2025/data",
        "/content/drive/MyDrive/Colab Notebooks/novartis_data",
        "/content/drive/MyDrive/data",
        "/content/drive/MyDrive/novartis/data",
    ]
    
    found_path = None
    for path in search_paths:
        if os.path.exists(path):
            # Check if it has the expected structure
            raw_train = os.path.join(path, "raw", "TRAIN")
            raw_test = os.path.join(path, "raw", "TEST")
            
            if os.path.exists(raw_train) or os.path.exists(raw_test):
                found_path = path
                print(f"  ‚úÖ Found data at: {path}")
                break
            else:
                print(f"  ‚ö†Ô∏è Found {path} but missing raw/TRAIN or raw/TEST")
    
    if found_path:
        # Create symlink
        local_data = os.path.join(PROJECT_PATH, "data")
        if os.path.islink(local_data):
            os.unlink(local_data)
        elif os.path.exists(local_data):
            import shutil
            shutil.rmtree(local_data)
        
        os.symlink(found_path, local_data)
        print(f"  üîó Linked data directory")
        
        # List contents
        print(f"\nüìÇ Data contents:")
        for item in os.listdir(found_path):
            item_path = os.path.join(found_path, item)
            if os.path.isdir(item_path):
                subitems = os.listdir(item_path)[:5]
                print(f"  üìÅ {item}/ ({len(os.listdir(item_path))} items)")
                for si in subitems:
                    print(f"      - {si}")
            else:
                print(f"  üìÑ {item}")
    else:
        print("\n‚ùå Data not found. Please:")
        print("   1. Upload your data to Google Drive")
        print("   2. Expected structure:")
        print("      /content/drive/MyDrive/novartis-datathon-2025/data/")
        print("        ‚îú‚îÄ‚îÄ raw/")
        print("        ‚îÇ   ‚îú‚îÄ‚îÄ TRAIN/")
        print("        ‚îÇ   ‚îÇ   ‚îú‚îÄ‚îÄ volume.parquet")
        print("        ‚îÇ   ‚îÇ   ‚îú‚îÄ‚îÄ generics.parquet")
        print("        ‚îÇ   ‚îÇ   ‚îî‚îÄ‚îÄ medicine_info.parquet")
        print("        ‚îÇ   ‚îî‚îÄ‚îÄ TEST/")
        print("        ‚îÇ       ‚îú‚îÄ‚îÄ volume.parquet")
        print("        ‚îÇ       ‚îú‚îÄ‚îÄ generics.parquet")
        print("        ‚îÇ       ‚îî‚îÄ‚îÄ medicine_info.parquet")
        print("        ‚îî‚îÄ‚îÄ processed/ (optional)")

# Uncomment to run:
# setup_data_from_drive()

## üìä Summary

This notebook provides a complete training and submission pipeline for the Novartis Datathon 2025 competition.

### ‚úÖ Features Implemented
- **Multi-Model Support**: 7 models available (XGBoost, LightGBM, CatBoost, Linear, Neural Network, Hybrid, ARIHOW)
- **Training Modes**: Cross-validation, full training, hyperparameter sweep
- **GPU Acceleration**: Automatic detection and utilization for tree models and neural networks
- **Google Drive Integration**: Seamless data loading from shared Drive folders
- **Memory Management**: Garbage collection and memory monitoring
- **Robust Error Handling**: Graceful degradation and informative error messages
- **Artifact Management**: Saves models, predictions, and metrics to organized directories

### ü§ñ Available Models (Priority Order)

| Priority | Model | Config File | Description |
|----------|-------|-------------|-------------|
| 1 | XGBoost | `model_xgb.yaml` | Primary model - best official_metric performance |
| 2 | LightGBM | `model_lgbm.yaml` | Secondary - fast, good for ensemble |
| 2 | Hybrid | `model_hybrid.yaml` | Physics decay + ML residual learning |
| 3 | CatBoost | `model_cat.yaml` | Tertiary - ensemble diversity |
| 4 | Linear | `model_linear.yaml` | Baseline (Ridge/Lasso/ElasticNet/Huber) |
| 4 | ARIHOW | `model_arihow.yaml` | ARIMA + Holt-Winters hybrid |
| 5 | Neural Network | `model_nn.yaml` | Experimental PyTorch MLP |

### üéØ Competition Metrics
- **Scenario 1**: Forecast months 0-23 (pre-entry data only)
- **Scenario 2**: Forecast months 6-23 (has early months 0-5)
- **Official Metric**: PE (Prediction Error) - lower is better

### üìÅ Output Files
All outputs are saved to:
- `artifacts/{run_id}/` - Models, metrics, and predictions
- `submissions/` - Final submission CSV files
- Google Drive (if in Colab): `/content/drive/MyDrive/novartis-datathon-2025/`

### üîß Configuration Options
| Option | Description | Default |
|--------|-------------|---------|
| `MODEL_TYPE` | Model architecture | `xgboost` |
| `TRAINING_MODE` | cv, quick, sweep, sweep_cv, ensemble, compare | `cv` |
| `N_FOLDS` | Cross-validation folds | `5` |
| `USE_GPU` | Enable GPU acceleration | `True` |
| `SEED` | Reproducibility seed | `42` |

### üìã Config Files Structure
Each model config (`configs/model_*.yaml`) contains:
- `model`: Name, task type, priority
- `gpu`: GPU settings (enabled, device_id)
- `sweep`: Hyperparameter sweep configuration
- `sweep_configs`: Named parameter presets
- `scenario_best_params`: Best params per scenario from previous sweeps
- `params`: Default model parameters