<a href="https://colab.research.google.com/github/armanfeili/novartis_datathon_2025/blob/Arman/notebooks/colab/main.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# üß¨ Novartis Datathon 2025 - Complete Training Pipeline

This notebook provides a **complete end-to-end pipeline** for the Novartis Datathon 2025.

## Configuration Structure

Each model has **one consolidated config file** supporting all training modes:
- `configs/model_xgb.yaml` - XGBoost (primary model)
- `configs/model_lgbm.yaml` - LightGBM (secondary model)  
- `configs/model_cat.yaml` - CatBoost (tertiary model)

Each config includes: `model`, `sweep`, `scenario_best_params`, `validation`, `gpu`, `training`, `categorical_features`, `tuning`, `ensemble`

---

## Pipeline Sections

1. **üîß Environment Setup** - Mount Drive, clone repo, install dependencies
2. **üìä Data Loading** - Load raw data and build panels
3. **üî¨ Feature Engineering** - Build scenario-specific features  
4. **üèãÔ∏è Model Training** - Train with GPU acceleration (multiple modes)
5. **üîÑ Hyperparameter Sweep** - Grid search with K-fold CV, select by **official_metric**
6. **ü§ù Ensemble** - XGBoost + LightGBM weighted ensemble
7. **üì§ Submission** - Generate competition submission files

---

## Training Modes

| Mode | Description | Use Case |
|------|-------------|----------|
| `quick` | Use best known params from `scenario_best_params` | Fast baseline |
| `cv` | K-fold CV with best params | Robust single model |
| `sweep` | Grid search with holdout | Find optimal params |
| `sweep_cv` | Grid search with K-fold CV | Most robust tuning |
| `ensemble` | XGB + LGBM weighted average | Best submission |

---

## Key Principles
- ‚úÖ **Selection by official_metric** (PE), not RMSE
- ‚úÖ **K-fold CV** (3-5 folds) for robust hyperparameter selection
- ‚úÖ **XGB+LGBM ensemble** with optimized weights
- ‚úÖ **GPU acceleration** for all models on Colab

---

## 1. Environment Setup

In [None]:
# ==============================================================================
# 1.1 Detect Environment and Mount Google Drive
# ==============================================================================
import sys
import os
from pathlib import Path

# Detect if running in Colab
IN_COLAB = 'google.colab' in sys.modules

print(f"üñ•Ô∏è  Environment: {'Google Colab' if IN_COLAB else 'Local'}")
print(f"üêç Python: {sys.version.split()[0]}")

if IN_COLAB:
    from google.colab import drive
    drive.mount('/content/drive')
    print("‚úÖ Google Drive mounted successfully")
else:
    print("‚ö†Ô∏è Not running in Colab - using local paths")

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
‚úÖ Google Drive mounted successfully


In [None]:
# ==============================================================================
# 1.2 Clone Repository and Set Paths
# ==============================================================================
import os

# --- Configuration (MODIFY THESE) ---
REPO_URL = "https://github.com/armanfeili/novartis_datathon_2025.git"
BRANCH = "Arman"  # Change to your working branch

# Paths depend on environment
if IN_COLAB:
    DRIVE_BASE = "/content/drive/MyDrive"
    PROJECT_PATH = "/content/novartis_datathon_2025"  # Clone to /content for speed
    DATA_PATH = f"{DRIVE_BASE}/novartis-datathon-2025/data"  # Data on Drive
    ARTIFACTS_PATH = f"{DRIVE_BASE}/novartis-datathon-2025/artifacts"
    SUBMISSIONS_PATH = f"{DRIVE_BASE}/novartis-datathon-2025/submissions"
else:
    # Local paths (relative to notebook location)
    PROJECT_PATH = str(Path.cwd().parent.parent)
    DATA_PATH = os.path.join(PROJECT_PATH, "data")
    ARTIFACTS_PATH = os.path.join(PROJECT_PATH, "artifacts")
    SUBMISSIONS_PATH = os.path.join(PROJECT_PATH, "submissions")
# --------------------------------

if IN_COLAB:
    # Clone or update repository
    if not os.path.exists(PROJECT_PATH):
        print(f"üì• Cloning repository...")
        !git clone --branch {BRANCH} {REPO_URL} {PROJECT_PATH}
    else:
        print(f"üìÇ Repository exists. Pulling latest changes...")
        %cd {PROJECT_PATH}
        !git fetch origin {BRANCH}
        !git reset --hard origin/{BRANCH}
    
    %cd {PROJECT_PATH}
    
    # Create symlinks to Drive data (if data is on Drive)
    if os.path.exists(DATA_PATH):
        local_data = os.path.join(PROJECT_PATH, "data")
        if not os.path.exists(local_data):
            !ln -s {DATA_PATH} {local_data}
            print(f"üîó Linked data directory from Drive")

# Create required directories
for path in [DATA_PATH, ARTIFACTS_PATH, SUBMISSIONS_PATH]:
    os.makedirs(path, exist_ok=True)

# Print paths
print(f"\nüìÅ Project: {PROJECT_PATH}")
print(f"üìÅ Data: {DATA_PATH}")
print(f"üìÅ Artifacts: {ARTIFACTS_PATH}")
print(f"üìÅ Submissions: {SUBMISSIONS_PATH}")

üìÇ Repository exists at /content/drive/MyDrive/novartis_datathon_2025. Pulling latest changes...
/content/drive/MyDrive/novartis_datathon_2025
remote: Enumerating objects: 19, done.[K
remote: Counting objects: 100% (19/19), done.[K
remote: Compressing objects: 100% (3/3), done.[K
remote: Total 10 (delta 6), reused 10 (delta 6), pack-reused 0 (from 0)[K
Unpacking objects: 100% (10/10), 8.06 KiB | 26.00 KiB/s, done.
From https://github.com/armanfeili/novartis_datathon_2025
 * branch            Arman      -> FETCH_HEAD
   67c14aa..5c33709  Arman      -> origin/Arman
HEAD is now at 5c33709 project setup - 3
/content/drive/MyDrive/novartis_datathon_2025

üìÅ Project Path: /content/drive/MyDrive/novartis_datathon_2025
üìÅ Data Path: /content/drive/MyDrive/novartis-datathon-2025/data
üìÅ Artifacts Path: /content/drive/MyDrive/novartis-datathon-2025/artifacts
üìÅ Submissions Path: /content/drive/MyDrive/novartis-datathon-2025/submissions


In [None]:
# ==============================================================================
# 1.3 Install Dependencies
# ==============================================================================
import subprocess

print("üì¶ Installing dependencies...")

# Install from colab requirements
!pip install -q -r env/colab_requirements.txt

# For GPU support, ensure CUDA-compatible versions
if IN_COLAB:
    # XGBoost with GPU
    !pip install -q xgboost --upgrade
    
    # LightGBM with GPU (requires OpenCL)
    !pip install -q lightgbm --upgrade
    
    # CatBoost with GPU
    !pip install -q catboost --upgrade

# Verify key packages
import importlib

packages = [
    ('numpy', 'numpy'),
    ('pandas', 'pandas'),
    ('sklearn', 'scikit-learn'),
    ('yaml', 'pyyaml'),
    ('tqdm', 'tqdm'),
    ('catboost', 'catboost'),
    ('lightgbm', 'lightgbm'),
    ('xgboost', 'xgboost'),
    ('pyarrow', 'pyarrow'),
    ('scipy', 'scipy'),
]

print("\nüìã Package Status:")
for import_name, pkg_name in packages:
    try:
        mod = importlib.import_module(import_name)
        version = getattr(mod, '__version__', 'installed')
        print(f"  ‚úÖ {pkg_name}: {version}")
    except ImportError:
        print(f"  ‚ùå {pkg_name}: not installed")

# Check GPU availability
print("\nüñ•Ô∏è GPU Status:")
try:
    import torch
    if torch.cuda.is_available():
        print(f"  ‚úÖ CUDA available: {torch.cuda.get_device_name(0)}")
        print(f"  ‚úÖ CUDA version: {torch.version.cuda}")
    else:
        print("  ‚ö†Ô∏è CUDA not available - using CPU")
except ImportError:
    pass

# Check via nvidia-smi
!nvidia-smi --query-gpu=name,memory.total --format=csv,noheader 2>/dev/null || echo "  ‚ÑπÔ∏è nvidia-smi not available"

print("\n‚úÖ Dependencies installed!")

üì¶ Installing dependencies...
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m99.2/99.2 MB[0m [31m7.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m1.6/1.6 MB[0m [31m63.0 MB/s[0m eta [36m0:00:00[0m
[?25h  ‚úÖ torch
  ‚úÖ numpy
  ‚úÖ pandas
  ‚úÖ lightgbm
  ‚úÖ xgboost
  ‚úÖ catboost
  ‚úÖ sklearn
  ‚úÖ yaml

‚úÖ All dependencies installed!


## 2. Import Modules and Verify Environment

Import project modules and verify GPU availability.

In [None]:
# ==============================================================================
# 2.1 Import Project Modules
# ==============================================================================
import sys
import os
import gc
import warnings
warnings.filterwarnings('ignore')

# Ensure project root is in path
if PROJECT_PATH not in sys.path:
    sys.path.insert(0, PROJECT_PATH)

# Standard imports
import numpy as np
import pandas as pd
from pathlib import Path
from datetime import datetime

# Project imports
from src.utils import (
    load_config, set_seed, setup_logging, timer, 
    get_device, get_gpu_info, print_environment_info,
    clear_memory, get_memory_usage, optimize_dataframe_memory
)
from src.data import (
    get_panel, load_raw_data, prepare_base_panel, 
    compute_pre_entry_stats, handle_missing_values,
    META_COLS
)
from src.features import (
    get_features, make_features, split_features_target_meta,
    get_feature_columns, SCENARIO_CONFIG
)
from src.train import train_scenario_model, run_cross_validation
from src.evaluate import compute_metric1, compute_metric2, compute_per_series_error
from src.inference import (
    generate_submission, detect_test_scenarios, 
    validate_submission_format, save_submission_with_versioning
)

print("‚úÖ All modules imported successfully!")

# ==============================================================================
# 2.2 Display Environment Information
# ==============================================================================
print_environment_info()

üñ•Ô∏è  Device: cpu

‚úÖ All modules imported successfully!


## 3. Load Configuration and Set Seed

Load all configuration files and set random seed for reproducibility.

In [None]:
# ==============================================================================
# 3.1 Load Configurations
# ==============================================================================
data_config = load_config('configs/data.yaml')
features_config = load_config('configs/features.yaml')
run_config = load_config('configs/run_defaults.yaml')

# Load all model configs (one file per model)
model_configs = {
    'xgboost': load_config('configs/model_xgb.yaml'),    # Primary model
    'lightgbm': load_config('configs/model_lgbm.yaml'),  # Secondary model
    'catboost': load_config('configs/model_cat.yaml'),   # Tertiary (ensemble diversity)
}

# Set random seed for reproducibility
SEED = run_config['reproducibility']['seed']
set_seed(SEED)

# Setup logging
setup_logging(level=run_config.get('logging', {}).get('level', 'INFO'))

print("üìã Configurations loaded:")
print(f"  - Random seed: {SEED}")
print(f"  - Scenarios: {list(run_config['scenarios'].keys())}")

# Display model priorities
print(f"\nüèÜ Model Priorities (by official_metric):")
for name, cfg in model_configs.items():
    priority = cfg.get('model', {}).get('priority', 99)
    sweep_metric = cfg.get('sweep', {}).get('selection_metric', 'rmse')
    print(f"  {priority}. {name.upper()} - selection: {sweep_metric}")

# Display scenario details
print(f"\nüìÖ Scenario Configuration:")
for s_name, s_config in run_config['scenarios'].items():
    print(f"  {s_name}:")
    print(f"    Forecast: months {s_config['forecast_start']} to {s_config['forecast_end']}")
    print(f"    Feature cutoff: month {s_config['feature_cutoff']}")

üìã Configurations loaded:
  - Data config: ['drive', 'local', 'files', 'keys', 'dates', 'columns', 'validation']
  - Features config: ['feature_groups', 'lags', 'rolling', 'diff', 'time_features', 'interactions', 'selection', 'encoding']
  - Run config: ['experiment', 'run', 'reproducibility', 'cv', 'paths', 'output', 'metrics', 'logging', 'drive', 'hardware']
  - Model configs: ['lightgbm', 'xgboost', 'catboost', 'linear', 'neural_network']

üé≤ Random seed: 42


## 4. Load and Explore Data

Load the training and test data panels.

In [None]:
# ==============================================================================
# 4.1 Load Training Panel
# ==============================================================================
print("üìÇ Loading training data...")

with timer("Load train panel"):
    train_panel = get_panel(split='train', config=data_config, use_cache=True)

# Display statistics
n_series = train_panel[['country', 'brand_name']].drop_duplicates().shape[0]
print(f"\nüìä Training Panel Statistics:")
print(f"  Shape: {train_panel.shape[0]:,} rows √ó {train_panel.shape[1]} columns")
print(f"  Unique series: {n_series:,}")
print(f"  Time range: {train_panel['months_postgx'].min()} to {train_panel['months_postgx'].max()}")

# Bucket distribution
bucket_dist = train_panel[['country', 'brand_name', 'bucket']].drop_duplicates()['bucket'].value_counts()
print(f"\nü™£ Bucket Distribution:")
for bucket, count in bucket_dist.items():
    pct = count / n_series * 100
    print(f"  Bucket {bucket}: {count:,} series ({pct:.1f}%)")

# Memory usage
mem_mb = train_panel.memory_usage(deep=True).sum() / (1024**2)
print(f"\nüíæ Memory: {mem_mb:.1f} MB")

üìÇ Data Directories:
  Raw: /content/drive/MyDrive/novartis-datathon-2025/data/raw (exists: True)
  Interim: /content/drive/MyDrive/novartis-datathon-2025/data/interim (exists: True)
  Processed: /content/drive/MyDrive/novartis-datathon-2025/data/processed (exists: True)

üìÑ Available raw files (0):


In [None]:
# ==============================================================================
# 4.2 Load Test Panel
# ==============================================================================
print("üìÇ Loading test data...")

with timer("Load test panel"):
    test_panel = get_panel(split='test', config=data_config, use_cache=True)

# Detect scenarios
test_scenarios = detect_test_scenarios(test_panel)
n_test_series = test_panel[['country', 'brand_name']].drop_duplicates().shape[0]

print(f"\nüìä Test Panel Statistics:")
print(f"  Shape: {test_panel.shape[0]:,} rows √ó {test_panel.shape[1]} columns")
print(f"  Unique series: {n_test_series:,}")
print(f"  Scenario 1 series: {len(test_scenarios[1]):,}")
print(f"  Scenario 2 series: {len(test_scenarios[2]):,}")

# Clear memory
clear_memory()
print(f"\nüßπ Memory cleared")

AttributeError: 'NoneType' object has no attribute 'items'

In [None]:
# ==============================================================================
# 4.3 Quick Data Exploration
# ==============================================================================
import matplotlib.pyplot as plt

# Set up plotting
plt.style.use('default')
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# 1. y_norm distribution
ax = axes[0, 0]
train_panel['y_norm'].hist(bins=50, ax=ax, color='steelblue', edgecolor='white')
ax.axvline(x=1.0, color='red', linestyle='--', label='No erosion (1.0)')
ax.axvline(x=0.25, color='orange', linestyle='--', label='Bucket 1 threshold')
ax.set_xlabel('Normalized Volume (y_norm)')
ax.set_ylabel('Frequency')
ax.set_title('Distribution of y_norm')
ax.legend()

# 2. Mean erosion curve by bucket
ax = axes[0, 1]
for bucket in [1, 2]:
    bucket_data = train_panel[train_panel['bucket'] == bucket]
    erosion_by_month = bucket_data.groupby('months_postgx')['y_norm'].mean()
    ax.plot(erosion_by_month.index, erosion_by_month.values, 
            label=f'Bucket {bucket}', linewidth=2)
ax.axhline(y=1.0, color='gray', linestyle=':', alpha=0.7)
ax.set_xlabel('Months Post Generic Entry')
ax.set_ylabel('Mean Normalized Volume')
ax.set_title('Erosion Curves by Bucket')
ax.legend()
ax.grid(True, alpha=0.3)

# 3. Number of generics over time
ax = axes[1, 0]
ngxs_by_month = train_panel.groupby('months_postgx')['n_gxs'].mean()
ax.bar(ngxs_by_month.index, ngxs_by_month.values, color='forestgreen', alpha=0.7)
ax.set_xlabel('Months Post Generic Entry')
ax.set_ylabel('Mean Number of Generics')
ax.set_title('Average Generic Competition Over Time')
ax.grid(True, alpha=0.3, axis='y')

# 4. Hospital rate distribution
ax = axes[1, 1]
if 'hospital_rate' in train_panel.columns:
    hr_by_series = train_panel.groupby(['country', 'brand_name'])['hospital_rate'].first()
    hr_by_series.hist(bins=30, ax=ax, color='purple', edgecolor='white', alpha=0.7)
    ax.set_xlabel('Hospital Rate (%)')
    ax.set_ylabel('Number of Series')
    ax.set_title('Hospital Rate Distribution')

plt.tight_layout()
plt.show()

print("‚úÖ Data exploration complete")

## 5. Feature Engineering

Build scenario-specific features for training.

In [None]:
# ==============================================================================
# 5.1 Build Features for Both Scenarios
# ==============================================================================

# Build Scenario 1 features (forecast months 0-23 using pre-entry only)
print("üî¨ Building Scenario 1 features...")
with timer("Scenario 1 features"):
    X_train_s1, y_train_s1, meta_train_s1 = get_features(
        split='train', scenario=1, mode='train',
        data_config=data_config, features_config=features_config,
        use_cache=True
    )
print(f"  X shape: {X_train_s1.shape}")
print(f"  y shape: {y_train_s1.shape}")
print(f"  Features: {len(X_train_s1.columns)}")

# Build Scenario 2 features (forecast months 6-23 using pre-entry + months 0-5)
print("\nüî¨ Building Scenario 2 features...")
with timer("Scenario 2 features"):
    X_train_s2, y_train_s2, meta_train_s2 = get_features(
        split='train', scenario=2, mode='train',
        data_config=data_config, features_config=features_config,
        use_cache=True
    )
print(f"  X shape: {X_train_s2.shape}")
print(f"  y shape: {y_train_s2.shape}")
print(f"  Features: {len(X_train_s2.columns)}")

# Display some feature examples
print(f"\nüìã Sample Features (Scenario 1):")
print(f"  {list(X_train_s1.columns[:10])}...")

# Check for early erosion features in S2 only
s2_only_features = [c for c in X_train_s2.columns if 'erosion_0' in c or 'avg_vol_0' in c]
if s2_only_features:
    print(f"\nüìã Scenario 2 Specific Features:")
    print(f"  {s2_only_features[:5]}...")

clear_memory()

## 6. Model Training

Train CatBoost models for both scenarios using cross-validation.

In [None]:
# ==============================================================================
# 6.1 Training Configuration
# ==============================================================================

# ============ CONFIGURE YOUR TRAINING HERE ============
# Model options: 'xgboost', 'lightgbm', 'catboost'
MODEL_TYPE = 'xgboost'  # Primary model (best performing)

# Training mode options:
# - 'quick'    : Use best known params (from config scenario_best_params)
# - 'cv'       : Train with K-fold cross-validation using best params
# - 'sweep'    : Run hyperparameter sweep with holdout validation
# - 'sweep_cv' : Run sweep with K-fold cross-validation (most robust)
# - 'ensemble' : Train XGBoost + LightGBM ensemble
# - 'compare'  : Compare all models and select best by official_metric
TRAINING_MODE = 'cv'

N_FOLDS = 5  # Number of CV folds
USE_GPU = True  # Enable GPU acceleration
# =====================================================

# Create run ID
RUN_ID = datetime.now().strftime("%Y%m%d_%H%M%S")
RUN_DIR = Path(ARTIFACTS_PATH) / RUN_ID
RUN_DIR.mkdir(parents=True, exist_ok=True)

print(f"üèÉ Training Configuration:")
print(f"  Run ID: {RUN_ID}")
print(f"  Model: {MODEL_TYPE}")
print(f"  Mode: {TRAINING_MODE}")
print(f"  CV Folds: {N_FOLDS}")
print(f"  Artifacts: {RUN_DIR}")

# Check GPU availability and configure
gpu_info = get_gpu_info()
GPU_AVAILABLE = gpu_info['gpu_available'] and gpu_info.get('cuda_version')

if GPU_AVAILABLE and USE_GPU:
    print(f"  üöÄ GPU: {gpu_info.get('device_name', 'Available')} - GPU training enabled")
else:
    print(f"  üíª Using CPU training")
    USE_GPU = False

# Set environment variable for thread safety (important for XGBoost/LightGBM)
import os
os.environ['OMP_NUM_THREADS'] = '1'

# Show sweep configuration if using sweep mode
if TRAINING_MODE in ['sweep', 'sweep_cv']:
    sweep_config = model_configs[MODEL_TYPE].get('sweep', {})
    print(f"\nüîç Sweep Configuration (from configs/model_{MODEL_TYPE.replace('boost', '')}.yaml):")
    print(f"  Selection metric: {sweep_config.get('selection_metric', 'official_metric')}")
    axes = sweep_config.get('axes', {})
    total_combos = 1
    for param, values in axes.items():
        print(f"  - {param}: {values}")
        total_combos *= len(values)
    print(f"  Total combinations: {total_combos}")

In [None]:
# ==============================================================================
# 6.2 Configure GPU-Enabled Model Parameters
# ==============================================================================
from src.models import get_model_class

def get_gpu_model_config(model_type, base_config, use_gpu=True):
    """
    Get model configuration with GPU settings enabled.
    Uses gpu section from consolidated model config files.
    
    Args:
        model_type: 'xgboost', 'lightgbm', or 'catboost'
        base_config: Base model configuration dict
        use_gpu: Whether to enable GPU
        
    Returns:
        Updated model configuration
    """
    config = base_config.copy()
    params = config.get('model', {}).get('params', {}).copy()
    gpu_config = config.get('gpu', {})
    
    if use_gpu and GPU_AVAILABLE:
        # Apply GPU settings from config
        if model_type == 'xgboost':
            params['tree_method'] = gpu_config.get('tree_method', 'gpu_hist')
            params['gpu_id'] = gpu_config.get('gpu_id', 0)
            params['predictor'] = gpu_config.get('predictor', 'gpu_predictor')
            print("  üöÄ XGBoost GPU mode enabled (tree_method='gpu_hist')")
            
        elif model_type == 'lightgbm':
            params['device'] = 'gpu'
            params['gpu_platform_id'] = gpu_config.get('gpu_platform_id', 0)
            params['gpu_device_id'] = gpu_config.get('gpu_device_id', 0)
            print("  ‚ö° LightGBM GPU mode enabled (device='gpu')")
            
        elif model_type == 'catboost':
            params['task_type'] = 'GPU'
            params['devices'] = str(gpu_config.get('device_id', 0))
            print("  üê± CatBoost GPU mode enabled (task_type='GPU')")
    else:
        # Use CPU settings from config
        if model_type == 'xgboost':
            params['tree_method'] = config.get('model', {}).get('params', {}).get('tree_method', 'hist')
        elif model_type == 'lightgbm':
            params['device'] = 'cpu'
        elif model_type == 'catboost':
            params['task_type'] = 'CPU'
        print(f"  üíª {model_type} CPU mode")
    
    # Update config with modified params
    config_copy = config.copy()
    if 'model' in config_copy:
        config_copy['model'] = config_copy['model'].copy()
        config_copy['model']['params'] = params
    else:
        config_copy['params'] = params
    
    return config_copy

# Get GPU-enabled config for selected model
print(f"\nüîß Configuring {MODEL_TYPE}...")
current_model_config = get_gpu_model_config(MODEL_TYPE, model_configs[MODEL_TYPE], USE_GPU)

# Get model class
ModelClass = get_model_class(MODEL_TYPE)
print(f"  Model class: {ModelClass.__name__}")

# Display selected params
selected_params = current_model_config.get('model', current_model_config).get('params', {})
print(f"\nüìã Model Parameters:")
for k, v in list(selected_params.items())[:8]:
    print(f"  {k}: {v}")

### 6.3 Train Scenario 1 Model

In [None]:
# ==============================================================================
# 6.3 Train Scenario 1 Model
# ==============================================================================
print(f"üèãÔ∏è Training Scenario 1 - {MODEL_TYPE.upper()}")
print("=" * 60)

# Train with cross-validation
s1_cv_results = run_cross_validation(
    X=X_train_s1,
    y=y_train_s1,
    meta_df=meta_train_s1,
    scenario=1,
    model_config=current_model_config,
    run_config=run_config,
    n_folds=N_FOLDS,
    save_dir=RUN_DIR / 'models_s1',
    run_id=RUN_ID,
)

print(f"\n‚úÖ Scenario 1 Training Complete")
print(f"  Mean CV Metric: {s1_cv_results['mean_score']:.6f} ¬± {s1_cv_results['std_score']:.6f}")

# Save S1 OOF predictions
oof_s1 = pd.DataFrame({
    'y_true': y_train_s1,
    'y_pred': s1_cv_results['oof_predictions'],
})
oof_s1.to_csv(RUN_DIR / 'oof_s1.csv', index=False)

clear_memory()

In [None]:
# ==============================================================================
# 6.4 Train Scenario 2 Model
# ==============================================================================
print(f"üèãÔ∏è Training Scenario 2 - {MODEL_TYPE.upper()}")
print("=" * 60)

# Train with cross-validation
s2_cv_results = run_cross_validation(
    X=X_train_s2,
    y=y_train_s2,
    meta_df=meta_train_s2,
    scenario=2,
    model_config=current_model_config,
    run_config=run_config,
    n_folds=N_FOLDS,
    save_dir=RUN_DIR / 'models_s2',
    run_id=RUN_ID,
)

print(f"\n‚úÖ Scenario 2 Training Complete")
print(f"  Mean CV Metric: {s2_cv_results['mean_score']:.6f} ¬± {s2_cv_results['std_score']:.6f}")

# Save S2 OOF predictions
oof_s2 = pd.DataFrame({
    'y_true': y_train_s2,
    'y_pred': s2_cv_results['oof_predictions'],
})
oof_s2.to_csv(RUN_DIR / 'oof_s2.csv', index=False)

clear_memory()

# Summary
print("\n" + "=" * 60)
print("üìä TRAINING SUMMARY")
print("=" * 60)
print(f"  Model: {MODEL_TYPE.upper()}")
print(f"  GPU: {'Enabled' if USE_GPU and GPU_AVAILABLE else 'Disabled'}")
print(f"  Scenario 1 CV: {s1_cv_results['mean_score']:.6f} ¬± {s1_cv_results['std_score']:.6f}")
print(f"  Scenario 2 CV: {s2_cv_results['mean_score']:.6f} ¬± {s2_cv_results['std_score']:.6f}")
print(f"  Models saved to: {RUN_DIR}")

## 6.5 Advanced Training Options

The cells below provide advanced training options:
- **Hyperparameter Sweep**: Grid search with K-fold CV to find optimal parameters
- **Multi-Model Training**: Train all models (XGBoost, LightGBM, CatBoost)
- **Ensemble**: Combine XGBoost + LightGBM predictions for better performance

‚ö†Ô∏è These are optional and computationally intensive. Skip to Section 7 for basic submission.

In [None]:
# ==============================================================================
# 6.5a Train All Models (XGBoost, LightGBM, CatBoost)
# ==============================================================================
# Set RUN_ALL_MODELS = True to train all three models and compare

RUN_ALL_MODELS = False  # ‚ö†Ô∏è Set to True to run (takes ~15-30 min with GPU)

if RUN_ALL_MODELS:
    all_model_results = {}
    
    for model_name in ['xgboost', 'lightgbm', 'catboost']:
        print(f"\n{'='*60}")
        print(f"üèãÔ∏è Training {model_name.upper()}")
        print(f"{'='*60}")
        
        # Get GPU-enabled config
        model_cfg = get_gpu_model_config(model_name, model_configs[model_name], USE_GPU)
        
        # Train S1
        s1_results = run_cross_validation(
            X=X_train_s1, y=y_train_s1, meta_df=meta_train_s1,
            scenario=1, model_config=model_cfg, run_config=run_config,
            n_folds=N_FOLDS, save_dir=RUN_DIR / f'{model_name}_s1', run_id=RUN_ID
        )
        
        # Train S2
        s2_results = run_cross_validation(
            X=X_train_s2, y=y_train_s2, meta_df=meta_train_s2,
            scenario=2, model_config=model_cfg, run_config=run_config,
            n_folds=N_FOLDS, save_dir=RUN_DIR / f'{model_name}_s2', run_id=RUN_ID
        )
        
        all_model_results[model_name] = {
            's1_mean': s1_results['mean_score'],
            's1_std': s1_results['std_score'],
            's2_mean': s2_results['mean_score'],
            's2_std': s2_results['std_score'],
            's1_oof': s1_results['oof_predictions'],
            's2_oof': s2_results['oof_predictions'],
        }
        
        print(f"  S1: {s1_results['mean_score']:.4f} ¬± {s1_results['std_score']:.4f}")
        print(f"  S2: {s2_results['mean_score']:.4f} ¬± {s2_results['std_score']:.4f}")
        
        clear_memory()
    
    # Display comparison table
    print("\n" + "="*60)
    print("üìä MODEL COMPARISON")
    print("="*60)
    comparison_df = pd.DataFrame([
        {
            'Model': name.upper(),
            'S1 Mean': f"{r['s1_mean']:.4f}",
            'S1 Std': f"¬±{r['s1_std']:.4f}",
            'S2 Mean': f"{r['s2_mean']:.4f}",
            'S2 Std': f"¬±{r['s2_std']:.4f}",
        }
        for name, r in all_model_results.items()
    ])
    display(comparison_df)
else:
    print("‚ÑπÔ∏è Set RUN_ALL_MODELS = True to train all models")

In [None]:
# ==============================================================================
# 6.5b Hyperparameter Sweep with K-Fold CV (using consolidated configs)
# ==============================================================================
# Run a grid search over hyperparameters with cross-validation
# Sweep parameters are defined in configs/model_xgb.yaml and configs/model_lgbm.yaml

RUN_SWEEP = False  # ‚ö†Ô∏è Set to True to run (takes ~30-60 min)

if RUN_SWEEP:
    from src.train import run_sweep_with_cv
    
    # Select model for sweep (use consolidated config files)
    SWEEP_MODEL = 'xgboost'  # 'xgboost' or 'lightgbm'
    SWEEP_FOLDS = 3  # Fewer folds for faster sweep
    
    # Get sweep configuration from consolidated model config
    sweep_model_config = model_configs[SWEEP_MODEL]
    sweep_axes = sweep_model_config.get('sweep', {}).get('axes', {})
    selection_metric = sweep_model_config.get('sweep', {}).get('selection_metric', 'official_metric')
    
    print(f"üîç Running {SWEEP_MODEL.upper()} hyperparameter sweep...")
    print(f"  Config file: configs/model_{SWEEP_MODEL.replace('boost', '')}.yaml")
    print(f"  Selection metric: {selection_metric}")
    print(f"  Folds: {SWEEP_FOLDS}")
    print(f"  GPU: {'Enabled' if USE_GPU else 'Disabled'}")
    print(f"\n  Sweep axes:")
    
    total_combos = 1
    for param, values in sweep_axes.items():
        print(f"    {param}: {values}")
        total_combos *= len(values)
    print(f"  Total combinations: {total_combos}")
    
    # Config file path for sweep
    config_path = f'configs/model_xgb.yaml' if SWEEP_MODEL == 'xgboost' else f'configs/model_lgbm.yaml'
    
    # Run sweep for both scenarios
    sweep_all_results = {}
    for scenario in [1, 2]:
        print(f"\n{'='*60}")
        print(f"Scenario {scenario} Sweep")
        print(f"{'='*60}")
        
        sweep_results = run_sweep_with_cv(
            scenario=scenario,
            model_type=SWEEP_MODEL,
            model_config_path=config_path,
            run_config_path='configs/run_defaults.yaml',
            data_config_path='configs/data.yaml',
            features_config_path='configs/features.yaml',
            base_run_name=f"{RUN_ID}_{SWEEP_MODEL}_s{scenario}",
            n_folds=SWEEP_FOLDS,
            use_cached_features=True
        )
        
        sweep_all_results[scenario] = sweep_results
        
        print(f"\n‚úÖ Best config: {sweep_results['best_config']}")
        print(f"   Mean {selection_metric}: {sweep_results['best_mean_metric']:.4f} ¬± {sweep_results['best_std_metric']:.4f}")
        
        # Display results table
        if 'summary_df' in sweep_results:
            display(sweep_results['summary_df'])
    
    # Summary
    print("\n" + "="*60)
    print(f"üìä SWEEP SUMMARY ({SWEEP_MODEL.upper()})")
    print("="*60)
    for s in [1, 2]:
        r = sweep_all_results[s]
        print(f"  Scenario {s}: {r['best_mean_metric']:.4f}")
        print(f"    Best params: {r['best_config']}")
    
    clear_memory()
else:
    print("‚ÑπÔ∏è Set RUN_SWEEP = True to run hyperparameter sweep")
    print("   Sweep configuration is in configs/model_xgb.yaml and configs/model_lgbm.yaml")

In [None]:
# ==============================================================================
# 6.5c XGBoost + LightGBM Ensemble (using consolidated configs)
# ==============================================================================
# Train both XGBoost and LightGBM and combine predictions
# Ensemble settings are defined in configs/model_xgb.yaml and configs/model_lgbm.yaml

RUN_ENSEMBLE = False  # ‚ö†Ô∏è Set to True to run (takes ~10-20 min)

if RUN_ENSEMBLE:
    from src.train import train_xgb_lgbm_ensemble
    
    print("ü§ù Training XGBoost + LightGBM Ensemble...")
    print("=" * 60)
    
    # Get ensemble settings from configs
    xgb_ensemble_cfg = model_configs['xgboost'].get('ensemble', {})
    lgbm_ensemble_cfg = model_configs['lightgbm'].get('ensemble', {})
    
    # Determine weight optimization method
    optimize_weights = xgb_ensemble_cfg.get('optimize_weights', True)
    weight_search = xgb_ensemble_cfg.get('weight_search', 'grid')
    
    print(f"  XGBoost weight range: {xgb_ensemble_cfg.get('weight_range', [0.4, 0.8])}")
    print(f"  LightGBM weight range: {lgbm_ensemble_cfg.get('weight_range', [0.2, 0.6])}")
    print(f"  Optimize weights: {optimize_weights}")
    print(f"  Search method: {weight_search}")
    
    # Get GPU configs for both models
    xgb_cfg = get_gpu_model_config('xgboost', model_configs['xgboost'], USE_GPU)
    lgbm_cfg = get_gpu_model_config('lightgbm', model_configs['lightgbm'], USE_GPU)
    
    ensemble_results = {}
    
    for scenario in [1, 2]:
        print(f"\n{'='*60}")
        print(f"Scenario {scenario} Ensemble")
        print(f"{'='*60}")
        
        # Select data
        if scenario == 1:
            X_train, y_train, meta_train = X_train_s1, y_train_s1, meta_train_s1
        else:
            X_train, y_train, meta_train = X_train_s2, y_train_s2, meta_train_s2
        
        result = train_xgb_lgbm_ensemble(
            X_train=X_train,
            y_train=y_train,
            meta_train=meta_train,
            scenario=scenario,
            xgb_config=xgb_cfg,
            lgbm_config=lgbm_cfg,
            run_config=run_config,
            n_folds=N_FOLDS,
            optimize_weights=optimize_weights,
            save_dir=RUN_DIR / f'ensemble_s{scenario}'
        )
        
        ensemble_results[scenario] = result
        
        print(f"\n‚úÖ Scenario {scenario} Ensemble Results:")
        print(f"   XGBoost:  {result['xgb_metric']:.4f}")
        print(f"   LightGBM: {result['lgbm_metric']:.4f}")
        print(f"   Ensemble: {result['ensemble_metric']:.4f}")
        print(f"   Weights:  XGB={result['weights'][0]:.2f}, LGBM={result['weights'][1]:.2f}")
        
        clear_memory()
    
    # Save ensemble configuration for inference
    import json
    ensemble_output = {
        's1_weights': list(ensemble_results[1]['weights']),
        's2_weights': list(ensemble_results[2]['weights']),
        's1_xgb_metric': ensemble_results[1]['xgb_metric'],
        's1_lgbm_metric': ensemble_results[1]['lgbm_metric'],
        's1_ensemble_metric': ensemble_results[1]['ensemble_metric'],
        's2_xgb_metric': ensemble_results[2]['xgb_metric'],
        's2_lgbm_metric': ensemble_results[2]['lgbm_metric'],
        's2_ensemble_metric': ensemble_results[2]['ensemble_metric'],
    }
    with open(RUN_DIR / 'ensemble_config.json', 'w') as f:
        json.dump(ensemble_output, f, indent=2)
    
    print("\n" + "="*60)
    print("üìä ENSEMBLE SUMMARY")
    print("="*60)
    for s in [1, 2]:
        r = ensemble_results[s]
        improvement_over_xgb = r['ensemble_metric'] - r['xgb_metric']
        print(f"  Scenario {s}: {r['ensemble_metric']:.4f}")
        print(f"    XGB: {r['xgb_metric']:.4f}, LGBM: {r['lgbm_metric']:.4f}")
        print(f"    Weights: XGB={r['weights'][0]:.0%}, LGBM={r['weights'][1]:.0%}")
        print(f"    Improvement over XGB alone: {improvement_over_xgb:+.4f}")
else:
    print("‚ÑπÔ∏è Set RUN_ENSEMBLE = True to train XGBoost + LightGBM ensemble")
    print("   Ensemble settings are in configs/model_xgb.yaml and configs/model_lgbm.yaml")

## 7. Generate Submission

Generate predictions on test data and create submission files.

In [None]:
# ==============================================================================
# 7.1 Build Test Features and Generate Predictions
# ==============================================================================
import joblib
from glob import glob

print("üì§ Generating submission...")

# Build test features for Scenario 1
print("  Building S1 test features...")
X_test_s1, _, meta_test_s1 = get_features(
    split='test', scenario=1, mode='test',
    data_config=data_config, features_config=features_config,
    use_cache=True
)

# Build test features for Scenario 2  
print("  Building S2 test features...")
X_test_s2, _, meta_test_s2 = get_features(
    split='test', scenario=2, mode='test',
    data_config=data_config, features_config=features_config,
    use_cache=True
)

# Determine model file extension based on model type
MODEL_EXTENSIONS = {
    'catboost': 'model.cbm',
    'xgboost': 'model.json',
    'lightgbm': 'model.txt',
}
model_ext = MODEL_EXTENSIONS.get(MODEL_TYPE, 'model.bin')

print(f"  Loading {MODEL_TYPE.upper()} models and predicting...")

# Scenario 1 predictions (average across folds)
s1_preds_list = []
s1_model_dir = RUN_DIR / 'models_s1'
for fold_path in sorted(s1_model_dir.glob('fold_*')):
    model_files = list(fold_path.glob('model.*'))
    if model_files:
        model_path = model_files[0]
        model = ModelClass.load(str(model_path), current_model_config)
        preds = model.predict(X_test_s1)
        s1_preds_list.append(preds)
        print(f"    Loaded {model_path.name} from fold_{fold_path.name.split('_')[-1]}")

if s1_preds_list:
    s1_test_preds = np.mean(s1_preds_list, axis=0)
else:
    print("  ‚ö†Ô∏è No S1 models found, using baseline predictions")
    s1_test_preds = np.ones(len(X_test_s1))

# Scenario 2 predictions (average across folds)
s2_preds_list = []
s2_model_dir = RUN_DIR / 'models_s2'
for fold_path in sorted(s2_model_dir.glob('fold_*')):
    model_files = list(fold_path.glob('model.*'))
    if model_files:
        model_path = model_files[0]
        model = ModelClass.load(str(model_path), current_model_config)
        preds = model.predict(X_test_s2)
        s2_preds_list.append(preds)
        print(f"    Loaded {model_path.name} from fold_{fold_path.name.split('_')[-1]}")

if s2_preds_list:
    s2_test_preds = np.mean(s2_preds_list, axis=0)
else:
    print("  ‚ö†Ô∏è No S2 models found, using baseline predictions")
    s2_test_preds = np.ones(len(X_test_s2))

print(f"\n  S1 predictions: {len(s1_test_preds):,}")
print(f"  S2 predictions: {len(s2_test_preds):,}")

### 7.2 Create Submission File

In [None]:
# ==============================================================================
# 7.2 Create and Save Submission
# ==============================================================================

# Create submission dataframes
submission_s1 = meta_test_s1[['country', 'brand_name', 'months_postgx']].copy()
submission_s1['volume'] = s1_test_preds * meta_test_s1['avg_vol_12m'].values  # Convert y_norm to volume

submission_s2 = meta_test_s2[['country', 'brand_name', 'months_postgx']].copy()
submission_s2['volume'] = s2_test_preds * meta_test_s2['avg_vol_12m'].values

# Combine submissions
submission = pd.concat([submission_s1, submission_s2], ignore_index=True)

# Clip negative volumes to 0
submission['volume'] = submission['volume'].clip(lower=0)

# Validate submission format
is_valid, issues = validate_submission_format(submission)
if is_valid:
    print("‚úÖ Submission format validated")
else:
    print(f"‚ö†Ô∏è Validation issues: {issues}")

# Save submission
submission_path = Path(SUBMISSIONS_PATH) / f"submission_{RUN_ID}.csv"
submission.to_csv(submission_path, index=False)

print(f"\nüìÑ Submission saved to: {submission_path}")
print(f"  Shape: {submission.shape}")
print(f"  Columns: {list(submission.columns)}")

# Statistics
print(f"\nüìä Submission Statistics:")
print(f"  Volume min: {submission['volume'].min():.2f}")
print(f"  Volume max: {submission['volume'].max():.2f}")
print(f"  Volume mean: {submission['volume'].mean():.2f}")
print(f"  Volume median: {submission['volume'].median():.2f}")

# Preview
print(f"\nüìã Preview:")
display(submission.head(10))

In [None]:
# ==============================================================================
# 7.3 Download Submission (Colab only)
# ==============================================================================
if IN_COLAB:
    print("üì• Downloading submission file...")
    from google.colab import files
    files.download(str(submission_path))
    print("‚úÖ Download complete!")
else:
    print(f"üìÑ Submission available at: {submission_path}")

# Also sync to Drive if in Colab
if IN_COLAB:
    # Copy to Drive submissions folder
    import shutil
    drive_submission_path = f"{SUBMISSIONS_PATH}/submission_{RUN_ID}.csv"
    shutil.copy(str(submission_path), drive_submission_path)
    print(f"‚òÅÔ∏è Saved to Google Drive: {drive_submission_path}")

## 8. Utilities

Helper functions for common operations.

In [None]:
# ==============================================================================
# 8.1 Utility Functions
# ==============================================================================

def show_memory():
    """Display current memory usage."""
    mem = get_memory_usage()
    print(f"üíæ Memory Usage:")
    print(f"  Process: {mem.get('process_rss_gb', 'N/A'):.2f} GB")
    print(f"  System: {mem.get('system_used_percent', 'N/A'):.1f}% used")
    if 'gpu_allocated_gb' in mem:
        print(f"  GPU: {mem['gpu_allocated_gb']:.2f} GB allocated")

def free_memory():
    """Free unused memory."""
    before = get_memory_usage().get('process_rss_gb', 0)
    clear_memory()
    after = get_memory_usage().get('process_rss_gb', 0)
    print(f"üßπ Freed {before - after:.2f} GB")

def download_artifacts():
    """Download all artifacts as a zip file (Colab only)."""
    if not IN_COLAB:
        print(f"üìÅ Artifacts at: {RUN_DIR}")
        return
    
    import shutil
    zip_path = f"/content/artifacts_{RUN_ID}.zip"
    shutil.make_archive(zip_path.replace('.zip', ''), 'zip', str(RUN_DIR))
    
    from google.colab import files
    files.download(zip_path)
    print(f"üì¶ Downloaded: artifacts_{RUN_ID}.zip")

def restart_runtime():
    """Restart Colab runtime to free memory."""
    if IN_COLAB:
        import os
        os.kill(os.getpid(), 9)

def check_gpu():
    """Check GPU status and memory."""
    print("üñ•Ô∏è GPU Status:")
    !nvidia-smi --query-gpu=name,memory.used,memory.total,utilization.gpu --format=csv,noheader 2>/dev/null || print("  GPU not available")

print("üõ†Ô∏è Utility functions available:")
print("  - show_memory(): Display memory usage")
print("  - free_memory(): Free unused memory")
print("  - check_gpu(): Check GPU status and memory")
print("  - download_artifacts(): Download all run artifacts")
print("  - restart_runtime(): Restart Colab runtime")