<a href="https://colab.research.google.com/github/arvindsuresh-math/Fall-2025-Team-Big-Data/blob/main/notebooks/modeling_oct_19_additive.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Experiment 1: Additive Model for Price-Per-Person (Baseline)

## 1. Objective

The goal of this notebook is to establish a performance baseline for predicting the `price_per_person` of an Airbnb listing. This experiment uses a purely additive, multi-axis neural network, which is designed for high explainability.

## 2. Methodology

*   **Model Architecture:** An additive neural network with 6 parallel sub-networks, one for each of the following feature axes:
    1.  Location
    2.  Size & Capacity
    3.  Quality & Reputation
    4.  Amenities (Text)
    5.  Description (Text)
    6.  Seasonality
*   **Target Variable:** `price_per_person` (untransformed, in dollars).
*   **Loss Function:** Root Mean Squared Error (RMSE), which optimizes for direct dollar-space error. This model does not use sample weighting.

## 3. Key Features

*   **Modular Code:** The notebook is structured with distinct classes for configuration, feature processing, data loading, and modeling to ensure clarity and reusability.
*   **End-to-End Pipeline:** A single `main()` function orchestrates the entire workflow from data loading to model training and artifact saving.
*   **Robust Training:** The training loop includes early stopping to prevent overfitting and saves the best model based on validation performance.
*   **Performance Analysis:** After training, the model's performance is evaluated by calculating the Mean Absolute Error (MAE) and Mean Absolute Percentage Error (MAPE) across different price-per-person brackets.

## 4. Expected Outcome

This notebook will produce a trained model artifact (`.pt` file) and a detailed performance report. This baseline will serve as the benchmark against which more complex models (e.g., those using log-transformed targets) will be compared.

#### **0. Setup and Installations**

In [None]:
# --- Hugging Face Authentication (using Colab Secrets) ---
from google.colab import userdata
from huggingface_hub import login
print("Attempting Hugging Face login...")
try:
    HF_TOKEN = userdata.get('HF_TOKEN')
    login(token=HF_TOKEN)
    print("Hugging Face login successful.")
except Exception as e:
    print(f"Could not log in. Please ensure 'HF_TOKEN' is a valid secret. Error: {e}")

Attempting Hugging Face login...
Hugging Face login successful.


In [None]:
# --- Mount Google Drive ---
from google.colab import drive
print("Mounting Google Drive...")
try:
    drive.mount('/content/drive')
    print("Google Drive mounted successfully.")
except Exception as e:
    print(f"Could not mount Google Drive. Error: {e}")

Mounting Google Drive...
Mounted at /content/drive
Google Drive mounted successfully.


In [None]:
# --- Install Dependencies ---
!pip install pandas
!pip install pyarrow
!pip install sentence-transformers
!pip install scikit-learn
!pip install torch
!pip install tqdm
!pip install transformers
!pip install matplotlib
!pip install seaborn



### **1. Configuration and Helper Functions**

This section defines the experiment's configuration using a simple Python dictionary, which is direct and easy to modify. The `set_seed` utility is retained for reproducibility.

In [None]:
import pandas as pd
import numpy as np
import os
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
from sklearn.model_selection import train_test_split
from transformers import AutoTokenizer
import time
from tqdm.notebook import tqdm
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
# --- Seeding function for reproducibility ---
def set_seed(seed: int):
    """Sets random seeds for numpy and torch for reproducible results."""
    import random
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    if torch.cuda.is_available():
        torch.cuda.manual_seed(seed)
        torch.cuda.manual_seed_all(seed)
        torch.backends.cudnn.deterministic = True
        torch.backends.cudnn.benchmark = False
    print(f"All random seeds set to {seed}.")

In [None]:
# --- Simplified Configuration using a Dictionary ---
config = {
    # --- Data and Environment ---
    "CITY": "nyc",
    "DEVICE": "cuda" if torch.cuda.is_available() else "cpu",
    "DRIVE_SAVE_PATH": "/content/drive/MyDrive/Colab_Notebooks/Airbnb_Project/",
    "TEXT_MODEL_NAME": 'BAAI/bge-small-en-v1.5',

    # --- Data Pre-processing ---
    "VAL_SIZE": 0.2,

    # --- Reproducibility ---
    "SEED": 42,

    # --- Model Training ---
    "BATCH_SIZE": 4096,
    "LEARNING_RATE": 1e-3,
    "N_EPOCHS": 30,

    # --- Early Stopping ---
    "EARLY_STOPPING_PATIENCE": 5,
    "EARLY_STOPPING_MIN_DELTA": 0.005, # Min improvement in MAPE is 0.5%

    # --- Learning Rate Scheduler ---
    "SCHEDULER_PATIENCE": 2,
    "SCHEDULER_FACTOR": 0.5,

    # --- Logging ---
    "LOG_EVERY_N_STEPS": 1,
}

### **2. Data Loading and Splitting**

This function is a critical part of the modeling process. **Stratified splitting is retained** to ensure that the validation set is representative of the training data, which is crucial for obtaining a reliable measure of the model's performance.

In [None]:
def load_and_split_data(config: dict):
    """
    Loads data, removes price outliers, and performs a stratified split.
    This is a core ML practice that is retained for model correctness.
    """
    dataset_filename = f"{config['CITY']}_dataset_oct_17.parquet"
    dataset_path = f"./{dataset_filename}"
    if not os.path.exists(dataset_path):
        raise FileNotFoundError(f"'{dataset_filename}' not found. Please upload the file.")

    print(f"Loading dataset from: {dataset_path}")
    df = pd.read_parquet(dataset_path)
    df = df[df["price_per_person"] > 0].copy()

    # Create bins for stratifying continuous price_per_person
    df['price_bin'] = pd.qcut(df['price_per_person'], q=10, labels=False, duplicates='drop')
    stratify_key = (
        df['neighbourhood_cleansed'].astype(str) + '_' +
        df['month'].astype(str) + '_' +
        df['price_bin'].astype(str)
    )

    strata_counts = stratify_key.value_counts()
    valid_strata = strata_counts[strata_counts >= 2].index
    df_filtered = df[stratify_key.isin(valid_strata)].copy()
    print(f"Removed small strata. New size: {len(df_filtered):,} records.")

    train_indices, val_indices = train_test_split(
        df_filtered.index,
        test_size=config['VAL_SIZE'],
        random_state=config['SEED'],
        stratify=stratify_key[df_filtered.index]
    )

    train_df = df_filtered.loc[train_indices].reset_index(drop=True)
    val_df = df_filtered.loc[val_indices].reset_index(drop=True)

    print(f"Split complete. Training: {len(train_df):,}, Validation: {len(val_df):,}")
    return train_df, val_df

### **3. Feature Processor**

The `FeatureProcessor` with its `fit`/`transform` pattern is **retained**. This is fundamental to preventing **data leakage**, where information from the validation set accidentally influences the training process. The vectorized NumPy operations are also kept for performance.

In [None]:
class FeatureProcessor:
    """
    The fit/transform pattern is a fundamental ML best practice to prevent data leakage.
    It is retained to ensure the model's evaluation is valid.
    """
    def __init__(self, embedding_dim_geo: int = 32):
        self.vocabs, self.scalers = {}, {}
        self.embedding_dim_geo = embedding_dim_geo
        self.categorical_cols = ["neighbourhood_cleansed", "property_type", "room_type"]
        self.numerical_cols = ["accommodates", "review_scores_rating", "review_scores_cleanliness",
                               "review_scores_checkin", "review_scores_communication", "review_scores_location",
                               "review_scores_value", "bedrooms", "beds", "bathrooms"]
        self.log_transform_cols = ["total_reviews"]

    def fit(self, df: pd.DataFrame):
        print("Fitting FeatureProcessor...")
        for col in self.categorical_cols:
            valid_uniques = df[col].dropna().unique().tolist()
            self.vocabs[col] = {val: i for i, val in enumerate(["<UNK>"] + sorted(valid_uniques))}
        for col in self.numerical_cols + self.log_transform_cols:
            numeric_series = pd.to_numeric(df[col], errors='coerce')
            mean_val = numeric_series.mean()
            filled_series = numeric_series.fillna(mean_val)
            vals = np.log1p(filled_series) if col in self.log_transform_cols else filled_series
            self.scalers[col] = {'mean': vals.mean(), 'std': vals.std() if vals.std() > 0 else 1.0, 'impute_raw': float(mean_val)}
        print("Fit complete.")

    def transform(self, df: pd.DataFrame) -> dict:
        df = df.copy()
        half_dim = self.embedding_dim_geo // 2
        lat = pd.to_numeric(df["latitude"], errors="coerce").fillna(0).to_numpy(dtype=np.float32)
        lon = pd.to_numeric(df["longitude"], errors="coerce").fillna(0).to_numpy(dtype=np.float32)

        def pe(arr, max_val, d):
            pos = (arr / max_val) * 10000.0
            idx = np.arange(0, d, 2, dtype=np.float32)
            div = np.exp(-(np.log(10000.0) / d) * idx).astype(np.float32)
            s = np.sin(pos[:, None] * div[None, :]).astype(np.float32)
            c = np.cos(pos[:, None] * div[None, :]).astype(np.float32)
            out = np.empty((arr.shape[0], d), dtype=np.float32)
            out[:, 0::2], out[:, 1::2] = s, c
            return out

        geo_position = np.hstack([pe(lat, 90.0, half_dim), pe(lon, 180.0, half_dim)]).astype(np.float32)
        neighbourhood = df["neighbourhood_cleansed"].map(self.vocabs["neighbourhood_cleansed"]).fillna(0).astype(np.int64)

        size_features = {
            "property_type": df["property_type"].map(self.vocabs["property_type"]).fillna(0).astype(np.int64),
            "room_type": df["room_type"].map(self.vocabs["room_type"]).fillna(0).astype(np.int64)
        }
        for col in ["accommodates", "bedrooms", "beds", "bathrooms"]:
            x = pd.to_numeric(df[col], errors='coerce').fillna(self.scalers[col]['impute_raw'])
            size_features[col] = ((x - self.scalers[col]["mean"]) / self.scalers[col]["std"]).astype(np.float32)

        quality_features = {}
        quality_num_cols = set(self.numerical_cols) - set(size_features.keys()) - set(["property_type", "room_type"])
        for col in quality_num_cols:
            x = pd.to_numeric(df[col], errors='coerce').fillna(self.scalers[col]['impute_raw'])
            quality_features[col] = ((x - self.scalers[col]["mean"]) / self.scalers[col]["std"]).astype(np.float32)

        raw = pd.to_numeric(df["total_reviews"], errors="coerce").fillna(self.scalers["total_reviews"]["impute_raw"])
        tr_log = np.log1p(raw).astype(np.float32)
        quality_features["total_reviews"] = (tr_log - self.scalers["total_reviews"]["mean"]) / self.scalers["total_reviews"]["std"]
        quality_features["host_is_superhost"] = df["host_is_superhost"].astype(np.float32)

        month = pd.to_numeric(df["month"], errors="coerce").fillna(1).to_numpy(np.float32)
        season_cyc = np.stack([np.sin(2 * np.pi * month / 12), np.cos(2 * np.pi * month / 12)], axis=1).astype(np.float32)

        return {
            "location": {"geo_position": geo_position, "neighbourhood": neighbourhood.to_numpy()},
            "size_capacity": {k: v.to_numpy() for k, v in size_features.items()},
            "quality": {k: v.to_numpy() for k, v in quality_features.items()},
            "amenities_text": df["amenities"].fillna("").tolist(),
            "description_text": df["description"].fillna("").tolist(),
            "seasonality": {"cyclical": season_cyc},
            # UPDATED: Provide both the original price and the log-transformed price
            "target_price": df["price_per_person"].to_numpy(dtype=np.float32),
            "target_log": np.log1p(df["price_per_person"]).to_numpy(dtype=np.float32),
        }

### **4. PyTorch Dataset and DataLoader**

This section is now much simpler.
*   The `Dataset` class now handles tokenization internally.
*   The `create_dataloaders` function is a simple wrapper that creates standard `DataLoader` instances without any complex configuration. This removes the need for a custom collate function.

In [None]:
class AirbnbPriceDataset(Dataset):
    def __init__(self, features: dict, tokenizer):
        self.features = features
        self.tokenizer = tokenizer
        self.n_samples = len(features['target'])

    def __len__(self):
        return self.n_samples

    def __getitem__(self, index: int) -> dict:
        item = {
            'loc_geo_position': torch.tensor(self.features['location']['geo_position'][index], dtype=torch.float32),
            'loc_neighbourhood': torch.tensor(self.features['location']['neighbourhood'][index], dtype=torch.long),
            'season_cyclical': torch.tensor(self.features['seasonality']['cyclical'][index], dtype=torch.float32),
            # UPDATED: Pass both targets through
            'target_price': torch.tensor(self.features['target_price'][index], dtype=torch.float32),
            'target_log': torch.tensor(self.features['target_log'][index], dtype=torch.float32)
        }
        for k, v in self.features['size_capacity'].items():
            dtype = torch.long if k in ['property_type', 'room_type'] else torch.float32
            item[f'size_{k}'] = torch.tensor(v[index], dtype=dtype)
        for k, v in self.features['quality'].items():
            item[f'qual_{k}'] = torch.tensor(v[index], dtype=torch.float32)

        # Tokenization is now done here, inside the dataset
        item['amenities_tokens'] = self.tokenizer(
            self.features['amenities_text'][index],
            padding='max_length',
            truncation=True,
            max_length=128,
            return_tensors="pt"
        )
        item['description_tokens'] = self.tokenizer(
            self.features['description_text'][index],
            padding='max_length',
            truncation=True,
            max_length=256,
            return_tensors="pt"
        )
        return item

def create_dataloaders(train_features: dict, val_features: dict, config: dict):
    """Simplified DataLoader creation."""
    tokenizer = AutoTokenizer.from_pretrained(config['TEXT_MODEL_NAME'], use_fast=True)
    train_dataset = AirbnbPriceDataset(train_features, tokenizer)
    val_dataset = AirbnbPriceDataset(val_features, tokenizer)

    train_loader = DataLoader(
        train_dataset,
        batch_size=config['BATCH_SIZE'],
        shuffle=True,
        num_workers=2
    )
    val_loader = DataLoader(
        val_dataset,
        batch_size=config['BATCH_SIZE'],
        shuffle=False,
        num_workers=2
    )

    print("DataLoaders created.")
    return train_loader, val_loader

### **5. Model Architecture**

**This section is updated as requested.**
*   A new method, `forward_with_decomposition`, is added to compute the outputs of all subnets.
*   The original `forward` method now calls this new method and simply returns the final price, keeping the training loop clean. This avoids code duplication while enabling the detailed output needed for inference.

In [None]:
from sentence_transformers import SentenceTransformer

class AdditiveAxisModel(nn.Module):
    def __init__(self, processor: FeatureProcessor, config: dict, mean_target_price: float):
        super().__init__()
        self.device = config['DEVICE']
        self.embed_neighbourhood = nn.Embedding(len(processor.vocabs['neighbourhood_cleansed']), 16)
        self.embed_property_type = nn.Embedding(len(processor.vocabs['property_type']), 8)
        self.embed_room_type = nn.Embedding(len(processor.vocabs['room_type']), 4)
        self.text_transformer = SentenceTransformer(config['TEXT_MODEL_NAME'], device=self.device)
        for param in self.text_transformer.parameters():
            param.requires_grad = False
        self.loc_subnet = nn.Sequential(nn.Linear(48, 32), nn.ReLU(), nn.Linear(32, 1))
        self.size_subnet = nn.Sequential(nn.Linear(16, 32), nn.ReLU(), nn.Linear(32, 1))
        self.qual_subnet = nn.Sequential(nn.Linear(8, 32), nn.ReLU(), nn.Linear(32, 1))
        self.amenities_subnet = nn.Sequential(nn.Linear(384, 32), nn.ReLU(), nn.Linear(32, 1))
        self.desc_subnet = nn.Sequential(nn.Linear(384, 32), nn.ReLU(), nn.Linear(32, 1))
        self.season_subnet = nn.Sequential(nn.Linear(2, 16), nn.ReLU(), nn.Linear(16, 1))
        self.register_buffer('global_bias', torch.tensor([mean_target_price], dtype=torch.float32))

        self.to(self.device)

    # NEW: A method that returns the full decomposition for analysis
    def forward_with_decomposition(self, batch: dict) -> dict:
        loc_input = torch.cat([batch['loc_geo_position'], self.embed_neighbourhood(batch['loc_neighbourhood'])], dim=1)
        size_input = torch.cat([
            self.embed_property_type(batch['size_property_type']), self.embed_room_type(batch['size_room_type']),
            batch['size_accommodates'].unsqueeze(1), batch['size_bedrooms'].unsqueeze(1),
            batch['size_beds'].unsqueeze(1), batch['size_bathrooms'].unsqueeze(1)], dim=1)
        qual_cols = ["review_scores_rating", "review_scores_cleanliness", "review_scores_checkin",
                     "review_scores_communication", "review_scores_location", "review_scores_value",
                     "total_reviews", "host_is_superhost"]
        qual_input = torch.cat([batch[f'qual_{c}'].unsqueeze(1) for c in qual_cols], dim=1)

        with torch.inference_mode():
            amenities_tokens = {k: v.squeeze(1) for k, v in batch['amenities_tokens'].items()}
            desc_tokens = {k: v.squeeze(1) for k, v in batch['description_tokens'].items()}
            amenities_embed = self.text_transformer(amenities_tokens)['sentence_embedding']
            desc_embed = self.text_transformer(desc_tokens)['sentence_embedding']

        p_loc = self.loc_subnet(loc_input)
        p_size = self.size_subnet(size_input)
        p_qual = self.qual_subnet(qual_input)
        p_amenities = self.amenities_subnet(amenities_embed)
        p_desc = self.desc_subnet(desc_embed)
        p_season = self.season_subnet(batch['season_cyclical'])

        total_price = (self.global_bias + p_loc + p_size + p_qual + p_amenities + p_desc + p_season)

        return {
            'predicted': total_price.squeeze(-1),
            'global_bias': self.global_bias.expand(total_price.size(0)),
            'p_location': p_loc.squeeze(-1),
            'p_size_capacity': p_size.squeeze(-1),
            'p_quality': p_qual.squeeze(-1),
            'p_amenities': p_amenities.squeeze(-1),
            'p_description': p_desc.squeeze(-1),
            'p_seasonality': p_season.squeeze(-1),
        }

    # The standard forward pass now uses the decomposition method for clean code
    def forward(self, batch: dict) -> torch.Tensor:
        return self.forward_with_decomposition(batch)['predicted']

### **6. Training and Evaluation Functions**

**This section is updated as requested.**
*   The `evaluate_model` function now calculates and returns both validation loss (MSE) and **validation MAPE**.
*   The `train_model` function now uses **validation MAPE for its early stopping logic**, which aligns the training objective more closely with your desired outcome.

In [None]:
def evaluate_model(model, data_loader, device):
    """UPDATED: This function now returns both MSE and MAPE."""
    model.eval()
    total_loss = 0.0
    total_mape = 0.0
    with torch.no_grad():
        for batch in data_loader:
            for k, v in batch.items():
                batch[k] = v.to(device) if isinstance(v, torch.Tensor) else {sk: sv.to(device) for sk, sv in v.items()}
            targets = batch['target']

            # CORRECTED: Use the modern torch.amp.autocast API with device_type as the first argument.
            with torch.amp.autocast(device_type=device, dtype=torch.float16, enabled=(device=="cuda")):
                predictions = model(batch)
                loss = torch.mean((predictions - targets).float().pow(2))
                mape = (torch.abs(predictions - targets) / (targets + 1e-6)).mean()

            total_loss += loss.item()
            total_mape += mape.item()

    return total_loss / len(data_loader), total_mape / len(data_loader)


def train_model(model, train_loader, val_loader, optimizer, scheduler, config):
    """UPDATED: Early stopping is now based on validation MAPE."""
    print("\n--- Starting Model Training ---")
    history, best_val_mape = [], float('inf')
    best_model_state, patience_counter = None, 0

    # CORRECTED: The modern GradScaler does not take a 'device_type' argument.
    # It only needs to know if it's enabled.
    scaler = torch.amp.GradScaler(enabled=(config['DEVICE'] == "cuda"))

    for epoch in range(config['N_EPOCHS']):
        model.train()
        for batch in tqdm(train_loader, desc=f"Epoch {epoch+1}/{config['N_EPOCHS']}"):
            for k, v in batch.items():
                batch[k] = v.to(config['DEVICE']) if isinstance(v, torch.Tensor) else {sk: sv.to(config['DEVICE']) for sk, sv in v.items()}
            targets = batch["target"]

            # CORRECTED: Use the modern torch.amp.autocast API with device_type as the first argument.
            with torch.amp.autocast(device_type=config['DEVICE'], dtype=torch.float16, enabled=(config['DEVICE']=="cuda")):
                preds = model(batch)
                loss = torch.mean((preds - targets).float().pow(2))

            optimizer.zero_grad()
            scaler.scale(loss).backward()
            scaler.step(optimizer)
            scaler.update()

        val_mse, val_mape = evaluate_model(model, val_loader, config['DEVICE'])
        val_rmse = np.sqrt(val_mse)
        history.append({'epoch': epoch, 'val_rmse': val_rmse, 'val_mape': val_mape})
        print(f"Epoch {epoch+1}/{config['N_EPOCHS']}, Val RMSE: ${val_rmse:.2f}, Val MAPE: {val_mape*100:.2f}%")

        scheduler.step(val_mape)

        if val_mape < best_val_mape - config['EARLY_STOPPING_MIN_DELTA']:
            best_val_mape = val_mape
            patience_counter = 0
            best_model_state = model.state_dict()
        else:
            patience_counter += 1
            if patience_counter >= config['EARLY_STOPPING_PATIENCE']:
                print(f"--- Early Stopping Triggered (MAPE did not improve) ---")
                break

    print("\n--- Training Complete ---")
    if best_model_state:
        model.load_state_dict(best_model_state)
    return model, pd.DataFrame(history)

### **7. Inference, Saving, and Analysis**

**This section is updated as requested.**
*   A new inference function, `run_inference_with_decomposition`, calls the model's corresponding new forward method. It returns a dictionary of NumPy arrays, one for the final prediction and one for each subnet.
*   The analysis function, `analyze_results`, now creates the **binned MAPE visualization**.
*   The saving function remains simple.

In [None]:
# NEW: A dedicated inference function to get the decomposition.
def run_inference_with_decomposition(model, data_loader, device):
    """Runs inference and returns a dictionary of all decomposed predictions."""
    model.eval()
    outputs = []
    with torch.no_grad():
        for batch in tqdm(data_loader, desc="Running Inference with Decomposition"):
            for k, v in batch.items():
                batch[k] = v.to(device) if isinstance(v, torch.Tensor) else {sk: sv.to(device) for sk, sv in v.items()}
            # Call the new forward method
            batch_outputs = model.forward_with_decomposition(batch)
            # Move results to CPU and append
            outputs.append({k: v.cpu() for k, v in batch_outputs.items()})

    # Concatenate results from all batches
    final_outputs = {}
    for key in outputs[0].keys():
        final_outputs[key] = torch.cat([o[key] for o in outputs]).numpy()
    return final_outputs

def save_artifacts_simple(model, processor, config):
    """Simplified artifact saving."""
    timestamp = pd.Timestamp.now().strftime('%Y%m%d_%H%M%S')
    filename = f"{config['CITY']}_model_artifacts_{timestamp}.pt"
    save_path = os.path.join(config['DRIVE_SAVE_PATH'], filename)
    os.makedirs(config['DRIVE_SAVE_PATH'], exist_ok=True)

    torch.save({
        'model_state_dict': model.state_dict(),
        'feature_processor': processor,
        'config': config
    }, save_path)
    print(f"\nArtifacts saved to {save_path}")

# UPDATED: The analysis function now includes binned MAPE plots.
def analyze_results(df_with_preds):
    """UPDATED: Simplified performance analysis with MAE, scatter plot, and binned MAPE."""
    mae = (df_with_preds['predicted'] - df_with_preds['price_per_person']).abs().mean()
    print(f"\n--- Final Performance (Validation Set) ---")
    print(f"Mean Absolute Error: ${mae:.2f}")

    # --- Scatter Plot ---
    plt.figure(figsize=(8, 8))
    sns.scatterplot(
        x='price_per_person', y='predicted', data=df_with_preds.sample(min(len(df_with_preds), 2000)),
        alpha=0.5
    )
    plt.plot([0, df_with_preds['price_per_person'].max()], [0, df_with_preds['price_per_person'].max()], color='red', linestyle='--')
    plt.title("True vs. Predicted Prices")
    plt.xlabel("True Price per Person ($)")
    plt.ylabel("Predicted Price per Person ($)")
    plt.grid(True)
    plt.show()

    # --- Binned MAPE Visualization ---
    df_analysis = df_with_preds.copy()
    df_analysis['mape'] = (df_analysis['predicted'] - df_analysis['price_per_person']).abs() / (df_analysis['price_per_person'] + 1e-6) * 100

    # Create $10 price buckets
    max_price = min(200, df_analysis['price_per_person'].max()) # Cap at $200 for readability
    bins = np.arange(0, max_price + 10, 10)
    df_analysis['price_bin'] = pd.cut(df_analysis['price_per_person'], bins=bins, right=False)

    binned_mape = df_analysis.groupby('price_bin')['mape'].mean().reset_index()

    plt.figure(figsize=(12, 6))
    sns.barplot(x='price_bin', y='mape', data=binned_mape, palette='viridis')
    plt.title("Mean Absolute Percentage Error (MAPE) by Price Bracket")
    plt.xlabel("Price per Person Bracket ($)")
    plt.ylabel("MAPE (%)")
    plt.xticks(rotation=45)
    plt.grid(axis='y')
    plt.show()

### **8. Main Orchestration and Execution**

The `main` function is updated to use the new decomposition inference function and to correctly add the multiple new columns to the final DataFrame before passing it to the updated analysis function.

In [None]:
def main(config: dict):
    """
    Orchestrates the simplified experiment pipeline.
    """
    # 1. Load and split data
    train_df, val_df = load_and_split_data(config)
    mean_target = train_df['price_per_person'].mean()

    # 2. Setup processor and dataloaders
    processor = FeatureProcessor()
    processor.fit(train_df)
    train_features = processor.transform(train_df)
    val_features = processor.transform(val_df)
    train_loader, val_loader = create_dataloaders(train_features, val_features, config)

    # 3. Initialize model and optimizer
    model = AdditiveAxisModel(processor, config, mean_target_price=mean_target)
    optimizer = optim.AdamW(model.parameters(), lr=config['LEARNING_RATE'])
    scheduler = optim.lr_scheduler.ReduceLROnPlateau(optimizer, mode='min', factor=config['SCHEDULER_FACTOR'], patience=config['SCHEDULER_PATIENCE'])

    # 4. Train the model
    trained_model, history = train_model(model, train_loader, val_loader, optimizer, scheduler, config)

    # 5. Run final inference on the validation set to get decomposition
    decomposition_outputs = run_inference_with_decomposition(trained_model, val_loader, config['DEVICE'])

    # Add all decomposition columns to the validation dataframe
    val_df_with_preds = val_df.copy()
    for key, value in decomposition_outputs.items():
        val_df_with_preds[key] = value

    # 6. Save artifacts and analyze results
    save_artifacts_simple(trained_model, processor, config)
    analyze_results(val_df_with_preds)

In [None]:
# --- Run Full Experiment ---
set_seed(config['SEED'])
main(config)

All random seeds set to 42.
Loading dataset from: ./nyc_dataset_oct_17.parquet
Removed small strata. New size: 144,449 records.
Split complete. Training: 115,559, Validation: 28,890
Fitting FeatureProcessor...
Fit complete.
DataLoaders created.

--- Starting Model Training ---


Epoch 1/30:   0%|          | 0/29 [00:00<?, ?it/s]

Epoch 1/30, Val RMSE: $42.90, Val MAPE: 57.09%


Epoch 2/30:   0%|          | 0/29 [00:00<?, ?it/s]

Epoch 2/30, Val RMSE: $42.46, Val MAPE: 56.29%


Epoch 3/30:   0%|          | 0/29 [00:00<?, ?it/s]

Epoch 3/30, Val RMSE: $41.75, Val MAPE: 54.97%


Epoch 4/30:   0%|          | 0/29 [00:00<?, ?it/s]

Epoch 4/30, Val RMSE: $40.71, Val MAPE: 53.24%


Epoch 5/30:   0%|          | 0/29 [00:00<?, ?it/s]

Epoch 5/30, Val RMSE: $39.52, Val MAPE: 50.10%


Epoch 6/30:   0%|          | 0/29 [00:00<?, ?it/s]

Epoch 6/30, Val RMSE: $38.45, Val MAPE: 48.11%


Epoch 7/30:   0%|          | 0/29 [00:00<?, ?it/s]

Epoch 7/30, Val RMSE: $37.56, Val MAPE: 46.07%


Epoch 8/30:   0%|          | 0/29 [00:00<?, ?it/s]

Epoch 8/30, Val RMSE: $36.85, Val MAPE: 44.39%


Epoch 9/30:   0%|          | 0/29 [00:00<?, ?it/s]

KeyboardInterrupt: 