<a href="https://colab.research.google.com/github/arvindsuresh-math/Fall-2025-Team-Big-Data/blob/main/final_notebooks/nn_models_toronto.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 1. Baseline Neural Network Model

### Objective

In this notebook, we train and evaluate a fully-connected deep learning model. This model will serve as a robust performance baseline against which we can compare our more complex, interpretable `AdditiveModel`.

The architecture is a standard Multi-Layer Perceptron (MLP) that takes all available features—including location, size, quality, and text embeddings—concatenates them into a single vector, and processes them through several layers to predict the final log-price deviation. Regularization techniques like Dropout, Batch Normalization, and Weight Decay are used to prevent overfitting.

In [1]:
# --- Mount Google Drive ---
from google.colab import drive
drive.mount('/content/drive')

# --- Change Directory to Project Folder ---
import os

# IMPORTANT: Make sure this path matches the location of your project folder in Google Drive
PROJECT_PATH = '/content/drive/MyDrive/Airbnb_Price_Project'
os.chdir(PROJECT_PATH)
print(f"Current working directory: {os.getcwd()}")

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
Current working directory: /content/drive/MyDrive/Airbnb_Price_Project


In [2]:
# --- Hugging Face Authentication ---
from google.colab import userdata
from huggingface_hub import login
print("\nAttempting Hugging Face login...")
try:
    HF_TOKEN = userdata.get('HF_TOKEN')
    login(token=HF_TOKEN)
    print("Hugging Face login successful.")
except Exception as e:
    print(f"Could not log in. Please ensure 'HF_TOKEN' is a valid secret. Error: {e}")


Attempting Hugging Face login...
Hugging Face login successful.


In [3]:
print("--- Installing required packages ---")
!pip install -q pandas pyarrow sentence-transformers scikit-learn torch tqdm transformers matplotlib seaborn

print("Package installation complete.")

--- Installing required packages ---
Package installation complete.


In [4]:
import os
import pickle
import random
import pandas as pd
import numpy as np
import torch
import torch.optim as optim
from torch.utils.data import DataLoader
from transformers import AutoTokenizer

# --- Custom Project Scripts ---
from config import config
from data_processing import load_and_split_data, FeatureProcessor, create_dataloaders
# Import BOTH model classes and the dataset
from model import BaselineModel, AdditiveModel, AirbnbPriceDataset
from train import train_model
# Import BOTH inference functions
from inference import run_inference
from build_app_dataset import build_dataset, create_full_panel_dataset

In [5]:
def set_seed(seed: int):
    """
    Sets random seeds for numpy, torch, and Python's random module to ensure
    reproducible results across runs.
    """
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    if torch.cuda.is_available():
        torch.cuda.manual_seed(seed)
        torch.cuda.manual_seed_all(seed)
        # These settings are needed for full determinism with CUDA
        torch.backends.cudnn.deterministic = True
        torch.backends.cudnn.benchmark = False
    print(f"All random seeds set to {seed}.")

set_seed(config["SEED"])
print("-"*60)
print("Current configuration:")
for key, value in config.items():
    print(f"{key}: {value}")

All random seeds set to 42.
------------------------------------------------------------
Current configuration:
CITY: toronto
DEVICE: cuda
DRIVE_SAVE_PATH: /content/drive/MyDrive/Airbnb_Price_Project/artifacts/
TEXT_MODEL_NAME: BAAI/bge-small-en-v1.5
VAL_SIZE: 0.05
SEED: 42
BATCH_SIZE: 256
VALIDATION_BATCH_SIZE: 512
LEARNING_RATE: 0.001
TRANSFORMER_LEARNING_RATE: 1e-05
N_EPOCHS: 100
WEIGHT_DECAY: 0.0001
DROPOUT_RATE: 0.2
GEO_EMBEDDING_DIM: 32
HIDDEN_LAYERS_LOCATION: [32, 16]
HIDDEN_LAYERS_SIZE_CAPACITY: [32, 16]
HIDDEN_LAYERS_QUALITY: [32, 16]
HIDDEN_LAYERS_AMENITIES: [64, 32]
HIDDEN_LAYERS_DESCRIPTION: [64, 32]
HIDDEN_LAYERS_SEASONALITY: [16]
EARLY_STOPPING_PATIENCE: 3
EARLY_STOPPING_MIN_DELTA: 0.0
SCHEDULER_PATIENCE: 2
SCHEDULER_FACTOR: 0.5


## Data Loading and Preprocessing

We begin by loading the dataset and performing our custom stratified group split. This method ensures that all records for a single listing (`listing_id`) are confined to either the training or the validation set, which is crucial for preventing data leakage and obtaining a reliable performance estimate.

Once split, we instantiate and `fit` our `FeatureProcessor` exclusively on the training data. This learns the necessary vocabularies and scaling parameters, which are then used to `transform` both the training and validation sets into numerical tensors ready for the model.

In [6]:
# Load and split the data
train_df, val_df, neighborhood_log_means, train_ids, val_ids = load_and_split_data(config)

# Instantiate and fit the feature processor on the training data
processor = FeatureProcessor(config)
processor.fit(train_df)

# Transform both datasets into feature dictionaries
train_features = processor.transform(train_df, neighborhood_log_means)
val_features = processor.transform(val_df, neighborhood_log_means)

# Create the PyTorch DataLoaders
train_loader, val_loader = create_dataloaders(train_features, val_features, config)

print("\nData pipeline complete. DataLoaders are ready for training.")

Data split: 82,065 train records, 4,327 validation records.

Data pipeline complete. DataLoaders are ready for training.


---
# Part 1: Baseline Model
---

First, we train the `BaselineModel`. This involves initializing the model and a standard optimizer, running the training loop, and saving all the necessary artifacts for later analysis.

In [7]:
# Instantiate the baseline model
baseline_model = BaselineModel(processor, config)
baseline_model.to(config['DEVICE'])

# Instantiate the optimizer with weight decay for regularization
baseline_optimizer = optim.AdamW(
    baseline_model.parameters(),
    lr=config['LEARNING_RATE'],
    weight_decay=config['WEIGHT_DECAY']
)

# Instantiate the learning rate scheduler
baseline_scheduler = optim.lr_scheduler.ReduceLROnPlateau(
    baseline_optimizer,
    mode='min',
    factor=config['SCHEDULER_FACTOR'],
    patience=config['SCHEDULER_PATIENCE']
)

print(f"BaselineModel and its optimizer/scheduler have been initialized.")

BaselineModel and its optimizer/scheduler have been initialized.


In [8]:
trained_baseline_model, baseline_history_df = train_model(
    model=baseline_model,
    train_loader=train_loader,
    val_loader=val_loader,
    optimizer=baseline_optimizer,
    scheduler=baseline_scheduler,
    config=config
)


--- Starting Training for BaselineModel on TORONTO ---
Epoch |     Time |   Train RMSE | Train MAPE (%) |   Val RMSE | Val MAPE (%) | MAPE Gap (%) | Patience
------------------------------------------------------------------------------------------------------




    1 | 00:01:16 |       0.3695 |          31.44 |     0.3327 |        26.41 |        -5.03 |        0




    2 | 00:02:31 |       0.3195 |          26.48 |     0.3314 |        26.66 |         0.18 |        0




    3 | 00:03:45 |       0.3015 |          24.87 |     0.3318 |        28.18 |         3.31 |        0




    4 | 00:05:00 |       0.2876 |          23.61 |     0.3319 |        26.71 |         3.11 |        0




    5 | 00:06:15 |       0.2735 |          22.35 |     0.3302 |        26.76 |         4.41 |        1




    6 | 00:07:29 |       0.2661 |          21.66 |     0.3310 |        26.99 |         5.33 |        2




    7 | 00:08:43 |       0.2621 |          21.34 |     0.3296 |        27.23 |         5.89 |        3
--- Early Stopping Triggered (MAPE Gap exceeded 4% for 3 epochs) ---

--- Training Complete ---
Loading best model state from file with Train MAPE: 23.61% (and MAPE Gap: 3.11%)


## Final Performance Metrics

With the training complete, the `train_model` function has returned the model object with the weights from its best-performing epoch, defined by our custom criteria (lowest `train_mape` while the validation/train MAPE gap is under 4%).

To get the definitive performance scores for our saved model, we now query the `history_df` using the **exact same logic** to identify the best epoch and extract its corresponding metrics.

In [10]:
# 1. Filter the history to find all epochs that satisfy the gap constraint (< 4%)
valid_epochs_df = baseline_history_df[baseline_history_df['mape_gap'] < 0.04]

if not valid_epochs_df.empty:
    # 2. From these valid epochs, find the index of the one with the lowest train MAPE
    best_epoch_idx = valid_epochs_df['train_mape'].idxmin()

    # 3. Select the entire row of metrics from that best epoch
    best_metrics_series = baseline_history_df.loc[best_epoch_idx]

    # 4. Convert the pandas Series to a dictionary
    final_baseline_metrics = best_metrics_series.to_dict()

    # --- Print a clear summary for verification ---
    print("\n" + "="*60)
    print(f"{'Final Baseline Model Performance Metrics':^60}")
    print(f"(Extracted from Best Epoch: {int(final_baseline_metrics['epoch']) + 1})")
    print("="*60)
    print(f"Train RMSE:      {final_baseline_metrics['train_rmse']:.4f}")
    print(f"Validation RMSE: {final_baseline_metrics['val_rmse']:.4f}")
    print("-" * 60)
    print(f"Train MAPE:      {final_baseline_metrics['train_mape'] * 100:.2f}%")
    print(f"Validation MAPE: {final_baseline_metrics['val_mape'] * 100:.2f}%")
    print(f"MAPE Gap:        {final_baseline_metrics['mape_gap'] * 100:.2f}%")
    print("=" * 60)
else:
    print("ERROR: No valid epochs found that met the <4% MAPE gap criterion.")
    # Create a dummy dictionary to prevent the next cell from crashing
    final_baseline_metrics = {}


          Final Baseline Model Performance Metrics          
(Extracted from Best Epoch: 4)
Train RMSE:      0.2876
Validation RMSE: 0.3319
------------------------------------------------------------
Train MAPE:      23.61%
Validation MAPE: 26.71%
MAPE Gap:        3.11%


In [11]:
print("--- Preparing full panel dataset for baseline inference ---")
raw_df = pd.read_parquet(f"./{config['CITY']}_dataset_oct_20.parquet")
panel_df = create_full_panel_dataset(raw_df, train_ids, val_ids)
panel_features = processor.transform(panel_df, neighborhood_log_means)
tokenizer = AutoTokenizer.from_pretrained(config['TEXT_MODEL_NAME'], use_fast=True)
panel_dataset = AirbnbPriceDataset(panel_features, tokenizer)
panel_loader = DataLoader(panel_dataset, batch_size=config['VALIDATION_BATCH_SIZE'], shuffle=False)
predictions_df = run_inference(trained_baseline_model, panel_loader, config['DEVICE'])
final_predictions_df = pd.concat([panel_df, predictions_df], axis=1)

print("\n--- Saving all baseline artifacts ---")
timestamp = pd.Timestamp.now().strftime('%Y%m%d_%H%M%S')
# UPDATED: The directory name now includes the city
artifacts_dir = os.path.join(config['DRIVE_SAVE_PATH'], f"{config['CITY']}_baseline_{timestamp}")
os.makedirs(artifacts_dir, exist_ok=True)

# UPDATED: All filenames are now prefixed with the city name
model_save_path = os.path.join(artifacts_dir, f"{config['CITY']}_baseline_model.pt")
processor_save_path = os.path.join(artifacts_dir, f"{config['CITY']}_feature_processor.pkl")
predictions_save_path = os.path.join(artifacts_dir, f"{config['CITY']}_baseline_model_predictions.parquet")

# Save model, processor, and predictions
torch.save({
    'model_state_dict': trained_baseline_model.state_dict(),
    'final_metrics': final_baseline_metrics
}, model_save_path)
with open(processor_save_path, 'wb') as f:
    pickle.dump(processor, f)
final_predictions_df.to_parquet(predictions_save_path, index=False)

print(f"Baseline artifacts for {config['CITY'].upper()} successfully saved in folder: {artifacts_dir}")

--- Preparing full panel dataset for baseline inference ---


Running Inference: 100%|██████████| 188/188 [04:26<00:00,  1.42s/it]



--- Saving all baseline artifacts ---
Baseline artifacts for TORONTO successfully saved in folder: /content/drive/MyDrive/Airbnb_Price_Project/artifacts/toronto_baseline_20251105_164700


---
# Part 2: Additive Model
---

Now, we proceed to train the `AdditiveModel`. We reuse the exact same data loaders to ensure a fair comparison. The key difference in this section is the optimizer setup, which uses a lower learning rate for the pre-trained text transformer to enable effective fine-tuning. After training, we run the specialized `build_dataset` script to generate the enriched data artifact for our Streamlit application.

In [12]:
additive_model = AdditiveModel(processor, config)
additive_model.to(config['DEVICE'])

# Create parameter groups for differential learning rates
transformer_params = additive_model.text_transformer.parameters()
other_params = [p for n, p in additive_model.named_parameters() if 'text_transformer' not in n]

# Instantiate the optimizer with two parameter groups
additive_optimizer = optim.AdamW([
    {'params': other_params, 'lr': config['LEARNING_RATE'], 'weight_decay': config['WEIGHT_DECAY']},
    {'params': transformer_params, 'lr': config['TRANSFORMER_LEARNING_RATE']}
])

additive_scheduler = optim.lr_scheduler.ReduceLROnPlateau(
    additive_optimizer, mode='min', factor=config['SCHEDULER_FACTOR'], patience=config['SCHEDULER_PATIENCE']
)

print(f"AdditiveModel and its optimizer/scheduler have been initialized.")

AdditiveModel and its optimizer/scheduler have been initialized.


In [13]:
trained_additive_model, additive_history_df = train_model(
    model=additive_model,
    train_loader=train_loader,
    val_loader=val_loader,
    optimizer=additive_optimizer,
    scheduler=additive_scheduler,
    config=config
)


--- Starting Training for AdditiveModel on TORONTO ---
Epoch |     Time |   Train RMSE | Train MAPE (%) |   Val RMSE | Val MAPE (%) | MAPE Gap (%) | Patience
------------------------------------------------------------------------------------------------------




    1 | 00:01:12 |       0.4100 |          34.55 |     0.3478 |        29.84 |        -4.71 |        0




    2 | 00:02:25 |       0.3287 |          27.41 |     0.3449 |        29.37 |         1.96 |        0




    3 | 00:03:38 |       0.3100 |          25.74 |     0.3440 |        28.91 |         3.17 |        0




    4 | 00:04:50 |       0.2991 |          24.74 |     0.3441 |        29.40 |         4.67 |        1




    5 | 00:06:03 |       0.2917 |          24.01 |     0.3506 |        28.42 |         4.42 |        2




    6 | 00:07:15 |       0.2849 |          23.44 |     0.3490 |        29.88 |         6.44 |        3
--- Early Stopping Triggered (MAPE Gap exceeded 4% for 3 epochs) ---

--- Training Complete ---
Loading best model state from file with Train MAPE: 25.74% (and MAPE Gap: 3.17%)


In [14]:
# 1. Filter the history to find all epochs that satisfy the gap constraint (< 4%)
valid_epochs_df = additive_history_df[additive_history_df['mape_gap'] < 0.04]

if not valid_epochs_df.empty:
    # 2. From these valid epochs, find the index of the one with the lowest train MAPE
    best_epoch_idx = valid_epochs_df['train_mape'].idxmin()

    # 3. Select the entire row of metrics from that best epoch
    best_metrics_series = additive_history_df.loc[best_epoch_idx]

    # 4. Convert the pandas Series to a dictionary
    final_additive_metrics = best_metrics_series.to_dict()

    # --- Print a clear summary for verification ---
    print("\n" + "="*60)
    print(f"{'Final Additive Model Performance Metrics':^60}")
    print(f"(Extracted from Best Epoch: {int(final_additive_metrics['epoch']) + 1})")
    print("="*60)
    print(f"Train RMSE:      {final_additive_metrics['train_rmse']:.4f}")
    print(f"Validation RMSE: {final_additive_metrics['val_rmse']:.4f}")
    print("-" * 60)
    print(f"Train MAPE:      {final_additive_metrics['train_mape'] * 100:.2f}%")
    print(f"Validation MAPE: {final_additive_metrics['val_mape'] * 100:.2f}%")
    print(f"MAPE Gap:        {final_additive_metrics['mape_gap'] * 100:.2f}%")
    print("=" * 60)
else:
    print("ERROR: No valid epochs found for the Additive Model that met the <4% MAPE gap criterion.")
    # Create a dummy dictionary to prevent the next cell from crashing
    final_additive_metrics = {}


          Final Additive Model Performance Metrics          
(Extracted from Best Epoch: 3)
Train RMSE:      0.3100
Validation RMSE: 0.3440
------------------------------------------------------------
Train MAPE:      25.74%
Validation MAPE: 28.91%
MAPE Gap:        3.17%


In [15]:
print("--- Building and saving final application dataset ---")
build_dataset(
    model=trained_additive_model,
    processor=processor,
    config=config,
    train_ids=train_ids,
    val_ids=val_ids
)

--- Building and saving final application dataset ---


Running Detailed Inference: 100%|██████████| 188/188 [01:50<00:00,  1.70it/s]



Successfully created application database at: /content/drive/MyDrive/Airbnb_Price_Project/artifacts/app_data/toronto_app_database.parquet


In [16]:
# We also save the core model artifacts for reproducibility.
print("\n--- Saving core additive model artifacts ---")
timestamp = pd.Timestamp.now().strftime('%Y%m%d_%H%M%S')
# UPDATED: The directory name now includes the city
artifacts_dir = os.path.join(config['DRIVE_SAVE_PATH'], f"{config['CITY']}_additive_{timestamp}")
os.makedirs(artifacts_dir, exist_ok=True)

# UPDATED: All filenames are now prefixed with the city name
model_save_path = os.path.join(artifacts_dir, f"{config['CITY']}_additive_model.pt")
processor_save_path = os.path.join(artifacts_dir, f"{config['CITY']}_feature_processor.pkl")

# Save model and processor
torch.save({
    'model_state_dict': trained_additive_model.state_dict(),
    'final_metrics': final_additive_metrics
}, model_save_path)
with open(processor_save_path, 'wb') as f:
    pickle.dump(processor, f)

print(f"Core additive artifacts for {config['CITY'].upper()} successfully saved in folder: {artifacts_dir}")
print("\n\nNotebook complete.")


--- Saving core additive model artifacts ---
Core additive artifacts for TORONTO successfully saved in folder: /content/drive/MyDrive/Airbnb_Price_Project/artifacts/toronto_additive_20251105_165611


Notebook complete.
