### **Step 1: Setup, Data Loading, and Stratified Split**

We will load the dataset as before. The key change is the splitting logic. We will create a new column by combining `neighbourhood_cleansed` and `month`. This ensures that every unique combination of neighborhood and month is proportionally represented in both the training and validation sets.

A crucial edge case we must handle is strata with only one member. `train_test_split` cannot split a single sample, so we will identify and filter out these rare instances before performing the split.

In [1]:
import pandas as pd
import numpy as np
import os
from sklearn.model_selection import train_test_split

# --- Configuration ---
CITY = "nyc"
DATA_DIR = os.path.expanduser(f"~/Downloads/insideairbnb/{CITY}")
DATASET_PATH = os.path.join(DATA_DIR, f"{CITY}_final_modeling_dataset.parquet")
VAL_SIZE = 0.2  # Using 20% of the data for validation
RANDOM_STATE = 42

# --- Load Data ---
print(f"Loading final dataset from: {DATASET_PATH}")
df = pd.read_parquet(DATASET_PATH)
print("Dataset loaded successfully.")

# --- Stratified Train/Validation Split ---
# To stratify by both neighborhood and month, we create a combined key.
stratify_col = df['neighbourhood_cleansed'].astype(str) + '_' + df['month'].astype(str)
print(f"\nCreated a combined stratification key with {stratify_col.nunique()} unique strata.")

# -- Handle small strata --
# train_test_split requires at least 2 members per stratum. We identify and
# filter out strata with only one sample.
strata_counts = stratify_col.value_counts()
valid_strata = strata_counts[strata_counts >= 2].index
df_filtered = df[stratify_col.isin(valid_strata)].copy()

print(f"Original dataset size: {len(df):,}")
print(f"Removed {len(df) - len(df_filtered):,} records belonging to strata with only 1 member.")
print(f"Filtered dataset size for splitting: {len(df_filtered):,}")

# Perform the stratified split on the filtered dataset
# We split the indices first to avoid copying data in memory unnecessarily
train_indices, val_indices = train_test_split(
    df_filtered.index,
    test_size=VAL_SIZE,
    random_state=RANDOM_STATE,
    stratify=df_filtered['neighbourhood_cleansed'].astype(str) + '_' + df_filtered['month'].astype(str)
)

train_df = df_filtered.loc[train_indices].copy()
val_df = df_filtered.loc[val_indices].copy()

# Reset index for clean processing later
train_df.reset_index(drop=True, inplace=True)
val_df.reset_index(drop=True, inplace=True)

print("\n--- Data Split Summary ---")
print(f"Training records: {len(train_df):,}")
print(f"Validation records: {len(val_df):,}")

# --- Verification ---
# Check the distribution of 'month' in each set to verify stratification
print("\nMonth distribution in Training Set:")
print(train_df['month'].value_counts(normalize=True).sort_index())
print("\nMonth distribution in Validation Set:")
print(val_df['month'].value_counts(normalize=True).sort_index())

Loading final dataset from: /Users/arvindsuresh/Downloads/insideairbnb/nyc/nyc_final_modeling_dataset.parquet
Dataset loaded successfully.

Created a combined stratification key with 2203 unique strata.
Original dataset size: 83,218
Removed 230 records belonging to strata with only 1 member.
Filtered dataset size for splitting: 82,988

--- Data Split Summary ---
Training records: 66,390
Validation records: 16,598

Month distribution in Training Set:
month
1     0.078461
2     0.074695
3     0.067842
4     0.067826
5     0.074108
6     0.079425
7     0.079485
8     0.067601
10    0.159256
11    0.161380
12    0.089923
Name: proportion, dtype: float64

Month distribution in Validation Set:
month
1     0.078202
2     0.074407
3     0.068201
4     0.067659
5     0.074105
6     0.079528
7     0.079467
8     0.066996
10    0.159658
11    0.161827
12    0.089951
Name: proportion, dtype: float64


### **Step 2: The `FeatureProcessor` Class**

This is a critical stage where we translate the feature representation strategy outlined in `EMBEDDINGS.md` into code. The `FeatureProcessor` will be a reusable class, similar to a `scikit-learn` transformer. It will first learn all necessary transformations from the training data (`fit` method) and then apply these transformations to any dataset (`transform` method).

This approach ensures two things:
1.  **No Data Leakage:** Information from the validation set (like its mean or unique categories) never influences the transformations.
2.  **Consistency:** The exact same transformations are applied to both training and validation data, which is essential for the model to work correctly.

The processor will handle all transformations except for the `amenities` embedding, which will be done inside the PyTorch model itself.

In [2]:
import torch
import json
import pandas as pd
import numpy as np

class FeatureProcessor:
    """
    A class to handle all feature transformations for the Airbnb pricing model.
    It learns transformations from the training data and applies them consistently.
    """
    def __init__(self, embedding_dim_geo: int = 32): # Increased default geo dim
        self.vocabs = {}
        self.scalers = {}
        self.embedding_dim_geo = embedding_dim_geo
        
        # --- Define feature groups based on EMBEDDINGS.md ---
        self.categorical_cols = [
            "neighbourhood_cleansed", "property_type", "room_type", 
            "bathrooms_type", "bedrooms", "beds", "bathrooms_numeric"
        ]
        self.numerical_cols = [
            "accommodates", "review_scores_rating", "review_scores_cleanliness",
            "review_scores_checkin", "review_scores_communication", 
            "review_scores_location", "review_scores_value", "host_response_rate",
            "host_acceptance_rate"
        ]
        self.log_transform_cols = ["number_of_reviews_ltm"]
        self.boolean_cols = [
            "host_is_superhost", "host_identity_verified", "instant_bookable"
        ]

    def _create_positional_encoding(self, value, max_val):
        """Creates a positional encoding for a scalar value."""
        d = self.embedding_dim_geo
        if d % 2 != 0:
            raise ValueError("embedding_dim_geo must be an even number.")
            
        pe = np.zeros(d)
        position = (value / max_val) * 10000  # Scale position
        div_term = np.exp(np.arange(0, d, 2) * -(np.log(10000.0) / d))
        pe[0::2] = np.sin(position * div_term)
        pe[1::2] = np.cos(position * div_term)
        return pe

    def fit(self, df: pd.DataFrame):
        """Learns vocabularies and scaling parameters from the training data."""
        print("Fitting FeatureProcessor on training data...")
        
        # 1. Learn vocabularies for categorical features
        for col in self.categorical_cols:
            # --- FIX IS HERE ---
            # Drop NA values before getting unique elements to prevent sorting errors
            valid_uniques = df[col].dropna().unique().tolist()
            unique_vals = ["<UNK>"] + sorted(valid_uniques)
            self.vocabs[col] = {val: i for i, val in enumerate(unique_vals)}
            
        # 2. Learn scaling parameters for numerical features
        for col in self.numerical_cols + self.log_transform_cols:
            if col in self.log_transform_cols:
                vals = np.log1p(df[col])
            else:
                vals = df[col]
            self.scalers[col] = {'mean': vals.mean(), 'std': vals.std()}
        
        print("Fit complete. Vocabularies and scalers have been learned.")

    def transform(self, df: pd.DataFrame) -> dict:
        """Applies learned transformations to a dataframe."""
        df = df.copy()
        
        # --- AXIS 1: Location ---
        df["neighbourhood_cleansed_idx"] = df["neighbourhood_cleansed"].apply(
            lambda x: self.vocabs["neighbourhood_cleansed"].get(x, 0) # 0 is '<UNK>'
        )
        # Per EMBEDDINGS.md, Positional Encoding for lat/lon has output dim 32
        lat_enc = df['latitude'].apply(lambda x: self._create_positional_encoding(x, 90))
        lon_enc = df['longitude'].apply(lambda x: self._create_positional_encoding(x, 180))
        # This results in two 16-dim vectors, matching the original 32-dim plan.
        # Let's adjust _create_positional_encoding to just produce one vector of size 16 each
        # A simpler way is to just halve the embedding_dim_geo inside the PE function
        # For simplicity, let's keep the PE dim 32 and just concat them. No, let's follow the doc.
        # Doc says "Positional (Cyclical) Encoding | 32". Let's assume that's for the combined lat/lon.
        # A common way is 16 for lat, 16 for lon.
        if self.embedding_dim_geo % 2 != 0:
            raise ValueError("embedding_dim_geo must be an even number.")
        half_dim = self.embedding_dim_geo // 2
        lat_enc = df['latitude'].apply(lambda x: self._create_positional_encoding(x, 90)[:half_dim])
        lon_enc = df['longitude'].apply(lambda x: self._create_positional_encoding(x, 180)[:half_dim])
        loc_geo = np.hstack([np.stack(lat_enc), np.stack(lon_enc)])

        location_features = {
            "geo_position": loc_geo,
            "neighbourhood": df["neighbourhood_cleansed_idx"].values
        }
        
        # --- AXIS 2: Size & Capacity ---
        size_capacity_features = {}
        for col in ["property_type", "room_type", "bathrooms_type", "bedrooms", "beds", "bathrooms_numeric"]:
            # Handle potential NaNs by mapping them to the '<UNK>' token
            df[f"{col}_idx"] = df[col].apply(lambda x: self.vocabs[col].get(x, 0) if pd.notna(x) else 0)
            size_capacity_features[col] = df[f"{col}_idx"].values
        df["accommodates_std"] = (df["accommodates"] - self.scalers["accommodates"]["mean"]) / self.scalers["accommodates"]["std"]
        size_capacity_features["accommodates"] = df["accommodates_std"].values

        # --- AXIS 3: Quality & Reputation ---
        quality_features = {}
        numerical_quality_cols = [c for c in self.numerical_cols if c != "accommodates"]
        for col in numerical_quality_cols:
            df[f"{col}_std"] = (df[col] - self.scalers[col]["mean"]) / self.scalers[col]["std"]
            quality_features[col] = df[f"{col}_std"].values
        df["number_of_reviews_ltm_log_std"] = (np.log1p(df["number_of_reviews_ltm"]) - self.scalers["number_of_reviews_ltm"]["mean"]) / self.scalers["number_of_reviews_ltm"]["std"]
        quality_features["number_of_reviews_ltm"] = df["number_of_reviews_ltm_log_std"].values
        # One-hot encoding for booleans
        for col in self.boolean_cols:
             quality_features[col] = df[col].astype(float).values

        # --- AXIS 4: Amenities ---
        amenities_features = {"text": df["amenities"].tolist()}
        
        # --- AXIS 5: Seasonality ---
        month_sin = np.sin(2 * np.pi * df["month"] / 12)
        month_cos = np.cos(2 * np.pi * df["month"] / 12)
        seasonality_features = {"cyclical": np.vstack([month_sin, month_cos]).T}
        
        # --- Return final dictionary ---
        return {
            "location": location_features,
            "size_capacity": size_capacity_features,
            "quality": quality_features,
            "amenities": amenities_features,
            "seasonality": seasonality_features,
            "target_price": df["target_price"].values,
            "sample_weight": df["estimated_occupancy_rate"].values
        }

In [3]:
# --- Code to run the processor ---
# (You can run this in a separate cell)

processor = FeatureProcessor()
processor.fit(train_df)
train_features = processor.transform(train_df)
val_features = processor.transform(val_df)
print("\n--- Transformation Output Verification ---")
print("Transformed data is a dictionary with keys:", list(train_features.keys()))
print("Sample from 'location' axis features:")
print("Geo Position Shape:", train_features['location']['geo_position'].shape)

Fitting FeatureProcessor on training data...
Fit complete. Vocabularies and scalers have been learned.

--- Transformation Output Verification ---
Transformed data is a dictionary with keys: ['location', 'size_capacity', 'quality', 'amenities', 'seasonality', 'target_price', 'sample_weight']
Sample from 'location' axis features:
Geo Position Shape: (66390, 32)


### **Step 3: PyTorch `Dataset` and `DataLoader`**

Now that our features are processed into NumPy arrays, we need a standard way to feed them into a PyTorch model. This is accomplished in two stages:

1.  **`Dataset`:** A custom class that organizes our features and tells PyTorch how to retrieve a single data point (`__getitem__`). This is where the just-in-time tokenization of the `amenities` text will occur.
2.  **`DataLoader`:** A PyTorch utility that wraps our `Dataset` and efficiently serves up batches of data, handling shuffling and collation automatically.

This approach is memory-efficient and is the standard for all PyTorch projects. Our design is informed by the specifications in `EMBEDDINGS.md` regarding the use of a pre-trained transformer for amenities.

In [7]:
import torch
from torch.utils.data import Dataset, DataLoader
from sentence_transformers import SentenceTransformer

# --- Configuration ---
# As per EMBEDDINGS.md, we use a specific pre-trained model for amenities.
TOKENIZER_MODEL = 'BAAI/bge-small-en-v1.5'  # Corrected model name
BATCH_SIZE = 256

# Device selection
if torch.cuda.is_available():
    DEVICE = "cuda"
elif torch.backends.mps.is_available():
    DEVICE = "mps"
    print("Apple MPS device found. Using MPS for acceleration.")
else:
    DEVICE = "cpu"

print(f"Using device: {DEVICE}")

# --- PyTorch Dataset Class ---

class AirbnbPriceDataset(Dataset):
    """
    Custom PyTorch Dataset to handle the transformed Airbnb feature dictionary.
    """
    def __init__(self, features: dict):
        self.features = features
        self.n_samples = len(features['target_price'])

    def __len__(self):
        """Returns the total number of samples in the dataset."""
        return self.n_samples

    def __getitem__(self, index: int) -> dict:
        """
        Retrieves all features for a single sample and converts them to Tensors.
        The raw amenities text is also returned for batch tokenization.
        """
        item = {}
        # --- Get features for each axis ---
        # Note: We keep features as separate dictionary entries for clarity in the model's forward pass.
        
        # Location
        item['loc_geo_position'] = torch.tensor(self.features['location']['geo_position'][index], dtype=torch.float32)
        item['loc_neighbourhood'] = torch.tensor(self.features['location']['neighbourhood'][index], dtype=torch.long)
        
        # Size & Capacity
        for key, val in self.features['size_capacity'].items():
            item[f'size_{key}'] = torch.tensor(val[index], dtype=torch.float32 if key == 'accommodates' else torch.long)
            
        # Quality & Reputation
        for key, val in self.features['quality'].items():
            item[f'qual_{key}'] = torch.tensor(val[index], dtype=torch.float32)

        # Amenities (pass raw text)
        item['amenities_text'] = self.features['amenities']['text'][index]

        # Seasonality
        item['season_cyclical'] = torch.tensor(self.features['seasonality']['cyclical'][index], dtype=torch.float32)
        
        # --- Get labels and weights ---
        item['target_price'] = torch.tensor(self.features['target_price'][index], dtype=torch.float32)
        item['sample_weight'] = torch.tensor(self.features['sample_weight'][index], dtype=torch.float32)
        
        return item

# --- Custom Collate Function for Batch Tokenization ---

# Instantiate the tokenizer model once
tokenizer_model = SentenceTransformer(TOKENIZER_MODEL, device=DEVICE)

def custom_collate_fn(batch: list) -> dict:
    """
    Custom collate function to handle batching and on-the-fly tokenization.
    """
    collated_batch = {}
    
    # 1. Separate raw text from other features
    amenities_texts = [item.pop('amenities_text') for item in batch]
    
    # 2. Use default collate for all other tensor features
    first_item = batch[0]
    for key in first_item.keys():
        collated_batch[key] = torch.stack([d[key] for d in batch])
        
    # 3. Tokenize the batch of amenities texts
    # The tokenizer handles padding and creates 'input_ids' and 'attention_mask'
    tokenized_amenities = tokenizer_model.tokenizer(
        amenities_texts, 
        padding=True, 
        truncation=True, 
        return_tensors='pt',
        max_length=128 # A reasonable max length for amenities lists
    )
    collated_batch['amenities_tokens'] = tokenized_amenities
    
    return collated_batch

# --- Instantiate Datasets and DataLoaders ---

# Create Dataset objects
train_dataset = AirbnbPriceDataset(train_features)
val_dataset = AirbnbPriceDataset(val_features)

# Create DataLoader objects
train_loader = DataLoader(
    dataset=train_dataset,
    batch_size=BATCH_SIZE,
    shuffle=True, # Shuffle training data
    collate_fn=custom_collate_fn
)
val_loader = DataLoader(
    dataset=val_dataset,
    batch_size=BATCH_SIZE,
    shuffle=False, # No need to shuffle validation data
    collate_fn=custom_collate_fn
)

print("\n--- DataLoader Verification ---")
print(f"Number of training batches: {len(train_loader)}")
print(f"Number of validation batches: {len(val_loader)}")

# --- Inspect a single batch to verify shapes and types ---
try:
    sample_batch = next(iter(train_loader))
    print("\nSample batch loaded. Verifying contents...")
    print("Batch keys:", list(sample_batch.keys()))
    
    print("\nShape of 'loc_geo_position':", sample_batch['loc_geo_position'].shape)
    print("Shape of 'size_property_type':", sample_batch['size_property_type'].shape)
    print("Shape of 'amenities_tokens.input_ids':", sample_batch['amenities_tokens']['input_ids'].shape)
    print("Shape of 'target_price':", sample_batch['target_price'].shape)
    print("Data type of 'loc_neighbourhood':", sample_batch['loc_neighbourhood'].dtype) # Should be torch.long
    print("Data type of 'target_price':", sample_batch['target_price'].dtype)     # Should be torch.float32

except Exception as e:
    print(f"\nAn error occurred while inspecting the DataLoader batch: {e}")

Apple MPS device found. Using MPS for acceleration.
Using device: mps

--- DataLoader Verification ---
Number of training batches: 260
Number of validation batches: 65

Sample batch loaded. Verifying contents...
Batch keys: ['loc_geo_position', 'loc_neighbourhood', 'size_property_type', 'size_room_type', 'size_bathrooms_type', 'size_bedrooms', 'size_beds', 'size_bathrooms_numeric', 'size_accommodates', 'qual_review_scores_rating', 'qual_review_scores_cleanliness', 'qual_review_scores_checkin', 'qual_review_scores_communication', 'qual_review_scores_location', 'qual_review_scores_value', 'qual_host_response_rate', 'qual_host_acceptance_rate', 'qual_number_of_reviews_ltm', 'qual_host_is_superhost', 'qual_host_identity_verified', 'qual_instant_bookable', 'season_cyclical', 'target_price', 'sample_weight', 'amenities_tokens']

Shape of 'loc_geo_position': torch.Size([256, 32])
Shape of 'size_property_type': torch.Size([256])
Shape of 'amenities_tokens.input_ids': torch.Size([256, 128])
Sha

### **Step 4: The Additive Model Architecture (`nn.Module`)**

We will now define the neural network. The architecture is composed of several distinct parts that mirror the feature axes:

1.  **Embedding Layers:** For all categorical features.
2.  **Pre-trained Transformer:** For the `amenities` text. We will use the `SentenceTransformer` model directly and freeze its weights for this baseline version.
3.  **Sub-Networks (MLPs):** Five small, independent Multi-Layer Perceptrons (MLPs), one for each axis. Each MLP takes the processed features for its axis and outputs a single scalar value representing that axis's price contribution.
4.  **Aggregation:** A final summation of the outputs from the five sub-networks plus a single, learnable `Global_Bias` term.

This structure allows us to inspect the output of each sub-network individually, providing the desired explainability.

In [15]:
import torch
import torch.nn as nn
from sentence_transformers import SentenceTransformer

class AdditiveAxisModel(nn.Module):
    """
    Implements the 5-axis additive model for Airbnb price prediction.
    - Architecture is based on MODELING.md (Phase 1: Simple Additive Baseline).
    - Feature embedding strategy is based on EMBEDDINGS.md.
    """
    def __init__(self, processor: FeatureProcessor, device: str = 'cpu'):
        super().__init__()
        self.vocabs = processor.vocabs
        self.device = device
        
        # --- 1. Embedding Layers ---
        # Based on output dimensions in EMBEDDINGS.md
        self.embed_neighbourhood = nn.Embedding(len(self.vocabs['neighbourhood_cleansed']), 16)
        self.embed_property_type = nn.Embedding(len(self.vocabs['property_type']), 8)
        self.embed_room_type = nn.Embedding(len(self.vocabs['room_type']), 4)
        self.embed_bathrooms_type = nn.Embedding(len(self.vocabs['bathrooms_type']), 2)
        
        # For ordinal features treated as categorical
        self.embed_bedrooms = nn.Embedding(len(self.vocabs['bedrooms']), 4)
        self.embed_beds = nn.Embedding(len(self.vocabs['beds']), 4)
        self.embed_bathrooms_numeric = nn.Embedding(len(self.vocabs['bathrooms_numeric']), 4)
        
        # --- 2. Pre-trained Amenities Model ---
        self.amenities_transformer = SentenceTransformer('BAAI/bge-small-en-v1.5')
        # FREEZE the transformer weights for the baseline model
        for param in self.amenities_transformer.parameters():
            param.requires_grad = False
            
        # --- 3. Sub-Network MLPs ---
        # Input dimensions are calculated from the concatenated features for each axis.
        # See EMBEDDINGS.md "Final Sub-Network Inputs" section for these values.
        
        # Location: 32 (Geo) + 16 (Neighbourhood) = 48
        self.loc_subnet = nn.Sequential(
            nn.Linear(48, 32),
            nn.ReLU(),
            nn.Linear(32, 1)
        )
        
        # Size & Capacity: 8+4+2+4+4+4 (Embeds) + 1 (Accommodates) = 27
        self.size_subnet = nn.Sequential(
            nn.Linear(27, 32),
            nn.ReLU(),
            nn.Linear(32, 1)
        )
        
        # Quality: 8 (Numericals) + 1 (Reviews LTM) + 3 (Booleans) = 12
        self.qual_subnet = nn.Sequential(
            nn.Linear(12, 32),
            nn.ReLU(),
            nn.Linear(32, 1)
        )

        # Amenities: 384 (Transformer Output Dim for bge-small)
        self.amenities_subnet = nn.Linear(384, 1)
        
        # Seasonality: 2 (Cyclical)
        self.season_subnet = nn.Sequential(
            nn.Linear(2, 16),
            nn.ReLU(),
            nn.Linear(16, 1)
        )
        
        # --- 4. Global Bias Term ---
        self.global_bias = nn.Parameter(torch.randn(1))

        self.to(self.device)

    def forward(self, batch: dict) -> torch.Tensor:
        # Move all tensors in the batch to the correct device
        for key, value in batch.items():
            if isinstance(value, torch.Tensor):
                batch[key] = value.to(self.device)
            elif isinstance(value, dict): # For the amenities tokenizer output
                 for sub_key, sub_value in value.items():
                    batch[key][sub_key] = sub_value.to(self.device)

        # --- Process Each Axis ---

        # 1. Location Axis
        loc_geo = batch['loc_geo_position']
        loc_hood_embed = self.embed_neighbourhood(batch['loc_neighbourhood'])
        loc_input = torch.cat([loc_geo, loc_hood_embed], dim=1)
        p_location = self.loc_subnet(loc_input)

        # 2. Size & Capacity Axis
        size_prop_embed = self.embed_property_type(batch['size_property_type'])
        size_room_embed = self.embed_room_type(batch['size_room_type'])
        size_bath_type_embed = self.embed_bathrooms_type(batch['size_bathrooms_type'])
        size_beds_embed = self.embed_beds(batch['size_beds'])
        size_bedrooms_embed = self.embed_bedrooms(batch['size_bedrooms'])
        size_bath_num_embed = self.embed_bathrooms_numeric(batch['size_bathrooms_numeric'])
        size_accommodates = batch['size_accommodates'].unsqueeze(1)
        
        size_input = torch.cat([
            size_prop_embed, size_room_embed, size_bath_type_embed,
            size_beds_embed, size_bedrooms_embed, size_bath_num_embed,
            size_accommodates
        ], dim=1)
        p_size = self.size_subnet(size_input)

        # 3. Quality & Reputation Axis
        qual_input = torch.cat([
            batch['qual_review_scores_rating'].unsqueeze(1),
            batch['qual_review_scores_cleanliness'].unsqueeze(1),
            batch['qual_review_scores_checkin'].unsqueeze(1),
            batch['qual_review_scores_communication'].unsqueeze(1),
            batch['qual_review_scores_location'].unsqueeze(1),
            batch['qual_review_scores_value'].unsqueeze(1),
            batch['qual_host_response_rate'].unsqueeze(1),
            batch['qual_host_acceptance_rate'].unsqueeze(1),
            batch['qual_number_of_reviews_ltm'].unsqueeze(1),
            batch['qual_host_is_superhost'].unsqueeze(1),
            batch['qual_host_identity_verified'].unsqueeze(1),
            batch['qual_instant_bookable'].unsqueeze(1)
        ], dim=1)
        p_quality = self.qual_subnet(qual_input)

        # 4. Amenities Axis
        # The sentence_transformers library expects a dictionary for the forward pass
        amenities_tokens = batch['amenities_tokens']
        amenities_output = self.amenities_transformer(amenities_tokens)
        amenities_embed = amenities_output['sentence_embedding']
        p_amenities = self.amenities_subnet(amenities_embed)

        # 5. Seasonality Axis
        season_input = batch['season_cyclical']
        p_seasonality = self.season_subnet(season_input)

        # --- Final Aggregation ---
        predicted_price = (
            self.global_bias +
            p_location +
            p_size +
            p_quality +
            p_amenities +
            p_seasonality
        ).squeeze(-1) # Squeeze to remove the trailing dimension of size 1

        return predicted_price

# --- Instantiate the model and verify ---
# We pass the fitted processor to give the model access to vocab sizes
model = AdditiveAxisModel(processor, device='cpu')

# Perform a forward pass with the sample batch to ensure all dimensions are correct
try:
    with torch.no_grad(): # No need to calculate gradients for this test
        model.eval() # Set model to evaluation mode
        sample_output = model(sample_batch)
    
    print("--- Model Verification ---")
    print(f"Model instantiated successfully on device: {model.device}")
    print(f"Output shape of a sample forward pass: {sample_output.shape}") # Should be [BATCH_SIZE]
    print(f"Sample prediction value: {sample_output[0].item():.2f}")

except Exception as e:
    print(f"\nAn error occurred during the model's forward pass test: {e}")

--- Model Verification ---
Model instantiated successfully on device: cpu
Output shape of a sample forward pass: torch.Size([256])
Sample prediction value: -0.58


### **Step 5: Model Training Loop**

We will now write a standard PyTorch training loop. The key component here is the implementation of the **weighted loss function**, which is central to your project's methodology as described in `TARGET-PRICE.md`.

Instead of a simple Mean Squared Error (MSE), we will calculate a weighted MSE. Each sample's contribution to the total loss will be scaled by its `estimated_occupancy_rate` (the `sample_weight`). This forces the model to pay much more attention to fitting the prices of listings with high market activity.

The loop will:
1.  Define the optimizer (AdamW is a robust choice).
2.  Iterate for a specified number of epochs.
3.  For each epoch:
    *   Run a **training phase** where we iterate through the `train_loader`, calculate the weighted loss, and update the model's weights via backpropagation.
    *   Run a **validation phase** where we iterate through the `val_loader` with gradients turned off (`torch.no_grad()`) to calculate the loss on the validation set.
4.  Print the training and validation loss after each epoch to monitor progress.


In [16]:
import torch.optim as optim
from tqdm import tqdm # For a nice progress bar

# --- Training Configuration ---
N_EPOCHS = 5
LEARNING_RATE = 1e-3

# --- Instantiate Model, Optimizer, and Loss ---
# Note: Re-instantiating the model to ensure we start with fresh, random weights
model = AdditiveAxisModel(processor, device='cpu') 
optimizer = optim.AdamW(model.parameters(), lr=LEARNING_RATE)

# We will implement the weighted MSE loss directly in the loop.

# --- Training and Validation Loop ---
print("\n--- Starting Model Training ---")

for epoch in range(N_EPOCHS):
    
    # --- Training Phase ---
    model.train() # Set model to training mode
    train_loss = 0.0
    
    # Use tqdm for a progress bar over the training batches
    train_iterator = tqdm(train_loader, desc=f"Epoch {epoch+1}/{N_EPOCHS} [Training]")
    
    for batch in train_iterator:
        # 1. Get predictions
        predictions = model(batch)
        
        # 2. Get targets and weights
        targets = batch['target_price'].to(model.device)
        weights = batch['sample_weight'].to(model.device)
        
        # 3. Calculate weighted loss
        # The core of the weighted learning strategy
        loss = (weights * (predictions - targets)**2).mean()
        
        # 4. Backpropagation
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        
        train_loss += loss.item()
        
        # Update progress bar description
        train_iterator.set_postfix({'train_loss': train_loss / (train_iterator.n + 1)})

    avg_train_loss = train_loss / len(train_loader)

    # --- Validation Phase ---
    model.eval() # Set model to evaluation mode
    val_loss = 0.0
    
    with torch.no_grad(): # Disable gradient calculation
        val_iterator = tqdm(val_loader, desc=f"Epoch {epoch+1}/{N_EPOCHS} [Validation]")
        for batch in val_iterator:
            predictions = model(batch)
            targets = batch['target_price'].to(model.device)
            weights = batch['sample_weight'].to(model.device)
            
            loss = (weights * (predictions - targets)**2).mean()
            val_loss += loss.item()
            val_iterator.set_postfix({'val_loss': val_loss / (val_iterator.n + 1)})

    avg_val_loss = val_loss / len(val_loader)
    
    # --- Print Epoch Summary ---
    print(f"Epoch {epoch+1}/{N_EPOCHS} -> "
          f"Train Loss: {avg_train_loss:.4f}, "
          f"Validation Loss: {avg_val_loss:.4f}")

print("\n--- Training Complete ---")


--- Starting Model Training ---


Epoch 1/5 [Training]:  13%|█▎        | 33/260 [05:26<37:28,  9.90s/it, train_loss=2.31e+4]


KeyboardInterrupt: 