# Assignment 1: Airbnb Pricing Model

**Objective**: Build a prediction model for Airbnb listing prices  
**Data**: Oslo (training & temporal validation) and Copenhagen (spatial validation)  
**Course**: CEU Data Analysis 3 - Prediction and Machine Learning

In [None]:
import warnings
warnings.filterwarnings('ignore')

import pandas as pd
import numpy as np
import os
import time
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.linear_model import LinearRegression, LassoCV, RidgeCV
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error

%matplotlib inline

## Part I: Task 1 - Data Acquisition & Preparation

### Data Loading
- **Oslo**: Training and temporal validation (≥10,000 listings)
- **Copenhagen**: Spatial validation (≥3,000 observations)

In [None]:
# Load data using cross-platform paths
path_oslo = os.path.join(os.pardir, os.pardir, 'data', 'raw', 'oslo_listings.csv')
path_copenhagen = os.path.join(os.pardir, os.pardir, 'data', 'raw', 'copenhagen_listings.csv')

df_oslo = pd.read_csv(path_oslo)
df_copenhagen = pd.read_csv(path_copenhagen)

print(f"Oslo listings: {df_oslo.shape[0]:,} rows, {df_oslo.shape[1]} columns")
print(f"Copenhagen listings: {df_copenhagen.shape[0]:,} rows, {df_copenhagen.shape[1]} columns")

### Exploratory Data Analysis

In [None]:
# Check basic info about Oslo data
df_oslo.info()

In [None]:
# Examine target variable (price)
# Price is stored as string with $ and commas, need to clean
print("Price column sample:")
print(df_oslo['price'].head(10))

In [None]:
def clean_price(price_str):
    """Convert price string like '$1,234.00' to float"""
    if pd.isna(price_str):
        return np.nan
    return float(str(price_str).replace('$', '').replace(',', ''))

# Apply to both datasets
df_oslo['price_clean'] = df_oslo['price'].apply(clean_price)
df_copenhagen['price_clean'] = df_copenhagen['price'].apply(clean_price)

print("Oslo price statistics:")
print(df_oslo['price_clean'].describe())

In [None]:
# Price distribution
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

axes[0].hist(df_oslo['price_clean'].dropna(), bins=50, edgecolor='black')
axes[0].set_title('Oslo Price Distribution')
axes[0].set_xlabel('Price')

# Log-transformed price
axes[1].hist(np.log(df_oslo['price_clean'].dropna() + 1), bins=50, edgecolor='black')
axes[1].set_title('Oslo Log(Price) Distribution')
axes[1].set_xlabel('Log(Price)')

plt.tight_layout()
plt.show()

### Data Wrangling Function

Creating a reusable function to ensure consistent preprocessing across train and test sets.

In [None]:
def preprocess_airbnb(df, is_train=True, reference_df=None):
    """
    Preprocess Airbnb data for modeling.
    
    Parameters:
    -----------
    df : DataFrame - Raw Airbnb data
    is_train : bool - Whether this is training data (affects imputation)
    reference_df : DataFrame - Reference for imputation values (use training data)
    
    Returns:
    --------
    DataFrame with cleaned and engineered features
    """
    df = df.copy()
    
    # 1. Clean price (target variable)
    df['price'] = df['price'].apply(clean_price)
    
    # 2. Drop rows with missing target
    df = df.dropna(subset=['price'])
    
    # 3. Filter extreme prices (keep reasonable range)
    df = df[(df['price'] >= 100) & (df['price'] <= 10000)]
    
    # 4. Log transform price
    df['ln_price'] = np.log(df['price'])
    
    # 5. Select and clean numeric features
    numeric_cols = ['accommodates', 'bedrooms', 'beds', 'minimum_nights', 
                    'maximum_nights', 'number_of_reviews', 'review_scores_rating',
                    'review_scores_accuracy', 'review_scores_cleanliness',
                    'review_scores_checkin', 'review_scores_communication',
                    'review_scores_location', 'review_scores_value',
                    'reviews_per_month', 'availability_30', 'availability_60',
                    'availability_90', 'availability_365',
                    'calculated_host_listings_count']
    
    # 6. Handle missing values in numeric columns
    for col in numeric_cols:
        if col in df.columns:
            if is_train or reference_df is None:
                median_val = df[col].median()
            else:
                median_val = reference_df[col].median()
            
            # Create flag for important missing patterns
            if col.startswith('review_scores'):
                df[f'flag_{col}_missing'] = df[col].isna().astype(int)
            
            df[col] = df[col].fillna(median_val)
    
    # 7. Categorical features
    cat_cols = ['room_type', 'property_type', 'neighbourhood_cleansed']
    for col in cat_cols:
        if col in df.columns:
            df[col] = df[col].fillna('Missing')
    
    # 8. Feature engineering - polynomial terms
    if 'accommodates' in df.columns:
        df['accommodates_sq'] = df['accommodates'] ** 2
    
    if 'bedrooms' in df.columns:
        df['bedrooms_sq'] = df['bedrooms'] ** 2
        
    # 9. Extract number of bathrooms from text
    if 'bathrooms_text' in df.columns:
        df['n_bathrooms'] = df['bathrooms_text'].str.extract(r'(\d+\.?\d*)').astype(float)
        df['n_bathrooms'] = df['n_bathrooms'].fillna(1)
        df['is_shared_bath'] = df['bathrooms_text'].str.contains('shared', case=False, na=False).astype(int)
    
    # 10. Host features
    df['host_is_superhost'] = (df['host_is_superhost'] == 't').astype(int)
    df['instant_bookable'] = (df['instant_bookable'] == 't').astype(int)
    
    # 11. Extract amenities count
    if 'amenities' in df.columns:
        df['n_amenities'] = df['amenities'].str.count(',') + 1
        df['n_amenities'] = df['n_amenities'].fillna(0)
        
        # Key amenities as binary features
        key_amenities = ['wifi', 'kitchen', 'washer', 'dryer', 'parking', 
                        'air conditioning', 'heating', 'tv', 'pool', 'gym']
        for amenity in key_amenities:
            df[f'has_{amenity.replace(" ", "_")}'] = df['amenities'].str.lower().str.contains(amenity, na=False).astype(int)
    
    return df

In [None]:
# Apply preprocessing
df_oslo_clean = preprocess_airbnb(df_oslo, is_train=True)
print(f"Oslo after preprocessing: {df_oslo_clean.shape[0]:,} rows")
print(f"\nPrice statistics after cleaning:")
print(df_oslo_clean['price'].describe())

### Train/Test Split

**Strategy:**
- Training set: 70% of Oslo data
- Temporal validation: 30% of Oslo data (holdout)
- Spatial validation: Copenhagen data

In [None]:
# Split Oslo data for training and temporal validation
df_train, df_temporal = train_test_split(df_oslo_clean, train_size=0.7, random_state=20250224)

# Preprocess Copenhagen for spatial validation
df_spatial = preprocess_airbnb(df_copenhagen, is_train=False, reference_df=df_train)

print(f"Training set: {df_train.shape[0]:,} rows")
print(f"Temporal validation (Oslo holdout): {df_temporal.shape[0]:,} rows")
print(f"Spatial validation (Copenhagen): {df_spatial.shape[0]:,} rows")

### Feature Selection

**Variable selection rationale:**
- **Property characteristics**: accommodates, bedrooms, beds, bathrooms (core drivers of price)
- **Review metrics**: scores and counts (signal of quality)
- **Availability**: indicates demand patterns
- **Host features**: superhost status, listing count (professionalism)
- **Amenities**: key amenities as binary features

In [None]:
# Define feature sets for modeling
numeric_features = [
    'accommodates', 'accommodates_sq', 'bedrooms', 'bedrooms_sq', 'beds',
    'n_bathrooms', 'minimum_nights', 'number_of_reviews',
    'review_scores_rating', 'reviews_per_month',
    'availability_30', 'availability_365',
    'calculated_host_listings_count', 'n_amenities',
    'host_is_superhost', 'instant_bookable', 'is_shared_bath',
    'has_wifi', 'has_kitchen', 'has_washer', 'has_parking',
    'has_air_conditioning', 'has_tv', 'has_pool', 'has_gym',
    'flag_review_scores_rating_missing'
]

categorical_features = ['room_type', 'property_type']

target = 'ln_price'  # Log-transformed price

# Verify all features exist
all_features = numeric_features + categorical_features
missing_features = [f for f in all_features if f not in df_train.columns]
if missing_features:
    print(f"Warning: Missing features: {missing_features}")
    numeric_features = [f for f in numeric_features if f in df_train.columns]
    categorical_features = [f for f in categorical_features if f in df_train.columns]

print(f"Numeric features: {len(numeric_features)}")
print(f"Categorical features: {len(categorical_features)}")

In [None]:
# Prepare feature matrices
def prepare_features(df, numeric_features, categorical_features):
    """Create feature matrix with one-hot encoding for categoricals"""
    X_numeric = df[numeric_features].copy()
    
    # One-hot encode categorical features
    X_cat = pd.get_dummies(df[categorical_features], drop_first=True)
    
    X = pd.concat([X_numeric, X_cat], axis=1)
    return X

# Prepare train, temporal, and spatial datasets
X_train = prepare_features(df_train, numeric_features, categorical_features)
y_train = df_train[target]

X_temporal = prepare_features(df_temporal, numeric_features, categorical_features)
y_temporal = df_temporal[target]

X_spatial = prepare_features(df_spatial, numeric_features, categorical_features)
y_spatial = df_spatial[target]

# Align columns across datasets
common_cols = X_train.columns.intersection(X_temporal.columns).intersection(X_spatial.columns)
X_train = X_train[common_cols]
X_temporal = X_temporal[common_cols]
X_spatial = X_spatial[common_cols]

print(f"Feature matrix shape: {X_train.shape}")
print(f"Features: {list(X_train.columns)}")

## Part I: Task 2 - Build 5 Predictive Models

In [None]:
# Helper function to evaluate models
def evaluate_model(model, X, y, dataset_name=""):
    """Calculate RMSE, R², MAE for a model"""
    y_pred = model.predict(X)
    rmse = np.sqrt(mean_squared_error(y, y_pred))
    r2 = r2_score(y, y_pred)
    mae = mean_absolute_error(y, y_pred)
    return {'Dataset': dataset_name, 'RMSE': rmse, 'R²': r2, 'MAE': mae}

# Store results
results = []

### Model a: OLS (Baseline)

In [None]:
# Scale features for linear models
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_temporal_scaled = scaler.transform(X_temporal)
X_spatial_scaled = scaler.transform(X_spatial)

# OLS model
start_time = time.time()
ols_model = LinearRegression()
ols_model.fit(X_train_scaled, y_train)
ols_train_time = time.time() - start_time

# Evaluate
start_time = time.time()
ols_train_metrics = evaluate_model(ols_model, X_train_scaled, y_train, 'Train')
ols_temporal_metrics = evaluate_model(ols_model, X_temporal_scaled, y_temporal, 'Temporal')
ols_spatial_metrics = evaluate_model(ols_model, X_spatial_scaled, y_spatial, 'Spatial')
ols_inference_time = time.time() - start_time

results.append({
    'Model': 'OLS',
    'Train_RMSE': ols_train_metrics['RMSE'],
    'Train_R2': ols_train_metrics['R²'],
    'Temporal_RMSE': ols_temporal_metrics['RMSE'],
    'Temporal_R2': ols_temporal_metrics['R²'],
    'Spatial_RMSE': ols_spatial_metrics['RMSE'],
    'Spatial_R2': ols_spatial_metrics['R²'],
    'Train_Time': ols_train_time,
    'Inference_Time': ols_inference_time
})

print(f"OLS Model trained in {ols_train_time:.4f}s")
print(f"Train RMSE: {ols_train_metrics['RMSE']:.4f}, R²: {ols_train_metrics['R²']:.4f}")
print(f"Temporal RMSE: {ols_temporal_metrics['RMSE']:.4f}, R²: {ols_temporal_metrics['R²']:.4f}")
print(f"Spatial RMSE: {ols_spatial_metrics['RMSE']:.4f}, R²: {ols_spatial_metrics['R²']:.4f}")

### Model b: LASSO (L1 Regularization)

In [None]:
# LASSO with cross-validation to find optimal alpha
start_time = time.time()
lasso_model = LassoCV(cv=5, random_state=42, max_iter=10000)
lasso_model.fit(X_train_scaled, y_train)
lasso_train_time = time.time() - start_time

print(f"Optimal alpha: {lasso_model.alpha_:.6f}")
print(f"Non-zero coefficients: {np.sum(lasso_model.coef_ != 0)} out of {len(lasso_model.coef_)}")

# Evaluate
start_time = time.time()
lasso_train_metrics = evaluate_model(lasso_model, X_train_scaled, y_train, 'Train')
lasso_temporal_metrics = evaluate_model(lasso_model, X_temporal_scaled, y_temporal, 'Temporal')
lasso_spatial_metrics = evaluate_model(lasso_model, X_spatial_scaled, y_spatial, 'Spatial')
lasso_inference_time = time.time() - start_time

results.append({
    'Model': 'LASSO',
    'Train_RMSE': lasso_train_metrics['RMSE'],
    'Train_R2': lasso_train_metrics['R²'],
    'Temporal_RMSE': lasso_temporal_metrics['RMSE'],
    'Temporal_R2': lasso_temporal_metrics['R²'],
    'Spatial_RMSE': lasso_spatial_metrics['RMSE'],
    'Spatial_R2': lasso_spatial_metrics['R²'],
    'Train_Time': lasso_train_time,
    'Inference_Time': lasso_inference_time
})

print(f"\nLASSO Model trained in {lasso_train_time:.4f}s")
print(f"Train RMSE: {lasso_train_metrics['RMSE']:.4f}, R²: {lasso_train_metrics['R²']:.4f}")
print(f"Temporal RMSE: {lasso_temporal_metrics['RMSE']:.4f}, R²: {lasso_temporal_metrics['R²']:.4f}")
print(f"Spatial RMSE: {lasso_spatial_metrics['RMSE']:.4f}, R²: {lasso_spatial_metrics['R²']:.4f}")

### Model c: Random Forest

In [None]:
# Random Forest (using unscaled features - RF doesn't require scaling)
start_time = time.time()
rf_model = RandomForestRegressor(
    n_estimators=200,
    max_depth=15,
    min_samples_split=10,
    min_samples_leaf=5,
    random_state=42,
    n_jobs=-1
)
rf_model.fit(X_train, y_train)
rf_train_time = time.time() - start_time

# Evaluate
start_time = time.time()
rf_train_metrics = evaluate_model(rf_model, X_train, y_train, 'Train')
rf_temporal_metrics = evaluate_model(rf_model, X_temporal, y_temporal, 'Temporal')
rf_spatial_metrics = evaluate_model(rf_model, X_spatial, y_spatial, 'Spatial')
rf_inference_time = time.time() - start_time

results.append({
    'Model': 'Random Forest',
    'Train_RMSE': rf_train_metrics['RMSE'],
    'Train_R2': rf_train_metrics['R²'],
    'Temporal_RMSE': rf_temporal_metrics['RMSE'],
    'Temporal_R2': rf_temporal_metrics['R²'],
    'Spatial_RMSE': rf_spatial_metrics['RMSE'],
    'Spatial_R2': rf_spatial_metrics['R²'],
    'Train_Time': rf_train_time,
    'Inference_Time': rf_inference_time
})

print(f"Random Forest trained in {rf_train_time:.4f}s")
print(f"Train RMSE: {rf_train_metrics['RMSE']:.4f}, R²: {rf_train_metrics['R²']:.4f}")
print(f"Temporal RMSE: {rf_temporal_metrics['RMSE']:.4f}, R²: {rf_temporal_metrics['R²']:.4f}")
print(f"Spatial RMSE: {rf_spatial_metrics['RMSE']:.4f}, R²: {rf_spatial_metrics['R²']:.4f}")

### Model d: Gradient Boosting

In [None]:
# Gradient Boosting
start_time = time.time()
gb_model = GradientBoostingRegressor(
    n_estimators=200,
    max_depth=5,
    learning_rate=0.1,
    min_samples_split=10,
    min_samples_leaf=5,
    random_state=42
)
gb_model.fit(X_train, y_train)
gb_train_time = time.time() - start_time

# Evaluate
start_time = time.time()
gb_train_metrics = evaluate_model(gb_model, X_train, y_train, 'Train')
gb_temporal_metrics = evaluate_model(gb_model, X_temporal, y_temporal, 'Temporal')
gb_spatial_metrics = evaluate_model(gb_model, X_spatial, y_spatial, 'Spatial')
gb_inference_time = time.time() - start_time

results.append({
    'Model': 'Gradient Boosting',
    'Train_RMSE': gb_train_metrics['RMSE'],
    'Train_R2': gb_train_metrics['R²'],
    'Temporal_RMSE': gb_temporal_metrics['RMSE'],
    'Temporal_R2': gb_temporal_metrics['R²'],
    'Spatial_RMSE': gb_spatial_metrics['RMSE'],
    'Spatial_R2': gb_spatial_metrics['R²'],
    'Train_Time': gb_train_time,
    'Inference_Time': gb_inference_time
})

print(f"Gradient Boosting trained in {gb_train_time:.4f}s")
print(f"Train RMSE: {gb_train_metrics['RMSE']:.4f}, R²: {gb_train_metrics['R²']:.4f}")
print(f"Temporal RMSE: {gb_temporal_metrics['RMSE']:.4f}, R²: {gb_temporal_metrics['R²']:.4f}")
print(f"Spatial RMSE: {gb_spatial_metrics['RMSE']:.4f}, R²: {gb_spatial_metrics['R²']:.4f}")

### Model e: Ridge Regression (Custom Choice)

**Rationale**: Ridge regression (L2 regularization) provides a good comparison to LASSO. It shrinks coefficients but doesn't set them to zero, often performing well when predictors are correlated.

In [None]:
# Ridge Regression with cross-validation
start_time = time.time()
ridge_model = RidgeCV(cv=5, alphas=np.logspace(-3, 3, 50))
ridge_model.fit(X_train_scaled, y_train)
ridge_train_time = time.time() - start_time

print(f"Optimal alpha: {ridge_model.alpha_:.6f}")

# Evaluate
start_time = time.time()
ridge_train_metrics = evaluate_model(ridge_model, X_train_scaled, y_train, 'Train')
ridge_temporal_metrics = evaluate_model(ridge_model, X_temporal_scaled, y_temporal, 'Temporal')
ridge_spatial_metrics = evaluate_model(ridge_model, X_spatial_scaled, y_spatial, 'Spatial')
ridge_inference_time = time.time() - start_time

results.append({
    'Model': 'Ridge',
    'Train_RMSE': ridge_train_metrics['RMSE'],
    'Train_R2': ridge_train_metrics['R²'],
    'Temporal_RMSE': ridge_temporal_metrics['RMSE'],
    'Temporal_R2': ridge_temporal_metrics['R²'],
    'Spatial_RMSE': ridge_spatial_metrics['RMSE'],
    'Spatial_R2': ridge_spatial_metrics['R²'],
    'Train_Time': ridge_train_time,
    'Inference_Time': ridge_inference_time
})

print(f"\nRidge Model trained in {ridge_train_time:.4f}s")
print(f"Train RMSE: {ridge_train_metrics['RMSE']:.4f}, R²: {ridge_train_metrics['R²']:.4f}")
print(f"Temporal RMSE: {ridge_temporal_metrics['RMSE']:.4f}, R²: {ridge_temporal_metrics['R²']:.4f}")
print(f"Spatial RMSE: {ridge_spatial_metrics['RMSE']:.4f}, R²: {ridge_spatial_metrics['R²']:.4f}")

## Part I: Task 3 - Model Comparison (Horserace Table)

In [None]:
# Create horserace table
results_df = pd.DataFrame(results)
results_df = results_df.round(4)

# Display formatted table
print("=" * 100)
print("HORSERACE TABLE: Model Comparison")
print("=" * 100)
display(results_df.style.highlight_min(subset=['Train_RMSE', 'Temporal_RMSE', 'Spatial_RMSE'], color='lightgreen')
                        .highlight_max(subset=['Train_R2', 'Temporal_R2', 'Spatial_R2'], color='lightgreen'))

In [None]:
# Visualization of model performance
fig, axes = plt.subplots(1, 3, figsize=(15, 5))

# RMSE comparison
x = np.arange(len(results_df))
width = 0.25

axes[0].bar(x - width, results_df['Train_RMSE'], width, label='Train', color='steelblue')
axes[0].bar(x, results_df['Temporal_RMSE'], width, label='Temporal', color='darkorange')
axes[0].bar(x + width, results_df['Spatial_RMSE'], width, label='Spatial', color='green')
axes[0].set_ylabel('RMSE (log-price)')
axes[0].set_title('RMSE by Model and Dataset')
axes[0].set_xticks(x)
axes[0].set_xticklabels(results_df['Model'], rotation=45, ha='right')
axes[0].legend()

# R² comparison
axes[1].bar(x - width, results_df['Train_R2'], width, label='Train', color='steelblue')
axes[1].bar(x, results_df['Temporal_R2'], width, label='Temporal', color='darkorange')
axes[1].bar(x + width, results_df['Spatial_R2'], width, label='Spatial', color='green')
axes[1].set_ylabel('R²')
axes[1].set_title('R² by Model and Dataset')
axes[1].set_xticks(x)
axes[1].set_xticklabels(results_df['Model'], rotation=45, ha='right')
axes[1].legend()

# Training time
axes[2].bar(results_df['Model'], results_df['Train_Time'], color='purple')
axes[2].set_ylabel('Time (seconds)')
axes[2].set_title('Training Time')
axes[2].tick_params(axis='x', rotation=45)

plt.tight_layout()
plt.savefig('model_comparison.png', dpi=150, bbox_inches='tight')
plt.show()

### Discussion: Model Performance

**Key observations:**
1. **Training performance**: Tree-based models (RF, GB) typically show lower training RMSE due to their flexibility
2. **Generalization**: Compare temporal vs spatial validation - spatial shift often hurts more
3. **Training time**: Linear models are fastest; Random Forest with parallelization is competitive
4. **Trade-offs**: Ridge/LASSO offer interpretability; ensemble methods offer predictive power

## Part I: Task 4 - Feature Importance Analysis

In [None]:
# Random Forest feature importance
rf_importance = pd.DataFrame({
    'Feature': X_train.columns,
    'Importance': rf_model.feature_importances_
}).sort_values('Importance', ascending=False)

print("Random Forest - Top 10 Features:")
print(rf_importance.head(10).to_string(index=False))

In [None]:
# Gradient Boosting feature importance
gb_importance = pd.DataFrame({
    'Feature': X_train.columns,
    'Importance': gb_model.feature_importances_
}).sort_values('Importance', ascending=False)

print("Gradient Boosting - Top 10 Features:")
print(gb_importance.head(10).to_string(index=False))

In [None]:
# Compare top 10 features side by side
fig, axes = plt.subplots(1, 2, figsize=(14, 6))

# Random Forest
top_rf = rf_importance.head(10)
axes[0].barh(top_rf['Feature'], top_rf['Importance'], color='steelblue')
axes[0].set_xlabel('Importance')
axes[0].set_title('Random Forest - Top 10 Features')
axes[0].invert_yaxis()

# Gradient Boosting
top_gb = gb_importance.head(10)
axes[1].barh(top_gb['Feature'], top_gb['Importance'], color='darkorange')
axes[1].set_xlabel('Importance')
axes[1].set_title('Gradient Boosting - Top 10 Features')
axes[1].invert_yaxis()

plt.tight_layout()
plt.savefig('feature_importance.png', dpi=150, bbox_inches='tight')
plt.show()

In [None]:
# Compare overlap in top 10 features
rf_top10 = set(rf_importance.head(10)['Feature'])
gb_top10 = set(gb_importance.head(10)['Feature'])

overlap = rf_top10.intersection(gb_top10)
rf_only = rf_top10 - gb_top10
gb_only = gb_top10 - rf_top10

print(f"Features in BOTH top 10: {overlap}")
print(f"\nRF only: {rf_only}")
print(f"GB only: {gb_only}")

### Discussion: Feature Importance

**Interpretation:**
- Both models typically agree on key pricing drivers: accommodates, bedrooms, room_type
- Differences reflect algorithmic characteristics:
  - RF splits based on overall variance reduction
  - GB focuses on residual patterns, may highlight different features
- Review scores and availability metrics provide quality/demand signals

## Part II: Task 5 & 6 - External Validation Summary

The external validation results are already incorporated in the horserace table above.

### Key Findings:

**Temporal Validation (Oslo holdout):**
- Performance similar to training data (random split from same distribution)
- All models generalize reasonably well within the same city

**Spatial Validation (Copenhagen):**
- Larger performance degradation expected due to:
  - Different market dynamics
  - Different price levels (currency may differ: NOK vs DKK)
  - Different neighborhood structures
- Models with simpler structure (linear) may generalize better across cities

In [None]:
# Final summary table
print("\n" + "=" * 80)
print("FINAL MODEL COMPARISON SUMMARY")
print("=" * 80)
print(f"\nTraining data: Oslo ({len(X_train):,} observations)")
print(f"Temporal validation: Oslo holdout ({len(X_temporal):,} observations)")
print(f"Spatial validation: Copenhagen ({len(X_spatial):,} observations)")
print(f"\nNumber of features: {X_train.shape[1]}")
print(f"Target variable: log(price)")
print("\n")
print(results_df.to_string(index=False))

In [None]:
# Save results to CSV
results_df.to_csv('model_results.csv', index=False)
print("Results saved to model_results.csv")