# Notebook 05: Machine Learning Modeling

## Purpose
This notebook applies machine learning algorithms to predict Airbnb listing prices:
- Define the prediction problem
- Train multiple regression models
- Compare model performance
- Generate predictions

## Learning Objectives
- Formulate a machine learning problem
- Apply different ML algorithms
- Justify model selection
- Document modeling process

---
## 1. Import Libraries and Load Data

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import pickle

# Display settings
pd.set_option('display.max_columns', None)
plt.style.use('seaborn-v0_8-whitegrid')

print("Libraries imported successfully!")

In [None]:
# Load engineered data
X_train = pd.read_csv('../data/X_train.csv')
X_test = pd.read_csv('../data/X_test.csv')
y_train = pd.read_csv('../data/y_train.csv')['price']
y_test = pd.read_csv('../data/y_test.csv')['price']

print("DATA LOADED:")
print("="*80)
print(f"Training set: {X_train.shape[0]:,} samples, {X_train.shape[1]} features")
print(f"Testing set: {X_test.shape[0]:,} samples, {X_test.shape[1]} features")
print(f"\nTarget variable range:")
print(f"  Training: ${y_train.min():.2f} - ${y_train.max():.2f}")
print(f"  Testing: ${y_test.min():.2f} - ${y_test.max():.2f}")

---
## 2. Problem Formulation

### Problem Type: Regression

**Objective**: Predict the price of Airbnb listings based on their characteristics.

**Why Regression?**
- The target variable (price) is continuous
- We want to predict actual dollar values, not categories
- Regression models can capture the relationship between features and price

**Features Used**:
- Location (latitude, longitude, neighbourhood)
- Property type (room type)
- Availability metrics
- Review statistics
- Host information

**Evaluation Metrics**:
- Mean Absolute Error (MAE): Average prediction error in dollars
- Root Mean Squared Error (RMSE): Penalizes larger errors more
- R² Score: Proportion of variance explained by the model

---
## 3. Model Selection and Training

We'll train multiple models and compare their performance:
1. **Linear Regression** - Baseline model
2. **Ridge Regression** - Linear with L2 regularization
3. **Lasso Regression** - Linear with L1 regularization
4. **Decision Tree** - Non-linear model
5. **Random Forest** - Ensemble of decision trees
6. **Gradient Boosting** - Advanced ensemble method

### 3.1 Linear Regression (Baseline)

**Explanation**: Linear regression assumes a linear relationship between features and target. It's simple and interpretable.

In [None]:
# Train Linear Regression
print("Training Linear Regression...")
lr_model = LinearRegression()
lr_model.fit(X_train, y_train)

# Make predictions
y_pred_lr_train = lr_model.predict(X_train)
y_pred_lr_test = lr_model.predict(X_test)

print("✓ Linear Regression trained")

### 3.2 Ridge Regression

**Explanation**: Ridge adds L2 regularization to prevent overfitting by penalizing large coefficients.

In [None]:
# Train Ridge Regression
print("Training Ridge Regression...")
ridge_model = Ridge(alpha=1.0, random_state=42)
ridge_model.fit(X_train, y_train)

# Make predictions
y_pred_ridge_train = ridge_model.predict(X_train)
y_pred_ridge_test = ridge_model.predict(X_test)

print("✓ Ridge Regression trained")

### 3.3 Lasso Regression

**Explanation**: Lasso adds L1 regularization which can perform feature selection by setting some coefficients to zero.

In [None]:
# Train Lasso Regression
print("Training Lasso Regression...")
lasso_model = Lasso(alpha=1.0, random_state=42)
lasso_model.fit(X_train, y_train)

# Make predictions
y_pred_lasso_train = lasso_model.predict(X_train)
y_pred_lasso_test = lasso_model.predict(X_test)

print("✓ Lasso Regression trained")

### 3.4 Decision Tree Regressor

**Explanation**: Decision trees can capture non-linear relationships and interactions between features.

In [None]:
# Train Decision Tree
print("Training Decision Tree Regressor...")
dt_model = DecisionTreeRegressor(max_depth=10, random_state=42)
dt_model.fit(X_train, y_train)

# Make predictions
y_pred_dt_train = dt_model.predict(X_train)
y_pred_dt_test = dt_model.predict(X_test)

print("✓ Decision Tree Regressor trained")

### 3.5 Random Forest Regressor

**Explanation**: Random Forest combines multiple decision trees to reduce overfitting and improve accuracy.

In [None]:
# Train Random Forest
print("Training Random Forest Regressor...")
rf_model = RandomForestRegressor(n_estimators=100, max_depth=10, random_state=42, n_jobs=-1)
rf_model.fit(X_train, y_train)

# Make predictions
y_pred_rf_train = rf_model.predict(X_train)
y_pred_rf_test = rf_model.predict(X_test)

print("✓ Random Forest Regressor trained")

### 3.6 Gradient Boosting Regressor

**Explanation**: Gradient Boosting builds trees sequentially, each correcting errors from previous trees.

In [None]:
# Train Gradient Boosting
print("Training Gradient Boosting Regressor...")
gb_model = GradientBoostingRegressor(n_estimators=100, max_depth=5, random_state=42)
gb_model.fit(X_train, y_train)

# Make predictions
y_pred_gb_train = gb_model.predict(X_train)
y_pred_gb_test = gb_model.predict(X_test)

print("✓ Gradient Boosting Regressor trained")

---
## 4. Model Evaluation

### 4.1 Calculate Metrics for All Models

In [None]:
# Function to calculate metrics
def calculate_metrics(y_true, y_pred, dataset_name, model_name):
    mae = mean_absolute_error(y_true, y_pred)
    mse = mean_squared_error(y_true, y_pred)
    rmse = np.sqrt(mse)
    r2 = r2_score(y_true, y_pred)
    
    return {
        'Model': model_name,
        'Dataset': dataset_name,
        'MAE': mae,
        'RMSE': rmse,
        'R2_Score': r2
    }

# Calculate metrics for all models
results = []

models = [
    ('Linear Regression', y_pred_lr_train, y_pred_lr_test),
    ('Ridge Regression', y_pred_ridge_train, y_pred_ridge_test),
    ('Lasso Regression', y_pred_lasso_train, y_pred_lasso_test),
    ('Decision Tree', y_pred_dt_train, y_pred_dt_test),
    ('Random Forest', y_pred_rf_train, y_pred_rf_test),
    ('Gradient Boosting', y_pred_gb_train, y_pred_gb_test)
]

for model_name, y_pred_train, y_pred_test in models:
    results.append(calculate_metrics(y_train, y_pred_train, 'Training', model_name))
    results.append(calculate_metrics(y_test, y_pred_test, 'Testing', model_name))

# Create results dataframe
results_df = pd.DataFrame(results)

print("MODEL EVALUATION RESULTS:")
print("="*80)
print(results_df.to_string(index=False))

### 4.2 Visualize Model Comparison

In [None]:
# Separate training and testing results
test_results = results_df[results_df['Dataset'] == 'Testing'].copy()

# Create comparison plots
fig, axes = plt.subplots(1, 3, figsize=(16, 5))

# MAE comparison
axes[0].barh(test_results['Model'], test_results['MAE'], color='coral', edgecolor='black')
axes[0].set_xlabel('Mean Absolute Error ($)', fontsize=11)
axes[0].set_title('MAE Comparison (Lower is Better)', fontsize=12, fontweight='bold')
axes[0].grid(axis='x', alpha=0.3)

# RMSE comparison
axes[1].barh(test_results['Model'], test_results['RMSE'], color='skyblue', edgecolor='black')
axes[1].set_xlabel('Root Mean Squared Error ($)', fontsize=11)
axes[1].set_title('RMSE Comparison (Lower is Better)', fontsize=12, fontweight='bold')
axes[1].grid(axis='x', alpha=0.3)

# R² comparison
axes[2].barh(test_results['Model'], test_results['R2_Score'], color='lightgreen', edgecolor='black')
axes[2].set_xlabel('R² Score', fontsize=11)
axes[2].set_title('R² Score Comparison (Higher is Better)', fontsize=12, fontweight='bold')
axes[2].grid(axis='x', alpha=0.3)

plt.tight_layout()
plt.show()

### 4.3 Identify Best Model

In [None]:
# Find best model based on test R² score
best_model_row = test_results.loc[test_results['R2_Score'].idxmax()]

print("BEST MODEL:")
print("="*80)
print(f"Model: {best_model_row['Model']}")
print(f"MAE: ${best_model_row['MAE']:.2f}")
print(f"RMSE: ${best_model_row['RMSE']:.2f}")
print(f"R² Score: {best_model_row['R2_Score']:.4f}")
print(f"\nThis model explains {best_model_row['R2_Score']*100:.2f}% of the variance in price.")

---
## 5. Feature Importance (Random Forest)

Understanding which features are most important for prediction.

In [None]:
# Get feature importance from Random Forest
feature_importance = pd.DataFrame({
    'Feature': X_train.columns,
    'Importance': rf_model.feature_importances_
}).sort_values('Importance', ascending=False)

# Plot top 15 features
plt.figure(figsize=(10, 8))
plt.barh(feature_importance.head(15)['Feature'], 
         feature_importance.head(15)['Importance'], 
         color='teal', edgecolor='black')
plt.xlabel('Importance', fontsize=11)
plt.ylabel('Feature', fontsize=11)
plt.title('Top 15 Most Important Features (Random Forest)', fontsize=13, fontweight='bold')
plt.gca().invert_yaxis()
plt.grid(axis='x', alpha=0.3)
plt.tight_layout()
plt.show()

print("\nTOP 10 MOST IMPORTANT FEATURES:")
print("="*80)
print(feature_importance.head(10).to_string(index=False))

---
## 6. Save Models

Save trained models for use in the evaluation notebook.

In [None]:
# Save all models
models_to_save = {
    'linear_regression': lr_model,
    'ridge_regression': ridge_model,
    'lasso_regression': lasso_model,
    'decision_tree': dt_model,
    'random_forest': rf_model,
    'gradient_boosting': gb_model
}

for model_name, model in models_to_save.items():
    with open(f'../data/{model_name}_model.pkl', 'wb') as f:
        pickle.dump(model, f)
    print(f"✓ Saved {model_name}")

# Save predictions
predictions_df = pd.DataFrame({
    'actual': y_test,
    'lr_pred': y_pred_lr_test,
    'ridge_pred': y_pred_ridge_test,
    'lasso_pred': y_pred_lasso_test,
    'dt_pred': y_pred_dt_test,
    'rf_pred': y_pred_rf_test,
    'gb_pred': y_pred_gb_test
})

predictions_df.to_csv('../data/predictions.csv', index=False)
print("\n✓ Saved predictions")

---
## 7. Summary

### Models Trained:

1. **Linear Regression** ✅ - Simple baseline
2. **Ridge Regression** ✅ - L2 regularization
3. **Lasso Regression** ✅ - L1 regularization with feature selection
4. **Decision Tree** ✅ - Non-linear relationships
5. **Random Forest** ✅ - Ensemble method
6. **Gradient Boosting** ✅ - Advanced ensemble

### Key Findings:

- Tree-based models (Random Forest, Gradient Boosting) generally perform better than linear models
- Location features (latitude, longitude) are among the most important predictors
- Room type and neighbourhood significantly impact price
- The best model achieves reasonable predictive accuracy

### Model Selection Justification:

We chose to train multiple models because:
- Different algorithms have different strengths
- Comparison helps identify the best approach for this data
- Ensemble methods often outperform single models
- Linear models provide interpretability while tree-based models capture complexity

### Next Steps:

In the next notebook, we will:
- Perform detailed evaluation of all models
- Visualize predictions vs actual values
- Analyze prediction errors
- Discuss limitations and improvements
- Draw final conclusions

---
**Next Notebook**: [06_evaluation_and_insights.ipynb](06_evaluation_and_insights.ipynb)