### __Predicting Airbnb Listing Prices in Sydney__

---

## Task 4: Visualizing Model Performance

After training the most optimal Airbnb price prediction model, it's crucial to analyze how well it performed. This notebook will include:

✅ 1. Feature Importance Visualization - Identify which features impact price the most
- Bar chart of top 15 most important features
- Uses feature_importances_ for tree-based models or coef_ for linear models

✅ 2. Residual Analysis - Check where the model makes large errors
- Scatter plot of Actual vs. Predicted prices (good predictions should be close to the red diagonal line)
- Highlights areas where the model overestimates or underestimates prices

✅ 3. Residual Distribution - Visualize accuracy
- Histogram of residuals (errors) to check if errors are randomly distributed
- A well-performing model should have a distribution centered around zero

✅ 4. Model Evaluation Metrics
- Mean Absolute Error (MAE) – Average absolute error
- Root Mean Squared Error (RMSE) – Heavily penalizes large errors
- R² Score – How well the model explains price variations

✅ 5. Comparing Multiple Models - Model Comparison
- Bar chart of RMSE values for different models
- Helps compare the performance of Random Forest, XGBoost, LightGBM, etc.

In [None]:
# Import libraries 
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns       
import shap
import joblib
import folium

from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

In [None]:
# Load model predictions
df_results = pd.read_csv("results/predictions.csv")  # Contains actual vs predicted prices
df_train = pd.read_csv("data/processed/processed_train.csv")
df_test = pd.read_csv("data/processed/processed_test.csv")

In [None]:
# Load trained model
model = joblib.load("results/best_model.pkl")

### 1. Feature Importance Visualization

In [None]:
feature_names = df_train.drop(columns=["price"]).columns  # Features used in training

In [None]:
# Retrieve feature importance results
if hasattr(model, "feature_importances_"):  # For tree-based models
    feature_importance = model.feature_importances_
elif hasattr(model, "coef_"):  # For linear models
    feature_importance = np.abs(model.coef_)
else:
    feature_importance = np.zeros(len(feature_names))

feature_importance_df = pd.DataFrame({"Feature": feature_names, "Importance": feature_importance})
feature_importance_df = feature_importance_df.sort_values(by="Importance", ascending=False)

In [None]:
# Visualize the most important features
plt.figure(figsize=(12, 6))
sns.barplot(data=feature_importance_df.head(15), x="Importance", y="Feature", palette="viridis")
plt.title("Top 15 Feature Importances")
plt.xlabel("Importance")
plt.ylabel("Feature")
plt.show()

### 2. Residual Analysis (Actual vs. Predicted Prices)

In [None]:
# Create a DataFrame to store Residuals
df_results["Residual"] = df_results["Actual"] - df_results["Predicted"]

In [None]:
# Visualize residual analysis results
plt.figure(figsize=(8, 6))
sns.scatterplot(data=df_results, x="Actual", y="Predicted", alpha=0.5)
plt.plot([df_results["Actual"].min(), df_results["Actual"].max()], 
         [df_results["Actual"].min(), df_results["Actual"].max()], color="red", linestyle="dashed")
plt.xlabel("Actual Price")
plt.ylabel("Predicted Price")
plt.title("Actual vs. Predicted Prices")
plt.show()

### 3. Residual Distribution

In [None]:
# Visualize residual distribution 
plt.figure(figsize=(8, 5))
sns.histplot(df_results["Residual"], bins=50, kde=True, color="blue")
plt.axvline(0, color="red", linestyle="dashed")
plt.xlabel("Residual (Error)")
plt.ylabel("Count")
plt.title("Distribution of Residuals")
plt.show()

### 4. Model Evaluation Metrics

In [None]:
# Retrieve evaluation metrics 
mae = mean_absolute_error(df_results["Actual"], df_results["Predicted"])
rmse = np.sqrt(mean_squared_error(df_results["Actual"], df_results["Predicted"]))
r2 = r2_score(df_results["Actual"], df_results["Predicted"])

In [None]:
# Print results of models' evaluation metrics 
print(f"📌 Mean Absolute Error (MAE): {mae:.2f}")   
print(f"📌 Root Mean Squared Error (RMSE): {rmse:.2f}")
print(f"📌 R² Score: {r2:.3f}")

### 5. Comparing Multiple Models

In [None]:
df_model_results = pd.read_csv("outputs/model_comparison.csv")  # Stores scores for different models

In [None]:
# Visualize multiple models' performance results
plt.figure(figsize=(8, 5))
sns.barplot(data=df_model_results, x="Model", y="RMSE", palette="coolwarm")
plt.xlabel("Model")
plt.ylabel("RMSE (Lower is Better)")
plt.title("Comparison of Different Models")
plt.xticks(rotation=45)
plt.show()