# 🔧 Hyperparameter Tuning with GridSearchCV

This notebook performs hyperparameter tuning on the Random Forest Regressor using `GridSearchCV`.  
The goal is to find the optimal combination of `max_depth` and `n_estimators` to minimize MAE.

In [None]:
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestRegressor
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import GridSearchCV, KFold
from sklearn.metrics import mean_absolute_error
import matplotlib.pyplot as plt

In [None]:
df = pd.read_csv("../data/housing.csv")

X = df.drop("Price", axis=1).copy()
y = df["Price"]

X["TotalRooms"] = X["Bedroom2"] + X["Bathroom"] + X["Rooms"]
X["HouseAge"] = 2025 - X["YearBuilt"]
X["PricePerSqm"] = df["Price"] / (X["BuildingArea"] + 1)

In [None]:
numeric_cols = X.select_dtypes(include=["int64", "float64"]).columns.tolist()
categorical_cols = X.select_dtypes(include=["object"]).columns.tolist()

numeric_transformer = Pipeline([
    ("imputer", SimpleImputer(strategy="median"))
])

categorical_transformer = Pipeline([
    ("imputer", SimpleImputer(strategy="most_frequent")),
    ("encoder", OneHotEncoder(handle_unknown="ignore", sparse_output=False))
])

preprocessor = ColumnTransformer([
    ("num", numeric_transformer, numeric_cols),
    ("cat", categorical_transformer, categorical_cols)
])

In [None]:
model = RandomForestRegressor(random_state=42)

pipeline = Pipeline([
    ("preprocessor", preprocessor),
    ("model", model)
])

param_grid = {
    "model__n_estimators": [50, 100, 200],
    "model__max_depth": [5, 10, 15, None]
}

cv = KFold(n_splits=5, shuffle=True, random_state=42)

grid_search = GridSearchCV(
    pipeline,
    param_grid,
    scoring="neg_mean_absolute_error",
    cv=cv,
    n_jobs=-1,
    verbose=2
)

In [None]:
grid_search.fit(X, y)

In [None]:
print("Best Parameters:", grid_search.best_params_)

best_mae = -grid_search.best_score_
print(f"Best MAE: ${best_mae:,.0f} AUD")

In [None]:
results = pd.DataFrame(grid_search.cv_results_)
results = results.sort_values("rank_test_score")
results[["param_model__n_estimators", "param_model__max_depth", "mean_test_score"]].head()

In [None]:
pivot = results.pivot(index="param_model__max_depth", columns="param_model__n_estimators", values="mean_test_score")
pivot = -pivot  # convert to positive MAE

pivot.plot(kind="line", marker="o", title="MAE vs. max_depth and n_estimators", figsize=(10,6))
plt.ylabel("Mean MAE (AUD)")
plt.grid(True)
plt.show()

## 🔧 Model Tuning Summary: GridSearchCV on Random Forest

This notebook conducted a hyperparameter search over 12 configurations of a `RandomForestRegressor` using 5-fold cross-validation and **Mean Absolute Error (MAE)** as the evaluation metric.

---

### 📊 Grid Search Configuration

- **Model:** RandomForestRegressor
- **Evaluation Metric:** `neg_mean_absolute_error`
- **Cross-Validation:** 5-fold, shuffled with `random_state=42`
- **Hyperparameters Tuned:**
  - `n_estimators ∈ {50, 100, 200}`
  - `max_depth ∈ {5, 10, 15, None}`

This results in 12 hyperparameter combinations tested for model performance.

---

### 🥇 Best Model Performance

| Hyperparameter        | Value      |
|-----------------------|------------|
| `n_estimators`        | 200        |
| `max_depth`           | None       |
| **Mean CV MAE**       | \$136,362 AUD |

The best-performing configuration used **200 estimators** and no constraint on tree depth. This model achieved the lowest MAE across folds, improving upon earlier results from Day 4 and Day 5.

---

### 📈 Observations

- Increasing both `n_estimators` and `max_depth` generally led to better performance, with diminishing returns beyond `max_depth=15`.
- The most complex models (deep trees + many estimators) did not overfit under cross-validation but may be computationally heavier for deployment.
- Shallower models (`max_depth=5`) exhibited underfitting, as reflected in MAE values > \$175,000 AUD.

---

### 📉 MAE Trend Visualization

The MAE surface plotted against `max_depth` and `n_estimators` showed:

- A **monotonic decrease** in error as `max_depth` increases.
- **Marginal improvements** moving from 100 to 200 estimators, suggesting that ensemble size is helpful but saturates.

---

### 🧠 Interpretation

The combination of deep trees and large ensemble size gave the most accurate predictions, reducing MAE by nearly **\$26,000 AUD** compared to the untuned model from Day 5 (MAE ≈ \$162,391 AUD).

This improvement confirms that careful hyperparameter optimization can meaningfully enhance model accuracy even in robust learners like Random Forests.

---

### ✅ Next Steps

- Use this tuned configuration for final model training and evaluation in Day 7.
- Optionally explore tuning `min_samples_leaf`, `min_samples_split`, or `max_features`.
- Consider using `RandomizedSearchCV` for larger hyperparameter spaces to improve compute efficiency.