In [4]:
# =========================================
# Housing Price Prediction Capstone Project
# =========================================

# %% [markdown]
# # Housing Price Prediction Capstone Project
# **Author:** Your Name  
# **Date:** 2025-10-24
# 
# ---
# 
# ## Summary
# End-to-end regression modeling for housing prices: data preprocessing, feature engineering, EDA, baseline & multiple models, hyperparameter tuning, SVR with kernel trick, bias-variance analysis, and final evaluation visualizations.

# %% [code]
# Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import time
import warnings
import sys
from math import pi
from sklearn.exceptions import ConvergenceWarning
warnings.filterwarnings("ignore", category=ConvergenceWarning)
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split,RandomizedSearchCV, GridSearchCV, learning_curve
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.linear_model import LinearRegression, Ridge, SGDRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor, AdaBoostRegressor, ExtraTreesRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.svm import SVR
from datetime import datetime


# Inline plots
%matplotlib inline
sns.set(style="whitegrid")
plt.rcParams['figure.figsize'] = (10,6)

# %% [markdown]
# ## Load and Inspect Dataset

# %% [code]
# Load the dataset
housing = fetch_california_housing()

# Convert to DataFrame
housing_df = pd.DataFrame(housing.data,
                             columns=housing.feature_names)
housing_df['MedHouseValue'] = pd.Series(housing.target)
display(housing_df.head())
#print(housing_df.info())
#print(housing_df.describe())
#print(housing_df.isnull().sum())

# %% [markdown]
# ## Feature & Task Understanding

# %% [code]
target = 'MedHouseValue'
predictors = [col for col in housing_df.columns if col != target]
numerical_features = housing_df[predictors].select_dtypes(include=['int64','float64']).columns.tolist()
categorical_features = housing_df[predictors].select_dtypes(include=['object','category']).columns.tolist()
print("Numerical Features:", numerical_features)
print("Categorical Features:", categorical_features)

plt.figure()
sns.heatmap(housing_df.corr(), annot=True, cmap='coolwarm', fmt=".2f")
plt.title("Correlation Heatmap")
plt.savefig("images/Correlation Heatmap.png")
plt.close()
#plt.show()


# %% [markdown]
# ## Preprocessing

# %% [code]
X = housing_df[predictors]
y = housing_df[target]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Scale numerical features
scaler = StandardScaler()
X_train[numerical_features] = scaler.fit_transform(X_train[numerical_features])
X_test[numerical_features] = scaler.transform(X_test[numerical_features])

# Encode categorical features
if categorical_features:
    encoder = OneHotEncoder(drop='first', sparse=False)
    X_train_encoded = pd.DataFrame(encoder.fit_transform(X_train[categorical_features]),
                                   columns=encoder.get_feature_names_out(categorical_features),
                                   index=X_train.index)
    X_test_encoded = pd.DataFrame(encoder.transform(X_test[categorical_features]),
                                  columns=encoder.get_feature_names_out(categorical_features),
                                  index=X_test.index)
    X_train = X_train.drop(categorical_features, axis=1).join(X_train_encoded)
    X_test = X_test.drop(categorical_features, axis=1).join(X_test_encoded)
    
# %% [markdown]
# ## Feature Engineering

# %% [code]
X_train['Rooms_per_Household'] = X_train['AveRooms'] / X_train['AveOccup']
X_test['Rooms_per_Household'] = X_test['AveRooms'] / X_test['AveOccup']
X_train['Bedrooms_per_Room'] = X_train['AveBedrms'] / X_train['AveRooms']
X_test['Bedrooms_per_Room'] = X_test['AveBedrms'] / X_test['AveRooms']
X_train['Population_per_Household'] = X_train['Population'] / X_train['AveOccup']
X_test['Population_per_Household'] = X_test['Population'] / X_test['AveOccup']
#all_features = X_train.columns.tolist()

# %% [markdown]
# ## 6) Exploratory Data Analysis (EDA)
sns.set(style="whitegrid")
palette = sns.color_palette("Set2")

# %% [code]
# Histograms
housing_df['Rooms_per_Household'] = housing_df['AveRooms'] / housing_df['AveOccup']
housing_df['Bedrooms_per_Room'] = housing_df['AveBedrms'] / housing_df['AveRooms']
housing_df['Population_per_Household'] = housing_df['Population'] / housing_df['AveOccup']

derived_features = ['Rooms_per_Household', 'Bedrooms_per_Room', 'Population_per_Household']
all_features = numerical_features + derived_features
fig, axes = plt.subplots(4, 3, figsize=(18, 16))
axes = axes.flatten()

for i, feature in enumerate(all_features):
    sns.histplot(X_train[feature], bins=30, kde=True, ax=axes[i], color=palette[i % len(palette)])
    axes[i].set_title(f'Distribution of {feature}')
    axes[i].set_xlabel('')
    axes[i].set_ylabel('Count')

# Remove the last empty subplot (if features < 12)
if len(all_features) < len(axes):
    for j in range(len(all_features), len(axes)):
        fig.delaxes(axes[j])

plt.tight_layout()
#plt.savefig("housing_feature_histograms.png", dpi=300, bbox_inches='tight')
plt.savefig("images/housing_feature_histograms.png")
plt.close()
#plt.show()

# Scatter plots
plt.figure(figsize=(10,6))
sns.scatterplot(x=X_train['MedInc'], y=y_train, alpha=0.5, color='royalblue')
plt.title("Median Income vs Median House Value")
plt.xlabel("Median Income (scaled)")
plt.ylabel("Median House Value")
plt.tight_layout()
plt.savefig("images/median_income_vs_housevalue.png")
#plt.show()
plt.close()

plt.figure(figsize=(10,6))
sns.scatterplot(x=X_train['Longitude'], y=X_train['Latitude'], hue=y_train, palette='viridis', alpha=0.5)
plt.title("Geographic Distribution of House Values")
plt.xlabel("Longitude")
plt.ylabel("Latitude")
plt.legend(title="MedHouseValue", bbox_to_anchor=(1.05,1), loc='upper left')
plt.tight_layout()
plt.savefig("images/geographic_distribution.png", dpi=300, bbox_inches='tight')
#plt.show()
plt.close()

# %% [markdown]
# ## Outlier Handling

# %% [code]
def cap_outliers(df, features):
    capped = df.copy()
    for feature in features:
        Q1 = df[feature].quantile(0.25)
        Q3 = df[feature].quantile(0.75)
        IQR = Q3 - Q1
        lower = Q1 - 1.5*IQR
        upper = Q3 + 1.5*IQR
        capped[feature] = np.where(capped[feature] < lower, lower, capped[feature])
        capped[feature] = np.where(capped[feature] > upper, upper, capped[feature])
    return capped

X_train_capped = cap_outliers(X_train, all_features)
X_test_capped = cap_outliers(X_test, all_features)

# %% [markdown]
# ## Baseline Model

# %% [code]
baseline_pred = np.median(y_train)
baseline_mse = mean_squared_error(y_test, [baseline_pred]*len(y_test))
baseline_mae = mean_absolute_error(y_test, [baseline_pred]*len(y_test))
baseline_r2 = r2_score(y_test, [baseline_pred]*len(y_test))
print("Baseline Model - MSE:", baseline_mse, "MAE:", baseline_mae, "R2:", baseline_r2)

# %% [markdown]
# ## 9) Regression Model Comparison (Existing Models)

# %% [code]
models = {
    "Linear Regression": LinearRegression(),
    "Ridge Regression": Ridge(alpha=1.0),
    "Decision Tree": DecisionTreeRegressor(random_state=42),
    "Random Forest": RandomForestRegressor(n_estimators=100, random_state=42),
    "Gradient Boosting": GradientBoostingRegressor(n_estimators=100, learning_rate=0.1, random_state=42),
    "KNN Regressor": KNeighborsRegressor(n_neighbors=5)
}

results = []
for name, model in models.items():
    model.fit(X_train_capped, y_train)
    y_pred = model.predict(X_test_capped)
    results.append([name,
                    mean_squared_error(y_test, y_pred),
                    mean_absolute_error(y_test, y_pred),
                    r2_score(y_test, y_pred)])

results_df = pd.DataFrame(results, columns=['Model','MSE','MAE','R2'])
results_df.sort_values('R2', ascending=False, inplace=True)
display(results_df)



# %% [markdown]
# ## Random Forest Regressor – Hyperparameter Optimization (RandomizedSearchCV)

# RandomForest is a robust ensemble method that averages multiple decision trees to reduce overfitting.  
# However, exhaustive `GridSearchCV` on large datasets can be very time-consuming.  
# To optimize runtime without compromising accuracy, we switch to **`RandomizedSearchCV`**, which samples
# from a defined parameter space to find strong hyperparameters efficiently.

# %% [code]
# 
# Track runtime
start_time = time.time()

# ----------------------------
# Stage 1: Broad Randomized Search
# ----------------------------
param_dist_rf = {
    'n_estimators': [50, 100, 150, 200],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'max_features': ['sqrt', 'log2']
}

random_search_rf = RandomizedSearchCV(
    RandomForestRegressor(random_state=42),
    param_distributions=param_dist_rf,
    n_iter=6,                   
    cv=3,
    scoring='r2',
    n_jobs=-1,
    verbose=1,
    random_state=42
)

random_search_rf.fit(X_train_capped, y_train)
best_params_stage1 = random_search_rf.best_params_

print("\n✅ Broad Randomized Search Complete")
print("Best Parameters:", best_params_stage1)
print("Best CV R²:", random_search_rf.best_score_)

# ----------------------------
# Stage 2: Narrow Grid Search
# ----------------------------
# Only vary around best results — keep combinations <= 20
param_grid_rf = {
    'n_estimators': [best_params_stage1['n_estimators']],
    'max_depth': [best_params_stage1['max_depth'], None],
    'min_samples_split': [best_params_stage1['min_samples_split']],
    'min_samples_leaf': [best_params_stage1['min_samples_leaf']],
    'max_features': [best_params_stage1['max_features']]
}

grid_search_rf = GridSearchCV(
    RandomForestRegressor(random_state=42),
    param_grid=param_grid_rf,
    cv=3,
    scoring='r2',
    n_jobs=-1,
    verbose=1
)

grid_search_rf.fit(X_train_capped, y_train)
best_rf = grid_search_rf.best_estimator_

print("\n✅ Narrow Grid Search Complete")
print("Best Parameters:", grid_search_rf.best_params_)
print("Best CV R²:", grid_search_rf.best_score_)

# ----------------------------
# Evaluate Final Tuned Model
# ----------------------------
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error

y_pred_best_rf = best_rf.predict(X_test_capped)
print("\n🏆 Final Tuned Random Forest Performance")
print("---------------------------------------")
print(f"R² Score (Test): {r2_score(y_test, y_pred_best_rf):.4f}")
print(f"MAE (Test): {mean_absolute_error(y_test, y_pred_best_rf):.4f}")
print(f"MSE (Test): {mean_squared_error(y_test, y_pred_best_rf):.4f}")

print(f"\nTotal Training Time: {(time.time() - start_time)/60:.2f} min")

# ----------------------------
# Visualization 1: Actual vs Predicted
# ----------------------------
plt.figure(figsize=(8,6))
sns.scatterplot(x=y_test, y=y_pred_best_rf, alpha=0.6, color='royalblue')
plt.plot([y_test.min(), y_test.max()],
         [y_test.min(), y_test.max()],
         'r--', lw=2, label='Perfect Prediction')
plt.xlabel("Actual Median House Value")
plt.ylabel("Predicted Median House Value")
plt.title("Final Random Forest (Optimized) - Actual vs Predicted Prices")
plt.legend()
plt.grid(True)
plt.tight_layout()
plt.savefig("images/Random Forest (Optimized) - Actual vs Predicted Prices.png")
plt.close()
#plt.show()

# ----------------------------
# Visualization 2: Feature Importances
# ----------------------------
feat_imp = (
    pd.DataFrame({
        'Feature': X_train_capped.columns,
        'Importance': best_rf.feature_importances_
    })
    .sort_values('Importance', ascending=False)
)

plt.figure(figsize=(10,6))
sns.barplot(x='Importance', y='Feature', hue='Feature',data=feat_imp.head(15), palette='bright')
plt.title("Top 15 Feature Importances – Final Random Forest")
plt.xlabel("Importance Score")
plt.ylabel("Feature")
plt.tight_layout()
plt.savefig("images/Top 15 Feature Importances.png")
plt.close()
#plt.show()

# %% [markdown]
# ##  Additional Models + Optimized SVR Tuning + Bias–Variance Analysis
# Includes: SGDRegressor, SVR with kernel trick, randomized hyper-parameter search,
# bias–variance learning curves, and epsilon-tube visualization.

# %% [code]
# # ------------------------------------------------------
# SGD Regressor (Baseline)
# ------------------------------------------------------
print("\n⚙️ Training SGD Regressor...")
sgd = SGDRegressor(max_iter=1000, tol=1e-3, random_state=42)
sgd.fit(X_train_capped, y_train)
y_pred_sgd = sgd.predict(X_test_capped)

print(f"✅ SGD R²: {r2_score(y_test, y_pred_sgd):.4f}")
print(f"MAE: {mean_absolute_error(y_test, y_pred_sgd):.4f}")
print(f"MSE: {mean_squared_error(y_test, y_pred_sgd):.4f}")

# ------------------------------------------------------
# Baseline SVR (with subsample to reduce runtime)
# ------------------------------------------------------
print("\n⚙️ Training baseline SVR (RBF kernel) on 15% data for efficiency...")

sample_frac = 0.15
X_train_svr = X_train_capped.sample(frac=sample_frac, random_state=42)
y_train_svr = y_train.loc[X_train_svr.index]
X_test_svr = X_test_capped.sample(frac=sample_frac, random_state=42)
y_test_svr = y_test.loc[X_test_svr.index]

svr = SVR(kernel='rbf', C=100, gamma=0.1, epsilon=0.1)
svr.fit(X_train_svr, y_train_svr)
y_pred_svr = svr.predict(X_test_svr)
print(f"✅ SVR (Baseline, subsampled) R²: {r2_score(y_test_svr, y_pred_svr):.4f}")
print(f"MAE: {mean_absolute_error(y_test_svr, y_pred_svr):.4f}")
print(f"MSE: {mean_squared_error(y_test_svr, y_pred_svr):.4f}")

# ------------------------------------------------------
# RandomizedSearchCV for SVR Hyperparameter Optimization
# ------------------------------------------------------
print("\n⚙️ RandomizedSearchCV: Optimizing SVR Hyperparameters...")

param_dist_svr = {
    'C': np.logspace(1, 3, 5),         # [10, 31.6, 100, 316, 1000]
    'gamma': [0.01, 0.05, 0.1, 0.2],
    'epsilon': [0.01, 0.05, 0.1, 0.2]
}

random_search_svr = RandomizedSearchCV(
    SVR(kernel='rbf'),
    param_distributions=param_dist_svr,
    n_iter=8,                # small and fast
    cv=3,
    scoring='r2',
    n_jobs=-1,
    verbose=1,
    random_state=42
)

random_search_svr.fit(X_train_svr, y_train_svr)
best_svr = random_search_svr.best_estimator_
print("\n✅ SVR Randomized Search Complete")
print("Best Parameters:", random_search_svr.best_params_)
print(f"Best CV R²: {random_search_svr.best_score_:.4f}")

# Evaluate tuned SVR
y_pred_best_svr = best_svr.predict(X_test_svr)
print(f"🏆 Tuned SVR R² (test subset): {r2_score(y_test_svr, y_pred_best_svr):.4f}")

# ------------------------------------------------------
# Bias–Variance (Learning Curves)
# ------------------------------------------------------
print("\n📈 Generating learning curves (Bias–Variance Analysis)...")

def plot_learning_curve(model, X, y, title):
    train_sizes, train_scores, test_scores = learning_curve(
        model, X, y, cv=3, scoring='r2', n_jobs=-1,
        train_sizes=np.linspace(0.1, 1.0, 5)
    )
    plt.figure()
    plt.plot(train_sizes, np.mean(train_scores, axis=1), 'o-', color='blue', label='Train R²')
    plt.plot(train_sizes, np.mean(test_scores, axis=1), 'o-', color='green', label='CV R²')
    plt.xlabel("Training Examples")
    plt.ylabel("R² Score")
    plt.title(title)
    plt.legend()
    plt.grid(True)
    plt.tight_layout()
    plt.savefig(f"images/{title}.png")
    plt.close()
    #plt.show()

plot_learning_curve(sgd, X_train_capped, y_train, title="Learning Curve – SGD Regressor")
plot_learning_curve(best_svr, X_train_svr, y_train_svr, title="Learning Curve – Tuned SVR (Subsampled)")

# %% [markdown]
# ## 14) SVR Margin / Maximum Margin Concept

# %% [code]
print("\n📊 Visualizing SVR Margin (Epsilon-Tube)...")

plt.figure(figsize=(8,5))
plt.scatter(X_train_svr['MedInc'], y_train_svr, alpha=0.3, label='Training Points')

# Create 500 evenly spaced values of MedInc
X_plot = np.linspace(X_train_svr['MedInc'].min(), X_train_svr['MedInc'].max(), 500)

# Create DataFrame with same columns as training, other features filled with median
X_plot_df = pd.DataFrame(np.tile(X_train_svr.median().values, (500, 1)), columns=X_train_svr.columns)
X_plot_df['MedInc'] = X_plot

# Predict using SVR
y_svr_plot = best_svr.predict(X_plot_df)

plt.plot(X_plot, y_svr_plot, color='red', label='SVR Prediction')
plt.fill_between(
    X_plot,
    y_svr_plot - best_svr.epsilon,
    y_svr_plot + best_svr.epsilon,
    color='orange', alpha=0.3, label='Epsilon Tube (Margin)'
)
plt.xlabel("Median Income (Scaled)")
plt.ylabel("Median House Value")
plt.title("SVR Epsilon Tube – Maximum Margin Concept (Subsampled)")
plt.legend()
plt.tight_layout()
plt.savefig("images/SVR-Episilon Tube_Maximum_margin_concept.png")
plt.close()
#plt.show()


# %% [markdown]

# =========================
# 15) Model Comparison & Dynamic Summary (Professional)
# =========================

import matplotlib.colors as mcolors

print("\n📊 Generating full model comparison with professional visuals...")

# Recalculate predictions for key models
y_pred_best_svr_full = best_svr.predict(X_test_capped)
y_pred_best_rf = best_rf.predict(X_test_capped)

# Prepare comparison data (exclude SGD initially)
comparison_data_main = [
    ["Linear Regression", mean_squared_error(y_test, models["Linear Regression"].predict(X_test_capped)),
     mean_absolute_error(y_test, models["Linear Regression"].predict(X_test_capped)),
     r2_score(y_test, models["Linear Regression"].predict(X_test_capped))],
    
    ["Ridge Regression", mean_squared_error(y_test, models["Ridge Regression"].predict(X_test_capped)),
     mean_absolute_error(y_test, models["Ridge Regression"].predict(X_test_capped)),
     r2_score(y_test, models["Ridge Regression"].predict(X_test_capped))],
    
    ["Decision Tree", mean_squared_error(y_test, models["Decision Tree"].predict(X_test_capped)),
     mean_absolute_error(y_test, models["Decision Tree"].predict(X_test_capped)),
     r2_score(y_test, models["Decision Tree"].predict(X_test_capped))],
    
    ["Random Forest (Tuned)", mean_squared_error(y_test, y_pred_best_rf),
     mean_absolute_error(y_test, y_pred_best_rf),
     r2_score(y_test, y_pred_best_rf)],
    
    ["Gradient Boosting", mean_squared_error(y_test, models["Gradient Boosting"].predict(X_test_capped)),
     mean_absolute_error(y_test, models["Gradient Boosting"].predict(X_test_capped)),
     r2_score(y_test, models["Gradient Boosting"].predict(X_test_capped))],
    
    ["KNN Regressor", mean_squared_error(y_test, models["KNN Regressor"].predict(X_test_capped)),
     mean_absolute_error(y_test, models["KNN Regressor"].predict(X_test_capped)),
     r2_score(y_test, models["KNN Regressor"].predict(X_test_capped))],
    
    ["Tuned SVR", mean_squared_error(y_test, y_pred_best_svr_full),
     mean_absolute_error(y_test, y_pred_best_svr_full),
     r2_score(y_test, y_pred_best_svr_full)]
]

# SGD Regressor (extreme scale)
comparison_sgd = [
    ["SGD Regressor", mean_squared_error(y_test, y_pred_sgd),
     mean_absolute_error(y_test, y_pred_sgd),
     r2_score(y_test, y_pred_sgd)]
]

# Convert to DataFrames
df_main = pd.DataFrame(comparison_data_main, columns=['Model','MSE','MAE','R2']).sort_values('R2', ascending=False).reset_index(drop=True)
df_sgd = pd.DataFrame(comparison_sgd, columns=['Model','MSE','MAE','R2'])

all_results_full_sorted = pd.concat([df_main, df_sgd]).sort_values('R2', ascending=False).reset_index(drop=True)

# ----------------------------
# 1) Display comparison table
# ----------------------------
print("📊 Full Model Comparison Table:")
display(all_results_full_sorted)

# Normalize metrics for visualization (excluding SGD)
df_main_norm = df_main.copy()
df_main_norm['MSE_norm'] = df_main_norm['MSE'] / df_main_norm['MSE'].max()
df_main_norm['MAE_norm'] = df_main_norm['MAE'] / df_main_norm['MAE'].max()
df_main_norm['R2_norm'] = df_main_norm['R2']  # R2 already 0–1

# ----------------------------
# Professional custom palette
# ----------------------------
custom_palette = list(mcolors.TABLEAU_COLORS.values())[:len(df_main_norm)]

# ----------------------------
# Horizontal bar plots
# ----------------------------
plt.figure(figsize=(10,6))
sns.barplot(x='R2_norm', y='Model', hue='Model',data=df_main_norm, palette=custom_palette)
plt.title("Model Comparison – R² Score (Normalized)")
plt.xlabel("R² Score")
plt.ylabel("Model")
plt.xlim(0,1)
plt.tight_layout()
plt.savefig("images/Model Comparison - Normalized - R2 score.png")
plt.close()

plt.figure(figsize=(10,6))
sns.barplot(x='MAE_norm', y='Model',hue='Model', data=df_main_norm, palette=custom_palette)
plt.title("Model Comparison – MAE (Normalized)")
plt.xlabel("Normalized MAE")
plt.ylabel("Model")
plt.tight_layout()
plt.savefig("images/Model Comparison - Normalized - MAE.png")
plt.close()


plt.figure(figsize=(10,6))
sns.barplot(x='MSE_norm', y='Model',hue='Model', data=df_main_norm, palette=custom_palette)
plt.title("Model Comparison – MSE (Normalized)")
plt.xlabel("Normalized MSE")
plt.ylabel("Model")
plt.tight_layout()
plt.savefig("images/Model Comparison - Normalized - MSE.png")
plt.close()
#plt.show()

# ----------------------------
# Display SGD separately
# ----------------------------
print("\n⚠️ SGD Regressor is extremely unstable; metrics are on a very different scale:")
display(df_sgd)

# ----------------------------
# Dynamic Summary
# ----------------------------
best_r2_model = df_main.loc[df_main['R2'].idxmax()]['Model']
best_r2 = df_main['R2'].max()

best_mae_model = df_main.loc[df_main['MAE'].idxmin()]['Model']
best_mae = df_main['MAE'].min()

best_mse_model = df_main.loc[df_main['MSE'].idxmin()]['Model']
best_mse = df_main['MSE'].min()

print("📌 Dynamic Summary & Recommendation:")
print(f"- The top-performing model based on R² is **{best_r2_model}** (R² = {best_r2:.3f}).")
print(f"- Lowest MAE: **{best_mae_model}** (MAE = {best_mae:.3f}), indicating best average prediction accuracy.")
print(f"- Lowest MSE: **{best_mse_model}** (MSE = {best_mse:.3f}), indicating best overall error minimization.")
print("- Random Forest and Gradient Boosting are strong classical models; Tuned SVR provides a competitive non-linear alternative.")
print("- SGD Regressor is not recommended due to extreme instability and poor performance.")

#formatted_datetime = datetime.now().strftime("%Y-%m-%d %H:%M:%S")
#print("Execution completed at", formatted_datetime)

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,MedHouseValue
0,8.3252,41.0,6.984127,1.02381,322.0,2.555556,37.88,-122.23,4.526
1,8.3014,21.0,6.238137,0.97188,2401.0,2.109842,37.86,-122.22,3.585
2,7.2574,52.0,8.288136,1.073446,496.0,2.80226,37.85,-122.24,3.521
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25,3.413
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25,3.422


Numerical Features: ['MedInc', 'HouseAge', 'AveRooms', 'AveBedrms', 'Population', 'AveOccup', 'Latitude', 'Longitude']
Categorical Features: []
Baseline Model - MSE: 1.3762028164626938 MAE: 0.8740399224806202 R2: -0.050208628996205595


Unnamed: 0,Model,MSE,MAE,R2
3,Random Forest,0.271148,0.344458,0.793081
4,Gradient Boosting,0.298926,0.377162,0.771884
1,Ridge Regression,0.467507,0.50483,0.643236
0,Linear Regression,0.467509,0.50506,0.643235
2,Decision Tree,0.528135,0.477494,0.596969
5,KNN Regressor,0.674144,0.619152,0.485547


Fitting 3 folds for each of 6 candidates, totalling 18 fits

✅ Broad Randomized Search Complete
Best Parameters: {'n_estimators': 100, 'min_samples_split': 2, 'min_samples_leaf': 1, 'max_features': 'log2', 'max_depth': 20}
Best CV R²: 0.7985705307275951
Fitting 3 folds for each of 2 candidates, totalling 6 fits

✅ Narrow Grid Search Complete
Best Parameters: {'max_depth': 20, 'max_features': 'log2', 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 100}
Best CV R²: 0.7985705307275951

🏆 Final Tuned Random Forest Performance
---------------------------------------
R² Score (Test): 0.7923
MAE (Test): 0.3514
MSE (Test): 0.2722

Total Training Time: 1.07 min

⚙️ Training SGD Regressor...
✅ SGD R²: -72906.2561
MAE: 245.0493
MSE: 95538.3229

⚙️ Training baseline SVR (RBF kernel) on 15% data for efficiency...
✅ SVR (Baseline, subsampled) R²: 0.3821
MAE: 0.6584
MSE: 0.7549

⚙️ RandomizedSearchCV: Optimizing SVR Hyperparameters...
Fitting 3 folds for each of 8 candidates, totalling

Unnamed: 0,Model,MSE,MAE,R2
0,Random Forest (Tuned),0.272186,0.351409,0.79229
1,Gradient Boosting,0.298926,0.377162,0.771884
2,Ridge Regression,0.467507,0.50483,0.643236
3,Linear Regression,0.467509,0.50506,0.643235
4,Tuned SVR,0.484519,0.50064,0.630254
5,Decision Tree,0.528135,0.477494,0.596969
6,KNN Regressor,0.674144,0.619152,0.485547
7,SGD Regressor,95538.322926,245.0493,-72906.256065



⚠️ SGD Regressor is extremely unstable; metrics are on a very different scale:


Unnamed: 0,Model,MSE,MAE,R2
0,SGD Regressor,95538.322926,245.0493,-72906.256065


📌 Dynamic Summary & Recommendation:
- The top-performing model based on R² is **Random Forest (Tuned)** (R² = 0.792).
- Lowest MAE: **Random Forest (Tuned)** (MAE = 0.351), indicating best average prediction accuracy.
- Lowest MSE: **Random Forest (Tuned)** (MSE = 0.272), indicating best overall error minimization.
- Random Forest and Gradient Boosting are strong classical models; Tuned SVR provides a competitive non-linear alternative.
- SGD Regressor is not recommended due to extreme instability and poor performance.
