# Predicting House Prices — Regression Modeling & Feature Selection

This notebook builds and compares several regression models to predict **median house prices (MEDV)** using the IBM Skills Network version of the Boston Housing dataset.

We focus on:

- Linear Regression
- Ridge Regression (L2 regularization)
- Lasso Regression (L1 regularization)
- Random Forest Regressor (non-linear model)
- Feature importance analysis
- Model selection using evaluation metrics (RMSE, MAE, R²)
- Saving the final model as a deployable artifact

This notebook is designed as a professional portfolio piece for **Model Developer / ML Engineer** roles.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

import joblib

plt.style.use("seaborn-v0_8")
plt.rcParams["figure.figsize"] = (10, 6)

In [None]:
url = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-ST0151EN-SkillsNetwork/labs/boston_housing.csv"

df = pd.read_csv(url)
df.head()

## Dataset Description

This dataset is a structured housing dataset similar to the classic Boston Housing dataset.

The target variable is:

- **MEDV** — Median value of homes in $1,000’s.

It contains 506 samples and 14 predictive features.

In [None]:
df.info()
df.describe()

In [None]:
plt.figure(figsize=(12, 10))
sns.heatmap(df.corr(), cmap="coolwarm", annot=False)
plt.title("Correlation Heatmap")
plt.show()

In [None]:
X = df.drop(columns=["MEDV"])
y = df["MEDV"]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

X_train.shape, X_test.shape

In [None]:
def evaluate_model(name, model):
    y_pred_train = model.predict(X_train)
    y_pred_test = model.predict(X_test)

    return {
        "Model": name,
        "Train R2": r2_score(y_train, y_pred_train),
        "Test R2": r2_score(y_test, y_pred_test),
        "Test RMSE": mean_squared_error(y_test, y_pred_test, squared=False),
        "Test MAE": mean_absolute_error(y_test, y_pred_test)
    }

In [None]:
results = []

lin = LinearRegression()
lin.fit(X_train, y_train)

results.append(evaluate_model("Linear Regression", lin))

pd.DataFrame(results)

In [None]:
ridge_params = {"alpha": [0.1, 1, 10, 100]}

ridge = Ridge()
ridge_cv = GridSearchCV(
    ridge, ridge_params, cv=5,
    scoring="neg_root_mean_squared_error"
)

ridge_cv.fit(X_train, y_train)

results.append(evaluate_model("Ridge (best)", ridge_cv.best_estimator_))

pd.DataFrame(results)

In [None]:
lasso_params = {"alpha": [0.001, 0.01, 0.1, 1]}

lasso = Lasso(max_iter=10000)
lasso_cv = GridSearchCV(
    lasso, lasso_params, cv=5,
    scoring="neg_root_mean_squared_error"
)

lasso_cv.fit(X_train, y_train)

results.append(evaluate_model("Lasso (best)", lasso_cv.best_estimator_))

pd.DataFrame(results)

In [None]:
rf = RandomForestRegressor(
    n_estimators=300,
    max_depth=None,
    random_state=42,
    n_jobs=-1
)

rf.fit(X_train, y_train)

results.append(evaluate_model("Random Forest", rf))

results_df = pd.DataFrame(results)
results_df

In [None]:
importances = pd.Series(rf.feature_importances_, index=X.columns)
importances_sorted = importances.sort_values(ascending=False)

plt.figure(figsize=(10, 6))
sns.barplot(x=importances_sorted.values[:10], y=importances_sorted.index[:10])
plt.title("Top 10 Feature Importances — Random Forest")
plt.xlabel("Importance")
plt.ylabel("Feature")
plt.show()

In [None]:
lasso_best = lasso_cv.best_estimator_
coef = pd.Series(lasso_best.coef_, index=X.columns)

non_zero_coef = coef[coef != 0].sort_values(ascending=False)
non_zero_coef

In [None]:
best_model = rf
joblib.dump(best_model, "house_price_model.pkl")
"Model saved successfully!"

# Conclusion

- Several regression models were built to predict **MEDV**.
- The best performing model based on RMSE and R² was **Random Forest Regressor**.
- Ridge and Lasso improved stability and performed feature selection.
- Random Forest revealed that **LSTAT** and **RM** were among the strongest predictors.
- The final model was saved as a `.pkl` file.

This project demonstrates skills in:
- Regression modeling
- Hyperparameter tuning
- Feature importance analysis
- Model artifact creation
- ML model development workflows