
# ml01 — California Housing Baseline (Brandon J)

**Purpose.** Establish a transparent, defensible baseline for predicting California median house values using linear regression. The goal is *reproducibility, interpretability, and evaluation discipline*, not raw leaderboard performance.

**Design choices (why this is defendable):**
- Use the built‑in California Housing dataset (clean, standard benchmark).
- Small, interpretable feature set (**MedInc**, **AveRooms**) to isolate signal from confounds.
- Constant baseline (median) to contextualize metrics — otherwise an R² value is floating in space.
- Stratified split by **target quantiles** so the test set reasonably mirrors the train distribution.
- Metrics: R², MAE, RMSE + residual diagnostics and coefficient table.


## 0. Imports (pre-installed via `uv sync`)

In [1]:
from __future__ import annotations

from typing import cast, List
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.datasets import fetch_california_housing
from sklearn.dummy import DummyRegressor
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, r2_score, root_mean_squared_error
from sklearn.model_selection import StratifiedShuffleSplit

ModuleNotFoundError: No module named 'numpy'

## 1. Load Dataset and Verify Integrity

In [None]:
# Load as DataFrame/Series for immediate analysis
X_df, y_ser = fetch_california_housing(as_frame=True, return_X_y=True)
X: pd.DataFrame = cast(pd.DataFrame, X_df)
y: pd.Series = cast(pd.Series, y_ser)

# Combine for quick inspection / plotting
df: pd.DataFrame = X.copy()
df["MedHouseVal"] = y

# Sanity checks that matter in practice
assert not df.isnull().any().any(), "Dataset unexpectedly contains missing values."
assert df.select_dtypes("number").shape[1] == df.shape[1], "Non-numeric columns found."

# Distribution check confirms we loaded what we think we loaded
ax = df.hist(bins=30, figsize=(12, 8))
plt.suptitle("Feature Distributions — California Housing", fontsize=12)
plt.tight_layout()
plt.show()

## 2. Stratified Train/Test Split (by Target Deciles)

In [None]:
# Stratifying by target stabilizes metrics across different random seeds
qbins = pd.qcut(df["MedHouseVal"], q=10, duplicates="drop")
splitter = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
train_idx, test_idx = next(splitter.split(df, qbins))

train_df, test_df = df.iloc[train_idx].copy(), df.iloc[test_idx].copy()

features: List[str] = ["MedInc", "AveRooms"]
target: str = "MedHouseVal"

X_train, y_train = train_df[features], train_df[target]
X_test, y_test = test_df[features], test_df[target]

X_train.shape, X_test.shape

## 3. Baseline (Median) vs. Linear Regression

In [None]:
# Baseline: "do nothing smart" model — essential for context
baseline = DummyRegressor(strategy="median")
baseline.fit(X_train, y_train)
y_base = baseline.predict(X_test)

# Linear model: simple, interpretable, and a good pedagogical baseline
model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

## 4. Metric Reporting (R², MAE, RMSE)

In [None]:
def report_metrics(name: str, y_true: pd.Series, y_hat) -> dict:
    r2 = r2_score(y_true, y_hat)
    mae = mean_absolute_error(y_true, y_hat)
    rmse = root_mean_squared_error(y_true, y_hat)
    print(f"{name:>10} | R²={r2:0.3f} | MAE={mae:0.3f} | RMSE={rmse:0.3f}")
    return {"r2": r2, "mae": mae, "rmse": rmse}


print("Model Evaluation (Lower MAE/RMSE is better; Higher R² is better)")
m_base = report_metrics("Baseline", y_test, y_base)
m_lin = report_metrics("LinearRG", y_test, y_pred)

if m_base["mae"] > 0:
    mae_impr = 100 * (m_base["mae"] - m_lin["mae"]) / m_base["mae"]
    print(f"→ MAE improvement vs baseline: {mae_impr:0.1f}%")

## 5. Residual Diagnostics

In [None]:
residuals = y_test - y_pred

plt.figure(figsize=(5, 4))
plt.scatter(y_pred, residuals, s=10, alpha=0.6)
plt.axhline(0, color="red", linestyle="--", linewidth=1)
plt.xlabel("Predicted Values")
plt.ylabel("Residuals (y - ŷ)")
plt.title("Residuals vs Predicted Values")
plt.tight_layout()
plt.show()

plt.figure(figsize=(5, 4))
plt.hist(residuals, bins=30)
plt.title("Residual Distribution")
plt.xlabel("Residual")
plt.ylabel("Frequency")
plt.tight_layout()
plt.show()

## 6. Coefficient Table (Interpretability)

In [None]:
import numpy as np

coef_df = (
    pd.DataFrame({"Feature": features, "Coefficient": model.coef_})
    .assign(Intercept=model.intercept_)
    .sort_values("Coefficient", key=np.abs, ascending=False)
    .reset_index(drop=True)
)
coef_df

## 7. Visual Validation — Actual vs Predicted (MedInc slice)

In [None]:
plt.figure(figsize=(5, 4))
plt.scatter(test_df["MedInc"], y_test, s=8, alpha=0.4, label="Actual")
plt.scatter(test_df["MedInc"], y_pred, s=8, alpha=0.7, label="Predicted")
plt.xlabel("MedInc")
plt.ylabel("MedHouseVal")
plt.title("MedInc vs MedHouseVal — Actual vs Predicted")
plt.legend()
plt.tight_layout()
plt.show()