# AML Assignment 1 — Full Solution Notebook

*Generated: 2025-11-12 20:40:03*

This notebook provides a compact but complete solution for a standard first assignment in Applied Machine Learning (AML):

1. **Linear Regression (closed‑form / MLE)** on the California Housing dataset (with an offline synthetic fallback if the dataset cannot be fetched).
2. **Gradient Descent from scratch** for linear regression with Mean Squared Error, including a convergence plot.
3. **Regularization**: Ridge and Lasso with standardization and simple cross‑validation.
4. **Evaluation**: RMSE/MAE/$R^2$, residual analysis, and feature importance visualization.

**Note on data:** If the California Housing dataset is not cached locally and the environment has no internet, the loader below transparently falls back to a synthetic regression dataset with the same number of samples and features. The rest of the pipeline remains identical.


## 0. Setup
We import the required libraries. All plots use **matplotlib** (one chart per figure, no custom colors) per the constraints.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from dataclasses import dataclass
from typing import Tuple, Dict

from sklearn.model_selection import train_test_split, KFold
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.linear_model import Ridge, Lasso
from sklearn.datasets import fetch_california_housing, make_regression

np.random.seed(42)
plt.rcParams['figure.figsize'] = (7, 4)


## 1. Data Loading (California Housing with offline fallback)
If California Housing is unavailable to fetch, we create a synthetic dataset of the same shape (20,640 samples, 8 features).

In [None]:
def load_california_or_synthetic(random_state: int = 42) -> Tuple[np.ndarray, np.ndarray, Dict]:
    try:
        data = fetch_california_housing()
        X = data.data
        y = data.target
        meta = {"source": "CaliforniaHousing", "feature_names": list(data.feature_names)}
        return X, y, meta
    except Exception as e:
        # Synthetic fallback: same n_samples and n_features
        X, y = make_regression(n_samples=20640, n_features=8, noise=12.0, random_state=random_state)
        # Rescale target roughly to typical CaliforniaHousing scale
        y = (y - y.min()) / (y.max() - y.min()) * 5.0
        meta = {"source": "SyntheticFallback", "feature_names": [f"f{i}" for i in range(8)]}
        return X, y, meta

X, y, meta = load_california_or_synthetic()
meta


## 2. Train/Validation/Test Split
We use a 60/20/20 split. Targets are not standardized; features are standardized where needed (e.g., for regularization).

In [None]:
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.4, random_state=42)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)
X_train.shape, X_val.shape, X_test.shape


## 3. Linear Regression (Closed‑Form / MLE)
For the Gaussian noise model, maximizing the likelihood is equivalent to minimizing the MSE. The MLE solution is:
$$\hat{\theta} = (X^\top X)^{-1} X^\top y,$$
where $X$ includes a column of ones for the intercept. We implement this explicitly below.

In [None]:
def add_intercept(X: np.ndarray) -> np.ndarray:
    return np.hstack([np.ones((X.shape[0], 1)), X])

def closed_form_linreg(X: np.ndarray, y: np.ndarray) -> np.ndarray:
    # Add intercept
    Xb = add_intercept(X)
    # Moore–Penrose inverse for numerical stability
    theta = np.linalg.pinv(Xb.T @ Xb) @ (Xb.T @ y)
    return theta  # shape (d+1,)

theta_hat = closed_form_linreg(X_train, y_train)
theta_hat[:5], theta_hat.shape


### Evaluation Helper

In [None]:
def predict_with_theta(theta: np.ndarray, X: np.ndarray) -> np.ndarray:
    Xb = add_intercept(X)
    return Xb @ theta

def eval_regression(y_true, y_pred) -> dict:
    rmse = mean_squared_error(y_true, y_pred, squared=False)
    mae = mean_absolute_error(y_true, y_pred)
    r2 = r2_score(y_true, y_pred)
    return {"RMSE": rmse, "MAE": mae, "R2": r2}

pred_val_cf = predict_with_theta(theta_hat, X_val)
pred_test_cf = predict_with_theta(theta_hat, X_test)
metrics_cf_val = eval_regression(y_val, pred_val_cf)
metrics_cf_test = eval_regression(y_test, pred_test_cf)
metrics_cf_val, metrics_cf_test


### Residual Plot (Closed‑Form)

In [None]:
residuals = y_val - pred_val_cf
plt.figure()
plt.scatter(pred_val_cf, residuals, s=6)
plt.axhline(0)
plt.xlabel('Predicted')
plt.ylabel('Residual')
plt.title('Residuals vs Predicted (Closed-form)')
plt.show()


## 4. Gradient Descent for Linear Regression (from scratch)
We optimize $\theta$ by minimizing MSE with batch gradient descent. We report the loss across iterations and compare to the closed‑form solution.

In [None]:
def mse_loss(theta: np.ndarray, X: np.ndarray, y: np.ndarray) -> float:
    y_pred = predict_with_theta(theta, X)
    return float(np.mean((y_pred - y) ** 2))

def gradient_descent_linreg(X: np.ndarray, y: np.ndarray, lr=1e-3, iters=2000):
    Xb = add_intercept(X)
    n, d = Xb.shape
    theta = np.zeros(d)
    history = []
    for t in range(iters):
        y_pred = Xb @ theta
        grad = (2.0/n) * (Xb.T @ (y_pred - y))
        theta -= lr * grad
        if t % 10 == 0:
            history.append(mse_loss(theta, X, y))
    return theta, history

theta_gd, hist = gradient_descent_linreg(X_train, y_train, lr=1e-3, iters=3000)
pred_val_gd = predict_with_theta(theta_gd, X_val)
metrics_gd_val = eval_regression(y_val, pred_val_gd)
metrics_gd_val, len(hist)


### Convergence Plot (MSE vs Iterations)

In [None]:
plt.figure()
plt.plot(np.arange(len(hist))*10, hist)
plt.xlabel('Iteration')
plt.ylabel('Training MSE')
plt.title('Gradient Descent Convergence')
plt.show()


## 5. Ridge and Lasso (with Standardization)
We standardize features on the training set and apply the same transform to validation/test sets. Then we perform simple K‑fold CV (K=5) to select $\alpha$ for Ridge and Lasso, and report performance.

In [None]:
scaler = StandardScaler().fit(X_train)
Xs_train = scaler.transform(X_train)
Xs_val = scaler.transform(X_val)
Xs_test = scaler.transform(X_test)

def cv_model_and_alpha(Model, alphas, X, y, k=5):
    kf = KFold(n_splits=k, shuffle=True, random_state=42)
    alpha_scores = []
    for a in alphas:
        rmses = []
        for tr, va in kf.split(X):
            m = Model(alpha=a)
            m.fit(X[tr], y[tr])
            pred = m.predict(X[va])
            rmses.append(mean_squared_error(y[va], pred, squared=False))
        alpha_scores.append((a, float(np.mean(rmses))))
    alpha_scores.sort(key=lambda t: t[1])
    best_alpha = alpha_scores[0][0]
    model = Model(alpha=best_alpha).fit(X, y)
    return model, best_alpha, alpha_scores

alphas = np.logspace(-3, 2, 12)
ridge_model, ridge_alpha, ridge_scores = cv_model_and_alpha(Ridge, alphas, Xs_train, y_train)
lasso_model, lasso_alpha, lasso_scores = cv_model_and_alpha(Lasso, alphas, Xs_train, y_train)
ridge_alpha, lasso_alpha


### Ridge/Lasso Evaluation

In [None]:
pred_val_ridge = ridge_model.predict(Xs_val)
pred_val_lasso = lasso_model.predict(Xs_val)
metrics_ridge_val = eval_regression(y_val, pred_val_ridge)
metrics_lasso_val = eval_regression(y_val, pred_val_lasso)

pred_test_ridge = ridge_model.predict(Xs_test)
pred_test_lasso = lasso_model.predict(Xs_test)
metrics_ridge_test = eval_regression(y_test, pred_test_ridge)
metrics_lasso_test = eval_regression(y_test, pred_test_lasso)

metrics_ridge_val, metrics_lasso_val, metrics_ridge_test, metrics_lasso_test


### Coefficients (Magnitude) — Ridge vs Lasso
We visualize coefficient magnitudes to illustrate regularization effects.

In [None]:
feature_names = meta.get("feature_names", [f"f{i}" for i in range(X.shape[1])])

def bar_coeffs(model, title):
    coefs = model.coef_
    plt.figure()
    plt.bar(np.arange(len(coefs)), np.abs(coefs))
    plt.xticks(np.arange(len(coefs)), feature_names, rotation=45, ha='right')
    plt.title(title)
    plt.tight_layout()
    plt.show()

bar_coeffs(ridge_model, f'Ridge | alpha={ridge_alpha:.3g}')
bar_coeffs(lasso_model, f'Lasso | alpha={lasso_alpha:.3g}')


## 6. Final Comparison on Test Set
We compare closed‑form linear regression, gradient descent solution, Ridge, and Lasso on the test split.

In [None]:
summary = {
    'ClosedForm': metrics_cf_test,
    'GradDescent': eval_regression(y_test, predict_with_theta(theta_gd, X_test)),
    'Ridge': metrics_ridge_test,
    'Lasso': metrics_lasso_test,
}
summary


### Notes & Takeaways
- **Closed‑form** and **Gradient Descent** should agree closely if GD converged.
- **Ridge** tends to reduce variance and often improves generalization when features are correlated.
- **Lasso** encourages sparsity, potentially performing feature selection.
- Always **standardize** features before L2/L1 regularization to keep coefficients on comparable scales.
- Inspect **residuals** to validate assumptions (homoscedasticity, linearity) and identify failure modes.