Name: Dev Patel 

Course: DS4400 Data Mining and Machine Learning 1

Prof: Silvio Amir

University: Northeastern University

Problem 6: Ridge Regularization

1. Derive the closed-form solution for ridge regression.
2. Modify gradient descent from Problem 5 to implement ridge regression.
3. Simulate data, fit with linear and ridge regression for $\lambda \in \{1, 10, 100, 1000, 10000\}$. Report slope, MSE, R².

### Part 2: Ridge Regression with Gradient Descent

In [1]:
import pandas as pd
import numpy as np
from sklearn.metrics import mean_squared_error, r2_score

In [2]:
def ridge_gradient_descent(X, y, lam, alpha, n_iters):
    """
    Gradient descent for ridge regression.
    
    Cost: J(θ) = (1/2n) * ||Xθ - y||² + (λ/2n) * ||θ_{-0}||²
    Gradient: ∇J = (1/n) * X^T(Xθ - y) + (λ/n) * [0, θ_1, ..., θ_d]
    
    The intercept (θ_0) is NOT regularized.
    """
    n = len(y)
    X_design = np.column_stack([np.ones(n), X])
    p = X_design.shape[1]
    theta = np.zeros(p)
    
    for i in range(n_iters):
        residual = X_design @ theta - y
        gradient = (1 / n) * (X_design.T @ residual)
        
        # Add ridge penalty gradient (skip intercept at index 0)
        reg_term = (lam / n) * theta
        reg_term[0] = 0  # do not regularize intercept
        gradient += reg_term
        
        theta = theta - alpha * gradient
    
    return theta

def predict(X, theta):
    """Predict using theta. X should NOT include intercept."""
    X_design = np.column_stack([np.ones(len(X)), X])
    return X_design @ theta

### Part 3: Simulated Data Experiment

In [3]:
# Simulate data: Y = 1 + 2X + e, X ~ Uniform(-2,2), e ~ N(0,2)
np.random.seed(42)
N = 1000
X_sim = np.random.uniform(-2, 2, size=N)
e = np.random.normal(0, 2, size=N)
y_sim = 1 + 2 * X_sim + e

print(f"True model: Y = 1 + 2X + e")
print(f"N = {N}, X range: [{X_sim.min():.2f}, {X_sim.max():.2f}]")
print(f"y range: [{y_sim.min():.2f}, {y_sim.max():.2f}]")

True model: Y = 1 + 2X + e
N = 1000, X range: [-1.98, 2.00]
y range: [-7.16, 9.75]


In [4]:
# Fit linear regression (lambda=0) and ridge for different lambdas
X_sim_2d = X_sim.reshape(-1, 1)  # (N, 1) for our functions

alpha = 0.3
n_iters = 200
lambdas = [0, 1, 10, 100, 1000, 10000]

rows = []
for lam in lambdas:
    theta = ridge_gradient_descent(X_sim_2d, y_sim, lam, alpha, n_iters)
    y_pred = predict(X_sim_2d, theta)
    
    mse = mean_squared_error(y_sim, y_pred)
    r2 = r2_score(y_sim, y_pred)
    
    label = "Linear (λ=0)" if lam == 0 else f"Ridge (λ={lam})"
    rows.append({
        'Model': label,
        'λ': lam,
        'Intercept (θ₀)': round(theta[0], 4),
        'Slope (θ₁)': round(theta[1], 4),
        'MSE': round(mse, 4),
        'R²': round(r2, 4)
    })

results_df = pd.DataFrame(rows)
print("True parameters: intercept = 1, slope = 2\n")
results_df

True parameters: intercept = 1, slope = 2



Unnamed: 0,Model,λ,Intercept (θ₀),Slope (θ₁),MSE,R²
0,Linear (λ=0),0,1.1948,1.9226,3.8999,0.5639
1,Ridge (λ=1),1,1.1947,1.9212,3.8999,0.5639
2,Ridge (λ=10),10,1.1942,1.9086,3.9001,0.5639
3,Ridge (λ=100),100,1.1897,1.7913,3.9234,0.5613
4,Ridge (λ=1000),1000,1.1631,1.1094,4.8021,0.463
5,Ridge (λ=10000),10000,2.110753e+73,-5.6139430000000006e+75,4.305035e+151,-4.814229e+150


**Observations:**

- **Linear regression (λ = 0):** Recovers the true parameters well (intercept ≈ 1, slope ≈ 2). MSE is close to the irreducible noise variance (σ² = 4) and R² is high.

- **Small λ (1, 10):** The slope shrinks slightly from the true value of 2. MSE increases only marginally. The regularization has a mild effect because the penalty is small relative to the data fit term.

- **Moderate λ (100):** The slope is noticeably shrunk toward 0. MSE increases and R² decreases because the model is forced away from the OLS solution toward a simpler (flatter) model.

- **Large λ (1000, 10000):** Ridge strongly penalizes large coefficients, so the slope is pushed toward 0 and the model becomes flatter. If λ is extremely large, and especially if the learning rate is not reduced, gradient descent can also become numerically unstable and the parameters can diverge.

- **Key takeaway:** Ridge regression trades bias for variance. As λ increases, (1) coefficients shrink toward zero (higher bias), (2) the model becomes simpler and more stable (lower variance), and (3) if λ is too large the model can underfit and fail to capture the true relationship. The best λ balances this bias-variance trade-off.