# Lesson 02 - Linear Regression (GD, SGD, Normal Equation)


## Objectives
- Implement batch gradient descent and stochastic gradient descent.
- Solve linear regression with the normal equation.
- Compare convergence behavior and sensitivity to feature scaling.


## From the notes

**Notation**
- $h_\theta(x) = \theta^T x$ with $x_0 = 1$.
- Cost: $J(\theta) = \frac{1}{2m} \sum_{i=1}^m (h_\theta(x^{(i)}) - y^{(i)})^2$.
- Gradient descent: $\theta := \theta - \alpha \nabla_\theta J(\theta)$.
- Normal equation: $\theta = (X^T X)^{-1} X^T y$.

_TODO: Verify the exact notation matches the official CS229 notes PDF._


## Intuition
Linear regression finds the line (or hyperplane) that minimizes squared error. Batch GD uses full gradients, SGD uses single-example updates, and the normal equation gives a closed-form solution.


## Data
We reuse a 1D synthetic regression dataset to make optimization behavior easy to visualize.


In [None]:
import numpy as np
import matplotlib.pyplot as plt

np.random.seed(42)

# Data
m = 80
x = np.linspace(0, 10, m)
y = 3.0 * x + 1.5 + np.random.normal(scale=2.5, size=m)
X = np.c_[np.ones(m), x]

def compute_cost(X, y, theta):
    errors = X @ theta - y
    return (errors @ errors) / (2 * len(y))

def batch_gd(X, y, alpha=0.01, iters=2000):
    theta = np.zeros(X.shape[1])
    history = []
    for _ in range(iters):
        grad = (X.T @ (X @ theta - y)) / len(y)
        theta -= alpha * grad
        history.append(compute_cost(X, y, theta))
    return theta, history

def sgd(X, y, alpha=0.01, iters=30):
    theta = np.zeros(X.shape[1])
    history = []
    for _ in range(iters):
        for i in range(len(y)):
            grad = (X[i] @ theta - y[i]) * X[i]
            theta -= alpha * grad
        history.append(compute_cost(X, y, theta))
    return theta, history

theta_gd, hist_gd = batch_gd(X, y)
theta_sgd, hist_sgd = sgd(X, y)
theta_ne = np.linalg.pinv(X.T @ X) @ X.T @ y

theta_gd, theta_sgd, theta_ne


## Experiments


In [None]:
# Feature scaling experiment
X_scaled = X.copy()
X_scaled[:, 1] = (X_scaled[:, 1] - X_scaled[:, 1].mean()) / X_scaled[:, 1].std()

theta_gd_scaled, hist_gd_scaled = batch_gd(X_scaled, y, alpha=0.05)
hist_gd[-1], hist_gd_scaled[-1]


## Visualizations


In [None]:
plt.figure(figsize=(6,4))
plt.plot(hist_gd, label="Batch GD")
plt.plot(hist_sgd, label="SGD")
plt.title("Convergence of GD vs SGD")
plt.xlabel("Iteration")
plt.ylabel("J(θ)")
plt.legend()
plt.show()

preds = X @ theta_gd
plt.figure(figsize=(6,4))
plt.scatter(x, y, alpha=0.6, label="data")
plt.plot(x, preds, color="black", label="GD fit")
plt.title("Linear regression fit")
plt.xlabel("x")
plt.ylabel("y")
plt.legend()
plt.show()


## Takeaways
- Batch GD is stable but slower per step; SGD is noisy but often faster to good solutions.
- The normal equation provides a closed-form solution when X^T X is invertible.


## Explain it in an interview
- Explain the difference between GD, SGD, and the normal equation.
- Describe why feature scaling accelerates gradient descent.


## Exercises
- Implement mini-batch gradient descent and compare to GD/SGD.
- Show what happens if X^T X is singular and use the pseudo-inverse.
- Add L2 regularization and derive the closed-form ridge solution.
