# Lesson 02: Linear Regression with GD + SGD (Refined)## Objectives- Implement batch gradient descent (BGD) and stochastic gradient descent (SGD).- Compare convergence behavior and learning curves.- Visualize optimization dynamics and prediction error.

## From the notes: notation + objectiveWe use \(m\) training examples \(\{(x^{(i)}, y^{(i)})\}_{i=1}^m\), with \(x_0=1\). The hypothesis is \(h_	heta(x)=	heta^T x\).Objective:\[J(	heta) = rac{1}{2m}\sum_{i=1}^m (h_	heta(x^{(i)}) - y^{(i)})^2.\]Gradient:\[\nabla_	heta J(	heta) = rac{1}{m} X^T(X	heta - y).\]

## IntuitionGradient descent iteratively nudges \(	heta\) to reduce error. SGD uses a single example at a time, producing noisier but often faster updates.

## DataWe generate a noisy linear dataset similar to the lecture example.

In [None]:
import numpy as npimport matplotlib.pyplot as pltnp.random.seed(42)

In [None]:
# Synthetic datam = 80X_raw = np.linspace(0, 12, m)y = 4.0 * X_raw - 3.0 + np.random.normal(0, 3.0, size=m)X = np.c_[np.ones(m), X_raw]

## Implementation: batch gradient descent

In [None]:
def compute_cost(X, y, theta):    preds = X @ theta    return 0.5 / len(y) * np.sum((preds - y) ** 2)def batch_gd(X, y, alpha=0.05, num_iters=200):    theta = np.zeros(X.shape[1])    history = []    for _ in range(num_iters):        grad = (X.T @ (X @ theta - y)) / len(y)        theta -= alpha * grad        history.append(compute_cost(X, y, theta))    return theta, np.array(history)def sgd(X, y, alpha=0.01, num_iters=10):    theta = np.zeros(X.shape[1])    history = []    for _ in range(num_iters):        for i in range(len(y)):            xi = X[i:i+1]            yi = y[i]            grad = xi.T @ (xi @ theta - yi)            theta -= alpha * grad.flatten()            history.append(compute_cost(X, y, theta))    return theta, np.array(history)

## ExperimentsWe compare BGD vs SGD and inspect convergence.

In [None]:
theta_bgd, hist_bgd = batch_gd(X, y)theta_sgd, hist_sgd = sgd(X, y)pred_bgd = X @ theta_bgdpred_sgd = X @ theta_sgd

## Visualizations

In [None]:
plt.figure(figsize=(6,4))plt.scatter(X_raw, y, alpha=0.6, label="data")plt.plot(X_raw, pred_bgd, color="C1", label="BGD")plt.plot(X_raw, pred_sgd, color="C2", label="SGD")plt.xlabel("x")plt.ylabel("y")plt.title("Linear regression fits")plt.legend()plt.show()plt.figure(figsize=(6,4))plt.plot(hist_bgd, label="BGD")plt.plot(hist_sgd, label="SGD", alpha=0.7)plt.xlabel("update step")plt.ylabel("J(theta)")plt.title("Convergence curves")plt.legend()plt.show()

## Takeaways- BGD provides smooth convergence, SGD introduces noise but can reach good solutions faster.- Learning rate tuning is crucial for stability.

## Explain it in an interview- Start with the least-squares objective and its gradient.- Emphasize why SGD is used for large datasets.

## Exercises1. Add feature scaling and compare convergence.2. Implement mini-batch GD and compare to BGD/SGD.3. Visualize the cost surface for \(	heta_0, 	heta_1\).