## Gradient Descent Update Rule with L2 Regularization
**Front:** For logistic regression with L2 regularization, the cost function is $J(\theta) = -\frac{1}{m}l(\theta) + \frac{\lambda}{2m}\sum_{j=1}^n \theta_j^2$. What are the gradient descent update rules for $\theta_0$ and $\theta_j$ ($j \geq 1$)? <br/>
**Back:** 
- For $\theta_0$ (bias term): $\theta_0 := \theta_0 - \alpha \frac{1}{m} \sum_{i=1}^m (h_\theta(x^{(i)}) - y^{(i)})$
- For $\theta_j$ ($j \geq 1$): $\theta_j := \theta_j - \alpha \left[ \frac{1}{m} \sum_{i=1}^m (h_\theta(x^{(i)}) - y^{(i)}) x_j^{(i)} + \frac{\lambda}{m} \theta_j \right]$

## L2 Regularization Effect on Update
**Front:** In the update rule $\theta_j := \theta_j - \alpha \left[ \frac{1}{m} \sum (h-y)x_j + \frac{\lambda}{m} \theta_j \right]$, what does the term $\frac{\lambda}{m} \theta_j$ do? <br/>
**Back:** It adds **weight decay**: at each step, $\theta_j$ is shrunk toward zero by a factor proportional to its current value and $\lambda$. This prevents weights from growing too large, reducing overfitting.

## Bias Term Exclusion from Regularization
**Front:** Why is $\theta_0$ typically excluded from the L2 regularization penalty in the update rule? <br/>
**Back:** The bias term $\theta_0$ controls the intercept/position of the decision boundary, not the complexity of the model. Regularizing it would unnecessarily constrain where the boundary can be placed.

## Rearranged Update Form
**Front:** How can the update rule for $\theta_j$ ($j \geq 1$) with L2 regularization be rewritten to show explicit weight shrinkage? <br/>
**Back:** $\theta_j := \theta_j \left(1 - \alpha\frac{\lambda}{m}\right) - \alpha \frac{1}{m} \sum_{i=1}^m (h_\theta(x^{(i)}) - y^{(i)}) x_j^{(i)}$. The term $(1 - \alpha\frac{\lambda}{m})$ multiplicatively shrinks $\theta_j$ before the gradient step.

## Regularization Strength $\lambda$
**Front:** What happens in the update rule when $\lambda = 0$? What happens when $\lambda$ is very large? <br/>
**Back:** 
- $\lambda = 0$: Recovers standard gradient descent without regularization.
- $\lambda$ large: The term $\frac{\lambda}{m}\theta_j$ dominates, forcing $\theta_j$ toward zero regardless of data (high bias/underfitting).

## Learning Rate Interaction
**Front:** How does the learning rate $\alpha$ interact with the regularization parameter $\lambda$ in the update? <br/>
**Back:** They appear together as $\alpha\frac{\lambda}{m}$ in the weight shrinkage term. A larger $\alpha$ or $\lambda$ increases the shrinkage per iteration. They should be tuned together.

## Practical Implementation Note
**Front:** In code, how can you implement the update rules for all $\theta_j$ (including $\theta_0$) uniformly despite $\theta_0$ having no regularization? <br/>
**Back:** By creating a regularization vector where element $j=0$ is 0 and all other elements are $\lambda$, then using: $\theta := \theta - \alpha\left(\frac{1}{m}X^T(h-y) + \frac{\text{reg\_vector}}{m}\theta\right)$.

## Connection to MAP Estimation
**Front:** What is the Bayesian interpretation of adding L2 regularization to the gradient descent update? <br/>
**Back:** It's equivalent to Maximum A Posteriori (MAP) estimation with a Gaussian prior $\theta_j \sim \mathcal{N}(0, \frac{1}{\lambda})$ for $j \geq 1$. The update incorporates prior knowledge that weights should be small.