# Gradient Descent - Interview Q&A

Interview-level questions and answers about Gradient Descent.

## Basic Concepts

### Q1: What is Gradient Descent? Explain the intuition.

**Answer:**

Gradient Descent is an **iterative optimization algorithm** used to minimize a function by moving in the direction of steepest descent (negative gradient).

**Intuition:** Imagine standing on a hill in fog. You can't see the valley, but you can feel the slope under your feet. Gradient Descent says: take a step in the steepest downhill direction. Repeat until you reach the bottom.

**Update Rule:**
$$\theta := \theta - \alpha \nabla J(\theta)$$

Where:
- $\theta$ = parameters
- $\alpha$ = learning rate (step size)
- $\nabla J(\theta)$ = gradient of cost function

**For Linear Regression:**
$$\theta := \theta - \frac{\alpha}{m} X^T(X\theta - y)$$

**Key idea:** The gradient tells us the direction of steepest **ascent**, so we subtract it to go **downhill**.

### Q2: What is the learning rate? What happens if it's too large or too small?

**Answer:**

The **learning rate** ($\alpha$) controls how big each step is during gradient descent.

**Too small ($\alpha$ = 0.0001):**
- Very slow convergence
- May take thousands of iterations
- More computation time
- But will likely converge

**Too large ($\alpha$ = 10):**
- Overshoots the minimum
- Oscillates around the minimum
- May diverge (cost increases)
- Algorithm fails to converge

**Just right ($\alpha$ = 0.01):**
- Steady decrease in cost
- Converges in reasonable iterations
- Smooth convergence curve

**How to choose:**
1. Start small (0.001) and increase
2. Try values on log scale: 0.001, 0.003, 0.01, 0.03, 0.1, 0.3
3. Plot cost vs iterations - look for smooth decrease
4. Use learning rate schedules (decay over time)
5. Use adaptive methods (Adam, RMSProp)

### Q3: What are the three types of Gradient Descent?

**Answer:**

**1. Batch Gradient Descent:**
- Uses ALL training samples per update
- Computes exact gradient
- Smooth convergence
- Slow for large datasets

```python
# Batch GD
gradient = (1/m) * X.T @ (X @ theta - y)  # all m samples
theta = theta - lr * gradient
```

**2. Stochastic Gradient Descent (SGD):**
- Uses ONE random sample per update
- Noisy gradient estimate
- Faster updates, but noisy convergence
- Can escape local minima (noise helps)
- Good for online learning

```python
# SGD
i = np.random.randint(m)  # pick one sample
gradient = X[i:i+1].T @ (X[i:i+1] @ theta - y[i:i+1])
theta = theta - lr * gradient
```

**3. Mini-Batch Gradient Descent:**
- Uses a small batch (32, 64, 128 samples) per update
- Best of both worlds
- Most commonly used in practice
- Leverages GPU parallelism

```python
# Mini-Batch GD
batch = random_sample(X, y, batch_size=32)
gradient = (1/batch_size) * X_batch.T @ (X_batch @ theta - y_batch)
theta = theta - lr * gradient
```

**Comparison:**

| Type | Speed | Stability | Memory | Usage |
|------|-------|-----------|--------|-------|
| Batch | Slow | Stable | High | Small data |
| SGD | Fast | Noisy | Low | Online |
| Mini-Batch | Medium | Medium | Medium | Most common |

### Q4: Why is feature scaling important for Gradient Descent?

**Answer:**

**Without scaling:**
- Features on different scales create elongated contours in the cost function
- Gradient descent oscillates (zigzags) across the narrow dimension
- Takes many more iterations to converge
- May need very small learning rate

**With scaling:**
- Contours become more circular
- Gradient points more directly toward the minimum
- Converges much faster
- Same learning rate works well for all features

**Example:**
- Feature 1: House size (0 - 5000 sq ft)
- Feature 2: Number of bedrooms (1 - 5)
- Without scaling, the gradient is dominated by the larger scale feature

**Common scaling methods:**
1. **Standardization (Z-score):** $x' = \frac{x - \mu}{\sigma}$ (mean=0, std=1)
2. **Min-Max normalization:** $x' = \frac{x - x_{min}}{x_{max} - x_{min}}$ (range 0-1)

**Note:** Feature scaling is NOT needed for the Normal Equation.

## Intermediate Questions

### Q5: How do you know if Gradient Descent has converged?

**Answer:**

**Methods to check convergence:**

1. **Plot cost vs iterations:**
   - Cost should decrease with each iteration
   - Curve flattens when converged
   - If cost increases, learning rate is too high

2. **Automatic convergence test:**
   - Stop when cost decreases by less than a threshold $\epsilon$
   - Example: Stop if $J(\theta^{(t-1)}) - J(\theta^{(t)}) < 10^{-6}$

3. **Gradient magnitude:**
   - At minimum, gradient approaches zero
   - Stop when $||\nabla J(\theta)|| < \epsilon$

4. **Parameter change:**
   - Stop when $||\theta^{(t)} - \theta^{(t-1)}|| < \epsilon$

**Red flags:**
- Cost increases -> learning rate too high
- Cost fluctuates wildly -> learning rate too high or SGD noise
- Cost decreases very slowly -> learning rate too low
- Cost plateaus early -> may be stuck in local minimum (not an issue for linear regression since MSE is convex)

### Q6: What is the difference between convex and non-convex cost functions? Why does it matter?

**Answer:**

**Convex function:**
- Has exactly ONE global minimum
- No local minima
- Bowl-shaped (any line between two points lies above the function)
- **Linear regression MSE is convex**
- Gradient descent guaranteed to find global minimum

**Non-convex function:**
- Has multiple local minima
- Gradient descent may get stuck in a local minimum
- Example: Neural network cost functions
- Need techniques like momentum, random restarts, simulated annealing

**Why it matters for linear regression:**
- MSE with linear model is **always convex**
- No local minima to worry about
- GD will always converge to global minimum (with proper learning rate)
- Both Normal Equation and GD give same result

**Mathematical proof of convexity:**
- Second derivative (Hessian) of MSE = $\frac{2}{m}X^TX$
- $X^TX$ is positive semi-definite
- Therefore MSE is convex

### Q7: Explain the gradient computation step-by-step for linear regression.

**Answer:**

**Cost function:**
$$J(\theta) = \frac{1}{2m}\sum_{i=1}^{m}(h_\theta(x^{(i)}) - y^{(i)})^2$$

Where $h_\theta(x) = \theta^Tx = X\theta$

**Step 1 - Compute predictions:**
$$\hat{y} = X\theta$$

**Step 2 - Compute errors:**
$$e = \hat{y} - y = X\theta - y$$

**Step 3 - Compute gradient:**

Take the partial derivative of $J$ with respect to $\theta_j$:

$$\frac{\partial J}{\partial \theta_j} = \frac{1}{m}\sum_{i=1}^{m}(h_\theta(x^{(i)}) - y^{(i)})x_j^{(i)}$$

In vectorized form:
$$\nabla J(\theta) = \frac{1}{m}X^T(X\theta - y)$$

**Step 4 - Update parameters:**
$$\theta := \theta - \alpha \cdot \nabla J(\theta)$$

**Python code:**
```python
predictions = X @ theta           # Step 1
errors = predictions - y          # Step 2
gradient = (1/m) * X.T @ errors   # Step 3
theta = theta - alpha * gradient  # Step 4
```

### Q8: What is the computational complexity of Gradient Descent vs Normal Equation?

**Answer:**

**Normal Equation:**
- $O(n^3)$ - dominated by matrix inversion of $(X^TX)^{-1}$
- $X^TX$ computation: $O(mn^2)$
- Matrix inversion: $O(n^3)$
- Where n = number of features, m = number of samples
- **Bottleneck: n**

**Gradient Descent (one iteration):**
- Matrix multiplication $X\theta$: $O(mn)$
- Gradient $X^T \cdot errors$: $O(mn)$
- Total per iteration: $O(mn)$
- Total for k iterations: $O(kmn)$
- **Bottleneck: number of iterations**

**Comparison:**
- When n is small: Normal Equation wins (one-shot)
- When n is large (>10,000): GD wins ($mn$ vs $n^3$)
- When m is very large: GD can use mini-batches

**Memory:**
- Normal Equation: Must store $X^TX$ ($n \times n$ matrix)
- GD: Only needs current batch in memory

## Advanced Questions

### Q9: What are learning rate schedules? Name a few.

**Answer:**

Learning rate schedules **reduce the learning rate during training** to improve convergence.

**Why?** Start with large steps (fast progress), end with small steps (precision).

**Common schedules:**

1. **Step Decay:**
   - Reduce by factor every N epochs
   - $\alpha_t = \alpha_0 \cdot \gamma^{\lfloor t/N \rfloor}$

2. **Exponential Decay:**
   - $\alpha_t = \alpha_0 \cdot e^{-kt}$

3. **1/t Decay:**
   - $\alpha_t = \frac{\alpha_0}{1 + kt}$

4. **Cosine Annealing:**
   - $\alpha_t = \alpha_{min} + \frac{1}{2}(\alpha_{max} - \alpha_{min})(1 + \cos(\frac{t\pi}{T}))$

**Adaptive optimizers** (automatically adjust learning rate):
- **AdaGrad**: Adapts per-parameter based on historical gradients
- **RMSProp**: Uses exponential moving average of squared gradients
- **Adam**: Combines momentum + RMSProp (most popular)

**For linear regression:** Usually a fixed learning rate with feature scaling is sufficient.

### Q10: Implement Gradient Descent from scratch in a coding interview.

**Answer:**

```python
import numpy as np

class LinearRegressionGD:
    def __init__(self, learning_rate=0.01, n_iterations=1000):
        self.lr = learning_rate
        self.n_iter = n_iterations
        self.theta = None
        self.cost_history = []
    
    def fit(self, X, y):
        m, n = X.shape
        X_b = np.c_[np.ones((m, 1)), X]   # add intercept
        self.theta = np.zeros((n + 1, 1))  # init to zeros
        y = y.reshape(-1, 1)
        
        for _ in range(self.n_iter):
            predictions = X_b @ self.theta
            errors = predictions - y
            gradient = (1/m) * X_b.T @ errors
            self.theta -= self.lr * gradient
            cost = (1/(2*m)) * np.sum(errors**2)
            self.cost_history.append(cost)
        
        return self
    
    def predict(self, X):
        m = X.shape[0]
        X_b = np.c_[np.ones((m, 1)), X]
        return (X_b @ self.theta).flatten()
    
    def score(self, X, y):
        y_pred = self.predict(X)
        ss_res = np.sum((y - y_pred)**2)
        ss_tot = np.sum((y - y.mean())**2)
        return 1 - (ss_res / ss_tot)
```

**Key points to mention:**
- Feature scaling before training
- Initialize weights to zeros
- Vectorized operations (no loops over features)
- Track cost history for debugging
- Could add early stopping for efficiency

### Q11: What is momentum in Gradient Descent?

**Answer:**

**Momentum** accelerates gradient descent by adding a fraction of the previous update to the current one.

**Standard GD:**
$$\theta := \theta - \alpha \nabla J(\theta)$$

**GD with Momentum:**
$$v_t = \beta v_{t-1} + \alpha \nabla J(\theta)$$
$$\theta := \theta - v_t$$

Where $\beta$ is the momentum coefficient (typically 0.9).

**Analogy:** A ball rolling downhill accumulates velocity. Momentum lets the update accumulate speed in consistent directions.

**Benefits:**
- Faster convergence (especially in narrow valleys)
- Dampens oscillations across steep dimensions
- Helps escape shallow local minima
- Reduces zigzagging with SGD

**Used in:** SGD with momentum, Adam optimizer, Nesterov Accelerated Gradient

### Q12: What is the difference between Gradient Descent and Stochastic Gradient Descent?

**Answer:**

| Aspect | Batch GD | SGD |
|--------|----------|-----|
| **Samples per update** | All m samples | 1 sample |
| **Gradient quality** | Exact | Noisy estimate |
| **Convergence path** | Smooth | Noisy / Zigzag |
| **Speed per epoch** | Slow | Fast |
| **Memory** | High | Low |
| **Online learning** | No | Yes |
| **Local minima** | Can get stuck | Noise helps escape |

**SGD advantages:**
- Much faster per iteration
- Can handle datasets that don't fit in memory
- Noise acts as implicit regularization
- Can update model as new data arrives

**SGD challenges:**
- Noisy updates (may not converge smoothly)
- Need learning rate schedule for convergence
- Final solution oscillates around minimum

**In practice:** Mini-batch GD (batch_size = 32-256) is most common. It balances the benefits of both approaches.

### Q13: Can Gradient Descent get stuck in local minima for Linear Regression?

**Answer:**

**No**, not for linear regression.

**Reason:** The MSE cost function for linear regression is **convex** (bowl-shaped). A convex function has exactly one minimum, which is the global minimum. There are no local minima to get stuck in.

**Proof:** The Hessian matrix (second derivative) of the MSE cost function is:
$$H = \frac{2}{m}X^TX$$

$X^TX$ is always positive semi-definite, which means the cost function is convex.

**However, GD can still fail to converge if:**
- Learning rate is too large (diverges)
- Not enough iterations
- Numerical issues (overflow/underflow)

**Note:** For neural networks and other non-linear models, the cost function IS non-convex, and getting stuck in local minima (or saddle points) is a real concern.

### Q14: What is the vanishing/exploding gradient problem? Is it relevant to Linear Regression?

**Answer:**

**Vanishing Gradient:**
- Gradients become extremely small during backpropagation
- Parameters barely update
- Training stalls

**Exploding Gradient:**
- Gradients become extremely large
- Parameters update too aggressively
- Numerical overflow, NaN values

**Relevant to Linear Regression?**

**Vanishing gradients: No.** Linear regression has no activation functions or deep layers that cause gradients to diminish.

**Exploding gradients: Partially.** Can happen if:
- Features are not scaled (very large values)
- Learning rate is too high

**Solution for linear regression:**
- Feature scaling (standardization)
- Appropriate learning rate
- Gradient clipping (cap gradient magnitude)

**This problem is mainly a deep learning concern** (deep neural networks with many layers).

### Q15: Explain the Adam optimizer. Why is it so popular?

**Answer:**

**Adam (Adaptive Moment Estimation)** combines two ideas:
1. **Momentum** (first moment: running average of gradients)
2. **RMSProp** (second moment: running average of squared gradients)

**Algorithm:**
$$m_t = \beta_1 m_{t-1} + (1-\beta_1)g_t$$
$$v_t = \beta_2 v_{t-1} + (1-\beta_2)g_t^2$$
$$\hat{m}_t = \frac{m_t}{1-\beta_1^t}, \quad \hat{v}_t = \frac{v_t}{1-\beta_2^t}$$
$$\theta := \theta - \frac{\alpha}{\sqrt{\hat{v}_t} + \epsilon} \hat{m}_t$$

**Default hyperparameters:** $\beta_1 = 0.9$, $\beta_2 = 0.999$, $\epsilon = 10^{-8}$

**Why popular:**
- Adaptive per-parameter learning rates
- Works well out-of-the-box with default settings
- Combines benefits of momentum and RMSProp
- Handles sparse gradients well
- Bias correction for initial iterations
- Fast convergence

**For linear regression:** Adam is overkill. Simple batch GD works fine. Adam shines in deep learning.