# Theory Questions (Total: 12)

[MSQ]

### 1. You train two models A and B on the same dataset. Model A uses an optimizer that reaches lower empirical risk faster but exhibits a sharp minimum around its final parameters. Model B converges more slowly to a slightly higher training loss but ends in a wider, flatter basin. Assume identical architectures, data, and regularization. Which statements are most consistent with an expert understanding of optimization strategies and generalization?

- [ ] A. Model A is guaranteed to generalize better because its training loss is lower.
- [x] B. The wider basin around Model B’s solution can correlate with better generalization due to robustness to small parameter perturbations.
- [x] C. Minimizing empirical risk efficiently is not the same as minimizing true risk; optimization is a means to the generalization end.
- [ ] D. If two solutions achieve similar validation loss, the solution with the sharper Hessian spectrum is typically preferred.

Ans: Optimization targets *empirical risk*; deep learning cares about *true risk*. Flatter minima (wider basins, smaller leading Hessian eigenvalues) often correlate with better generalization, whereas sharp minima can be brittle. Lower training loss alone does not guarantee better generalization.


[MCQ]

### 2. Consider high-dimensional nonconvex loss landscapes typical in deep learning. At stationary points $ (∇f(x)=0) $, which is most likely to occur as dimensionality grows?

- [ ] A. Local minima dominate because second derivatives vanish.
- [x] B. Saddle points proliferate since mixed-sign Hessian eigenvalues are statistically more common than all-positive spectra.
- [ ] C. Global minima dominate because of the concentration of measure.
- [ ] D. Points with negative semidefinite Hessians dominate due to overparameterization.

Ans: In high dimensions, random Hessian spectra are more likely to have mixed signs (saddles) than be strictly positive definite (local minima), making saddle points far more common and a practical obstacle for first-order methods.


[MCQ]

### 3. For a differentiable convex objective f over a convex domain, which statement is most accurate?

- [ ] A. A global minimum must have strictly positive eigenvalues for the Hessian.
- [x] B. Any local minimum is also a global minimum.
- [ ] C. A local maximum is also a global maximum.
- [ ] D. If $ ∇²f(x) $ has a single negative eigenvalue at a stationary point, it remains a global minimum by continuity.

Ans: Convexity guarantees that any local minimum is global. Positive semidefiniteness (not strictly positive) of the Hessian is the differential test for convexity; any negative eigenvalue contradicts convexity.


[MCQ]

### 4. Suppose f is differentiable and you take a small step $ xₜ₊₁ = xₜ − η ∇f(xₜ) $. Using a first-order Taylor approximation, what is the approximate change $ Δf = f(xₜ₊₁) − f(xₜ)) $ expressed in terms of $ η $ and the gradient norm at $ xₜ $? (Give the leading-order term; ignore higher-order terms.)

- [ ] $ \eta \nabla f(x_t)^2 $
- [x] $ -\eta \|\nabla f(x_t)\|^2 $
- [ ] $ \eta \|\nabla f(x_t)\|^2 $
- [ ] $ -\eta \nabla f(x_t)^2 $

Ans: First-order expansion: $ f(x+\epsilon)\approx f(x)+\epsilon^\top \nabla f(x) $. With $ \epsilon=-\eta \nabla f(x) $, we get $ Δf \approx -\eta \|\nabla f(x)\|^2 $.


[MSQ]

### 5. After optimizing a network with saturating nonlinearities and observe training stalls far from any known minima. Which interventions directly address vanishing gradients from an optimization perspective?

- [x] A. Re-parameterize or initialize so pre-activations fall in non-saturating regions (e.g., careful init, normalization).
- [x] B. Use learning-rate schedules or warmup to traverse flatter regions with controlled steps.
- [ ] C. Increase model size to reduce training loss variance.
- [x] D. Switch to non-saturating activations where appropriate to enlarge gradient-carrying regions.

Ans: When derivatives approach zero (e.g., saturating tanh/sigmoid regimes), gradient signals vanish. Better initialization/normalization, suitable LR schedules, and non-saturating activations help maintain gradient flow and progress.


[MSQ]

### 6. You must minimize a convex $ f(x) $ subject to convex inequality constraints $ c_i(x) ≤ 0 $. Which methods are mathematically standard approaches to respect constraints during optimization?

- [x] A. Form a Lagrangian $ L(x,\alpha)=f(x)+\sum_i \alpha_i c_i(x) $ with $ \alpha_i\ge 0 $ and solve the saddle-point problem.
- [x] B. Add penalties $ \sum_i \alpha_i c_i(x) $ to the objective (carefully tuned) to discourage violations.
- [x] C. After each unconstrained step, project back to the feasible set via $  \mathrm{Proj}_X(x) $.
- [ ] D. Replace f with any strongly concave surrogate; constraints then become irrelevant.

Ans: Lagrangian duality, penalty methods, and projected updates are canonical strategies for convex constrained optimization; option D is incorrect.


[MCQ]

### 7. Let f be convex and X a random variable. Which statement is correct?

- [ ] A. $ \mathbb{E}[f(X)] \le f(\mathbb{E}[X]) $
- [x] B. $ \mathbb{E}[f(X)] \ge f(\mathbb{E}[X]) $
- [ ] C. The inequality direction depends on the variance of X.
- [ ] D. The inequality holds only if f is twice differentiable and strongly convex.

Ans: Jensen’s inequality states that for convex f, the expectation of f(X) is at least f of the expectation. Differentiability or strong convexity is not required.


[MSQ]

### 8. There are multiple optimization methods like - constant LR, step decay, cosine decay, and warmup, and between vanilla SGD, momentum/Nesterov, and adaptive methods (e.g., RMSProp/Adam). Which selections best reflect deep optimization understanding for large-scale training?

- [x] A. Warmup can stabilize early training when gradients are noisy or scales are poorly calibrated.
- [x] B. Cosine or step decays can help escape plateaus and improve late-stage convergence compared to a fixed LR.
- [x] C. Momentum/Nesterov can mitigate curvature-induced zig-zagging and accelerate along low-curvature valleys.
- [ ] D. Adaptive methods strictly dominate SGD with momentum for final generalization across all tasks.

Ans: Schedules shape training dynamics across regimes; momentum-like methods accelerate and smooth updates. No method dominates universally; adaptives sometimes underperform in final generalization compared to tuned SGD+m.


[MCQ]

### 9. Which statement about batch size and noise in updates is most accurate?

- [x] A. Smaller batches introduce gradient noise that can act as an implicit regularizer and may help avoid sharp minima.
- [ ] B. Full-batch GD is always better because it uses the exact gradient.
- [ ] C. Mini-batch noise guarantees convergence to the global minimum in nonconvex problems.
- [ ] D. Larger batches always generalize better due to higher signal-to-noise ratio.

Ans: Stochasticity can help exploration and avoid sharp or poor basins; there is no universal guarantee of global optimality in nonconvex settings, and larger batches do not always generalize better.


[MSQ]

### 10. Training loss keeps decreasing; validation loss decreases initially but then rises and stays high. Which interventions are aligned with core optimization?

- [x] A. Increase data augmentation and/or apply dropout/weight decay to narrow the generalization gap.
- [ ] B. Increase the learning rate late in training to force convergence to a better basin.
- [x] C. Use LR decay to refine solutions after initial progress, complementing regularization.
- [ ] D. Remove all regularization to let the optimizer fully minimize training loss.

Ans: The pattern indicates overfitting; regularization and appropriate LR schedules help generalization. Raising LR late or removing regularization typically worsens generalization.


[MCQ]

### 11. For gradient descent on a differentiable scalar objective f, what vector direction (in terms of $ ∇f(x) $) yields the steepest first-order decrease in f at x? (Give the unit vector direction.)

- [ ] $ -\frac{\|\nabla f(x)\|}{\nabla f(x)} $
- [ ] $ \frac{\|\nabla f(x)\|}{\nabla f(x)} $
- [x] $ -\frac{\nabla f(x)}{\|\nabla f(x)\|} $
- [ ] $ \frac{\nabla f(x)}{\|\nabla f(x)\|} $

Ans: The inner product $ \epsilon^\top \nabla f(x) $ is minimized by moving opposite to the gradient with unit norm, giving steepest descent in first-order approximation.


[MCQ]

### 12. Which statement is most reasonable for guiding optimizer choice, based on curvature?

- [x] A. If updates oscillate across narrow valleys, momentum/Nesterov or smaller LR can reduce zig-zagging due to anisotropic curvature.
- [ ] B. If gradients are small and flat everywhere, increasing LR indefinitely is best.
- [ ] C. If the landscape is flat, switching to saturating activations will improve gradients.
- [ ] D. If the Hessian is indefinite, full-batch gradients remove saddle points.

Ans: Zig-zagging indicates strong curvature anisotropy; momentum and careful LR help. Large LR in flat regions risks instability; saturating activations worsen gradients; full-batch does not remove saddles.


# Coding questions (Total: 12)

[MSQ]

```python
import torch
import torch.nn as nn
import torch.optim as optim

torch.manual_seed(0)

class Block(nn.Module):
    def __init__(self, d_in, d_hidden, d_out):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(d_in, d_hidden, bias=True),
            nn.ReLU(inplace=True),
            nn.LayerNorm(d_hidden),
            nn.Linear(d_hidden, d_out, bias=True)
        )
    def forward(self, x):
        return self.net(x)

model1 = Block(32, 128, 10)
model2 = Block(32, 128, 10)

X = torch.randn(512, 32)
y = torch.randint(0, 10, (512,))

crit = nn.CrossEntropyLoss()

opt_sgd = optim.SGD(model1.parameters(), lr=0.1, momentum=0.9, weight_decay=1e-4, nesterov=True)
opt_adamw = optim.AdamW(model2.parameters(), lr=3e-3, betas=(0.9, 0.999), weight_decay=1e-2)

def train_step(model, opt):
    opt.zero_grad(set_to_none=True)
    logits = model(X)
    loss = crit(logits, y)
    loss.backward()
    torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=5.0)
    opt.step()
    return loss.item()

losses_sgd, losses_adamw = [], []
for t in range(20):
    losses_sgd.append(train_step(model1, opt_sgd))
    losses_adamw.append(train_step(model2, opt_adamw))

print(f"SGD last 3 losses: {losses_sgd[-3:]}")
print(f"AdamW last 3 losses: {losses_adamw[-3:]}")
```

### 1. Considering the code and standard optimizer behaviors, which statements are most accurate?

- [x] A. AdamW’s weight decay is decoupled from the gradient of the loss, unlike classic L2 penalty implemented via `weight_decay` in SGD.
- [x] B. With Nesterov momentum, the update uses the projected “lookahead” gradient, often reducing zig-zagging across narrow valleys.
- [ ] C. For equal effective step sizes, SGD with momentum is guaranteed to achieve a lower training loss than AdamW after the same number of steps.
- [x] D. Gradient clipping by global norm can mitigate occasional instability for both optimizers but does not replace a good learning-rate schedule.

Ans: AdamW applies decay in a decoupled manner; Nesterov uses a lookahead gradient; clipping stabilizes but doesn’t replace LR scheduling. No general guarantee that SGD+m dominates AdamW in steps-to-loss.

[MCQ]

```python
import torch
import torch.nn as nn

torch.manual_seed(0)

w = torch.tensor([[1.0, -2.0],[3.0, -4.0]], requires_grad=True)
x = torch.tensor([[2.0, -1.0]], requires_grad=False)
target = torch.tensor([[1.0, -3.0]], requires_grad=False)

y = x @ w
loss = ((y - target)**2).sum()
loss.backward()

eta = 0.05
with torch.no_grad():
    grad_copy = w.grad.clone()
    w_new = w - eta * grad_copy

first_order_delta = (-eta * (grad_copy * grad_copy)).sum().item()
print("grad:\n", grad_copy)
print("first_order_delta:", first_order_delta)
```

### 2. If we interpret `first_order_delta` as a first-order Taylor approximation of Δloss due to the step, which statement is most correct?

- [ ] A. It should be positive because we subtract a positive step times positive gradients.
- [x] B. It should be non-positive, approximately −η∥∇ℓ∥^2, reflecting a local decrease to first order.
- [ ] C. Its sign is undefined at first order; second-order terms dominate the sign.
- [ ] D. It equals exactly the true change in loss regardless of step size.

Ans: First-order approximation for a step −η∇ℓ is −η∥∇ℓ∥^2 (non-positive). Exact equality doesn’t hold except in special linear/quadratic cases with tiny steps.


[NAT]

```python
import math

base_lr = 0.1
min_lr  = 0.0
warmup_steps = 5
total_steps  = 25

def lr_at(step):
    if step < warmup_steps:
        return base_lr * (step + 1) / warmup_steps
    t = (step - warmup_steps) / (total_steps - warmup_steps)
    return min_lr + 0.5 * (base_lr - min_lr) * (1 + math.cos(math.pi * t))

lr_12 = lr_at(12)
print("LR at step 12:", lr_12)
```

### 3. What is the exact value of the learning rate at step = 12 printed by the code above? (write the decimal up to 3 places)

Ans: 0.077. After warmup (steps 0–4), cosine decay applies. With step=12, t=(12-5)/(25-5)=7/20=0.35. LR =0.5∗0.1∗(1+cos(π∗0.35))≈0.076604.

[MSQ]

```python
import torch
import torch.nn as nn
import torch.optim as optim

torch.manual_seed(0)

class SmallNet(nn.Module):
    def __init__(self):
        super().__init__()
        self.l1 = nn.Linear(16, 64)
        self.act = nn.GELU()
        self.l2 = nn.Linear(64, 4)
    def forward(self, x):
        return self.l2(self.act(self.l1(x)))

net = SmallNet()
opt = optim.SGD(net.parameters(), lr=0.05, momentum=0.9)

x = torch.randn(32, 16)
y = torch.randint(0, 4, (32,))
loss_fn = nn.CrossEntropyLoss()

opt.zero_grad(set_to_none=True)
out = net(x)
loss = loss_fn(out, y)
loss.backward()

gnorm = torch.nn.utils.clip_grad_norm_(net.parameters(), max_norm=1.0)

for p in net.parameters():
    if p.grad is not None:
        p.grad.mul_(0.5)  

opt.step()

print("Global grad-norm observed before clamp (reported):", float(gnorm))
```

### 4. Which statements about the above gradient handling are correct?

- [x] A. `clip_grad_norm_` rescales all parameter gradients by a common factor if the global norm exceeds `max_norm`.
- [x] B. Post-clipping rescaling (multiplying by 0.5) reduces the effective step further and can change the intended clipping magnitude.
- [ ] C. `clip_grad_norm_` normalizes each parameter tensor independently to have unit norm.
- [x] D. The reported value from `clip_grad_norm_` is the norm before clipping was applied.

Ans: Global-norm clipping applies one scale factor; extra rescaling changes effective update; the API returns the pre-clipping global norm.

[MCQ]

```python
import torch

beta1, beta2, eps = 0.9, 0.999, 1e-8
g1, g2 = 0.3, -0.1  
m = 0.0
v = 0.0

# t = 1
m = beta1*m + (1-beta1)*g1            # m1
v = beta2*v + (1-beta2)*(g1*g1)       # v1

m_hat1 = m/(1-beta1**1)
v_hat1 = v/(1-beta2**1)

# t = 2
m = beta1*m + (1-beta1)*g2            # m2
v = beta2*v + (1-beta2)*(g2*g2)       # v2

m_hat2 = m/(1-beta1**2)
v_hat2 = v/(1-beta2**2)

print("m_hat1, v_hat1:", m_hat1, v_hat1)
print("m_hat2, v_hat2:", m_hat2, v_hat2)
```

### 5. Which statement best captures Adam’s bias-correction in the code?

- [ ] A. m_hat equals the raw exponential moving average m because correction cancels when beta1=0.9.
- [x] B. m_hat and v_hat divide by `(1 - beta^t)` to correct initialization bias toward zero at early steps.
- [ ] C. Bias correction increases the decay of m and v, making steps smaller than using raw m and v.
- [ ] D. Bias correction only affects v but not m.

Ans: Adam corrects the zero-initialization bias via division by 1−β^t for both first and second moments.

[NAT]

```python
import torch
import torch.nn as nn
import torch.optim as optim

torch.manual_seed(0)

net = nn.Sequential(nn.Linear(64, 128), nn.ReLU(), nn.Linear(128, 2))
X = torch.randn(256, 64)
y = torch.randint(0, 2, (256,))
crit = nn.CrossEntropyLoss()

opt = optim.SGD(net.parameters(), lr=0.2, momentum=0.9, weight_decay=1e-3)
scheduler = optim.lr_scheduler.StepLR(opt, step_size=5, gamma=0.5)

def train_epoch(use_explicit_l2=False, l2_lambda=0.0):
    opt.zero_grad(set_to_none=True)
    logits = net(X)
    base_loss = crit(logits, y)
    loss = base_loss
    if use_explicit_l2:
        l2 = sum((p**2).sum() for p in net.parameters())
        loss = loss + l2_lambda * l2
    loss.backward()
    opt.step()

for epoch in range(10):
    if epoch < 5:
        train_epoch(use_explicit_l2=False)
    else:
        train_epoch(use_explicit_l2=True, l2_lambda=1e-4)
    scheduler.step()

print("Final LR:", scheduler.get_last_lr()[0])
```

### 7. Which statements best reflect the code’s regularization and scheduling?

- [x] A. weight_decay in SGD applies L2-like shrinkage coupled to the gradient step; adding explicit l2_lambda * ||θ||² augments loss-based gradients in later epochs.
- [x] B. Using both decay and explicit L2 increases total shrinkage pressure on parameters in the second phase.
- [ ] C. StepLR modifies opt.param_groups before opt.step(), so the first five epochs already use a decayed LR.
- [x] D. The learning rate halves at epochs 5 and 10 boundaries (given step_size=5), affecting step magnitudes post-update calls.

Ans: Weight decay is coupled to the update; explicit L2 adds to loss gradient; StepLR.step() is called after opt.step() here, so LR changes take effect from the next epoch boundary.

[MCQ]

```python
import torch

x = torch.tensor([3.0, 4.0, 0.0])  
g = torch.tensor([1.0, -2.0, 2.0])  
eta = 1.0
R = 4.0

x_new = x - eta * g
norm = torch.linalg.vector_norm(x_new)
if norm > R:
    x_proj = (R / norm) * x_new
else:
    x_proj = x_new

print("x_new:", x_new.tolist())
print("x_proj:", x_proj.tolist())
```

### 8. Which is the correct x_proj after the projection?

- [ ] A. `[2.0,6.0,−2.0]`
- [x] B. `(4/(44)^1/2)[2,6,−2]`
- [ ] C. `(5/(44)^1/2)[2,6,−2]`
- [ ] D. `[3.2,9.6,−3.2]`

Ans: Since ∥xnew​∥=(44)^1/2 > 4, project by scaling to radius 4 along the same direction.

[MCQ]

```python
import torch
import torch.nn as nn
import torch.optim as optim

torch.manual_seed(0)

net = nn.Sequential(nn.Linear(32, 64), nn.ReLU(), nn.Linear(64, 2))
crit = nn.CrossEntropyLoss()
opt = optim.SGD(net.parameters(), lr=0.05)

X = torch.randn(2048, 32)
y = torch.randint(0, 2, (2048,))

def step(batch_size):
    perm = torch.randperm(X.size(0))
    bidx = perm[:batch_size]
    xb, yb = X[bidx], y[bidx]
    opt.zero_grad(set_to_none=True)
    logits = net(xb)
    loss = crit(logits, yb)
    loss.backward()
    opt.step()
    return float(loss)

loss_16  = step(16)
loss_256 = step(256)

print(loss_16, loss_256)
```

### 9. Which choice most likely reduces stochastic gradient variance in this setup (all else equal)?

- [ ] A. Use smaller batch sizes, as that always reduces variance.
- [x] B. Use larger batch sizes, although it may affect generalization dynamics.
- [ ] C. Keep batch size constant but double the learning rate.
- [ ] D. Replace SGD with momentum while keeping batch size tiny; variance strictly vanishes.

Ans: Larger batches reduce minibatch noise variance, though this can interact with generalization and may require LR retuning.

[MSQ]

```python
import torch
import torch.nn as nn
import torch.optim as optim

torch.manual_seed(1)

net = nn.Sequential(nn.Linear(16, 32), nn.ReLU(), nn.Linear(32, 4))
opt = optim.Adam(net.parameters(), lr=1e-3)
crit = nn.CrossEntropyLoss()

Xtr = torch.randn(256, 16); ytr = torch.randint(0, 4, (256,))
Xva = torch.randn(128, 16); yva = torch.randint(0, 4, (128,))

best_val = float("inf")
patience, bad = 3, 0

for epoch in range(20):
    opt.zero_grad(set_to_none=True)
    logits = net(Xtr)
    loss = crit(logits, ytr)
    loss.backward(); opt.step()

    with torch.no_grad():
        vloss = crit(net(Xva), yva).item()

    if vloss < best_val - 1e-4:
        best_val, bad = vloss, 0
    else:
        bad += 1
        if bad >= patience:
            break

print("Stopped at epoch:", epoch)
print("Best val:", best_val)
```

### 10. Which statements about the stopping logic are correct? (choose all that apply)

- [x] A. Validation loss must improve by at least 1e-4 to reset patience; otherwise bad increases.
- [x] B. Training continues while bad < patience; once bad reaches patience, loop breaks.
- [ ] C. This guarantees stopping at a global minimum of validation loss.
- [x] D. A smaller tolerance (like 1e-6) would make improvements harder to register, potentially increasing early stops.

Ans: The tolerance gate controls what counts as improvement; patience counts non-improving epochs; no guarantee of global minima.

[NAT]

```python
import torch
import torch.nn as nn

class F(nn.Module):
    def __init__(self):
        super().__init__()
        self.conv1 = nn.Conv2d(3, 8, kernel_size=5, stride=1, padding=2)
        self.pool  = nn.MaxPool2d(kernel_size=2, stride=2)
        self.conv2 = nn.Conv2d(8, 16, kernel_size=3, stride=1, padding=1)
    def forward(self, x):
        x = self.conv1(x).relu()
        x = self.pool(x)
        x = self.conv2(x).relu()
        return x

net = F()
x = torch.randn(4, 3, 32, 32)
z = net(x)
flat_dim = z[0].numel()  
print("Per-sample flattened dim:", flat_dim)
```

### 11. What exact integer does `Per-sample flattened dim` print?

Ans: 4096. conv1 keeps 32x32 → pool halves to 16x16 → conv2 keeps 16x16 with 16 channels ⇒ 16∗16∗16=4096.

[NAT]

```python
# We compute deterministic scalar updates: w is 1D parameter.
# SGD with momentum (classic):
#   v_{t+1} = mu * v_t + g_t
#   w_{t+1} = w_t - lr * v_{t+1}

w0 = 1.0
lr = 0.1
mu = 0.9

# Gradients at two consecutive steps (given deterministically)
g1 = 2.0
g2 = -1.0

v0 = 0.0

# Step 1
v1 = mu * v0 + g1      
w1 = w0 - lr * v1      

# Step 2
v2 = mu * v1 + g2      
w2 = w1 - lr * v2      

print("w2_exact:", w2)
```

### 12. What exact numeric value is printed for w2_exact?

Ans: 0.72. v0 = 0; w1 = 1 - 0.1*2 = 0.8, v2 = 0.9 * 2 - 1 = 0.8, w2 = 0.8 - 0.1 * 0.8 = 0.72