# Theory Questions (Total: 12)

[MCQ]

### 1. In a Vanilla GAN setup, if the discriminator becomes an optimal classifier too early in training, which of the following outcomes is most theoretically accurate in terms of the generator’s gradient behavior and adversarial learning objective?

- [ ] A. The generator receives stronger gradients that accelerate convergence due to discriminator confidence
- [x] B. The generator receives vanishing gradients, causing training stagnation even if its outputs are random
- [ ] C. The discriminator’s optimality guarantees Nash equilibrium is reached earlier
- [ ] D. The generator's updates become similar to maximum likelihood estimation

Ans: When the discriminator becomes optimal too early, it assigns values near 1 for real and 0 for fake, causing the gradient from the generator’s loss to vanish. As a result, the generator cannot learn.

[MSQ]

### 2. Consider mode collapse in GANs. Which mechanisms directly contribute to its emergence at an optimization level?

- [x] A. Generator exploiting local minima where discriminator response is predictable
- [x] B. Discriminator loss saturating due to Jensen-Shannon divergence properties
- [ ] C. Incorrect initialization of batch normalization parameters only
- [ ] D. Generator learning effectively but discriminator underfitting

Ans: Mode collapse occurs due to generator exploiting local predictable responses from the discriminator. The JS divergence saturates when distributions don't overlap, leading to vanishing gradients.

[MCQ]

### 3. In the minimax GAN objective using Jensen-Shannon divergence, why is the divergence theoretically unsuitable for disjoint real/generated distributions during early training?

- [ ] A. JS divergence becomes negative
- [x] B. JS divergence saturates at log(2), resulting in zero gradients
- [ ] C. It causes the discriminator to output uniform probabilities
- [ ] D. It forces generator to switch to KL divergence automatically

Ans: JS divergence becomes constant (log(2)) when distributions are disjoint, resulting in zero gradients, preventing generator learning.

[NAT]

### 4. If the discriminator perfectly distinguishes real and fake samples such that $ D(x)=1 $ and $ D(G(z))=0 $, what numerical value does the generator’s gradient term in the original minimax formulation approach (provide the numerical limit)?

Ans: 0.

If the discriminator is perfect, the generator's loss term $
log(1−D(G(z))) $ saturates, making the gradient approach zero → no learning signal.

[MSQ]

### 5. Which of the following architectural or training strategies fundamentally alter the divergence measure implicitly optimized by GANs?

- [x] A. WGAN replacing sigmoid activation with linear output
- [x] B. Using gradient penalty in WGAN-GP
- [x] C. Feature matching in the discriminator loss
- [ ] D. Increasing batch size during training

Ans: A: Linear activation changes the divergence from JS to Wasserstein. B: Gradient penalty enforces Lipschitz constraint, modifying distance measure. C: Feature matching induces generator to optimize toward feature statistics rather than solely fooling discriminator. D has no fundamental effect on divergence.

[MCQ]

### 6. What is the theoretical reason that Wasserstein distance enables meaningful gradients even when generator and real distributions do not overlap?

- [ ] A. It introduces a logarithmic divergence behavior
- [ ] B. It computes distance over the support boundary only
- [x] C. It measures the cost of optimal transport between distributions
- [ ] D. It forces discriminator to approximate entropy

Ans: Wasserstein distance measures the minimum cost to transform generated distribution into real distribution, giving meaningful gradients even without distributional overlap.

[MSQ]

### 7. Which of the following indicate a failure of the GAN game to converge to a Nash equilibrium, based on training signal behavior?

- [x] A. Oscillations in generator loss with no trend
- [x] B. Discriminator loss converging to zero permanently
- [ ] C. Generator loss reaching negative infinity
- [ ] D. Constant discriminator accuracy near 50%

Ans: A: Oscillations imply dynamic instability and lack of equilibrium. B: If discriminator loss stays at zero, it overpowers the generator, halting progress. C is theoretically impossible, and D indicates convergence to equilibrium rather than failure.

[MCQ]

### 8. Why is Lipschitz continuity critical in WGAN theory?

- [ ] A. It bounds the discriminator weights to prevent saturation
- [x] B. It ensures the discriminator approximates a K-Lipschitz function for valid Wasserstein estimation
- [ ] C. It increases variance in gradient flow
- [ ] D. It allows generator to minimize KL divergence directly

Ans: WGAN critic must be 1-Lipschitz to ensure valid Wasserstein distance estimation. This theoretical constraint is essential for stable training.

[MCQ]

### 9. Batch normalization in GAN generators primarily helps in:

- [x] A. Stabilizing gradient magnitudes across layers
- [ ] B. Preventing discriminator from overpowering the generator
- [ ] C. Changing divergence measure
- [ ] D. Reducing Lipschitz constant directly

Ans: Batch normalization reduces internal covariate shift, stabilizing training and improving gradient flow in the generator.

[MSQ]

### 10. Which regularization techniques help mitigate discriminator overfitting in GANs?

- [x] A. Label smoothing
- [x] B. Dropout in discriminator
- [ ] C. Increasing learning rate exponentially
- [x] D. Spectral normalization

Ans: Label smoothing reduces discriminator overconfidence. Dropout prevents overfitting. Spectral normalization constrains the Lipschitz constant to maintain stability. Option C destabilizes training.

[MCQ]

### 11. What is the core principle behind GANs?

- [ ] A. Generator tries to maximize classification accuracy
- [x] B. Generator and discriminator play a two-player minimax game
- [ ] C. Discriminator generates samples while generator evaluates them
- [ ] D. Generator minimizes logistic regression loss directly

Ans: GANs are adversarial models where generator and discriminator optimize opposing objectives in a minimax framework.

[NAT]

### 12. In a GAN context, if the discriminator outputs 0.5 consistently for both real and fake samples, what is the ideal theoretical interpretation of this output (numerical answer for the discriminator’s confidence level)?

Ans: 0.5

A discriminator output of 0.5 indicates maximum uncertainty, meaning it cannot distinguish real from fake — the ideal point of equilibrium.

# Coding questions (Total: 12)

[MSQ]

```python
netD.train(); netG.train()

optimG.zero_grad(set_to_none=True)
z = torch.randn(B, 100, device=device)
fake = netG(z)
pred_fake = netD(fake)
g_loss = criterion(pred_fake, torch.ones_like(pred_fake))
g_loss.backward()
optimG.step()

optimD.zero_grad(set_to_none=True)
pred_real = netD(x)
loss_real = criterion(pred_real, torch.ones_like(pred_real))
pred_fake_detached = netD(fake.detach())
loss_fake = criterion(pred_fake_detached, torch.zeros_like(pred_fake_detached))
d_loss = (loss_real + loss_fake) * 0.5
d_loss.backward()
optimD.step()
```

### 1. Which lines/practices are critical to avoid incorrect gradient flow or graph reuse problems in alternating updates?

- [x] A. Using `fake.detach()` before the discriminator update
- [x] B. Calling `optimG.zero_grad(set_to_none=True)` and `optimD.zero_grad(set_to_none=True)` separately
- [x] C. Computing `pred_fake = netD(fake)` again for D instead of reusing `pred_fake` from G step
- [x] D. Calling `netD.train(); netG.train()` at the start of the step

Ans: A breaks the G graph for D’s step. B prevents gradient accumulation bleed. C avoids backprop through the same graph twice. D ensures layers like Dropout/BN are in train mode for correct stats.

[MCQ]

```python
for _ in range(5):
    netD.train(); netG.train()
    for p in netD.parameters(): p.requires_grad = True

    optimD.zero_grad(set_to_none=True)
    z = torch.randn(B, 128, device=device)
    fake = netG(z).detach()
    d_real = netD(x)
    d_fake = netD(fake)

    eps = torch.rand(B, 1, 1, 1, device=device)
    x_hat = eps * x + (1 - eps) * fake
    x_hat.requires_grad_(True)
    d_hat = netD(x_hat)
    grad = torch.autograd.grad(
        outputs=d_hat.sum(), inputs=x_hat, create_graph=True
    )[0].view(B, -1)
    gp = ((grad.norm(2, dim=1) - 1.0) ** 2).mean()

    d_loss = (d_fake.mean() - d_real.mean()) + 10.0 * gp
    d_loss.backward()
    optimD.step()

for p in netD.parameters(): p.requires_grad = False
optimG.zero_grad(set_to_none=True)
z = torch.randn(B, 128, device=device)
g_loss = -netD(netG(z)).mean()
g_loss.backward()
optimG.step()
```

### 2. What is the primary reason we set `requires_grad=False` on the critic parameters before the G step?

- [ ] A. Reduces memory overhead only
- [x] B. Prevents accidental gradient flow into critic during G update
- [ ] C. Avoids weight clipping
- [ ] D. Enables batch norm statistics to freeze

Ans: Freezes critic so only G receives gradients when optimizing `g_loss`.

[MSQ]

```python
for p in netD1.parameters():
    p.data.clamp_(-0.01, 0.01)

from torch.nn.utils import spectral_norm
netD2.conv1 = spectral_norm(netD2.conv1)
netD2.conv2 = spectral_norm(netD2.conv2)
netD2.fc = spectral_norm(netD2.fc)
```

### 3. Which statements about training behavior are correct?

- [x] A. Spectral norm enforces a tighter Lipschitz control than naive clipping in practice
- [x] B. Weight clipping can reduce critic capacity and lead to underfitting
- [x] C. Spectral norm typically stabilizes gradients for both WGAN and non-WGAN losses
- [ ] D. Spectral norm guarantees convergence for any generator

Ans: SN provides smoother Lipschitz control and stabilizes training; clipping is crude and harms capacity. No method guarantees convergence universally.

[NAT]

```python
scaler = torch.cuda.amp.GradScaler()
for it in range(T):
    optimD.zero_grad(set_to_none=True)
    with torch.cuda.amp.autocast():
        d_real = netD(x)
        d_fake = netD(netG(z).detach())
        d_loss = (d_fake.mean() - d_real.mean())

    scaler.scale(d_loss).backward()
    scaler.step(optimD)

    optimG.zero_grad(set_to_none=True)
    with torch.cuda.amp.autocast():
        g_loss = -netD(netG(z)).mean()
    scaler.scale(g_loss).backward()
    scaler.step(optimG)
    scaler.update()
```

### 4. After many iterations, you observe unstable/plateaued critic updates due to a repeated AMP bug. What single integer count of missing calls per iteration causes this?

Ans: 1

One missing `scaler.update()` after the D step prevents proper AMP scaling state progression before the G step.

[MCQ]

```python
# EMA - Exponential Moving Average
ema_decay = 0.999
for it in range(T):
    with torch.no_grad():
        for p, p_ema in zip(netG.parameters(), netG_EMA.parameters()):
            p_ema.copy_(p_ema * ema_decay + (1 - ema_decay) * p)
```

### 5. Why is evaluating FID/IS with `netG_EMA` often preferred?

- [x] A. EMA reduces instantaneous noise in weights, giving more stable sample quality
- [ ] B. EMA changes the loss to Wasserstein distance
- [ ] C. EMA regularizes the discriminator
- [ ] D. EMA guarantees lower FID

Ans: EMA smooths parameter trajectories, yielding stabler generations; no guarantees on metrics.


[MSQ]

```python
# This is happening after each epoch.
netG.eval()
z_fixed = torch.randn(64, 100, device=device)
with torch.no_grad():
    samples = netG(z_fixed)

# Training resumes:
netG.train()
```

### 6. Pick all correct statements.

- [x] A. Using `.eval()` avoids updating BatchNorm running stats during sampling
- [x] B. `torch.no_grad()` prevents building graphs and saves memory during sampling
- [x] C. Omitting `.eval()` risks sampling with training-time BN/Dropout behavior
- [ ] D. Sampling without `no_grad()` can incorrectly update EMA

Ans: `.eval()` and `no_grad()` are both needed for correct, efficient evaluation. EMA isn’t updated here (we didn’t change EMA logic).

[MCQ]

```python
logits_real, feat_real = netD(x, return_features=True)
logits_fake, feat_fake = netD(netG(z), return_features=True)

# Standard adversarial portion
g_adv = F.binary_cross_entropy_with_logits(logits_fake, torch.ones_like(logits_fake))

# Feature matching
g_fm = F.l1_loss(feat_fake.mean(dim=0), feat_real.mean(dim=0))

g_total = g_adv + 10.0 * g_fm
g_total.backward()
optimG.step()
```

### 7. What’s the main purpose of adding `g_fm` here?

- [ ] A. To change divergence from JS to KL
- [x] B. To discourage mode collapse by matching higher-level statistics
- [ ] C. To regularize the discriminator’s weights
- [ ] D. To enforce Lipschitz constraint

Ans: Feature matching shapes generator outputs toward real feature statistics, reducing collapse.

[MSQ]

```python
def seed_all(seed=42):
    random.seed(seed); np.random.seed(seed)
    torch.manual_seed(seed); torch.cuda.manual_seed_all(seed)
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False

seed_all(123)

loader = torch.utils.data.DataLoader(
    dataset, batch_size=64, shuffle=True,
    num_workers=4, pin_memory=True, drop_last=True,
    worker_init_fn=lambda wid: np.random.seed(123 + wid)
)
```

### 8. Which choices best ensure run-to-run determinism given this setup?

- [x] A. Keep `benchmark=False` and `deterministic=True` for cuDNN
- [x] B. Fix `worker_init_fn` seeds per worker
- [ ] C. Set `shuffle=False`
- [x] D. Use `torch.use_deterministic_algorithms(True)` when available

Ans: Deterministic cuDNN, deterministic algos, and seeded workers help reproducibility. `shuffle=False` is unnecessary (and often undesirable) for training.

[MCQ]

```python
real_labels = torch.empty_like(pred_real).uniform_(0.8, 1.0)
fake_labels = torch.zeros_like(pred_fake)

loss_real = F.binary_cross_entropy_with_logits(pred_real, real_labels)
loss_fake = F.binary_cross_entropy_with_logits(pred_fake, fake_labels)
d_loss = 0.5 * (loss_real + loss_fake)
```

### 9. What is the expected training effect of one-sided smoothing for real labels?

- [x] A. Reduces D overconfidence, providing better gradients for G
- [ ] B. Forces D to underfit by construction
- [ ] C. Changes the GAN objective to WGAN
- [ ] D. Eliminates mode collapse

Ans: Softer targets regularize D and prevent saturated, unhelpful gradients.

[MSQ]

```python
# Consider the following two options to clear the grads -
optimD.zero_grad()                   # (1)
# or
optimD.zero_grad(set_to_none=True)   # (2)
```

### 10. Which statements are true?

- [x] A. (2) can reduce memory writes and may be faster
- [x] B. (2) sets grads to None, which can be treated differently by some optimizers
- [ ] C. (1) and (2) are always identical in performance
- [x] D. (2) is commonly recommended for large models

Ans: `set_to_none=True` avoids explicit zeroing of tensors, often improving performance/memory.

[MCQ]

```python
netG.to(device); netD.to(device)
z = torch.randn(64, 100)         
x = x.to(device)

fake = netG(z)    
```

### 11. What is the bug and fix?

- [x] A. `z` is on CPU; move it to device → `z = torch.randn(64, 100, device=device)`
- [ ] B. `x` must be on CPU; move it back
- [ ] C. `netG` must be on CPU
- [ ] D. Use `float16` by default

Ans: Inputs must be on the same device as the model.

[NAT]

```python
with torch.no_grad():
    z = torch.randn(32, 100, device=device)
    fake = netG(z)
# later we compute a metric requiring CPU numpy
m = fake.mean().item()
```

### 12. What is the scalar dtype of `m` returned by `.item()` (write the Python type name)?

Ans: float

`.item()` converts single-element tensor to a native Python `float`.