# Quiz 1 - Mock Test Questions

## Theory (10)

[MCQ]

### 1. A single-layer perceptron with a hard step activation is trained on a dataset that is not linearly separable. Which of the following is the most accurate statement about its representational limitation?

- [ ] A. It will converge to a solution that minimizes mean squared error on the training set.
- [ ] B. It can represent any Boolean function if trained long enough.
- [x] C. It cannot represent the XOR function due to linear separability limits.
- [ ] D. It can represent XOR only if the learning rate is sufficiently small.

Ans: A single-layer perceptron implements a linear decision boundary, which cannot represent XOR. Training time or learning rate does not change the hypothesis class.

[MSQ]

### 2. Consider a 3-layer MLP with ReLU activations trained with cross-entropy loss. Which of the following choices increase the risk of vanishing/exploding gradients during training?

- [x] A. Using very deep networks without skip connections.
- [x] B. Using sigmoid activations in hidden layers with poor weight initialization.
- [ ] C. Using ReLU activations with He (Kaiming) initialization.
- [ ] D. Using Batch Normalization between linear and activation layers.

Ans: Depth + saturating activations + poor initialization can cause vanishing/exploding gradients. ReLU with He init and BatchNorm mitigate these issues.

[NAT]

You train a classifier with softmax + cross-entropy. For a single sample, the model logits are [2,0,−1] for classes [0,1,2], and the true class is 0.
### 3. Compute the cross-entropy loss (natural log) to 4 decimal places.

Ans: 0.1691

[MCQ]

### 4. Which statement best captures the role of an activation function in an MLP?

- [ ] A. It guarantees convexity of the loss landscape.
- [x] B. It introduces nonlinearity, enabling composition of linear transforms to model complex functions.
- [ ] C. It primarily serves as a learning rate schedule.
- [ ] D. It regularizes the model by dropping activations.

Ans: Nonlinear activations allow stacked linear layers to represent nonlinear mappings. They do not guarantee convexity or directly schedule learning rates.

[MSQ]

### 5. You apply L2 weight decay, Dropout(p=0.5), and early stopping. Which statements are true about generalization?

- [x] A. L2 penalizes large weights, typically smoothing decision boundaries.
- [x] B. Dropout behaves like model averaging over sub-networks at train time.
- [x] C. Early stopping often selects a point with higher training loss but lower test loss than later epochs.
- [ ] D. Dropout increases the effective width at test time.

Ans: L2 shrinks weights; Dropout approximates ensemble effects; early stopping is a practical regularizer. Test-time Dropout is off (with scaling), not “increased width.”

[MCQ]

### 6. Batch Normalization primarily helps by:

- [ ] A. Eliminating the need for nonlinear activations.
- [x] B. Reducing internal covariate shift and stabilizing gradients across layers.
- [ ] C. Guaranteeing faster test-time inference.
- [ ] D. Replacing the optimizer’s momentum term.

Ans: BN normalizes intermediate activations to stabilize training and gradient flow. It doesn’t remove nonlinearities or replace optimizer features.

[NAT]

### 7. You initialize a fully-connected layer with He normal initialization for ReLU: weights W ~ N(0, 2/fan_in). If fan_in = 50, what is the standard deviation (upto 2 decimal places).

Ans: 0.20. Variance = 2/50 = 0.04.

[MSQ]

### 8. Comparing SGD with momentum vs Adam, which statements are correct in typical deep learning practice?

- [x] A. Adam adapts learning rates per-parameter using first and second moment estimates.
- [ ] B. Momentum SGD cannot converge without weight decay.
- [x] C. Adam is often more robust to poorly scaled gradients at initialization.
- [ ] D. Adam always outperforms SGD on final generalization.

Ans: Adam uses adaptive moments; momentum SGD can converge without weight decay; Adam can be robust early on; generalization superiority is task-dependent.

[MCQ]

### 9. For a binary classifier with logits $z$ and sigmoid outputs $ \sigma(z) $, which loss formulation is numerically stable?

- [ ] A. $-y \log \sigma(z) - (1 - y) \log(1 - \sigma(z))$ computed directly from $\sigma(z)$.
- [x] B. $\max(z, 0) - z \cdot y + \log(1 + e^{-|z|})$.
- [ ] C. Mean squared error between $y$ and $\sigma(z)$.
- [ ] D. Hinge loss on $\sigma(z)$.

Ans: Using logits avoids catastrophic cancellation for extreme $z$. Direct sigmoid BCE can underflow/overflow.

[NAT]

### 10. A two-layer MLP (no bias) is $f(x) = W_2 \, \text{ReLU}(W_1 x)$. With  
$W_1 = \begin{bmatrix} 1 & -2 \\ -3 & 4 \end{bmatrix}, \; W_2 = [2 \;\; -1], \; \text{and } x = \begin{bmatrix} 2 \\ 1 \end{bmatrix}.$  
Compute $f(x)$.

Ans: 0.

$W_1 x = [1 \cdot 2 + (-2) \cdot 1,\; -3 \cdot 2 + 4 \cdot 1] = [0,\; -2].$ ReLU $\rightarrow [0, 0].$ Then $W_2 [0, 0]^T = 0.$


## Coding (10)

[NAT]

```python
import torch
torch.manual_seed(0)

x = torch.tensor([[0.5, -1.0, 2.0]])
W = torch.randn(3, 2) * (2/3)**0.5  # He-like for ReLU (fan_in=3)
b = torch.zeros(2)
h = torch.relu(x @ W + b)

V = torch.randn(2, 1) * (2/2)**0.5
y = h @ V
print(y.item())
```

### 11. Report the scalar printed (to 4 decimal places).

Ans: 0.3822. With fixed seed and shapes, the forward pass is deterministic; He-like scaling keeps activations reasonable. Can be verified by execution.

[MCQ]

```python
import torch
torch.manual_seed(1)
logits = torch.tensor([[3.0, -1.0, 0.5]])
target = torch.tensor([0])
loss = torch.nn.functional.cross_entropy(logits, target)
```

### 12. Which statement is correct?

- [x] A. loss equals `-torch.log_softmax(logits, dim=1)[0,0]`.
- [ ] B. loss equals `-torch.log(torch.softmax(logits, dim=0)[0,0])`.
- [ ] C. loss is the same as MSE between logits and one-hot vector.
- [ ] D. loss depends on temperature scaling by default.

Ans: F.cross_entropy = NLLLoss(log_softmax). Softmax must be across dim=1 (classes).

[NAT]

```python
import torch
torch.manual_seed(2)

W = torch.tensor([1.0], requires_grad=True)   # scalar weight
x = torch.tensor([2.0])
y = torch.tensor([5.0])

# model: y_hat = W*x
y_hat = W * x
loss = ((y_hat - y)**2).mean()   
loss.backward()

eta = 0.1
with torch.no_grad():
    W -= eta * W.grad
print(W.item())
```

### 13. Compute the updated weight after one SGD step.

Ans: 2.2.

Loss gradient wrt W: 2(xW−y)x=2(2W−5)⋅2=8W−20. With W=1, grad =−12. Update W←1−0.1(−12)=2.2.

[MSQ]

```python
import torch
torch.manual_seed(3)
x = torch.randn(64, 100)  
lin1 = torch.nn.Linear(100, 100)
lin2 = torch.nn.Linear(100, 100)

# Case A: Sigmoid activations
hA = torch.sigmoid(lin1(x))
hA = torch.sigmoid(lin2(hA)).mean().item()

# Case B: ReLU activations
hB = torch.relu(lin1(x))
hB = torch.relu(lin2(hB)).mean().item()
```

### 14. Which are likely true with default PyTorch init?

- [x] A. Case A tends to have smaller mean activations due to saturation around 0–1.
- [x] B. Case B activations are sparser but maintain larger variance than Case A.
- [x] C. Case A gradients are at higher risk of vanishing.
- [ ] D. Case B will necessarily explode.

Ans: Sigmoid compresses values; ReLU yields sparse positives, better gradient flow than sigmoid; explosion is not guaranteed.

[NAT]

```python
import torch
logits = torch.tensor([[1.5, 0.2]])
target = torch.tensor([1])
loss = torch.nn.functional.cross_entropy(logits, target)
print(loss.item())
```

### 15. Fill the numeric value (to 4 decimals).

Ans: 1.1166. Softmax over 2 classes with logits [1.5, 0.2]; CE uses logsumexp - z_y.

[MCQ]

### 16. Suppose you optimize with `torch.optim.SGD(params, lr=0.1, weight_decay=1e-4)`. This weight_decay is equivalent to:

- [ ] A. Adding $ λ||W||_1 $ to the loss and differentiating.
- [x] B. Adding $ λ||W||_2^2 $ to the loss and differentiating.
- [ ] C. Clipping weights by $ λ $ every step.
- [ ] D. Decaying gradients by $ λ $.

Ans: weight_decay in SGD is classical L2 penalty.

[MSQ]

### 17. With Adam (β₁=0.9, β₂=0.999, ε=1e-8), which are correct?

- [x] A. It scales updates inversely with the square root of second moments.
- [x] B. Bias correction counteracts initialization of moments at zero.
- [x] C. Using too large a learning rate can still destabilize training.
- [ ] D. Adam guarantees the global optimum in nonconvex settings.

Ans: Adam adapts per-parameter step sizes; bias correction is crucial early; no global optimality guarantee in deep nets.

[NAT]

```python
import torch
torch.manual_seed(4)
x = torch.randn(1, 3)
W1 = torch.randn(3, 2, requires_grad=True)
W2 = torch.randn(2, 1, requires_grad=True)

h = torch.relu(x @ W1)
y = h @ W2
loss = (y**2).mean()
loss.backward()

print(W1.grad.norm().item())
```

### 18. Give the value of printed scalar (to 4 decimals).

Ans: 0.9486. Deterministic with fixed seed. Tests ReLU gates and squared loss propagate gradients.

[MCQ]

```python
import torch, torch.nn as nn
torch.manual_seed(5)
drop = nn.Dropout(p=0.5)
x = torch.ones(10, 4)
y_train = drop(x)        
drop.eval()
y_eval  = drop(x)       
```

### 19. Which statement is correct?

- [ ] A. y_train will be all zeros.
- [ ] B. y_eval equals x scaled by (1-p).
- [x] C. y_train keeps a random 50% of units and scales survivors by 1/(1-p).
- [ ] D. y_eval equals exactly x.

Ans: In training, units are dropped with prob p and survivors are scaled by 1/(1-p) to keep expectation; in eval, Dropout is an identity mapping.

[MCQ]

```python
import torch, torch.nn as nn
torch.manual_seed(6)
bn = nn.BatchNorm1d(3, momentum=0.1, affine=True, track_running_stats=True)

bn.train()
x1 = torch.randn(4, 3)
bn(x1)

x2 = torch.randn(4, 3)
bn(x2)

print(bn.running_mean)
```

### 20. What will be the running mean printed as?

- [ ] A. `Tensor([-0.6533, -0.1270, -0.1000])`
- [x] B. `Tensor([-0.0593, -0.1270, -0.0098])`
- [ ] C. `Tensor([-0.0593, -0.2000, -0.1008])`
- [ ] D. `Tensor([-0.0600, -0.1370, -0.0098])`

Ans: BatchNorm updates running statistics with momentum.