# Theory Questions (Total: 15)

[MSQ]

### 1. You build a 3-layer fully connected neural network in PyTorch using `torch.nn.Sequential`, with ReLU activations in the hidden layers and Sigmoid activation in the output layer. Consider the following statements about training dynamics:

- [ ] A. Gradient vanishing is impossible because ReLU completely avoids it.
- [x] B. The Sigmoid at the output layer can cause gradients to vanish if the inputs are large in magnitude.
- [ ] C. Using ReLU ensures both vanishing and exploding gradients are eliminated.
- [x] D. Combining ReLU hidden layers with a Sigmoid output may still result in vanishing gradients in earlier layers.

[MSQ]

### 2. In PyTorch, if you define a `torch.nn.Linear` layer but forget to specify a non-linear activation afterwards:

- [ ] A. PyTorch will throw a runtime error.
- [x] B. The layer behaves as a purely linear transformation.
- [x] C. The overall network may degenerate into a linear function if no other nonlinearities exist.
- [x] D. The model still learns, but its expressive capacity is severely limited.

[NAT]

Consider a **2-layer feedforward neural network** in PyTorch:

- Input vector:  
  $$
  x = \begin{bmatrix} 1 \\ -1 \end{bmatrix}
  $$

- First layer: `Linear(2,2)` with weights  
  $$
  W^{[1]} = \begin{bmatrix} 1 & -2 \\ 0 & 1 \end{bmatrix}, \quad b^{[1]} = \begin{bmatrix} 0 \\ 1 \end{bmatrix}
  $$

- Activation: **tanh**

- Second layer: `Linear(2,1)` with weights  
  $$
  W^{[2]} = \begin{bmatrix} 2 & -1 \end{bmatrix}, \quad b^{[2]} = [0]
  $$

- Output activation: **Sigmoid**

- Target: $ y = 1 $, loss = **binary cross-entropy**.

### 3. Compute the gradient of the loss with respect to the weights of the second layer, $\frac{\partial L}{\partial W^{[2]}}$. Finally, report the sum of all elements of this gradient vector.

Ans: −0.238

[MCQ]

### 4. Which PyTorch optimizer is generally the least sensitive to learning rate tuning?

- [ ] A. `torch.optim.SGD`
- [ ] B. `torch.optim.RMSprop`
- [x] C. `torch.optim.Adam`
- [ ] D. `torch.optim.Adagrad`

[MSQ]

### 5. Consider two models trained on the same dataset: one with `torch.optim.SGD`, another with `torch.optim.Adam`. Which statements are correct?

- [x] A. Adam adapts the learning rate for each parameter based on historical gradients.
- [x] B. SGD with momentum can approximate Adam’s performance in some tasks.
- [ ] C. Adam always converges faster than SGD in all scenarios.
- [x] D. SGD may generalize better than Adam in large-scale classification tasks.

[MCQ]

### 6. You experiment with different hidden-layer activations in PyTorch. Which is most prone to dead neurons?

- [ ] A. `Sigmoid`
- [ ] B. `Tanh`
- [x] C. `ReLU`
- [ ] D. `LeakyReLU`

[MCQ]

### 7. Suppose you implement a classifier with `torch.nn.Softmax(dim=1)` at the output. Which of the following are true?

- [x] A. The output probabilities always sum to 1.
- [x] B. This is suitable for multi-class single-label problems.
- [ ] C. This is suitable for multi-class multi-label problems.
- [x] D. The most commonly used loss function for such problems is `torch.nn.CrossEntropyLoss`.

[NAT]

Consider a **3-hidden-layer PyTorch network** with width 2 and **sigmoid activation** everywhere.

- Input:  
  $$
  x = \begin{bmatrix} 2 \\ -2 \end{bmatrix}
  $$

- All weights initialized as identity matrices, biases = 0.

- Output: a single linear neuron.

- Target: $ y = 3 $, loss = **MSE**.

### 8. Compute the gradient of the loss with respect to the input, $ \frac{\partial L}{\partial x} $. Finally, report the sum of all elements of this gradient vector.

Ans: 0.0003

[MSQ]

### 9. Why is proper weight initialization critical for deep PyTorch models?

- [x] A. Poor initialization may cause exploding/vanishing gradients.
- [ ] B. Initialization only affects convergence speed, not final performance.
- [x] C. Xavier/Glorot initialization stabilizes variance of activations across layers.
- [ ] D. Random unscaled initialization is sufficient for deep networks.

[MSQ]

### 10. Which of the following directly mitigate vanishing gradients in deep PyTorch networks?

- [ ] A. Replace ReLU with Sigmoid.
- [x] B. Apply `torch.nn.BatchNorm`.
- [x] C. Use Residual Connections (`torch.nn.Sequential`).
- [x] D. Use initialization (`torch.nn.init.kaiming_normal_`).

[MCQ]

### 11. Which of the following do not dynamically adjust the learning rate during PyTorch training?

- [ ] A. `torch.optim.lr_scheduler.ReduceLROnPlateau`
- [ ] B. `torch.optim.lr_scheduler.StepLR`
- [ ] C. `torch.optim.lr_scheduler.ExponentialLR`
- [x] D. Fixed `lr` in `torch.optim.Adam`

[MSQ]

### 12. What is the effect of different activations on optimization geometry?

- [x] A. ReLU partitions space into piecewise linear regions, simplifying optimization.
- [x] B. Sigmoid compresses outputs into [0,1], often leading to flat gradients.
- [x] C. Tanh centers outputs at 0, often helping convergence vs sigmoid.
- [ ] D. Softmax in hidden layers improves optimization geometry.

[MSQ]

### 13. Two identical PyTorch models are trained, but one uses `torch.nn.BatchNorm1d`. Which is true?

- [x] A. BatchNorm reduces internal covariate shift.
- [ ] B. BatchNorm eliminates the need for activations.
- [x] C. BatchNorm allows training with higher learning rates.
- [ ] D. BatchNorm guarantees higher test accuracy.

[MSQ]

### 14. During training, you notice oscillations in loss values across epochs. Which remedies are valid?

- [x] A. Reduce learning rate in the optimizer.
- [x] B. Switch from SGD to Adam.
- [x] C. Add momentum to SGD.
- [ ] D. Remove non-linear activations to simplify optimization.

[NAT]

You train a **3-layer MLP** in PyTorch:

- Input dimension = 2
- Hidden Layer 1: 2 neurons, activation = ReLU
- Hidden Layer 2: 2 neurons, activation = ReLU
- Output: 1 neuron, activation = Sigmoid

Weights/biases:

$$
W^{[1]} = \begin{bmatrix} 1 & -1 \\ 2 & 0 \end{bmatrix}, \quad b^{[1]} = \begin{bmatrix} 0 \\ 1 \end{bmatrix}
$$

$$
W^{[2]} = \begin{bmatrix} 1 & 0 \\ -1 & 2 \end{bmatrix}, \quad b^{[2]} = \begin{bmatrix} 0 \\ 0 \end{bmatrix}
$$

$$
W^{[3]} = \begin{bmatrix} 1 & -2 \end{bmatrix}, \quad b^{[3]} = [0]
$$

Input:  
$$
x = \begin{bmatrix} 1 \\ 2 \end{bmatrix}, \quad \text{Target: } y = 0
$$

### 15. Compute the final output prediction $\hat{y}$ of the network after the forward pass.

Ans: 0.182

# Coding questions (Total: 6)

[MSQ]

```python
import torch
import torch.nn as nn
import torch.optim as optim

X = torch.tensor([[0.,0.],[0.,1.],[1.,0.],[1.,1.]])
y = torch.tensor([[0.],[1.],[1.],[0.]])

class MLP(nn.Module):
    def __init__(self):
        super().__init__()
        self.layers = nn.Sequential(
            nn.Linear(2,4),
            nn.Tanh(),
            nn.Linear(4,1),
            nn.Sigmoid()
        )
    def forward(self,x):
        return self.layers(x)

model = MLP()
criterion = nn.BCELoss()
optimizer = optim.SGD(model.parameters(), lr=0.1)

for epoch in range(200):
    optimizer.zero_grad()
    output = model(X)
    loss = criterion(output,y)
    loss.backward()
    optimizer.step()

print(loss.item())
```

### 1. Which of the following are true?

- [ ] A. Without non-linear activation, this model can solve XOR.  
- [x] B. If we replace Tanh with ReLU, convergence might be faster.
- [x] C. If we omit `optimizer.zero_grad()`, gradients accumulate and weights diverge.
- [ ] D. Sigmoid in the final layer is necessary for mean squared error loss calculation.

[NAT]

```python
torch.manual_seed(42)

X = torch.tensor([[1.,-1.]])
y = torch.tensor([[1.]])

model = nn.Sequential(
    nn.Linear(2,1),
    nn.Sigmoid()
)

with torch.no_grad():
    for p in model.parameters():
        p.fill_(1.0)

criterion = nn.BCELoss()
output = model(X)
loss = criterion(output,y)
print(output.item(), loss.item())
```

### 2. What is the loss value printed? (Round to 3 decimal places)

Ans: 0.313

[MSQ]

```python
import torch.nn.functional as F

X = torch.randn(5,10)
y = torch.randint(0,3,(5,))

model = nn.Linear(10,3)
optimizer = optim.Adam(model.parameters(), lr=0.01)

for epoch in range(50):
    optimizer.zero_grad()
    logits = model(X)
    loss = F.cross_entropy(logits,y)
    loss.backward()
    optimizer.step()
```

### 3. Which of the following statement/s is incorrect?

- [ ] A. `cross_entropy` internally applies softmax to logits.
- [ ] B. Changing labels `y` to one-hot encoding would break this code.
- [ ] C. The final linear layer must have 3 outputs for this setup.
- [x] D. Replacing Adam with SGD(lr=0.01) would guarantee identical convergence.

[NAT]

```python
torch.manual_seed(0)

X = torch.tensor([[2., -3.]])
layer = nn.Linear(2,1)
with torch.no_grad():
    layer.weight[:] = torch.tensor([[1.,-2.]])
    layer.bias[:] = torch.tensor([0.5])

output = layer(X)
print(output.item())
```

### 4. What is the numerical output value printed?

Ans: 8.5

[MSQ]

```python
scheduler = optim.lr_scheduler.StepLR(optimizer, step_size=10, gamma=0.1)

for epoch in range(30):
    optimizer.zero_grad()
    logits = model(X)
    loss = F.cross_entropy(logits,y)
    loss.backward()
    optimizer.step()
    scheduler.step()
```

### 5. Which statements are true?

- [x] A. Learning rate decreases every 10 epochs by a factor of 0.1.
- [ ] B. If base LR = 0.01, then LR at epoch 21 is 0.0001.
- [ ] C. Scheduler must be stepped before `optimizer.step()` to function correctly.
- [x] D. This helps avoid getting stuck in sharp local minima.

[NAT]

```python
torch.manual_seed(7)

X = torch.tensor([[1.,2.,3.]])
y = torch.tensor([[1.]])

model = nn.Sequential(
    nn.Linear(3,1),
    nn.Sigmoid()
)

criterion = nn.BCELoss()
optimizer = optim.SGD(model.parameters(), lr=0.1)

output = model(X)
loss = criterion(output,y)
loss.backward()

grads = []
for p in model.parameters():
    grads.append(p.grad.sum().item())

print(sum(grads))
```

### 6. What is the sum of all gradient elements printed? (Round to 3 decimal places)

Ans: -0.578