# Theory Questions (Total: 7)

[MSQ]

### 1. We train a fully connected PyTorch network with 3 hidden layers of ReLU activations. During training, it is observed that:
- The first layer’s gradients are nearly zero.
- The later layers still update normally.

Which of the following could be the reasons for this -

- [x] A. Improper weight initialization (e.g., large negative values) causing dead ReLUs in early layers.
- [ ] B. Exploding gradients in deeper layers cancel out earlier layer updates.
- [ ] C. The non-linearity of ReLU creates a non-symmetric optimization landscape.
- [x] D. Backpropagated gradients shrink as they pass through multiple layers with zeros from ReLU.

Ans:

If early-layer pre-activations are pushed negative and you use ReLU, many units output 0, giving zero local derivative and near-zero upstream gradient — dead/near-dead units.
Gradients are multiplied by gates (0/1 for ReLU) and weight matrices; many zeros in early layers shrink backpropagated signal.

[MCQ]

### 2. Suppose we have a binary classification task using sigmoid output with BCELoss. It is observed that the model predicts values very close to 0 or 1 even in the first few epochs. What does this suggest?

- [ ] A. The learning rate is too low.
- [x] B. The weights are initialized with too large magnitudes.
- [ ] C. The optimizer is not updating parameters.
- [ ] D. The loss function is unsuitable for sigmoid outputs.

Very large initial weights push sigmoid into saturation. In saturation, gradients = 0, yet predictions look “confident” early. Low LR (A) causes tiny updates but doesn’t create saturated outputs. (C) would keep outputs near their random initial values. (D) is false — BCE + sigmoid is standard.

[MSQ]

In PyTorch, we have a following model -

```
model = nn.Sequential(
    nn.Linear(100, 50),
    nn.ReLU(),
    nn.Linear(50, 1),
    nn.Sigmoid()
)
```

### 3. Suppose input features are normalized and mean-centered. Which of the following would happen if you replaced ReLU with Sigmoid in the hidden layer?

- [x] A. Risk of vanishing gradients increases.
- [x] B. Effective capacity of the network decreases.
- [x] C. The model may still fit training data but require significantly more epochs.
- [ ] D. Exploding gradients become more common.

Ans: Hidden sigmoid reintroduces saturation, thus, vanishing gradients more likely. A network with saturating hidden units behaves more linearly around the operating point, thus, lower effective capacity. It can still fit, but needs more epochs due to slower gradient flow. Sigmoid does not make exploding gradients more common (it more like depends on either ReLU or bad initialization or LSTM gate issue).

[NAT]

### 4. You train a 2-layer MLP with tanh activations on normalized data. During training, you observe that the gradients of the first layer’s weights are consistently around 10^-5. What is the most likely long-term effect on model convergence?

Ans: Stagnation

If first-layer gradients sit around this value, parameter updates are negligible. This will lead to stall of training process unless you change activation/init/LR or add normalization/residuals.

[MCQ]

### 5. Is this statement true or false, using torch.nn.Softmax with CrossEntropyLoss is redundant because CrossEntropyLoss already applies log-softmax internally.

- [x] A. True, because PyTorch expects raw logits.
- [ ] B. False, because CrossEntropyLoss expects probabilities as input.

Ans: CrossEntropyLoss expects raw logits and internally applies LogSoftmax + NLL. Passing already-softmaxed probabilities harms training (and can be numerically unstable).


[NAT]

Suppose you have a 2-layer network:

- Layer 1: Linear(2,2) with weights = identity matrix, no bias, activation = ReLU
- Layer 2: Linear(2,1) with weights = [1,1], bias = 0, activation = Sigmoid

With,

Input: 𝑥=[1,−2].

### 6. What is the final predicted output (rounded to 3 decimals)?

Ans: 0.731

- Layer1: z1=x=[1,−2], ReLU → a1=[1,0].
- Layer2 pre-act: 1⋅1+1⋅0=1.
- Sigmoid(1) = 0.731

[MSQ]

### 7. You experiment with a custom optimizer that skips gradient updates every alternate step. What issues could arise?

- [X] A. Convergence slows down compared to normal optimizers.
- [X] B. Gradient information may decay before being used.
- [X] C. Loss curve oscillations may increase.
- [ ] D. The model is mathematically guaranteed to diverge.

Skipping updates halves effective step frequency, leading to slower convergence, some gradient signals become stale, and the loss curve can oscillate. No guarantee of divergence.

# Coding questions (Total: 6)

[MSQ]

```python
import torch
import torch.nn as nn
import torch.optim as optim

X = torch.randn(32, 10)
y = torch.randint(0, 2, (32,1)).float()

model = nn.Sequential(
    nn.Linear(10, 16),
    nn.ReLU(),
    nn.Linear(16, 1),
    nn.Sigmoid()
)

criterion = nn.BCELoss()
optimizer = optim.SGD(model.parameters(), lr=0.01)

for epoch in range(5):
    for i in range(0, 32, 8):
        xb, yb = X[i:i+8], y[i:i+8]
        output = model(xb)
        loss = criterion(output, yb)
        loss.backward()
        optimizer.step()
```

### 1. Which of the following are true?

- [ ] A. The final parameter updates will be equivalent to full-batch training.  
- [x] B. Gradients will accumulate across mini-batches, making updates unstable.
- [x] C. The model may still train but with unpredictable dynamics.
- [x] D. Adding optimizer.zero_grad() fixes this training loop.

Ans: Adding the term fixes this loop, as otherwise, gradient accumulation happens across the mini-batches which leads to unstable updates. The training might happen but dynamics are not predictable now.

[NAT]

```python
optimizer = optim.Adam(model.parameters(), lr=0.01)
scheduler = optim.lr_scheduler.StepLR(optimizer, step_size=5, gamma=0.5)

lrs = []
for epoch in range(12):
    lrs.append(optimizer.param_groups[0]['lr'])
    scheduler.step()

```

### 2. What is the learning rate after epoch 10? (Round to 4 decimal places)

Ans: 0.0025

We start at 0.01, at every 5 steps it gets multiplied by 0.5.
- Epochs 0–4: 0.01 → after e=4 step LR=0.005
- Epochs 5–9: 0.005 → after e=9 step LR=0.0025
- Epoch 10 (logged before step): 0.0025.

[MCQ]

```python
model = nn.Sequential(
    nn.Linear(20, 3),
    nn.Softmax(dim=1)
)
criterion = nn.CrossEntropyLoss()

X = torch.randn(4,20)
y = torch.tensor([0,1,2,1])

output = model(X)
loss = criterion(output,y)

```

### 3. What is wrong with this code?

- [x] A. CrossEntropyLoss expects raw logits, not softmax probabilities.
- [ ] B. The target y must be one-hot encoded for CrossEntropyLoss.
- [ ] C. The number of classes in y does not match the model output.
- [ ] D. Nothing is wrong, it should work fine.

Ans: CrossEntropyLoss expects logits. Passing softmax probabilities loses the log-sum-exp stabilization and double-softmaxes the signal.

[NAT]

```python
torch.manual_seed(0)

X = torch.tensor([[2., -3.]])
layer = nn.Linear(2,1)
with torch.no_grad():
    layer.weight[:] = torch.tensor([[1.,-2.]])
    layer.bias[:] = torch.tensor([0.5])

output = layer(X)
print(output.item())
```

### 4. What is the numerical output value printed?

Ans: 8.5

```text
w1*x1 + w2*x2 + b
2*1 + -3*-2 + 0.5
```

[MSQ]

```python
model = nn.Sequential(
    nn.Linear(10, 10),
    nn.BatchNorm1d(10),
    nn.ReLU(),
    nn.Linear(10,1)
)

model.eval()
X = torch.randn(4,10)
print(model(X))
```

### 5. What happens when the model is in .eval() mode here?

- [x] A. BatchNorm uses running statistics instead of batch statistics.
- [ ] B. Gradients will not be computed for BatchNorm parameters.
- [ ] C. The output is identical to training mode in all cases.
- [x] D. Dropout, if present, would also behave differently in eval mode.


Ans: In .eval(), BatchNorm uses running mean/var, not batch stats. Gradients can still be computed if the .backward() function is called. Output differs from the train mode.  

[NAT]

```python
model = nn.Sequential(
    nn.Linear(10, 20),
    nn.ReLU(),
    nn.Linear(20, 5),
    nn.ReLU(),
    nn.Linear(5, 1)
)
```

### 6. How many trainable parameters (including bias) does this model have in total?

Ans: 336

- Layer 1 - 10*20 + 20
- Layer 2 - 20*5 + 5
- Layer 3 - 5*1 + 1