# Theory Questions (Total: 10)

[MCQ]

A CNN trained on CIFAR-10 achieves 95 % training accuracy but only 65 % validation accuracy. The researcher tries the following remedies:

1. Adding more convolutional layers
2. Increasing kernel size in each convolutional layer
3. Using dropout and batch normalization
4. Performing extensive data augmentation

### 1. Which of the following will most likely reduce overfitting?

- [ ] A. 1 and 2
- [ ] B. 1 and 3
- [x] C. 3 and 4
- [ ] D. 1 and 4

Ans: Adding layers or larger kernels (1 and 2) may further overfit; dropout and batch norm regularize activations, while augmentation expands the dataset distribution, combating overfitting.

[NAT]

### 2. Consider a 2-D convolution with a 5×5 kernel, stride = 2, and padding = 1 on a 64×64×3 input. How many learnable parameters (including) exist in this single convolution layer with 32 filters?

Ans: 2432

Each filter has 5×5×3=75 weights + 1 bias = 76 parameters.
Total = 32 × 76 = 2 432 learnable parameters.
Stride and padding affect output size, not parameter count.

[MSQ]

### 3. Which of the following statements about Batch Normalization (BN) in CNNs are correct?

- [x] A. BN mitigates internal covariate shift by stabilizing feature distribution.
- [x] B. BN allows higher learning rates, accelerating convergence.
- [ ] C. BN completely removes the need for dropout.
- [x] D. BN normalizes activations across spatial dimensions and batch samples.

Ans: BN standardizes intermediate activations and enables faster, more stable training.
Dropout may still be needed for regularization; BN does not replace it entirely.

[MCQ]

### 4. In a CNN for CIFAR-10, early convolutional filters tend to capture edges and textures, while deeper filters respond to object parts. This hierarchy emerges primarily due to —

- [ ] A. Weight sharing across spatial locations
- [x] B. Local receptive fields expanding through layer stacking
- [ ] C. ReLU introducing sparsity in activations
- [ ] D. Pooling enforcing translation invariance

Ans: As layers stack, each neuron’s receptive field covers a larger portion of the input, allowing progressively complex feature compositions—from edges → shapes → semantic parts.

[MCQ]

A CNN has three convolutional layers, each with kernel 3, stride 1, and padding 1, followed by one max-pooling layer of size 2 × 2 and stride 2.

### 5. What is the receptive field (in input pixels) of one neuron in the final feature map?

- [ ] A. 5 x 5
- [ ] B. 7 x 7
- [x] C. 10 x 10
- [ ] D. 11 x 11

Ans: Receptive field after three 3×3 stride-1 layers: 3+2+2=7.
Pooling doubles it: 7+(2−1)×2=10.
Hence, each final-layer neuron “sees” 10×10 pixels of the input.

[MCQ]

### 6. While training a CNN from scratch on MNIST, accuracy stagnates at 90 % despite sufficient data and no overfitting. Which is the most plausible underlying reason?

- [x] A. Vanishing gradients due to sigmoid activations
- [ ] B. Exploding gradients due to large initialization
- [ ] C. Improper shuffling of mini-batches
- [ ] D. Over-regularization by dropout

Ans: Sigmoids saturate for large inputs, producing near-zero derivatives that halt weight updates.
ReLU or Leaky ReLU alleviate this by maintaining gradient flow.

[MSQ]

Consider two CNN architectures trained on identical data:
1. Model A: no pooling, but stride 2 in convolutions
2. Model B: uses pooling layers for down-sampling

### 7. Which statements are true?

- [x] A. Model A learns down-sampling through convolutional weights.
- [x] B. Model B’s pooling provides parameter-free translation invariance.
- [x] C. Model A may retain more task-specific spatial detail.
- [ ] D. Model B will always outperform Model A in generalization.

Ans: Strided convolutions subsample via learnable kernels (1), pooling adds invariance without parameters (2), and convolutional down-sampling can preserve fine spatial cues (3).
Performance superiority (4) is not guaranteed; it depends on data and regularization.

[MCQ]

### 8. Which of the following is a key reason that deeper CNNs like ResNet outperform shallow ones on large-scale image tasks?

- [ ] A. Deeper CNNs always have fewer parameters.
- [ ] B. Residual connections mitigate vanishing gradients and preserve representational richness.
- [ ] C. Batch normalization replaces all non-linearities.
- [ ] D. Deeper CNNs avoid overfitting by having more layers.

Ans: ResNets introduce skip (identity) connections that ease gradient flow and enable learning of residual mappings, solving degradation in very deep nets.

[NAT]

An image of size 128×128 is processed by a sequence of layers:
- Conv1: 3×3 kernel, stride 1, padding 1
- MaxPool: 2×2, stride 2
- Conv2: 3×3 kernel, stride 1, padding 1
- MaxPool: 2×2, stride 2

### 9. What is the final spatial size (H×W) of the feature map (ignore channels)? Give the value of H.

Ans: 32

Each pooling halves resolution: 128→64→32. Convs preserve it due to padding 1.

[MSQ]

### 10. During CNN training, you observe the validation accuracy stagnates while training accuracy continues to improve. Which of the following remedies are conceptually justified?

- [x] A. Adding dropout layers between fully connected blocks.
- [ ] B. Increasing the learning rate.
- [x] C. Using data augmentation such as rotations and flips.
- [x] D. Reducing the number of convolutional filters.

Ans: Classic overfitting symptom. Regularization (dropout), augmentation, or reduced capacity can help.
Raising the learning rate generally worsens convergence and may destabilize optimization.

# Coding questions (Total: 12)

[MCQ]

```python
import torch
import torch.nn as nn

conv = nn.Conv2d(in_channels=3, out_channels=6, kernel_size=5, stride=1, padding=2)
x = torch.randn(8, 3, 32, 32)
y = conv(x)
print(y.shape)
```

### 1. What will be the output shape of y?

- [ ] A. (8, 6, 28, 28)
- [ ] B. (8, 6, 30, 30)
- [x] C. (8, 6, 32, 32)
- [ ] D. (8, 6, 34, 34)

Ans: Padding = 2 and stride = 1 keep spatial dimensions same:
out=⌊(32+2×2−5)/1+1⌋=32.
Hence (8, 6, 32, 32).

[MSQ]

```python
class SmallCNN(nn.Module):
    def __init__(self):
        super().__init__()
        self.conv1 = nn.Conv2d(3, 8, 3, padding=1)
        self.conv2 = nn.Conv2d(8, 16, 3, padding=1)
        self.pool = nn.MaxPool2d(2, 2)
    def forward(self, x):
        x = torch.relu(self.conv1(x))
        x = torch.relu(self.conv2(x))
        x = self.pool(x)
        return x
```

### 2. Which statements are true about the above network?

- [x] A. After pooling, the spatial size becomes half in each dimension.
- [x] B. The receptive field of each neuron in the output covers 5×5 pixels of input.
- [x] C. Removing padding would shrink the output feature map even before pooling.
- [ ] D. Pooling increases the number of parameters.

Ans: Padding keeps intermediate sizes same; pooling halves them.
Two stacked 3×3 kernels ⇒ receptive field = 5×5.
Pooling has no learnable parameters.


[MCQ]

```python
import torch.nn.functional as F

x = torch.tensor([[[-1.0, 2.0], [3.0, -4.0]]])
y = F.relu(x)
print(y)
```

### 3. What will be printed as the output?

- [x] A. `[[[0., 2.], [3., 0.]]]`
- [ ] B. `[[[-1., 2.], [3., -4.]]]`
- [ ] C. `[[[1., 2.], [3., 4.]]]`
- [ ] D. Parsing Error

Ans: ReLU sets negatives to 0; positives unchanged.

[MSQ]

```python
class Net(nn.Module):
    def __init__(self):
        super().__init__()
        self.conv = nn.Conv2d(3, 3, 3, bias=False)
        self.bn = nn.BatchNorm2d(3)
    def forward(self, x):
        x = self.conv(x)
        x = self.bn(x)
        return x
```

### 4. During training, which are true about BatchNorm2d here?

- [ ] A. Maintains running estimates of mean and variance per channel.
- [ ] B. Has trainable γ (scale) and β (shift).
- [ ] C. Normalizes across batch and spatial dimensions.
- [ ] D. Has no effect during inference.

Ans: BN learns per-channel affine params (γ, β) and uses batch stats while training; during inference, uses running means.

[MSQ]

```python
import torch.optim as optim

model = SmallCNN()
criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(model.parameters(), lr=0.1, momentum=0.9)

for epoch in range(10):
    outputs = model(torch.randn(4, 3, 32, 32))
    loss = criterion(outputs.view(4, -1), torch.randint(0, 10, (4,)))
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
```

### 5. Which statements are correct?

- [x] A. The .zero_grad() prevents gradient accumulation.
- [x] B. .backward() computes gradients using automatic differentiation.
- [x] C. .step() updates parameters using gradients stored in each tensor.
- [ ] D. Momentum = 0.9 reduces overfitting directly.

Ans: Momentum = 0.9 reduces overfitting directly.

[MSQ]

```python
import torchvision.transforms as T
transform = T.Compose([
    T.RandomHorizontalFlip(p=0.5),
    T.RandomRotation(15),
    T.ToTensor()
])
```

### 6. Select true statements regarding this transform pipeline.

- [x] A. Each image has 50 % chance to be flipped horizontally.
- [ ] B. Rotation preserves label semantics for classification tasks like CIFAR-10.
- [x] C. ToTensor() converts PIL Image → [0, 1] tensor.
- [x] D. It can reduce overfitting by enriching training diversity.

Ans: Rotation may distort label semantics for asymmetric objects (e.g., digits 6/9). Flipping + augmentation combats overfitting.

[NAT]

```python
def receptive_field(kernel_sizes, strides):
    rf, jump = 1, 1
    for k, s in zip(kernel_sizes, strides):
        rf = rf + (k - 1) * jump
        jump *= s
    return rf

print(receptive_field([3,3,3], [1,1,2]))
```

### 7. What is the receptive field size of the final layer?

Ans: 9

After first two 3×3 stride-1 layers → rf = 5, jump = 1.
Third layer (stride 2): rf = 5 + (3−1)*1 = 7 before jump update → jump = 2.
Actually final rf = 7 + (3−1)*1 = 9 pixels.

[MSQ]

```python
import torch
import torch.nn as nn
import torch.nn.functional as F

class TinyCNN(nn.Module):
    def __init__(self):
        super().__init__()
        self.conv1 = nn.Conv2d(3, 8, kernel_size=5, stride=2, padding=2)
        self.bn1   = nn.BatchNorm2d(8)
        self.conv2 = nn.Conv2d(8, 16, kernel_size=3, stride=1, padding=1)
        self.fc1   = nn.Linear(16 * 8 * 8, 10)

    def forward(self, x):
        x = F.relu(self.bn1(self.conv1(x)))
        x = F.max_pool2d(F.relu(self.conv2(x)), 2)
        x = torch.flatten(x, 1)
        return self.fc1(x)

model = TinyCNN()
inp = torch.randn(4, 3, 32, 32)
out = model(inp)
```

### 8. Which of the following are true?

- [x] A. The feature map size before flattening is 16×8×8.
- [x] B. BatchNorm introduces extra learnable parameters (γ, β).
- [ ] C. The receptive field of the output neuron covers the whole image.
- [x] D. max_pool2d halves spatial dimensions.

Ans: Stride 2 in conv1 → 16×16; pool 2×2 → 8×8. BN adds 2×C parameters. Receptive field < entire 32×32 yet.

[MSQ]

```python
class Block(nn.Module):
    def __init__(self):
        super().__init__()
        self.conv1 = nn.Conv2d(3, 8, 3, padding=1)
        self.conv2 = nn.Conv2d(8, 8, 3, padding=1)
    def forward(self, x):
        return F.relu(self.conv2(F.relu(self.conv1(x))) + x)
```

### 9. For this residual block:

- [x] A. Shapes of input and output must match.
- [x] B. Skip connection helps gradient flow.
- [ ] C. Number of learnable parameters decreases compared to two independent conv layers.
- [x] D. The block can model identity mapping easily.

Ans: Residual adds input to output → same shape; skip improves training and permits identity.

[MSQ]

```python
class NormedCNN(nn.Module):
    def __init__(self):
        super().__init__()
        self.conv = nn.Conv2d(3, 6, 3)
        self.bn   = nn.BatchNorm2d(6)
    def forward(self, x):
        x = self.conv(x)
        if self.training:
            x = self.bn(x)
        return F.relu(x)
```

### 10. Which statements hold true?

- [x] A. During evaluation mode, BN uses running mean/var.
- [x] B. self.training flag must be False for inference.
- [x] C. BN normalizes over spatial + batch dims per channel.
- [ ] D. Turning off BN has no effect on output distribution.

Ans: model.eval() sets training=False; BN uses stored stats; normalization is per-channel.

[MSQ]

```python
class GlobalPoolNet(nn.Module):
    def __init__(self):
        super().__init__()
        self.conv = nn.Conv2d(3, 64, 3, padding=1)
        self.gap  = nn.AdaptiveAvgPool2d((1,1))
        self.fc   = nn.Linear(64, 10)
    def forward(self, x):
        x = F.relu(self.conv(x))
        x = self.gap(x)
        x = torch.flatten(x, 1)
        return self.fc(x)
```

### 11. Choose the correct option.

- [x] A. Adaptive pooling outputs size (1×1) regardless of input dimension.
- [ ] B. This architecture is translation invariant to object position.
- [x] C. fc expects 64 inputs for each sample.
- [x] D. Removing gap requires changing fc input size.

Ans: GAP → 1×1 maps; fc needs 64 inputs; no strict translation invariance (so B false).

[MSQ]

```python
class DeeperCNN(nn.Module):
    def __init__(self):
        super().__init__()
        self.layers = nn.Sequential(
            nn.Conv2d(3, 32, 3, padding=1),
            nn.ReLU(),
            nn.Conv2d(32, 64, 3, padding=1),
            nn.ReLU(),
            nn.Conv2d(64, 128, 3, padding=1),
            nn.ReLU()
        )
    def forward(self, x): return self.layers(x)

m = DeeperCNN()
x = torch.randn(1,3,32,32)
y = m(x)
```

### 12. Which statements are correct?

- [x] A. Depth increases effective receptive field.
- [x] B. All conv layers preserve spatial size.
- [x] C. Gradient vanishing is more likely than in a 2-layer CNN.
- [ ] D. Each layer’s activation map has 128 channels.

Ans: Padding 1 keeps 32×32; RF grows with depth; final map has 128 channels only at end, not each.