# Theory Questions (Total: 12)

[MCQ]

### 1. Consider a vanilla RNN processing variable-length sequences for next-token prediction in language modeling. Which of the following most accurately describes the relationship between the hidden state at a given time step and the probability estimation of the current token, especially regarding long-term dependencies and memory limitations?

- [ ] A) The hidden state captures all previous context regardless of sequence length, ensuring perfect modeling of dependencies.
- [x] B) The hidden state dynamically summarizes past tokens, but its effectiveness in modeling long-term dependencies diminishes rapidly with increased sequence length.
- [ ] C) The hidden state is only influenced by the last token input, making it suitable for short-term dependencies.
- [ ] D) The hidden state guarantees that gradient propagation remains stable across very long sequences.

Ans: The hidden state summarizes prior context, but as sequence length increases, its ability to capture long-term dependencies weakens. This is due to vanishing gradients and limited memory, leading to loss of old information over long sequences, a key limitation of vanilla RNNs.

[MCQ]

### 2. Someone applies an RNN to time-series data where each observation depends on all previous ones, but computational resources are limited. Which modeling strategy related to state dependencies and training method would most likely result in loss of critical information for prediction?

- [x] A) Utilizing a windowed approach with fixed-length input sequences using only the most recent tokens
- [ ] B) Encoding all historical observations in a compressed feature vector before input to the RNN
- [ ] C) Using a vanilla RNN with random token sampling and non-overlapping sequence partitioning during training
- [ ] D) Applying backpropagation through time across the full historical data sequence

Ans: Using a fixed-length window (windowed approach) discards older context, making it impossible for the model to access dependencies beyond the window. This loss is critical in time series where earlier observations influence later ones.

[MSQ]

### 3. Suppose you are tasked with improving an RNN language model’s ability to handle rare long-term dependencies without increasing the number of model parameters. Which approaches and considerations, based on the properties of vanilla RNNs, would best address this issue?

- [x] A) Introducing gated mechanisms such as LSTM or GRU units
- [ ] B) Increasing the number of hidden state neurons linearly with input sequence length
- [x] C) Implementing truncated backpropagation through time with an optimally chosen window size
- [x] D) Applying gradient clipping to stabilize parameter updates in long sequences

Ans: Gated units help RNNs retain long-term dependencies; truncated BPTT can help learn over a manageable window; and gradient clipping can prevent instability—these strategies mitigate RNN weaknesses without adding parameters. Merely scaling hidden units (B) does not solve the core problem.

[MCQ]

### 4. Someone wishes to quantify a language model's effectiveness using perplexity. Which property of perplexity, for token-level evaluation, best explains its utility as a metric when comparing models trained on the same corpus?

- [ ] A) Perplexity directly measures the model’s vocabulary size
- [x] B) Lower perplexity implies the model better predicts upcoming tokens, indicating stronger language modeling performance
- [ ] C) High perplexity always corresponds to overfitting the training data
- [ ] D) Perplexity cannot be used to compare models; it only reports randomness present in the token sequences

Ans: Lower perplexity means the model assigns higher probability to real next tokens in the dataset. It reflects the model’s predictive accuracy in sequence modeling and is the standard for evaluating language models.

[MSQ]

### 5. When utilizing RNNs for sequence modeling tasks in domains like DNA analysis or video activity recognition, what challenges arise due to the inherent structure of such data and the architecture of vanilla RNNs?

- [x] A) Memory limitations for capturing dependencies across spatially separated sequence elements
- [x] B) Potential for gradient vanishing or explosion during training of long sequences
- [ ] C) Inability to process variable-length sequences without substantial preprocessing
- [ ] D) Loss of key sequential information due to parameter sharing across time steps

Ans: Vanilla RNNs struggle with long-range dependencies (memory limitations) and are prone to vanishing/exploding gradients during training. Parameter sharing and variable-length handling are strengths, not weaknesses of RNNs.​​



[MCQ]

### 6. You apply random sampling and sequential partitioning to create training/validation splits for an RNN language model on a large text corpus. What is the primary risk of using random sampling if token order is disrupted?

- [ ] A) Overrepresentation of frequent tokens in each split
- [x] B) Loss of meaningful sequential context, harming the model’s ability to capture dependencies
- [ ] C) Increase in computational cost due to redundant training examples
- [ ] D) Imbalance in class labels for prediction tasks

Ans: Random sampling can disrupt the sequence order, causing the model to lose crucial context and dependencies, which harms learning for sequential tasks where order matters.

[MCQ]

### 7. During backpropagation through time for an RNN trained on variable-length sequences, why might truncating the unrolled computational graph to a fixed number of steps be both beneficial and detrimental?

- [x] A) It reduces computation time but risks forgetting dependencies beyond the truncation length
- [ ] B) It prevents exploding gradients but always reduces model expressivity
- [ ] C) It optimizes memory use and increases effective parameter sharing
- [ ] D) It eliminates vanishing gradients for all sequences regardless of length

Ans: Truncating BPTT makes training computationally feasible, but can cause the network to forget influences from steps beyond the truncation point, losing long-term dependencies.

[MSQ]

### 8. In the context of vanilla RNNs applied to stock market prediction, which technical design and data handling choices most strongly influence the model’s ability to generalize across unseen time periods?

- [x] A) Lookback period selection and train-test split strategy
- [ ] B) Choice of tokenization method for numerical financial data
- [x] C) Activation function type used for hidden state computation
- [ ] D) Use of one-hot encoding for input sequences

Ans: Properly selecting the lookback period and thoughtful train/test splits make the model more generalizable. Activation function selection also impacts what patterns can be captured, especially nonlinearities important for stock prediction. One-hot encoding and tokenization are less critical vs. architectural/data handling choices.​



[MCQ]

### 9. Why are feedforward neural networks and CNNs generally not ideal for modeling sequential data such as language or time series without major architectural modifications?

- [x] A) Lack of ability to share parameters across time steps and insufficient memory for retaining sequence context
- [ ] B) Limited to processing only fixed-length feature vectors
- [ ] C) Unable to apply nonlinear transformations to input data
- [ ] D) Restricted to image processing tasks only

Ans: Feedforward neural nets and standard CNNs cannot share parameters across time or remember sequence context, making them ill-suited for sequential data without major changes.

[NAT]

### 10. If an RNN language model achieves a perplexity of 20 on a test set, what is the model’s implied level of uncertainty regarding the next token selection?

Ans: 20

A perplexity of 20 means the model's uncertainty is equivalent to randomly selecting among 20 tokens—lower is better for prediction certainty.

[MCQ]

### 11. Which step comes first in preparing text data for input into an RNN-based language model according to standard practice?

- [ ] A) Building the vocabulary dictionary
- [x] B) Tokenizing the raw text into smaller sequence units
- [ ] C) Converting each token to a unique integer index
- [ ] D) Calculating the model’s perplexity

Ans: Tokenization—the process of splitting raw text into words, characters, or other units—is always the first required step before vocabulary building in language modeling workflows.

[MCQ]

### 12. What is the primary role of the hidden state in a vanilla RNN when modeling sequences?

- [ ] A) It stores the network’s parameters for training
- [x] B) It serves as a summary of all information processed so far in the sequence
- [ ] C) It generates the output prediction for each time step independently
- [ ] D) It maintains the model’s hyperparameter settings

Ans: The hidden state in a vanilla RNN continuously encodes a summary of all information seen so far in the sequence, making it fundamental for sequence learning.

# Coding Questions (Total: 12)

[MCQ]

```python
import torch
import torch.nn as nn

class TestRNN(nn.Module):
    def __init__(self):
        super().__init__()
        self.rnn = nn.RNN(10, 8, 2, batch_first=True)
        self.fc = nn.Linear(8, 1)
    def forward(self, x):
        h0 = torch.randn(2, x.size(0), 8)
        out, _ = self.rnn(x, h0)
        return self.fc(out[:, -1, :])

x = torch.randn(16, 20, 10)
model = TestRNN()
out = model(x)
print(out.shape)
```

### 1. What is the shape of `out` printed above?

- [ ] A) (16, 20, 1)
- [ ] B) (16, 8)
- [x] C) (16, 1)
- [ ] D) (16, 2)

Ans: The final fully connected layer outputs one value per batch item, so the output shape is (batch_size, 1) = (16, 1).

[MCQ]

```python
import torch
import torch.nn as nn

class MyRNN(nn.Module):
    def __init__(self):
        super().__init__()
        self.rnn = nn.RNN(4, 6, 1)
    def forward(self, x):
        h0 = torch.zeros(1, x.size(1), 6)
        out, hn = self.rnn(x, h0)
        return out, hn

seq = torch.randn(7, 5, 4)
model = MyRNN()
out, hn = model(seq)
print(out.shape, hn.shape)
```

### 2. What will be printed for `out.shape` and `hn.shape`?

- [ ] A) torch.Size([3, 5, 4]) torch.Size([2, 5, 2])
- [x] B) torch.Size([7, 5, 6]) torch.Size([1, 5, 6])
- [ ] C) torch.Size([3, 5, 4]) torch.Size([1, 5, 6])
- [ ] D) torch.Size([7, 5, 6]) torch.Size([2, 5, 2])

Ans: PyTorch RNNs default to (seq_len, batch, input_size) input/output shapes; out has shape (7, 5, 6) and hn is (1, 5, 6).

[MSQ]

```python
import torch
import torch.nn as nn
rnn = nn.RNN(6, 5, 3, bidirectional=True)
x = torch.randn(9, 2, 6)
h0 = torch.zeros(3 * 2, 2, 5)
out, hn = rnn(x, h0)
```

### 3. Which are true about out and hn?

- [x] A) `out` has shape (9, 2, 10)
- [x] B) `hn` has shape (6, 2, 5)
- [x] C) The number 10 in output shape comes from doubling the hidden size due to bidirectionality
- [ ] D) Initial hidden state must always be zeros

Ans: Bidirectional RNNs double the number of features; so out is (seq_len, batch, num_directionshidden_size)=(9,2,10). hn has (num_layersnum_directions, batch, hidden_size) = (6,2,5).

[MCQ]

```python
import torch
import torch.nn as nn

class LoopRNN(nn.Module):
    def __init__(self):
        super().__init__()
        self.rnn = nn.RNN(3, 4, 1, batch_first=True)
    def forward(self, x):
        outputs = []
        h = torch.zeros(1, x.size(0), 4)
        for t in range(x.size(1)):
            out, h = self.rnn(x[:, t:t+1, :], h)
            outputs.append(out)
        return torch.cat(outputs, dim=1)

x = torch.randn(5, 8, 3)
model = LoopRNN()
out = model(x)
print(out.shape)
```

### 4. What is the shape of out?

- [ ] A) (5, 8, 3)
- [x] B) (5, 8, 4)
- [ ] C) (8, 5, 4)
- [ ] D) (5, 1, 4)

Ans: Iterates over time dimension, outputting the hidden state at each step; output is (batch, seq_len, hidden_size).

[MSQ]

```python
import torch
import torch.nn as nn

class CustomRNNCell(nn.Module):
    def __init__(self):
        super().__init__()
        self.rnn = nn.RNNCell(4, 2)
    def forward(self, x):
        h = torch.zeros(x.size(0), 2)
        out_seq = []
        for i in range(x.size(1)):
            h = self.rnn(x[:, i, :], h)
            out_seq.append(h)
        return torch.stack(out_seq, dim=1)

x = torch.randn(10, 7, 4)
model = CustomRNNCell()
out = model(x)
print(out.shape)
```

### 5. Which statements are correct about the above code snippet?

- [x] A) Shape of output is (10, 7, 2)
- [x] B) The model processes the input one time step at a time (for loop)
- [x] C) Parameter sharing occurs across all time steps
- [ ] D) This code supports variable-length sequences without modification

Ans: Each time-step uses the same RNN cell (shared parameters); output is stacked for each sequence (batch, seq_len, hidden_size). The for-loop processes sequentially, but length must be known (so D is false).

[MCQ]

```python
import torch
import torch.nn as nn

class MultiRNN(nn.Module):
    def __init__(self):
        super().__init__()
        self.rnn1 = nn.RNN(2, 6, 1, batch_first=True)
        self.rnn2 = nn.RNN(6, 3, 1, batch_first=True)
    def forward(self, x):
        o1, _ = self.rnn1(x)
        o2, _ = self.rnn2(o1)
        return o2

x = torch.randn(12, 10, 2)
model = MultiRNN()
out = model(x)
print(out.shape)
```

### 6. What is the dimension of model output?

- [ ] A) (12, 10, 2)
- [ ] B) (12, 10, 6)
- [x] C) (12, 10, 3)
- [ ] D) (10, 12, 6)

Ans: Final layer output shape is (batch, seq_len, hidden_size of last RNN).

[MSQ]

```python
import torch
x = torch.tensor([[1.,2.],[3.,4.]], requires_grad=True)
y = x * 2
z = y.sum()
z.backward()
```

### 7. Which statements are true?

- [x] A) The gradient of x will be all 2s
- [x] B) x.grad will have shape (2,2)
- [ ] C) z.backward() will throw an error since y is not a leaf tensor
- [x] D) The computation graph supports backward calls on scalar outputs

Ans: The gradient w.r.t x is 2 everywhere; x.grad matches x's shape; backward on scalar outputs works; y not being leaf is okay.

[MCQ]

```python
import torch
import torch.nn as nn

class TinyRNN(nn.Module):
    def __init__(self):
        super().__init__()
        self.rnn = nn.RNN(1, 1, 1, batch_first=True)
    def forward(self, x, h):
        out, h_n = self.rnn(x, h)
        return out, h_n

x = torch.ones(1, 3, 1)
h = torch.zeros(1, 1, 1)
model = TinyRNN()
model.rnn.bias_ih_l0.data.fill_(0.0)
model.rnn.bias_hh_l0.data.fill_(0.0)
with torch.no_grad():
    model.rnn.weight_ih_l0.data.fill_(1.0)
    model.rnn.weight_hh_l0.data.fill_(0.0)
out, h_n = model(x, h)
print(out.squeeze().tolist())
```

### 8. What value is printed?

- [x] A) [1.0, 1.0, 1.0]
- [ ] B) [1.0, 2.0, 3.0]
- [ ] C) [1.0, 1.0, -1.0]
- [ ] D) [0.0, 0.0, 0.0]

Ans: With all weights 1, hh=0, input is always 1 and previous hidden doesn't contribute—so output is always 1.

[MCQ]

```python
import torch
import torch.nn as nn
lstm = nn.LSTM(2, 4, batch_first=True)
x = torch.randn(3, 5, 2)
h0 = torch.zeros(1, 3, 4)
c0 = torch.zeros(1, 3, 4)
y, (hn, cn) = lstm(x, (h0, c0))
print(y.shape, hn.shape, cn.shape)
```

### 9. What are the shapes of y, hn, cn?

- [ ] A) (3, 5, 2), (1, 3, 4), (1, 3, 4)
- [x] B) (3, 5, 4), (1, 3, 4), (1, 3, 4)
- [ ] C) (5, 3, 4), (1, 3, 4), (1, 3, 4)
- [ ] D) (3, 2, 4), (1, 5, 4), (1, 3, 4)

Ans: y is (batch, seq_len, hidden_size); hn/cn are (num_layers, batch, hidden_size).

[MSQ]

```python
import torch
import torch.nn as nn

class CustomEncoder(nn.Module):
    def __init__(self):
        super().__init__()
        self.embedding = nn.Embedding(10, 3)
        self.rnn = nn.RNN(3, 5, 1, batch_first=True)

    def forward(self, x):
        emb = self.embedding(x)
        seq_out, _ = self.rnn(emb)
        return seq_out

x = torch.tensor([[1,9,3],[2,2,6]])
model = CustomEncoder()
y = model(x)
```

### 10. Which are true?

- [x] A) The embedding layer converts indices into 3D vectors
- [x] B) Output shape of model is (2, 3, 5)
- [ ] C) RNN input is of type torch.long
- [x] D) Model would fail if x contained an index >= 10

Ans: Embedding transforms long indices to float vectors; model output matches (batch, seq_len, hidden_size); index out-of-bounds triggers runtime error.

[MCQ]

```python
import torch
import torch.nn as nn
rnn = nn.RNN(1, 2, 1)
x = torch.randn(7, 3, 1)
out, hn = rnn(x)
print(hn.shape)
```

### 11. What is the shape of hn?

- [x] A) (1, 3, 2)
- [ ] B) (3, 2, 7)
- [ ] C) (7, 3, 2)
- [ ] D) (1, 2, 3)

Ans: Shape is (num_layers, batch_size, hidden_size).

[NAT]

```python
import torch
import torch.nn as nn
rnn = nn.RNN(5, 7, 1, batch_first=True)
x = torch.randn(4, 13, 5)
out, hn = rnn(x)
print(out.shape[2])
```

### 12. What integer is printed?

Ans: 7

The third dimension of output corresponds to the hidden_size passed to RNN.