# Theory Questions (Total: 12)

[MCQ]

### 1. In the context of memory-based architectures, which distinction best summarizes the difference between explicit and implicit memory mechanisms in recurrent neural networks, as described for LSTM and GRU.

- [x] A. LSTM retains prior cell calculations via explicit memory slots, whereas GRU's memory is internally managed by update/reset gates without separate storage.
- [ ] B. Both LSTM and GRU use explicit memory, but LSTM uses more gates.
- [ ] C. GRU manages memory with attention layers, whereas LSTM does not use gates for retention.
- [ ] D. LSTM relies on weight matrices for state recall, GRU uses activation functions alone.

Ans: LSTM features explicit memory cells (the cell state) that persist over time and are controlled by gates. GRU combines gating and state management, lacking a dedicated memory cell, so its memory is considered implicit.

[MCQ]

### 2. Bidirectional RNNs are introduced to overcome the limitations of standard RNNs in tasks where future context is essential. Which task would most likely highlight the advantage of bidirectional processing?

- [x] A. Named entity recognition, where predicting the label for a word may depend on both previous and subsequent words in the sentence.
- [ ] B. Naive sentiment aggregation, where only historical input matters.
- [ ] C. Time series forecasting for financial data.
- [ ] D. Image classification using CNN layers.

Ans: Bidirectional RNNs consider information from both left and right context, crucial for tasks like named entity recognition where context isn't limited to previous words.

[MSQ]

### 3. In a typical encoder-decoder architecture for sequence-to-sequence tasks (e.g., translation or summarization), which statements are valid?

- [x] A. The encoder compresses the input sequence into a context vector for the decoder.
- [ ] B. The decoder utilizes only the final token from the encoder to generate output.
- [x] C. The model is trained with teacher forcing, feeding ground-truth tokens during training.
- [ ] D. Output sequence and input sequence lengths must match.

Ans: The encoder compresses the source to a context vector (A), and teacher forcing supplies ground truths during training (C). Decoder uses the full encoder output, not only the final token (B is incorrect), and input/output sequence lengths need not match (D is incorrect).

[MCQ]

### 4. Which mechanism in LSTM helps mitigate the vanishing gradient problem that often plagues vanilla RNNs?

- [x] A. The introduction of gates that regulate the flow of information and gradients.
- [ ] B. The use of dropout within recurrent units.
- [ ] C. Data normalization during pre-processing.
- [ ] D. Gradient clipping during training.

Ans: LSTM's gating architecture controls the flow of signals and gradients, preserving information and mitigating vanishing gradients.

[MSQ]

### 5. BLEU (Bilingual Evaluation Understudy) is described as a metric for evaluating sequence generation. What does it penalize or affect the score with?

- [x] A. High n-gram precision between candidate and reference sequence.
- [x] B. Overly short candidate sequences via brevity penalties.
- [ ] C. Use of rare vocabulary words in the output.
- [ ] D. Matching sequence lengths alone.

Ans: BLEU checks for n-gram precision (A) and penalizes overly short translations using a brevity penalty (B). Sequence lengths and rare words are not direct scoring components (C, D incorrect).

[MCQ]

### 6. Teacher forcing in seq2seq training helps stabilize learning. What is a potential risk associated with teacher forcing when performing inference?

- [x] A. During inference, the model generates sequences without access to ground-truth tokens, so errors can accumulate step-by-step.
- [ ] B. During inference, training and inference loss values remain identical.
- [ ] C. The approach delays gradient computation until evaluation.
- [ ] D. Teacher forcing skips backpropagation in recurrent layers.

Ans: During inference, the model must rely on its own prior predictions, so initial mistakes can propagate without ground-truth corrections.

[MSQ]

### 7. Which limitations of standard RNNs are specifically addressed by LSTM and GRU architectures?

- [x] A. Difficulty with long-term dependency retention.
- [ ] B. Fixed input-output lengths for all sequence tasks.
- [x] C. Gradient instability due to vanishing and exploding gradients.
- [x] D. Lack of selective memory updates.

LSTM/GRU address long-term dependency issues (A), mitigate gradient instability (C), and introduce selective memory update mechanisms (D). Fixed length (B) isn't directly addressed by these architectures.

[MCQ]

### 8. In the preprocessing pipeline described for sentiment analysis, padding is utilized for what purpose in batch-based training?

- [x] A. To allow processing of variable-length sequences by making their lengths uniform.
- [ ] B. To enhance n-gram scoring.
- [ ] C. To increase vocabulary diversity in the batch.
- [ ] D. To simplify the labeling process.

Ans: Padding makes all input sequences the same length, enabling efficient batch training in frameworks that require consistent tensor shapes.

[MCQ]

### 9. When encoding reviews as integer sequences for the RNN/LSTM model, which role does the integer '0' play in vocabulary mapping?

- [x] A. It acts as a reserved index for padding tokens in the encoded sequence.
- [ ] B. It is assigned to unknown or out-of-vocabulary words.
- [ ] C. It indicates the start of the sequence.
- [ ] D. It marks sentiment class boundaries.

Ans: Padding is required to standardize sequence lengths, and 0 is typically reserved for this role.

[MCQ]

### 10. What is the primary class distribution for the labeled sentiment analysis dataset?

- [x] A. 50% positive and 50% negative reviews
- [ ] B. 75% positive and 25% negative reviews
- [ ] C. 60% positive and 40% negative reviews
- [ ] D. 90% positive and 10% negative reviews

Ans: The dataset is typically balanced in class distribution for effective binary classification.

[MSQ]

### 11. Which common preprocessing step/s is performed before tokenization in the provided code?

- [x] A. Case normalization
- [x] B. Punctuation removal
- [ ] C. Sequence encoding
- [ ] D. Vocabulary expansion

Ans: Processing steps like lowercasing and removing punctuation help standardize input before tokenization.

# Coding Questions (Total: 12)

[MCQ]

```python
import torch
import torch.nn as nn
class CustomGRU(nn.Module):
    def __init__(self, input_size=10, hidden_size=32, num_layers=2, dropout=0.2):
        super().__init__()
        self.gru = nn.GRU(input_size, hidden_size, num_layers, dropout=dropout, batch_first=True)
        self.fc = nn.Linear(hidden_size, 1)
    def forward(self, x, h0=None):
        output, hn = self.gru(x, h0)
        out = self.fc(output[:, -1, :])
        return out, hn
x = torch.randn(8, 20, 10)
h0 = torch.zeros(2, 8, 32)
model = CustomGRU()
y, hn = model(x, h0)
```

### 1. What is the shape of the hidden state `hn` returned from the forward method?

- [x] A. (2, 8, 32)
- [ ] B. (8, 32)
- [ ] C. (1, 8, 32)
- [ ] D. (8, 20, 1)

Ans: For GRU, `hn` shape is (num_layers, batch, hidden_size). Here: 2 layers, batch=8, hidden_size=32.

[MSQ]

```python
import torch
from torch import nn
lstm = nn.LSTM(input_size=7, hidden_size=16, num_layers=3, bidirectional=True)
x = torch.randn(5, 12, 7) # seq_len=12, batch=5, features=7
h0 = torch.zeros(6, 5, 16)
c0 = torch.zeros(6, 5, 16)
out, (hn, cn) = lstm(x, (h0, c0))
```

### 2. Given the above setup, which of the following statements are correct?

- [ ] A. The LSTM output tensor out is shaped (12, 5, 32).
- [x] B. The number of layers in hn is 6 because of bidirectional LSTM.
- [x] C. The hidden state is doubled due to bidirectionality.
- [ ] D. The first axis of hn always matches the batch size.

Ans: Bidirectional=True, so layers doubled: 3*2=6. Hidden/cell states are stacked, making hn/cn shape (6,5,16). Output feature dim also doubles.

[MCQ]

```python
import torch
import torch.nn as nn
class SequenceClassifier(nn.Module):
    def __init__(self):
        super().__init__()
        self.rnn = nn.RNN(4, 8, 1, batch_first=True)
        self.fc = nn.Linear(8, 2)
    def forward(self, x):
        out, _ = self.rnn(x)
        return self.fc(out[:, -1, :])
x = torch.randn(32, 15, 4)
model = SequenceClassifier()
y = model(x)
```

### 3. What does `out[:, -1, :]` select from the RNN output?

- [x] A. Last time-step output vector for each sequence in the batch.
- [ ] B. Random sample from the batch.
- [ ] C. All outputs across time steps.
- [ ] D. Initial hidden state before sequence starts.

Ans:  Index `-1` picks the last element in the time-dimension for each batch sample. This is common before classification.

[MSQ]

```python
class DeepLSTM(nn.Module):
    def __init__(self):
        super().__init__()
        self.lstm = nn.LSTM(6, 12, num_layers=2, batch_first=True)
        self.fc = nn.Linear(12, 1)
    def forward(self, x):
        out, (hn, cn) = self.lstm(x)
        return self.fc(out[:, -1, :]), hn, cn
x = torch.randn(10, 16, 6)
```

### 4. Which are true about the shapes of `out`, `hn`, and the output of `fc`?

- [ ] A. `out` is (10, 16, 12)
- [x] B. `hn` is (2, 10, 12)
- [x] C. Output of `fc` is (10, 1)
- [ ] D. `hn` has batch size and seq_len swapped.

Ans: hn shape is (num_layers, batch, hidden); fc projects last time-step per sample to output (batch, 1).

[MCQ]

```python
import torch.nn.functional as F
x = torch.randn(24, 30, 15)
model = nn.LSTM(15, 64, 1, batch_first=True)
out, _ = model(x)
prob = F.softmax(out[:, -1, :], dim=-1)
```

### 5. What does `prob` represent?

- [x] A. Normalized probabilities computed over features of last time step per batch.
- [ ] B. Probabilities for entire sequence length.
- [ ] C. Output logits for all time steps.
- [ ] D. Raw hidden state values.

Ans: Softmax is applied only for last time step features, so `prob` is a batch of distributions over hidden features.

[MSQ]

```python
import torch
cell = nn.LSTMCell(9, 17)
x = torch.randn(5, 9)
hx = torch.zeros(5, 17)
cx = torch.zeros(5, 17)
for i in range(11):
    hx, cx = cell(x, (hx, cx))
```

### 6. Which are true regarding this LSTMCell usage?

- [x] A. Each iteration simulates a time step in sequence processing.
- [ ] B. Inputs to the cell stay constant here, so output doesn't change meaningfully.
- [x] C. This pattern is suitable for custom sequence iteration.
- [ ] D. Output batch size would change if cell parameters differed.

Ans: Each loop mimics stepping through a sequence, with batch size constant, and custom control over inputs possible.

[MCQ]

```python
class BidirectionalLSTM(nn.Module):
    def __init__(self):
        super().__init__()
        self.lstm = nn.LSTM(14, 20, batch_first=True, bidirectional=True)
    def forward(self, x):
        o, _ = self.lstm(x)
        return o
x = torch.randn(9, 50, 14)
model = BidirectionalLSTM()
out = model(x)
```

### 7. What is the shape of `out`?

- [x] A. (9, 50, 40)
- [ ] B. (9, 50, 20)
- [ ] C. (9, 100, 14)
- [ ] D. (9, 50, 14)

Ans: Bidirectional doubles the hidden size. The output shape is (batch, seq_len, hidden_size).

[MCQ]

```python
class TimeSeriesRNN(nn.Module):
    def __init__(self):
        super().__init__()
        self.rnn = nn.RNN(3, 4, 1, nonlinearity='relu', batch_first=True)
        self.fc = nn.Linear(4, 1)
    def forward(self, x):
        o, hn = self.rnn(x)
        y = self.fc(o[:, -1, :])
        return y
x = torch.randn(7, 29, 3)
```

### 8. We have specified `nonlinearity='relu'`. How does it affect the `TimeSeriesRNN` function?

- [x] A. Activations inside RNN will use ReLU instead of tanh for hidden state updates.
- [ ] B. Output shape changes to account for ReLU.
- [ ] C. Model will always converge faster.
- [ ] D. RNN input size must match output size.

Ans: Relu alters internal activation dynamics in RNN units; shape and convergence depend on other configs.

[MCQ]

```python
import torch.nn as nn
model = nn.LSTM(input_size=8, hidden_size=16, batch_first=True)
x = torch.randn(10, 25, 8)
out, (hn, cn) = model(x)
```

### 9. What happens if you change `batch_first=False` in the LSTM?

- [x] A. Input should be (seq_len, batch, input_size).
- [ ] B. Input should be (batch, seq_len, input_size).
- [ ] C. Hidden state has incorrect shape.
- [ ] D. Only bidirectional LSTM is affected.

Ans: Setting `batch_first=False` expects input with sequence dimension leading, (seq_len, batch, input_size).

[NAT]

```python
def create_sequences(data, seq_length):
    xs, ys = [], []
    for i in range(len(data) - seq_length):
        x = data[i:i+seq_length]
        y = data[i+seq_length]
        xs.append(x)
        ys.append(y)
    return torch.tensor(xs).float(), torch.tensor(ys).float()
series = torch.arange(31)
X, y = create_sequences(series, seq_length=5)
```

### 10. What is the length of the target tensor `y` produced?

Ans: 26

For a series of length 31 and seq_length=5, you get (31-5) = 26 targets.

[MCQ]

```python
import torch
from torch import nn
model = nn.RNN(10, 20, batch_first=True)
x = torch.randn(4, 6, 10)
out, hn = model(x)
```

### 11. What will be the shape of `out`?

- [x] A. (4, 6, 20)
- [ ] B. (6, 4, 20)
- [ ] C. (4, 20, 6)
- [ ] D. (4, 6)

Ans: With `batch_first=True`, output shape matches input batch and sequence dims.

[MCQ]

```python
x = torch.randn(32, 10, 5)
lstm = nn.LSTM(5, 7, batch_first=True)
out, (hn, cn) = lstm(x)
```

### 12. Which parameter controls the output feature dimension for each time step?

- [x] A. hidden_size
- [ ] B. batch_first
- [ ] C. input_size
- [ ] D. num_layers

Ans: `hidden_size` sets the dimension of each time step's output for LSTM/RNN/GRU layers.