# Theory Questions (Total: 12)

[MCQ]

### 1. In static word embeddings (e.g., Word2Vec or GloVe), consider the phrase "bank deposit near the river bank." Which of the following statements best explains the inadequacy of static embeddings in this scenario?

- [x] A. Static embeddings treat both meanings of "bank" as identical, failing to account for context.
- [ ] B. Static embeddings assign different vectors for identical words in different sentences.
- [ ] C. Static embeddings use randomly assigned vectors for each occurrence of a word.
- [ ] D. Static embeddings rely solely on syntactic information when representing word meaning.

Ans: Static (non-contextual) embeddings map each unique word to a single fixed vector, regardless of the sentence context. This means both senses of "bank" (financial and river side) share the same vector, so models can't distinguish their meanings within different contexts.

[MCQ]

### 2. A neural network has been trained to learn embeddings for a medical vocabulary. What would you observe in the vector space, if the training data frequently contained "hypertension" and "high blood pressure" together?

- [ ] A. The vectors representing "hypertension" and "high blood pressure" will diverge significantly.
- [x] B. The vectors will become closely clustered, capturing semantic similarity.
- [ ] C. The vectors will collapse to a zero vector.
- [ ] D. The learning process will fail for infrequent words.

Ans: Word embeddings capture semantic relationships based on co-occurrence patterns in data. If "hypertension" and "high blood pressure" often appear together, their vectors become similar, reflecting their related meanings.

[MSQ]

### 3. Which of the following are true about bottlenecks in RNN in the encoder-decoder architecture for machine translation?

- [x] A. Information from early tokens can be diluted or lost in the fixed-size context vector.
- [ ] B. The decoder generates every output token using attention over all encoder outputs.
- [x] C. Vanishing gradients may limit long-term information retention.
- [x] D. The final hidden state of the encoder summarizes the whole input sequence.

Ans: Early token information can be diluted or lost in the fixed-size context vector, which is a core limitation. RNNs are susceptible to vanishing gradients, making it harder to capture long-range dependencies. The encoder's final hidden state is a compressed summary of the whole sequence, causing information loss.

[MCQ]

### 4. Consider the chain rule in backpropagation over the unrolled computation graph of a simple RNN. Which mathematical problem primarily arises as the sequence length increases there?

- [ ] A. Numerical instability in softmax computation for large vocabularies.
- [x] B. Gradients either vanish or explode due to repeated multiplications.
- [ ] C. Convergence of embeddings to trivial solutions.
- [ ] D. Redundant representations of context in each hidden state.

Ans: The chain rule applied through many time steps can cause gradients to shrink (vanish) or grow (explode) exponentially, leading to unstable or ineffective training in long sequences.

[MSQ]

### 5. In the word embeddings, which observation can typically be captured through vector arithmetic in the embedding space?

- [x] A. Gender relationships (e.g., King - Man + Woman â‰ˆ Queen)
- [x] B. Syntactic roles (e.g., verb vs. noun)
- [ ] C. Context-specific meaning in static embeddings
- [x] D. Analogical relationships between pairs of words

Ans: Gender relationships can often be represented as consistent vector offsets. Some syntactic roles also show structure in embedding spaces. Analogical relationships (like "Paris is to France as Tokyo is to Japan") are modeled via vector arithmetic. C is incorrect because static embeddings do not capture context-specific meanings.

[MCQ]

### 6. Suppose you have a DL model trained to generate word embeddings solely using co-occurrence frequencies in text corpora. Which of the following limitations might it exhibit for rare words?

- [ ] A. Rare words will have highly robust and distinct representations.
- [x] B. Embeddings for rare words may be poorly estimated, limiting downstream performance.
- [ ] C. Rare words are ignored completely in static models.
- [ ] D. Rare words always receive identical vectors.

Ans: Rare words have fewer examples, so their vectors are less reliable, harming performance in downstream tasks.

[MCQ]

### 7. When training an RNN on long sequences for language modeling, what architectural modification specifically addresses the problem of long-term dependency retention?

- [ ] A. Transitioning to feed-forward networks.
- [x] B. Incorporating LSTM or GRU cells for improved memory persistence.
- [ ] C. Increasing the vocabulary size.
- [ ] D. Using static word embeddings.

Ans: LSTM and GRU units have gating mechanisms designed to retain information over long sequences, directly addressing the limitations of vanilla RNNs in modeling long-term dependencies.

[MSQ]

### 8. Regarding the computation of contextualized embeddings, which statements are accurate?

- [x] A. The representation of a word depends on its surrounding words.
- [x] B. The embedding of "bank" in "river bank" differs from "bank account."
- [ ] C. Contextualized embeddings are always static after training.
- [x] D. Models like BERT generate embeddings by considering the whole sentence.

Ans: Contextualized embeddings depend on surrounding words (this phenomenon is referred to as context awareness). The vector for "bank" adapts in "river bank" vs. "bank account". Models such as BERT dynamically generate embeddings based on sentence context.

[MCQ]

### 9. Suppose you are using an encoder-decoder RNN model for translating paragraphs instead of sentences. Which of the following factors would most affect translation quality of long inputs?

- [ ] A. Fixed vocabulary during tokenization
- [x] B. Compression of the entire input in the final encoder state
- [ ] C. Masking padded tokens during training
- [ ] D. Increasing batch size without changing model depth

Ans: The limitation arises because the encoder squashes all input information into a single fixed vector, leading to significant loss of details, especially for long paragraphs.

[MSQ]

### 10. Which mechanisms help alleviate vanishing gradient problems in sequence models?

- [x] A. Use of LSTM cells
- [x] B. Gradient clipping during optimizer updates
- [ ] C. Increasing the depth of vanilla RNN layers
- [x] D. Incorporation of gating functions in recurrent units

Ans: LSTM cells have internal memory and gates to preserve gradients. Gradient clipping directly bounds gradients. Gating functions (as in GRUs or LSTMs) help mitigate vanishing gradients.

[MCQ]

### 11. While generating embeddings from scratch, what is the typical first step before meaningful learning occurs?

- [ ] A. Assigning fixed vectors based on human intuition
- [x] B. Initializing vectors randomly and updating during network training
- [ ] C. Averaging co-occurrence counts
- [ ] D. Clustering words by frequency

Ans: Embedding vectors are usually initialized randomly and then refined as part of the network's training process through backpropagation.

[NAT]

### 12. If a neural network uses an attention mechanism with the softmax function to normalize attention scores across 10 input tokens, what is the sum of all attention weights produced for this input? [Enter a single numerical value]

Ans: 1

The softmax function outputs a probability distribution that sums to 1 over all inputs; thus, the sum of all attention weights for 10 tokens is 1.

# Coding Questions (Total: 12)

[MCQ]

```python
class CustomEmbedder(nn.Module):
    def __init__(self, vocab_size, embedding_dim):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        self.linear = nn.Linear(embedding_dim, embedding_dim)
    def forward(self, input_ids):
        x = self.embedding(input_ids)
        return self.linear(x)
emb = CustomEmbedder(1000, 128)
input_ids = torch.randint(0, 1000, (32, 10))
output = emb(input_ids)
print(output.shape)
```

### 1. Which is the correct shape of `output`?

- [ ] A. `(32, 128)`
- [ ] B. `(128, 32)`
- [x] C. `(32, 10, 128)`
- [ ] D. `(32, 10)`

Ans: The embedding looks up a 128-dimensional vector for each of the 10 tokens per sequence, for 32 sequences, so the output is `(batch, seq_len, dim)`.

[MSQ]

```python
batch = [{'text': [2, 5, 7, 8]}, {'text': [4, 2]}]
padded = nn.utils.rnn.pad_sequence(
    [torch.tensor(x['text']) for x in batch], padding_value=0)
```

### 2. Which statements are true after execution?

- [x] A. The padded tensor's `shape[0]` equals the length of the longest sequence.
- [ ] B. Padding is added at the start of sequences.
- [x] C. By default, sequences are stacked along the first dimension.
- [x] D. Padding value 0 is used for shorter sequences.

Ans: Padding in PyTorch is added at the end, the tensor is stacked first along the max sequence dimension, and the chosen padding value is 0.

[MCQ]

```python
class SimpleModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.linear = nn.Linear(64, 64)
    def forward(self, x):
        return self.linear(x)
model = SimpleModel()
model.apply(lambda m: nn.init.constant_(m.weight, 0.5) if hasattr(m, 'weight') else None)
```

### 3. How are the weights of model parameters after this initialization?

- [x] A. All weights become exactly 0.5 where possible.
- [ ] B. Only the last layer receives constant weights, others are untouched.
- [ ] C. All weights are zeros.
- [ ] D. Initialization fails if no 'weight' attribute exists.

Ans: `apply` traverses all modules. When `weight` exists, it sets all entries to 0.5 using `nn.init.constant_`.

[MCQ]

```python
def my_collate(batch):
    xs = [torch.tensor(item) for item in batch]
    return torch.stack(xs, dim=1)
loader = torch.utils.data.DataLoader([[1,2], [3,4]], batch_size=2, collate_fn=my_collate)
for batch in loader:
    print(batch.shape)
```

### 4. What will be printed?

- [x] A. `torch.Size([2, 2])`
- [ ] B. `torch.Size([2])`
- [ ] C. `torch.Size([2, 2, 1])`
- [ ] D. `torch.Size([2, 1])`

Ans: Stacking 2 tensors of shape `(2,)` along `dim=1` gives `(2, 2)` shape.

[MSQ]

```python
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer, mode='min', patience=2)
for epoch in range(5):
    train_loss = ... # returns a float
    scheduler.step(train_loss)
```

### 5. Which statements accurately describe what can occur during training?

- [x] A. Learning rate reduces if `train_loss` plateaus for 2 epochs.
- [x] B. The optimizer's learning rate may decrease multiple times.
- [ ] C. The scheduler is unaffected by validation loss.
- [x] D. If loss decreases every epoch, learning rate stays the same.

Ans: The scheduler observes the provided metric, reduces LR if no improvement for 'patience' epochs, and does nothing if loss keeps improving.

[MCQ]

```python
rnn = nn.GRU(256, 512, bidirectional=True)
x = torch.randn(15, 32, 256)
out, hidden = rnn(x)
```

### 6. What is the shape of `hidden`?

- [x] A. (2, 32, 512)
- [ ] B. (4, 32, 512)
- [ ] C. (6, 32, 512)
- [ ] D. (8, 32, 512)

Ans: For a single-layer, bidirectional GRU, the hidden state has shape (2, batch, hidden_size).

[MCQ]

```python
def numericalize(tokens, vocab):
    return [vocab.get(t, 0) for t in tokens]
tokens = ['the', 'cat', 'sat']
vocab = {'the':3, 'cat':6, 'dog':8}
print(numericalize(tokens, vocab))
```

### 7. What will be printed?

- [ ] A. `[3, 6, 8]`
- [x] B. `[3, 6, 0]`
- [ ] C. `[0, 0, 0]`
- [ ] D. An error will be raised.

Ans: 'sat' is not in vocab, so 0 is substituted. Others are mapped correctly to their values.

[MSQ]

```python
class Decoder(nn.Module):
    def __init__(self, hidden_dim, output_dim):
        super().__init__()
        self.rnn = nn.GRU(hidden_dim, hidden_dim)
        self.fc = nn.Linear(hidden_dim, output_dim)
    def forward(self, x, hidden):
        out, hidden_new = self.rnn(x, hidden)
        pred = self.fc(out)
        return pred, hidden_new
```

### 8. Which statements are always true?

- [x] A. Output shape of `pred` depends on input sequence length.
- [x] B. Decoder's hidden state must be updated at every step.
- [x] C. Output and hidden have same batch dimension.
- [x] D. The forward pass can run for multiple time-steps.

Ans: All statements are true for general RNNs handling sequence decoding.

[MCQ]

```python
output = model(src, trg, teacher_forcing_ratio)
output_dim = output.shape[-1]
preds = output[1:].view(-1, output_dim)
print(preds.shape)
```

### 9. Which statement is true if `output` has shape `(max_len, batch, vocab_size)` and `max_len=20`, `batch=32`, `vocab_size=10000`?

- [ ] A. (19, 32, 10000)
- [x] B. (608, 10000)
- [ ] C. (32, 10000)
- [ ] D. (20, 32, 10000)

Ans: Removing the first time-step gives shape (19, 32, 10000), then flattening over (19, 32) gives (608, 10000).

[MCQ]

```python
a = torch.softmax(torch.randn(32, 10), dim=1)
b = torch.randn(32, 10, 64)
weighted = torch.bmm(a.unsqueeze(1), b)
print(weighted.shape)
```

### 10. What is the resulting shape of `weighted`?

- [ ] A. (32, 10, 64)
- [x] B. (32, 1, 64)
- [ ] C. (1, 10, 64)
- [ ] D. (32, 64)

Ans: `torch.bmm` with (32,1,10) times (32,10,64) => (32,1,64) (batch matrix multiplication).

[MCQ]

### 11. A PyTorch DataLoader uses a batch size of 64 and a dataset of 1000 examples. With `drop_last=True`, how many batches are created?

- [x] A. 15
- [ ] B. 16
- [ ] C. 14
- [ ] D. 10

Ans: 1000 // 64 = 15 full batches; remainder is dropped.

[MCQ]

```python
import torch
import torch.nn as nn
import torch.nn.functional as F
class SelfAttentionLayer(nn.Module):
    def __init__(self, feature_size):
        super().__init__()
        self.key = nn.Linear(feature_size, feature_size)
        self.query = nn.Linear(feature_size, feature_size)
        self.value = nn.Linear(feature_size, feature_size)
    def forward(self, x, mask=None):
        keys = self.key(x)
        queries = self.query(x)
        values = self.value(x)
        scores = torch.matmul(queries, keys.transpose(-2, -1)) / torch.sqrt(torch.tensor(self.key.out_features, dtype=torch.float32))
        if mask is not None:
            scores = scores.masked_fill(mask == 0, -1e9)
        attention_weights = F.softmax(scores, dim=-1)
        output = torch.matmul(attention_weights, values)
        return output, attention_weights
sa = SelfAttentionLayer(64)
x = torch.randn(8, 20, 64)  
out, attention_weights = sa(x)
print(out.shape, attention_weights.shape)
```

### 12. What are the shapes of `out` and `attention_weights`?

- [x] A. (8, 20, 64), (8, 20, 20)
- [ ] B. (8, 64), (8, 20, 64)
- [ ] C. (8, 20, 64), (8, 64, 20)
- [ ] D. (8, 20, 20), (8, 20, 64)

Ans: The output retains (batch, seq, feature), attention weights (batch, seq, seq) indicating attention scores for every sequence position.