> Note: [Other answers from](https://forums.fast.ai/t/fastbook-chapter-12-questionnaire-wiki/70516)

### Q1: If the dataset for your project is so big and complicated that working with it takes a significant amount of time, what should you do?

I'd take a subset of the bieg dataset for testing at first

### Q2: Why do we concatenate the documents in our dataset before creating a language model?

Because it helps to the language models understanding of the bos token and eos token. The "hidden state", "cell state" or "attention layer" should forget the text before bos

> Other answer: To create a continuous stream of input/target words, to be able to split it up in batches of significant size

### Q3: To use a standard fully connected network to predict the fourth word given the previous three words, what tweaks do we need to make to our model?

Since what we are doing with the hidden state and the input embedding is simply stacking both in one bieg tensor, what we can do is stacking 3 input embeddings too instead. To do so we could simply apply a stacking of the 4 tensors instead of just the first one.

### Q4: How can we share a weight matrix across multiple layers in PyTorch?

> Other answer: Define one layer in the PyTorch model class, and use them multiple times in the forward method.

### Q5: Write a module that predicts the third word given the previous two words of a sentence, without peeking:

In [1]:
from fastai.text.all import *

In [2]:
class LMModel8(Module):
    def __init__(self, num_embeddings, num_hidden, num_rnn_layers):
        self.embed_layer = nn.Embedding(num_embeddings=num_embeddings)
        self.embed2hidden = nn.RNN(input_size=2, hidden_size=num_hidden, num_layers=num_rnn_layers, nonlinearity='tanh')
        self.output_layer = nn.Linear(num_hidden, num_embeddings)
        pass

    def forward(self, word1, word2):
        word1_embedding =  self.embed_layer(word1)
        word2_embedding = self.embed_layer(word2)

        pass

In [4]:
random_embed_layer = nn.Embedding(num_embeddings=500, embedding_dim=256)

### Q6: What is a recurrent neural network?

It is a neural network in which there are several layers in which each input of those layers is the output of the previous one + the embedding of certain input

### Q7: What is hidden state?

It's a specific tensor/array/vector that allows the neural net to remember previous iterations. It can be useful for long sequences in which the output of a specific sequence is related to the previous output.

### Q8: What is the equivalent of hidden state in LMModel1?

0 hidden state

### Q9: To maintain the state in an RNN, why is it important to pass the text to the model in order?

Since the state helps the predictions based on previous states, if the text is unordered, this would make the hidden state noisy remembering random orders which would make the RNN worse.

### Q10: What is an *"unrolled"* representation of an RNN?

The schematic of the RNN in which all the layers are represented. Including all those repetitive patterns like the RNN itself.

### Q11: Why can maintaining the hidden state in an RNN lead to memory and performance problems? How do we fix this problem?

The issue with maintaining the hidden state in long sequences is the derivations needed to be able to compute the gradients during backpropagation. Since at number 10,000 we'd have to recompute 10,000 derivatives, it is not desirable nor interesting to keep such operations during training.

In order to fix it, we must use the *".detach()"* method from pytorch which will forget older gradients and just focus on the current sequence

### Q12:What is BPTT?

Backpropagation Through Time is the concept of performing backprop **only** through the seen sequences instead of the whole stream.

### Q13: Write code to print out the first few batches of the validation set, including converting the token IDs back into English strings, as we showed for batches or IMDb data in Chapter 10

### Q14: What does the ModelResetter callback do? Why do we need it?

It uses the reset method of the model at every callback. We may need it for example if we train using 2 uncorrelated datasets and we don't need the hidden state to remember anything about the previous dataset once we switch to the second.

### Q15: What are the downsides of predicting just one output word for each three input words?

- Waste of training steps
- Waste of meaningful context if the sequence length is not enough.

### Q16: Why do we need a custom loss function for LMModel4?

Because it predicts a word after every input word which will make it have outputs of size seq_length for each input series. Since this is predicting several words, we need to adapt the loss to this multiword prediction.

### Q17: Why is the training of LMModel4 unstable?

Because larger networks suffer from exploding or disappearing gradients

### Q18: In the unrolled representation, we can see that a recurrent neural network has many layers. So why do we need to stack RNNs to get better results?

We are effectively creating a deeper network which will get to learn richer and deeper features **Not sure**

### Q19: Drawing.

### Q20: Why should we get better results in a RNN if we call detach less often? Why might this not happen in practice with a simple RNN?

Because we would be backproping from a longer horizon of events which would improve the RNN's hidden state memories. With a simple RNN, this memories aren't that many so it would not have that much of an impact.

### Q21: Why can a deep network result in very large or very small activations? Why does this matter?

Because we have effectively a sequence of matrix multiplications. This makes the big numbers tend to infinite while those close to 0 get closer and closer to 0. This large or small activations make some neurons effectively "stupid" since those are not really used for the computation towards the goal. Also, depending on the nonlinearity it can make many neurons output the same because of the large or small activation. For example, the limit with tanh will be 1 when those weights are very large.

### Q22: In a computer's floating-point representation of numbers, which are the most precise?

Those closer to 0

### Q23: Why do vanishing gradients prevent training?

Because those neurons with a "vanished" gradient will no longer be "alive" since those will not be able to learn anything making our net effectively more stupid because of such loss.

### Q24: Why does it help to have two hidden states in the LSTM architecture? What is the purpose of each one?

Helps with memory management. The *hidden state* helps with focusing on the next token to predict while the *cell state* will help with long short term memory

### Q25: What are these two states called in an LSTM?

Cell state and hidden state

### Q26: What is tanh and how is it related to sigmoid?

Basically a sigmoid centered on 0.

### Q27: What is the purpose of this code in LSTMCell:
```python
h = torch.cat([h, input], dim=1)
```

Just stacking the input with the hidden state for predictions

### Q28: What does chunk do in Pytorch?

Splits a Tensor in equal sizes

### Q29: Study the refactores version of LSTMCell carefully to ensure you understand how and why it does the same thing as the nonrefactored version.

### Q30: Why can we use a higher learning rate for LMModel6?

No clue brother. I guess because we have a deeper net? I mean it's kinda unrelated though

### Q31: What are the three regularization techniques used in AWD-LSTM model?

Dropout, activation regularization and temporal activation regularization

### Q32: What is dropout?

Randomly setting the output of specific neurons to 0. Normally using some probability *p*

### Q33: Why do we scale the activations with dropout? Is this applied during training, inference or both?

Because the sum of the dropout layer will result in lower averages than that of a non dropped output. This is only applied during training

### Q34: What is the purpose of the line from Dropout:
```python
if not self.training: return x
```

To avoid droput during inference

### Q35: Experiment with bernoulli_to understand how it works.

### Q36: How do you set your model in training mode in PyTorch? In evaluation mode?

### Q37: Write the equation for activation regularization (in math or code, as you prefer). How is it different from weight decay?

### Q38: Write the equation for temporal activation regularization (in math or code, as you prefer). Why wouldn't we use this for computer vision problems?

### Q39: What is weight tying in a LM?