# Decoder

<img src="images/transformer-architecture-6.png" width="500">

## Transformer Neural Network Architecture Overview (Decoder)

In the decoder architecture, you need to pass in the sentence that you're going to be translating to. For example, assuming you wanted to translate an English sentect to French, the French sentence would be passed into the decoder. The first token that will be passed into the decoder is the `<START>` token. The, the word tokens for *My name is John*, in French, are passed into the decoder — *Mon nom est John*. Finally, an `<END>` token to signify that the sentence has ended, followed by a bunch of `<PADDING>` tokens. Once again, we pad the tokens because we want a fixed size input for any kind of sentence that we want to translate. The dimension of the input is then the batch size $\times$ the maximum length of the input sequence $\times$ 512. The batch size is the number of samples we pass into the decoder at any one time. Maximum length of the input sequence is the maximum number of words allowed in a sentence and 512 is the dimension that each word is encoded into.

Because all words are passed into the Transformer in parallel, there is no sense of ordering. However, English sentences have words that are ordered specifically. So, we pass in some [positional encoding](3-Positional_Encoding_in_Transformer_Neural_Network.ipynb) to encode orders. In order to encode ordering, we use the $\sin$ and $\cos$ functions to generate the encodings. We then add the input to the encoding to get the positionally encoded vectors to form matrix $X^{1}$ of size batch size $\times$ the maximum length of the input sequence $\times$ 512.

Each of the 512-dimensional vector in the $X^{1}$ matrix is then made to be represented by the query, key and value vectors — each of these are also 512-dimensional vectors. The resulting matrix has a size of batch size $\times$ the maximum length of the input sequence $\times$ 512 $\times$ 3 or batch size $\times$ the maximum length of the input sequence $\times$ 1536.

In some intuitive sense, the query vector is basically what I am looking for while the key vector is what can I offer and the value vector is going to be what is actually being offered when computing the attention mechanism.

**Note:** In many implementations, the $X^{1}$ matrix can be repeateed three times to create the query, key and value vector. However, in other implementations, there can be a feed forward layer that provides additional learnable parameters to the entire network so that there's an additional case there to learn, if required. This can potentially improve performance down the road.

At this point, we are going to perform the multi-head self-attention (8 of them to be specific). For each of these 8 heads, we'll divide each of the query, key and value vectors into 8 equal parts along the vertical dimension. This results in a matrix of size batch size $\times$ the maximum length of the input sequence $\times$ 8 attention heads $\times$ 64 (i.e. $\frac{1536}{8} = 192$) for each attention head.

We will then take the query vector multiplied by the key vector transposed to get a matrix of size maximum length of input sequence $\times$ maximum length of input sequence. This matrix is also called the attention matrix. However, it is rudimentary at this point. Intuitively, what this matrix multiplication between the query and key vectors is doing is that for every word's query vector, it is trying to compare itself to the key vector of all the other words in that sentence. If the multiplication between two words is high, it means they're similar to each other but if they are small, then they're dissimilar to each other. 

In self-attention, we basically want to pay more attention to the cases that are similar to each other. So, we would end up with a self-attention matrix where the diagonal is going to have very high affinity because the words are attending to itself. There will be other attention values for other cases as well. This is the kind of affinity we're trying to capture in the matrix multiplication between the query and key vectors. Another thing to note is that this attention matrix is unnormalisated and we would scale these values because the variance of some of the values in the unscaled matrix may be pretty large which can lead to erratic behaviour during training.

Next, we apply some masking. Since this is a decoder architecture, masking is very much necessary. The mask is created such that the bottom triangle are filled with zeros and upper triangle is filled with negative infinity (see image below). In other words, the bottom half is not masked and the upper half is masked. The reason we do this is because we don't want the decoder to cheat during inference time. During inference time, we are not going to actually pass every translated word into the decoder. We will probably start with a `<START>` token and everything else will be padded with the `<PADDING>` token. This is because at inference, we do not have any information at the beginning. So, if we try to perform the attention mechanism, it's going to try to understand attention or context to words that isn't even available to the network yet — that is considered cheating.

After adding a mask to our rudimentary attention matrix, we would get the final attention matrix. For every word, it will only be able to attend on words that come before (the lower triangular matrix) while the words that come after are all negative infinity (the upper triangular matrix). We use negative infinity because we will be performing the $\text{softmax}$ operation and the $\text{softmax}$ of negative infinity is zero. For example, the first word of the first row vector will have zeros for all values. This essentially means that the first word (probably the `<START>` token) can only pay attention to itself and no words after it.

Then we would perform a matrix multiplcation between our attention matrix and the value vector. This gives us a batch size $\times$ the maximum length of input sequence $\times$ 64 matrix. What this means is that we'll get a 64-dimensional vector for every single word. These 64-dimensional vectors are now context aware because of attention — they are much higher quality tensors, so to speak.

Remember, we have 8 attention heads. So, we can concatenate this context aware vectors with the vectors from the other 7 heads into one output matrix. We then perform an addition with a skip/residual connections with matrix $X^{1}$. Residual connections are required to propagate signal through very deep networks so as to ensure back propagation does not fail due to vanishing gradients. This avoid the scenario of the model not learning because the gradients are back propagation are so small (near zero) that no gradient update even happens.

We then apply layer normalisation which, like the name suggests, is going to normalise values along the layer dimension across batches. So for the 512-dimensions vector for very word, we are going to compute a mean and standard deviation and scale these values with learnable parameters $\gamma$ and $\beta$ such that the values become more comparable to each other and training becomes more stable.

Now, we're going to perform is multi-head cross attention. The attention mechanism is going to take two vectors from the encoder and one vector from the decoder. The vector coming from the decoder is going to be the query vector while the other two vectors coming from the encoder are going to be the key and value vectors. The query vector from the decoder can be fed in directly to the second attention mechanism (see figure above) or it can be fed into another linear layer with the same number of parameters to output a matrix of same size which can be used as the query vector.

The query vector from the decoder is then concatenated with the key and value vectors from the encoder to give a matrix of size of batch size $\times$ the maximum length of the input sequence $\times$ 512 $\times$ 3 or batch size $\times$ the maximum length of the input sequence $\times$ 1536.

We then do the same exact process as we did with self-attention for 8 attention heads. In the attention mechanism, the query vector is multiplied with the transposed key vector. This time, for every word in the translated sentence in the query vector (it comes from the decoder), we are trying to compute its affinity for every other word in the English sentence in the key vector (it comes from the encoder). This matrix multiplication results in a rudimentary attention matrix similar to the one previously described above. We apply $\text{softmax}$ to get the final attention matrix where the values will be between 0 and 1, akin to probabilities.

**Note:** Notice that right here, we are not doing any masking. This is because during inference time, each French word will always have access to every English word in the sentence from beginning to end.

We then perform a matrix multiplcation between our attention matrix and the value vector which is from the encoder. The resulting matrix is now going to be more contextually aware. Every single row in the matrix corresponds to an English word and each word is presented by a 64-dimensional vector with some contextual awareness about English and also French.

From all attention heads, we concatenate these matrices and perform an addition with a skip/residual connections with the query vector (from the decoder just before the second attention mechanism). Then we perform the layer normalisation.

We then apply a feed forward network to compute the final matrix. However, in order to understand the complexities of the English and French languages, we can take this matrix and put it back into the beginning of the decoder. This way, the network can better understand the complexities and intricacies of the English and French languages.

Once we have fed it back into the decoder multiple times, the final matrix now goes through another feed forward neural network and gets expanded in a way that each English word now corresponds to a French word in the French vocabulary and we will represent the values as probabilities by apply the $\text{softmax}$ function.

During training, the rows in final output matrix is like a probability distribution. For example, if the first word is supposed to predict *Mon*, and its probability is 0.66. From that, the network computes the loss. We can compute the losses for the other words in parallel. To do so, we can use a cross entropy loss function, ignore all the `<PADDING>` tokens, take the four losses and combine them together and back propagate this loss through the entire network. This is one network update.

In doing so, we are able to feed an entire sentence at a time. In fact, not just an entire sentence but we can do this for an entire batch of sentences. Despite the Transformer being an extremely complex architecture with millions of parameters, training can happen pretty fast because we're taking advantage of batches.

If you want to infer using the network, you would pass in only the `<START>` token at the beginning of the network. We then let it propagate and compute the self-attention, get the key and value vectors from the encoder which ecapsulate the context of all words from the input sentences. We then begin to encode all of these values in the second attention mechanism and perform cross attention. Then, we would simply infer just the first translated word because we don't have information of the second word just yet — in this case, the information refers to the first translated word. Now that we have inferred the first translated word, we can begin to use this information to infer the second translated word and so on.

In this way, the Transformer decoder during the inference phase is going to generate one word at a time. 





![diagram](images/decoder.png)

## Import `Decoder` Class and Functions 

In [1]:
import torch
from decoder import Decoder

### Set parameters and variables

In [2]:
batch_size = 30
max_sequence_length = 200
d_model = 512
ffn_hidden = 2048
num_heads = 8
drop_prob = 0.1
num_layers = 5

### Instantiate a `Decoder` object with parameters set above

In [3]:
decoder = Decoder(d_model, ffn_hidden, num_heads, drop_prob, num_layers)

### Test

In [4]:
# English sentence positionally encoded
x = torch.randn((batch_size, max_sequence_length, d_model))
# French sentence positionally encoded
y = torch.randn((batch_size, max_sequence_length, d_model))

mask = torch.full([max_sequence_length, max_sequence_length], float("-inf"))
mask = torch.triu(mask, diagonal=1)

In [5]:
out = decoder(x, y, mask)

------- MASKED ATTENTION ------
x.size(): torch.Size([30, 200, 512])
qkv.size(): torch.Size([30, 200, 1536])
qkv.size(): torch.Size([30, 200, 8, 192])
qkv.size(): torch.Size([30, 8, 200, 192])
q.size(): torch.Size([30, 8, 200, 64]), k.size(): torch.Size([30, 8, 200, 64]), v.size(): torch.Size([30, 8, 200, 64])
scaled.size(): torch.Size([30, 8, 200, 200])
------- ADDING MASK of shape torch.Size([200, 200]) ------
values.size(): torch.Size([30, 8, 200, 64]), attention.size():torch.Size([30, 8, 200, 200])
values.size(): torch.Size([30, 200, 512])
out.size(): torch.Size([30, 200, 512])
------- ADD AND LAYER NORMALIZATION 1 ------
mean.size(): (torch.Size([30, 200, 1]))
var.size(): (torch.Size([30, 200, 1]))
std.size(): (torch.Size([30, 200, 1]))
y.size(): torch.Size([30, 200, 512])
out.size(): torch.Size([30, 200, 512])
------- DROPOUT 1 ------
------- CROSS ATTENTION ------
x.size(): torch.Size([30, 200, 512])
kv.size(): torch.Size([30, 200, 1024])
q.size(): torch.Size([30, 200, 512])
kv.

In [6]:
mask

tensor([[0., -inf, -inf,  ..., -inf, -inf, -inf],
        [0., 0., -inf,  ..., -inf, -inf, -inf],
        [0., 0., 0.,  ..., -inf, -inf, -inf],
        ...,
        [0., 0., 0.,  ..., 0., -inf, -inf],
        [0., 0., 0.,  ..., 0., 0., -inf],
        [0., 0., 0.,  ..., 0., 0., 0.]])

In [7]:
mask.size()

torch.Size([200, 200])