# The Illustrated Transformer

In this guide, we will attempt to oversimplify things a bit and introduce the concepts of Transformers one by one to hopefully make it easier to understand.

<a href="https://jalammar.github.io/illustrated-transformer/"><b>Original article</b></a>

## A high-level look

Let’s begin by looking at the model as a single black box. In a machine translation application, it would take a sentence in one language, and output its translation in another.

<img title="" src="images_it/The_transformer_encoders_decoders.png" alt="" width="500" data-align="center">

The encoding component is a stack of encoders (the paper stacks 6 of them on top of each other but there is nothing magical about number 6, we could experiment with other arrangements). The decoding component is a stack of decoders of the same number.

<img title="" src="images_it/The_transformer_encoder_decoder_stack.png" alt="" width="500" data-align="center">

Each encoder/decoder is identical in structure (yet they do not share weights). They can be broken down into the following sub-layers:

<img title="" src="images_it/Transformer_encoder_decoder_simple_architecture.png" alt="" width="500" data-align="center">

* The encoder's inputs first flow through a self-attention layer, wchihc helps the encoder look at other words in the input sequence as it encodes a specific word. The ouputs of the self-attention layer are fed to a feed-forward neural network. The exact same feed-forward network is independently applied to each position.

* The decoder has both those layers, but between them is an attention layer than helps the decoder focus on relevan parts of the input sequence (similar to what attention does in seq2seq models).

## Bringing the tensors into the picture

Now that we have seen the major components of the model, let's start to look at the various vectors/tensors and how they flow between these components to turn the input of a trained model into an output.

As is the case in NLP applications in general, we begin by turning each input word into a vector using an embedding. The embedding only happens in the bottom-most encoder. The abstraction that is common to all encoders is that they receive a list of vectors, each of size 512. **In the bottom encoder that would be the word embeddings, but in other encoders, it would be the output of the encoder that is directly below.** The size of this list is hypterparameter we can set, basically it would be the length of the longest sentence in our training dataset.

After embedding the words into our input sequence, each of them flows through each of the two layers of the encoder:

<img title="" src="images_it/encoder_with_tensors_2.png" alt="" width="500" data-align="center">

Here we begin to see one key property of the Transformer, which is that the word in each position flows through its own path in the encoder. There are dependencies between these paths in the self-attention layer. However, the feed-forward layer does not have those dependencies, and thus the various paths can be executed in parallel while flowing through the feed-forward layer.

## Self-attention at a high level

Say the following sentence is an input sentence we want to translate:

`The animal didn't cross the stree because it was too tired`

What does "it" in this sentence refer to? Is it referring to the street or to the animal? It's a simple question to a human, but not as simple to an algorithm. When the model is processing the word "it", self-attention allows it to associate "it" with "animal".

As the model processes each word (each position in the input sequence), self-attention allows it to look at other positions in the input sequence for clues that can help lead to a better encoding for this word.

Similar to the hidden state in RNNs that allows them to incorporate a representation of previously processed words/vectors, self-attention is the method the Transformer uses to bake the "understanding" of other relevant words into the one we are currently processing. For example, in the following image we can see that as we are encoding the word "it" in the top encoder of the stack, part of the attention mechanism was focusing on "The animal".

<img title="" src="images_it/transformer_self-attention_visualization.png" alt="" width="400" data-align="center">

## Self-attention in detail

Let's first look at how we can calculate self-attention using vectors, then proceed to look at how it's actually implemented (using matrices).

**The first step** in calculating self-attention is to create three vectors from each of the encoder's input vectors (in this case, the embedding of each word). So for each word, we create a <span style="color:purple"><b>Query</b></span> (<span style="color:purple"><b>Q</b></span>) vector, a <span style="color:orange"><b>Key</b></span> (<span style="color:orange"><b>K</b></span>) vector, and a <span style="color:blue"><b>Value</b></span> (<span style="color:blue"><b>V</b></span>) vector. These vectors are created by multiplying the embedding by three matrices that we trained during the training process.

Notice that these new vectors are smaller in dimension than the embedding vector. Their dimensionality is 64, while the input/output vector of an encoder have dimensionality of 512 (this includes the embedding dimension, which is the input of the first encoder of the stack). <span style="color:purple"><b>Q</b></span>, <span style="color:orange"><b>K</b></span>, <span style="color:blue"><b>V</b></span> vectors don't have to be smaller, this is an architecture choice to make the computation of multi-headed attention (mostly) constant.

<img title="" src="images_it/transformer_self_attention_vectors.png" alt="" width="500" data-align="center">

Multiplying the first input $X_{1}$ by the $W^Q$ matrix produces the $q_{1}$ vector, i.e., the "query" vector associated with that word. Same process would be required for the generation of $q_{2}$. We end up creating a "query", a "key", and a "value" projection of each word in the input sequence.

----

<b>What is the meaning of the <span style="color:purple">Q</span>, <span style="color:orange">K</span>, and <span style="color:blue">V</span> vectors?</b>

They are simply abstractions that are useful for calculating and thinking about attention.

----

**The second step** in calculating self-attention is to calculate the "attention score". Say we are calculating the self attention for the first word in this example, "Thinking". We need to score each word of the input sentence agains this word. The score determine how much focus to place on the other parts of the input sentence as we encode a word at a certain position.

The score is calculated by taking the dot product of the <span style="color:purple"><b>query vector</b></span> with the <span style="color:orange"><b>key vector</b></span> of the respective word we're scoring. So, if we're processing the self-attention for the word in position #1, the first score would be the dot product of <span style="color:purple">q1</span> and <span style="color:orange">k1</span>. The second score would be the dot product of <span style="color:purple">q1</span> and <span style="color:orange">k2</span>. 

Now, since we are interested in estimating the attention that word puts on the rest, we will repeat this process for each query.


**The third step** is to divide the scores by 8 (the quare root of the dimension of the key vectors used in the paper - 64. This leads to having more stable gradients. There could be other possible values here, but this is the default)

**The fourth step** is to pass the result through a softmax operation. Softmax normalizes the scores so they're all positive and add up to 1. This softmax score determines how much each word will be expressed at this position. Clearly the word at this position will have the highest softmax score, but sometimes it is useful to attend to another word that is relevant to the current word.

**The fifth step** is to multiply each value vector by the softmax score (in preparation to sum them up). The intution here is to keep intact the values of the word(s) we want to focus on, and drown-out irrelevant words (by multiplying them by tiny numbers like 0.001, for example).

**The sixth step** is to sum up the weighted value vectors. This produces the output of the self-attention layer at this position (for the first word).

<img title="" src="images_it/self-attention-output.png" alt="" width="500" data-align="center">

The resulting vector is one we can send along to the feed-forward neural network. However, in the actual implementation this calculation is done in matrix form for faster processing. So let's look at that now we've seen the intuition of the calculation on the word level.

## Matrix calculation of self-attention

**The first step** is to calculate the <span style="color:purple"><b>Query</b></span>, <span style="color:orange"><b>Key</b></span>, and <span style="color:blue"><b>Value</b></span> matrices. We do that by packing our inputs (e.g, the embeddings in the in the first encoder of the stack) into a matrix <span style="color:green"><b>X</b></span>, and multiplying it by the weight matrices <span style="color:purple"><b>WQ</b></span>, <span style="color:orange"><b>WK</b></span>, <span style="color:blue"><b>WV</b></span>.

<img title="" src="images_it/self-attention-matrix-calculation.png" alt="" width="250" data-align="center">

Every row in the <span style="color:green"><b>X</b></span> matrix corresponds to a words in the input sequence. We again see the difference in size of the embedding vector (512, or 4 boxes in the figure), and the <span style="color:purple"><b>Q</b></span>, <span style="color:orange"><b>K</b></span>, <span style="color:blue"><b>V</b></span> vectors (64, or 3 boxes in the figure).

Finally, since we are dealing with matrices, **we can condense steps two through six** in one formula to calculate the outputs of the self-attention layer:

<img title="" src="images_it/self-attention-matrix-calculation-2.png" alt="" width="500" data-align="center">

## Multi-headed self-attention

The paper further refined the self-attention layer by adding a mechanism called "multi-headed" attention. This improves the performance of the attention layer in two ways:

1. **It expands the model's ability to focus on different positions**. Yes, in the example above, z1 contains a little bit of every other encoding, but it could be dominated by the actual word itself. If we’re translating a sentence like “The animal didn’t cross the street because it was too tired”, it would be useful to know which word “it” refers to.

2. **It gives the attention layer multiple "representation subspaces"**. With multi-headed attention we have not only one but multiple sets of Query/Key/Value weight matrices (the Transformer uses eight attention heads, so we end up with eight sets for each encoder/decoder). Each of these sets is randomly initialized. Then, after training each set is used to project the input embeddings (or vector from lower encoders/decoders) into a different representation subspace.

<img title="" src="images_it/transformer_attention_heads_qkv.png" alt="" width="500" data-align="center">

If we do the same self-attention calculation we outlined in the above example, just eight different times with different weight matrices, we end up with eight different score (**Z**) matrices. 

However, this leaves us with a bit of a challenge. The feed-forward layer is not expecting eight matrices, it is expecting a single matrix (a vector for each word). So we need a way to condense these eight down into a single matrix. How do we do that? We concat the matrices then multiply them by an additional weights matrix WO.

<img title="" src="images_it/transformer_attention_heads_weight_matrix_o.png" alt="" width="500" data-align="center">

We can summarize multi-headed self-attention into the following picture:

<img title="" src="images_it/transformer_multi-headed_self-attention-recap.png" alt="" width="800" data-align="center">

To finish attention, let's revisit our example from before to see where the different attention heads are focusing as we encode the word "it" in our example sentence (showing 2/8 heads):

<img title="" src="images_it/transformer_self-attention_visualization_2.png" alt="" width="400" data-align="center">

As we encode the word "it", we can see that one attention head is focusing most on "the animal", while another is focusing on "tired". In a sense, the model's representation of the word "it"  bakes in some of the representation of both "animal" and "tired". Of course, if we add all the attention heads to the picture things can be harder to interpret...

<img title="" src="images_it/transformer_self-attention_visualization_3.png" alt="" width="400" data-align="center">

## Representing the order of the sequence using positional embedding

One thing that is missing from the model as we have described it so far is a way to account for the order of the words in the input sequence.

To address this, the transformer adds a vector to each input embedding. **These vectors could be a fully learnable embedding** (i.e.,  <span style="color:blue"><b>learnable positional embeddings</b></span>) or **follow a specific pattern that the model uses during learning** (<span style="color:blue"><b>absolute positional embeddings</b></span>), which helps it determine the position of each word, or the distance between different words in the sequence. The intuition here is that adding these values to the embeddings provides meaningful distances between the embedding vectors once they are projected into Q/K/V vectors and during dot-product attention.

<img title="" src="images_it/transformer_positional_encoding_vectors.png" alt="" width="500" data-align="center">

If we assumed the embedding has a dimensionality of 4, the actual positional embeddings would look like this:

<img title="" src="images_it/transformer_positional_encoding_example.png" alt="" width="500" data-align="center">

The formula for absolute positional embeddings is described in the paper (section 3.5). This is not the only possible method for positional encodings. It, however, gives the advantage of being able to scale to unseen lengths of sequences (e.g., if our tranined model is asked to translate a sentence longer than any of those in our training set). In the following fiure, each row corresponds to a positional encoding of a vector. So the first row would be the vector we would add to the embedding of the first word in an input sequence. Each row would 512 values, each with a value between 1 and -1 (in this image, there are only 60 values because it is from a different example):

<img title="" src="images_it/attention-is-all-you-need-positional-encoding.png" alt="" width="500" data-align="center">

**The specific pattern comes from interweaving the sine and cosine signals**, other implementations have considered to apply the sine function to half of the embedding columns and the cosine function to the other half, concatenating the result:

<img title="" src="images_it/transformer_positional_encoding_large_example.png" alt="" width="500" data-align="center">

## The residuals

Each sub-layer inside encoders and decoders has a residual connection around it, which is followed by a layer-normalization step. If we are to visualize the vectors and the layer-normalization operation associated with self-attention, it would look like this:

<img title="" src="images_it/transformer_resideual_layer_norm_2.png" alt="" width="400" data-align="center">

This example also goes for the sub-layers of the decoder as well:

<img title="" src="images_it/transformer_resideual_layer_norm_3.png" alt="" width="650" data-align="center">

-----

<span style="color:red"><b>NOTE:</span> From here, we focuse more on the "translation example" and thus introduce a classification head (linear + softmax)

----

## The decoder side 

The encoder start by processing the input sequence. The output of the top encoder is then transformed into a set of attention vectors <span style="color:orange"><b>K</b></span> and <span style="color:blue"><b>V</b></span>. These are to be used by each decoder in its “encoder-decoder attention” layer which helps the decoder focus on appropriate places in the input sequence:

The self-attention layer in the decoder is only allowed to attende earlier position in the output sequence. This is done by masking future positions (setting them to `-inf`) before the softmax step in the self-attention calculation.

The "Encoder-Decoder Attention" layer works just like multi-headed self-attention, except it creates the <span style="color:purple"><b>Queries</b></span> matrix from the layer below it, and takes the <span style="color:orange"><b>Keys</b></span> and <span style="color:blue"><b>Values</b></span> matrix from the output of the encoder stack.

<img title="" src="images_it/transformer_decoding_2.gif" alt="" width="650" data-align="center">

## The final Linear and Softmax layer

The decoder stack outputs a vector of floats. How do we turn that into a word? That is the job of the final Linear layer which is followed by a Softmax Layer.

* The linear layer is a simple fully connected neural network that projects the vector produced by the stack of decoders, into a much, much larger vector called a logits vector. Let's assume that our model knows 10.000 unique English words (our model's "output vocabulary") that is learned from its training dataset. This would make the logits vector 10.000 cells wide - each cell corresponding to the score of a unique word. That is how we interpret the output of the model followed by the Linear layer

* The softmax layer then turns those scores into probabilities (all positive, all add up to 1.0). The cell with the highest probability is chosen, and the word associated with it is produced as the output for this time step.

<img title="" src="images_it/transformer_decoder_output_softmax.png" alt="" width="500" data-align="center">

## The classification loss function

Say we are training our translation model. Say it is our first step in the training phase, and we are training it on a simple example - translating "merci" to "thanks". What this means is that we want the output to be a probability distribution indicating the word "thanks". But since the model parameters (weights) are all initialized randomly, the (untrained) model produces a probability distribution with arbitrary values for each word. We can compare it with the actual output, then tweak all the model's weights using backpropagation to make the output closer to the desired output:

<img title="" src="images_it/transformer_logits_output_and_label.png" alt="" width="500" data-align="center">

To compare two probability distributions we can simly "substract" one from the other using the Kullback-Leibler divergence (i.e., cross-entropy loss function)

Note that this is an oversimplified example. More realistically, we will use a sentence longer than one word. For example – input: “je suis étudiant” and expected output: “i am a student”. What this really means, is that we want our model to successively output probability distributions where:

* Each probability distribution is represented by a vector of width vocab_size (6 in our toy example, but more realistically a number like 30,000 or 50,000)
* The first probability distribution has the highest probability at the cell associated with the word “i”
* The second probability distribution has the highest probability at the cell associated with the word “am”
* And so on, until the fifth output distribution indicates `<end of sentence>` symbol, which also has a cell associated with it from the 10,000 element vocabulary.

Hopefully upon training the model would output the right translation we expect.

<img title="" src="images_it/output_trained_model_probability_distributions.png" alt="" width="500" data-align="center">
