<img width="928" alt="image" src="https://user-images.githubusercontent.com/28102493/168345918-a3860aea-fff8-4f32-a2c2-dc435e0571dc.png">

# Introduction


The Transformer is an architecture that uses Attention to significantly improve the performance of deep learning NLP translation models. It was first introduced in the paper [Attention is all you need](https://arxiv.org/abs/1706.03762) and was quickly established as the leading architecture for most text data applications.

Since then, numerous projects including Google’s BERT and OpenAI’s GPT series have built on this foundation and published performance results that handily beat existing state-of-the-art benchmarks.

Attention is a concept that helped improve the performance of neural machine translation applications. In this post, we will look at The Transformer – a model that uses attention to boost the speed with which these models can be trained. The Transformer outperforms the Google Neural Machine Translation model in specific tasks. The biggest benefit, however, comes from how The Transformer lends itself to parallelization. It is in fact Google Cloud’s recommendation to use The Transformer as a reference model to use their Cloud TPU offering. So let’s try to break the model apart and look at how it functions.

The Transformer was proposed in the paper Attention is All You Need. A TensorFlow implementation of it is available as a part of the [Tensor2Tensor](https://github.com/tensorflow/tensor2tensor) package. Harvard’s NLP group created [a guide annotating the paper with PyTorch implementation](http://nlp.seas.harvard.edu/2018/04/03/attention.html). 


## The Attention Mechanisms Origin Story: Machine Translation

In order to understand the hype around Transformer NLP models and their real-world implications, it’s worth taking a step back and looking into the architecture and inner workings behind these models. In this blog post, we’ll walk you through the rise of the Transformer NLP architecture, starting by its key component — the Attention paradigm.

The Attention paradigm made its grand entrance into the NLP landscape back in 2014, before the deep learning hype, and was first applied to the problem of machine translation.

Typically, a machine translation system follows a basic encoder-decoder architecture (as shown in the image below), where both the encoder and decoder are generally variants of recurrent neural networks (RNNs). To understand how a RNN works, it helps to imagine it as a succession of cells. The encoder RNN receives an input sentence and reads it one token at a time: each cell receives an input word and produces a hidden state as output, which is then fed as input to the next RNN cell, until all the words in the sentence are processed. 

It follows a typical encoder-decoder architecture, where both the encoder and decoder are generally variants of RNNs (such as LSTMs or GRUs). The encoder RNN reads the input sentence one token at a time. It helps to imagine an RNN as a succession of cells, one for each timestep. At each timestep t, the RNN cell produces a hidden state h(t), based on the input word X(t) at timestep t, and the previous hidden state h(t-1). This output will be then fed to the next RNN cell.


<img width="1222" alt="image" src="https://user-images.githubusercontent.com/28102493/166567143-102a0d6a-fa7c-407f-9762-bdde92acde49.png">



After this, the last-generated hidden state will hopefully capture the gist of all the information contained in every word of the input sentence. This vector, called the **context vector**, will then be fed as input to the decoder RNN, which will produce the translated sentence one word at a time. Eventually when the whole sentence has been processed, the last-generated hidden state will hopefully capture the gist of the all the information contained in every word of the input sentence. This vector, called the context vector, will then be the input to the decoder RNN, which will produce the translated sentence one word at a time.

<img width="920" alt="image" src="https://user-images.githubusercontent.com/28102493/167252038-e6be0691-9b40-463a-b165-951a4fd89312.png">

But is it safe to reasonably assume that the context vector can retain ALL the needed information of the input sentence? What about if the sentence is, say, 50 words long? Because of the inherent sequential structure of RNNs, each input cell only produces one output hidden state vector for each word in the sentence, one by one. Due to the sequential order of word processing, it’s harder for the context vector to capture all the information contained in a sentence for long sentences with complicated dependencies between words — this is referred to as **“the bottleneck problem”.**

**Solving the Bottleneck Problem With Attention**

To  address this bottleneck issue,  researchers  created  a  technique  for  paying  attention  to  specific  words.  When  translating  a  sentence  or  transcribing  an  audio  recording,  a  human  agent  would  pay  special  attention  to  the  word they are presently translating or transcribing. 

Neural networks can achieve this same behavior using Attention, focusing on part of a subset of the information they are given. Remember that each input RNN cell produces one hidden state vector for each input word. We can then concatenate these vectors, average them, or (even better!) weight them as to give higher importance to words — from the input sentence — that are most relevant to decode the next word (of the output sentence). This is what the Attention technique is all about. **With an attention mechanism, we no longer try encode the full source sentence into a fixed-length vector. Rather, we allow the decoder to “attend” to different parts of the source sentence at each step of the output generation.** Importantly, we let the model learn what to attend to based on the input sentence and what it has produced so far.


**How can we parallelize sequential data?? (I will get back on this question.)**

For now, we are dealing with two issues-

1. Vanishing gradient
1. Slow training

Solving the vanishing gradient issue: Attention
How can we parallelize sequential data: Transformers


**Enters Attention**

So how can we avoid this bottleneck? Why not feed the decoder not only the last hidden state vector, but all the hidden state vectors! Remember that each input RNN cell produces one such vector for each input word. We can then concatenate these vectors, average them, or (even better!) weight them as to give higher importance to words — from the input sentence — that are most relevant to decode the next word (of the output sentence). This is what attention is all about.

<img width="910" alt="image" src="https://user-images.githubusercontent.com/28102493/167252358-0b89d86f-f675-418e-8f86-db93e85df7f6.png">

As per the tradition now, this paradigm was in fact first leveraged on images before being replicated on text. The idea was to shift the focus of the model on specific areas of the image (that is, specific pixels) to better help it in its task.

<img width="924" alt="image" src="https://user-images.githubusercontent.com/28102493/167252511-5fd72197-7c4a-41ea-934a-c33fd28d8d94.png">

The same idea applies to translating text. In order for the decoder to generate the next word, it will first weigh the input words (encoded by their hidden states) according to their relevance at the current phase of the decoding process.

<img width="967" alt="image" src="https://user-images.githubusercontent.com/28102493/167315773-ebcd79ce-23dd-4e54-ba4f-4310cf872ca4.png">


The model needs to weight all the words in the input sentence, giving more importance to words that relate the most to the word the model is about to predict and using the information it has on hand, which in this case is the last decoder hidden state s(3). This vector represents a summary of all the words decoded so far and can be seen as the closest thing to the word the model is about to predict. Following this intuition, the weight for input word j is computed as a “similarity” measure between its hidden state vector and the vector s(3):


<img width="845" alt="image" src="https://user-images.githubusercontent.com/28102493/167315927-300436a1-f11a-473c-8367-c05347667aeb.png">

The resulting context vector c(i)is then used in the decoder (along with the previous decoder hidden state and the last predicted word) in order to generate the next word.

This is only one type of attention, which is called **additive attention**. Other forms of attention have been proposed since. The Luong multiplicative attention mechanism is one example worth noting.

**The Blessing of Soft Alignment**

Here is a cool property that you get when using attention: the model learns by itself the alignment of words between the input words and the output words. Which also makes for a great inspection tool:

<img width="985" alt="image" src="https://user-images.githubusercontent.com/28102493/167315979-4d25a504-1fb5-4901-ace6-8324ddff7fb2.png">

As you can see, the alignment is pretty much monotonic with the exception of certain cases, such as the expression: European Economic Area, where the order of the words in the French translation is reversed (zone économique européenne); the model is able to look back and forth to focus on the right word. Notice how it is mostly just linear except when translating “zone économique européenne” to “European economic zone.” It correctly attends in the reverse order in that case.

**Such an ability allows attention to learn long range dependencies.**



**Towards Transformer NLP Models**

As you now understand, Attention was a revolutionary idea in sequence-to-sequence systems such as translation models. Transformer NLP models are based on the Attention mechanism, taking its core idea even further: In addition to using Attention to compute representations (i.e., context vectors) out of the encoder’s hidden state vectors, why not use Attention to compute the encoder’s hidden state vectors themselves? **The immediate advantage of this is getting rid of the inherent sequential structure of RNNs, which hinders the parallelization of models.**

To solve the problem of parallelization, Attention boosts the speed of how fast the model can translate from one sequence to another. **Thus, the main advantage of Transformer NLP models is that they are not sequential,  which  means that unlike RNNs,  they can be more easily parallelized, and that bigger and bigger models can be trained by parallelizing the training. **

What’s more, Transformer NLP models have  so  far  displayed  better  performance  and  speed  than  RNN  models.  Due  to  all  these  factors,  **a  lot  of  the  NLP  research  in  the past couple of years has been focused on Transformer NLP models, and we can expect this to translate into exciting new business use cases as well.**

## How are they better than RNNs?

You might ask yourself, why would I need this? I already know about RNN and LSTM. Isn’t that good enough?

**The issue with those two models is that long term information tends to be forgotten by the model, the longer the sequence gets.** Theoretically, the information from a token can propagate far down the sequence but in practice, the probability that we keep the information diminishes exponentially, the further away we get from a specific word.

This concept is called the vanishing gradient. LSTMs do better than RNNs thanks to the introduction of a “forget gates”, but they don’t do so great with much larger sequences.

RNNs and their cousins, LSTMs and GRUs, were the de facto architecture for all NLP applications until Transformers came along and dethroned them.

RNN-based sequence-to-sequence models performed well, and when the Attention mechanism was first introduced, it was used to enhance their performance.

However, they had two limitations:

1. It was challenging to deal with **long-range dependencies** between words that were spread far apart in a long sentence.

2. They **process the input sequence sequentially one word at a time**, which means that it cannot do the computation for time-step t until it has completed the computation for time-step t — 1. This slows down training and inference.

As an aside, with **CNNs, all of the outputs can be computed in parallel**, which makes convolutions much faster. However, they also have limitations in dealing with long-range dependencies:

1. In a convolutional layer, only parts of the image (or words if applied to text data) that are close enough to fit within the kernel size can interact with each other. For items that are further apart, you need a much deeper network with many layers.

**The Transformer architecture addresses both of these limitations. It got rid of RNNs altogether and relied exclusively on the benefits of Attention.**

<img width="933" alt="image" src="https://user-images.githubusercontent.com/28102493/166163882-ad420101-2cd3-4b38-b33d-e8786c0927e1.png">


1. They process all the words in the sequence in parallel, thus greatly speeding up computation.

2. The distance between words in the input sequence does not matter. It is equally good at computing dependencies between adjacent words and words that are far apart.

When arranging one’s calendar for the day, we prioritize our appointments. If there is anything important, we can cancel some of the meetings and accommodate what is important.

RNNs don’t do that. Whenever it adds new information, it transforms existing information completely by applying a function. The entire information is modified, and there is no consideration of what is important and what is not.

LSTMs make small modifications to the information by multiplications and additions. With LSTMs, the information flows through a mechanism known as cell states. In this way, LSTMs can selectively remember or forget things that are important and not so important.

**The problem with LSTMs**

The same problem that happens to RNNs generally, happen with LSTMs, i.e. when sentences are too long LSTMs still don’t do too well. The reason for that is that the probability of keeping the context from a word that is far away from the current word being processed decreases exponentially with the distance from it.

That means that when sentences are long, the model often forgets the content of distant positions in the sequence. Another problem with RNNs, and LSTMs, is that it’s hard to parallelize the work for processing sentences, since you are have to process word by word. Not only that but there is no model of long and short range dependencies. To summarize, LSTMs and RNNs present 3 problems:

1. Sequential computation inhibits parallelization

1. No explicit modeling of long and short range dependencies

1. **“Distance” between positions is linear**


## CNN

The main issue with RNNs lies in their inability of providing parallelization while processing. The processing of RNN is sequential, i.e. we cannot compute the value of the next timestep unless we have the output of the current. This makes RNN-based approaches slow.

This issue, however, was addressed by Facebook Research wherein they suggested using a convolution-based approach that allows incorporating parallelization with GPU. These models establish hierarchical representation between words, *where the words that occur closer in sequences interact at lower levels while the ones appearing far from each other operate at higher levels in the hierarchy.* ConvS2S and ByteNet are two such models. **The hierarchy is introduced to address long-term dependencies.**


Although this achieves parallelization, it is still computationally expensive. The number of operations per layer incurred by RNNs and CNNs is way more unreasonable as compared to the quality of results they offer. The original Transformer paper has put forth a comparison of these parameters for the competent models:

<img width="920" alt="image" src="https://user-images.githubusercontent.com/28102493/168291954-a16d2ce1-c3f4-4d58-b34d-431ddd41d7cc.png">

Here, d (or d_model) is the representation dimension or embedding dimension of a word (usually in the range 128–512), n is the sequence length (usually in the range 40–70), k is the kernel size of the convolution and r is the attention window-size for restricted self-attention. From the table, we can infer the following:

- Clearly, the per-layer computational complexity of self-attention is way less than that of others.

- With respect to sequential operations, except RNNs, all other approaches offer parallelization, hence their complexity is O(1).

- The final metric is maximum path length, which superficially means the complexity for attending long-term dependencies or distant words. Since convolutional models use hierarchical representations, their complexity is nlog(n), while self-attention models attend all the words at the same step, hence their complexity is O(1).


The Transformer uses the self-attention mechanism where attention weights are calculated using all the words in the input sequence at once, hence it facilitates parallelization. **In addition to that, since the per-layer operations in the Transformer are among words of the same sequence, the complexity does not exceed O(n²d).**

***Hence, the transformer proves to be effective (since it uses attention) and at the same time, a computationally efficient model.***


Convolutional Neural Networks help solve these problems. With them we can

- **Trivial to parallelize (per layer)**

- **Exploits local dependencies**

- **Distance between positions is logarithmic**

Some of the most popular neural networks for sequence transduction, Wavenet and Bytenet, are Convolutional Neural Networks.

<img width="914" alt="image" src="https://user-images.githubusercontent.com/28102493/168487261-1f345a57-1630-423b-81cb-e85004db81ff.png">


The reason why Convolutional Neural Networks can work in parallel, is that each word on the input can be processed at the same time and does not necessarily depend on the previous words to be translated. Not only that, but the “distance” between the output word and any input for a CNN is in the order of log(N) — that is the size of the height of the tree generated from the output to the input (you can see it on the GIF above. That is much better than the distance of the output of a RNN and an input, which is on the order of N.

**The problem is that Convolutional Neural Networks do not necessarily help with the problem of figuring out the problem of dependencies when translating sentences. That’s why Transformers were created, they are a combination of both CNNs with attention.**


# Google states...

Neural networks, in particular recurrent neural networks (RNNs), are now at the core of the leading approaches to language understanding tasks such as language modeling, machine translation and question answering. In “Attention Is All You Need”, we introduce the Transformer, a novel neural network architecture based on a self-attention mechanism that we believe to be particularly well suited for language understanding.

In our paper, we show that the Transformer outperforms both recurrent and convolutional models on academic English to German and English to French translation benchmarks. On top of **higher translation quality**, the Transformer requires **less computation to train** and is a much **better fit for modern machine learning hardware**, speeding up training by up to an order of magnitude.

# Transformers Overview
Let's do a two-line recap of the attention-based model. Its primary ideology was that it took an input sequence and all the hidden states associated with it and at every instance of the output, it decided which part of the input was useful, and subsequently decided the output based on that. The sequential nature was captured by using either of RNNs or LSTMs in both the encoder and the decoder.


In the Attention is all you need paper, the authors have shown that this sequential nature can be captured by using only the attention mechanism — without any use of LSTMs or RNNs.

Let us first understand the basic similarities and differences between the attention and the transformer models. Both aim to achieve the same result using an encoder-decoder approach. The encoder converts the original input sequence into its latent representation in the form of hidden state vectors. The decoder tries to predict the output sequence using this latent representation. But the RNN based approach has an inherent flaw. Due to the fundamental constraint of sequential computation, it is not possible to parallelize the network (leverage GPUS), which makes it hard to train on long sequences. This, in turn, puts a constraint on the batch size that can be used while training. This has been alleviated by the transformer and we’ll soon learn how. So let’s just dive right into it.

The transformer architecture continues with the Encoder-Decoder framework that was a part of the original Attention networks — given an input sequence, create an encoding of it based on the context and decode that context-based encoding to the output sequence.

**Except for the issue of not being able to parallelize, another important reason for working on improvement was that the attention-based model would inadvertently give a higher weight-age to the elements in the sequence closer to a position.** Though this might make sense in the sense of understanding the grammatical formation of various parts of the sentence, it is hard to find relations between words far apart in the sentence.

As seen, it follows the encoder-decoder design, meanwhile replacing the LSTMs with the Self Attention layer and the sequential nature being identified using the Positional Encodings. ***One important point to remember is that all these components are only made of fully connected (FC) layers. Since the whole architecture is FC layers, it’s easy to parallelize it.***

As the saying goes “You don’t change a winning team.” **Consequently, the core concepts of the previously discussed model were kept: the transformer leverages an encoder-decoder architecture as well as an attention mechanism between these two components.** But not only that: in the following, we are going to discuss the self-attention layer, the multi-head concept, and the magic of positional encodings.



## A High-Level Look

The Transformer architecture excels at handling text data which is inherently sequential. They take a text sequence as input and produce another text sequence as output. eg. to translate an input English sentence to Spanish.

<img width="951" alt="image" src="https://user-images.githubusercontent.com/28102493/166140875-09f7bef3-88ee-4c1c-81f5-03eef084a47a.png">

Let’s begin by looking at the model as a single black box. In a machine translation application, it would take a sentence in one language, and output its translation in another.

<img width="1303" alt="image" src="https://user-images.githubusercontent.com/28102493/166317044-ae6742d5-ec45-430a-8832-0013ad785c31.png">


Popping open that Optimus Prime goodness, we see an encoding component, a decoding component, and connections between them. At its core, it contains a stack of Encoder layers and Decoder layers. 

<img width="1311" alt="image" src="https://user-images.githubusercontent.com/28102493/166317157-bd590d6b-fcf6-4a88-8fb3-44b797b8a803.png">

This is the architecture of the transformer we have until now. **What we need to note is that the output of the encoder is an improved version of the original embeddings.** So we should be able to improve it further by adding more. This is the point that is leveraged in the final design of the transformer network.


The encoding component is a stack of encoders (the paper stacks six of them on top of each other – there’s nothing magical about the number six, one can definitely experiment with other arrangements). The decoding component is a stack of decoders of the same number. To avoid confusion we will refer to the individual layer as an Encoder or a Decoder and will use Encoder stack or Decoder stack for a group of Encoder layers.

<img width="1410" alt="image" src="https://user-images.githubusercontent.com/28102493/166317344-59343ccf-789f-4278-8811-d267004dc3b8.png">

All the Encoders are identical to one another. Similarly, all the Decoders are identical. **The encoders are all identical in structure (yet they do not share weights).** **Each Encoder and Decoder has its own set of weights.** Each one is broken down into two sub-layers.

<img width="1264" alt="image" src="https://user-images.githubusercontent.com/28102493/166317580-57dce359-b5dd-44c6-8632-2821686b2e00.png">

The Encoder stack and the Decoder stack each have their corresponding Embedding layers for their respective inputs. Finally, there is an Output layer to generate the final output.


<img width="938" alt="image" src="https://user-images.githubusercontent.com/28102493/166140916-ec793c9a-c396-4020-b7d3-63b51f24ae55.png">


<img width="908" alt="image" src="https://user-images.githubusercontent.com/28102493/166140961-b17ba727-e828-4432-a74e-64d3cbec5345.png">

- The Encoder contains the all-important Self-attention layer that computes the relationship between different words in the sequence, as well as a Feed-forward layer.
    - The encoder’s inputs first flow through a self-attention layer – a layer that helps the encoder look at other words in the input sentence as it encodes a specific word. We’ll look closer at self-attention later in the post. 
    - The outputs of the self-attention layer are fed to a feed-forward neural network. The exact same feed-forward network is independently applied to each position.


- The Decoder contains the Self-attention layer and the Feed-forward layer, as well as a second Encoder-Decoder attention layer. The decoder has both those layers, but between them is an attention layer that helps the decoder focus on relevant parts of the input sentence (similar what attention does in seq2seq models).


<img width="1267" alt="image" src="https://user-images.githubusercontent.com/28102493/166317920-7aec12b5-da3f-4fc7-9b98-6a9a9d33bb2c.png">

The Encoder is a reusable module that is the defining component of all Transformer architectures. In addition to the above two layers, it also has Residual skip connections around both layers along with two LayerNorm layers. There are many variations of the Transformer architecture. Some Transformer architectures have no Decoder at all and rely only on the Encoder.

<img width="968" alt="image" src="https://user-images.githubusercontent.com/28102493/166141012-f13a1f00-d37c-4ab9-ab01-99d2b07512f4.png">


## Architecture Overview

We can now look under the hood and study exactly how they work in detail. We’ll see how data flows through the system with their actual matrix representations and shapes and understand the computations performed at each stage.
 
To understand what each component does, let’s walk through the working of the Transformer while we are training it to solve a translation problem. We’ll use one sample of our training data which consists of an input sequence (‘You are welcome’ in English) and a target sequence (‘De nada’ in Spanish).

<img width="917" alt="image" src="https://user-images.githubusercontent.com/28102493/166318579-045e381a-80e4-47eb-b1db-81fb05fe4ac0.png">


**Bringing The Tensors Into The Picture:**

Now that we’ve seen the major components of the model, let’s start to look at the various vectors/tensors and how they flow between these components to turn the input of a trained model into an output.


###  Embedding

As is the case in NLP applications in general, we begin by turning each input word into a vector using an embedding algorithm. The Transformer has two Embedding layers. The input sequence is fed to the first Embedding layer, known as the Input Embedding.


<img width="927" alt="image" src="https://user-images.githubusercontent.com/28102493/166319355-16e6720d-f9ec-4d46-8020-51d17bc41b63.png">


The target sequence is fed to the second Embedding layer after shifting the targets right by one position and inserting a Start token in the first position. Note that, during Inference, we have no target sequence and we feed the output sequence to this second layer in a loop. That is why it is called the Output Embedding.

<img width="1006" alt="image" src="https://user-images.githubusercontent.com/28102493/166319965-ec4dbc77-f569-4219-a8b1-135135fd2843.png">


**The embedding only happens in the bottom-most encoder. The abstraction that is common to all the encoders is that they receive a list of vectors each of the size 512** – In the bottom encoder that would be the word embeddings, but in other encoders, it would be the output of the encoder that’s directly below. **The size of this list is hyperparameter we can set – basically it would be the length of the longest sentence in our training dataset.**

The text sequence is mapped to numeric word IDs using our vocabulary. The embedding layer then maps each input word into an embedding vector, which is a richer representation of the meaning of that word.


<img width="958" alt="image" src="https://user-images.githubusercontent.com/28102493/166320138-cc459a27-6583-4172-9137-15d71981cb0a.png">

After embedding the words in our input sequence, each of them flows through each of the two layers of the encoder.


<img width="1357" alt="image" src="https://user-images.githubusercontent.com/28102493/166320354-bdf9f872-9bd6-432a-b249-f1eee3dbd8a4.png">


Here we begin to see one key property of the Transformer, which is that the word in each position flows through its own path in the encoder. There are dependencies between these paths in the self-attention layer. The feed-forward layer does not have those dependencies, however, and thus the various paths can be executed in parallel while flowing through the feed-forward layer.

Next, we’ll switch up the example to a shorter sentence and we’ll look at what happens in each sub-layer of the encoder.

### Position Encoding

Representing The Order of The Sequence Using Positional Encoding

One thing that’s missing from the model as we have described it so far is a way to account for the order of the words in the input sequence. Since an RNN implements a loop where each word is input sequentially, it implicitly knows the position of each word.

However, Transformers don’t use RNNs and all words in a sequence are input in parallel. This is its major advantage over the RNN architecture, but it means that the position information is lost, and has to be added back in separately.

To address this, the transformer adds a vector to each input embedding. These vectors follow a specific pattern that the model learns, which helps it determine the position of each word, or the distance between different words in the sequence. **The intuition here is that adding these values to the embeddings provides meaningful distances between the embedding vectors once they’re projected into Q/K/V vectors and during dot-product attention.** Just like the two Embedding layers, there are two Position Encoding layers. **The Position Encoding is computed independently of the input sequence. These are fixed values that depend only on the max length of the sequence.** 

The positional encoding fixes that. It allows the model to know the position of a word in it’s sequence, also considering the overall length of the sequence, in order to have a relative position.

**That makes sense, since where the word is used in a sentence (beginning or end) can change it’s meaning.**

For instance:

- the first item is a constant code that indicates the first position
- the second item is a constant code that indicates the second position,
- and so on.

These constants are computed using the formula below, where

<img width="965" alt="image" src="https://user-images.githubusercontent.com/28102493/166324925-d2265b4d-32b4-4858-999f-d0c720849227.png">


**To give the model a sense of the order of the words, we add positional encoding vectors -- the values of which follow a specific pattern.**

<img width="1325" alt="image" src="https://user-images.githubusercontent.com/28102493/166325536-55862203-cde0-467a-a542-d37e4bb2b786.png">

If we assumed the embedding has a dimensionality of 4, the actual positional encodings would look like this:


<img width="1195" alt="image" src="https://user-images.githubusercontent.com/28102493/166325789-88ae4839-10e7-4633-a097-9ff2197107aa.png">


What might this pattern look like?

In the following figure, each row corresponds to a positional encoding of a vector. So the first row would be the vector we’d add to the embedding of the first word in an input sequence. **Each row contains 512 values – each with a value between 1 and -1.** We’ve color-coded them so the pattern is visible.


<img width="1074" alt="image" src="https://user-images.githubusercontent.com/28102493/166326096-0da354a7-153b-4e80-89c7-9736e434c8c4.png">

Here, i is the dimension and pos is the position of the word. We use sine for even values (2i) of dimensions and cosine for odd values (2i + 1). **There are several choices for positional encodings — learned or fixed. This is the fixed way as the paper states learned as well as fixed methods achieved identical results.**
The general idea behind this is, for a fixed offset k, PEₚₒₛ₊ₖ can be represented as linear function of PEₚₒₛ.



*A real example of positional encoding for 20 words (rows) with an embedding size of 512 (columns). You can see that it appears split in half down the center. That's because the values of the left half are generated by one function (which uses sine), and the right half is generated by another function (which uses cosine). They're then concatenated to form each of the positional encoding vectors.*

The formula for positional encoding is described in the paper (section 3.5). You can see the code for generating positional encodings in get_timing_signal_1d(). This is not the only possible method for positional encoding. It, however, gives the advantage of being able to scale to unseen lengths of sequences (e.g. if our trained model is asked to translate a sentence longer than any of those in our training set).

The positional encoding shown above is from the Tranformer2Transformer implementation of the Transformer. The method shown in the paper is slightly different in that it doesn’t directly concatenate, but interweaves the two signals. The following figure shows what that looks like. Here’s the [code to generate it](https://github.com/jalammar/jalammar.github.io/blob/master/notebookes/transformer/transformer_positional_encoding_graph.ipynb):


<img width="978" alt="image" src="https://user-images.githubusercontent.com/28102493/166326854-f7bbd358-0705-498a-9552-ad4e23930842.png">

In other words, it interleaves a sine curve and a cos curve, with sine values for all even indexes and cos values for all odd indexes. As an example, if we encode a sequence of 40 words, we can see below the encoding values for a few (word position, encoding_index) combinations.


<img width="877" alt="image" src="https://user-images.githubusercontent.com/28102493/166326986-91d3d523-2297-474d-aeb4-f0d4deb5f2ee.png">

The blue curve shows the encoding of the 0th index for all 40 word-positions and the orange curve shows the encoding of the 1st index for all 40 word-positions. There will be similar curves for the remaining index values.


As we know, deep learning models process a batch of training samples at a time. The Embedding and Position Encoding layers operate on matrices representing a batch of sequence samples. The Embedding takes a (samples, sequence length) shaped matrix of word IDs. It encodes each word ID into a word vector whose length is the embedding size, resulting in a (samples, sequence length, embedding size) shaped output matrix. The Position Encoding uses an encoding size that is equal to the embedding size. So it produces a similarly shaped matrix that can be added to the embedding matrix.


<img width="945" alt="image" src="https://user-images.githubusercontent.com/28102493/166328111-867c5faa-07ed-4876-962e-74143a63dd82.png">

**The (samples, sequence length, embedding size) shape produced by the Embedding and Position Encoding layers is preserved all through the Transformer, as the data flows through the Encoder and Decoder Stacks until it is reshaped by the final Output layers.**


This gives a sense of the 3D matrix dimensions in the Transformer. However, to simplify the visualization, from here on we will drop the first dimension (for the samples) and use the 2D representation for a single sample.


<img width="914" alt="image" src="https://user-images.githubusercontent.com/28102493/166328498-045422b5-6843-4085-9c3e-06e487270c75.png">

The Input Embedding sends its outputs into the Encoder. Similarly, the Output Embedding feeds into the Decoder.


To preserve the positional information, the transformer injects a vector to individual input embeddings (could be using word embeddings for corresponding to the input words). These vectors follow a specific periodic function (Example: combination of various sines/cosines having different frequency, in short not in sync with each other) that **the model learns and is able to determine the position of individual word wrt each other based on the values**.

This injected vector is called “positional encoding” and are added to the input embeddings at the bottoms of both encoder and decoder stacks.


**Positional encoding is added to the model to helps inject the information about the relative or absolute position of the words in the sentence**

Positional encoding has the same dimension as the input embedding so that the two can be summed.

<img width="895" alt="image" src="https://user-images.githubusercontent.com/28102493/168377456-f4c64865-2b7e-4acc-8f70-9cb120e03266.png">

<img width="975" alt="image" src="https://user-images.githubusercontent.com/28102493/168377520-60ceceea-216e-4963-9449-72d057f87382.png">


So how can we make sense of these vectors? In order to understand the logic behind them, let’s analyze two fundamental properties that these vectors need to follow:

**…it may allow the model to extrapolate to sequence lengths longer than the ones encountered during training. **[Attention is all you need]

This means that even if the model was trained on sentences no longer that N words, it should be able to correctly score longer sentences. This steered the choice of these encodings towards periodic functions and helped settle on these family of functions rather than training these positional encoding vectors from scratch.

**we hypothesized it would allow the model to easily learn to attend by relative positions, since for any fixed offset k, PE(pos+k) can be represented as a linear function of PE(pos). [Attention is all you need]**

One fundamental property that these vectors need to have is that they should not encode the intrinsic position of a word within a sentence (“The word took is at position 4”), but rather the position of a word relative to other words in the sentence (“The word took is at position 3, relative to the word the”).

In other words, if the distance between two words A and B in a sentence is the same as between words C and D, their positional encoding vectors should reflect that fact. It turns out that the aforementioned function family follows this property.

### Encoder

The Encoder and Decoder Stacks consists of several (usually six) Encoders and Decoders respectively, connected sequentially.

<img width="886" alt="image" src="https://user-images.githubusercontent.com/28102493/166329233-749bf3e6-70be-4092-af3a-3d94b05ddcc7.png">


The first Encoder in the stack receives its input from the Embedding and Position Encoding. The other Encoders in the stack receive their input from the previous Encoder. As we’ve mentioned already, an encoder receives a list of vectors as input. It processes this list by passing these vectors into a ‘self-attention’ layer, then into a feed-forward neural network, then sends out the output upwards to the next encoder.

<img width="1220" alt="image" src="https://user-images.githubusercontent.com/28102493/166329635-c3f2cf89-5a94-4138-9138-7bb4891ada09.png">

**The word at each position passes through a self-attention process. Then, they each pass through a feed-forward neural network -- the exact same network with each vector flowing through it separately.**


The Encoder passes its input into a Multi-head Self-attention layer. The Self-attention output is passed into a Feed-forward layer, which then sends its output upwards to the next Encoder.


<img width="917" alt="image" src="https://user-images.githubusercontent.com/28102493/166329374-cb01954b-1dfe-4f44-a7ee-63d881b41979.png">

B**oth the Self-attention and Feed-forward sub-layers, have a residual skip-connection around them, followed by a Layer-Normalization.** 

The output of the last Encoder is fed into each Decoder in the Decoder Stack as explained below.

### Decoder
The Decoder’s structure is very similar to the Encoder’s but with a couple of differences.
Like the Encoder, the first Decoder in the stack receives its input from the Output Embedding and Position Encoding. The other Decoders in the stack receive their input from the previous Decoder.

The Decoder passes its input into a Multi-head Self-attention layer. This operates in a slightly different way than the one in the Encoder. It is only allowed to attend to earlier positions in the sequence. This is done by masking future positions, which we’ll talk about shortly.

<img width="903" alt="image" src="https://user-images.githubusercontent.com/28102493/166330314-700f6205-28cb-4dee-9f2f-aad5b944ebf0.png">

Unlike the Encoder, the Decoder has a second Multi-head attention layer, known as the Encoder-Decoder attention layer. The Encoder-Decoder attention layer works like Self-attention, except that it combines two sources of inputs — the Self-attention layer below it as well as the output of the Encoder stack.

The Self-attention output is passed into a Feed-forward layer, which then sends its output upwards to the next Decoder.

Each of these sub-layers, Self-attention, Encoder-Decoder attention, and Feed-forward, have a residual skip-connection around them, followed by a Layer-Normalization.


Each decoder has three sub-layers.

- A **masked multi-head** self attention mechanism on the output vectors of the previous iteration.
- A **multi-head attention mechanism on the output from encoder and masked multi-headed attention in decoder**.
- A simple, position-wise fully connected feed-forward network (think post-processing).
A few additional points:

- In the original paper, 6 layers were present in the encoder stack (2 sub-layer version) and 6 in the decoder stack (3 sub-layer version).

- **All sub-layers in the model, as well as the embedding layers, produce outputs of the same dimension. This is done to facilitate the residual connections.**


### Layer Normalization

<img width="870" alt="image" src="https://user-images.githubusercontent.com/28102493/168285793-63da3de8-cab1-4718-952b-99ed9271ab3a.png">


The key feature of layer normalization is that it **normalizes the inputs across the features**, unlike **batch normalization which normalizes each feature across a batch**. Batch norm has the flaw that it imposes a lower bound on the batch size. In layer norm, the statistics are computed across each feature and **are independent of other examples**. It has been seen to perform better experimentally.

<img width="857" alt="image" src="https://user-images.githubusercontent.com/28102493/168285934-cc8f9a06-a3d6-46e1-b9f3-323fbda17415.png">

In the transformers, layer normalization is done with residuals, allowing it to retain some form of information from the previous layer.

**How do residual connection and layer normalization help?**

- **Residual connections are “skip connections”:** that allow gradients to flow through the network without passing through the non-linear activation function. Residual connection helps with avoiding vanishing or exploding gradient issues. For residual connections to work, the output of each sub-layer in the model should be the same. All sub-layers in the Transformer, produce an output of dimension 512.

- **Layer Normalization:** normalizes the inputs across each of the features and is independent of other examples, as shown below. **Layer normalization reduces the training time in feed-forward neural networks.** In Layer normalization, we compute mean and variance from all of the summed inputs to the neurons in a layer on a single training case.

### Feed-Forward Neural Net

Each encoder and decoder block contains a feed-forward neural net. **It consists of two Linear layers with a relu activation between them.** It is applied to **each position separately and identically**. Hence the input to it is a set of embeddings x’1, x’2, etc. and the output is another set of embeddings x’’1, x’’2, etc. of the same dimensions mapped to another latent space which is common to the whole language.

Now, the second step is the Feed Forward Neural Network. This is the simple feed-forward Neural Network that is applied to every attention vector, it’s the main purpose is to transform the attention vectors into a form that is acceptable by the next encoder or decoder layer.

<img width="907" alt="image" src="https://user-images.githubusercontent.com/28102493/168361303-ae359285-92da-424a-bcfb-20a3034f971a.png">


Feed Forward Network accepts attention vectors “one at a time”. And the best thing here is unlike the case of RNN, here each of these attention vectors is independent of each other. So, **parallelization** can be applied here, and that makes all the difference.

<img width="903" alt="image" src="https://user-images.githubusercontent.com/28102493/168370141-cc8ae695-9b48-4d35-b29b-649631587e52.png">

**Here we begin to see one key property of the Transformer, which is that the word in each position flows through its own path in the encoder.** There are dependencies between these paths in the self-attention layer. The feed-forward layer does not have those dependencies, however, and thus the various paths can be executed in parallel while flowing through the feed-forward layer.


Now we can pass all the words at the same time into the encoder block, and get the set of Encoded Vectors for every word **simultaneously**.

**A linear layer is another feed-forward layer. It is used to expand the dimensions into numbers of words in the French language after translation.**

Now it is passed through a Softmax Layer, which transforms the input into a probability distribution, which is human interpretable.

And the resulting word is produced with the highest probability after translation.



# Attention

## What does Attention Do?

The key to the Transformer’s ground-breaking performance is its use of Attention.

While processing a word, Attention **enables the model to focus on other words in the input that are closely related to that word.**

eg. ‘Ball’ is closely related to ‘blue’ and ‘holding’. On the other hand, ‘blue’ is not related to ‘boy’.

<img width="898" alt="image" src="https://user-images.githubusercontent.com/28102493/166141067-5fa8d01f-ca07-4d66-8b7f-7e88b65430e3.png">


The Transformer architecture uses self-attention by **relating every word in the input sequence to every other word.**

eg. Consider two sentences:

- The cat drank the milk because it was hungry.

- The cat drank the milk because it was sweet.

In the first sentence, the word ‘it’ refers to ‘cat’, while in the second it refers to ‘milk. When the model processes the word ‘it’, self-attention gives the model more information about its meaning so that it can associate ‘it’ with the correct word.

<img width="916" alt="image" src="https://user-images.githubusercontent.com/28102493/166141126-033c5715-f7d1-4050-9c52-319d73f4e2a4.png">


**To enable it to handle more nuances about the intent and semantics of the sentence, Transformers include multiple attention scores for each word.**

eg. While processing the word ‘it’, the first score highlights ‘cat’, while the second score highlights ‘hungry’. So when it decodes the word ‘it’, by translating it into a different language, for instance, it will incorporate some aspect of both ‘cat’ and ‘hungry’ into the translated word.

<img width="888" alt="image" src="https://user-images.githubusercontent.com/28102493/166141171-f234478a-ceab-43f5-8641-ec86fcea7fe9.png">

<img width="931" alt="image" src="https://user-images.githubusercontent.com/28102493/168376472-5d7eece6-bbea-4d41-851a-f77f329bf2b4.png">


- First, what do the X(i) vectors look like? What do you feed the self-attention layer, i.e., what are the initial representations of each word in the sentence? Well, these can be computed in several ways. For instance, you can use the good old tf-idf, or you can keep up with NLP fashion and use word embeddings (GloVe, Fasttext, or Word2vec, for example).

- Second, what does related words really mean? Lot of things! For example, two words in a sentence can be related with a subject-to-verb relation (the words “bird” and “has” share this relation). Ideally, we would like that the dot product between the embeddings of these two words — and all the words that share the same grammatical relation — to be high, and inversely, to be low for words that don’t share this grammatical relation. The bad news is that your out-of-the-box pre-computed word embeddings were not trained to answer this specific question. The good news is that we can still leverage these pre-computed embeddings and fine-tune them as part of the whole transformer training.

## Self-Attention at a High Level in Transformers


In the Transformer, Attention is used in three places:

- **Self-attention in the Encoder** — the input sequence pays attention to itself
- **Self-attention in the Decoder** — the target sequence pays attention to itself
- **Encoder-Decoder-attention in the Decoder** — the target sequence pays attention to the input sequence

The Attention layer takes its input in the form of three parameters, known as the Query, Key, and Value.

All three parameters are similar in structure, with each word in the sequence represented by a vector.

**Keys, Values, and Queries:**

The three random words I just threw at you in this heading are vectors created as abstractions are useful for calculating self attention, more details on each below. These are calculated by multiplying your input vector(X) **with weight matrices that are learnt while training.**


<img width="1006" alt="image" src="https://user-images.githubusercontent.com/28102493/166559647-2a30d137-bbe7-4a4b-a720-64670d949236.png">


What we want to do is take query q and find the most similar key k, by doing a dot product for q and k. The closest query-key product will have the highest value, followed by a softmax that will drive the q.k with smaller values close to 0 and q.k with larger values towards 1. This softmax distribution is multiplied with v. The value vectors multiplied with ~1 will get more attention while the ones ~0 will get less. The sizes of these q, k and v vectors are referred to as “hidden size” by various implementations.

<img width="911" alt="image" src="https://user-images.githubusercontent.com/28102493/166559976-ea15c4c8-a4c3-42d6-ad4a-f0fcc1f0b8f0.png">

**All these matrices Wq, Wk and Wv are learnt while being jointly trained during the model training.**

<img width="935" alt="image" src="https://user-images.githubusercontent.com/28102493/166565852-196f8e55-0038-4b1f-8138-c1408e399d7a.png">

<img width="906" alt="image" src="https://user-images.githubusercontent.com/28102493/166565875-14902a36-5483-4be3-87ab-a429800d0a39.png">


<img width="911" alt="image" src="https://user-images.githubusercontent.com/28102493/168283544-831f4bfc-800e-4075-882e-93672389c564.png">


Say x1 wants to know its value with respect to x2. So it’ll ‘query’ x2. x2 will provide the answer in the form of its own ‘key’, which can then be used to get a score representing how much it values x1 by taking a dot product with the query. **Since both have the same size, this will be a single number. This step will be performed with every word.**

Now, x1 will take all these scores and perform softmax to ensure that the score is bounded while also ensuring that the relative difference between the scores is maintained. (There is also the step of dividing the score before softmax by the square root of the d_model — embedding dimension — to ensure stable gradients in case the score is too large in cases where d_model is a large number.)

This scoring and softmax task is performed by every word against all other words. The above diagram paints a picture of this whole explanation and will be more easily understood now.


<img width="907" alt="image" src="https://user-images.githubusercontent.com/28102493/168283727-0a256624-ebed-4001-b690-899d58f70caf.png">


x1 will now use this score and the ‘value’ of the corresponding word to get a new value of itself with respect to that word. If the word is not relevant to x1 then the score will be small and the corresponding value will be reduced as a factor of that score and similarly the significant words will get their values bolstered by the score.

<img width="974" alt="image" src="https://user-images.githubusercontent.com/28102493/168283918-d8dfcfa8-bb99-4788-9ac4-1b6e592765d8.png">

Finally, the word x1 will create a new ‘value’ for itself by summing up the values received. This will be the new embedding of the word.

So at the end of this whole section on self-attention, we see that the self-attention layer takes as input a position injected naïve form of embeddings and outputs more context-aware embeddings.

<img width="947" alt="image" src="https://user-images.githubusercontent.com/28102493/168315439-ad60a986-3e4b-46a8-9316-3a8d75f1c3fa.png">


<img width="948" alt="image" src="https://user-images.githubusercontent.com/28102493/168482128-a158a572-bf82-4fc7-82fb-51ca2ff142b6.png">

As we can see from the formula, the first step within Attention is to do a matrix multiply (ie. dot product) between the Query (Q) matrix and a transpose of the Key (K) matrix. Watch what happens to each word.

We produce an intermediate matrix (let’s call it a ‘factor’ matrix) where each cell is a matrix multiplication between two words.


<img width="941" alt="image" src="https://user-images.githubusercontent.com/28102493/168482139-d8eb7a20-5019-498e-b704-cb56d2db3145.png">

The next step is a matrix multiply between this intermediate ‘factor’ matrix and the Value (V) matrix, to produce the attention score that is output by the attention module. Here we can see that the fourth row corresponds to the fourth Query word matrix multiplied with all other Key and Value words.

<img width="970" alt="image" src="https://user-images.githubusercontent.com/28102493/168482346-8e30c61e-3837-49c9-82f6-ed76b83f8d46.png">
This produces the Attention Score vector (Z) that is output by the Attention Module.

The way to think about the output score is that, for each word, it is the encoded value of every word from the “Value” matrix, weighted by the “factor” matrix. The factor matrix is the dot product of the Query value for that specific word with the Key value of all words.

<img width="969" alt="image" src="https://user-images.githubusercontent.com/28102493/168482373-b30b7549-d7af-4671-829a-ff466055103d.png">

Remember that the Query, Key, and Value rows are actually vectors with an Embedding dimension. Let’s zoom in on how the matrix multiplication between those vectors is calculated.

<img width="951" alt="image" src="https://user-images.githubusercontent.com/28102493/168483500-5a46c57d-5f2c-48b2-805b-ac5c9d09ee61.png">

When we do a dot product between two vectors, we multiply pairs of numbers and then sum them up.

- If the two paired numbers (eg. ‘a’ and ‘d’ above) are both positive or both negative, then the product will be positive. The product will increase the final summation.

- If one number is positive and the other negative, then the product will be negative. The product will reduce the final summation.

- If the product is positive, the larger the two numbers, the more they contribute to the final summation.

This means that if the signs of the corresponding numbers in the two vectors are aligned, the final sum will be larger.

This notion of the Dot Product applies to the attention score as well. If the vectors for two words are more aligned, the attention score will be higher.

So what is the behavior we want for the Transformer?

We want the attention score to be high for two words that are relevant to each other in the sentence. And we want the score to be low for two words that are unrelated to one another.

For example, for the sentence, “The black cat drank the milk”, the word “milk” is very relevant to “drank”, perhaps slightly less relevant to “cat”, and irrelevant to “black”. We want “milk” and “drank” to produce a high attention score, for “milk” and “cat” to produce a slightly lower score, and for “milk” and “black”, to produce a negligible score.

This is the output we want the model to learn to produce.

For this to happen, the word vectors for “milk” and “drank” must be aligned. The vectors for “milk” and “cat” will diverge somewhat. And they will be quite different for “milk” and “black”.

Let’s go back to the point we had kept at the back of our minds — how does the Transformer figure out what set of weights will give it the best results?

**The word vectors are generated based on the word embeddings and the weights of the Linear layers. Therefore the Transformer can learn those embeddings, Linear weights, and so on to produce the word vectors as required above.**

In other words, **it will learn those embeddings and weights in such a way that if two words in a sentence are relevant to each other, then their word vectors will be aligned.** And hence produce a higher attention score. For words that are not relevant to each other, the word vectors will not be aligned and will produce a lower attention score.

Therefore the embeddings for “milk” and “drank” will be very aligned and produce a high attention score. They will diverge somewhat for “milk” and “cat” to produce a slightly lower score and will be quite different for “milk” and “black”, to produce a low score.

This then is the principle behind the Attention module.

**The Transformer learns embeddings etc, in such a way that words that are relevant to one another are more aligned.**

**This is one reason for introducing the three Linear layers and making three versions of the input sequence, for the Query, Key, and Value. That gives the Attention module some more parameters that it is able to learn to tune the creation of the word vectors.**



### Encoder’s Self-attention

In the Encoder’s Self-attention, the Encoder’s input is passed to all three parameters, Query, Key, and Value.

The input sequence is fed to the Input Embedding and Position Encoding, which produces an encoded representation for each word in the input sequence that captures the meaning and position of each word. This is fed to all three parameters, Query, Key, and Value in the Self-Attention in the first Encoder which then also produces an encoded representation for each word in the input sequence, that now incorporates the attention scores for each word as well. As this passes through all the Encoders in the stack, each Self-Attention module also adds its own attention scores into each word’s representation.



<img width="941" alt="image" src="https://user-images.githubusercontent.com/28102493/166523038-7250819e-0d1e-40bd-ab87-bb4821280504.png">

<img width="858" alt="image" src="https://user-images.githubusercontent.com/28102493/166524636-a3e5c858-0156-49d4-87c8-fe45230b8c0c.png">



As an example, let’s say that we’re working on an English-to-Spanish translation problem, where one sample source sequence is “The ball is blue”. The target sequence is “La bola es azul”.

The source sequence is first passed through the Embedding and Position Encoding layer, which generates embedding vectors for each word in the sequence. The embedding is passed to the Encoder where it first reaches the Attention module.

**Within Attention, the embedded sequence is passed through three Linear layers which produce three separate matrices — known as the Query, Key, and Value.** These are the three matrices that are used to compute the Attention Score.

The important thing to keep in mind is that each ‘row’ of these matrices corresponds to one word in the source sequence.

<img width="969" alt="image" src="https://user-images.githubusercontent.com/28102493/168481542-8c4830f0-081b-42bf-b9d6-76964a829578.png">

So to simplify the explanation and the visualization, let’s ignore the embedding dimension and track just the rows for each word.

<img width="936" alt="image" src="https://user-images.githubusercontent.com/28102493/168481688-d5a0b599-bb64-457c-a2c3-a7d9aaf31e86.png">

Each such row has been generated from its corresponding source word by a series of transformations:

- **embedding**, 
- **position encoding**, 
- and **linear layer***

**All of those transformations are trainable operations.** This means that the weights used in those operations are not pre-decided but are learned by the model in such a way that they produce the desired output predictions.

<img width="942" alt="image" src="https://user-images.githubusercontent.com/28102493/168481804-695fc6f5-d8bc-4a6c-81f6-9f8a2afb8720.png">

The key question is, how does the Transformer figure out what set of weights will give it the best results? Keep this point in the back of your mind as we will come back to it a little later.




### Decoder’s Self-attention

In the Decoder’s Self-attention, the Decoder’s input is passed to all three parameters, Query, Key, and Value.

Coming to the Decoder stack, the target sequence is fed to the Output Embedding and Position Encoding, which produces an encoded representation for each word in the target sequence that captures the meaning and position of each word. This is fed to all three parameters, Query, Key, and Value in the Self-Attention in the first Decoder which then also produces an encoded representation for each word in the target sequence, which now incorporates the attention scores for each word as well.

After passing through the Layer Norm, this is fed to the Query parameter in the Encoder-Decoder Attention in the first Decoder

The Decoder Self-Attention works just like the Encoder Self-Attention, except that it operates on each word of the target sequence.

<img width="938" alt="image" src="https://user-images.githubusercontent.com/28102493/166556077-74144de5-6405-4a91-8c6f-0773511fda7c.png">

Similarly, the Masking masks out the Padding words in the target sequence.

Now we do a scalar multiplication between attention and the original token embeddings(Value) to get the Contextualized Embeddings. **Every token in this embedding will have some amount of context diffused from other tokens.** The magnitude is given by the attention map. Take a little while and imagine how this happens.



### Decoder’s Encoder-Decoder attention

In the Encoder-Decoder Attention, the Query is obtained from the target sentence and the Key/Value from the source sentence. Thus it computes the relevance of each word in the target sentence to each word in the source sentence.

In the Decoder’s Encoder-Decoder attention, the output of the final Encoder in the stack is passed to the Value and Key parameters. The output of the Self-attention (and Layer Norm) module below it is passed to the Query parameter.


<img width="933" alt="image" src="https://user-images.githubusercontent.com/28102493/166523276-14820f11-03c0-4552-b013-0b959783795b.png">

Along with that, the output of the final Encoder in the stack is passed to the Value and Key parameters in the Encoder-Decoder Attention.

The Encoder-Decoder Attention is therefore getting a representation of both the target sequence (from the Decoder Self-Attention) and a representation of the input sequence (from the Encoder stack). It, therefore, produces a representation with the attention scores for each target sequence word that captures the influence of the attention scores from the input sequence as well.

As this passes through all the Decoders in the stack, each Self-Attention and each Encoder-Decoder Attention also add their own attention scores into each word’s representation.

The Encoder-Decoder Attention takes its input from two sources. Therefore, unlike the Encoder Self-Attention, which computes the interaction between each input word with other input words, and Decoder Self-Attention which computes the interaction between each target word with other target words, **the Encoder-Decoder Attention computes the interaction between each target word with each input word.**


<img width="769" alt="image" src="https://user-images.githubusercontent.com/28102493/166556276-8f000ddb-6458-49bb-8a6f-9213d6507a9b.png">

Therefore each cell in the resulting Attention Score corresponds to the interaction between one Q (ie. target sequence word) with all other K (ie. input sequence) words and all V (ie. input sequence) words.

**Similarly, the Masking masks out the later words in the target output, as was explained in detail.**


## Multiple Attention Heads

In the Transformer, the Attention module repeats its computations multiple times in parallel. Each of these is called an Attention Head. **The Attention module splits its Query, Key, and Value parameters N-ways and passes each split independently through a separate Head.** All of these similar Attention calculations are then combined together to produce a final Attention score. This is called Multi-head attention and gives the Transformer greater power to encode multiple relationships and nuances for each word.

<img width="966" alt="image" src="https://user-images.githubusercontent.com/28102493/166525211-b17a6e00-1d37-43a0-83cb-1e874e176d7e.png">


To understand exactly how the data is processed internally, let’s walk through the working of the Attention module while we are training the Transformer to solve a translation problem. We’ll use one sample of our training data which consists of an input sequence (‘You are welcome’ in English) and a target sequence (‘De nada’ in Spanish).

The idea behind it is that whenever you are translating a word, you may pay different attention to each word based on the type of question that you are asking. The images below show what that means. For example, whenever you are translating “kicked” in the sentence “I kicked the ball”, you may ask “Who kicked”. Depending on the answer, the translation of the word to another language can change. Or ask other questions, like “Did what?”, etc…

<img width="949" alt="image" src="https://user-images.githubusercontent.com/28102493/168488854-e39be394-d654-4842-a49a-9b04843493ac.png">


### Input Layers
The Input Embedding and Position Encoding layers produce a matrix of shape (Number of Samples, Sequence Length, Embedding Size) which is fed to the Query, Key, and Value of the first Encoder in the stack.


<img width="922" alt="image" src="https://user-images.githubusercontent.com/28102493/166525804-b8d8fd48-eb4c-45ac-890c-2446324044b6.png">

To make it simple to visualize, we will drop the Batch dimension in our pictures and focus on the remaining dimensions.

<img width="925" alt="image" src="https://user-images.githubusercontent.com/28102493/166525861-b1ee5a7d-e930-4578-aab2-9f75199eef4e.png">

### Linear Layers
There are three separate Linear layers for the Query, Key, and Value. Each Linear layer has its own weights. The input is passed through these Linear layers to produce the Q, K, and V matrices.

<img width="944" alt="image" src="https://user-images.githubusercontent.com/28102493/166526233-793af2a8-eb58-44a5-ba10-5828e2905499.png">


### Splitting data across Attention heads
Now the data gets split across the multiple Attention heads so that each can process it independently.

However, the important thing to understand is that this is a logical split only. The Query, Key, and Value are not physically split into separate matrices, one for each Attention head. A single data matrix is used for the Query, Key, and Value, respectively, with logically separate sections of the matrix for each Attention head. Similarly, there are not separate Linear layers, one for each Attention head. All the Attention heads share the same Linear layer but simply operate on their ‘own’ logical section of the data matrix.

**Linear layer weights are logically partitioned per head**

This logical split is done by partitioning the input data as well as the Linear layer weights uniformly across the Attention heads. We can achieve this by choosing the Query Size as below:

**Query Size = Embedding Size / Number of heads**


<img width="963" alt="image" src="https://user-images.githubusercontent.com/28102493/166538648-04ad185c-317d-4025-ad4b-232e84d395ac.png">

In our example, that is why the Query Size = 6/2 = 3. Even though the layer weight (and input data) is a single matrix we can think of it as ‘stacking together’ the separate layer weights for each head.

<img width="981" alt="image" src="https://user-images.githubusercontent.com/28102493/166539674-0f123799-80cb-44e7-9b57-9197cbc25109.png">

**The computations for all Heads can be therefore be achieved via a single matrix operation rather than requiring N separate operations.** This makes the computations more efficient and keeps the model simple because fewer Linear layers are required, while still achieving the power of the independent Attention heads.

Reshaping the Q, K, and V matrices

The Q, K, and V matrices output by the Linear layers are reshaped to include an explicit Head dimension. Now each ‘slice’ corresponds to a matrix per head.

This matrix is reshaped again by swapping the Head and Sequence dimensions. Although the Batch dimension is not drawn, the dimensions of Q are now (**Batch, Head, Sequence, Query size**).


<img width="965" alt="image" src="https://user-images.githubusercontent.com/28102493/166544783-7f8ae092-3ff4-4991-a71b-baaaf70109b1.png">

In the picture below, we can see the complete process of splitting our example Q matrix, after coming out of the Linear layer.

**The final stage is for visualization only — although the Q matrix is a single matrix, we can think of it as a logically separate Q matrix per head.**



<img width="909" alt="image" src="https://user-images.githubusercontent.com/28102493/166548448-a7073f50-5f29-4082-b5b9-0941d89a61d6.png">


<img width="914" alt="image" src="https://user-images.githubusercontent.com/28102493/168314338-4ab2b82e-9bc0-4a17-84f6-1242cff8f882.png">

**This is the reason why d_model needs to be completely divisible by h.** So, while splitting, the d_model shaped vectors are split into h vectors of shape depth. These vectors are passed as Q, K, V to the scaled dot product, and the output is ‘Concat’ by again reshaping the h vectors into 1 vector of shape d_model. This reformed vector is then passed through a feed-forward neural network layer.


### Compute the Attention Score for each head
We now have the 3 matrices, Q, K, and V, split across the heads. These are used to compute the Attention Score.

We will show the computations for a single head using just the last two dimensions (Sequence and Query size) and skip the first two dimensions (Batch and Head). **Essentially, we can imagine that the computations we’re looking at are getting ‘repeated’ for each head and for each sample in the batch (although, obviously, they are happening as a single matrix operation, and not as a loop).**

The first step is to do a matrix multiplication between Q and K.


<img width="918" alt="image" src="https://user-images.githubusercontent.com/28102493/166551458-ff4cf09d-8c2b-4d43-8cf2-52fc50bc1381.png">

A Mask value is now added to the result. **In the Encoder Self-attention, the mask is used to mask out the Padding values so that they don’t participate in the Attention Score**.

Different masks are applied in the Decoder Self-attention and in the Decoder Encoder-Attention which we’ll come to a little later in the flow.

- **Step 2:** Then divide this product by the square root of the dimension of key vector. This step is done for better gradient flow which is specially important in cases when the value of the dot product in previous step is too big. As using them directly might push the softmax into regions with very little gradient flow.

- **Step 3:** Once we have scores for all js, we pass these through a softmax. We get normalized value for each j.

- **Step 4:** Multiply softmax scores for each j with vᵢ vector. The idea/purpose here is, very similar attention, to keep preserve only the values v of the input word(s) we want to focus on by multiplying them with high probability scores from softmax ~1, and remove the rest by driving them towards 0, i.e. making them very small by multiplying them with the low probability scores ~0 from softmax.



<img width="951" alt="image" src="https://user-images.githubusercontent.com/28102493/166552594-a2fd0753-55bf-4170-a3fb-a7ecebd04726.png">
<img width="974" alt="image" src="https://user-images.githubusercontent.com/28102493/166552637-6da703b3-873b-4ac0-8ac2-8b49fb92509b.png">
<img width="923" alt="image" src="https://user-images.githubusercontent.com/28102493/166552697-2adb1157-b65f-471c-aa62-f2313a7af2b0.png">

### Merge each Head’s Attention Scores together
We now have separate Attention Scores for each head, which need to be combined together into a single score. This Merge operation is essentially the reverse of the Split operation.

It is done by simply reshaping the result matrix to eliminate the Head dimension. The steps are:

1. **Reshape the Attention Score matrix by swapping the Head and Sequence dimensions.** In other words, the matrix shape goes from (Batch, Head, Sequence, Query size) to (Batch, Sequence, Head, Query size).

2. **Collapse the Head dimension by reshaping to (Batch, Sequence, Head * Query size).** This effectively concatenates the Attention Score vectors for each head into a single merged Attention Score. Since Embedding size =Head * Query size, the merged Score is (Batch, Sequence, Embedding size). In the picture below, we can see the complete process of merging for the example Score matrix.


<img width="954" alt="image" src="https://user-images.githubusercontent.com/28102493/166553635-240610e4-6bad-48ef-9a8f-f51bd1d80576.png">

### End-to-end Multi-head Attention
Putting it all together, this is the end-to-end flow of the Multi-head Attention.


<img width="966" alt="image" src="https://user-images.githubusercontent.com/28102493/166554176-f8d0be8c-9109-48fe-91af-2253039c6edf.png">

**Multi-head split captures richer interpretations**

An Embedding vector captures the meaning of a word. In the case of Multi-head Attention, as we have seen, the Embedding vectors for the input (and target) sequence gets logically split across multiple heads. What is the significance of this?

<img width="946" alt="image" src="https://user-images.githubusercontent.com/28102493/166554352-c16daf2e-496c-4e32-94a2-f3fd84f0bda9.png">

This means that separate sections of the Embedding can learn different aspects of the meanings of each word, as it relates to other words in the sequence. This allows the Transformer to capture richer interpretations of the sequence.

This may not be a realistic example, but it might help to build intuition. For instance, one section might capture the ‘gender-ness’ (male, female, neuter) of a noun while another might capture the ‘cardinality’ (singular vs plural) of a noun. This might be important during translation because, in many languages, the verb that needs to be used depends on these factors.

## Attention Hyperparameters
There are three hyperparameters that determine the data dimensions:

1. **Embedding Size:** — width of the embedding vector (we use a width of 6 in our example). This dimension is carried forward throughout the Transformer model and hence is sometimes referred to by other names like ‘model size’ etc.

2. **Query Size:** (equal to Key and Value size)— the size of the weights used by three Linear layers to produce the Query, Key, and Value matrices respectively (we use a Query size of 3 in our example)

3. **Number of Attention heads** (we use 2 heads in our example)

4. In addition, we also have the **Batch size**, giving us one dimension for the number of samples.


## Masking

There are two kinds of masks used in the multi-head attention mechanism of the Transformer.

<img width="953" alt="image" src="https://user-images.githubusercontent.com/28102493/168313080-cd24a1d8-6f9e-44b5-8a5c-7785d5f75cbc.png">


- **Padding Mask:** The input vector of the sequences is supposed to be fixed in length. Hence, a max_length parameter defines the maximum length of a sequence that the transformer can accept. All the sequences that are greater in length than max_length are truncated while shorter sequences are padded with zeros. The zero-paddings, however, are not supposed to contribute to the attention calculation nor in the target sequence generation. The working of padding mask is explained in the adjacent figure. This is an optional operation in the Transformer.


<img width="952" alt="image" src="https://user-images.githubusercontent.com/28102493/168313710-55cd9b78-143f-457a-a224-4f824c8cf0d1.png">


- **Look-ahead Mask:** While generating target sequences at the decoder, since the Transformer uses self-attention, it tends to include all the words from the decoder inputs. But, practically this is incorrect. Only the words preceding the current word may contribute to the generation of the next word. Masked Multi-Head Attention ensures this. The working of the look-ahead mask is explained in the adjacent figure.


# Training the Transformer

The Transformer works slightly differently during Training and while doing Inference.
Let’s first look at the flow of data during Training. Training data consists of two parts:

1. The source or input sequence (eg. “You are welcome” in English, for a translation problem)

2. The destination or target sequence (eg. “De nada” in Spanish)

The Transformer’s goal is to learn how to output the target sequence, by using both the input and target sequence.


The Transformer processes the data like this:

1. The input sequence is converted into Embeddings (with Position Encoding) and fed to the Encoder.

2. The stack of Encoders processes this and produces an encoded representation of the input sequence.


3. The target sequence is prepended with a **start-of-sentence token**, converted into Embeddings (with Position Encoding), and fed to the Decoder.

4. The stack of Decoders processes this along with the Encoder stack’s encoded representation to produce an encoded representation of the target sequence.

5. The Output layer converts it into word probabilities and the final output sequence.

6. The Transformer’s Loss function compares this output sequence with the target sequence from the training data. This loss is used to generate gradients to train the Transformer during **back-propagation**.


<img width="961" alt="image" src="https://user-images.githubusercontent.com/28102493/166141306-262fe571-a7c0-4a05-b46f-3c6fc53cfa40.png">


## Teacher Forcing

The approach of feeding the target sequence to the Decoder during training is known as Teacher Forcing. Why do we do this and what does that term mean?

During training, we could have used the same approach that is used during inference. In other words, run the Transformer in a loop, take the last word from the output sequence, append it to the Decoder input and feed it to the Decoder for the next iteration. Finally, when the end-of-sentence token is predicted, the Loss function would compare the generated output sequence to the target sequence in order to train the network.
Not only would this looping cause training to take much longer, but it also makes it harder to train the model. The model would have to predict the second word based on a potentially erroneous first predicted word, and so on.
Instead, by feeding the target sequence to the Decoder, we are giving it a hint, so to speak, just like a Teacher would. Even though it predicted an erroneous first word, it can instead use the correct first word to predict the second word so that those errors don’t keep compounding.

**In addition, the Transformer is able to output all the words in parallel without looping, which greatly speeds up training.**

# Inference
During Inference, we have only the input sequence and don’t have the target sequence to pass as input to the Decoder. The goal of the Transformer is to produce the target sequence from the input sequence alone.

So, like in a Seq2Seq model, we generate the output in a **loop** and feed the output sequence from the previous timestep to the Decoder in the next timestep until we come across an **end-of-sentence token.**

The difference from the Seq2Seq model is that, **at each timestep, we re-feed the entire output sequence generated thus far, rather than just the last word.**


The flow of data during Inference is:

1. The input sequence is converted into Embeddings (with Position Encoding) and fed to the Encoder.

2. The stack of Encoders processes this and produces an encoded representation of the input sequence.

3. Instead of the target sequence, we use an empty sequence with only a start-of-sentence token. This is converted into Embeddings (with Position Encoding) and fed to the Decoder.

4. The stack of Decoders processes this along with the Encoder stack’s encoded representation to produce an encoded representation of the target sequence.

5. The Output layer converts it into word probabilities and produces an output sequence.

6. We take the last word of the output sequence as the predicted word. That word is now filled into the second position of our Decoder input sequence, which now contains a start-of-sentence token and the first word.

7. Go back to step #3. As before, feed the new Decoder sequence into the model. Then take the second word of the output and append it to the Decoder sequence. Repeat this until it predicts an end-of-sentence token. Note that since the Encoder sequence does not change for each iteration, **we do not have to repeat steps #1 and #2 each time (Thanks to Michal Kučírka for pointing this out).**


<img width="982" alt="image" src="https://user-images.githubusercontent.com/28102493/166142185-da340f39-9c2e-434c-9b14-9fd439ee9c75.png">



# Use 

What are Transformers used for?

Transformers are very versatile and are used for most NLP tasks such as language models and text classification. They are frequently used in sequence-to-sequence models for applications such as:

1. Machine Translation, 
2. Text Summarization, 
3. Question-Answering, 
4. Named Entity Recognition, 
5. and Speech Recognition.

There are different flavors of the Transformer architecture for different problems. **The basic Encoder Layer is used as a common building block for these architectures, with different application-specific ‘heads’ depending on the problem being solved.**

## Transformer Classification architecture

A Sentiment Analysis application, for instance, would take a text document as input. A Classification head takes the Transformer’s output and generates predictions of the class labels such as a positive or negative sentiment.

<img width="955" alt="image" src="https://user-images.githubusercontent.com/28102493/166142289-e3604dc0-ab27-42e8-bcfc-1b241af81871.png">


## Transformer Language Model architecture
A Language Model architecture would take the initial part of an input sequence such as a text sentence as input, and generate new text by predicting sentences that would follow. A Language Model head takes the Transformer’s output and generates a probability for every word in the vocabulary. The highest probability word becomes the predicted output for the next word in the sentence.

<img width="957" alt="image" src="https://user-images.githubusercontent.com/28102493/166142315-8063bd0f-17b5-4aba-8d89-ae92574d8440.png">


# Conclusion


Transformers have a simple network architecture based on the Self-Attention mechanism and do not rely on recurrence and convolutions entirely. The computations are executed in parallel making Transformers efficient and requiring less training time

