# TRANSFORMER ANATOMY

**Note:** [To fully understand self-attention, I recommend looking at Karpathy's video on building GPT from scratch.](https://www.youtube.com/watch?v=kCc8FmEb1nY)

In this chapter,we'll first focus on building the attention mechanism, and then add the bits and pieces necessary to make a transformer encoder work. We'll also take a brief look at the architectural differences between the encoder and decoder modules.

This chapters also introduces a taxonomy of transformers to help you understand the zoo of models that have emerged in recent years.

# 1 - The Transformer architecture

The original Transformer is based on the encoder-decoder architecture that is widely used for tasks like machine translation, where a sequence of words is translated from one language to another. This architecture consists of two components:

* *Encoder*. Converts an input sequence of tokens into a sequence of embedding vectors and then compiles those vectors into others that are called the hidden state or context.
* *Decoder*. Uses the encoder's hidden state to iteratively generate an output sequence of tokens, one token at a time.

<table>
    <tr>
        <td><img title="" src="images_ch3/seq2seq.png" alt="" width="500" data-align="center"></td>
    </tr>
</table>

The Transformer architecture was originally designed for sequence-to-sequence tasks like machine translation, but both the encoder and decoder blocks were soon adapted as standalone models. Although there are hundreds of different transformer models, most of them belong to one of three types:

#### Encoder-only

These models convert an input sequence of text into a rich numerical representation that is well suited for tasks like text classification or named entity recognition. *BERT* and its variants, like *RoBERTa* and *DistilBERT*, belong to this class of architectures. The representation computed for a given token in this architecture depends both on the left (before the token) and the right (after the token) contexts. This is often called **bidirectional attention**.

#### Decoder-only

Given a prompt of text like "Thanks for lunch, I had a..." these models will auto-complete the sequence by iteratively predicting the most probable next word. The family of GPT models belong to this class. The representation computed for a given token in this architecture depends only on the left context. This is often called **causal** or **autoregressive attention**.

#### Encoder-decoder

These are used for modeling complex mappings from one sequence of text to another; they are suitable for machine translation and summarization tasks. In addition to the original Transformer architecture, the BASRT and T5 models belong to this class.

---

<mark><b>Note:</b></mark> In reality, the distinction between applications for decoder-only versus encoder-only architectures is a bit blurry. For example, decoder-only models like those in the GPT family can be primed for tasks like translation that are conventionally thought of as sequence-to-sequence tasks. Similarly, encoder-only models like BERT can be applied to summarization tasks that are usually associated with encoder-decoder or decoder-only models.

---

# 2 - The encoder

The transformer's encoder usually consists of many encoder blocks stacked next to each other (encoder stack). Each encoder block receives a sequence of embeddings and feeds them through the following sublayers:

* A multi-head **self-attention layer**

* A fully connected **feed-forward layer** that is applied to each input embedding

The output embeddings of each encoder block have the same size as the inputs. We'll soon see that the main role of the encoder stack is to "update" the input embeddings to produce representations that encode some contextual information in the sequence. For example, the word "apple" will be updated to be more company-like and less "fruit-like"" if the words "keynote" or "phone" are close to it.

<table>
    <tr>
        <td><img src="images_ch3/encoder_block.png" title="" alt="" width="500" data-align="center"></td>
    </tr>
</table>

Each of these sublayers also uses skip connections and layer normalization, which are standard tricks to train deep neural networks effectively. But, to truly understand what makes a transformer work, we have to go deeper. Let's start with the most important building block: the **self-attention layer**.


## 2.1 - Self-attention

Attention is a mechanism that allows neural networks to assign a different amount of weight or "attention" to each element in a sequence.

For text sequences, the elements are token embeddings, where each token is mapped to a vector of some fixed dimension. For example, in BERT, each token is represented as a 768-dimensional vector. 

----

<span style="color:red"><b>IMPORTANT:</b></span> The "self" part in "self-attention", refers to the fact that these weights are computed for all hidden states in the same set; for example, all the hidden states of the encoder. By contrast, the attention mechanism associated with recurrent models involves computing the relevance of each encoder hidden state to the decoder hidden state at a given decoding timestep.

----

The main idea behind self-attention is that instead of using a **fixed** embedding for each token, we can use the whole sequence to compute a **weighted average** of each embedding. Another way to formulate this is to say that given a sequence of  token embeddings $\mathbf{x} = {x_{i},\dots, x_{n}}$ , self-attention produces a sequence of new embeddings $\mathbf{x}' ={x_{i}',\dots, x_{n}'}$ where each ${x_{j}'}$ is a linear combination of all the ${x_{j}'}$:

${x_{i}' = \sum_{j=1}^{n} w_{ji}x_{j}}$

The coefficients ${w_{ji}}$ are called attention weights and are normalized so that $\sum_{j} w_{ji}$.

To see why averaging the token embeddings might be a good  idea, consider what comes to mind when you see the word "flies". You might think of insects, but in other contexts such as "time flies like an arrow", then it would refer to a verb instead. We can create a representation for "flies" that incorporates this context by combining all the token embeddings in different proportions. For example, by assigning a larger weight $w_{ji}$ to the token embeddings of "time" and "arrow". Embeddings that are generated in this way are called **contextualized embeddings** and predate the invention of transformers in language models like ELMo ([Peters et al., 2017](https://arxiv.org/abs/1802.05365)). 

<table>
    <tr>
        <td><img src="images_ch3/self-attention.png" title="" alt="" width="500" data-align="center"></td>
    </tr>
</table>

### 2.1.1 - Scaled dot-product attention

There are several ways to implement a self-attention block, but the most common one is scaled dot-product attention, from the original Transformer article ([Vaswani et al., 2017](https://arxiv.org/abs/1706.03762)). There are four main steps required to implement this mechanism:

1. **Project each token embedding** into three vectors called *query* ($\mathbf{q}$), *key* ($\mathbf{k}$), and *value* ($\mathbf{v}$). Each of dimension $n$.

2. **Compute attention scores**. We determine how much the *query* and *key* vectors relate to each other using a *similarity function*. As the name suggests, the similarity function for scaled-dot product attention is the dot product, efficiently computed using matrix multiplication. Similar queries and keys will have a large dot product, while those that don't share much n common will have little to no overlap. The outputs from this step are called the attention scores. For a sequence with *n* input tokens, there is a corresponding $n \times n$ matrix of attention scores.

3. **Compute attention weights**. Dot products can in general produce arbitrarily large numbers, which can destabilize the training process. To handle this, the attention scores are first multiplied by a scaling factor to normalize their variance and then apply the softmax function to ensure all the column values sum to 1. The resulting $n \times n$ matrix corresponds to the attention weights.

4. **Update the token embeddings**. Once the attention weights are computed, we multiply them by the value vector $v_{1}, \dots, v_{n}$ to obtain an updated representation for embedding $x_{i}' = \sum_{j} w_{ji}$.

<table>
    <tr>
        <td><img src="images_ch3/scaled_dot_multiplication.png" title="" alt="" width="500" data-align="center"></td>
    </tr>
</table>

${\text{Attention}}(Q, K, V) = \text{softmax}\left(\frac{QK^{T}}{\sqrt{dim_k}}\right)V$

As an example, we can visualize how attention weights are calculated using the [BertViz library for Jupyter](https://pypi.org/project/bertviz/):

In [None]:
from transformers import AutoTokenizer
from bertviz.transformers_neuron_view import BertModel
from bertviz.neuron_view import show

model_ckpt = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)
model = BertModel.from_pretrained(model_ckpt)
text = "time flies like an arrow"
show(model, "bert", tokenizer, text, display_mode="light", layer=0, head=8)

From the visualization, we can see the values of the *query* and *key* vectors are represented as vertical bands, where the intensity of each band corresponds to the magnitude. From the scaled dot multiplication of $\mathbf{q}$ and $\mathbf{k}$, we can see that the query vector for "flies" has the strongest overlap with the key vector for "arrow".

#### **Simple implementation**

We will use PyTorch to implement the scaled dot-product attention Transformer architecture.

The first thing we need to do is tokenize the text, so let's use our tokenizer (from BERT-uncased) to extract the input vocabulary IDs:

In [None]:
inputs = tokenizer(text, return_tensors="pt", add_special_tokens=False)
inputs.input_ids

To keep things simple, we have escluded the `[CLS]` and `[SEP]` tokens by setting `add_special_tokens=False`.

Next, we need to create some dense embeddings. *Dense* in this context means that each entry in the embeddings contains a nonzero value. In contrast to one-hot encodings.

In [None]:
from torch import nn
from transformers import AutoConfig

config = AutoConfig.from_pretrained(model_ckpt)
token_emb = nn.Embedding(config.vocab_size, config.hidden_size)

Here we have used the `AutoConfig` class to load the `config.json` file associated with the `bert-base-uncased` checkpoint (i.e., model). In 🤗 Transformers, every checkpoint is assigned aconfiguration file that specifies various hyperparameters like `vocab_size` and `hidden_size`.

**Notes:**

* Token embeddings are at this point random because we have not learned them.
* Token embeddings are at this point independent of their context. This means that homonyms (words that have the same spelling but different meaning), like "flies" in the previous example, have the same representation. The role of the subsequent attention layers will be to mix these token embeddings to disambiguate and inform the representation of each token with the content of its context.

Now that we have our "lookup table", we can generate the embeddings of `text` by feeding in its corresponding vocabulary IDs:

In [None]:
inputs_embeds = token_emb(inputs.input_ids)
inputs_embeds.size()

The resulting vector has a shape of `[batch_size, seq_len, hidden_dim]`, where `seq_len` corresponds to the number of word tokens in the input text and `hidden_dim` corresponds to the number embedding dimensions.

The next step is to create the *query*, *key*, and *value* vectors and to calculate the attention scores using the scaled dot product as the similarity function.

In [None]:
import torch
from math import sqrt

query = key = value = inputs_embeds
dim_k = key.size(-1)
scores = torch.bmm(query, key.transpose(1,2)) / sqrt(dim_k)
scores.size()

This has created a $5 \times 5$ matrix of attention scores per sample in the batch. We'll see later that the query, key, and value vectors are generated by applying independent weight matrices $W_{Q, K,V}$ to the embeddings, but for now we've kept them equal for simplicity. We have scaled the dot product by $\sqrt{dim_k}$ so that we don’t get too many large numbers during training that can cause the softmax we will apply next to saturate.

----

<mark><b>Note:</b></mark> The `torch.bmm` function performs a batch matrix-matrix product. That is, it takes two btaches of matrices and multiplies each matrix from the first batch with the corresponding matrix in the second batch. The dimensions of the two input tensors are `[b, n, m]` and `[b, m, p]` and the resulting tensor has a dimension of `[b, n, p]`.

An alternative to `torch.bmm` is to use `torch.einsum`. Einsum allows computing many common multi-dimensional linear algebraic array operations by representing them in a short-hand format based on the Einstein summation convention. For example, we can do batch multiplication as follows. [You can read this article for more information](https://theaisummer.com/einsum-attention/) :

```python
a = torch.randn(10,20,30) # b -> 10, i -> 20, k -> 30
c = torch.randn(10,50,30) # b -> 10, j -> 50, k -> 30

y1 = torch.einsum('b i k, b j k -> b i j', a , c) # shape [10, 20, 50]
```

----

Let's apply the **softmax** now:

In [None]:
import torch.nn.functional as F

weights = F.softmax(scores, dim=-1)
weights.sum(dim=-1)

The final step is to multiply the attention weights by the values:

In [None]:
print("Attention weights shape:\n" + str(weights.shape))
print("\nValue tensor shape:\n" + str(value.shape))

attn_outputs = torch.bmm(weights, value)
print("\nAttention output shape:\n" + str(attn_outputs.shape))

Let's wrap these steps into a function that we can use later:

In [None]:
def scaled_dot_product_attention(query, key, value):
    dim_k = query.size(-1)
    scores = torch.bmm(query, key.transpose(1, 2)) / sqrt(dim_k)
    weights = F.softmax(scores, dim=-1)
    return torch.bmm(weights, value)

### 2.1.2 -  Multi-headed attention

In our simple example, we only used the embeddings "as is" to compute the attention scores and weights. In practice, the self-attention block applies three independent linear transformations to each embedding to generate the *query*, *key*, and *value* vectors. <span style="color:blue"><b>These transformations project the embeddings and each projection carries its own set of learnable parameters, which allows the self-attention layer to focus on different semantic aspects of the sequence</b></span>.

Each linear transformation represents a so-called attention head, resulting ina multi-head attention block. But <span style="color:blue"><b>why do we need more than one attention head?</b></span> The reason is that the softmax of one head tends to focus on mostly one aspect of similarity. <span style="color:blue">Having several heads allows the model to focus on several aspects at once</span>. For instance, one head can focus on subject-verb interaction, whereas another finds nearby adjectives. Obviously we don't handcraft these relation into the model and they are fully learned from the data. This is *similar* to filters in convolutional neural networks, where one filter can be responsible for detecting faces and another one finds wheels of cars in images.

<table>
    <tr>
        <td><img src="images_ch3/multi_headed_attention.png" title="" alt="" width="500" data-align="center"></td>
    </tr>
</table>

Let's implement this block by first coding up a single attention head:

In [None]:
class AttentionHead(nn.Module):
    
    def __init__(self, embed_dim, head_dim):
        super().__init__()
        self.q = nn.Linear(embed_dim, head_dim)
        self.k = nn.Linear(embed_dim, head_dim)
        self.v = nn.Linear(embed_dim, head_dim)
    
    def forward(self, hidden_state):
        attn_outputs = scaled_dot_product_attention(
            self.q(hidden_state), self.k(hidden_state), self.v(hidden_state)
        )
        return attn_outputs

Here we have initialized three independent layers that project embedding vectors from shape `[batch_size, seq_len, embed_dim]` to `[batch_size, seq_len, head_dim]`, where `head_dim` is the number of dimensions we are projecting into.

Although `head_dim` does not have to be smaller than the number of embedding dimensions of the tokens (`embed_dim`), in practice it is chosen to be a multiple of `embed_dim` so that the computation across each head is constant. For example, BERT has 12 attention heads, so the dimension of each head is 768/12 = 64.

Now that we have implemented the attention head, we can <span style="color:blue"><b>concatenate the outputs of multiple attention heads to implement the full multi-head attention block</b></span>:

In [None]:
class MultiHeadAttention(nn.Module):
    def __init__(self, embed_dim, num_attention_heads):
        super().__init__()
        head_dim = embed_dim // num_attention_heads
        self.heads = nn.ModuleList(
            [AttentionHead(embed_dim, head_dim) for _ in range(num_attention_heads)]
        )
        self.output_linear = nn.Linear(embed_dim, embed_dim)
        
    def forward(self, hidden_state):
        x = torch.cat([h(hidden_state) for h in self.heads], dim=-1)
        x = self.output_linear(x)
        return x

Notice that the concatenated output from the attention heads is also fed through a final linear layer to produce an output tensor of shape `[batch_size, seq_len, hidden_dim]`. In this case, we consider the same dimension as the embedding. To confirm, let's see if the multi-head attention layer produces the expected shape of our inputs. We will consider the embeddings dimensions from our pretrained BERT model:

In [None]:
multihead_attn = MultiHeadAttention(config.hidden_size, config.num_attention_heads)
attn_output = multihead_attn(inputs_embeds)
attn_output.size()

To wrap up this section on attention, let's use BertViz again to visualize the attention for two different uses of the word "flies". 

In [None]:
from bertviz import head_view
from transformers import AutoModel

model = AutoModel.from_pretrained(model_ckpt, output_attentions=True)

sentence_a = "time flies like an arrow"
sentence_b = "fruit flies like a banana"

viz_inputs = tokenizer(sentence_a, sentence_b, return_tensors='pt')
attention = model(**viz_inputs).attentions
sentence_b_start = (viz_inputs.token_type_ids == 0).sum(dim=1)
tokens = tokenizer.convert_ids_to_tokens(viz_inputs.input_ids[0])

head_view(attention, tokens, sentence_b_start, heads=[8])

This visualization shows the attention weights as lines connecting the token whose embedding is getting updated (left) with ever word is being attended to (right). The intensity of the lines indicates the strength of the attention weights, with dark
lines representing values close to 1, and faint lines representing values close to 0.

In this example, the input consists of two sentences and the `[CLS]` and `[SEP]` tokens are the special tokens in BERT’s tokenizer that indicate the start and end of the sequence (the name of these tokens usually differs from model to model).

One thing we can see from the visualization is that the attention weights are strongest between words that belong to the same sentence, which suggests BERT can tell that it should attend to words in the same sentence. We can also see that the attention weughts allow the model to distinguish the use of "flies" as a verb or noun depending on the context in which it occurs.

## 2.2 - The feed-forward layer

Now that we have covered attention, let's take a look at implementing the missing piece of the encoder layer: **position-wise feed forward networks**.

The feed-forward sublayer in the encoder and decoder is just a simple two-layer fully connected neural network, but with a twist: instead of processing the whole sequence of embeddings as a single vector, it processes each embedding *independntly*. For this reason, this layer is often referred to as a *position-wise feed-forward layer*. <span style="color:blue">You may also see it referred to as a one-dimensional convolution with a kernel size of one</span> (e.g., The OpenAI GPT codebase uses this nomenclature). 

A rule of thumb from the literature is for the hidden size of the first layer to be four times the size of the embeddings, and a Gaussian Error Linear Unit (GELU) activation function is most commonly used. This is where most of the capacity and memorization is hypothesized to happen, and it's the part that is most often scaled when scaling up the models. We can implement it as follows:

In [None]:
class FeedForward(nn.Module):
    
    def __init__(self, hidden_size, intermediate_size, hidden_dropout_prob):
        super().__init__()
        self.linear_1 = nn.Linear(hidden_size, intermediate_size)
        self.linear_2 = nn.Linear(intermediate_size, hidden_size)
        self.gelu = nn.GELU()
        self.dropout = nn.Dropout(hidden_dropout_prob)
        
    def forward(self, x):
        x = self.linear_1(x)
        x = self.gelu(x)
        x = self.linear_2(x)
        x = self.dropout(x)
        return x

Note that a feed-forward layer such as `nn.Linear` is usually applied to a tensor of shape `[batch_size, input_dim]`, where it acts on each element of the batch dimension independently. In this case, we pass a tensor of shape `[batch_size, seq_len, input_dim]`, where it acts on all token embeddings of the batch and sequence independently, which is exactly what we want.

In [None]:
feed_forward = FeedForward(config.hidden_size, config.intermediate_size, config.hidden_dropout_prob)
ff_outputs = feed_forward(attn_outputs)
ff_outputs.size()

## 2.3 - Adding layer normalization

We now have all the ingredients to create a fully fledged transformer encoder layer. The only decision left to make is where to place the skip connections and layer normalization.

* [**Layer normalization**](https://arxiv.org/abs/1607.06450) normalizes each input in the batch to have zero mean and unity variance. For more information about the different normalization approaches, see this [Medium article](https://towardsdatascience.com/different-normalization-layers-in-deep-learning-1a7214ff71d6).
* [**Skip connections**](https://theaisummer.com/skip-connections/), as the name suggests, skip some layer in the neural network and feeds the output one layer as the input to the next layer. In general, there are two fundamental ways that one could use skip connection through different non-sequential layers:
    * addition, as in residual architectures (e.g., ResNet)
    * concatenation, as in densely connected architecures (e.g., DenseNet)

When it comes to placing the layer normalization in the encoder or decoder blocks of a transformer, there ar two main choices adoped in the literature:

**Post-layer normalization**

This is the arrangement used in the original Transformer paper; it places layer normalization in between the skip connections. This arrangement is tricky to train from scratch as the gradients can diverge. For this reason, you will often see a concept known as <span style="color:blue"><b>learning rate warm-up</b></span>, where the learning rate is gradually increased from a small value to some maximum value during training.

**Pre-layer normalization**

This is the most common arrangement found in the literature; it places layer normalization within the span of the skip connections. This tends to be much more stable during training, and it does not usually require any learning rate warm-up.

<table>
    <tr>
        <td><img src="images_ch3/layer_normalization.png" title="" alt="" width="500" data-align="center"></td>
    </tr>
</table>

We'll use the second arrangement to build our encoder block as follows:

In [None]:
class TransformerEncoderBlock(nn.Module):
    
    def __init__(self, hidden_size, intermediate_size, hidden_dropout_prob, num_attention_heads):
        super().__init__()
        self.layer_norm_1 = nn.LayerNorm(hidden_size)
        self.layer_norm_2 = nn.LayerNorm(hidden_size)
        self.attention = MultiHeadAttention(hidden_size, num_attention_heads)
        self.feed_forward = FeedForward(hidden_size, intermediate_size, hidden_dropout_prob)
    
    def forward(self, x):
        # Apply layer normalization and then copy input into query, key, value
        hidden_state = self.layer_norm_1(x)
        # Apply attention with a skip connection
        x = x + self.attention(hidden_state)
        # Apply feed-forward layer with a skip connection
        x = x + self.feed_forward(self.layer_norm_2(x))
        return x

In [None]:
encoder_block = TransformerEncoderBlock(config.hidden_size, config.intermediate_size, config.hidden_dropout_prob, config.num_attention_heads)
inputs_embeds.shape, encoder_block(inputs_embeds).size()

We have now implemented our first transformer encoder block from scratch! However, <span style="color:blue"><b>there is a caveat with the way we set up the encoder block: it is completely invariant to the position of the tokens</b></span>. Since the multi-head attention layer is effectively a fancy weighted sum, the information on token position is lost. In fancier terminology, the self-attention and feed-forward layers are said to be permutation equivariant; if the input is permuted then the corresponding output of the layer is permuted in exactly the same way.

Luckily, there is an easy trick to incorporate positional information using <span style="color:blue"><b>positional embeddings</b></span>.

## 2.4 - Positional embeddings

Positional embeddings are based on a simple, yet very effective, idea: augment the token embeddings with a position-dependent pattern of values arranged in a vector. If the pattern is characteristic for each position, the attention heads and feed-forward layers in each stack can learn to incorporate positional information into their transformations.

There are several ways to achieve this, and one of the most popular approaches is to use a learnable pattern, especially when the pretraining dataset is sufficiently large. This works exactly the same way as the token embeddings, but using the position index instead of the toekn ID as input. With that approach, an efficient way of encoding the positions of tokens is learnable during pretraining.

In [None]:
seq_length = inputs.input_ids.size(1)
position_ids = torch.arange(seq_length, dtype=torch.long).unsqueeze(0)
position_ids

Let's create a custom `Embeddings` module that combines token embeddings and positional embeddings by summing them:

In [None]:
class Embeddings(nn.Module):
    
    def __init__(self, vocab_size, max_position_embeddings, hidden_size ):
        super().__init__()
        self.token_embeddings = nn.Embedding(vocab_size, hidden_size)
        self.position_embeddings = nn.Embedding(max_position_embeddings, hidden_size)
        self.layer_norm = nn.LayerNorm(config.hidden_size, eps=1e-12)
        self.dropout = nn.Dropout()
    
    def forward(self, input_ids):
        # Create position IDs for input sequence
        seq_length = input_ids.size(1)
        position_ids = torch.arange(seq_length, dtype=torch.long).unsqueeze(0)
        # Create token and position embeddings
        token_embeddings = self.token_embeddings(input_ids)
        position_embeddings = self.position_embeddings(position_ids)
        # Combine token and position embeddings
        embeddings = token_embeddings + position_embeddings
        embeddings = self.layer_norm(embeddings)
        embeddings = self.dropout(embeddings)
        return embeddings

embedding_layer = Embeddings(config.vocab_size, config.max_position_embeddings, config.hidden_size)
embedding_layer(inputs.input_ids).size()

While learnable position embeddings are easy to implement and widely used, they are disadvantageous when the sequence length is big because we will get very large positional embedding values that may dominate over other values. For this reason, [there are alternative approaches](https://rutvik-trivedi.github.io/blog/nlp/positional-embeddings.html).

**Absolute positional representations**

Transformer models can use static patterns consisting of modulated sin and cosine signals to encode the positions of the tokens. Why use sinusoidal functions? Well, because the appropriate function should satisfy the following criteria
1. It should output a unique encoding for each sequential token. So, the first token (the second, the third, etc.) in any sequence will have same embedding irrespective of the sequence length.
2. Distance between any two tokens should be same irrespective of the length of the sequence. Thus, difference between the first and second token for a sequence length of 5 should be same as the difference between the first and second token with a sequence of length 100.
3. The values of the function should be bounded. So, using 1,2,3,... and so on will not work for long sequences.
4. It must be deterministic.

**Relative positional representations**

Although absolute positions are important, one can argue that when computing an embedding, the surronding tokens are most important. Relative positional representations follow that intuition and encode the relative positions between tokens. <span style="color:blue">This cannot be set up by just introducing a new relative embedding layer at the beginning, since the relative embedding changes for each token depending on where the from the sequence we are attending to it.</span> <span style="color:blue"><b>Instead the attention mechanism itself is modified with additional terms that take the relative position between tokens into account</b></span>. 
    
[More information on relative positional embeddings](https://theaisummer.com/positional-embeddings/).

## 2.5 - Putting the encoder together

In [None]:
class TransformerEncoder(nn.Module):
    
    def __init__(self, 
                 vocab_size, 
                 max_position_embeddings, 
                 hidden_size, 
                 intermediate_size, 
                 hidden_dropout_prob, 
                 num_attention_heads):
        super().__init__()
        self.embeddings = Embeddings(vocab_size, max_position_embeddings, hidden_size)
        self.layers = nn.ModuleList([TransformerEncoderBlock(hidden_size, 
                                                             intermediate_size, 
                                                             hidden_dropout_prob, 
                                                             num_attention_heads)
                                     for _ in range(config.num_hidden_layers)])

    def forward(self, x):
        x = self.embeddings(x)
        for layer in self.layers:
            x = layer(x)
        return x
    
encoder = TransformerEncoder(config.vocab_size,
                             config.max_position_embeddings,
                             config.hidden_size,
                             config.intermediate_size,
                             config.hidden_dropout_prob,
                             config.num_attention_heads)
encoder(inputs.input_ids).size()

## 2.6 - Adding a classification head

Transformer models are usually divided into a task-independent body and a task-specific head. What we have built so far is the body, so if we wish to build a text classifier, we will need to attach a classification head to that body.

The issue is that we have a hidden state for each token in the sequence. To solve this, the firs toek in such models is traditionally used for the prediction. In addition, we can attach a dropout and a linear layer to make the classification prediction. The following class extends the existing encoder for sequence classification:

In [None]:
class TransformerForSequenceClassification(nn.Module):
    
    def __init__(self, 
                 vocab_size, 
                 max_position_embeddings, 
                 hidden_size, 
                 intermediate_size, 
                 hidden_dropout_prob, 
                 num_attention_heads,
                 num_labels):
        super().__init__()
        self.encoder = TransformerEncoder(vocab_size,
                                          max_position_embeddings,
                                          hidden_size,
                                          intermediate_size,
                                          hidden_dropout_prob,
                                          num_attention_heads)
        self.dropout = nn.Dropout(hidden_dropout_prob)
        self.classifier = nn.Linear(hidden_size, num_labels)
    
    def forward(self, x):
        x = self.encoder(x)[:, 0, :] # select hidden state of [CLS] token
        x = self.dropout(x)
        x = self.classifier(x)
        return x
    
config.num_labels = 3
encoder_classifier = TransformerForSequenceClassification(config.vocab_size,
                                                          config.max_position_embeddings,
                                                          config.hidden_size,
                                                          config.intermediate_size,
                                                          config.hidden_dropout_prob,
                                                          config.num_attention_heads,
                                                          config.num_labels)
encoder_classifier(inputs.input_ids).size()

# 3 - The decoder

The main difference between the decoder and the encoder is that the decoder has **two** attention sublayers:

#### Masked multi-head self-attention layer

<span style="color:blue">Ensure that the tokens we generate each timestep are only based on the past outputs and the current token being predicted</span>. Without this, the decoder could cheat during training by simply copying the target translations; masking the inputs ensures the task is not trivial. 

----

<mark>Note:</mark> Remember that in a traditional encoder-decoder architecture (e.g., translation task), the decoder receives the "target sequence".

----

#### Encoder-decoder attention

<span style="color:blue">Performs multi-head attention over the output *key* and *value* vectors of the encoder stack, with the intermediate representations of the decoder acting as the *query* vectors</span>. This way the encoder-decoder attention layer learns how to relate tokens from two different sequences, such as two different languages.

----

<mark>Note:</mark> Unlike the self-attention layer, the *key* and *query* vectors in encoder-decoder attention can have different lengths. This is because the encoder and decoder inputs will generally involve sequences of differing length. As a result, the matrix of attention scores in this layer is rectangular, not square.

----

<img src="images/decoder_block.png" title="" alt="" width="500" data-align="center">

Let's take a look at the modifications we need to make to include masking in our self-attention layer. The trick with masked self-attention is to introduce a mask matrix with ones on the lower diagonal and zeros above. For that, we can use `torch.trill()` function:

In [None]:
seq_len = inputs.input_ids.size(-1)
mask = torch.tril(torch.ones(seq_len, seq_len)).unsqueeze(0)
mask[0]

We can prevent each attention head from peeking at future tokens by using Tensor.masked_fill() to replace all the zeros with negative infinity. By setting the upper values to negative infinity, we guarantee that the attention weights are all zero once we take the softmax over the scores because $e^{-\infty} = 0$ (recall that softmax calculates the normalized exponential).

In [None]:
scores.masked_fill(mask == 0, -float("inf"))

We can easily include this masking behaviour with a small change to our scaled dot-product attention function that we implemented earlier:

----

```python
def scaled_dot_product_attention(query, key, value):
    dim_k = query.size(-1)
    scores = torch.bmm(query, key.transpose(1, 2)) / sqrt(dim_k)
    weights = F.softmax(scores, dim=-1)
    return torch.bmm(weights, value)

```

----

In [None]:
def scaled_dot_product_attention(query, key, value, mask=None):
    dim_k = query.size(-1)
    scores = torch.bmm(query, key.transpose(1, 2)) / sqrt(dim_k)
    if mask is not None:
        scores = scores.masked_fill(mask == 0, float("-inf"))
    weights = F.softmax(scores, dim=-1)
    return weights.bmm(value)

From here it is a simple matter to build up the decoder block. We can take inspiration from the excellent implementations of [minGPT](https://github.com/karpathy/nanoGPT) or [nanoGPT](https://github.com/karpathy/nanoGPT) by Andrej Karpathy. [He has a great video explaining the process](https://www.youtube.com/watch?v=kCc8FmEb1nY).

# 4 - Demystifying Encoder-Decoder attention

Let's propose a simple analogy to help understand the <span style="color:blue"><b>encoder-decoder attention process</b></span>. 

Imagine you (<span style="color:blue">the decoder</span>) are in class taking an exam. Given a sentence, your task is to predict the next word based on the previous words (decoder inputs), which sounds simple but is incredibly hard. Fortunately, your mate (<span style="color:blue">the encoder</span>) has the full text. Unfortunately, they are a foreign student and the text is in their mother tongue. Cunning student that you are, you figure out a way to cheat anyway. You draw a little cartoon illustrating the text you already have (<span style="color:blue">the query</span>) and give it to your mate. They try to figure out which sentence matches that description (<span style="color:blue">the key</span>), draw a cartoon descibing the word following that sentence (<span style="color:blue">the value</span>), and pass that back to you. With this system in place, you ace the exam.

# 5 - Transformers architectures

Over time, each of the three main architectures (i.e., encoders, decoders, and encoder-decoders) has undergone an evolution of its own. With over 50 different architectures included in 🤗 Transformers, this family tree by no means provides a complete overview of all the ones that exist: it simply highlights a few of the architecural milestones:

<table>
    <tr>
        <td><img src="images_ch3/transformers_tree.png" title="" alt="" width="500" data-align="center"></td>
    </tr>
</table>

We have covered the original Transformer architecure, so let's take a closer look at some of the key descendants:

## 5.1 - The encoder branch


The first encoder-only model based on the Transformer architecture was BERT. At the time it was published, it outperformed all the state-of-the-art models in the popular [GLUE benchmark](https://arxiv.org/abs/1804.07461), which measure natural language understanding (NLU) across several tasks of varying difficulty. Subsequently, the pretraining objective and the architecture of BERT have been adapted to further improve performance. As of 2022, encoder-only models still dominate research and industry on NLU tasks such as text classification, named entity recognition, and question answering. Let's take a brief look at the BERT model and its variants:

**BERT** ([Devlin et al., 2018](https://arxiv.org/abs/1810.04805))

BERT is pretrained with the two objectives of predicting masked tokens in texts and determining if one text sentence is likely to follow another. the former task is called *masked language modeling* (MLM) and the latter *next sentence prediction* (NSP).

**DistilBERT** ([Sanh et al., 2019](https://arxiv.org/abs/1910.01108))

Although BERT delivers great results, its size can make it triky to deploy in environments where low latencies are required. By using a technique known as knowledge distillation during pretraining, DistilBERT achieves 97% of BERT's performance while using 40% less memoery and being 60% faster.

**RoBERTa** ([Liu et al., 2019](https://arxiv.org/abs/1907.11692))

A study following the release of BERT revealed that its performance can be further improved by modifying the pretraining scheme. RoBERTa is trained longer, on larger batches with more training data, and it drops the NSP task. Together these changes significantly improve its performance compared to the original BERT model.

**XLM** ([Lample and Conneau, 2019](https://arxiv.org/abs/1901.07291))

Several pretraining objectves for building **multilingual models** were explored in the work on the cross-lingual language model (XLM), including the autoregressive language from GPT-like models and MLM from BERT. In addition, the authors introduced translation language modeling (TLM), which is an extension of MLM to multiple language inputs. Experimenting with these pretraining tasks, they achieve state-of-the-art results on several multilingual NLU benchmarks as well as on translation tasks.

**XLM-RoBERTa** ([Conneau et al., 2019](https://arxiv.org/abs/1911.02116))

Following the work of XLM and RoBERTa, the XLM-RoBERTa or XLM-R model taks **multilingual** pretraining one step further by massibely upscaling the training data. Using the Common Crawl corpus, its developers created a dataset with 2.5 terabytes of text; they then trained an encoder with MLM on this dataset. Since the dataset only contains data without parallel texts (i.e., translations), the TLM objective of XLM was dropped. This approach beats XLM and multilingual BERT variants by a large margin, espcially on low-resource languages.

**ALBERT** ([Lan et al., 2020](https://arxiv.org/abs/1909.11942))

The ALBERT model introduced three changes to make the encoder architecture more efficient. First, it decouples the token embedding dimension from the hidden dimension, thus allowing the embedding dimension to be small and threby saving parameters, especially when the covabulary gets large. Second, all layers share the same parameters, which decreases the number of effective parameters even further. Finally, the NSP objective is replaced with a sentence-ordering prediction: the model needs to predict whether or not the order of two consecutive sentences was swapped rather than predicting if they belong together at all. These changes make it possible to train even larger models with fewer parameters and reach superior performance on NLU tasks.

**ELECTRA** ([Clark et al., 2020](https://arxiv.org/abs/2003.10555))

One limitation of the standard MLM pretraining objective is that at each training step only the representaitons of the masked tokens are updated, while the other input tokens are not. To address this issue, ELECTRA uses a two-model approach: the first model (which is tipically small) works like a standard masked language model and predicts masked tokens. The second model, called the *discriminator*, is then tasked to predict which of the tokens in the first model's ouput were originally masked. Therefore the discriminator needs to make binary classification for every token, which results in general training efficiency. For downstream tasks, the discriminator is fine-tuned like a standard BERT model.

**DeBERTa** ([He et al., 2020](https://arxiv.org/abs/2006.03654))

The DeBERTa model introduces two architectural changes. First, each token is represented as two vectors: one for the content, the other for relative position. By disentangling the tokens' content from their relative positions, the self-attention layers can better model the dependency of nearby toekn pairs. On the other hand, the absolute position of a word is also important, especially for decoding. For this reason, an absolute position embedding is added just before the softmax layer of the token decoding head. DeBERTa is the first model (as an ensemble) to beat the human baseline on the [SuperGLUE benchmark](https://arxiv.org/abs/1905.00537) a more difficult version of GLUE consisting of several subtasks used to measure NLU performance.

## 5.2 - The decoder branch

The progress on transformer decoder models has been spearheaded to a large extent by OpenAI. These models are exceptionally good at predicting the next word in a sequence and are thus mostly used for text generation tasks. Their progress has been fueled by using larger datasets and scaling the language models to larger and larger sizes. Let's take a look at some of these models:

**GPT** ([Radford et al., 2018](https://openai.com/blog/language-unsupervised/))

The introduction of GPT combined two key ideas in NLP: the novel and efficient transformer decoder architecture, and transfer learning. In that setup, the model was pretrained by predicting the next word based on the previous ones. The model was trained on the BookCorpus and achieved great results on downstream tasks such as classification.

**GPT-2** ([Radford et al., 2019](https://openai.com/blog/better-language-models/))

Inspired by the success of the simple and scalable pretraining approach, the original model and training set were upscaled to produce GPT-2. This model is able to produce long sequences of coherent text. Due to concerns about possible misuse (ha!), the model was released in a staged fashion, with smaller models being published first and the full model later.

**CTRL** ([Keskar et al., 2019](https://arxiv.org/abs/1909.05858))

Models like GPT-2 can continue an input sequence (also called a prompt). However, the user has little control over the style of the generated sequence. The Conditional Transformer Language (CTRL) model addresses this issue by adding "control tokens" at the beginning of the sequence. These allow the style of the generated text to be controlled, which allows for diverse generation.

**GPT-3** ([Brown et al., 2020](https://arxiv.org/abs/2005.14165))

Following the success of scaling GPT up to GPT-2, a thorough analysis on the behaviour of language models at different scales revealed that [there are simple "power laws" that govern the relation between compute, data size, model size, and the performance of the language model](https://arxiv.org/abs/2001.08361). Inspired by these insights, GPt-2 was upscaled by a factor of 100 to yield GPT-3, with 175 billion parameters. Besides being able to generate impressibely realistic text passages, the model also exhibits few-shot learning capabilities: with a few examples of a novel task such as translating text to code, the model is able to accomplish the task on new examples. OpenAI has not currently open-source this model, but provides an interface through the OpenAI API.

**GPT-Neo/GPT-J-6B** ([Black et al., 2021](https://zenodo.org/record/5297715) / [Wang and Komatsuzaki, 2021](https://github.com/kingoflolz/mesh-transformer-jax))

GPT-Neo and GPT-J-6B are GPT-like models that were trained by [EleutherAI](https://www.eleuther.ai/), a collective of researchers who aim to re-create and release GPT-3 scale models.22 The current models are smaller variants of the full 175-billion-parameter model, with 1.3, 2.7, 6, and 20 billion parameters, and are competitive with the smaller GPT-3 models OpenAI offers.

### 3.5.3 - The encoder-decoder branch

Although it has become common to build models using a single encoder or decoder stack, there are several encoder-decoder variants of the Transformer architecture that have novel  applications across both NLU and NLG domains:

**T5** ([Raffel et al., 2019](https://arxiv.org/abs/1910.10683))

The T5 model unifies all NLU and NLG tasks by converting them into text-to-text tasks. All tasks are framed as sequence-to-sequence tasks, where adopting an encoder-decoder architecture is natural. For example, for text classification problems, this means that text is used as the encoder input and the decoder has to generate the label as normal text instead of a class. We will look at this in more detail in Chapter 6. the T5 architecture uses the original Transformer architecutre. Using the large crawled C4 dataset, the model is pretrained with masked language modeling as well as the SuperGLUE tasks by trnaslating all of them to text-to-text tasks. The largest model with 11 billion parameters yielded state-of-the-art results on several benchmarks.

**BART** ([Lewis et al., 2019](https://arxiv.org/abs/1910.13461))

BART combines the pretraining procedures of BERT and GPT within the encoder-decoder architecture. The input sequences undergo one of several possible transformations, from simple masking to sentence permutation, token deletion and document rotation. These modified inputs are passed through the encoder, and the decoder has to reconstructur the original texts. This maskes the model more flexible as it is possible to use it for NLU as well as NLG tasks, and it achieves state-of-the-art performance at both.

**M2M-100** ([Fan et al., 2020](https://arxiv.org/abs/2010.11125))

A conventional translation model is built for one language pair and translation direction. Naturally, this does not scale to many languages, and in addition there might be shared knowledge between language pairs that could be leveraged for trnaslation between rare languages. M2M-100 is the first translation model tht can translate between any of 100 languages. This allows for high quality translations between rare and underrepresented languages. The model uses prefix tokens (similar to the special `[CLS]` token) to indicate the source and target language.

**BigBird** ([Zaheer et al., 2020](https://arxiv.org/abs/2007.14062))

One main limitation of transformer models is the maximum context size, due to the quadratic memory requirements of the attention mechanism. BigBird addresses this issue by using a sparse form of attention that scales linearly. This allows for the drastic scaling of context from 512 tokens in most BERT models to 4096 in BigBird. This is especially useful in cases where long dependencies need to be conserved, such as in text summarization.