<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Implement-Encoder" data-toc-modified-id="Implement-Encoder-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Implement Encoder</a></span><ul class="toc-item"><li><span><a href="#Load-embedding-matrix-from-bert-base-uncased" data-toc-modified-id="Load-embedding-matrix-from-bert-base-uncased-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Load embedding matrix from bert-base-uncased</a></span></li><li><span><a href="#Run-the-input-text-to-this-embedding-matrix" data-toc-modified-id="Run-the-input-text-to-this-embedding-matrix-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Run the input text to this embedding matrix</a></span></li><li><span><a href="#Q@K.T-/-sqrt(dim_k)" data-toc-modified-id="Q@K.T-/-sqrt(dim_k)-1.3"><span class="toc-item-num">1.3&nbsp;&nbsp;</span><a href="mailto:Q@K.T" rel="nofollow" target="_blank">Q@K.T</a> / sqrt(dim_k)</a></span></li><li><span><a href="#Add-softmax" data-toc-modified-id="Add-softmax-1.4"><span class="toc-item-num">1.4&nbsp;&nbsp;</span>Add softmax</a></span></li><li><span><a href="#Multiply-Value-matrix" data-toc-modified-id="Multiply-Value-matrix-1.5"><span class="toc-item-num">1.5&nbsp;&nbsp;</span>Multiply Value matrix</a></span></li><li><span><a href="#Single-attention-head" data-toc-modified-id="Single-attention-head-1.6"><span class="toc-item-num">1.6&nbsp;&nbsp;</span>Single attention head</a></span></li><li><span><a href="#Multihead-attention" data-toc-modified-id="Multihead-attention-1.7"><span class="toc-item-num">1.7&nbsp;&nbsp;</span>Multihead attention</a></span></li><li><span><a href="#Feed-forward" data-toc-modified-id="Feed-forward-1.8"><span class="toc-item-num">1.8&nbsp;&nbsp;</span>Feed forward</a></span></li><li><span><a href="#Layer-norm" data-toc-modified-id="Layer-norm-1.9"><span class="toc-item-num">1.9&nbsp;&nbsp;</span>Layer norm</a></span></li><li><span><a href="#Put-things-together" data-toc-modified-id="Put-things-together-1.10"><span class="toc-item-num">1.10&nbsp;&nbsp;</span>Put things together</a></span></li><li><span><a href="#Positional-encoding" data-toc-modified-id="Positional-encoding-1.11"><span class="toc-item-num">1.11&nbsp;&nbsp;</span>Positional encoding</a></span></li><li><span><a href="#Put-everything-together-(with-learnable-positional-encoding)" data-toc-modified-id="Put-everything-together-(with-learnable-positional-encoding)-1.12"><span class="toc-item-num">1.12&nbsp;&nbsp;</span>Put everything together (with learnable positional encoding)</a></span></li><li><span><a href="#Classification-head" data-toc-modified-id="Classification-head-1.13"><span class="toc-item-num">1.13&nbsp;&nbsp;</span>Classification head</a></span></li></ul></li><li><span><a href="#Implement-Decoder" data-toc-modified-id="Implement-Decoder-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Implement Decoder</a></span><ul class="toc-item"><li><span><a href="#Mask-matrix" data-toc-modified-id="Mask-matrix-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Mask matrix</a></span></li></ul></li><li><span><a href="#Bertviz-visualization-(again)" data-toc-modified-id="Bertviz-visualization-(again)-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Bertviz visualization (again)</a></span></li><li><span><a href="#Transformer-family-tree" data-toc-modified-id="Transformer-family-tree-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Transformer family tree</a></span><ul class="toc-item"><li><span><a href="#Encoder-branch" data-toc-modified-id="Encoder-branch-4.1"><span class="toc-item-num">4.1&nbsp;&nbsp;</span>Encoder branch</a></span></li><li><span><a href="#The-decoder-branch" data-toc-modified-id="The-decoder-branch-4.2"><span class="toc-item-num">4.2&nbsp;&nbsp;</span>The decoder branch</a></span></li><li><span><a href="#Both-encoder-and-decoder" data-toc-modified-id="Both-encoder-and-decoder-4.3"><span class="toc-item-num">4.3&nbsp;&nbsp;</span>Both encoder and decoder</a></span></li></ul></li></ul></div>

In [1]:
from transformers import AutoTokenizer
from bertviz.transformers_neuron_view import BertModel
from bertviz.neuron_view import show

Since BertViz needs to tap into the attention layers of the model, we’ll instantiate our BERT checkpoint with the model class from BertViz and then use the show() function to generate the interactive visualization

In [2]:
model_ckpt = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)
model = BertModel.from_pretrained(model_ckpt)


In [4]:
# text = "time flies like an arrow"
text = 'I had an enjoyable two weeks in London'
show(model, "bert", tokenizer, text, display_mode="light", layer=0, head=8)

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

the values of the query and key vectors are represented as vertical bands, where the intensity of each band corresponds to the magnitude. The connecting lines are weighted according to the attention between the tokens

# Implement Encoder

In [3]:
text = "time flies like an arrow"


In [4]:
inputs = tokenizer(text, return_tensors="pt", add_special_tokens=False)
# exclude [cls] and [sep] with add_special_tokens=False
inputs.input_ids

tensor([[ 2051, 10029,  2066,  2019,  8612]])

## Load embedding matrix from bert-base-uncased

In [5]:
from torch import nn
from transformers import AutoConfig

config = AutoConfig.from_pretrained(model_ckpt)
token_emb = nn.Embedding(config.vocab_size, config.hidden_size)
token_emb

Embedding(30522, 768)

## Run the input text to this embedding matrix

In [6]:
inputs_embeds = token_emb(inputs.input_ids)
inputs_embeds.size() # [batch_size, seq_len, hidden_dim]

torch.Size([1, 5, 768])

## Q@K.T / sqrt(dim_k)

In [7]:
import torch
from math import sqrt

query = key = value = inputs_embeds
dim_k = key.size(-1)

In [8]:
query.shape, key.transpose(1,2).shape

(torch.Size([1, 5, 768]), torch.Size([1, 768, 5]))

In [9]:
scores = torch.bmm(query, key.transpose(1,2)) / sqrt(dim_k)
scores.size()

torch.Size([1, 5, 5])

## Add softmax

In [10]:
import torch.nn.functional as F

weights = F.softmax(scores, dim=-1)
weights.sum(dim=-1)

tensor([[1., 1., 1., 1., 1.]], grad_fn=<SumBackward1>)

## Multiply Value matrix

In [11]:
attn_outputs = torch.bmm(weights, value)
attn_outputs.shape

torch.Size([1, 5, 768])

Put into a function

In [12]:
def scaled_dot_product_attention(query, key, value):
    dim_k = query.size(-1)
    scores = torch.bmm(query, key.transpose(1, 2)) / sqrt(dim_k)
    weights = F.softmax(scores, dim=-1)
    return torch.bmm(weights, value)

## Single attention head

In [13]:
class AttentionHead(nn.Module):
    def __init__(self, embed_dim, head_dim):
        super().__init__()
        self.q = nn.Linear(embed_dim, head_dim)
        self.k = nn.Linear(embed_dim, head_dim)
        self.v = nn.Linear(embed_dim, head_dim)

    def forward(self, hidden_state):
        attn_outputs = scaled_dot_product_attention(
            self.q(hidden_state), self.k(hidden_state), self.v(hidden_state))
        return attn_outputs

head_dim is the number of dimensions we are projecting into. Although head_dim does not have to be smaller than the number of embedding dimensions of the tokens (embed_dim), in practice it is chosen to be a multiple of embed_dim so that the computation across each head is constant. 

For example, BERT has 12 attention heads, so the dimension of each head is 768/12 = 64

## Multihead attention

In [14]:
class MultiHeadAttention(nn.Module):
    def __init__(self, config):
        super().__init__()
        embed_dim = config.hidden_size
        num_heads = config.num_attention_heads
        head_dim = embed_dim // num_heads
        self.heads = nn.ModuleList(
            [AttentionHead(embed_dim, head_dim) for _ in range(num_heads)]
        )
        self.output_linear = nn.Linear(embed_dim, embed_dim)

    def forward(self, hidden_state):
        x = torch.cat([h(hidden_state) for h in self.heads], dim=-1)
        x = self.output_linear(x)
        return x

In [17]:
config = AutoConfig.from_pretrained(model_ckpt)
config

BertConfig {
  "_name_or_path": "bert-base-uncased",
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.19.2",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30522
}

In [18]:
inputs_embeds.shape

torch.Size([1, 5, 768])

In [15]:
multihead_attn = MultiHeadAttention(config)
attn_output = bmultihead_attn(inputs_embeds)
attn_output.size()

torch.Size([1, 5, 768])

## Feed forward

In [38]:
class FeedForward(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.linear_1 = nn.Linear(config.hidden_size, config.intermediate_size)
        self.linear_2 = nn.Linear(config.intermediate_size, config.hidden_size)
        self.gelu = nn.GELU()
        self.dropout = nn.Dropout(config.hidden_dropout_prob)

    def forward(self, x):
        x = self.linear_1(x)
        x = self.gelu(x)
        x = self.linear_2(x)
        x = self.dropout(x)
        return x

Note that a feed-forward layer such as nn.Linear is usually applied to a tensor of shape (batch_size, input_dim), where it acts on each element of the batch dimension independently. This is actually true for any dimension except the last one, so when we pass a tensor of shape (batch_size, seq_len, hidden_dim) the layer is applied to all token embeddings of the batch and sequence independently, which is exactly what we want.

In [39]:
feed_forward = FeedForward(config)
ff_outputs = feed_forward(attn_outputs)
ff_outputs.size()

torch.Size([1, 5, 768])

## Layer norm

![](https://learning.oreilly.com/api/v2/epubs/urn:orm:book:9781098136789/files/assets/nlpt_0306.png)

- Post layer normalization

This is the arrangement used in the Transformer paper; it places layer normalization in between the skip connections. This arrangement is tricky to train from scratch as the gradients can diverge. For this reason, you will often see a concept known as learning rate warm-up, where the learning rate is gradually increased from a small value to some maximum value during training.

- Pre layer normalization

**This is the most common arrangement found in the literature**; it places layer normalization within the span of the skip connections. This tends to be much more stable during training, and it does not usually require any learning rate warm-up.

## Put things together

In [40]:
class TransformerEncoderLayer(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.layer_norm_1 = nn.LayerNorm(config.hidden_size)
        self.layer_norm_2 = nn.LayerNorm(config.hidden_size)
        self.attention = MultiHeadAttention(config)
        self.feed_forward = FeedForward(config)

    def forward(self, x):
        # Apply layer normalization and then copy input into query, key, value
        hidden_state = self.layer_norm_1(x)
        # Apply attention with a skip connection
        x = x + self.attention(hidden_state)
        # Apply feed-forward layer with a skip connection
        x = x + self.feed_forward(self.layer_norm_2(x))
        return x

In [42]:
encoder_layer = TransformerEncoderLayer(config)
inputs_embeds.shape, encoder_layer(inputs_embeds).size()
# input and output

(torch.Size([1, 5, 768]), torch.Size([1, 5, 768]))

## Positional encoding

There are several ways to achieve this, and one of the most popular approaches is to use a learnable pattern, especially when the pretraining dataset is sufficiently large. This works exactly the same way as the token embeddings, but using the position index instead of the token ID as input. With that approach, an efficient way of encoding the positions of tokens is learned during pretraining.

In [43]:
class Embeddings(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.token_embeddings = nn.Embedding(config.vocab_size,
                                             config.hidden_size)
        self.position_embeddings = nn.Embedding(config.max_position_embeddings,
                                                config.hidden_size)
        self.layer_norm = nn.LayerNorm(config.hidden_size, eps=1e-12)
        self.dropout = nn.Dropout()

    def forward(self, input_ids):
        # Create position IDs for input sequence
        seq_length = input_ids.size(1)
        position_ids = torch.arange(seq_length, dtype=torch.long).unsqueeze(0)
        # Create token and position embeddings
        token_embeddings = self.token_embeddings(input_ids)
        position_embeddings = self.position_embeddings(position_ids)
        # Combine token and position embeddings
        embeddings = token_embeddings + position_embeddings
        embeddings = self.layer_norm(embeddings)
        embeddings = self.dropout(embeddings)
        return embeddings

In [44]:
embedding_layer = Embeddings(config)
embedding_layer(inputs.input_ids).size()

torch.Size([1, 5, 768])

While learnable position embeddings are easy to implement and widely used, there are some alternatives:

- Absolute positional representations

Transformer models can use static patterns consisting of **modulated sine and cosine signals** to encode the positions of the tokens. This works especially well when there are not large volumes of data available.

- Relative positional representations

Although absolute positions are important, one can argue that when computing an embedding, the surrounding tokens are most important. Relative positional representations follow that intuition and encode the relative positions between tokens. This cannot be set up by just introducing a new relative embedding layer at the beginning, since the relative embedding changes for each token depending on where from the sequence we are attending to it. Instead, **the attention mechanism itself is modified** with **additional terms that take the relative position between tokens** into account. Models such as **DeBERTa** use such representations

## Put everything together (with learnable positional encoding)

In [45]:
class TransformerEncoder(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.embeddings = Embeddings(config)
        self.layers = nn.ModuleList([TransformerEncoderLayer(config)
                                     for _ in range(config.num_hidden_layers)])

    def forward(self, x):
        x = self.embeddings(x)
        for layer in self.layers:
            x = layer(x)
        return x

In [46]:
encoder = TransformerEncoder(config)
encoder(inputs.input_ids).size()

torch.Size([1, 5, 768])

## Classification head

Traditionally, the first token in such models is used for the prediction and we can attach a dropout and a linear layer to make a classification prediction. The following class extends the existing encoder for sequence classification:

In [47]:
class TransformerForSequenceClassification(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.encoder = TransformerEncoder(config)
        self.dropout = nn.Dropout(config.hidden_dropout_prob)
        self.classifier = nn.Linear(config.hidden_size, config.num_labels)

    def forward(self, x):
        x = self.encoder(x)[:, 0, :] # select hidden state of [CLS] token
        x = self.dropout(x)
        x = self.classifier(x)
        return x

In [48]:
# Let say we want to predict 1 out of 3 labels
config.num_labels = 3
encoder_classifier = TransformerForSequenceClassification(config)
encoder_classifier(inputs.input_ids).size()

torch.Size([1, 3])

In [49]:
encoder_classifier(inputs.input_ids)

tensor([[-0.1851,  1.2831, -0.3977]], grad_fn=<AddmmBackward0>)

For each example in the batch we **get the unnormalized logits for each class in the output**. This corresponds to the BERT model that we used

# Implement Decoder

## Mask matrix

In [50]:
seq_len = inputs.input_ids.size(-1)
mask = torch.tril(torch.ones(seq_len, seq_len)).unsqueeze(0)
mask[0]

tensor([[1., 0., 0., 0., 0.],
        [1., 1., 0., 0., 0.],
        [1., 1., 1., 0., 0.],
        [1., 1., 1., 1., 0.],
        [1., 1., 1., 1., 1.]])

Once we have this mask matrix, we can prevent each attention head from peeking at future tokens by using Tensor.masked_fill() to replace all the zeros with negative infinity:

In [51]:
scores

tensor([[[28.8223,  1.3862, -1.3654, -1.1751,  0.4624],
         [ 1.3862, 29.6709,  1.5041,  0.6674,  0.3451],
         [-1.3654,  1.5041, 27.3643, -0.4077, -0.4536],
         [-1.1751,  0.6674, -0.4077, 27.2289, -0.7336],
         [ 0.4624,  0.3451, -0.4536, -0.7336, 28.1667]]],
       grad_fn=<DivBackward0>)

In [52]:
scores.masked_fill(mask == 0, -float("inf"))

tensor([[[28.8223,    -inf,    -inf,    -inf,    -inf],
         [ 1.3862, 29.6709,    -inf,    -inf,    -inf],
         [-1.3654,  1.5041, 27.3643,    -inf,    -inf],
         [-1.1751,  0.6674, -0.4077, 27.2289,    -inf],
         [ 0.4624,  0.3451, -0.4536, -0.7336, 28.1667]]],
       grad_fn=<MaskedFillBackward0>)

Setting the upper values to negative infinity, we guarantee that the attention weights are all zero once we take the softmax over the scores because e**-inf = 0

Rewrite the dot product func

In [53]:
def scaled_dot_product_attention(query, key, value, mask=None):
    dim_k = query.size(-1)
    scores = torch.bmm(query, key.transpose(1, 2)) / sqrt(dim_k)
    if mask is not None:
        scores = scores.masked_fill(mask == 0, float("-inf"))
    weights = F.softmax(scores, dim=-1)
    return weights.bmm(value)

Look at https://github.com/karpathy/minGPT for further implementation

# Bertviz visualization (again)

In [20]:
from bertviz import head_view
from transformers import AutoModel

model = AutoModel.from_pretrained(model_ckpt, output_attentions=True)

sentence_a = "time flies like an arrow"
sentence_b = "fruit flies like a banana"

viz_inputs = tokenizer(sentence_a, sentence_b, return_tensors='pt')

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.seq_relationship.bias', 'cls.predictions.bias', 'cls.seq_relationship.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [22]:
viz_inputs

{'input_ids': tensor([[  101,  2051, 10029,  2066,  2019,  8612,   102,  5909, 10029,  2066,
          1037, 15212,   102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}

In [23]:
attention = model(**viz_inputs).attentions
sentence_b_start = (viz_inputs.token_type_ids == 0).sum(dim=1) # index of first token from sentence b
tokens = tokenizer.convert_ids_to_tokens(viz_inputs.input_ids[0])

In [27]:
len(attention),attention[0].shape 
# 12 hidden layers, each hidden layer attention output: (bs,num_heads,seq_len,seq_len)
# it's (...,...,13,13) because we are calculating attention score (already softmaxed)

(12, torch.Size([1, 12, 13, 13]))

In [34]:
attention[0][0][0].sum(dim=1)

tensor([1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000,
        1.0000, 1.0000, 1.0000, 1.0000], grad_fn=<SumBackward1>)

In [36]:
tokens

['[CLS]',
 'time',
 'flies',
 'like',
 'an',
 'arrow',
 '[SEP]',
 'fruit',
 'flies',
 'like',
 'a',
 'banana',
 '[SEP]']

In [19]:
head_view(attention, tokens, sentence_b_start, heads=[8])

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.seq_relationship.bias', 'cls.predictions.bias', 'cls.seq_relationship.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


<IPython.core.display.Javascript object>

# Transformer family tree

![](https://learning.oreilly.com/api/v2/epubs/urn:orm:book:9781098136789/files/assets/nlpt_0308.png)

## Encoder branch

BERT

BERT is pretrained with the two objectives of predicting masked tokens in texts and determining if one text passage is likely to follow another.8 **The former task is called masked language modeling (MLM) and the latter next sentence prediction (NSP).**

DistilBERT

Although BERT delivers great results, it’s size can make it tricky to deploy in environments where low latencies are required. By using a technique known as **knowledge distillation during pretraining**, DistilBERT achieves **97% of BERT’s performance** while **using 40% less memory and being 60% faster**

RoBERTa

A study following the release of BERT revealed that its performance can be further improved by **modifying the pretraining scheme**. RoBERTa is **trained longer**, on **larger batches** with **more training data**, and it **drops the NSP task.** Together, these changes significantly improve its performance compared to the original BERT model.

XLM

Several pretraining objectives for **building multilingual models** were explored in the work on the cross-lingual language model (XLM) including the **autoregressive language modeling from GPT-like** models and **MLM from BERT**. In addition, the authors of the paper on XLM pretraining introduced **translation language modeling (TLM), which is an extension of MLM to multiple language inputs.** Experimenting with these pretraining tasks, they achieved state-of-the-art results on several multilingual NLU benchmarks as well as on translation tasks.

XLM-RoBERTa

Following the work of XLM and RoBERTa, the XLM-RoBERTa or XLM-R model takes multilingual pretraining one step further by **massively upscaling the training data**.12 Using the Common Crawl corpus, its developers created a dataset with 2.5 terabytes of text; they then **trained an encoder with MLM** on this dataset. Since the dataset only contains data without parallel texts (i.e., translations), the **TLM objective of XLM was dropped**. This approach beats XLM and multilingual BERT variants by a large margin, especially on low-resource languages.

ALBERT

The ALBERT model introduced three changes to **make the encoder architecture more efficient**.13 First, it **decouples the token embedding dimension from the hidden dimension**, thus **allowing the embedding dimension** to be small and thereby saving parameters, especially when the vocabulary gets large. Second, **all layers share the same parameters**, which decreases the number of effective parameters even further. Finally, the **NSP objective is replaced with a sentence-ordering prediction**: the model **needs to predict whether or not the order of two consecutive sentences was swapped rather than predicting if they belong together at all**. These changes make it possible to train even larger models with fewer parameters and reach superior performance on NLU tasks.

ELECTRA

One **limitation of the standard MLM pretraining** objective is that at each training step **only the representations of the masked tokens are updated** (TODO what?), while the other input tokens are not. To address this issue, **ELECTRA uses a two-model approach**:14 the **first model (which is typically small) works like a standard masked language model and predicts masked tokens.** The **second model**, called the **discriminator**, is then tasked **to predict which of the tokens in the first model’s output were originally masked**. Therefore, the **discriminator needs to make a binary classification for every token**, which makes training 30 times more efficient. For downstream tasks the discriminator is fine-tuned like a standard BERT model.

DeBERTa

The DeBERTa model introduces two architectural changes.15 First, **each token is represented as two vectors: one for the content, the other for relative position**. By disentangling the tokens’ content from their relative positions, the **self-attention layers can better model the dependency of nearby token pairs**. On the other hand, the absolute position of a word is also important, especially for decoding. For this reason, **an absolute position embedding is added just before the softmax layer of the token decoding head**. DeBERTa is the first model (as an ensemble) to beat the human baseline on the SuperGLUE benchmark,16 a more difficult version of GLUE consisting of several subtasks used to measure NLU performance.

## The decoder branch


These models are exceptionally good at predicting the next word in a sequence and are thus mostly used for text generation tasks

GPT

The introduction of GPT combined two key ideas in NLP:17 **the novel and efficient transformer decoder architecture**, and **transfer learning**. In that setup, the model was pretrained by predicting the next word based on the previous ones. The model was trained on the BookCorpus and **achieved great results on downstream tasks such as classification.**

GPT-2

Inspired by the success of the simple and scalable pretraining approach, **the original model and training set were upscaled to produce GPT-2**.18 This model is able to produce long sequences of coherent text

CTRL

Models like **GPT-2 can continue an input sequence (also called a prompt)**. However, the user has **little control over the style** of the generated sequence. **The Conditional Transformer Language (CTRL) model addresses this issue by adding “control tokens” at the beginning of the sequence**.19 These allow the style of the generated text to be controlled, which **allows for diverse generation.**

GPT-3

Following the success of scaling GPT up to GPT-2, a thorough analysis on the behavior of language models at different scales revealed that **there are simple power laws that govern the relation between compute, dataset size, model size, and the performance of a language model**.20 Inspired by these insights, **GPT-2 was upscaled by a factor of 100 to yield GPT-3**,21 with 175 billion parameters. Besides being able to generate impressively realistic text passages, the model also **exhibits few-shot learning capabilities: with a few examples of a novel task such as translating text to code, the model is able to accomplish the task on new examples**. OpenAI has not open-sourced this model, but provides an interface through the OpenAI API.

GPT-Neo/GPT-J-6B

GPT-Neo and GPT-J-6B are GPT-like models that were trained by EleutherAI, a collective of researchers who aim to re-create and release GPT-3 scale models.22 The current models are smaller variants of the full 175-billion-parameter model, with 1.3, 2.7, and 6 billion parameters, and **are competitive with the smaller GPT-3 models OpenAI offers.**

## Both encoder and decoder

T5

The T5 model **unifies all NLU and NLG tasks by converting them into text-to-text tasks**.23 All tasks are framed as **sequence-to-sequence** tasks, where adopting an encoder-decoder architecture is natural. For **text classification problems, for example, this means that the text is used as the encoder input and the decoder has to generate the label as normal text instead of a class**. The **T5 architecture uses the original Transformer architecture**. Using the large crawled C4 dataset, the model is **pretrained with masked language modeling as well as the SuperGLUE tasks by translating all of them to text-to-text tasks**. The largest model with 11 billion parameters yielded state-of-the-art results on several benchmarks.

BART

BART **combines the pretraining procedures of BERT and GPT within the encoder-decoder architecture**.24 The **input sequences undergo one of several possible transformations**, from simple masking to sentence permutation, token deletion, and document rotation. These modified inputs are **passed through the encoder**, and **the decoder has to reconstruct the original texts**. This makes the model **more flexible** as it is possible to use it for NLU as well as NLG tasks, and it achieves state-of-the-art-performance on both.

M2M-100

Conventionally a translation model is built for one language pair and translation direction. Naturally, this does not scale to many languages, and in addition there might be shared knowledge between language pairs that could be leveraged for translation between rare languages. **M2M-100 is the first translation model that can translate between any of 100 languages**.25 This allows for **high-quality translations between rare and underrepresented languages**. The model uses prefix tokens (similar to the special [CLS] token) to indicate the source and target language.

BigBird

One main **limitation of transformer models is the maximum context size (512 in BERT)**, **due to the quadratic memory requirements of the attention mechanism**. **BigBird** addresses this issue by **using a sparse form of attention that scales linearly**.26 This allows for the drastic scaling of contexts **from 512 tokens in most BERT models to 4,096 in BigBird**. This is especially **useful in cases where long dependencies need to be conserved, such as in text summarization**.