In [1]:
# Uncomment and run this cell if you're on Colab or Kaggle
!git clone https://github.com/nlp-with-transformers/notebooks.git
%cd notebooks
from install import *
install_requirements()

Cloning into 'notebooks'...
remote: Enumerating objects: 526, done.[K
remote: Counting objects: 100% (173/173), done.[K
remote: Compressing objects: 100% (45/45), done.[K
remote: Total 526 (delta 143), reused 137 (delta 128), pack-reused 353[K
Receiving objects: 100% (526/526), 28.62 MiB | 18.34 MiB/s, done.
Resolving deltas: 100% (250/250), done.
/content/notebooks
⏳ Installing base requirements ...
✅ Base requirements installed!
⏳ Installing Git LFS ...
✅ Git LFS installed!


In [None]:
#hide
from utils import *
setup_chapter()

Using transformers v4.11.3
Using datasets v1.16.1


# Transformer Anatomy

## The Transformer Architecture

**Encoder**
Converts an input sequence of tokens into a sequence of embedding vectors,
often called the hidden state or context

**Decoder**
Uses the encoder’s hidden state to iteratively generate an output sequence of
tokens, one token at a time

**Three main types of transformes**
1.   Eoncoder-only - text classification, named entity recognition (e.g., BERT)
2.   Decoder-only - autocompletion (e.g., GPT)
3.   Encoder-decoder - machine translation and summarization (e.g., BART, T5)

**How it works**
*   The input text is tokenized and converted to token embeddings combined with positional embedings (so attention also knows the relative positions) 
*   Encoder stack - stack of encoder layers or “blocks,” which is analogous
to stacking convolutional layers in computer vision
*   The encoder’s output is fed to each decoder layer, and the decoder then generates a prediction for the most probable next token in the sequence. The output of this step is then fed back into the decoder to generate the next token.

**Main difference between Eoncoder and decoder layers**
* Enocders use *bidirectional attention* i.e., token representation depends on both left and right (before and after) context. 
* Decoders use *causal* or *autoregressive attention* i.e., token represnetaion depends only on the lest context. 



<img alt="transformer-encoder-decoder" caption="Encoder-decoder architecture of the transformer, with the encoder shown in the upper half of the figure and the decoder in the lower half" src="https://github.com/nlp-with-transformers/notebooks/blob/main/images/chapter03_transformer-encoder-decoder.png?raw=1" id="transformer-encoder-decoder"/>

## The Encoder

Each encoder layer receives a sequence of embeddings and feeds them through the following sublayers:
* A multi-head self-attention layer
* Fully connected feed-forward layer that is applied to each input embedding

**The main role of the encoder stack is to “update” the input embeddings to 
produce representations that encode some contextual information in the
sequence.**

<img alt="encoder-zoom" caption="Zooming into the encoder layer" src="https://github.com/nlp-with-transformers/notebooks/blob/main/images/chapter03_encoder-zoom.png?raw=1" id="encoder-zoom"/>

### Self-Attention

The main idea behind self-attention is that instead of using a fixed embedding for each token, we can use the whole sequence to compute a weighted average of each embedding.

**Embeddings that are generated in this way are called
contextualized embeddings.**

<img alt="Contextualized embeddings" caption="Diagram showing how self-attention updates raw token embeddings (upper) into contextualized embeddings (lower) to create representations that incorporate information from the whole sequence" src="https://github.com/nlp-with-transformers/notebooks/blob/main/images/chapter03_contextualized-embedding.png?raw=1" id="contextualized-embeddings"/>

#### Scaled dot-product attention
A. Vaswani et al., “Attention Is All You Need”, (2017).

1. Project each token embedding into three vectors called query (ingridient), key (label on the shelf), and value (choosen product).
2. **Compute attention scores** using similarity function. Queries and keys that are similar will have a large dot product, while those that don’t share much in common will have little to no overlap. The outputs from this step are called the attention scores, and for a sequence with n input tokens there is a corresponding n × n matrix of attention scores.
3. **Compute attention weights** i.e., normalize and scale the scores to 0-1 range.
4. **Update the token embeddings.** Once the attention weights are computed, we
multiply them by the value vector


#hide

Copy and execute the following cell magic in a new cell to use `bertviz` in JupyterLab:

```python
%%javascript
require.config({
  paths: {
      d3: '//cdnjs.cloudflare.com/ajax/libs/d3/3.4.8/d3.min',
      jquery: '//ajax.googleapis.com/ajax/libs/jquery/2.0.0/jquery.min',
  }
});
```

In [27]:
#hide_output
from transformers import AutoTokenizer
from bertviz.transformers_neuron_view import BertModel
from bertviz.neuron_view import show

model_ckpt = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)
model = BertModel.from_pretrained(model_ckpt)
text = "time flies like an arrow. fruit flies like a banana"
#text = 'apple is my favorite company, cause i had a crush on steve jobs'
#text = "apple is my favourite fruit, unrelatedly i don't like steve jobs"
show(model, "bert", tokenizer, text, display_mode="light", layer=0, head=8)

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

### Sidebar: Demystifying Queries, Keys, and Values

---
The notion of query, key, and value vectors may seem a bit cryptic the first time you
encounter them. Their names were inspired by information retrieval systems, but we
can motivate their meaning with a simple analogy. Imagine that you’re at the supermarket
buying all the ingredients you need for your dinner. You have the dish’s recipe,
and each of the required ingredients can be thought of as a query. As you scan the
shelves, you look at the labels (keys) and check whether they match an ingredient on
your list (similarity function). If you have a match, then you take the item (value)
from the shelf.
In this analogy, you only get one grocery item for every label that matches the ingredient.
Self-attention is a more abstract and “smooth” version of this: every label in the
supermarket matches the ingredient to the extent to which each key matches the
query. So if your list includes a dozen eggs, then you might end up grabbing 10 eggs,
an omelette, and a chicken wing.



### End sidebar

<img alt="Operations in scaled dot-product attention" height="125" caption="Operations in scaled dot-product attention" src="https://github.com/nlp-with-transformers/notebooks/blob/main/images/chapter03_attention-ops.png?raw=1" id="attention-ops"/>

**Example code with PyTorch**

**TOKENIZE THE TEXT**

In [17]:
# hide
from transformers import AutoTokenizer
model_ckpt = "bert-base-uncased"
text = "time flies like an arrow. fruit flies like a banana"
#text = 'apple is my favorite company, cause i had a crush on steve jobs'
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)

In [18]:
# add_special_tokens = False excludes [CLS] and [SEP] tokens
inputs = tokenizer(text, return_tensors="pt", add_special_tokens=False)
inputs.input_ids

tensor([[ 2051, 10029,  2066,  2019,  8612]])

**DENSE EMBEDDINGS**

In [19]:
# dense means that each entry in the embedding contains a non zero value
# spars (ch2) - e.g., one-hot encodings - all entires except one are zero

from torch import nn
from transformers import AutoConfig

# what exactly is happening here?
# Here we’ve used the AutoConfig class to load the config.json file associated with the
# bert-base-uncased checkpoint. In Transformers, every checkpoint is assigned a
# configuration file that specifies various hyperparameters like vocab_size and
# hidden_size, which in our example shows us that each input ID will be mapped to
# one of the 30,522 embedding vectors stored in nn.Embedding, each with a size of 768.
# The AutoConfig class also stores additional metadata, such as the label names, which
# are used to format the model’s predictions.
config = AutoConfig.from_pretrained(model_ckpt) #
token_emb = nn.Embedding(config.vocab_size, config.hidden_size)
token_emb

Embedding(30522, 768)

In [20]:
inputs_embeds = token_emb(inputs.input_ids)
# AT THIS POINT THE EMBEDDINGS ARE INDEPENDENT FROM THE CONTEXT!!
# i.e., homonyms ave the same representation
inputs_embeds.size()
# This has given us a tensor of shape [batch_size, seq_len, hidden_dim], just like we saw in Chapter 2.

torch.Size([1, 5, 768])

**ATTENTION SCORES**

In [21]:
# create the query, key, and value vectors and calculate the attention scores using the dot product as the similarity function
import torch
from math import sqrt 

query = key = value = inputs_embeds # WAS???
# We’ll see later that the query, key, and value vectors are generated by applying independent
# weight matrices WQ, K,V to the embeddings, but for now we’ve kept them equal for simplicity.

dim_k = key.size(-1)
scores = torch.bmm(query, key.transpose(1,2)) / sqrt(dim_k) 
# The torch.bmm() function performs a batch matrix-matrix product
# that simplifies the computation of the attention scores where the
# query and key vectors have the shape [batch_size, seq_len, hidden_dim].

scores.size()
scores
# This has created a N × N matrix of attention scores per sample in the batch. We’ll see
# later that the query, key, and value vectors are generated by applying independent
# weight matrices WQ, K,V to the embeddings, but for now we’ve kept them equal for
# simplicity. In scaled dot-product attention, the dot products are scaled by the size of
# the embedding vectors so that we don’t get too many large numbers during training
# that can cause the softmax we will apply next to saturate.

tensor([[[27.9746, -0.6173,  0.9864,  1.3337,  1.9457],
         [-0.6173, 28.0223, -2.0465,  0.6572,  1.2274],
         [ 0.9864, -2.0465, 26.7267, -0.3166,  0.1571],
         [ 1.3337,  0.6572, -0.3166, 30.0338,  0.9170],
         [ 1.9457,  1.2274,  0.1571,  0.9170, 25.6732]]],
       grad_fn=<DivBackward0>)

In [23]:
import torch.nn.functional as F

# attention scores into 0-1 weights using softmax
weights = F.softmax(scores, dim=-1)
weights.sum(dim=-1)
weights

tensor([[[1.0000e+00, 3.8257e-13, 1.9019e-12, 2.6916e-12, 4.9639e-12],
         [3.6474e-13, 1.0000e+00, 8.7355e-14, 1.3047e-12, 2.3074e-12],
         [6.6245e-12, 3.1913e-13, 1.0000e+00, 1.7999e-12, 2.8905e-12],
         [3.4332e-13, 1.7455e-13, 6.5915e-14, 1.0000e+00, 2.2633e-13],
         [4.9580e-11, 2.4174e-11, 8.2889e-12, 1.7723e-11, 1.0000e+00]]],
       grad_fn=<SoftmaxBackward0>)

In [24]:
# multiply the attention weights by the values:
attn_outputs = torch.bmm(weights, value)
attn_outputs.shape

# Notice that the whole process is just two matrix multiplications and a softmax,
# so you can think of “self-attention” as just a fancy form of averaging.

torch.Size([1, 5, 768])

FUNCTION FOR CALCULATING ATTENTION WEIGHTS

In [25]:
# all the steps in a single function
def scaled_dot_product_attention(query, key, value):
    dim_k = query.size(-1)
    scores = torch.bmm(query, key.transpose(1, 2)) / sqrt(dim_k)
    weights = F.softmax(scores, dim=-1)
    return torch.bmm(weights, value)

 **Our attention mechanism with equal query and key vectors will assign a very large
score to identical words in the context, and in particular to the current word itself**: the
dot product of a query with itself is always 1. 

But in practice, the meaning of a word
will be better informed by complementary words in the context than by identical
words—for example, the meaning of “flies” is better defined by incorporating information
from “time” and “arrow” than by another mention of “flies”. How can we promote
this behavior?

**Let’s allow the model to create a different set of vectors for the query, key, and value of
a token by using three different linear projections to project our initial token vector
into three different spaces.**

#### Multi-headed attention

It's beneficial to have multiple sets of linear projections, each one
representing a so-called attention head. 

WHY? the softmax of one head tends to focus on mostly one aspect of similarity.

Having several heads allows the model to focus on several aspects at once. For
instance, one head can focus on subject-verb interaction, whereas another finds
nearby adjectives.

<img alt="Multi-head attention" height="125" caption="Multi-head attention" src="https://github.com/nlp-with-transformers/notebooks/blob/main/images/chapter03_multihead-attention.png?raw=1" id="multihead-attention"/>

SINGLE HEAD

In [None]:
class AttentionHead(nn.Module):
    def __init__(self, embed_dim, head_dim):
        super().__init__()
        self.q = nn.Linear(embed_dim, head_dim) # linear weights/coefs!
        self.k = nn.Linear(embed_dim, head_dim)
        self.v = nn.Linear(embed_dim, head_dim)

    def forward(self, hidden_state):
        attn_outputs = scaled_dot_product_attention(
            self.q(hidden_state), self.k(hidden_state), self.v(hidden_state))
        return attn_outputs

# Here we’ve initialized three independent linear layers that apply matrix multiplication
# to the embedding vectors to produce tensors of shape [batch_size, seq_len,
# head_dim], where head_dim is the number of dimensions we are projecting into.
# Although head_dim does not have to be smaller than the number of embedding
# dimensions of the tokens (embed_dim), in practice it is chosen to be a multiple of
# embed_dim so that the computation across each head is constant. For example, BERT
# has 12 attention heads, so the dimension of each head is 768/12 = 64.

MULTI-HEAD ATTENTION LAYER

In [None]:
# Now that we have a single attention head, we can concatenate the outputs of each one
# to implement the full multi-head attention layer:

class MultiHeadAttention(nn.Module):
    def __init__(self, config):
        super().__init__()
        embed_dim = config.hidden_size
        num_heads = config.num_attention_heads
        head_dim = embed_dim // num_heads
        self.heads = nn.ModuleList(
            [AttentionHead(embed_dim, head_dim) for _ in range(num_heads)]
        )
        self.output_linear = nn.Linear(embed_dim, embed_dim)

    def forward(self, hidden_state):
        x = torch.cat([h(hidden_state) for h in self.heads], dim=-1)
        x = self.output_linear(x)
        return x

In [None]:
multihead_attn = MultiHeadAttention(config)
attn_output = multihead_attn(inputs_embeds)    
attn_output.size() 

torch.Size([1, 5, 768])

**VISUALIZATION WITH BERTVIZ**

In [26]:
#hide_output
from bertviz import head_view
from transformers import AutoModel

model = AutoModel.from_pretrained(model_ckpt, output_attentions=True)

sentence_a = "time flies like an arrow"
sentence_b = "fruit flies like a banana"

viz_inputs = tokenizer(sentence_a, sentence_b, return_tensors='pt')
attention = model(**viz_inputs).attentions
sentence_b_start = (viz_inputs.token_type_ids == 0).sum(dim=1)
tokens = tokenizer.convert_ids_to_tokens(viz_inputs.input_ids[0])

head_view(attention, tokens, sentence_b_start, heads=[8])

Downloading:   0%|          | 0.00/420M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight', 'cls.predictions.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


<IPython.core.display.Javascript object>

### The Feed-Forward Layer

Simple two-layer fully connected neural network, but with a twist: instead of processing the whole sequence of embeddings as a single vector, it processes each embedding independently. 

Often referred to as a **position-wise feed-forward layer**.

A rule of thumb from the literature is for the hidden size of the first layer to be four times the size of the embeddings, and a GELU activation function is most commonly used. **This is where most of the capacity and memorization is hypothesized to happen, and it’s the part that is most often scaled when scaling up the models.**

In [None]:
# We can implement this as a simple nn.Module as follows:
class FeedForward(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.linear_1 = nn.Linear(config.hidden_size, config.intermediate_size)
        self.linear_2 = nn.Linear(config.intermediate_size, config.hidden_size)
        self.gelu = nn.GELU()
        self.dropout = nn.Dropout(config.hidden_dropout_prob)
        
    def forward(self, x):
        x = self.linear_1(x)
        x = self.gelu(x)
        x = self.linear_2(x)
        x = self.dropout(x)
        return x

In [None]:
# Note that a feed-forward layer such as nn.Linear is usually applied to a tensor of
# shape (batch_size, input_dim), where it acts on each element of the batch dimension
# independently. This is actually true for any dimension except the last one, so
# when we pass a tensor of shape (batch_size, seq_len, hidden_dim) the layer is
# applied to all token embeddings of the batch and sequence independently, which is
# exactly what we want. Let’s test this by passing the attention outputs:

feed_forward = FeedForward(config)
ff_outputs = feed_forward(attn_outputs)
ff_outputs.size()

torch.Size([1, 5, 768])

### Adding Layer Normalization
The Transformer architecture makes use of layer normalization and skip connections. 
* layer normalization - normalizes each input in the batch to have zero mean and unity variance
* skip connection - passes a tensor to the next layer of the model without processing and add it to the processed tensor

<img alt="Transformer layer normalization" height="500" caption="Different arrangements of layer normalization in a transformer encoder layer" src="https://github.com/nlp-with-transformers/notebooks/blob/main/images/chapter03_layer-norm.png?raw=1" id="layer-norm"/>

In [None]:
# We’ll use the second arrangement, so we can simply stick together our building
# blocks as follows:
class TransformerEncoderLayer(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.layer_norm_1 = nn.LayerNorm(config.hidden_size)
        self.layer_norm_2 = nn.LayerNorm(config.hidden_size)
        self.attention = MultiHeadAttention(config)
        self.feed_forward = FeedForward(config)

    def forward(self, x):
        # Apply layer normalization and then copy input into query, key, value
        hidden_state = self.layer_norm_1(x)
        # Apply attention with a skip connection
        x = x + self.attention(hidden_state)
        # Apply feed-forward layer with a skip connection
        x = x + self.feed_forward(self.layer_norm_2(x))
        return x

In [None]:
encoder_layer = TransformerEncoderLayer(config)
inputs_embeds.shape, encoder_layer(inputs_embeds).size()

(torch.Size([1, 5, 768]), torch.Size([1, 5, 768]))

### Positional Embeddings

* Lernable positino embeddings
* Absolute positional embeddings
* Relative positional embeddings

In [None]:
class Embeddings(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.token_embeddings = nn.Embedding(config.vocab_size, 
                                             config.hidden_size)
        self.position_embeddings = nn.Embedding(config.max_position_embeddings,
                                                config.hidden_size)
        self.layer_norm = nn.LayerNorm(config.hidden_size, eps=1e-12)
        self.dropout = nn.Dropout()

    def forward(self, input_ids):
        # Create position IDs for input sequence
        seq_length = input_ids.size(1)
        position_ids = torch.arange(seq_length, dtype=torch.long).unsqueeze(0)
        # Create token and position embeddings
        token_embeddings = self.token_embeddings(input_ids)
        position_embeddings = self.position_embeddings(position_ids)
        # Combine token and position embeddings
        embeddings = token_embeddings + position_embeddings
        embeddings = self.layer_norm(embeddings)
        embeddings = self.dropout(embeddings)
        return embeddings

In [None]:
embedding_layer = Embeddings(config)
embedding_layer(inputs.input_ids).size()

torch.Size([1, 5, 768])

In [None]:
class TransformerEncoder(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.embeddings = Embeddings(config)
        self.layers = nn.ModuleList([TransformerEncoderLayer(config) 
                                     for _ in range(config.num_hidden_layers)])

    def forward(self, x):
        x = self.embeddings(x)
        for layer in self.layers:
            x = layer(x)
        return x

In [None]:
encoder = TransformerEncoder(config)
encoder(inputs.input_ids).size()

torch.Size([1, 5, 768])

### Adding a Classification Head

**The head is TASK-SPECIFIC**

We have a hidden state for each token, but we only need to make one prediction. There are several options to approach this. Traditionally, the first token in such models is used for the prediction and we can attach a dropout and a linear layer to make a classification prediction.

In [None]:
class TransformerForSequenceClassification(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.encoder = TransformerEncoder(config)
        self.dropout = nn.Dropout(config.hidden_dropout_prob)
        self.classifier = nn.Linear(config.hidden_size, config.num_labels)
        
    def forward(self, x):
        x = self.encoder(x)[:, 0, :] # select hidden state of [CLS] token
        x = self.dropout(x)
        x = self.classifier(x)
        return x

In [None]:
config.num_labels = 3
encoder_classifier = TransformerForSequenceClassification(config)
encoder_classifier(inputs.input_ids).size()

torch.Size([1, 3])

## The Decoder

Let’s see if we can shed some light on the mysteries of encoder-decoder attention.
Imagine you (the decoder) are in class taking an exam. Your task is to predict the next
word based on the previous words (decoder inputs), which sounds simple but is
incredibly hard (try it yourself and predict the next words in a passage of this book).
Fortunately, your neighbor (the encoder) has the full text. Unfortunately, they’re a
foreign exchange student and the text is in their mother tongue. Cunning students
that you are, you figure out a way to cheat anyway. You draw a little cartoon illustrating
the text you already have (the query) and give it to your neighbor. They try to
figure out which passage matches that description (the key), draw a cartoon describing
the word following that passage (the value), and pass that back to you. With this
system in place, you ace the exam.

<img alt="Transformer decoder zoom" caption="Zooming into the transformer decoder layer" src="https://github.com/nlp-with-transformers/notebooks/blob/main/images/chapter03_decoder-zoom.png?raw=1" id="decoder-zoom"/> 

In [None]:
seq_len = inputs.input_ids.size(-1)
mask = torch.tril(torch.ones(seq_len, seq_len)).unsqueeze(0)
mask[0]

tensor([[1., 0., 0., 0., 0.],
        [1., 1., 0., 0., 0.],
        [1., 1., 1., 0., 0.],
        [1., 1., 1., 1., 0.],
        [1., 1., 1., 1., 1.]])

In [None]:
scores.masked_fill(mask == 0, -float("inf"))

tensor([[[28.9853,    -inf,    -inf,    -inf,    -inf],
         [ 0.3560, 26.3663,    -inf,    -inf,    -inf],
         [ 0.1355, -0.0577, 26.8274,    -inf,    -inf],
         [ 0.5204, -0.9433, -0.6811, 26.7159,    -inf],
         [-0.4255,  0.6048, -0.6670, -0.1803, 29.3061]]],
       grad_fn=<MaskedFillBackward0>)

In [None]:
def scaled_dot_product_attention(query, key, value, mask=None):
    dim_k = query.size(-1)
    scores = torch.bmm(query, key.transpose(1, 2)) / sqrt(dim_k)
    if mask is not None:
        scores = scores.masked_fill(mask == 0, float("-inf"))
    weights = F.softmax(scores, dim=-1)
    return weights.bmm(value)

### Sidebar: Demystifying Encoder-Decoder Attention

### End sidebar

## Meet the Transformers

### The Transformer Tree of Life

<img alt="Transformer family tree" caption="An overview of some of the most prominent transformer architectures" src="https://github.com/nlp-with-transformers/notebooks/blob/main/images/chapter03_transformers-compact.png?raw=1" id="family-tree"/>

### The Encoder Branch

### The Decoder Branch

### The Encoder-Decoder Branch

## Conclusion