In [1]:
# Uncomment and run this cell if you're on Colab or Kaggle
!git clone https://github.com/nlp-with-transformers/notebooks.git
%cd notebooks
from install import *
install_requirements()

/bin/bash: /opt/conda/lib/libtinfo.so.6: no version information available (required by /bin/bash)
Cloning into 'notebooks'...
remote: Enumerating objects: 515, done.[K
remote: Counting objects: 100% (515/515), done.[K
remote: Compressing objects: 100% (278/278), done.[K
remote: Total 515 (delta 245), reused 479 (delta 231), pack-reused 0[K
Receiving objects: 100% (515/515), 29.39 MiB | 22.72 MiB/s, done.
Resolving deltas: 100% (245/245), done.
/kaggle/working/notebooks
⏳ Installing base requirements ...
✅ Base requirements installed!
⏳ Installing Git LFS ...
✅ Git LFS installed!


In [2]:
#hide
from utils import *
setup_chapter()

No GPU was detected! This notebook can be *very* slow without a GPU 🐢
Go to Settings > Accelerator and select GPU.
Using transformers v4.11.3
Using datasets v1.16.1


# Transformer Anatomy

Encoder-decoder architecture of the transformer, with the encoder shown in the upper half of the figure and the decoder in the lower half.

<img alt="transformer-encoder-decoder" caption="Encoder-decoder architecture of the transformer, with the encoder shown in the upper half of the figure and the decoder in the lower half" src="https://raw.githubusercontent.com/nlp-with-transformers/notebooks/48e4a5e5c44b86e1593c0945a49af9675cfd7158//images/chapter03_transformer-encoder-decoder.png" id="transformer-encoder-decoder"/>

## The Encoder

The transformer's encoder consists of many encoder layers stacked next to each other. As illustrated in the figure below, each encoder layer receives a sequence of enbeddings and feeds them through the following sublayers:
* A multi-head self-attention layer
* A fully connected feed-forward layer that is applied to each input embedding

> The main role of the encoder stack is to "update" the input embeddings to produce representations that encode some contextual information in the sequence.

<img alt="encoder-zoom" caption="Zooming into the encoder layer" src="https://raw.githubusercontent.com/nlp-with-transformers/notebooks/48e4a5e5c44b86e1593c0945a49af9675cfd7158//images/chapter03_encoder-zoom.png" id="encoder-zoom"/>

### Self-Attention

* The main idea behind self-attention is that instead of using a fixed embedding for each token, we can use the whole sequence to compute weighted average of each embedding. In other words, given a sequence of embeddings, self-attention produces a sequence of new embeddings where each new embedding is a linear combination of all original embedding vectors.
* The following diagram shows how self-attention updates raw token embeddings (upper) into contextualized embeddings (lower) to create contextualized representations that incorporate information from the whole sequence.

<img alt="Contextualized embeddings" caption="Diagram showing how self-attention updates raw token embeddings (upper) into contextualized embeddings (lower) to create representations that incorporate information from the whole sequence" src="https://raw.githubusercontent.com/nlp-with-transformers/notebooks/48e4a5e5c44b86e1593c0945a49af9675cfd7158//images/chapter03_contextualized-embedding.png" id="contextualized-embeddings"/>

Let's now take a look at how we can calculate the attention weights.

### Scaled dot-product attention

* There are several ways to implement a self-attention layer, but the most common one is scaled dot-product attiontion. It includes four main steps:
1. Project each token embedding into three vectors: key, query, value
2. Compute attention scores
3. Compute attention weights by applying softmax
4. For a specific token, multiply attention weight by value vector to get the updated form of token embedding.

* The following diagram represents the operations to compute scaled dot-product attention. Let's implement it:

<img alt="Operations in scaled dot-product attention" height="125" caption="Operations in scaled dot-product attention" src="https://raw.githubusercontent.com/nlp-with-transformers/notebooks/48e4a5e5c44b86e1593c0945a49af9675cfd7158//images/chapter03_attention-ops.png" id="attention-ops"/>

#### 1. Tokensize the text and get input IDs:

In [3]:
# import the tokenizer
from transformers import AutoTokenizer
model_ckpt = "bert-base-uncased"
text = "time flies like an arrow"
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/455k [00:00<?, ?B/s]

In [4]:
# use our tokenizer to get the input IDs
# for simplicity, exclued [CLS] and [SEP] tokens
inputs = tokenizer(text, return_tensors="pt", add_special_tokens=False)
inputs.input_ids

tensor([[ 2051, 10029,  2066,  2019,  8612]])

#### 2. Generate Raw Embeddings (non-contextual embedding):
* First, let's instantiate an Embedding Layer. We can do this by using `torch.nn.Embedding` layer that acts as a lookup table for each input ID:

In [5]:
from torch import nn
from transformers import AutoConfig
# load config.json associated with bert-base-uncased checkpoint
config = AutoConfig.from_pretrained(model_ckpt)

# instantiate the embedding layer
token_emb = nn.Embedding(config.vocab_size, config.hidden_size)
token_emb

Embedding(30522, 768)

* Now, we can generate our raw embedding vectors:
> NOTE: These vectors are raw, i.e., non-contextual and the position is NOT considered (no positional encoding)

````python
text = "time flies like an arrow"
````

In [6]:
# feed input IDs into the embedding layer to generate non-contextual embedding
inputs_embeds = token_emb(inputs.input_ids)
inputs_embeds.size()
# [batch_size, seq_len, hidden_dim]

torch.Size([1, 5, 768])

In [7]:
inputs_embeds

tensor([[[ 0.8675, -0.7600, -0.8446,  ..., -0.3658,  1.4398,  0.5560],
         [-0.4498,  0.7731, -1.3458,  ...,  1.3209, -0.8076,  0.9622],
         [-0.1648, -1.5417, -1.4514,  ..., -0.5486, -0.9568, -0.7579],
         [-0.9910,  2.3353,  2.0438,  ..., -1.5116, -0.0046, -0.9475],
         [-1.7626, -1.9409,  0.3452,  ..., -0.2773, -0.6623, -0.6120]]],
       grad_fn=<EmbeddingBackward0>)

#### 3.Create query, key, and value vectors:
* Now, let's create the query, key, and value vectors and calculate the attention scores using dot product as the similarity function:
> **NOTE**: For simplicity, projection step for **K**,**Q**, and **V** is negelected for now. That is, **K**, **Q**, and **V** are the same as the **raw embedding vector**. Later we'll generate three variants of this **raw vector** by projecting it using different matrices (The weights of those matrices will changed during training)

> **NOTE 2**: `tensor.size()` returns a tuple-like object of dimensions. `tensor.shape` is an attribute of the tensor whereas `size()` is a function. They both return the same value. `.shape` is an alias for `.size()`, and was added to more closely match numpy.

In [8]:
import torch
from math import sqrt 

# K, Q, V are the same
query = key = value = inputs_embeds # size (1, 5, 768)
query.shape, query.size()

(torch.Size([1, 5, 768]), torch.Size([1, 5, 768]))

In [9]:
query.size(-1)

768

#### 4.Calculate attention scores:

In [10]:
# calculate attention scores 
dim_k = key.size(-1) # 768
scores = torch.bmm(query, key.transpose(1,2)) / sqrt(dim_k) ## takes two batches of matrices and multiplies 
                                                            # each matrix from the first batch with the corresponding matrix in the second batch
scores.size()

torch.Size([1, 5, 5])

In [11]:
query.size(), key.transpose(1,2).size()

(torch.Size([1, 5, 768]), torch.Size([1, 768, 5]))

In [12]:
query.size(), key.transpose(1,2).size()

(torch.Size([1, 5, 768]), torch.Size([1, 768, 5]))

In [13]:
 scores.dim()

3

In [14]:
scores

tensor([[[29.2163, -1.6721,  0.4072,  0.2107,  0.4030],
         [-1.6721, 27.1864,  0.5380,  0.1811,  0.2229],
         [ 0.4072,  0.5380, 27.5311,  1.1563,  1.0918],
         [ 0.2107,  0.1811,  1.1563, 26.4008, -0.3041],
         [ 0.4030,  0.2229,  1.0918, -0.3041, 24.5834]]],
       grad_fn=<DivBackward0>)

#### 5. Calculate attention weights by applying the softmax function on the attention scores:

<img alt="Operations in scaled dot-product attention" height="125" caption="Operations in scaled dot-product attention" src="https://github.com/ahmad-alismail/NLP-with-Transformers/blob/master/imges/softmax-pytroch.jpg?raw=true" id="attention-ops"/>

In [15]:
import torch.nn.functional as F

weights = F.softmax(scores, dim=-1) # apply softmax along the last dimension of tensor, i.e., dimension of embedding features
weights.size()#.sum(dim=-1)

torch.Size([1, 5, 5])

#### 6. Update the raw token embedding: mulitply attention weights by the values:

In [16]:
attn_outputs = torch.bmm(weights, value)
attn_outputs.size()

torch.Size([1, 5, 768])

#### Wrap the previous steps in a function:

In [17]:
def scaled_dot_product_attention(query, key, value):
    dim_k = query.size(-1)
    scores = torch.bmm(query, key.transpose(1, 2)) / sqrt(dim_k)
    weights = F.softmax(scores, dim=-1) # # apply softmax along the last dimension of tensor, i.e., dimension of embedding features
    return torch.bmm(weights, value)

#### Multi-headed attention

* We just used **embedding** "as is" (in form of key, query, value) to compute the attention scores and weights. In practice, the self-attention layer applies three independent linear transformations to each embedding to generate different variation of the **embedding** in form of *key, query, and value*. These transformation project the **embeddings** and each projection carries its own set of learnable parameters.
* It also beneficial to have multiple sets of linear projections, each one represention a so-called ***attention head***. The resulting **multi-head attention layer** is illustrated below.
> Why we need more than attention head? Each head focuses on one aspect of similarity. For instance, one head can focus on subject-verb interaction, whereas another finds nearby adjectives.

<img alt="Multi-head attention" height="125" caption="Multi-head attention" src="https://raw.githubusercontent.com/nlp-with-transformers/notebooks/48e4a5e5c44b86e1593c0945a49af9675cfd7158//images/chapter03_multihead-attention.png" id="multihead-attention"/>

---
<img alt="Multi-head attention" height="150" caption="Multi-head attention" src="https://github.com/ahmad-alismail/NLP-with-Transformers/blob/master/imges/1-s2.0-S0926580521000595-gr5.jpg?raw=true" id="multihead-attention"/>

* In the given Python code, `self.q(hidden_state)` is calling the linear transformation layer `q` with input `hidden_state`. The `q` layer is an instance of `nn.Linear` and is defined in the `__init__` method of the `AttentionHead` class as follows:
````python
self.q = nn.Linear(embed_dim, head_dim)
````
* This layer takes an input tensor of shape `(batch_size, sequence_length, embed_dim)` and applies a linear transformation to it to produce an output tensor of shape `(batch_size, sequence_length, head_dim)`. 
* So when `self.q(hidden_state)` is called in the forward method, it applies the linear transformation to the input `hidden_state` and produces an output tensor of shape `(batch_size, sequence_length, head_dim)` which is used in further calculations to compute the attention scores.

* In the forward pass of the `AttentionHead` class, the input `hidden_state` is used to calculate the scaled dot-product attention for one attention head. * The attention head projects the input `hidden_state` from the original `embed_dim` to a lower `head_dim` using three linear transformation layers: `q`, `k`, and `v`. 
* The output of the attention head is the weighted sum of the input `hidden_state` based on the attention weights calculated by the head.

In [18]:
# create a class for one attention head
class AttentionHead(nn.Module):
    # initialize the class
    def __init__(self, embed_dim, head_dim): # embed_dim=768, head_dim=64 number of dimensions we want to project the hidden state to
        # call the parent class
        super().__init__()
        # create the linear transformation layers
        self.q = nn.Linear(embed_dim, head_dim) # project (1, 5, 768) -> (1, 5, 64)
        self.k = nn.Linear(embed_dim, head_dim) # project (1, 5, 768) -> (1, 5, 64)
        self.v = nn.Linear(embed_dim, head_dim) # project (1, 5, 768) -> (1, 5, 64)

    # define the forward pass of the class
    def forward(self, hidden_state):                         # hidden_state.size() = inputs_embeds.size() = (1, 5, 768) = (batch_size, seq_len, hidden_size) 
        # calculate the scaled dot product attention for one head 
        dim_k = self.k(hidden_state).size(-1)                # get the last dimension of the key after linear transformation   (1, 5, 64) -> 64
        scores = torch.bmm(self.q(hidden_state),             # calculate the attention scores (1, 5, 64) * (1, 64, 5) = (1, 5, 5)
                           self.k(hidden_state).transpose(1,2)) / sqrt(dim_k) 
        attn_weights = F.softmax(scores, dim=-1)             # calculate the attention weights (1, 5, 5)
        return torch.bmm(attn_weights, self.v(hidden_state)) # return the attention outputs (1, 5, 5) * (1, 5, 64) = (1, 5, 64)

            

* No that we have a **single attention head**, we can **concatenate** the outputs of each one to implement the full **multi-head attention** layer:

> Notice that the concatenated output from the attention head is also fed through a final linear layer to produce an output tensor of shape `[batch_size, seq_len, hidden_dim]` that is suitable for the feed-forward network downstream.

In [20]:
# create a class for the multi-head attention
class MultiHeadAttention(nn.Module):
    # initialize the class
    def __init__(self, config):
        # call the parent class
        super().__init__()
        
        # create the attention heads
        embed_dim = config.hidden_size         # 768
        num_heads = config.num_attention_heads # 12
        head_dim = embed_dim // num_heads      # 64

        # create the attention heads
        self.heads = nn.ModuleList(
            [AttentionHead(embed_dim, head_dim) for _ in range(num_heads)] # create 12 attention heads where each K,V,Q is of size 64
        )
        # create the output linear transformation
        self.output_linear = nn.Linear(embed_dim, embed_dim)
        
    # define the forward pass
    def forward(self, hidden_state):           # hidden_state.size() = inputs_embeds.size() = (1, 5, 768) = (batch_size, seq_len, hidden_size)
        # concatenate the attention outputs of the 12 attention heads
        x = torch.cat([h(hidden_state) for h in self.heads], dim=-1)
        # apply the output linear transformation
        x = self.output_linear(x)
        return x

In [21]:
multihead_attn = MultiHeadAttention(config)
attn_output = multihead_attn(inputs_embeds)    
attn_output.size() 

torch.Size([1, 5, 768])

In [22]:
multihead_attn

MultiHeadAttention(
  (heads): ModuleList(
    (0): AttentionHead(
      (q): Linear(in_features=768, out_features=64, bias=True)
      (k): Linear(in_features=768, out_features=64, bias=True)
      (v): Linear(in_features=768, out_features=64, bias=True)
    )
    (1): AttentionHead(
      (q): Linear(in_features=768, out_features=64, bias=True)
      (k): Linear(in_features=768, out_features=64, bias=True)
      (v): Linear(in_features=768, out_features=64, bias=True)
    )
    (2): AttentionHead(
      (q): Linear(in_features=768, out_features=64, bias=True)
      (k): Linear(in_features=768, out_features=64, bias=True)
      (v): Linear(in_features=768, out_features=64, bias=True)
    )
    (3): AttentionHead(
      (q): Linear(in_features=768, out_features=64, bias=True)
      (k): Linear(in_features=768, out_features=64, bias=True)
      (v): Linear(in_features=768, out_features=64, bias=True)
    )
    (4): AttentionHead(
      (q): Linear(in_features=768, out_features=64, bias=

* It works! let's use `BertViz` to visualize thee attention for two different uses of the word "flies".
* The visualization shows the **attention weights** as lines connecting token whose embedding is getting updated **(left)** with every word that is being attended to **(right)**
* The **intensity** of the lines indicates the stregnth of the attention weights.

In [23]:
#hide_output
from bertviz import head_view
from transformers import AutoModel

model = AutoModel.from_pretrained(model_ckpt, output_attentions=True)

sentence_a = "time flies like an arrow"
sentence_b = "fruit flies like a banana"

viz_inputs = tokenizer(sentence_a, sentence_b, return_tensors='pt')
attention = model(**viz_inputs).attentions
sentence_b_start = (viz_inputs.token_type_ids == 0).sum(dim=1)
tokens = tokenizer.convert_ids_to_tokens(viz_inputs.input_ids[0])

head_view(attention, tokens, sentence_b_start, heads=[8])

Downloading:   0%|          | 0.00/420M [00:00<?, ?B/s]

<IPython.core.display.Javascript object>

### The Feed-Forward Layer (position-wise feed-forward layer):

<img alt="Transformer layer normalization" height="500" caption="Different arrangements of layer normalization in a transformer encoder layer" src="https://raw.githubusercontent.com/nlp-with-transformers/notebooks/48e4a5e5c44b86e1593c0945a49af9675cfd7158//images/chapter03_layer-norm.png" id="layer-norm"/>

* The feed-forward sblayer in the encoder and decoder is just a simply two-layer fully connected neural network, but with a twist: **instead** of processing the **whole sequence** of embeddings as a single vector, it processes **each embedding** indepenently. For this reason, this layer is often referred to as a **position-wise feed-forward layer**.
* A rule of thumb from the literature is for the **hidden size** of the first layer to be four times the size of the **embeddings**, and a **GELU** activation function is most commonly used. We can implement this as simple `nn.Module` as follows:

````python
config.hidden_size, config.intermediate_size
(768, 3072)
````

In [24]:
# create a class for the point-wise feed forward network
class FeedForward(nn.Module):
    # initialize the class
    def __init__(self, config):                                  
        # call the parent class
        super().__init__()
        self.linear_1 = nn.Linear(config.hidden_size, config.intermediate_size) # project (1, 5, 768) -> (1, 5, 3072)
        self.linear_2 = nn.Linear(config.intermediate_size, config.hidden_size) # project (1, 5, 3072) -> (1, 5, 768)
        self.gelu = nn.GELU()                                                   # use the GELU activation function
        self.dropout = nn.Dropout(config.hidden_dropout_prob)                   # use the dropout probability from the config
    
    # define the forward pass of the Forward class
    def forward(self, x):
        x = self.linear_1(x)
        x = self.gelu(x)
        x = self.linear_2(x)
        x = self.dropout(x)
        return x

````python
attn_outputs.size()
> torch.Size([1, 5, 768])
````

In [25]:
feed_forward = FeedForward(config)
ff_outputs = feed_forward(attn_outputs)
ff_outputs.size()

torch.Size([1, 5, 768])

### Adding Layer Normalization

The transformer architecture make use of layer normalization and skip connections:
* **Layer Normalization**: normalizes each input in the batch to have zero mean and unity variance.
* **Skip Connections**: pass **tensor** to the next layer of model **without** processing and add it to the **processed tensor**


When it comes to placing the layer normalization and  skip connection, there are two choices:
1. **Post** layer normalization: It places layer normalization **in between** the skip connections.
2. **Pre** layer normalization: It places layer normalization **within the span** of the skip connections (more stable during training). 

This arrangment is used in the following code:

In [27]:
# create a class for an encoder layer (with pre layer normalization)
class TransformerEncoderLayer(nn.Module):
    # initialize the class
    def __init__(self, config): # config = BertConfig
        # call the parent class
        super().__init__()
        self.layer_norm_1 = nn.LayerNorm(config.hidden_size) # layer normalization
        self.layer_norm_2 = nn.LayerNorm(config.hidden_size) # layer normalization
        self.attention = MultiHeadAttention(config)          # multi-head attention
        self.feed_forward = FeedForward(config)              # feed-forward layer

    def forward(self, x):
        # Apply layer normalization and then copy input into query, key, value
        hidden_state = self.layer_norm_1(x)
        # Apply attention with a skip connection
        x = x + self.attention(hidden_state)
        # Apply feed-forward layer with a skip connection
        x = x + self.feed_forward(self.layer_norm_2(x))
        return x

In [28]:
encoder_layer = TransformerEncoderLayer(config)
inputs_embeds.shape, encoder_layer(inputs_embeds).size()

(torch.Size([1, 5, 768]), torch.Size([1, 5, 768]))

We've implemented our first transformer encoder layer from scratch. However, the information on **token position is not available** yet, since the **multi-head self-attention** is just fancy weighted sum **without** any positional information. Let's see a trick to **incorporate** positional information on tokens using **positional embeddings**.

### Positional Embeddings

In [58]:
# create a class for embedding layer that combines token and position embeddings
# position embeddings are learnable
class Embeddings(nn.Module):
    # initialize the class
    def __init__(self, config):
        # call the parent class
        super().__init__()

        # create the token and position embeddings 
        self.token_embeddings = nn.Embedding(config.vocab_size,     # 30522, the size of the vocabulary
                                             config.hidden_size)    # 768, the hidden size of the model
        self.position_embeddings = nn.Embedding(config.max_position_embeddings, # 512, maximum sequence length that this model might ever be used with
                                                config.hidden_size)             # 768, the hidden size of the model
        self.layer_norm = nn.LayerNorm(config.hidden_size, eps=1e-12)            
        self.dropout = nn.Dropout()

    # define the forward pass of the embedding layer
    def forward(self, input_ids):                                              # tensor([[ 2051, 10029,  2066,  2019,  8612]])
        # Create position IDs for input sequence
        seq_length = input_ids.size(1)                                         # 5
        position_ids = torch.arange(seq_length, dtype=torch.long).unsqueeze(0) # tensor([[0, 1, 2, 3, 4]])
        # Create token and position embeddings
        token_embeddings = self.token_embeddings(input_ids)
        position_embeddings = self.position_embeddings(position_ids)
        # Combine token and position embeddings
        embeddings = token_embeddings + position_embeddings
        embeddings = self.layer_norm(embeddings)
        embeddings = self.dropout(embeddings)
        return embeddings

````python
text
> 'time flies like an arrow'
inputs.input_ids
> tensor([[ 2051, 10029,  2066,  2019,  8612]])

````

In [59]:
# instantiate the embedding layer
embedding_layer = Embeddings(config)
# apply the embedding layer to the input_ids
embedding_layer(inputs.input_ids).size()

torch.Size([1, 5, 768])

* Let's put all of this tegether now by building the **full transformer encoder part** combining the **embeddings** with the **encoder** layers:

> `nn.ModuleList()` is used to hold a list of `nn.Module` objects in PyTorch. It allows us to define a list of `nn.Module` objects and registers them as sub-modules of the parent `nn.Module`. This means that any parameters of these sub-modules will be registered with the parent module and can be accessed during training.
> 
> In this case, `self.layers` is a list of `TransformerEncoderLayer` objects, which are sub-modules of the `TransformerEncoder class`. This allows the `forward()` method of the `TransformerEncoder` class to iterate over each `TransformerEncoderLayer` in `self.layers` and apply them to the input `x` in a loop.
> 
> Using `nn.ModuleList()` is particularly useful when the number of sub-modules in a module is dynamic and not known beforehand, as it allows for easy iteration over the sub-modules during the forward pass.
---
> NOTE: By using `_` as the variable name, we indicate to the reader of the code that the loop variable is **not** actually used **in the loop body**. This makes the code more concise and easier to read by indicating that the variable is simply a placeholder for iterating over the range.

In [60]:
# create a class for the full transformer encoder
class TransformerEncoder(nn.Module):
    # initialize the class
    def __init__(self, config):
        # call the parent class
        super().__init__()
        self.embeddings = Embeddings(config)                                # create the embedding layer
        self.layers = nn.ModuleList([TransformerEncoderLayer(config)        # create the encoder layers
                                     for _ in range(config.num_hidden_layers)])
    
    # define the forward pass for the transformer class
    def forward(self, x):
        x = self.embeddings(x)
        for layer in self.layers:
            x = layer(x)
        return x

In [61]:
# check the output shapes of the encoder
encoder = TransformerEncoder(config)
encoder(inputs.input_ids).size()

torch.Size([1, 5, 768])

* So far, we have a **hidden state** for **each token** in the **batch**, but what if we only need to make **one prediction**? Traditionally, the first token in such models is used for the prediction. Let's see how could do it in the following section:

### Adding a Classification Head

There are different ways to make a prediction based on the hidden states of the transformer encoder part. Traditionally, the first token in such models is used for the prediction and we can attach a dropout and a linear layer to make a classification prediction. The following class extends the existing encoder for sequence classification:

In [62]:
# create a class for the transformer with classification head
class TransformerForSequenceClassification(nn.Module):
    # initialize the class
    def __init__(self, config):
        # call the parent class
        super().__init__()
        self.encoder = TransformerEncoder(config) 
        self.dropout = nn.Dropout(config.hidden_dropout_prob) 
        self.classifier = nn.Linear(config.hidden_size, config.num_labels) # 768, num_labels number of labels will be defined later
    
    # define the forward pass of the model
    # not that an activation function is not applied to the output of the classifier
    def forward(self, x):
        x = self.encoder(x)[:, 0, :] # select hidden state of [CLS] token
                                     # : all elements along the first axis (batch size)
                                     # 0 the first element along the second axis (sequence length, i.e., the hidden state of [CLS])
                                     # : all elements along the third axis (hidden size). see the tensor image above
        x = self.dropout(x)
        x = self.classifier(x)       # returns unnormalized logits for each class
        return x

In [63]:
# define how many classes we would like to predict
config.num_labels = 3
encoder_classifier = TransformerForSequenceClassification(config)
encoder_classifier(inputs.input_ids).size()

torch.Size([1, 3])

In [64]:
# returns unnormalized logits for each class
encoder_classifier(inputs.input_ids)

tensor([[-0.4565, -1.1210, -0.4670]], grad_fn=<AddmmBackward0>)

> In the previous code, applying ``nn.Dropout`` on the hidden state of the ``[CLS]`` token makes sense because the ``[CLS]`` token is typically used as a representation of the entire input sequence in various natural language processing tasks, such as text classification, sentiment analysis, and question answering.
> 
> By applying dropout to the hidden state of the ``[CLS]`` token, we encourage the model to learn more robust and generalizable representations of the input sequence. Dropout has the effect of reducing the co-adaptation between neurons, which can help to prevent overfitting and encourage the model to learn more diverse and informative features.
> 
> Moreover, applying dropout to the ``[CLS]`` token is a commonly used practice in transformer-based models for text classification tasks. In fact, the original paper on transformers by *Vaswani et al. (2017)* used dropout on the output of the ``[CLS]`` token in their experiments on machine translation tasks, and subsequent works have applied the same technique to various NLP tasks.
> 
> Therefore, applying dropout on the hidden state of the ``[CLS]`` token can be a useful technique to regularize the model and improve its generalization performance.


## The Decoder

<img alt="Transformer decoder zoom" caption="Zooming into the transformer decoder layer" src="images/chapter03_decoder-zoom.png" id="decoder-zoom"/> 

In [None]:
seq_len = inputs.input_ids.size(-1)
mask = torch.tril(torch.ones(seq_len, seq_len)).unsqueeze(0)
mask[0]

tensor([[1., 0., 0., 0., 0.],
        [1., 1., 0., 0., 0.],
        [1., 1., 1., 0., 0.],
        [1., 1., 1., 1., 0.],
        [1., 1., 1., 1., 1.]])

In [None]:
scores.masked_fill(mask == 0, -float("inf"))

tensor([[[26.8082,    -inf,    -inf,    -inf,    -inf],
         [-0.6981, 26.9043,    -inf,    -inf,    -inf],
         [-2.3190,  1.2928, 27.8710,    -inf,    -inf],
         [-0.5897,  0.3497, -0.3807, 27.5488,    -inf],
         [ 0.5275,  2.0493, -0.4869,  1.6100, 29.0893]]],
       grad_fn=<MaskedFillBackward0>)

In [None]:
def scaled_dot_product_attention(query, key, value, mask=None):
    dim_k = query.size(-1)
    scores = torch.bmm(query, key.transpose(1, 2)) / sqrt(dim_k)
    if mask is not None:
        scores = scores.masked_fill(mask == 0, float("-inf"))
    weights = F.softmax(scores, dim=-1)
    return weights.bmm(value)

### Sidebar: Demystifying Encoder-Decoder Attention

### End sidebar

## Meet the Transformers

### The Transformer Tree of Life

<img alt="Transformer family tree" caption="An overview of some of the most prominent transformer architectures" src="images/chapter03_transformers-compact.png" id="family-tree"/>

### The Encoder Branch

### The Decoder Branch

### The Encoder-Decoder Branch

## Conclusion