# Encoders & Decoders
This notebook demonstrates building an encoder-decoder model from scratch.

Specifically, we show the steps involved in this task, namely:
- Generate tokens (tokenization) and token embeddings (encoding)
- Build the encoder structure
    - Multi-head self-attention layer
    - Fully connected feed-forward layer
- Build the decoder structure
    - Masked multi-head self-attention layer
    - Encoder-decoder attention layer
    - Fully connected feed-forward layer

<BR>
<div style="border:1px solid black; padding-top: 10px; padding-bottom: 10px; border-radius: 50px;">
    <center>
        <b>** Full disclosure **</b><BR>
        This notebook is based on chapter 3 of the following book<BR><BR>
        <b>Natural Language Processing with Transformers</b><BR><i>Lewis Tunstall, Leandro von Werra & Thomas Wolf</i>
    </center>
</div>
<BR>

## Setup
Setting things up!

In [1]:
import logging
import sys
from textwrap import TextWrapper

import datasets
import huggingface_hub
import matplotlib.font_manager as font_manager
import matplotlib.pyplot as plt
import matplotlib_inline
import torch
import transformers
from IPython.display import set_matplotlib_formats

is_gpu_available = torch.cuda.is_available()

# Give visibility on versions of the core libraries
def display_library_version(library):
    print(f"Using {library.__name__} v{library.__version__}")
display_library_version(transformers)
display_library_version(datasets)

# Disable all info / warning messages
transformers.logging.set_verbosity_error()
datasets.logging.set_verbosity_error()

# Use O'Reilly style for plots
def install_mpl_fonts():
    font_dir = ["./orm_fonts/"]
    for font in font_manager.findSystemFonts(font_dir):
        font_manager.fontManager.addfont(font)
install_mpl_fonts()
matplotlib_inline.backend_inline.set_matplotlib_formats("pdf", "svg")
plt.style.use("plotting.mplstyle")
logging.getLogger("matplotlib").setLevel(level=logging.ERROR)

Using transformers v4.49.0
Using datasets v3.6.0


In [2]:
%%javascript
(function() {
    var script = document.createElement('script');
    script.src = 'https://cdnjs.cloudflare.com/ajax/libs/require.js/2.3.6/require.min.js';
    script.onload = function() {
        require.config({
            paths: {
                d3: '//cdnjs.cloudflare.com/ajax/libs/d3/3.4.8/d3.min',
                jquery: '//ajax.googleapis.com/ajax/libs/jquery/2.0.0/jquery.min',
            }
        });

        // Now you can use require
        require(['d3', 'jquery'], function(d3, $) {
            console.log('Loaded D3 and jQuery:', d3, $);
        });
    };
    document.head.appendChild(script);
})();


<IPython.core.display.Javascript object>

## Encoder
We'll start with the encoder that consists of:
- Multi-head self-attention layer
- Fully connected feed-forward layer

### Self-attention

In [3]:
# retrieve BERT tokenizer
from transformers import AutoTokenizer
model_ckpt = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)
text = "time flies like an arrow"

In [4]:
# convert text to tokens
inputs = tokenizer(text, return_tensors="pt", add_special_tokens=False)
inputs.input_ids

tensor([[ 2051, 10029,  2066,  2019,  8612]])

In [5]:
# retrieve BERT embedding layer
from torch import nn
from transformers import AutoConfig

config = AutoConfig.from_pretrained(model_ckpt)
token_emb = nn.Embedding(config.vocab_size, config.hidden_size)
token_emb

Embedding(30522, 768)

In [6]:
# convert tokens to embeddings
inputs_embeds = token_emb(inputs.input_ids)
inputs_embeds.size()

torch.Size([1, 5, 768])

After getting the embeddings, it is time to implement the self-attention layer with the goal of coming up with attention weights that may be used to incorporate some context into the embeddings.

First, the embeddings are each projected into 3 vectors called the *query*, *key* and *vector*. In the code below, it is done by setting letting all 3 vectors be identical to the original embedding, which is a naive way of doing it that has some drawbacks.

In [7]:
# project each embedding into 3 vectors: query, key & value
import torch
from math import sqrt 
query = key = value = inputs_embeds
dim_k = key.size(-1)

Second, attention scores are obtained by via a similarity function that is used to calculate pairwise relevance scores between all query-key pairs. The similarity function used below is the **scaled dot product attention**.

In [8]:
# get attention scores
scores = torch.bmm(query, key.transpose(1,2)) / sqrt(dim_k)
scores.size()

torch.Size([1, 5, 5])

Attention scores can be arbitrarily large, which is why we will scale them down and use the softmax function to obtain attention weights that sum up to 1. 

In [9]:
# get attention weights
import torch.nn.functional as F
weights = F.softmax(scores, dim=-1)
weights.sum(dim=-1)

tensor([[1., 1., 1., 1., 1.]], grad_fn=<SumBackward1>)

Finally, we multiply the attention weights by the *value* vectors to get updated embeddings.

In [10]:
# multiply attention weights by the values
attn_outputs = torch.bmm(weights, value)
attn_outputs.shape

torch.Size([1, 5, 768])

Below is the similarity function that can take in the *query*, *key* and *value* tensors.

In [11]:
def scaled_dot_product_attention(query, key, value):
    dim_k = query.size(-1)
    scores = torch.bmm(query, key.transpose(1, 2)) / sqrt(dim_k)
    weights = F.softmax(scores, dim=-1)
    return torch.bmm(weights, value)

Note that we cannot consider the updated embeddings to be fully contextualized just yet. There is still another feed-forward layer that will incorporate additional positional info. But before that, let's spend some more time on the self-attention layer to make it even better.

### Multi-head self-attention layer
Why **multiple** self-attention layers? Well, in a single self-attention layer, the softmax would end up focusing on only one aspect of similarity. But if we have multiple we may consider multiple  similarity aspects at the same time.

For that to work though, there has to be some variation between the multiple self-attention that we wish to employ. So instead of `q = k = v = input_embeds`, we will do 3 random linear projections of `input_embeds` and assign them to `q`, `k` and `v`.

In [12]:
class AttentionHead(nn.Module):
    def __init__(self, embed_dim, head_dim):
        super().__init__()
        self.q = nn.Linear(embed_dim, head_dim)
        self.k = nn.Linear(embed_dim, head_dim)
        self.v = nn.Linear(embed_dim, head_dim)

    def forward(self, hidden_state):
        attn_outputs = scaled_dot_product_attention(
            self.q(hidden_state), self.k(hidden_state), self.v(hidden_state))
        return attn_outputs

In [13]:
class MultiHeadAttention(nn.Module):
    def __init__(self, config):
        super().__init__()
        embed_dim = config.hidden_size
        num_heads = config.num_attention_heads
        head_dim = embed_dim // num_heads
        self.heads = nn.ModuleList(
            [AttentionHead(embed_dim, head_dim) for _ in range(num_heads)]
        )
        self.output_linear = nn.Linear(embed_dim, embed_dim)

    def forward(self, hidden_state):
        x = torch.cat([h(hidden_state) for h in self.heads], dim=-1)
        x = self.output_linear(x)
        return x

In [14]:
multihead_attn = MultiHeadAttention(config)
attn_output = multihead_attn(inputs_embeds)    
attn_output.size() 

torch.Size([1, 5, 768])

In [15]:
#hide_output
from bertviz import head_view
from transformers import AutoModel

model = AutoModel.from_pretrained(model_ckpt, output_attentions=True)

sentence_a = "time flies like an arrow"
sentence_b = "fruit flies like a banana"

viz_inputs = tokenizer(sentence_a, sentence_b, return_tensors='pt')
attention = model(**viz_inputs).attentions
sentence_b_start = (viz_inputs.token_type_ids == 0).sum(dim=1)
tokens = tokenizer.convert_ids_to_tokens(viz_inputs.input_ids[0])

head_view(attention, tokens, sentence_b_start, heads=[8])

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


<IPython.core.display.Javascript object>

### Feed-Forward Layer
Now, we will build a two-layer fully connected neural network. Howeverm instead of processing the whole sequence of embeddings as a single vector, it processes each embedding independently, which is why this feed-forward layer is often referred to as *position-wise feed-forward layer*.

It is common in the literature for the hidden size of the first layer to be 4 times the size of the embeddings and to use a GELU activation function.

In [16]:
class FeedForward(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.linear_1 = nn.Linear(config.hidden_size, config.intermediate_size)
        self.linear_2 = nn.Linear(config.intermediate_size, config.hidden_size)
        self.gelu = nn.GELU()
        self.dropout = nn.Dropout(config.hidden_dropout_prob)
        
    def forward(self, x):
        x = self.linear_1(x)
        x = self.gelu(x)
        x = self.linear_2(x)
        x = self.dropout(x)
        return x

In [17]:
feed_forward = FeedForward(config)
ff_outputs = feed_forward(attn_outputs)
ff_outputs.size()

torch.Size([1, 5, 768])

### Layer Normalization
The Transformer architecture makes use of *layer normalization* (normalize each input to zero mean and unity variance) and *skip connections* (that pass a tensor to next layer without processing).

Two types of normalization:
- Post layer normalization  -->  layer normalization between skip connections; requires learning rate warm-up.
- Pre layer normalization  -->  layer normalization within span of skip connections; more stable during training, learning rate warm-up not required

In [18]:
class TransformerEncoderLayer(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.layer_norm_1 = nn.LayerNorm(config.hidden_size)
        self.layer_norm_2 = nn.LayerNorm(config.hidden_size)
        self.attention = MultiHeadAttention(config)
        self.feed_forward = FeedForward(config)

    def forward(self, x):
        # Apply layer normalization and then copy input into query, key, value
        hidden_state = self.layer_norm_1(x)
        # Apply attention with a skip connection
        x = x + self.attention(hidden_state)
        # Apply feed-forward layer with a skip connection
        x = x + self.feed_forward(self.layer_norm_2(x))
        return x

In [19]:
encoder_layer = TransformerEncoderLayer(config)
inputs_embeds.shape, encoder_layer(inputs_embeds).size()

(torch.Size([1, 5, 768]), torch.Size([1, 5, 768]))

### Positional Embeddings

In [20]:
class Embeddings(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.token_embeddings    = nn.Embedding(config.vocab_size,              config.hidden_size)
        self.position_embeddings = nn.Embedding(config.max_position_embeddings, config.hidden_size)
        self.layer_norm = nn.LayerNorm(config.hidden_size, eps=1e-12)
        self.dropout = nn.Dropout()

    def forward(self, input_ids):
        # Create position IDs for input sequence
        seq_length = input_ids.size(1)
        position_ids = torch.arange(seq_length, dtype=torch.long).unsqueeze(0)
        
        # Create token and position embeddings
        token_embeddings    = self.token_embeddings(input_ids)
        position_embeddings = self.position_embeddings(position_ids)
        
        # Combine token and position embeddings
        embeddings = token_embeddings + position_embeddings
        embeddings = self.layer_norm(embeddings)
        embeddings = self.dropout(embeddings)
        
        return embeddings

In [21]:
embedding_layer = Embeddings(config)
embedding_layer(inputs.input_ids).size()

torch.Size([1, 5, 768])

In [22]:
class TransformerEncoder(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.embeddings = Embeddings(config)
        self.layers = nn.ModuleList([TransformerEncoderLayer(config) 
                                     for _ in range(config.num_hidden_layers)])

    def forward(self, x):
        x = self.embeddings(x)
        for layer in self.layers:
            x = layer(x)
        return x

In [23]:
encoder = TransformerEncoder(config)
encoder(inputs.input_ids).size()

torch.Size([1, 5, 768])

# Adding a Classification Head
Traditionally, to add a classification head:
- The first token is used for prediction
- A dropout and a linear layer are attached to make a classification prediction.

In [24]:
class TransformerForSequenceClassification(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.encoder = TransformerEncoder(config)
        self.dropout = nn.Dropout(config.hidden_dropout_prob)
        self.classifier = nn.Linear(config.hidden_size, config.num_labels)
        
    def forward(self, x):
        x = self.encoder(x)[:, 0, :] # select hidden state of [CLS] token
        x = self.dropout(x)
        x = self.classifier(x)
        return x

In [25]:
config.num_labels = 3
encoder_classifier = TransformerForSequenceClassification(config)
encoder_classifier(inputs.input_ids).size()

torch.Size([1, 3])

Encoder done!

## Decoder

https://colab.research.google.com/github/nlp-with-transformers/notebooks/blob/main/03_transformer-anatomy.ipynb#scrollTo=kdUSrC-ImDgB