# DI 725: Transformers and Attention-Based Deep Networks

## An End-to-End Tutorial for Implementing Transformers

### Authors: ttemizel@metu.edu.tr atemizel@metu.edu.tr mecaglar@metu.edu.tr

Main resources: https://huggingface.co/

# Introduction

<div>
<img src="https://github.com/caglarmert/DI725/blob/main/src/attention_research_1.png?raw=true" width="400"/>
</div>

## Imports
In this part we import the required libraries. Running this code on the Colab servers is recommended. It is advised to check the associated python requirements.txt, that is frozen at the time of preparation of this notebook, in case of any library or version error occurs while running this notebook. Mind that installing everything locally via pip install -r "requirements.txt" is not advised though, mainly because of the discrepancies between Colab and local machine.

In [62]:
from transformers import pipeline
import math
import torch
from torch import nn
import torch.nn.functional as F

After importing the main libraries, we can continue with the transformers. First lets check what does the above import does. We have imported pipeline from transformers library, from huggingface 🤗.

The [documentation](https://huggingface.co/docs/transformers) for the Transformers library.

The [pipelines](https://huggingface.co/docs/transformers/main_classes/pipelines) is a class of the Transformers library. It is used for easy inference, abstracts most of the complexity and offers simple API for some dedicated tasks.

Lets try some tasks that we can do with pre-trained models.

## Introduction

### Classifying a restaurant's customer review

Let's practice loading an LLM from the Hugging Face hub into a pipeline to perform sentiment classification of customer restaurant reviews.

Specifying the target language task when calling the pipeline() function is enough often to load a "default" model from Hugging Face. Nonetheless, it is usually a good practice to specify the name of the model we want to use. This is done by adding the model argument to the pipeline() function.

The model_name variable, has been already instantiated for you with the name of a BERT-based LLM particularly suited for classifying reviews in a 1-to-5 star rating scale.

#### Instructions
* Import the necessary function from the transformers library to load Hugging Face LLMs as pipelines.
* Load the model specified in model_name into a suitable pipeline for sentiment classification in text.
* Pass the customer review defined in prompt to the pipeline to get a sentiment prediction.

In [53]:
task_name = "text-classification"
model_name = "lxyuan/distilbert-base-multilingual-cased-sentiments-student"
# We can change the model name to "mrm8488/distilroberta-finetuned-financial-news-sentiment-analysis"

classifier = pipeline(task = task_name, model = model_name)

prompt = "The food was good, but service at the restaurant was a bit slow"

prediction = classifier(prompt)
print(prediction)

[{'label': 'positive', 'score': 0.8415129780769348}]


### Using a pipeline for summarization
In this exercise, you'll practice loading a Hugging Face LLM into a pipeline for text summarization. This is a remarkable but challenging language task that requires sequence-to-sequence LLMs -such as T5 models- to output a summarized sequence given an original text sequence.

The pipeline import has been made for you. The text to be summarized has also been defined in the long_text variable. The beginning of the text looks like this:

The tower is 324 meters (1,063 ft) tall, about the same height as an 81-storey building, and the tallest structure in Paris. Its base is square, measuring 125 metres (410 ft) on each side …

#### Instructions

* Load the model, based on the T5 transformer architecture and specified in model_name, into a text summarization pipeline.
* Pass long_text to the model pipeline, to produce a summary limited to 50 tokens length.
* Access and print the summarized text in outputs.

In [54]:
model_name = "cnicu/t5-small-booksum"

long_text = "The tower is 324 metres (1,063 ft) tall, about the same height as an 81-storey building, and the tallest structure in Paris. Its base is square, measuring 125 metres (410 ft) on each side. During its construction, the Eiffel Tower surpassed the Washington Monument to become the tallest man-made structure in the world, a title it held for 41 years until the Chrysler Building in New York City was finished in 1930. It was the first structure to reach a height of 300 metres. Due to the addition of a broadcasting aerial at the top of the tower in 1957, it is now taller than the Chrysler Building by 5.2 metres (17 ft). Excluding transmitters, the Eiffel Tower is the second tallest free-standing structure in France after the Millau Viaduct."

# Load the model pipeline for text summarization
summarizer = pipeline(task="summarization", model=model_name)

# Pass the long text to the model to summarize it
outputs = summarizer(long_text, max_length=50)

# Access and print the summarized text in the outputs variable
print(outputs[0]['summary_text'])

the Eiffel Tower is 324 metres (1,063 ft) tall, about the same height as an 81-storey building, and the tallest structure in Paris. Its base is square, measuring 125 metres


In [55]:
# Set transformer model hyperparameters
d_model = 512
n_heads = 8
num_encoder_layers = 6
num_decoder_layers = 6

# Create the transformer model and assign hyperparameters
model = nn.Transformer(
    d_model=d_model,
    nhead=n_heads,
    num_encoder_layers=num_encoder_layers,
    num_decoder_layers=num_decoder_layers
)

print(model)



Transformer(
  (encoder): TransformerEncoder(
    (layers): ModuleList(
      (0-5): 6 x TransformerEncoderLayer(
        (self_attn): MultiheadAttention(
          (out_proj): NonDynamicallyQuantizableLinear(in_features=512, out_features=512, bias=True)
        )
        (linear1): Linear(in_features=512, out_features=2048, bias=True)
        (dropout): Dropout(p=0.1, inplace=False)
        (linear2): Linear(in_features=2048, out_features=512, bias=True)
        (norm1): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
        (norm2): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
        (dropout1): Dropout(p=0.1, inplace=False)
        (dropout2): Dropout(p=0.1, inplace=False)
      )
    )
    (norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
  )
  (decoder): TransformerDecoder(
    (layers): ModuleList(
      (0-5): 6 x TransformerDecoderLayer(
        (self_attn): MultiheadAttention(
          (out_proj): NonDynamicallyQuantizableLinear(in_features=512, o

In [56]:
model_name = "Helsinki-NLP/opus-mt-es-en"

input_text = "Este curso sobre LLMs se está poniendo muy interesante"

# Define pipeline for Spanish-to-English translation
translator = pipeline("translation_es_to_en", model=model_name)

# Translate the input text
translations = translator(input_text)

# Access the output to print the translated text in English
print(translations[0]['translation_text'])



This course on LLMs is getting very interesting.


In [57]:
# Load the model pipeline for question-answering
context = "The tower is 324 metres (1,063 ft) tall, about the same height as an 81-storey building, and the tallest structure in Paris. Its base is square, measuring 125 metres (410 ft) on each side. During its construction, the Eiffel Tower surpassed the Washington Monument to become the tallest man-made structure in the world, a title it held for 41 years until the Chrysler Building in New York City was finished in 1930. It was the first structure to reach a height of 300 metres. Due to the addition of a broadcasting aerial at the top of the tower in 1957, it is now taller than the Chrysler Building by 5.2 metres (17 ft). Excluding transmitters, the Eiffel Tower is the second tallest free-standing structure in France after the Millau Viaduct."

qa_model = pipeline("question-answering")
question = "For how long was the Eiffel Tower the tallest man-made structure in the world?"

# Pass the necessary inputs to the LLM pipeline for question-answering
outputs = qa_model(question=question, context=context)

# Access and print the answer
print(outputs['answer'])

No model was supplied, defaulted to distilbert-base-cased-distilled-squad and revision 626af31 (https://huggingface.co/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.


41 years


### Generating replies to customer reviews
In this exercise, you'll practice using an LLM pipeline for text generation.

A text variable has been defined, containing a customer review for Riverview Hotel:

I had a wonderful stay at the Riverview Hotel! The staff were incredibly attentive and the amenities were top-notch. The only hiccup was a slight delay in room service, but that didn't overshadow the fantastic experience I had

The language task consists in generating a hotel reply to the customer review. The initial sentence for the reply is defined in the response variable so that the LLM gets it prompted along with the customer review to continue generating the reply.

Note: the pad_token_id=generator.tokenizer.eos_token_id argument sets the tokenizer padding token ID as the EOS (End Of Speech) token ID
#### Instructions
Instantiate the generator variable as a pipeline that loads the "gpt2" pre-trained text generation model.
Build a prompt for the LLM that concatenates the customer review with the hotel response's initial sentence.
Pass the prompt to the previously defined pipeline to generate (inference) the following text in the hotel response, specifying a maximum length of 150 tokens for the generated output.
Print the generated output.

In [60]:
# Create a pipeline for text generation using the gpt2 model
generator = pipeline("text-generation", model="gpt2")

text = "I had a wonderful stay at the Riverview Hotel! The staff were incredibly attentive and the amenities were top-notch. The only hiccup was a slight delay in room service, but that didn't overshadow the fantastic experience I had."

response = "Dear valued customer, I am glad to hear you had a good stay with us."

# Build the prompt for the text generation LLM
prompt = f"Customer review:\n{text}\n\nHotel reponse to the customer:\n{response}"

# Pass the prompt to the model pipeline
outputs = generator(prompt, max_length=150, pad_token_id=generator.tokenizer.eos_token_id)

# Print the augmented sequence generated by the model
print(outputs[0]['generated_text'])

Customer review:
I had a wonderful stay at the Riverview Hotel! The staff were incredibly attentive and the amenities were top-notch. The only hiccup was a slight delay in room service, but that didn't overshadow the fantastic experience I had.

Hotel reponse to the customer:
Dear valued customer, I am glad to hear you had a good stay with us. We had a great time having your business. We went out to dinner and it felt great to have a wonderful relaxed dining experience and to discover a world of amazing food and service. I am very glad you are here to keep us going at Riverview. I would recommend your experience to others. We enjoy traveling, dining out and traveling with our


## Second Chapter

### Hands-on positional encoding
In this exercise you'll complete the class implementation for a positional encoding mechanism.

The necessary imports have been done for you, namely import torch.nn as nn.

#### Instructions
* Specify the PyTorch class that the positional encoder should subclass from.
* Initialize a positional encoding matrix for token positions in sequences up to max_length.
* Assign unique position encodings to the matrix pe by alternating the use of sine and cosine functions.
* Update the input embeddings tensor x to add position information about the sequence using the positional encodings matrix.

In [70]:
# Subclass an appropriate PyTorch class
class PositionalEncoder(nn.Module):
    def __init__(self, d_model, max_length):
        super(PositionalEncoder, self).__init__()
        self.d_model = d_model
        self.max_length = max_length

        # Initialize the positional encoding matrix
        pe = torch.zeros(max_length, d_model)

        position = torch.arange(0, max_length, dtype=torch.float).unsqueeze(1)
        div_term = torch.exp(torch.arange(0, d_model, 2, dtype=torch.float) * -(math.log(10000.0) / d_model))

        # Calculate and assign position encodings to the matrix
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)
        pe = pe.unsqueeze(0)
        self.register_buffer('pe', pe)

    # Update the embeddings tensor adding the positional encodings
    def forward(self, x):
        x = x + self.pe[:, :x.size(1)]
        return x

### Implementing multi-headed self-attention
Now it's the turn of the multi-headed self-attention mechanism implementation.

Besides the necessary imports, including this time torch.nn.functional as F, the __init__() method is also provided.
#### Instructions
* Split the sequence embeddings x across the multiple attention heads.
* Compute dot-product based attention scores between the project query and key.
* Normalize the attention scores to obtain attention weights.
* Multiply the attention weights by the values and linearly transform the concatenated outputs per head.

In [71]:
class MultiHeadAttention(nn.Module):
    def __init__(self, d_model, num_heads):
        super(MultiHeadAttention, self).__init__()
        self.num_heads = num_heads
        self.d_model = d_model
        self.head_dim = d_model // num_heads

        self.query_linear = nn.Linear(d_model, d_model)
        self.key_linear = nn.Linear(d_model, d_model)
        self.value_linear = nn.Linear(d_model, d_model)
        self.output_linear = nn.Linear(d_model, d_model)
    def split_heads(self, x, batch_size):
        # Split the sequence embeddings in x across the attention heads
        x = x.view(batch_size, -1, self.num_heads, self.head_dim)
        return x.permute(0, 2, 1, 3).contiguous().view(batch_size * self.num_heads, -1, self.head_dim)

    def compute_attention(self, query, key, mask=None):
        # Compute dot-product attention scores
        scores = torch.matmul(query, key.permute(1, 2, 0))
        if mask is not None:
            scores = scores.masked_fill(mask == 0, float("-1e20"))
        # Normalize attention scores into attention weights
        attention_weights = F.softmax(scores, dim=-1)
        return attention_weights

    def forward(self, query, key, value, mask=None):
        batch_size = query.size(0)

        query = self.split_heads(self.query_linear(query), batch_size)
        key = self.split_heads(self.key_linear(key), batch_size)
        value = self.split_heads(self.value_linear(value), batch_size)

        attention_weights = self.compute_attention(query, key, mask)

        # Multiply attention weights by values and linearly project concatenated outputs
        output = torch.matmul(attention_weights, value)
        output = output.view(batch_size, self.num_heads, -1, self.head_dim).permute(0, 2, 1, 3).contiguous().view(batch_size, -1, self.d_model)
        return self.output_linear(output)

### Post-attention feed-forward layer
Let's assemble some of the pieces of an encoder transformer, starting with the feed-forward sublayer that follows multi-headed self-attention in every encoder layer.

#### Instructions
* Specify in the __init__() method the sizes of the two linear fully connected layers.
* Apply a forward pass through the two linear layers, using the ReLU() activation in between.

In [72]:
class FeedForwardSubLayer(nn.Module):
    # Specify the two linear layers' input and output sizes
    def __init__(self, d_model, d_ff):
        super(FeedForwardSubLayer, self).__init__()
        self.fc1 = nn.Linear(d_model, d_ff)
        self.fc2 = nn.Linear(d_ff, d_model)
        self.relu = nn.ReLU()

    # Apply a forward pass
    def forward(self, x):
        return self.fc2(self.relu(self.fc1(x)))

### encoder layer
You've made it quite far in building your own skeleton transformer architecture! Now you are ready to assemble a full encoder layer containing:

* A multi-headed self-attention mechanism.
* A feed-forward sublayer.
* A combined layer normalization and dropout to be applied after each of the above two stages.
* Complete the implementation of the EncoderLayer class to initialize all its inner elements one by one.

In [73]:
# Complete the initialization of elements in the encoder layer
class EncoderLayer(nn.Module):
    def __init__(self, d_model, num_heads, d_ff, dropout):
        super(EncoderLayer, self).__init__()
        self.self_attn = MultiHeadAttention(d_model, num_heads)
        self.feed_forward = FeedForwardSubLayer(d_model, d_ff)
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x, mask):
        attn_output = self.self_attn(x, x, x, mask)
        x = self.norm1(x + self.dropout(attn_output))
        ff_output = self.feed_forward(x)
        return self.norm2(x + self.dropout(ff_output))

### Encoder transformer body and head
Almost there! Now that the encoder layer implementation has been completed, all that remains is:

Implementing the transformer body, namely a stack of multiple encoder layers.
Appending a task-specific transformer head to process the encoder's resulting hidden states and produce the final outputs for the language task at hand!
#### Instructions
* Define a stack of multiple encoder layers in the __init__() method.
* Complete the forward() method. Note that the process starts by converting the original sequence tokens in x into embeddings.
* Add final linear layer to project encoder results into raw classification outputs.
* Apply the necessary function to map raw classification outputs into log class probabilities.

In [74]:
class TransformerEncoder(nn.Module):
    def __init__(self, vocab_size, d_model, num_layers, num_heads, d_ff, dropout, max_sequence_length):
        super(TransformerEncoder, self).__init__()
        self.embedding = nn.Embedding(vocab_size, d_model)
        self.positional_encoding = PositionalEncoder(d_model, max_sequence_length)
        # Define a stack of multiple encoder layers
        self.layers = nn.ModuleList([EncoderLayer(d_model, num_heads, d_ff, dropout) for _ in range(num_layers)])

    # Complete the forward pass method
    def forward(self, x, mask):
        x = self.embedding(x)
        x = self.positional_encoding(x)
        for layer in self.layers:
            x = layer(x, mask)
        return x

class ClassifierHead(nn.Module):
    def __init__(self, d_model, num_classes):
        super(ClassifierHead, self).__init__()
        # Add linear layer for multiple-class classification
        self.fc = nn.Linear(d_model, num_classes)

    def forward(self, x):
        logits = self.fc(x[:, 0, :])
        # Obtain log class probabilities upon raw outputs
        return F.log_softmax(logits, dim=-1)

### Testing the encoder transformer
In this exercise, you'll practice creating some instructions to pass an example random sequence throughout the encoder transformer you just defined to obtain and print the classification output. The following variables and model hyperparameters are defined for you:
```
num_classes = 3
vocab_size = 10000
batch_size = 8
d_model = 512
num_heads = 8
num_layers = 6
d_ff = 2048
sequence_length = 256
dropout = 0.1
```
The PositionalEncoder, MultiHeadAttention, FeedForwardSublayer,EncoderLayer, TransformerEncoder, and ClassifierHead classes are also implemented.

Note: although a random input sequence and mask are being used here, in practice, the mask should correspond to the actual location of padding tokens in the input sequences to ensure all of them are the same length.

#### Instructions
* Instantiate the body and head of the encoder transformer.
* Complete the forward pass throughout the entire transformer body and head to obtain and print classification outputs.

In [76]:
num_classes = 3
vocab_size = 10000
batch_size = 8
d_model = 512
num_heads = 8
num_layers = 6
d_ff = 2048
sequence_length = 64
dropout = 0.1

input_sequence = torch.randint(0, vocab_size, (batch_size, sequence_length))
mask = torch.randint(0, 2, (sequence_length, sequence_length))

# Instantiate the encoder transformer's body and head
encoder = TransformerEncoder(vocab_size, d_model, num_layers, num_heads, d_ff, dropout, max_sequence_length=sequence_length)
classifier = ClassifierHead(d_model, num_classes)

# Complete the forward pass
output = encoder(input_sequence, mask)
classification = classifier(output)
print("Classification outputs for a batch of ", batch_size, "sequences:")
print(classification)

Classification outputs for a batch of  8 sequences:
tensor([[-1.2418, -1.3817, -0.7765],
        [-1.1628, -0.9391, -1.2161],
        [-0.4997, -2.4064, -1.1936],
        [-1.2461, -1.9649, -0.5583],
        [-0.9535, -1.8489, -0.7826],
        [-0.8685, -1.0750, -1.4307],
        [-1.0314, -1.6768, -0.7841],
        [-0.9121, -2.1297, -0.7351]], grad_fn=<LogSoftmaxBackward0>)


### Building a decoder body and head
Time to design a high-level architecture for a decoder-only transformer! On this occasion, instead of building the model body and the model head in two separate classes, the model head will be incorporated as part of the model body class that contains the stack of decoder layers.

As usual, the necessary imports for this exercise have been done for you.

#### Instructions
* Add the linear layer for the model head inside the TransformerDecoder class.
* Apply the last stage of the forward pass, through the model head.

In [77]:
class TransformerDecoder(nn.Module):
    def __init__(self, vocab_size, d_model, num_layers, num_heads, d_ff, dropout, max_sequence_length):
        super(TransformerDecoder, self).__init__()
        self.embedding = nn.Embedding(vocab_size, d_model)
        self.positional_encoding = PositionalEncoding(d_model, max_sequence_length)
        self.layers = nn.ModuleList([DecoderLayer(d_model, num_heads, d_ff, dropout) for _ in range(num_layers)])

        # Add a linear layer (head) for next-word prediction
        self.fc = nn.Linear(d_model, vocab_size)

    def forward(self, x, self_mask):
        x = self.embedding(x)
        x = self.positional_encoding(x)
        for layer in self.layers:
            x = layer(x, self_mask)

        # Apply the forward pass through the model head
        x = self.fc(x)
        return F.log_softmax(x, dim=-1)

### Testing the decoder transformer
In this exercise, you'll practice creating some instructions to pass an example random sequence throughout a decoder transformer architecture to obtain outputs in the form of next-token probabilities across the vocabulary.

The following variables and model hyperparameters are defined for you:


The PositionalEncoder, MultiHeadAttention, PositionWiseFeedForward,DecoderLayer, and TransformerDecoder classes are also implemented, the last of which integrates the model body and head.

#### Instructions
Create a triangular mask for enabling causal attention so that every token in the sequence only attends to the previous ones on its left-hand side.
Instantiate the decoder transformer model.

In [78]:
num_classes = 3
vocab_size = 10000
batch_size = 8
d_model = 512
num_heads = 8
num_layers = 6
d_ff = 2048
sequence_length = 256
dropout = 0.1

input_sequence = torch.randint(0, vocab_size, (batch_size, sequence_length))

# Create a triangular attention mask for causal attention
self_attention_mask = (1 - torch.triu(torch.ones(1, sequence_length, sequence_length), diagonal=1)).bool()

# Instantiate the decoder transformer
decoder = TransformerDecoder(vocab_size, d_model, num_layers, num_heads, d_ff, dropout, max_sequence_length=sequence_length)

output = decoder(input_sequence, self_attention_mask)
print(output.shape)
print(output)

NameError: name 'PositionalEncoding' is not defined