<a href="https://colab.research.google.com/github/caglarmert/DI725/blob/main/labs/DI725_Lab_0.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# DI 725: Transformers and Attention-Based Deep Networks

## An End-to-End Tutorial for Implementing Transformers

### Authors:
* ttemizel@metu.edu.tr
* atemizel@metu.edu.tr
* mecaglar@metu.edu.tr

# Introduction

<div>
<img src="https://github.com/caglarmert/DI725/blob/main/src/attention_research_1.png?raw=true" width="400"/>
</div>

---

### Transformer Architecture Overview

The Transformer architecture revolutionized the field of natural language processing (NLP) by introducing a model that relies entirely on self-attention mechanisms, eliminating the need for recurrent or convolutional layers. Here's a brief overview of the main components of the Transformer architecture:

#### 1. Input Embeddings
- The input sequence, typically a sequence of word embeddings, is passed into the model. Each word is represented as a high-dimensional vector, often initialized randomly or pre-trained on a large corpus.

#### 2. Positional Encoding
- Since the Transformer doesn't inherently understand the order of tokens in a sequence, positional encodings are added to the input embeddings to provide information about token positions. These are usually sinusoidal functions of different frequencies and phases.

#### 3. Encoder
- The Encoder consists of multiple identical layers (usually 6-12). Each layer consists of two main sub-components:
  - **Multi-Head Self-Attention Mechanism**: Computes attention weights between all pairs of words in the input sequence to capture relationships and dependencies among them.
  - **Feedforward Neural Network**: Applies a fully connected feedforward network to each position separately and identically. It processes the output of the attention mechanism in a position-wise manner.

#### 4. Decoder
- The Decoder also consists of multiple identical layers (the same number as in the Encoder). Each layer in the Decoder has three main sub-components:
  - **Masked Multi-Head Self-Attention Mechanism**: Similar to the Encoder's attention mechanism, but with a mask applied to prevent positions from attending to subsequent positions, ensuring that the model attends only to previous positions during generation.
  - **Encoder-Decoder Attention Mechanism**: Allows the Decoder to focus on different parts of the input sequence (Encoder's output) by computing attention scores between the current position in the Decoder and all positions in the Encoder's output.
  - **Feedforward Neural Network**: Similar to the Encoder, a fully connected feedforward network is applied to each position separately and identically.

#### 5. Output Layer
- The output of the final Decoder layer is passed through a linear layer followed by a softmax function to produce the probability distribution over the output vocabulary. During training, this distribution is compared to the actual target sequence using cross-entropy loss.

#### 6. Loss Computation
- The model's output is compared to the actual target sequence using cross-entropy loss. This comparison drives the learning process through backpropagation.

The Transformer architecture's key innovation lies in its ability to capture long-range dependencies in sequences efficiently through self-attention mechanisms, making it highly parallelizable and scalable compared to traditional recurrent neural networks.

---

The Transformer architecture, introduced in the paper "Attention is All You Need", revolutionized natural language processing by relying solely on attention mechanisms instead of recurrent connections. Here's a breakdown of its key components:

Overall Structure:

An encoder processes the input sequence to capture its meaning.
A decoder generates the output sequence based on the encoded representation and any additional context.
Encoder and Decoder Blocks:

Both the encoder and decoder consist of multiple identical encoder blocks and decoder blocks, respectively.
Each block has two sub-blocks:
Multi-head Self-attention: Captures relationships between elements within the sequence (encoder) or within the previously generated output (decoder).
Feed-forward network: Adds non-linearity and complexity to the model.
Residual connection and Layer Norm: Improve training stability and gradient flow.
Key Details of Each Block:

1. Multi-head Self-attention:

Splits the input into queries, keys, and values.
Computes attention scores based on the similarity between queries and keys.
Masks out padded elements using attention masks.
Aggregates values weighted by the attention scores, resulting in a context vector for each element.
The "multi-head" part refers to performing this self-attention mechanism multiple times with different query and key projections, capturing diverse relationships.
2. Feed-forward network:

A two-layer network with ReLU activation for non-linearity.
Adds complexity and allows the model to learn more intricate relationships.
3. Residual connection and Layer Norm:

Shortcuts around each sub-block are added to ensure the gradients can flow easily through the network.
Layer normalization rescales and shifts the output of each sub-block, stabilizing the training process.
Additional Components:

Encoder-decoder attention: In a sequence-to-sequence setting, the decoder attends to the encoded representation in each block to incorporate context into the generated output.
Positional encoding: Since the Transformer doesn't have inherent positional information, additional embeddings are added to encode the relative positions of elements in the sequence.
Output layer: In the decoder, a final layer converts the internal representation into the final vocabulary probabilities for output generation.
Benefits of Transformers:

Parallelization: Attention allows for better parallelization during training compared to recurrent models.
Long-range dependencies: Can capture long-range dependencies in sequences without relying on sequential processing.
Adaptability: Can be applied to various NLP tasks with minor modifications.
Drawbacks of Transformers:

Computational cost: Attention can be computationally expensive, especially for long sequences.
Memory intensive: Requires storing the entire input sequence for attention computations.

## Imports
In this part we import the required libraries. Running this part on the Colab servers is required for later parts. It is advised to check the associated python requirements.txt, that is frozen at the time of preparation of this notebook, in case of any library or version error occurs while running this notebook. Mind that installing everything locally via pip install -r "requirements.txt" is not advised though, mainly because of the discrepancies between Colab and locally available machine.

In [1]:
# Uncomment any install if needed. It is recommended that these installations
# are performed prior to any notebook runs and imports

# !pip install datasets # Huggingface dataset library
# !pip install evaluate # Used for evaluation metrics
# !pip install rouge_score # Is a text evaluation metric
# !pip install trl #Transformers Reinforcement Learning framework
# !pip install sacremoses # Used for specific characters, useful for languages like Turkish

In [2]:
from transformers import pipeline
import math
import torch
from torch import nn
import torch.nn.functional as F
import evaluate

from transformers import AutoTokenizer
from transformers import AutoModelForSequenceClassification, AutoModelForSeq2SeqLM
from datasets import load_dataset


from trl import PPOConfig, PPOTrainer, AutoModelForCausalLMWithValueHead, create_reference_model
from trl.core import respond_to_batch
from transformers import AutoModelForCausalLM, AutoTokenizer


After importing the main libraries, we can continue with the transformers. First lets check what does the above import does. We have imported pipeline from transformers library, from huggingface 🤗.

The [documentation](https://huggingface.co/docs/transformers) for the Transformers library.

The [pipelines](https://huggingface.co/docs/transformers/main_classes/pipelines) is a class of the Transformers library. It is used for easy inference, abstracts most of the complexity and offers simple API for some dedicated tasks.

The [torch](https://pytorch.org/) is a popular and diverse machine learning framework, enabling low level implementation (as low as it gets with Python anyway). The Neural Networks (nn) is a library within PyTorch that enables operations with neural network structures.

The [Auto](https://huggingface.co/docs/transformers/model_doc/auto) classes contain many high level methods and models for various specific tasks, sometimes required for a pre-processing step such as tokenizers.

The [datasets](https://huggingface.co/docs/datasets) is the 🤗 library used for datasets (who would have guess?). Tabular, Audio, Computer Vision, and Text data can be loaded or shared via this library.

The [TRL](https://huggingface.co/docs/trl) (standing for: Transformer Reinforcement Learning) is the comprehensive toolkit designed for training transformer language models using Reinforcement Learning. It encompasses a range of tools, starting from the initial Supervised Fine-tuning (SFT) phase, through Reward Modeling (RM), up to the Proximal Policy Optimization (PPO) stage.

# Chapter 1: Introduction

## Introduction to Transformers

In this first introductory section, we begin with experiencing basic and very high level usage of transformers.

### Exercise 1.1: Classifying a text

Huggingface Hub is an open-source public colaboration of various models. Large Language Models, require tremendous amount of training data and time, thus once trained they are invaluable and their inference can be adapted to various use-cases.

This first practice will be about loading a model from the huggingface hub, into a pipeline, to perform a task.

It is important to note that model loading with specific model name is advised or else it will opt to defaults.

#### Instructions
* Import the necessary function from the transformers library to load Hugging Face LLMs as pipelines.
* Load the model specified in model_name into a suitable pipeline for sentiment classification in text.
* Pass the customer review defined in prompt to the pipeline to get a sentiment prediction.

In [3]:
# Specify the task name
task_name = "text-classification"
# Specify the model to be loaded
model_name = "lxyuan/distilbert-base-multilingual-cased-sentiments-student"
# We can change the model name to
# "mrm8488/distilroberta-finetuned-financial-news-sentiment-analysis"
# "lxyuan/distilbert-base-multilingual-cased-sentiments-student"
classifier = pipeline(task = task_name, model = model_name)

# Clearly this is a positive sentiment from a 5 star review tripadvisor for Atakule
prompt = "I liked Atakule, very much so because of the excellent location in the midst of the botanical park and city center."
prediction = classifier(prompt)
print(prompt, "\nSentiment:", prediction[0]["label"], "Score:",prediction[0]["score"],)

# And a negative one, 1 star review from the time it is off-limits.
prompt = "There was nothing to see at Atakule, the building is under construction, you can't go into building, wasting my afternoon time in ankara."
prediction = classifier(prompt)
print(prompt, "\nSentiment:", prediction[0]["label"], "Score:",prediction[0]["score"],)



I liked Atakule, very much so because of the excellent location in the midst of the botanical park and city center. 
Sentiment: positive Score: 0.8559050559997559
There was nothing to see at Atakule, the building is under construction, you can't go into building, wasting my afternoon time in ankara. 
Sentiment: negative Score: 0.4448523223400116


### Exercise 1.2: Summarizing a text

Summarization is a challanging language task, requires sequence-to-sequence models, such as the one we are using here. The task is about summarizing a given long text.

#### Instructions

* Load the model, based on the T5 transformer architecture and specified in model_name, into a text summarization pipeline.
* Pass long_text to the model pipeline, to produce a summary limited to 50 tokens length.
* Access and print the summarized text in outputs.

In [4]:
# Specify a model name, note that we are using a small version so don't expect much
model_name = "cnicu/t5-small-booksum"
# Provide the long text
long_text = "Tunali hilmi, which is a bustling street, is a hub for various commercial activities as it extends southwards toward Kugulu Park. Tunali Hilmi Avenue is regarded as one of the city's most charming streets, adorned with a variety of shops, boutiques, and souvenir stores. The neighborhood exudes a sense of luxury and offers a wide range of goods, albeit at slightly higher prices compared to other areas. However, the elevated cost is justified by the high-quality shopping experience, particularly appealing to those who enjoy outdoor retail therapy."

# Load the model pipeline for text summarization
summarizer = pipeline(task="summarization", model=model_name)

# Pass the long text to the model to summarize it
outputs = summarizer(long_text, max_length=50)

# Access and print the summarized text in the outputs variable
print("Original Text: ", long_text, "\nSummary Text: ", outputs[0]['summary_text'])

Original Text:  Tunali hilmi, which is a bustling street, is a hub for various commercial activities as it extends southwards toward Kugulu Park. Tunali Hilmi Avenue is regarded as one of the city's most charming streets, adorned with a variety of shops, boutiques, and souvenir stores. The neighborhood exudes a sense of luxury and offers a wide range of goods, albeit at slightly higher prices compared to other areas. However, the elevated cost is justified by the high-quality shopping experience, particularly appealing to those who enjoy outdoor retail therapy. 
Summary Text:  Tunali hilmi is regarded as one of the city's most charming streets, adorned with shops, boutiques, and souvenir stores. The neighborhood offers a wide range of goods, albeit at slightly higher prices


### Exercise 1.3: Translating a text

Translation is another challanging language task, requiring models trained specifically for source and target languages.

#### Instructions

* Define a pipeline for Turkish-to-English translation, specifying the source and target languages in the pipeline task argument.
* Translate the text in input_text using the pipeline.
* Access and print the translated text in the outputs variable: translations.

In [5]:
# Specify the model name, from Turkish (tr) to English (en)
model_name = "Helsinki-NLP/opus-mt-tr-en"

# A short intro about METU
input_text = "Orta Doğu Teknik Üniversitesi, Türkiye ve Orta Doğu ülkelerinin kalkınmalarına katkıda bulunmak, özellikle fen bilimleri ve sosyal bilimler alanlarında uzman yetiştirmek üzere 15 Kasım 1956 tarihinde Orta Doğu Yüksek Teknoloji Enstitüsü adıyla eğitime başlamıştır. "

# Define pipeline for Spanish-to-English translation
translator = pipeline("translation_tr_to_en", model=model_name)

# Translate the input text
translations = translator(input_text)

# Access the output to print the translated text in English
print("Original text: ", input_text)
print("Translated text:", translations[0]['translation_text'])

Original text:  Orta Doğu Teknik Üniversitesi, Türkiye ve Orta Doğu ülkelerinin kalkınmalarına katkıda bulunmak, özellikle fen bilimleri ve sosyal bilimler alanlarında uzman yetiştirmek üzere 15 Kasım 1956 tarihinde Orta Doğu Yüksek Teknoloji Enstitüsü adıyla eğitime başlamıştır. 
Translated text: The Middle East Technical University began training as the Middle East Institute of Technology on 15 November 1956 to contribute to the development of Turkey and Middle East countries, especially to develop experts in science and social sciences.


### Exercise 1.4: Question-Answering
Next, let's practice loading a Hugging Face LLM into a pipeline for question-answering (QA, for short). This time, you will use the default model supplied by Hugging Face transformers library for QA pipelines.

#### Instructions
* Instantiate a pipeline for question-answering.
* Pass the necessary pieces of information as inputs to the pipeline.
* Access and print the extracted answer in the outputs variable.

In [6]:
# Load the model pipeline for question-answering
model_name = "distilbert-base-cased-distilled-squad"

qa_model = pipeline("question-answering",model=model_name)

# Provide the context
context = "The history of Ankara Castle, one of the symbols of the province, is as old as the history of the city. It remains to be determined when the castle, which existed when the Galatians settled in Ankara and was repaired during the Roman period, was built. Next to the hill on which it was founded, that is, Hatip Stream, is 110 m above the Bent Stream. The castle has more than 20 towers. The outer castle surrounds Ankara in the shape of a heart. The four-storey inner castle is made of Ankara Stone and partly of collected stones. The inner castle has two large gates, one is called the Outer Gate and the other is the Citadel Gate. There is a book belonging to the Ilkhanate on this door. The inner castles consist of a total of 42 pentagonal towers with a length of 14-16 m. There is an inscription in the northwestern part showing the repairs made by the Seljuk ruler."

# Provide the questions
questions = ["How many towers does the Ankara castle have?",
             "When did the Ankara castle was build?",
             "How long are the towers in the inner castle?",
             "Who repaired the Ankara castle and inscribed?",
             "What are the materials of the Ankara castle?"]

# Pass the necessary inputs to the LLM pipeline for question-answering
outputs = qa_model(question=questions, context=context)

# Access and print the answer
for i in range(len(questions)):
  print("Question: ", questions[i], "\nAnswer:", outputs[i]['answer'])

Question:  How many towers does the Ankara castle have? 
Answer: more than 20
Question:  When did the Ankara castle was build? 
Answer: Roman period
Question:  How long are the towers in the inner castle? 
Answer: 14-16 m
Question:  Who repaired the Ankara castle and inscribed? 
Answer: the Seljuk ruler
Question:  What are the materials of the Ankara castle? 
Answer: Ankara Stone and partly of collected stones


### Exercise 1.5: Text Generation

Text generation, is the most famous application of transformers, namely ChatGPT (standing for Generative Pre-Trained). Here we will use an older version (GPT-2) to generate text for customers leaving reviews for our business on a public website.

#### Instructions
* Instantiate the generator variable as a pipeline that loads the "gpt2" pre-trained text generation model.
* Build a prompt for the LLM that concatenates the customer review with the hotel response's initial sentence.
* Pass the prompt to the previously defined pipeline to generate (inference) the following text in the hotel response, specifying a maximum length of 150 tokens for the generated output.
* Print the generated output.

In [7]:
# Create a pipeline for text generation using the gpt2 model
generator = pipeline("text-generation", model="gpt2")

customer_text = "The Divan is a very comfortable and professionally run hotel in Ankara. The staff are extremely helpful and friendly. Rooms and beds are very comfortable, with all the facilities that you would expect in a four star hotel. The breakfast buffet is very extensive (open 6.30AM to 10.30AM). The only down-side is the hotels location, a ten to fifteen minute taxi ride away from the city centre, embassies and government buildings, but is located within a very quiet residential area."

response = "Dear Our Valuable Guest, Thank you for taking the time to leave us a review."

# Build the prompt for the text generation LLM
prompt = f"Customer review:\n{customer_text}\n\nHotel reponse to the customer:\n{response}"

# Pass the prompt to the model pipeline
outputs = generator(prompt, max_length=150, pad_token_id=generator.tokenizer.eos_token_id)

# Print the augmented sequence generated by the model
print(outputs[0]['generated_text'])

Customer review:
The Divan is a very comfortable and professionally run hotel in Ankara. The staff are extremely helpful and friendly. Rooms and beds are very comfortable, with all the facilities that you would expect in a four star hotel. The breakfast buffet is very extensive (open 6.30AM to 10.30AM). The only down-side is the hotels location, a ten to fifteen minute taxi ride away from the city centre, embassies and government buildings, but is located within a very quiet residential area.

Hotel reponse to the customer:
Dear Our Valuable Guest, Thank you for taking the time to leave us a review. This is our first time meeting with you (and we don't expect to see you again


# Chapter 2: Building a Transformer Architecture

## Building Blocks

### Exercise 2.1: PyTorch Transformer

Transformer class from PyTorch Neural Networks enables building a full transformer architecture with encoder and decoder.

The code example below can be used to build a very simple Transformer model. It is required to specifcy the main structural components:
* Embedding size
* Number of attention heads
* Number of encoder layers
* Number of decoder layers

In [8]:
# Set transformer model hyperparameters
d_model = 512
n_heads = 8
num_encoder_layers = 6
num_decoder_layers = 6

# Create the transformer model and assign hyperparameters
model = nn.Transformer(
    d_model=d_model,
    nhead=n_heads,
    num_encoder_layers=num_encoder_layers,
    num_decoder_layers=num_decoder_layers
)

print(model)



Transformer(
  (encoder): TransformerEncoder(
    (layers): ModuleList(
      (0-5): 6 x TransformerEncoderLayer(
        (self_attn): MultiheadAttention(
          (out_proj): NonDynamicallyQuantizableLinear(in_features=512, out_features=512, bias=True)
        )
        (linear1): Linear(in_features=512, out_features=2048, bias=True)
        (dropout): Dropout(p=0.1, inplace=False)
        (linear2): Linear(in_features=2048, out_features=512, bias=True)
        (norm1): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
        (norm2): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
        (dropout1): Dropout(p=0.1, inplace=False)
        (dropout2): Dropout(p=0.1, inplace=False)
      )
    )
    (norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
  )
  (decoder): TransformerDecoder(
    (layers): ModuleList(
      (0-5): 6 x TransformerDecoderLayer(
        (self_attn): MultiheadAttention(
          (out_proj): NonDynamicallyQuantizableLinear(in_features=512, o

### Exercise 2.2: Building positional encoding

Building the positional encoding can be observed from the implementation provided below.

#### Instructions
* Specify the PyTorch class that the positional encoder should subclass from.
* Initialize a positional encoding matrix for token positions in sequences up to max_length.
* Assign unique position encodings to the matrix pe by alternating the use of sine and cosine functions.
* Update the input embeddings tensor x to add position information about the sequence using the positional encodings matrix.

In [9]:
# Subclass an appropriate PyTorch class
class PositionalEncoder(nn.Module):
    def __init__(self, d_model, max_length):
        super(PositionalEncoder, self).__init__()
        self.d_model = d_model
        self.max_length = max_length

        # Initialize the positional encoding matrix
        pe = torch.zeros(max_length, d_model)

        position = torch.arange(0, max_length, dtype=torch.float).unsqueeze(1)
        div_term = torch.exp(torch.arange(0, d_model, 2, dtype=torch.float) * -(math.log(10000.0) / d_model))

        # Calculate and assign position encodings to the matrix
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)
        pe = pe.unsqueeze(0)
        self.register_buffer('pe', pe)

    # Update the embeddings tensor adding the positional encodings
    def forward(self, x):
        x = x + self.pe[:, :x.size(1)]
        return x

### Exercise 2.3: Implementing multi-headed self-attention

Building the multi-headed self-attention can be observed from the implementation provided below.

#### Instructions
* Split the sequence embeddings x across the multiple attention heads.
* Compute dot-product based attention scores between the project query and key.
* Normalize the attention scores to obtain attention weights.
* Multiply the attention weights by the values and linearly transform the concatenated outputs per head.

In [10]:
class MultiHeadAttention(nn.Module):
    def __init__(self, d_model, num_heads):
        super(MultiHeadAttention, self).__init__()
        self.num_heads = num_heads
        self.d_model = d_model
        self.head_dim = d_model // num_heads

        self.query_linear = nn.Linear(d_model, d_model)
        self.key_linear = nn.Linear(d_model, d_model)
        self.value_linear = nn.Linear(d_model, d_model)
        self.output_linear = nn.Linear(d_model, d_model)
    def split_heads(self, x, batch_size):
        # Split the sequence embeddings in x across the attention heads
        x = x.view(batch_size, -1, self.num_heads, self.head_dim)
        return x.permute(0, 2, 1, 3).contiguous().view(batch_size * self.num_heads, -1, self.head_dim)

    def compute_attention(self, query, key, mask=None):
        # Compute dot-product attention scores
        scores = torch.matmul(query, key.permute(1, 2, 0))
        if mask is not None:
            scores = scores.masked_fill(mask == 0, float("-1e20"))
        # Normalize attention scores into attention weights
        attention_weights = F.softmax(scores, dim=-1)
        return attention_weights

    def forward(self, query, key, value, mask=None):
        batch_size = query.size(0)

        query = self.split_heads(self.query_linear(query), batch_size)
        key = self.split_heads(self.key_linear(key), batch_size)
        value = self.split_heads(self.value_linear(value), batch_size)

        attention_weights = self.compute_attention(query, key, mask)

        # Multiply attention weights by values and linearly project concatenated outputs
        output = torch.matmul(attention_weights, value)
        output = output.view(batch_size, self.num_heads, -1, self.head_dim).permute(0, 2, 1, 3).contiguous().view(batch_size, -1, self.d_model)
        return self.output_linear(output)

### Exercise 2.4: Post-attention feed-forward layer

Feed-forward sublayer following multi-head self-attention for every encoder layer is built as an example below:

#### Instructions
* Specify in the __init__() method the sizes of the two linear fully connected layers.
* Apply a forward pass through the two linear layers, using the ReLU() activation in between.

In [11]:
class FeedForwardSubLayer(nn.Module):
    # Specify the two linear layers' input and output sizes
    def __init__(self, d_model, d_ff):
        super(FeedForwardSubLayer, self).__init__()
        self.fc1 = nn.Linear(d_model, d_ff)
        self.fc2 = nn.Linear(d_ff, d_model)
        self.relu = nn.ReLU()

    # Apply a forward pass
    def forward(self, x):
        return self.fc2(self.relu(self.fc1(x)))

## Encoder Transformer

### Exercise 2.5: Encoder layer

Assembling a full encoder layer containing:

* A multi-headed self-attention mechanism.
* A feed-forward sublayer.
* A combined layer normalization and dropout to be applied after each of the above two stages.

In [12]:
# Complete the initialization of elements in the encoder layer
class EncoderLayer(nn.Module):
    def __init__(self, d_model, num_heads, d_ff, dropout):
        super(EncoderLayer, self).__init__()
        # Multi-head self-attention
        self.self_attn = MultiHeadAttention(d_model, num_heads)
        # Feedforward neural network
        self.feed_forward = FeedForwardSubLayer(d_model, d_ff)
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x, mask):
      # Multi-head self-attention
        attn_output = self.self_attn(x, x, x, mask)
        x = self.norm1(x + self.dropout(attn_output))
        # Feedforward neural network
        ff_output = self.feed_forward(x)
        return self.norm2(x + self.dropout(ff_output))

### Exercise 2.6: Encoder transformer body and head

Implementing the transformer body, that is consisting of a stack of multiple encoder layers and a task specific transformer head that is used to process the encoder's hidden states.

#### Instructions
* Define a stack of multiple encoder layers in the __init__() method.
* Complete the forward() method. Note that the process starts by converting the original sequence tokens in x into embeddings.
* Add final linear layer to project encoder results into raw classification outputs.
* Apply the necessary function to map raw classification outputs into log class probabilities.

In [13]:
class TransformerEncoder(nn.Module):
    def __init__(self, vocab_size, d_model, num_layers, num_heads, d_ff, dropout, max_sequence_length):
        super(TransformerEncoder, self).__init__()
        self.embedding = nn.Embedding(vocab_size, d_model)
        self.positional_encoding = PositionalEncoder(d_model, max_sequence_length)
        # Define a stack of multiple encoder layers
        self.layers = nn.ModuleList([EncoderLayer(d_model, num_heads, d_ff, dropout) for _ in range(num_layers)])

    # Complete the forward pass method
    def forward(self, x, mask):
        x = self.embedding(x)
        x = self.positional_encoding(x)
        for layer in self.layers:
            x = layer(x, mask)
        return x

class ClassifierHead(nn.Module):
    def __init__(self, d_model, num_classes):
        super(ClassifierHead, self).__init__()
        # Add linear layer for multiple-class classification
        self.fc = nn.Linear(d_model, num_classes)

    def forward(self, x):
        logits = self.fc(x[:, 0, :])
        # Obtain log class probabilities upon raw outputs
        return F.log_softmax(logits, dim=-1)

### Exercise 2.7: Testing the encoder transformer

A random and simple sequence will be used as an input to the encoder transformer. Obtaining the output (that is not even human-readable) without any errors is sufficient for this exercise.

The following components are adequate to form a full encoder transformer:
* PositionalEncoder
* MultiHeadAttention
* FeedForwardSublayer
* EncoderLayer
* TransformerEncoder
* ClassifierHead

Note: although a random input sequence and mask are being used here, in practice, the mask should correspond to the actual location of padding tokens in the input sequences to ensure all of them are the same length.

#### Instructions
* Instantiate the body and head of the encoder transformer.
* Complete the forward pass throughout the entire transformer body and head to obtain and print classification outputs.

In [14]:
num_classes = 3
vocab_size = 10000
batch_size = 8
d_model = 512
num_heads = 8
num_layers = 6
d_ff = 2048
sequence_length = 64
dropout = 0.1

input_sequence = torch.randint(0, vocab_size, (batch_size, sequence_length))
mask = torch.randint(0, 2, (sequence_length, sequence_length))

# Instantiate the encoder transformer's body and head
encoder = TransformerEncoder(vocab_size, d_model, num_layers, num_heads, d_ff, dropout, max_sequence_length=sequence_length)
classifier = ClassifierHead(d_model, num_classes)

# Complete the forward pass
output = encoder(input_sequence, mask)
classification = classifier(output)
print("Classification outputs for a batch of ", batch_size, "sequences:")
print(classification)

Classification outputs for a batch of  8 sequences:
tensor([[-0.4401, -2.3582, -1.3416],
        [-0.7164, -1.9729, -0.9877],
        [-0.6334, -1.6363, -1.2927],
        [-1.3165, -1.2192, -0.8290],
        [-0.6048, -1.9647, -1.1595],
        [-1.0914, -1.7756, -0.7034],
        [-0.8631, -1.6719, -0.9409],
        [-1.3094, -1.6752, -0.6111]], grad_fn=<LogSoftmaxBackward0>)


## Decoder Transformer

### Exercise 2.8: Decoder Layer

Encoder layer was built similarly, what is the main difference between these two structures?



#### Instructions
* A multi-headed self-attention mechanism.
* A feed-forward sublayer.
* Normalization and dropout to be applied.

In [15]:
class DecoderLayer(nn.Module):
    def __init__(self, d_model, num_heads, d_ff, dropout):
        super(DecoderLayer, self).__init__()
        # Multi-head self-attention
        self.self_attention = MultiHeadAttention(d_model, num_heads)
        self.norm1 = nn.LayerNorm(d_model)
        # Feedforward neural network
        self.feed_forward = FeedForwardSubLayer(d_model, d_ff)
        self.norm2 = nn.LayerNorm(d_model)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x, self_mask):
        # Multi-head self-attention
        attention_output = self.self_attention(x, x, x, self_mask)
        x = x + self.dropout(attention_output)
        x = self.norm1(x)

        # Feedforward neural network
        ff_output = self.feed_forward(x)
        x = x + self.dropout(ff_output)
        x = self.norm2(x)

        return x

### Exercise 2.9: Building a decoder body and head

A high-level structure for a decoder only transformer will be implemented in this exercise. Different than the encoder transformer, the model body and head is not seperated in decoder transformer. Instead decoder transformer contains the model head and body. The model body is a stack of decoder layers.

#### Instructions
* Add the linear layer for the model head inside the TransformerDecoder class.
* Apply the last stage of the forward pass, through the model head.

In [16]:
class TransformerDecoder(nn.Module):
    def __init__(self, vocab_size, d_model, num_layers, num_heads, d_ff, dropout, max_sequence_length):
        super(TransformerDecoder, self).__init__()
        self.embedding = nn.Embedding(vocab_size, d_model)
        self.positional_encoding = PositionalEncoder(d_model, max_sequence_length)
        self.layers = nn.ModuleList([DecoderLayer(d_model, num_heads, d_ff, dropout) for _ in range(num_layers)])

        # Add a linear layer (head) for next-word prediction
        self.fc = nn.Linear(d_model, vocab_size)

    def forward(self, x, self_mask):
        x = self.embedding(x)
        x = self.positional_encoding(x)
        for layer in self.layers:
            x = layer(x, self_mask)

        # Apply the forward pass through the model head
        x = self.fc(x)
        return F.log_softmax(x, dim=-1)

### Exercise 2.10: Testing the decoder transformer

A random and simple sequence will be used as an input to the decoder transformer. Obtaining the output without any errors is sufficient for this exercise.

The following components are adequate to form a full decoder transformer:
* PositionalEncoder
* MultiHeadAttention
* FeedForwardSublayer
* DecoderLayer
* TransformerDecoder

#### Instructions
* Implement the decoder transformer with methods and classes defined before.
* Complete the forward pass throughout the entire transformer body and head to obtain and print  outputs.

In [17]:
num_classes = 3
vocab_size = 10000
batch_size = 8
d_model = 512
num_heads = 8
num_layers = 6
d_ff = 2048
sequence_length = 64
dropout = 0.1

input_sequence = torch.randint(0, vocab_size, (batch_size, sequence_length))

# Create a triangular attention mask for causal attention
self_attention_mask = (1 - torch.triu(torch.ones(1, sequence_length, sequence_length), diagonal=1)).bool()

# Instantiate the decoder transformer
decoder = TransformerDecoder(vocab_size, d_model, num_layers, num_heads, d_ff, dropout, max_sequence_length=sequence_length)

output = decoder(input_sequence, self_attention_mask)
print(output.shape)
print(output)

torch.Size([8, 64, 10000])
tensor([[[-10.1458,  -9.0383,  -9.5453,  ..., -10.1285,  -9.8935, -10.2319],
         [-10.0734,  -9.3788,  -9.6022,  ..., -10.7163,  -9.6544, -10.1705],
         [-10.7714,  -8.7524,  -8.8882,  ..., -10.0500,  -9.5250,  -9.5619],
         ...,
         [ -9.6249,  -9.1726,  -8.1047,  ..., -10.0399,  -8.4906,  -8.2246],
         [ -9.5591, -10.2290,  -9.5116,  ..., -10.5957, -10.2454,  -9.5035],
         [ -8.6030, -10.3418,  -9.0410,  ..., -10.2603,  -9.2483,  -9.6365]],

        [[-10.5391,  -9.1149,  -9.9728,  ...,  -9.3613,  -8.3935,  -8.7078],
         [-10.2205,  -9.6266,  -9.8232,  ...,  -9.6392,  -8.7367,  -9.1300],
         [-10.5028,  -9.0210,  -8.9935,  ...,  -9.6244,  -9.7951,  -9.6008],
         ...,
         [ -9.5705,  -8.7897,  -8.7925,  ...,  -9.4114,  -8.7086,  -9.6372],
         [ -9.9898,  -9.5243,  -9.5373,  ...,  -9.5448,  -9.4836,  -9.2993],
         [ -9.5879,  -9.0400,  -8.7634,  ...,  -9.5965,  -9.6398,  -9.1627]],

        [[-10.133

## Encoder-Decoder Transformer

### Exercise 2.11: Incorporating cross-attention in a decoder

In an encoder-decoder transformer, decoder layers incorporate two attention mechanisms: the causal attention inherent to any transformer decoder, plus a cross-attention that integrates source sequence information processed by the encoder with the target sequence information being processed through the decoder.

Modify the DecoderLayer class to incorporate this twofold attention scheme.

#### Instructions
* Initialize the two attention mechanisms used in an encoder-decoder transformers' decoder layer: causal (masked) self-attention and cross-attention.
* Pass the necessary input arguments (query, key, values, and mask) to the two attention stages in the forward pass.

In [18]:
class DecoderLayer(nn.Module):
    def __init__(self, d_model, num_heads, d_ff, dropout):
        super(DecoderLayer, self).__init__()

        # Initialize the causal (masked) self-attention and cross-attention
        self.self_attn = MultiHeadAttention(d_model, num_heads)
        self.cross_attn = MultiHeadAttention(d_model, num_heads)
        self.feed_forward = FeedForwardSubLayer(d_model, d_ff)
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        self.norm3 = nn.LayerNorm(d_model)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x, causal_mask, encoder_output, cross_mask):
        # Pass the necessary arguments to the causal self-attention and cross-attention
        self_attn_output = self.self_attn(x, x, x, causal_mask)
        x = self.norm1(x + self.dropout(self_attn_output))
        cross_attn_output = self.cross_attn(x, encoder_output, encoder_output, cross_mask)
        x = self.norm2(x + self.dropout(cross_attn_output))
        ff_output = self.feed_forward(x)
        x = self.norm3(x + self.dropout(ff_output))
        return x

### Exercise 2.12: Updating Decoder Transformer

In [19]:
class TransformerDecoder(nn.Module):
    def __init__(self, vocab_size, d_model, num_layers, num_heads, d_ff, dropout, max_sequence_length):
        super(TransformerDecoder, self).__init__()
        self.embedding = nn.Embedding(vocab_size, d_model)
        self.positional_encoding = PositionalEncoder(d_model, max_sequence_length)
        self.layers = nn.ModuleList([DecoderLayer(d_model, num_heads, d_ff, dropout) for _ in range(num_layers)])

    def forward(self, x, causal_mask, encoder_output, cross_mask):
        x = self.embedding(x)
        x = self.positional_encoding(x)
        for layer in self.layers:
            x = layer(x, causal_mask, encoder_output, cross_mask)
        return x

### Exercise 2.13: Trying out an encoder-decoder transformer
Your next task is complete the following piece of code to define and forward-pass an example batch of randomly generated input sequences through an encoder-decoder transformer.

Remember that we are only testing a yet-to-be-trained transformer architecture, hence the use of random input sequences.

The following components are required to form a full encoder-decoder transformer:
* MultiHeadAttention
* FeedForwardSubLayer
* PositionalEncoding
* EncoderLayer
* DecoderLayer
* TransformerEncoder
* TransformerDecoder
* ClassifierHead


#### Instructions

* Create a batch of random input sequences of size batch_size X sequence_length.
* Instantiate the two transformer bodies using the appropriate class names.
* Pass the necessary masks as arguments to the encoder and the decoder for their underlying attention mechanisms; each mask argument should be added in the same order they are utilized inside the encoder or decoder layer.

In [20]:
vocab_size = 10000
batch_size = 16
d_model = 512
num_heads = 8
num_layers = 6
d_ff = 2048
sequence_length = 128
dropout = 0.1


# Create a batch of random input sequences
input_sequence = torch.randint(0, vocab_size, (batch_size, sequence_length))
padding_mask = torch.randint(0, 2, (sequence_length, sequence_length))
causal_mask = torch.triu(torch.ones(sequence_length, sequence_length), diagonal=1)

# Instantiate the two transformer bodies
encoder = TransformerEncoder(vocab_size, d_model, num_layers, num_heads, d_ff, dropout, max_sequence_length=sequence_length)
decoder = TransformerDecoder(vocab_size, d_model, num_layers, num_heads, d_ff, dropout, max_sequence_length=sequence_length)

# Pass the necessary masks as arguments to the encoder and the decoder
encoder_output = encoder(input_sequence, padding_mask)
decoder_output = decoder(input_sequence, causal_mask, encoder_output, padding_mask)
print("Batch's output shape: ", decoder_output.shape)

Batch's output shape:  torch.Size([16, 128, 512])


## Chapter 3: Pre-trained Transformers and AutoModels

### Exercise 3.1: Classifying two opposing opinions

Previously we have built basic transformers and tested with sample sequences. For this exercise, we will suppy a pre-trained transformer (distilbert) with two opposing reviews.

#### Instructions
* Use the necessary task-specific classes and methods to load the tokenizer and pre-trained model.
* Tokenize the inputs and pass them to the LLM to perform classification inference.

In [21]:
model_name = "textattack/distilbert-base-uncased-SST-2"

# Load the tokenizer and pre-trained model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(
  model_name, num_labels=2)

text = ["The best movie I've ever watched!", "What an awful movie. I regret watching it."]

# Tokenize inputs and pass them to the model for inference
inputs = tokenizer(text, return_tensors="pt", padding=True)
outputs = model(**inputs)
logits = outputs.logits

predicted_classes = torch.argmax(logits, dim=1).tolist()
for idx, predicted_class in enumerate(predicted_classes):
    print(f"Predicted class for \"{text[idx]}\": {predicted_class}")

Predicted class for "The best movie I've ever watched!": 1
Predicted class for "What an awful movie. I regret watching it.": 0


### Exercise 3.2: Summarizing an opinion

Opinosis dataset contains product reviews and associated summaries. Using this dataset, and a transformer model, we can generate and experiment with summarization from input sequence.

#### Instructions
* Display the names of the features in the data, by accessing the downloaded 'train' fold.
* Use the necessary variables and methods to encode the input example, pass it to the model to generate a summary, and decode the summary.

In [22]:
dataset = load_dataset("opinosis")
model_name = "t5-small"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

print(f"Number of instances: {len(dataset['train'])}")

# Show the names of features in the training fold of the dataset
print(f"Feature names: {dataset['train'].column_names}")

# Encode the input example, obtain the summary, and decode it
example = dataset['train'][6]['review_sents']
input_ids = tokenizer.encode("summarize: " + example, return_tensors="pt", max_length=512, truncation=True)
summary_ids = model.generate(input_ids, max_length=150)
summary = tokenizer.decode(
  summary_ids[0], skip_special_tokens=True)

print("\nOriginal Text (first 400 characters): \n", example[:400])
print("\nGenerated Summary: \n", summary)

Number of instances: 51
Feature names: ['review_sents', 'summaries']

Original Text (first 400 characters): 
  Drivers seat not comfortable, the car itself compared to other models of similar class .
 It's very comfortable, remarkably large inside and just an overall great vehicle .
 Front seats are very uncomfortable .
 I'm 6' tall, and find the driving position pretty comfortable .
 However, there are a couple of things that kill it for me 1 terrible driver seat comfort, kills my back 2 lack luster 

Generated Summary: 
 Drivers seat not comfortable, the car itself compared to other models of similar class. it's very comfortable, remarkably large inside and just an overall great vehicle. Front seats are very uncomfortable. I'm 6' tall, and find the driving position pretty comfortable. but there are a couple of things that kill it for me 1 terrible driver seat comfort, kills my back 2 lack luster interior design.


### Exercise 3.3: Filling a Turkish phrasebook


Using Automodels from huggingfaces, and loading pre-trained transformers, we can quickly create a phrase translation from one language to a multitude of other languages.

#### Instructions
* Use the appropriate task-specific classes and methods to load the tokenizer and the model (the classes needed have been already imported for you, as usual!).
* Complete the instructions to encode the input sequences, generate translations, and decode them. For encodings, use an extra argument to return them as PyTorch tensors.

In [33]:
# Define model name, here it is specified as Turkish to English
model_name = "Helsinki-NLP/opus-mt-tr-en"

# Load the tokenizer and the model checkpoint
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

# Some phrases do not have any direct translation in English.
turkish_inputs = ["Merhaba", "Teşekkür ederim", "Nasılsın?", "Özür dilerim",
                  "Güle güle", "Afiyet olsun", "Çok yaşa!", "Kolay gelsin"]

# Encode the inputs, generate translations, decode, and print them
for turkish_input in turkish_inputs:
    input_ids = tokenizer.encode(turkish_input, return_tensors="pt")
    translated_ids = model.generate(input_ids)
    translated_text = tokenizer.decode(translated_ids[0], skip_special_tokens=True)
    print(f"Turkish: {turkish_input} | English: {translated_text}")

Turkish: Merhaba | English: Hello.
Turkish: Teşekkür ederim | English: Thank you.
Turkish: Nasılsın? | English: How are you?
Turkish: Özür dilerim | English: I'm sorry.
Turkish: Güle güle | English: Bye.
Turkish: Afiyet olsun | English: Enjoy your meal.
Turkish: Çok yaşa! | English: Long live!
Turkish: Kolay gelsin | English: Good luck with that.


### Exercise 3.4: Load and inspect a QA dataset

Loading a question answering dataset, inspecting data, extracting an answer from the given context and tokenizing the output is the task in this exercise.

#### Instructions
* Load the dataset "xtreme" and subset "MLQA.en.en"
* Initialize tokenizer using the "deepset/minilm-uncased-squad2" model checkpoint.
* Tokenize the example question and context retrieved, ensuring the results are returned as PyTorch tensors.

In [24]:
# Load a specific subset of the dataset
mlqa = load_dataset("xtreme", name="MLQA.en.en")

question = mlqa["test"]["question"][3]
context = mlqa["test"]["context"][3]
print("Question: ", question)
print("Context: ", context)

# Initialize the tokenizer using the model checkpoint
tokenizer = AutoTokenizer.from_pretrained("deepset/minilm-uncased-squad2")

# Tokenize the inputs returning the result as tensors
inputs = tokenizer(question, context, return_tensors="pt")
print("First five encoded tokens: ", inputs["input_ids"][0][:5])

Question:  what did the complainants alleged happen to them?
Context:  In 1994, five unnamed civilian contractors and the widows of contractors Walter Kasza and Robert Frost sued the USAF and the United States Environmental Protection Agency. Their suit, in which they were represented by George Washington University law professor Jonathan Turley, alleged they had been present when large quantities of unknown chemicals had been burned in open pits and trenches at Groom. Biopsies taken from the complainants were analyzed by Rutgers University biochemists, who found high levels of dioxin, dibenzofuran, and trichloroethylene in their body fat. The complainants alleged they had sustained skin, liver, and respiratory injuries due to their work at Groom, and that this had contributed to the deaths of Frost and Kasza. The suit sought compensation for the injuries they had sustained, claiming the USAF had illegally handled toxic materials, and that the EPA had failed in its duty to enforce the 

# Chapter 4: Evaluation


## Performance Metrics

### Exercise 4.1: Basic Metrics, Accuracy, Precision, Recall, F1 Score

Using the sentiment classification pipeline, here we will demonstrate how to calculate basic metrics: Accuracy, Precision, Recall, F1 Score.

#### Instructions
* Pass a list containing the four input reviews to the sentiment classification pipeline.
* Load the score metric from the evaluate library

In [25]:
test_examples = [
    {"text": "I am making a good use of this product!", "label": 1},
    {"text": "The service was disappointing.", "label": 0},
    {"text": "I learned a lot from this book.", "label": 1},
    {"text": "The book cover broke after two days of use.", "label": 0},
]
sentiment_analysis = pipeline("sentiment-analysis")

# Pass the four input texts (without labels) to the pipeline
predictions = sentiment_analysis([example["text"] for example in test_examples])

true_labels = [example["label"] for example in test_examples]
predicted_labels = [1 if pred["label"] == "POSITIVE" else 0 for pred in predictions]

# Load the accuracy metric
accuracy = evaluate.load("accuracy")

result = accuracy.compute(references=true_labels, predictions=predicted_labels)
print(result)


# Load the accuracy, precision, recall and F1 score .metrics
accuracy = evaluate.load("accuracy")
precision = evaluate.load("precision")
recall = evaluate.load("recall")
f1 = evaluate.load("f1")

# Obtain a description of each metric
print(accuracy.description)
print(precision.description)
print(recall.description)
print(f1.description)

test_examples = [
    "Fantastic hotel, exceeded expectations!",
    "Quiet despite central location, great stay.",
    "Friendly staff, welcoming atmosphere.",
    "Spacious, comfy room—a perfect retreat.",
    "Cleanliness could improve, overall decent stay.",
      "Disappointing stay, noisy and unclean room.",
    "Terrible service, unfriendly staff, won't return."
]
test_labels = [1, 1, 1, 1, 0, 0, 0]

# Pass the examples to the pipeline, and obtain a list of predicted labels
sentiment_analysis = pipeline("sentiment-analysis")
predictions = sentiment_analysis([example for example in test_examples])
predicted_labels = [1 if pred["label"] == "POSITIVE" else 0 for pred in predictions]

# Compute the metrics by comparing real and predicted labels
print(precision.compute(references=test_labels, predictions=predicted_labels))
print(recall.compute(references=test_labels, predictions=predicted_labels))
print(f1.compute(references=test_labels, predictions=predicted_labels))

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


{'accuracy': 1.0}


No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.



Accuracy is the proportion of correct predictions among the total number of cases processed. It can be computed with:
Accuracy = (TP + TN) / (TP + TN + FP + FN)
 Where:
TP: True positive
TN: True negative
FP: False positive
FN: False negative


Precision is the fraction of correctly labeled positive examples out of all of the examples that were labeled as positive. It is computed via the equation:
Precision = TP / (TP + FP)
where TP is the True positives (i.e. the examples correctly labeled as positive) and FP is the False positive examples (i.e. the examples incorrectly labeled as positive).


Recall is the fraction of the positive examples that were correctly labeled by the model as positive. It can be computed with the equation:
Recall = TP / (TP + FN)
Where TP is the true positives and FN is the false negatives.


The F1 score is the harmonic mean of the precision and recall. It can be computed with the equation:
F1 = 2 * (precision * recall) / (precision + recall)

{'precision': 

### Exercise 4.2: Perplexity

In general, perplexity is a measurement of how well a probability model predicts a sample. In the context of Natural Language Processing, perplexity is one way to evaluate language models. Here we will compute the perplexity score with Huggingface library's evaluate.

#### Instructions
* Encode the text prompt, pass it to the GPT2 model for text generation, and decode the generated text.
* Load and compute the mean perplexity score on the generated text.

In [26]:
# Define the model name
model_name = "gpt2"

# Initialize the tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Initialize the model
model = AutoModelForCausalLM.from_pretrained(model_name)

prompt = "Current trends show that by 2030 "

# Encode the prompt, generate text and decode it
prompt_ids = tokenizer.encode(prompt, return_tensors="pt")
output = model.generate(prompt_ids, max_length=20)
generated_text = tokenizer.decode(
  output[0], skip_special_tokens=True)

print("Generated Text: ", generated_text)

# Load and compute the perplexity score
perplexity = evaluate.load("perplexity", module_type="metric")
results = perplexity.compute(model_id='gpt2',
                             predictions=generated_text)
print("Perplexity: ", results['mean_perplexity'])

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generated Text:  Current trends show that by 2030  the number of people living in poverty will be at its lowest


  0%|          | 0/6 [00:00<?, ?it/s]

Perplexity:  3514.5176167589552


### Exercise 4.3: Rouge, Meteor and Exact Match (EM)

Rouge, Meteor and Exact Match are some more advanced metrics used specifically in NLP tasks. Short description for each metric is provided below:

1. ROUGE (Recall-Oriented Understudy for Gisting Evaluation):
   - ROUGE is a set of metrics used for evaluating automatic summarization and machine translation tasks.
   - It measures the overlap between the model-generated summary (or translation) and the reference summaries (or translations).
   - ROUGE includes various variants like ROUGE-N, ROUGE-L, and ROUGE-W. ROUGE-N measures n-gram overlap, ROUGE-L measures the longest common subsequence, and ROUGE-W measures weighted LCS-based statistics.
   - ROUGE typically reports precision, recall, and F1-score for the overlap between the model output and the reference.

2. METEOR (Metric for Evaluation of Translation with Explicit ORdering):
   - METEOR is another metric used in machine translation and automatic summarization tasks.
   - It evaluates the quality of machine translation by considering not only exact word matches but also synonyms and paraphrases.
   - METEOR computes a score based on precision, recall, and alignment between words in the reference and system output. It also considers the WordNet synonymy and stem overlap.
   - METEOR has been shown to correlate well with human judgments of translation quality.

3. Exact Match (EM):
   - EM is a metric commonly used in question answering tasks to evaluate the accuracy of the model's responses.
   - It measures whether the model's output exactly matches the reference answer. If the generated answer matches the reference answer exactly, it gets a score of 1; otherwise, it gets a score of 0.
   - EM is a binary metric, indicating whether the model's output is an exact match to the ground truth answer.

Each of these metrics provides different perspectives on the quality and performance of NLP models. While ROUGE and METEOR are often used in text generation tasks like summarization and translation, EM is more commonly used in question answering and dialogue systems where exact answers are expected. Choosing the appropriate metric depends on the specific task and the desired evaluation criteria.

In [27]:
# Load the rouge metric
rouge = evaluate.load("rouge")

predictions = ["""Pluto is a dwarf planet in our solar system, located in the Kuiper Belt beyond Neptune, and was formerly considered the ninth planet until its reclassification in 2006."""]
references = ["""Pluto is a dwarf planet in the solar system, located in the Kuiper Belt beyond Neptune, and was previously deemed as a planet until it was reclassified in 2006."""]

# Calculate the rouge scores between the predicted and reference summaries
results = rouge.compute(predictions=predictions, references=references)
print("ROUGE results: ", results)

meteor = evaluate.load("meteor")

predictions = ["He thought it right and necessary to become a knight-errant, roaming the world in armor, seeking adventures and practicing the deeds he had read about in chivalric tales."]
references = ["He believed it was proper and essential to transform into a knight-errant, traveling the world in armor, pursuing adventures, and enacting the heroic deeds he had encountered in tales of chivalry."]

# Compute and print the METEOR score
results = meteor.compute(predictions=predictions, references=references)
print("Meteor: ", results['meteor'])


exact_match = evaluate.load("exact_match")

predictions = ["The cat sat on the mat.", "Theaters are great.", "It's like comparing oranges and apples."]
references = ["The cat sat on the mat?", "Theaters are great.", "It's like comparing apples and oranges."]

# Compute the exact match and print the results
results = exact_match.compute(predictions=predictions, references=references)
print("EM results: ", results)


ROUGE results:  {'rouge1': 0.7719298245614034, 'rouge2': 0.6181818181818182, 'rougeL': 0.736842105263158, 'rougeLsum': 0.736842105263158}


[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


Meteor:  0.5350702240481536
EM results:  {'exact_match': 0.3333333333333333}


### Exercise 4.4: BLEU Score

BLEU (BiLingual Evaluation Understudy) is a metric for automatically evaluating machine-translated text. The BLEU score is a number between zero and one that measures the similarity of the machine-translated text to a set of high quality reference translations.

A pipeline based on the Helsinki-NLP Turkish-English translation model and the BLEU metric has been loaded, use evaluate.load("bleu") from the evaluate library.

#### Instructions
* Pass the input sentence in input_sentence to the translator, then calculate the BLEU metric using reference.

In [28]:
bleu = evaluate.load("bleu")

input_sentence_1 = "Merhaba, nasılsın?"

reference_1 = [
     ["Hello, how are you?", "Hi, how are you?"]
     ]

translator = pipeline("translation", model="Helsinki-NLP/opus-mt-tr-en")

# Translate the first input sentence
translated_output = translator(input_sentence_1)

translated_sentence = translated_output[0]['translation_text']

print("Translated:", translated_sentence)

# Calculate BLEU metric
results = bleu.compute(predictions=[translated_sentence], references=reference_1)
print(results)


input_sentences_2 = ["Merhaba, nasılsın?", "Çok iyiyim, teşekkür ederim."]

references_2 = [
     ["Hello, how are you?", "Hi, how are you?"],
     ["I'm great, thanks.", "I'm great, thank you."]
     ]

# Translate the input sentences, extract the translated text, and compute BLEU score
translator = pipeline("translation", model="Helsinki-NLP/opus-mt-tr-en")

translated_outputs = translator(input_sentences_2)

predictions = [translated_output['translation_text'] for translated_output in translated_outputs]
print(predictions)

results = bleu.compute(predictions=predictions, references=references_2)
print(results)

Translated: Hi, how are you?
{'bleu': 1.0, 'precisions': [1.0, 1.0, 1.0, 1.0], 'brevity_penalty': 1.0, 'length_ratio': 1.0, 'translation_length': 6, 'reference_length': 6}
['Hi, how are you?', 'Very well, thank you.']
{'bleu': 0.7598356856515925, 'precisions': [0.8333333333333334, 0.8, 0.75, 0.6666666666666666], 'brevity_penalty': 1.0, 'length_ratio': 1.0909090909090908, 'translation_length': 12, 'reference_length': 11}


## Reinforcement Learning

### Exercise 4.5: Setting up an RLHF loop
The Proximal Policy Optimization (PPO) algorithm is popularly used in Reinforcement Learning from Human Feedback (RLHF) loops to fine-tune an LLM. The algorithm facilitates the iterative updating of model parameters based on a reward model derived from human feedback, ensuring the model's behavior is adapted predicated on human preferences.

In this example, you will set up a simple RLHF loop based on PPO and a "dummy" reward model.

#### Instructions
* Instantiate a reference LLM to be used in the optimization process.
* Initialize a trainer configuration object assigning it to ppo_config.
* Create a PPOTrainer instance, assigning it the required arguments.
* Train the LLM for one step using the PPO instance.

In [29]:
model = AutoModelForCausalLMWithValueHead.from_pretrained('sshleifer/tiny-gpt2')

# Instantiate a reference model
model_ref = create_reference_model(model)

tokenizer = AutoTokenizer.from_pretrained('sshleifer/tiny-gpt2')

if tokenizer._pad_token is None:
    tokenizer.add_special_tokens({'pad_token': '[PAD]'})

# Initialize trainer configuration
ppo_config = PPOConfig(mini_batch_size = 1, batch_size=1)

prompt = "Next year, I "
input = tokenizer.encode(prompt, return_tensors="pt")
response  = respond_to_batch(model, input)

# Create a PPOTrainer instance
ppo_trainer = PPOTrainer(ppo_config, model, model_ref, tokenizer)
reward = [torch.tensor(1.0)]

# Train LLM for one step with PPO
train_stats = ppo_trainer.step([input[0]], [response[0]], reward)

print(train_stats)

You're using a GPT2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


{'objective/kl': 0.0, 'objective/kl_dist': 0.0, 'objective/logprobs': array([[-10.803115 , -10.834713 , -10.795107 , -10.808006 , -10.840378 ,
        -10.826831 , -10.880978 , -10.816012 , -10.838393 , -10.82634  ,
        -10.808245 , -10.82328  , -10.839426 , -10.849914 , -10.801011 ,
        -10.830755 , -10.835113 , -10.814901 , -10.81929  , -10.8222685,
        -10.853684 , -10.8385725, -10.800554 , -10.846208 ]],
      dtype=float32), 'objective/ref_logprobs': array([[-10.803115 , -10.834713 , -10.795107 , -10.808006 , -10.840378 ,
        -10.826831 , -10.880978 , -10.816012 , -10.838393 , -10.82634  ,
        -10.808245 , -10.82328  , -10.839426 , -10.849914 , -10.801011 ,
        -10.830755 , -10.835113 , -10.814901 , -10.81929  , -10.8222685,
        -10.853684 , -10.8385725, -10.800554 , -10.846208 ]],
      dtype=float32), 'objective/kl_coef': 0.2, 'objective/entropy': 216.61215209960938, 'ppo/mean_non_score_reward': 0.0, 'ppo/mean_scores': 1.0, 'ppo/std_scores': nan, 'tok

## Use-Case

### Exercise 4.6: Toxicity assessment

Assess the toxicity level of given reviews.

#### Instructions
* Calculate the individual toxicity of each sequence, the maximum toxicity and toxicity ratio per employee.

In [30]:
emp_1 = ["Everyone in the team adores him",
           "He is a true genius, pure talent"]
emp_2 = ["Nobody in the team likes him",
           "He is a useless 'good-for-nothing'"]

toxicity_metric = evaluate.load("toxicity")

# Calculate the individual toxicities, maximum toxicities, and toxicity ratios
toxicity_1 = toxicity_metric.compute(predictions=emp_1)
toxicity_2 = toxicity_metric.compute(predictions=emp_2)
print("Toxicities (emp. 1):", toxicity_1['toxicity'])
print("Toxicities (emp. 2): ", toxicity_2['toxicity'])

toxicity_1_max = toxicity_metric.compute(predictions=emp_1, aggregation="maximum")
toxicity_2_max = toxicity_metric.compute(predictions=emp_2, aggregation="maximum")
print("Maximum toxicity (emp. 1):", toxicity_1_max['max_toxicity'])
print("Maximum toxicity (emp. 2): ", toxicity_2_max['max_toxicity'])

toxicity_1_ratio = toxicity_metric.compute(predictions=emp_1, aggregation="ratio")
toxicity_2_ratio = toxicity_metric.compute(predictions=emp_2, aggregation="ratio")
print("Toxicity ratio (emp. 1):", toxicity_1_ratio['toxicity_ratio'])
print("Toxicity ratio (emp. 2): ", toxicity_2_ratio['toxicity_ratio'])



Toxicities (emp. 1): [0.0001386617950629443, 0.00013368591316975653]
Toxicities (emp. 2):  [0.00014245195779949427, 0.010071253404021263]
Maximum toxicity (emp. 1): 0.0001386617950629443
Maximum toxicity (emp. 2):  0.010071253404021263
Toxicity ratio (emp. 1): 0.0
Toxicity ratio (emp. 2):  0.0


### Exercise 4.7: Best "regard"!

In this wrap-up exercise, you will use the regard metric to assess polarity in two lists of sample texts associated to two groups of employees in two branches of a company. Each branch is located in a different country and geographical region of the planet.

(It was reported a priori that due to naturally occurring cultural differences, both groups exhibit differences in the way they perform their work.)

#### Instructions
* Load the regard and regard-comparison metrics.
* Compute the regard (polarity scores) for each group separately.
* Compute the relative regard between the two groups for comparing polarities.

In [31]:
group1 = ['abc are described as loyal employees',
          'abc are ambitious in their career expectations']
group2 = ['abc are known for causing lots of team conflicts',
          'abc are verbally violent']

# Load the regard and regard-comparison metrics
regard = evaluate.load("regard")
regard_comp = evaluate.load("regard", "compare")

# Compute the regard (polarities) of each group separately
polarity_results_1 = regard.compute(data=group1)
print("Polarity in group 1:\n", polarity_results_1)
polarity_results_2 = regard.compute(data=group2)
print("Polarity in group 2:\n", polarity_results_2)

# Compute the relative regard between the two groups for comparison
polarity_results_comp = regard_comp.compute(data=group1, references=group2)
print("Polarity comparison between groups:\n", polarity_results_comp)

Polarity in group 1:
 {'regard': [[{'label': 'positive', 'score': 0.9098386764526367}, {'label': 'neutral', 'score': 0.059396952390670776}, {'label': 'other', 'score': 0.026468101888895035}, {'label': 'negative', 'score': 0.004296252969652414}], [{'label': 'positive', 'score': 0.7809812426567078}, {'label': 'neutral', 'score': 0.18085983395576477}, {'label': 'other', 'score': 0.030492952093482018}, {'label': 'negative', 'score': 0.007666013203561306}]]}
Polarity in group 2:
 {'regard': [[{'label': 'negative', 'score': 0.9658734202384949}, {'label': 'other', 'score': 0.021555885672569275}, {'label': 'neutral', 'score': 0.012026479467749596}, {'label': 'positive', 'score': 0.0005441228277049959}], [{'label': 'negative', 'score': 0.9774736166000366}, {'label': 'other', 'score': 0.012994581833481789}, {'label': 'neutral', 'score': 0.008945506066083908}, {'label': 'positive', 'score': 0.0005862844991497695}]]}
Polarity comparison between groups:
 {'regard_difference': {'positive': 0.8448447

## Vision Transformers (IN PROGRESS)

In [32]:
break

import numpy as np
import torch
import torch.nn as nn
from torch.nn import CrossEntropyLoss
from torch.optim import Adam
from torch.utils.data import DataLoader
from torchvision.datasets.mnist import MNIST
from torchvision.transforms import ToTensor
from tqdm import tqdm, trange

np.random.seed(0)
torch.manual_seed(0)


def patchify(images, n_patches):
    n, c, h, w = images.shape

    assert h == w, "Patchify method is implemented for square images only"

    patches = torch.zeros(n, n_patches**2, h * w * c // n_patches**2)
    patch_size = h // n_patches

    for idx, image in enumerate(images):
        for i in range(n_patches):
            for j in range(n_patches):
                patch = image[
                    :,
                    i * patch_size : (i + 1) * patch_size,
                    j * patch_size : (j + 1) * patch_size,
                ]
                patches[idx, i * n_patches + j] = patch.flatten()
    return patches


class MyMSA(nn.Module):
    def __init__(self, d, n_heads=2):
        super(MyMSA, self).__init__()
        self.d = d
        self.n_heads = n_heads

        assert d % n_heads == 0, f"Can't divide dimension {d} into {n_heads} heads"

        d_head = int(d / n_heads)
        self.q_mappings = nn.ModuleList(
            [nn.Linear(d_head, d_head) for _ in range(self.n_heads)]
        )
        self.k_mappings = nn.ModuleList(
            [nn.Linear(d_head, d_head) for _ in range(self.n_heads)]
        )
        self.v_mappings = nn.ModuleList(
            [nn.Linear(d_head, d_head) for _ in range(self.n_heads)]
        )
        self.d_head = d_head
        self.softmax = nn.Softmax(dim=-1)

    def forward(self, sequences):
        # Sequences has shape (N, seq_length, token_dim)
        # We go into shape    (N, seq_length, n_heads, token_dim / n_heads)
        # And come back to    (N, seq_length, item_dim)  (through concatenation)
        result = []
        for sequence in sequences:
            seq_result = []
            for head in range(self.n_heads):
                q_mapping = self.q_mappings[head]
                k_mapping = self.k_mappings[head]
                v_mapping = self.v_mappings[head]

                seq = sequence[:, head * self.d_head : (head + 1) * self.d_head]
                q, k, v = q_mapping(seq), k_mapping(seq), v_mapping(seq)

                attention = self.softmax(q @ k.T / (self.d_head**0.5))
                seq_result.append(attention @ v)
            result.append(torch.hstack(seq_result))
        return torch.cat([torch.unsqueeze(r, dim=0) for r in result])


class MyViTBlock(nn.Module):
    def __init__(self, hidden_d, n_heads, mlp_ratio=4):
        super(MyViTBlock, self).__init__()
        self.hidden_d = hidden_d
        self.n_heads = n_heads

        self.norm1 = nn.LayerNorm(hidden_d)
        self.mhsa = MyMSA(hidden_d, n_heads)
        self.norm2 = nn.LayerNorm(hidden_d)
        self.mlp = nn.Sequential(
            nn.Linear(hidden_d, mlp_ratio * hidden_d),
            nn.GELU(),
            nn.Linear(mlp_ratio * hidden_d, hidden_d),
        )

    def forward(self, x):
        out = x + self.mhsa(self.norm1(x))
        out = out + self.mlp(self.norm2(out))
        return out


class MyViT(nn.Module):
    def __init__(self, chw, n_patches=7, n_blocks=2, hidden_d=8, n_heads=2, out_d=10):
        # Super constructor
        super(MyViT, self).__init__()

        # Attributes
        self.chw = chw  # ( C , H , W )
        self.n_patches = n_patches
        self.n_blocks = n_blocks
        self.n_heads = n_heads
        self.hidden_d = hidden_d

        # Input and patches sizes
        assert (
            chw[1] % n_patches == 0
        ), "Input shape not entirely divisible by number of patches"
        assert (
            chw[2] % n_patches == 0
        ), "Input shape not entirely divisible by number of patches"
        self.patch_size = (chw[1] / n_patches, chw[2] / n_patches)

        # 1) Linear mapper
        self.input_d = int(chw[0] * self.patch_size[0] * self.patch_size[1])
        self.linear_mapper = nn.Linear(self.input_d, self.hidden_d)

        # 2) Learnable classification token
        self.class_token = nn.Parameter(torch.rand(1, self.hidden_d))

        # 3) Positional embedding
        self.register_buffer(
            "positional_embeddings",
            get_positional_embeddings(n_patches**2 + 1, hidden_d),
            persistent=False,
        )

        # 4) Transformer encoder blocks
        self.blocks = nn.ModuleList(
            [MyViTBlock(hidden_d, n_heads) for _ in range(n_blocks)]
        )

        # 5) Classification MLPk
        self.mlp = nn.Sequential(nn.Linear(self.hidden_d, out_d), nn.Softmax(dim=-1))

    def forward(self, images):
        # Dividing images into patches
        n, c, h, w = images.shape
        patches = patchify(images, self.n_patches).to(self.positional_embeddings.device)

        # Running linear layer tokenization
        # Map the vector corresponding to each patch to the hidden size dimension
        tokens = self.linear_mapper(patches)

        # Adding classification token to the tokens
        tokens = torch.cat((self.class_token.expand(n, 1, -1), tokens), dim=1)

        # Adding positional embedding
        out = tokens + self.positional_embeddings.repeat(n, 1, 1)

        # Transformer Blocks
        for block in self.blocks:
            out = block(out)

        # Getting the classification token only
        out = out[:, 0]

        return self.mlp(out)  # Map to output dimension, output category distribution


def get_positional_embeddings(sequence_length, d):
    result = torch.ones(sequence_length, d)
    for i in range(sequence_length):
        for j in range(d):
            result[i][j] = (
                np.sin(i / (10000 ** (j / d)))
                if j % 2 == 0
                else np.cos(i / (10000 ** ((j - 1) / d)))
            )
    return result


def main():
    # Loading data
    transform = ToTensor()

    train_set = MNIST(
        root="./../datasets", train=True, download=True, transform=transform
    )
    test_set = MNIST(
        root="./../datasets", train=False, download=True, transform=transform
    )

    train_loader = DataLoader(train_set, shuffle=True, batch_size=128)
    test_loader = DataLoader(test_set, shuffle=False, batch_size=128)

    # Defining model and training options
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    print(
        "Using device: ",
        device,
        f"({torch.cuda.get_device_name(device)})" if torch.cuda.is_available() else "",
    )
    model = MyViT(
        (1, 28, 28), n_patches=7, n_blocks=2, hidden_d=8, n_heads=2, out_d=10
    ).to(device)
    N_EPOCHS = 5
    LR = 0.005

    # Training loop
    optimizer = Adam(model.parameters(), lr=LR)
    criterion = CrossEntropyLoss()
    for epoch in trange(N_EPOCHS, desc="Training"):
        train_loss = 0.0
        for batch in tqdm(
            train_loader, desc=f"Epoch {epoch + 1} in training", leave=False
        ):
            x, y = batch
            x, y = x.to(device), y.to(device)
            y_hat = model(x)
            loss = criterion(y_hat, y)

            train_loss += loss.detach().cpu().item() / len(train_loader)

            optimizer.zero_grad()
            loss.backward()
            optimizer.step()

        print(f"Epoch {epoch + 1}/{N_EPOCHS} loss: {train_loss:.2f}")

    # Test loop
    with torch.no_grad():
        correct, total = 0, 0
        test_loss = 0.0
        for batch in tqdm(test_loader, desc="Testing"):
            x, y = batch
            x, y = x.to(device), y.to(device)
            y_hat = model(x)
            loss = criterion(y_hat, y)
            test_loss += loss.detach().cpu().item() / len(test_loader)

            correct += torch.sum(torch.argmax(y_hat, dim=1) == y).detach().cpu().item()
            total += len(x)
        print(f"Test loss: {test_loss:.2f}")
        print(f"Test accuracy: {correct / total * 100:.2f}%")


if __name__ == "__main__":
    main()

SyntaxError: 'break' outside loop (<ipython-input-32-b01cba63fab9>, line 1)