<a href="https://colab.research.google.com/github/caglarmert/DI725/blob/main/DI725_Lab_0_d20.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# DI 725: Transformers and Attention-Based Deep Networks

## An End-to-End Tutorial for Implementing Transformers

### Authors:
* ttemizel@metu.edu.tr
* atemizel@metu.edu.tr
* mecaglar@metu.edu.tr

# Introduction

<div>
<img src="https://github.com/caglarmert/DI725/blob/main/src/attention_research_1.png?raw=true" width="400"/>
</div>

---

### Transformer Architecture Overview

The Transformer architecture revolutionized the field of natural language processing (NLP) by introducing a model that relies entirely on self-attention mechanisms, eliminating the need for recurrent or convolutional layers. Here's a brief overview of the main components of the Transformer architecture:

#### 1. Input Embeddings
- The input sequence, typically a sequence of word embeddings, is passed into the model. Each word is represented as a high-dimensional vector, often initialized randomly or pre-trained on a large corpus.

#### 2. Positional Encoding
- Since the Transformer doesn't inherently understand the order of tokens in a sequence, positional encodings are added to the input embeddings to provide information about token positions. These are usually sinusoidal functions of different frequencies and phases.

#### 3. Encoder
- The Encoder consists of multiple identical layers (usually 6-12). Each layer consists of two main sub-components:
  - **Multi-Head Self-Attention Mechanism**: Computes attention weights between all pairs of words in the input sequence to capture relationships and dependencies among them.
  - **Feedforward Neural Network**: Applies a fully connected feedforward network to each position separately and identically. It processes the output of the attention mechanism in a position-wise manner.

#### 4. Decoder
- The Decoder also consists of multiple identical layers (the same number as in the Encoder). Each layer in the Decoder has three main sub-components:
  - **Masked Multi-Head Self-Attention Mechanism**: Similar to the Encoder's attention mechanism, but with a mask applied to prevent positions from attending to subsequent positions, ensuring that the model attends only to previous positions during generation.
  - **Encoder-Decoder Attention Mechanism**: Allows the Decoder to focus on different parts of the input sequence (Encoder's output) by computing attention scores between the current position in the Decoder and all positions in the Encoder's output.
  - **Feedforward Neural Network**: Similar to the Encoder, a fully connected feedforward network is applied to each position separately and identically.

#### 5. Output Layer
- The output of the final Decoder layer is passed through a linear layer followed by a softmax function to produce the probability distribution over the output vocabulary. During training, this distribution is compared to the actual target sequence using cross-entropy loss.

#### 6. Loss Computation
- The model's output is compared to the actual target sequence using cross-entropy loss. This comparison drives the learning process through backpropagation.

The Transformer architecture's key innovation lies in its ability to capture long-range dependencies in sequences efficiently through self-attention mechanisms, making it highly parallelizable and scalable compared to traditional recurrent neural networks.

---

The Transformer architecture, introduced in the paper "Attention is All You Need", revolutionized natural language processing by relying solely on attention mechanisms instead of recurrent connections. Here's a breakdown of its key components:

Overall Structure:

An encoder processes the input sequence to capture its meaning.
A decoder generates the output sequence based on the encoded representation and any additional context.
Encoder and Decoder Blocks:

Both the encoder and decoder consist of multiple identical encoder blocks and decoder blocks, respectively.
Each block has two sub-blocks:
Multi-head Self-attention: Captures relationships between elements within the sequence (encoder) or within the previously generated output (decoder).
Feed-forward network: Adds non-linearity and complexity to the model.
Residual connection and Layer Norm: Improve training stability and gradient flow.
Key Details of Each Block:

1. Multi-head Self-attention:

Splits the input into queries, keys, and values.
Computes attention scores based on the similarity between queries and keys.
Masks out padded elements using attention masks.
Aggregates values weighted by the attention scores, resulting in a context vector for each element.
The "multi-head" part refers to performing this self-attention mechanism multiple times with different query and key projections, capturing diverse relationships.
2. Feed-forward network:

A two-layer network with ReLU activation for non-linearity.
Adds complexity and allows the model to learn more intricate relationships.
3. Residual connection and Layer Norm:

Shortcuts around each sub-block are added to ensure the gradients can flow easily through the network.
Layer normalization rescales and shifts the output of each sub-block, stabilizing the training process.
Additional Components:

Encoder-decoder attention: In a sequence-to-sequence setting, the decoder attends to the encoded representation in each block to incorporate context into the generated output.
Positional encoding: Since the Transformer doesn't have inherent positional information, additional embeddings are added to encode the relative positions of elements in the sequence.
Output layer: In the decoder, a final layer converts the internal representation into the final vocabulary probabilities for output generation.
Benefits of Transformers:

Parallelization: Attention allows for better parallelization during training compared to recurrent models.
Long-range dependencies: Can capture long-range dependencies in sequences without relying on sequential processing.
Adaptability: Can be applied to various NLP tasks with minor modifications.
Drawbacks of Transformers:

Computational cost: Attention can be computationally expensive, especially for long sequences.
Memory intensive: Requires storing the entire input sequence for attention computations.

## Imports
In this part we import the required libraries. Running this part on the Colab servers is required for later parts. It is advised to check the associated python requirements.txt, that is frozen at the time of preparation of this notebook, in case of any library or version error occurs while running this notebook. Mind that installing everything locally via pip install -r "requirements.txt" is not advised though, mainly because of the discrepancies between Colab and locally available machine.

In [78]:
# Uncomment any install if needed. It is recommended that these installations
# are performed prior to any notebook runs and imports

# !pip install datasets # Huggingface dataset library
# !pip install evaluate # Used for evaluation metrics
# !pip install rouge_score # Is a text evaluation metric
# !pip install trl #Transformers Reinforcement Learning framework
# !pip install sacremoses # Used for specific characters, useful for languages like Turkish

Collecting sacremoses
  Downloading sacremoses-0.1.1-py3-none-any.whl (897 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m897.5/897.5 kB[0m [31m7.4 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: sacremoses
Successfully installed sacremoses-0.1.1


In [2]:
from transformers import pipeline
import math
import torch
from torch import nn
import torch.nn.functional as F
import evaluate

from transformers import AutoTokenizer
from transformers import AutoModelForSequenceClassification, AutoModelForSeq2SeqLM
from datasets import load_dataset


from trl import PPOConfig, PPOTrainer, AutoModelForCausalLMWithValueHead, create_reference_model
from trl.core import respond_to_batch
from transformers import AutoModelForCausalLM, AutoTokenizer


After importing the main libraries, we can continue with the transformers. First lets check what does the above import does. We have imported pipeline from transformers library, from huggingface 🤗.

The [documentation](https://huggingface.co/docs/transformers) for the Transformers library.

The [pipelines](https://huggingface.co/docs/transformers/main_classes/pipelines) is a class of the Transformers library. It is used for easy inference, abstracts most of the complexity and offers simple API for some dedicated tasks.

The [torch](https://pytorch.org/) is a popular and diverse machine learning framework, enabling low level implementation (as low as it gets with Python anyway). The Neural Networks (nn) is a library within PyTorch that enables operations with neural network structures.

The [Auto](https://huggingface.co/docs/transformers/model_doc/auto) classes contain many high level methods and models for various specific tasks, sometimes required for a pre-processing step such as tokenizers.

The [datasets](https://huggingface.co/docs/datasets) is the 🤗 library used for datasets (who would have guess?). Tabular, Audio, Computer Vision, and Text data can be loaded or shared via this library.

The [TRL](https://huggingface.co/docs/trl) (standing for: Transformer Reinforcement Learning) is the comprehensive toolkit designed for training transformer language models using Reinforcement Learning. It encompasses a range of tools, starting from the initial Supervised Fine-tuning (SFT) phase, through Reward Modeling (RM), up to the Proximal Policy Optimization (PPO) stage.

## Chapter 1: Basic Transformers

In this first introductory section, we begin with experiencing basic and very high level usage of transformers.

### Exercise 1.1: Classifying a text

Huggingface Hub is an open-source public colaboration of various models. Large Language Models, require tremendous amount of training data and time, thus once trained they are invaluable and their inference can be adapted to various use-cases.

This first practice will be about loading a model from the huggingface hub, into a pipeline, to perform a task.

It is important to note that model loading with specific model name is advised or else it will opt to defaults.

#### Instructions
* Import the necessary function from the transformers library to load Hugging Face LLMs as pipelines.
* Load the model specified in model_name into a suitable pipeline for sentiment classification in text.
* Pass the customer review defined in prompt to the pipeline to get a sentiment prediction.

In [57]:
# Specify the task name
task_name = "text-classification"
# Specify the model to be loaded
model_name = "lxyuan/distilbert-base-multilingual-cased-sentiments-student"
# We can change the model name to
# "mrm8488/distilroberta-finetuned-financial-news-sentiment-analysis"
# "lxyuan/distilbert-base-multilingual-cased-sentiments-student"
classifier = pipeline(task = task_name, model = model_name)

# Clearly this is a positive sentiment from tripadvisor, 5 star review
prompt = "I like this place, very much so because of the excellent location in the midst of the botanical park and city center."
prediction = classifier(prompt)
print(prompt, "\nSentiment:", prediction[0]["label"], "Score:",prediction[0]["score"],)

# And a negative one, 1 star review.
prompt = "There was nothing to see, the building is under construction, you can't go into building, wasting my afternoon time in ankara."
prediction = classifier(prompt)
print(prompt, "\nSentiment:", prediction[0]["label"], "Score:",prediction[0]["score"],)



I like this place, very much so because of the excellent location in the midst of the botanical park and city center. 
Sentiment: positive Score: 0.8754111528396606
There was nothing to see, the building is under construction, you can't go into building, wasting my afternoon time in ankara. 
Sentiment: negative Score: 0.4693423807621002


### Exercise 1.2: Summarizing a text

Summarization is a challanging language task, requires sequence-to-sequence models, such as the one we are using here. The task is about summarizing a given long text.

#### Instructions

* Load the model, based on the T5 transformer architecture and specified in model_name, into a text summarization pipeline.
* Pass long_text to the model pipeline, to produce a summary limited to 50 tokens length.
* Access and print the summarized text in outputs.

In [74]:
# Specify a model name, note that we are using a small version so don't expect much
model_name = "cnicu/t5-small-booksum"
# Provide the long text
long_text = "Tunali hilmi, which is a bustling street, is a hub for various commercial activities as it extends southwards toward Kugulu Park. Tunali Hilmi Avenue is regarded as one of the city's most charming streets, adorned with a variety of shops, boutiques, and souvenir stores. The neighborhood exudes a sense of luxury and offers a wide range of goods, albeit at slightly higher prices compared to other areas. However, the elevated cost is justified by the high-quality shopping experience, particularly appealing to those who enjoy outdoor retail therapy."

# Load the model pipeline for text summarization
summarizer = pipeline(task="summarization", model=model_name)

# Pass the long text to the model to summarize it
outputs = summarizer(long_text, max_length=50)

# Access and print the summarized text in the outputs variable
print(outputs[0]['summary_text'])

Tunali hilmi is regarded as one of the city's most charming streets, adorned with shops, boutiques, and souvenir stores. The neighborhood offers a wide range of goods, albeit at slightly higher prices


### Exercise 1.3: Translating a text

Translation is another challanging language task, requiring models trained specifically for source and target languages.

#### Instructions

* Define a pipeline for Turkish-to-English translation, specifying the source and target languages in the pipeline task argument.
* Translate the text in input_text using the pipeline.
* Access and print the translated text in the outputs variable: translations.

In [81]:
# Specify the model name, from Turkish (tr) to English (en)
model_name = "Helsinki-NLP/opus-mt-tr-en"

# A short intro about METU
input_text = "Orta Doğu Teknik Üniversitesi, Türkiye ve Orta Doğu ülkelerinin kalkınmalarına katkıda bulunmak, özellikle fen bilimleri ve sosyal bilimler alanlarında uzman yetiştirmek üzere 15 Kasım 1956 tarihinde Orta Doğu Yüksek Teknoloji Enstitüsü adıyla eğitime başlamıştır. "

# Define pipeline for Spanish-to-English translation
translator = pipeline("translation_tr_to_en", model=model_name)

# Translate the input text
translations = translator(input_text)

# Access the output to print the translated text in English
print("Original text: ", input_text)
print("Translated text:", translations[0]['translation_text'])

Original text:  Orta Doğu Teknik Üniversitesi, Türkiye ve Orta Doğu ülkelerinin kalkınmalarına katkıda bulunmak, özellikle fen bilimleri ve sosyal bilimler alanlarında uzman yetiştirmek üzere 15 Kasım 1956 tarihinde Orta Doğu Yüksek Teknoloji Enstitüsü adıyla eğitime başlamıştır. 
Translated text: The Middle East Technical University began training as the Middle East Institute of Technology on 15 November 1956 to contribute to the development of Turkey and Middle East countries, especially to develop experts in science and social sciences.


### Exercise 1.4: Question-Answering
Next, let's practice loading a Hugging Face LLM into a pipeline for question-answering (QA, for short). This time, you will use the default model supplied by Hugging Face transformers library for QA pipelines.

#### Instructions
* Instantiate a pipeline for question-answering.
* Pass the necessary pieces of information as inputs to the pipeline.
* Access and print the extracted answer in the outputs variable.

In [98]:
# Load the model pipeline for question-answering
model_name = "distilbert-base-cased-distilled-squad"

qa_model = pipeline("question-answering",model=model_name)

# Provide the context
context = "The history of Ankara Castle, one of the symbols of the province, is as old as the history of the city. It remains to be determined when the castle, which existed when the Galatians settled in Ankara and was repaired during the Roman period, was built. Next to the hill on which it was founded, that is, Hatip Stream, is 110 m above the Bent Stream. The castle has more than 20 towers. The outer castle surrounds Ankara in the shape of a heart. The four-storey inner castle is made of Ankara Stone and partly of collected stones. The inner castle has two large gates, one is called the Outer Gate and the other is the Citadel Gate. There is a book belonging to the Ilkhanate on this door. The inner castles consist of a total of 42 pentagonal towers with a length of 14-16 m. There is an inscription in the northwestern part showing the repairs made by the Seljuk ruler."

# Provide the questions
questions = ["How many towers does the castle have?",
             "When did the castle was build?",
             "How long are the towers in the inner castle?",
             "Who repaired the castle and inscribed?",
             "What is the material of the castle?"]

# Pass the necessary inputs to the LLM pipeline for question-answering
outputs = qa_model(question=questions, context=context)

# Access and print the answer
for i in range(len(questions)):
  print("Question: ", questions[i], "\nAnswer:", outputs[i]['answer'])

Question:  How many towers does the castle have? 
Answer: more than 20
Question:  When did the castle was build? 
Answer: Roman period
Question:  How long are the towers in the inner castle? 
Answer: 14-16 m
Question:  Who repaired the castle and inscribed? 
Answer: the Seljuk ruler
Question:  What is the material of the castle? 
Answer: Ankara Stone and partly of collected stones


### Exercise 1.5: Text Generation

Text generation, is the most famous application of transformers, namely ChatGPT (standing for Generative Pre-Trained). Here we will use an older version (GPT-2) to generate text for customers leaving reviews for our business on a public website.

#### Instructions
* Instantiate the generator variable as a pipeline that loads the "gpt2" pre-trained text generation model.
* Build a prompt for the LLM that concatenates the customer review with the hotel response's initial sentence.
* Pass the prompt to the previously defined pipeline to generate (inference) the following text in the hotel response, specifying a maximum length of 150 tokens for the generated output.
* Print the generated output.

In [104]:
# Create a pipeline for text generation using the gpt2 model
generator = pipeline("text-generation", model="gpt2")

customer_text = "The Divan is a very comfortable and professionally run hotel in Ankara. The staff are extremely helpful and friendly. Rooms and beds are very comfortable, with all the facilities that you would expect in a four star hotel. The breakfast buffet is very extensive (open 6.30AM to 10.30AM). The only down-side is the hotels location, a ten to fifteen minute taxi ride away from the city centre, embassies and government buildings, but is located within a very quiet residential area."

response = "Dear Our Valuable Guest, Thank you for taking the time to leave us a review."

# Build the prompt for the text generation LLM
prompt = f"Customer review:\n{customer_text}\n\nHotel reponse to the customer:\n{response}"

# Pass the prompt to the model pipeline
outputs = generator(prompt, max_length=200, pad_token_id=generator.tokenizer.eos_token_id)

# Print the augmented sequence generated by the model
print(outputs[0]['generated_text'])

Customer review:
The Divan is a very comfortable and professionally run hotel in Ankara. The staff are extremely helpful and friendly. Rooms and beds are very comfortable, with all the facilities that you would expect in a four star hotel. The breakfast buffet is very extensive (open 6.30AM to 10.30AM). The only down-side is the hotels location, a ten to fifteen minute taxi ride away from the city centre, embassies and government buildings, but is located within a very quiet residential area.

Hotel reponse to the customer:
Dear Our Valuable Guest, Thank you for taking the time to leave us a review. In addition to the wonderful amenities that we now enjoy, the hotel has its many amenities, as there many others. In particular is a very efficient restaurant and a very well constructed and well done dining experience. Our experience in other parts of the country and across the world is very positive as well.

Our Customer satisfaction rating:




## Building a Transformer Architecture

### Exercise 2.1: PyTorch Transformer
PyTorch's nn.Transformer class provides a full transformer architecture with pre-built encoder and decoder stacks.

The simplest way to manually create a skeleton nn.Transformer model is by specifying its main structural hyperparameters: model dimensionality (embedding size), number of attention heads, number of encoder layers, and number of decoder layers. PyTorch does the rest of the job for you, assigning default modules inside the encoder and decoder layers.

In [None]:
# Set transformer model hyperparameters
d_model = 512
n_heads = 8
num_encoder_layers = 6
num_decoder_layers = 6

# Create the transformer model and assign hyperparameters
model = nn.Transformer(
    d_model=d_model,
    nhead=n_heads,
    num_encoder_layers=num_encoder_layers,

    num_decoder_layers=num_decoder_layers
)

print(model)

### Hands-on positional encoding
In this exercise you'll complete the class implementation for a positional encoding mechanism.

The necessary imports have been done for you, namely import torch.nn as nn.

#### Instructions
* Specify the PyTorch class that the positional encoder should subclass from.
* Initialize a positional encoding matrix for token positions in sequences up to max_length.
* Assign unique position encodings to the matrix pe by alternating the use of sine and cosine functions.
* Update the input embeddings tensor x to add position information about the sequence using the positional encodings matrix.

In [9]:
# Subclass an appropriate PyTorch class
class PositionalEncoder(nn.Module):
    def __init__(self, d_model, max_length):
        super(PositionalEncoder, self).__init__()
        self.d_model = d_model
        self.max_length = max_length

        # Initialize the positional encoding matrix
        pe = torch.zeros(max_length, d_model)

        position = torch.arange(0, max_length, dtype=torch.float).unsqueeze(1)
        div_term = torch.exp(torch.arange(0, d_model, 2, dtype=torch.float) * -(math.log(10000.0) / d_model))

        # Calculate and assign position encodings to the matrix
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)
        pe = pe.unsqueeze(0)
        self.register_buffer('pe', pe)

    # Update the embeddings tensor adding the positional encodings
    def forward(self, x):
        x = x + self.pe[:, :x.size(1)]
        return x

### Implementing multi-headed self-attention
Now it's the turn of the multi-headed self-attention mechanism implementation.

Besides the necessary imports, including this time torch.nn.functional as F, the __init__() method is also provided.
#### Instructions
* Split the sequence embeddings x across the multiple attention heads.
* Compute dot-product based attention scores between the project query and key.
* Normalize the attention scores to obtain attention weights.
* Multiply the attention weights by the values and linearly transform the concatenated outputs per head.

In [10]:
class MultiHeadAttention(nn.Module):
    def __init__(self, d_model, num_heads):
        super(MultiHeadAttention, self).__init__()
        self.num_heads = num_heads
        self.d_model = d_model
        self.head_dim = d_model // num_heads

        self.query_linear = nn.Linear(d_model, d_model)
        self.key_linear = nn.Linear(d_model, d_model)
        self.value_linear = nn.Linear(d_model, d_model)
        self.output_linear = nn.Linear(d_model, d_model)
    def split_heads(self, x, batch_size):
        # Split the sequence embeddings in x across the attention heads
        x = x.view(batch_size, -1, self.num_heads, self.head_dim)
        return x.permute(0, 2, 1, 3).contiguous().view(batch_size * self.num_heads, -1, self.head_dim)

    def compute_attention(self, query, key, mask=None):
        # Compute dot-product attention scores
        scores = torch.matmul(query, key.permute(1, 2, 0))
        if mask is not None:
            scores = scores.masked_fill(mask == 0, float("-1e20"))
        # Normalize attention scores into attention weights
        attention_weights = F.softmax(scores, dim=-1)
        return attention_weights

    def forward(self, query, key, value, mask=None):
        batch_size = query.size(0)

        query = self.split_heads(self.query_linear(query), batch_size)
        key = self.split_heads(self.key_linear(key), batch_size)
        value = self.split_heads(self.value_linear(value), batch_size)

        attention_weights = self.compute_attention(query, key, mask)

        # Multiply attention weights by values and linearly project concatenated outputs
        output = torch.matmul(attention_weights, value)
        output = output.view(batch_size, self.num_heads, -1, self.head_dim).permute(0, 2, 1, 3).contiguous().view(batch_size, -1, self.d_model)
        return self.output_linear(output)

### Post-attention feed-forward layer
Let's assemble some of the pieces of an encoder transformer, starting with the feed-forward sublayer that follows multi-headed self-attention in every encoder layer.

#### Instructions
* Specify in the __init__() method the sizes of the two linear fully connected layers.
* Apply a forward pass through the two linear layers, using the ReLU() activation in between.

In [11]:
class FeedForwardSubLayer(nn.Module):
    # Specify the two linear layers' input and output sizes
    def __init__(self, d_model, d_ff):
        super(FeedForwardSubLayer, self).__init__()
        self.fc1 = nn.Linear(d_model, d_ff)
        self.fc2 = nn.Linear(d_ff, d_model)
        self.relu = nn.ReLU()

    # Apply a forward pass
    def forward(self, x):
        return self.fc2(self.relu(self.fc1(x)))

### encoder layer
You've made it quite far in building your own skeleton transformer architecture! Now you are ready to assemble a full encoder layer containing:

* A multi-headed self-attention mechanism.
* A feed-forward sublayer.
* A combined layer normalization and dropout to be applied after each of the above two stages.
* Complete the implementation of the EncoderLayer class to initialize all its inner elements one by one.

In [12]:
# Complete the initialization of elements in the encoder layer
class EncoderLayer(nn.Module):
    def __init__(self, d_model, num_heads, d_ff, dropout):
        super(EncoderLayer, self).__init__()
        self.self_attn = MultiHeadAttention(d_model, num_heads)
        self.feed_forward = FeedForwardSubLayer(d_model, d_ff)
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x, mask):
        attn_output = self.self_attn(x, x, x, mask)
        x = self.norm1(x + self.dropout(attn_output))
        ff_output = self.feed_forward(x)
        return self.norm2(x + self.dropout(ff_output))

### Encoder transformer body and head
Almost there! Now that the encoder layer implementation has been completed, all that remains is:

Implementing the transformer body, namely a stack of multiple encoder layers.
Appending a task-specific transformer head to process the encoder's resulting hidden states and produce the final outputs for the language task at hand!
#### Instructions
* Define a stack of multiple encoder layers in the __init__() method.
* Complete the forward() method. Note that the process starts by converting the original sequence tokens in x into embeddings.
* Add final linear layer to project encoder results into raw classification outputs.
* Apply the necessary function to map raw classification outputs into log class probabilities.

In [13]:
class TransformerEncoder(nn.Module):
    def __init__(self, vocab_size, d_model, num_layers, num_heads, d_ff, dropout, max_sequence_length):
        super(TransformerEncoder, self).__init__()
        self.embedding = nn.Embedding(vocab_size, d_model)
        self.positional_encoding = PositionalEncoder(d_model, max_sequence_length)
        # Define a stack of multiple encoder layers
        self.layers = nn.ModuleList([EncoderLayer(d_model, num_heads, d_ff, dropout) for _ in range(num_layers)])

    # Complete the forward pass method
    def forward(self, x, mask):
        x = self.embedding(x)
        x = self.positional_encoding(x)
        for layer in self.layers:
            x = layer(x, mask)
        return x

class ClassifierHead(nn.Module):
    def __init__(self, d_model, num_classes):
        super(ClassifierHead, self).__init__()
        # Add linear layer for multiple-class classification
        self.fc = nn.Linear(d_model, num_classes)

    def forward(self, x):
        logits = self.fc(x[:, 0, :])
        # Obtain log class probabilities upon raw outputs
        return F.log_softmax(logits, dim=-1)

### Testing the encoder transformer
In this exercise, you'll practice creating some instructions to pass an example random sequence throughout the encoder transformer you just defined to obtain and print the classification output. The following variables and model hyperparameters are defined for you:
```
num_classes = 3
vocab_size = 10000
batch_size = 8
d_model = 512
num_heads = 8
num_layers = 6
d_ff = 2048
sequence_length = 256
dropout = 0.1
```
The PositionalEncoder, MultiHeadAttention, FeedForwardSublayer,EncoderLayer, TransformerEncoder, and ClassifierHead classes are also implemented.

Note: although a random input sequence and mask are being used here, in practice, the mask should correspond to the actual location of padding tokens in the input sequences to ensure all of them are the same length.

#### Instructions
* Instantiate the body and head of the encoder transformer.
* Complete the forward pass throughout the entire transformer body and head to obtain and print classification outputs.

### Building a decoder body and head
Time to design a high-level architecture for a decoder-only transformer! On this occasion, instead of building the model body and the model head in two separate classes, the model head will be incorporated as part of the model body class that contains the stack of decoder layers.

As usual, the necessary imports for this exercise have been done for you.

#### Instructions
* Add the linear layer for the model head inside the TransformerDecoder class.
* Apply the last stage of the forward pass, through the model head.

In [14]:
class TransformerDecoder(nn.Module):
    def __init__(self, vocab_size, d_model, num_layers, num_heads, d_ff, dropout, max_sequence_length):
        super(TransformerDecoder, self).__init__()
        self.embedding = nn.Embedding(vocab_size, d_model)
        self.positional_encoding = PositionalEncoder(d_model, max_sequence_length)
        self.layers = nn.ModuleList([DecoderLayer(d_model, num_heads, d_ff, dropout) for _ in range(num_layers)])

        # Add a linear layer (head) for next-word prediction
        self.fc = nn.Linear(d_model, vocab_size)

    def forward(self, x, self_mask):
        x = self.embedding(x)
        x = self.positional_encoding(x)
        for layer in self.layers:
            x = layer(x, self_mask)

        # Apply the forward pass through the model head
        x = self.fc(x)
        return F.log_softmax(x, dim=-1)

### Testing the decoder transformer
In this exercise, you'll practice creating some instructions to pass an example random sequence throughout a decoder transformer architecture to obtain outputs in the form of next-token probabilities across the vocabulary.

The following variables and model hyperparameters are defined for you:


The PositionalEncoder, MultiHeadAttention, PositionWiseFeedForward,DecoderLayer, and TransformerDecoder classes are also implemented, the last of which integrates the model body and head.

#### Instructions
Create a triangular mask for enabling causal attention so that every token in the sequence only attends to the previous ones on its left-hand side.
Instantiate the decoder transformer model.

### Incorporating cross-attention in a decoder
In an encoder-decoder transformer, decoder layers incorporate two attention mechanisms: the causal attention inherent to any transformer decoder, plus a cross-attention that integrates source sequence information processed by the encoder with the target sequence information being processed through the decoder.

In this exercise you'll modify the DecoderLayer class to incorporate this twofold attention scheme.

#### Instructions
* Initialize the two attention mechanisms used in an encoder-decoder transformers' decoder layer: causal (masked) self-attention and cross-attention.
* Pass the necessary input arguments (query, key, values, and mask) to the two attention stages in the forward pass.

In [15]:
class DecoderLayer(nn.Module):
    def __init__(self, d_model, num_heads, d_ff, dropout):
        super(DecoderLayer, self).__init__()

        # Initialize the causal (masked) self-attention and cross-attention
        self.self_attn = MultiHeadAttention(d_model, num_heads)
        self.cross_attn = MultiHeadAttention(d_model, num_heads)
        self.feed_forward = FeedForwardSubLayer(d_model, d_ff)
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        self.norm3 = nn.LayerNorm(d_model)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x, causal_mask, encoder_output, cross_mask):
        # Pass the necessary arguments to the causal self-attention and cross-attention
        self_attn_output = self.self_attn(x, x, x, causal_mask)
        x = self.norm1(x + self.dropout(self_attn_output))
        cross_attn_output = self.cross_attn(x, encoder_output, encoder_output, cross_mask)
        x = self.norm2(x + self.dropout(cross_attn_output))
        ff_output = self.feed_forward(x)
        x = self.norm3(x + self.dropout(ff_output))
        return x

### Trying out an encoder-decoder transformer
Your next task is complete the following piece of code to define and forward-pass an example batch of randomly generated input sequences through an encoder-decoder transformer.

Remember that we are only testing a yet-to-be-trained transformer architecture, hence the use of random input sequences.

These are the model hyperparameters and variables used:
```
vocab_size = 10000
batch_size = 16
d_model = 512
num_heads = 8
num_layers = 6
d_ff = 2048
sequence_length = 64
dropout = 0.1
```
The example assumes the necessary imports and the following transformer architecture classes have been defined for you: MultiHeadAttention, FeedForwardSubLayer, PositionalEncoding, EncoderLayer, DecoderLayer, TransformerEncoder, TransformerDecoder, and ClassifierHead.

#### Instructions

* Create a batch of random input sequences of size batch_size X sequence_length.
* Instantiate the two transformer bodies using the appropriate class names.
* Pass the necessary masks as arguments to the encoder and the decoder for their underlying attention mechanisms; each mask argument should be added in the same order they are utilized inside the encoder or decoder layer.

In [16]:
vocab_size = 10000
batch_size = 16
d_model = 512
num_heads = 8
num_layers = 6
d_ff = 2048
sequence_length = 128
dropout = 0.1


# Create a batch of random input sequences
input_sequence = torch.randint(0, vocab_size, (batch_size, sequence_length))
padding_mask = torch.randint(0, 2, (sequence_length, sequence_length))
causal_mask = torch.triu(torch.ones(sequence_length, sequence_length), diagonal=1)

# Instantiate the two transformer bodies
encoder = TransformerEncoder(vocab_size, d_model, num_layers, num_heads, d_ff, dropout, max_sequence_length=sequence_length)
decoder = TransformerDecoder(vocab_size, d_model, num_layers, num_heads, d_ff, dropout, max_sequence_length=sequence_length)

# Pass the necessary masks as arguments to the encoder and the decoder
encoder_output = encoder(input_sequence, padding_mask)
#decoder_output = decoder(input_sequence, causal_mask, encoder_output, padding_mask)
#print("Batch's output shape: ", decoder_output.shape)

### Transformer assembly bottom-up
This exercise focuses on putting together the main building blocks of an encoder-only transformer architecture, using a bottom-up approach.

The following classes, their attributes, and their core functions have been defined for you:

PositionalEncoding(nn.Module): positional encoding for input embeddings.

MultiHeadAttention(nn.Module): multi-head attention layer.

FeedForward(nn.Module): feed-forward layer.

EncoderLayer(nn.Module): a replicable encoder layer that glues together multi-head attention and feed-forward layers, along with layer normalizations and dropouts.

Your next task is to finalize assembling the highest-level components of the encoder transformer: the TransformerEncoder and Transformer classes.

#### Instructions
* Initialize a positional encoding layer for the initial sequence processing of the TransformerEncoder class, as well as a stack of num_layers encoder layers.
* Complete the implementation of the encoder stack's .forward() method by iteratively passing the processed sequence through the stacked encoder layers.
* Add the whole stack of components and layers into a Transformer class object: You'll need to initialize an attribute containing the whole encoder stack.

In [17]:
# Initialize positional encoding layer and stack of EncoderLayer modules
class TransformerEncoder(nn.Module):
    def __init__(self, vocab_size, d_model, num_heads, num_layers, d_ff, max_seq_len, dropout):
        super(TransformerEncoder, self).__init__()
        self.embedding = nn.Embedding(vocab_size, d_model)
        self.positional_encoding = PositionalEncoding(d_model, max_seq_len)
        self.layers = nn.ModuleList([EncoderLayer(d_model, num_heads, d_ff, dropout) for _ in range(num_layers)])
        self.dropout = nn.Dropout(dropout)

    def forward(self, x, mask):
        x = self.embedding(x)
        x = self.positional_encoding(x)
        x = self.dropout(x)

        # Pass the sequence through each layer in the encoder
        for layer in self.layers:
            x = layer(x, mask)

        return x

class Transformer(nn.Module):
    def __init__(self, vocab_size, d_model, num_heads, num_layers, d_ff, max_seq_len, dropout):
        super(Transformer, self).__init__()
        # Initialize the encoder stack of the Transformer
        self.encoder = TransformerEncoder(vocab_size, d_model, num_heads, num_layers, d_ff, max_seq_len, dropout)

    def forward(self, src, src_mask):
        encoder_output = self.encoder(src, src_mask)
        return encoder_output

## Harnessing Pre-trained LLMs

### Classifying two movie opinions
We have seen how to pass one example sequence to a pre-trained text classification LLM for inference. In this exercise you will practice passing two example sequences simultaneously, describing two rather opposite opinions of a movie.

All the necessary imports have been made for you, including the auto classes specific to using pre-trained classification LLMs. The variable model_name has been also set with the name of the BERT-based model to use: "textattack/distilbert-base-uncased-SST-2".

#### Instructions
* Use the necessary task-specific classes and methods to load the tokenizer and pre-trained model.
* Tokenize the inputs and pass them to the LLM to perform classification inference.

In [18]:
model_name = "textattack/distilbert-base-uncased-SST-2"

# Load the tokenizer and pre-trained model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(
  model_name, num_labels=2)

text = ["The best movie I've ever watched!", "What an awful movie. I regret watching it."]

# Tokenize inputs and pass them to the model for inference
inputs = tokenizer(text, return_tensors="pt", padding=True)
outputs = model(**inputs)
logits = outputs.logits

predicted_classes = torch.argmax(logits, dim=1).tolist()
for idx, predicted_class in enumerate(predicted_classes):
    print(f"Predicted class for \"{text[idx]}\": {predicted_class}")

Predicted class for "The best movie I've ever watched!": 1
Predicted class for "What an awful movie. I regret watching it.": 0


### Summarizing a product opinion
In this text summarization exercise, we will examine different aspects of the "opinosis" dataset containing product reviews and summaries, as well as showing an example input sequence and its generated summarization.

The necessary imports have been made for you, including the AutoTokenizer class and the specific auto class for handling sequence-to-sequence models: AutoModelForSeq2SeqLM.

#### Instructions
* Display the names of the features in the data, by accessing the downloaded 'train' fold.
* Use the necessary variables and methods to encode the input example, pass it to the model to generate a summary, and decode the summary.

In [19]:
dataset = load_dataset("opinosis")
model_name = "t5-small"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

print(f"Number of instances: {len(dataset['train'])}")

# Show the names of features in the training fold of the dataset
print(f"Feature names: {dataset['train'].column_names}")

# Encode the input example, obtain the summary, and decode it
example = dataset['train'][-2]['review_sents']
input_ids = tokenizer.encode("summarize: " + example, return_tensors="pt", max_length=512, truncation=True)
summary_ids = model.generate(input_ids, max_length=150)
summary = tokenizer.decode(
  summary_ids[0], skip_special_tokens=True)

print("\nOriginal Text (first 400 characters): \n", example[:400])
print("\nGenerated Summary: \n", summary)

Number of instances: 51
Feature names: ['review_sents', 'summaries']

Original Text (first 400 characters): 
 I bought the 8, gig Ipod Nano that has the built, in video camera .
  Itunes has an on, line store, where you may purchase and download music and videos which will install onto the ipod .
I have lots of music cd's and dvd's, so currently I'm just interested in storing some of my music and videos on the ipod so I can enjoy them on my vacation, and while at work .
There's a right way and wrong wa

Generated Summary: 
 I bought the 8, gig Ipod Nano that has the built, in video camera. Itunes has an on, line store, where you may purchase and download music and videos which will install onto the ipod.


### The Spanish phrasebook mission
You are a content writer at a reputable travel guide publisher. The next title to be published is a Spain travel guide for English speakers, but due to high demand and limited human resources, they assigned you the urgent task of drafting a "Spanish phrasebook" page, covering some essential survival Spanish words and phrases.

Luckily, LLMs are here to help! In this exercise, you'll try using a pre-trained LLM for English-to-Spanish translation, and start this important mission by translating the first five common English phrases into Spanish.

#### Instructions
* Use the appropriate task-specific classes and methods to load the tokenizer and the model (the classes needed have been already imported for you, as usual!).
* Complete the instructions to encode the input sequences, generate translations, and decode them. For encodings, use an extra argument to return them as PyTorch tensors.

In [20]:
model_name = "Helsinki-NLP/opus-mt-en-es"

# Load the tokenizer and the model checkpoint
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

english_inputs = ["Hello", "Thank you", "How are you?", "Sorry", "Goodbye"]

# Encode the inputs, generate translations, decode, and print them
for english_input in english_inputs:
    input_ids = tokenizer.encode(english_input, return_tensors="pt")
    translated_ids = model.generate(input_ids)
    translated_text = tokenizer.decode(translated_ids[0], skip_special_tokens=True)
    print(f"English: {english_input} | Spanish: {translated_text}")

English: Hello | Spanish: Hola.
English: Thank you | Spanish: Gracias.
English: How are you? | Spanish: ¿Cómo estás?
English: Sorry | Spanish: Lo siento.
English: Goodbye | Spanish: Adiós.


### Load and inspect a QA dataset
In this exercise, you will load a dataset for extractive QA, inspect some data, and tokenize a question-context example into a suitable format for feeding it to an LLM for QA.

The necessary libraries, classes, and functions have been imported for you.

#### Instructions
* Load the dataset "xtreme" and subset "MLQA.en.en" using the variables already defined.
* Initialize tokenizer using the "deepset/minilm-uncased-squad2" model checkpoint.
Tokenize the example question and context retrieved, ensuring the results are returned as PyTorch tensors.

In [21]:
# Load a specific subset of the dataset
mlqa = load_dataset("xtreme", name="MLQA.en.en")

question = mlqa["test"]["question"][0]
context = mlqa["test"]["context"][0]
print("Question: ", question)
print("Context: ", context)

# Initialize the tokenizer using the model checkpoint
tokenizer = AutoTokenizer.from_pretrained("deepset/minilm-uncased-squad2")

# Tokenize the inputs returning the result as tensors
inputs = tokenizer(question, context, return_tensors="pt")
print("First five encoded tokens: ", inputs["input_ids"][0][:5])

Question:  Who analyzed the biopsies?
Context:  In 1994, five unnamed civilian contractors and the widows of contractors Walter Kasza and Robert Frost sued the USAF and the United States Environmental Protection Agency. Their suit, in which they were represented by George Washington University law professor Jonathan Turley, alleged they had been present when large quantities of unknown chemicals had been burned in open pits and trenches at Groom. Biopsies taken from the complainants were analyzed by Rutgers University biochemists, who found high levels of dioxin, dibenzofuran, and trichloroethylene in their body fat. The complainants alleged they had sustained skin, liver, and respiratory injuries due to their work at Groom, and that this had contributed to the deaths of Frost and Kasza. The suit sought compensation for the injuries they had sustained, claiming the USAF had illegally handled toxic materials, and that the EPA had failed in its duty to enforce the Resource Conservation a

### Calculating accuracy
In this exercise you will use a sentiment classification pipeline to classify four short reviews with known labels, and then calculate the accuracy of predictions using the evaluate library.

The necessary imports have been made for you. The test_examples variable contains the text reviews and their ground-truth labels:



#### Instructions
* Pass a list containing the four input reviews to the sentiment classification pipeline.
* Load the accuracy score metric from the evaluate library

In [22]:
test_examples = [
    {"text": "I am making a good use of this product!", "label": 1},
    {"text": "The service was disappointing.", "label": 0},
    {"text": "I learned a lot from this book.", "label": 1},
    {"text": "The book cover broke after two days of use.", "label": 0},
]
sentiment_analysis = pipeline("sentiment-analysis")

# Pass the four input texts (without labels) to the pipeline
predictions = sentiment_analysis([example["text"] for example in test_examples])

true_labels = [example["label"] for example in test_examples]
predicted_labels = [1 if pred["label"] == "POSITIVE" else 0 for pred in predictions]

# Load the accuracy metric
accuracy = evaluate.load("accuracy")

result = accuracy.compute(references=true_labels, predictions=predicted_labels)
print(result)


# Load the accuracy, precision, recall and F1 score .metrics
accuracy = evaluate.load("accuracy")
precision = evaluate.load("precision")
recall = evaluate.load("recall")
f1 = evaluate.load("f1")

# Obtain a description of each metric
print(accuracy.description)
print(precision.description)
print(recall.description)
print(f1.description)

test_examples = [
    "Fantastic hotel, exceeded expectations!",
    "Quiet despite central location, great stay.",
    "Friendly staff, welcoming atmosphere.",
    "Spacious, comfy room—a perfect retreat.",
    "Cleanliness could improve, overall decent stay.",
      "Disappointing stay, noisy and unclean room.",
    "Terrible service, unfriendly staff, won't return."
]
test_labels = [1, 1, 1, 1, 0, 0, 0]

# Pass the examples to the pipeline, and obtain a list of predicted labels
sentiment_analysis = pipeline("sentiment-analysis")
predictions = sentiment_analysis([example for example in test_examples])
predicted_labels = [1 if pred["label"] == "POSITIVE" else 0 for pred in predictions]

# Compute the metrics by comparing real and predicted labels
print(precision.compute(references=test_labels, predictions=predicted_labels))
print(recall.compute(references=test_labels, predictions=predicted_labels))
print(f1.compute(references=test_labels, predictions=predicted_labels))

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


{'accuracy': 1.0}


No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.



Accuracy is the proportion of correct predictions among the total number of cases processed. It can be computed with:
Accuracy = (TP + TN) / (TP + TN + FP + FN)
 Where:
TP: True positive
TN: True negative
FP: False positive
FN: False negative


Precision is the fraction of correctly labeled positive examples out of all of the examples that were labeled as positive. It is computed via the equation:
Precision = TP / (TP + FP)
where TP is the True positives (i.e. the examples correctly labeled as positive) and FP is the False positive examples (i.e. the examples incorrectly labeled as positive).


Recall is the fraction of the positive examples that were correctly labeled by the model as positive. It can be computed with the equation:
Recall = TP / (TP + FN)
Where TP is the true positives and FN is the false negatives.


The F1 score is the harmonic mean of the precision and recall. It can be computed with the equation:
F1 = 2 * (precision * recall) / (precision + recall)

{'precision': 

### Perplexed about 2030
This exercise gives you the chance to generate some text and calculate its perplexity, based on the following prompt:

#### Instructions
Encode the text prompt, pass it to the GPT2 model for text generation, and decode the generated text.
Load and compute the mean perplexity score on the generated text.

In [23]:
# Define the model name
model_name = "gpt2"

# Initialize the tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Initialize the model
model = AutoModelForCausalLM.from_pretrained(model_name)

prompt = "Current trends show that by 2030 "

# Encode the prompt, generate text and decode it
prompt_ids = tokenizer.encode(prompt, return_tensors="pt")
output = model.generate(prompt_ids, max_length=20)
generated_text = tokenizer.decode(
  output[0], skip_special_tokens=True)

print("Generated Text: ", generated_text)

# Load and compute the perplexity score
perplexity = evaluate.load("perplexity", module_type="metric")
results = perplexity.compute(model_id='gpt2',
                             predictions=generated_text)
print("Perplexity: ", results['mean_perplexity'])

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generated Text:  Current trends show that by 2030  the number of people living in poverty will be at its lowest


  0%|          | 0/6 [00:00<?, ?it/s]

Perplexity:  3514.5176167589552


## Evaluating and Leveraging LLMs in the Real World


In [24]:
# Load the rouge metric
rouge = evaluate.load("rouge")

predictions = ["""Pluto is a dwarf planet in our solar system, located in the Kuiper Belt beyond Neptune, and was formerly considered the ninth planet until its reclassification in 2006."""]
references = ["""Pluto is a dwarf planet in the solar system, located in the Kuiper Belt beyond Neptune, and was previously deemed as a planet until it was reclassified in 2006."""]

# Calculate the rouge scores between the predicted and reference summaries
results = rouge.compute(predictions=predictions, references=references)
print("ROUGE results: ", results)

meteor = evaluate.load("meteor")

llm_outputs = ["He thought it right and necessary to become a knight-errant, roaming the world in armor, seeking adventures and practicing the deeds he had read about in chivalric tales."]
references = ["He believed it was proper and essential to transform into a knight-errant, traveling the world in armor, pursuing adventures, and enacting the heroic deeds he had encountered in tales of chivalry."]

# Compute and print the METEOR score
results = meteor.compute(predictions=llm_outputs, references=references)
print("Meteor: ", results['meteor'])


exact_match = evaluate.load("exact_match")

predictions = ["The cat sat on the mat.", "Theaters are great.", "It's like comparing oranges and apples."]
references = ["The cat sat on the mat?", "Theaters are great.", "It's like comparing apples and oranges."]

# Compute the exact match and print the results
results = exact_match.compute(references=references, predictions=predictions)
print("EM results: ", results)


ROUGE results:  {'rouge1': 0.7719298245614034, 'rouge2': 0.6181818181818182, 'rougeL': 0.736842105263158, 'rougeLsum': 0.736842105263158}


[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


Meteor:  0.5350702240481536
EM results:  {'exact_match': 0.3333333333333333}


### BLEU-proof translations
Let's get familiar with the BLEU translation metric.

A pipeline based on the Helsinki-NLP Spanish-English translation model and the BLEU metric has been loaded for you, using evaluate.load("bleu") from the evaluate library.

#### Instructions
Pass the input sentence in input_sentence_1 to the translator, then calculate the BLEU metric using reference_1.

In [25]:
bleu = evaluate.load("bleu")

input_sentence_1 = "Hola, ¿cómo estás?"

reference_1 = [
     ["Hello, how are you?", "Hi, how are you?"]
     ]

input_sentences_2 = ["Hola, ¿cómo estás?", "Estoy genial, gracias."]

references_2 = [
     ["Hello, how are you?", "Hi, how are you?"],
     ["I'm great, thanks.", "I'm great, thank you."]
     ]

translator = pipeline("translation", model="Helsinki-NLP/opus-mt-es-en")

# Translate the first input sentence
translated_output = translator(input_sentence_1)

translated_sentence = translated_output[0]['translation_text']

print("Translated:", translated_sentence)

# Calculate BLEU metric
results = bleu.compute(predictions=[translated_sentence], references=reference_1)
print(results)


# Translate the input sentences, extract the translated text, and compute BLEU score
translator = pipeline("translation", model="Helsinki-NLP/opus-mt-es-en")

translated_outputs = translator(input_sentences_2)

predictions = [translated_output['translation_text'] for translated_output in translated_outputs]
print(predictions)

results = bleu.compute(predictions=predictions, references=references_2)
print(results)



Translated: Hey, how are you?
{'bleu': 0.7598356856515925, 'precisions': [0.8333333333333334, 0.8, 0.75, 0.6666666666666666], 'brevity_penalty': 1.0, 'length_ratio': 1.0, 'translation_length': 6, 'reference_length': 6}
['Hey, how are you?', "I'm great, thanks."]
{'bleu': 0.8627788640890415, 'precisions': [0.9090909090909091, 0.8888888888888888, 0.8571428571428571, 0.8], 'brevity_penalty': 1.0, 'length_ratio': 1.0, 'translation_length': 11, 'reference_length': 11}


### Setting up an RLHF loop
The Proximal Policy Optimization (PPO) algorithm is popularly used in Reinforcement Learning from Human Feedback (RLHF) loops to fine-tune an LLM. The algorithm facilitates the iterative updating of model parameters based on a reward model derived from human feedback, ensuring the model's behavior is adapted predicated on human preferences.

In this example, you will set up a simple RLHF loop based on PPO and a "dummy" reward model.

#### Instructions
* Instantiate a reference LLM to be used in the optimization process.
* Initialize a trainer configuration object assigning it to ppo_config.
* Create a PPOTrainer instance, assigning it the required arguments.
* Train the LLM for one step using the PPO instance.

In [26]:
model = AutoModelForCausalLMWithValueHead.from_pretrained('sshleifer/tiny-gpt2')

# Instantiate a reference model
model_ref = create_reference_model(model)

tokenizer = AutoTokenizer.from_pretrained('sshleifer/tiny-gpt2')

if tokenizer._pad_token is None:
    tokenizer.add_special_tokens({'pad_token': '[PAD]'})

# Initialize trainer configuration
ppo_config = PPOConfig(mini_batch_size = 1, batch_size=1)

prompt = "Next year, I "
input = tokenizer.encode(prompt, return_tensors="pt")
response  = respond_to_batch(model, input)

# Create a PPOTrainer instance
ppo_trainer = PPOTrainer(ppo_config, model, model_ref, tokenizer)
reward = [torch.tensor(1.0)]

# Train LLM for one step with PPO
train_stats = ppo_trainer.step([input[0]], [response[0]], reward)

print(train_stats)

You're using a GPT2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


{'objective/kl': 0.0, 'objective/kl_dist': 0.0, 'objective/logprobs': array([[-10.803115 , -10.834713 , -10.795107 , -10.808006 , -10.808904 ,
        -10.826435 , -10.831235 , -10.821535 , -10.797018 , -10.794669 ,
        -10.85922  , -10.858709 , -10.852925 , -10.786781 , -10.841546 ,
        -10.8537445, -10.887673 , -10.8651905, -10.833255 , -10.840333 ,
        -10.85099  , -10.822063 , -10.795376 , -10.833819 ]],
      dtype=float32), 'objective/ref_logprobs': array([[-10.803115 , -10.834713 , -10.795107 , -10.808006 , -10.808904 ,
        -10.826435 , -10.831235 , -10.821535 , -10.797018 , -10.794669 ,
        -10.85922  , -10.858709 , -10.852925 , -10.786781 , -10.841546 ,
        -10.8537445, -10.887673 , -10.8651905, -10.833255 , -10.840333 ,
        -10.85099  , -10.822063 , -10.795376 , -10.833819 ]],
      dtype=float32), 'objective/kl_coef': 0.2, 'objective/entropy': 216.6614227294922, 'ppo/mean_non_score_reward': 0.0, 'ppo/mean_scores': 1.0, 'ppo/std_scores': nan, 'toke

### Toxic employee reviews?
You have just joined a new company as a team lead. Two of your team members send thorough employee reviews on each other. To have a first, quick glimpse, you ask a pre-trained summarization LLM for help to get some concise points about each employee, as shown below:

Your task is to carefully assess the toxicity level of these suggested responses.

#### Instructions
* Calculate the individual toxicity of each sequence, the maximum toxicity and toxicity ratio per employee.

In [27]:
emp_1 = ["Everyone in the team adores him",
           "He is a true genius, pure talent"]
emp_2 = ["Nobody in the team likes him",
           "He is a useless 'good-for-nothing'"]

toxicity_metric = evaluate.load("toxicity")

# Calculate the individual toxicities, maximum toxicities, and toxicity ratios
toxicity_1 = toxicity_metric.compute(predictions=emp_1)
toxicity_2 = toxicity_metric.compute(predictions=emp_2)
print("Toxicities (emp. 1):", toxicity_1['toxicity'])
print("Toxicities (emp. 2): ", toxicity_2['toxicity'])

toxicity_1_max = toxicity_metric.compute(predictions=emp_1, aggregation="maximum")
toxicity_2_max = toxicity_metric.compute(predictions=emp_2, aggregation="maximum")
print("Maximum toxicity (emp. 1):", toxicity_1_max['max_toxicity'])
print("Maximum toxicity (emp. 2): ", toxicity_2_max['max_toxicity'])

toxicity_1_ratio = toxicity_metric.compute(predictions=emp_1, aggregation="ratio")
toxicity_2_ratio = toxicity_metric.compute(predictions=emp_2, aggregation="ratio")
print("Toxicity ratio (emp. 1):", toxicity_1_ratio['toxicity_ratio'])
print("Toxicity ratio (emp. 2): ", toxicity_2_ratio['toxicity_ratio'])



Toxicities (emp. 1): [0.0001386617950629443, 0.00013368591316975653]
Toxicities (emp. 2):  [0.00014245195779949427, 0.010071253404021263]
Maximum toxicity (emp. 1): 0.0001386617950629443
Maximum toxicity (emp. 2):  0.010071253404021263
Toxicity ratio (emp. 1): 0.0
Toxicity ratio (emp. 2):  0.0


### Best "regard"!
In this wrap-up exercise, you will use the regard metric to assess polarity in two lists of sample texts associated to two groups of employees in two branches of a company. Each branch is located in a different country and geographical region of the planet.

(It was reported a priori that due to naturally occurring cultural differences, both groups exhibit differences in the way they perform their work.)

#### Instructions
* Load the regard and regard-comparison metrics.
* Compute the regard (polarity scores) for each group separately.
* Compute the relative regard between the two groups for comparing polarities.

In [28]:
group1 = ['abc are described as loyal employees',
          'abc are ambitious in their career expectations']
group2 = ['abc are known for causing lots of team conflicts',
          'abc are verbally violent']

# Load the regard and regard-comparison metrics
regard = evaluate.load("regard")
regard_comp = evaluate.load("regard", "compare")

# Compute the regard (polarities) of each group separately
polarity_results_1 = regard.compute(data=group1)
print("Polarity in group 1:\n", polarity_results_1)
polarity_results_2 = regard.compute(data=group2)
print("Polarity in group 2:\n", polarity_results_2)

# Compute the relative regard between the two groups for comparison
polarity_results_comp = regard_comp.compute(data=group1, references=group2)
print("Polarity comparison between groups:\n", polarity_results_comp)

Polarity in group 1:
 {'regard': [[{'label': 'positive', 'score': 0.9098386764526367}, {'label': 'neutral', 'score': 0.059396952390670776}, {'label': 'other', 'score': 0.026468101888895035}, {'label': 'negative', 'score': 0.004296252969652414}], [{'label': 'positive', 'score': 0.7809812426567078}, {'label': 'neutral', 'score': 0.18085983395576477}, {'label': 'other', 'score': 0.030492952093482018}, {'label': 'negative', 'score': 0.007666013203561306}]]}
Polarity in group 2:
 {'regard': [[{'label': 'negative', 'score': 0.9658734202384949}, {'label': 'other', 'score': 0.021555885672569275}, {'label': 'neutral', 'score': 0.012026479467749596}, {'label': 'positive', 'score': 0.0005441228277049959}], [{'label': 'negative', 'score': 0.9774736166000366}, {'label': 'other', 'score': 0.012994581833481789}, {'label': 'neutral', 'score': 0.008945506066083908}, {'label': 'positive', 'score': 0.0005862844991497695}]]}
Polarity comparison between groups:
 {'regard_difference': {'positive': 0.8448447

## Vision Transformers

In [29]:
import numpy as np
import torch
import torch.nn as nn
from torch.nn import CrossEntropyLoss
from torch.optim import Adam
from torch.utils.data import DataLoader
from torchvision.datasets.mnist import MNIST
from torchvision.transforms import ToTensor
from tqdm import tqdm, trange

np.random.seed(0)
torch.manual_seed(0)


def patchify(images, n_patches):
    n, c, h, w = images.shape

    assert h == w, "Patchify method is implemented for square images only"

    patches = torch.zeros(n, n_patches**2, h * w * c // n_patches**2)
    patch_size = h // n_patches

    for idx, image in enumerate(images):
        for i in range(n_patches):
            for j in range(n_patches):
                patch = image[
                    :,
                    i * patch_size : (i + 1) * patch_size,
                    j * patch_size : (j + 1) * patch_size,
                ]
                patches[idx, i * n_patches + j] = patch.flatten()
    return patches


class MyMSA(nn.Module):
    def __init__(self, d, n_heads=2):
        super(MyMSA, self).__init__()
        self.d = d
        self.n_heads = n_heads

        assert d % n_heads == 0, f"Can't divide dimension {d} into {n_heads} heads"

        d_head = int(d / n_heads)
        self.q_mappings = nn.ModuleList(
            [nn.Linear(d_head, d_head) for _ in range(self.n_heads)]
        )
        self.k_mappings = nn.ModuleList(
            [nn.Linear(d_head, d_head) for _ in range(self.n_heads)]
        )
        self.v_mappings = nn.ModuleList(
            [nn.Linear(d_head, d_head) for _ in range(self.n_heads)]
        )
        self.d_head = d_head
        self.softmax = nn.Softmax(dim=-1)

    def forward(self, sequences):
        # Sequences has shape (N, seq_length, token_dim)
        # We go into shape    (N, seq_length, n_heads, token_dim / n_heads)
        # And come back to    (N, seq_length, item_dim)  (through concatenation)
        result = []
        for sequence in sequences:
            seq_result = []
            for head in range(self.n_heads):
                q_mapping = self.q_mappings[head]
                k_mapping = self.k_mappings[head]
                v_mapping = self.v_mappings[head]

                seq = sequence[:, head * self.d_head : (head + 1) * self.d_head]
                q, k, v = q_mapping(seq), k_mapping(seq), v_mapping(seq)

                attention = self.softmax(q @ k.T / (self.d_head**0.5))
                seq_result.append(attention @ v)
            result.append(torch.hstack(seq_result))
        return torch.cat([torch.unsqueeze(r, dim=0) for r in result])


class MyViTBlock(nn.Module):
    def __init__(self, hidden_d, n_heads, mlp_ratio=4):
        super(MyViTBlock, self).__init__()
        self.hidden_d = hidden_d
        self.n_heads = n_heads

        self.norm1 = nn.LayerNorm(hidden_d)
        self.mhsa = MyMSA(hidden_d, n_heads)
        self.norm2 = nn.LayerNorm(hidden_d)
        self.mlp = nn.Sequential(
            nn.Linear(hidden_d, mlp_ratio * hidden_d),
            nn.GELU(),
            nn.Linear(mlp_ratio * hidden_d, hidden_d),
        )

    def forward(self, x):
        out = x + self.mhsa(self.norm1(x))
        out = out + self.mlp(self.norm2(out))
        return out


class MyViT(nn.Module):
    def __init__(self, chw, n_patches=7, n_blocks=2, hidden_d=8, n_heads=2, out_d=10):
        # Super constructor
        super(MyViT, self).__init__()

        # Attributes
        self.chw = chw  # ( C , H , W )
        self.n_patches = n_patches
        self.n_blocks = n_blocks
        self.n_heads = n_heads
        self.hidden_d = hidden_d

        # Input and patches sizes
        assert (
            chw[1] % n_patches == 0
        ), "Input shape not entirely divisible by number of patches"
        assert (
            chw[2] % n_patches == 0
        ), "Input shape not entirely divisible by number of patches"
        self.patch_size = (chw[1] / n_patches, chw[2] / n_patches)

        # 1) Linear mapper
        self.input_d = int(chw[0] * self.patch_size[0] * self.patch_size[1])
        self.linear_mapper = nn.Linear(self.input_d, self.hidden_d)

        # 2) Learnable classification token
        self.class_token = nn.Parameter(torch.rand(1, self.hidden_d))

        # 3) Positional embedding
        self.register_buffer(
            "positional_embeddings",
            get_positional_embeddings(n_patches**2 + 1, hidden_d),
            persistent=False,
        )

        # 4) Transformer encoder blocks
        self.blocks = nn.ModuleList(
            [MyViTBlock(hidden_d, n_heads) for _ in range(n_blocks)]
        )

        # 5) Classification MLPk
        self.mlp = nn.Sequential(nn.Linear(self.hidden_d, out_d), nn.Softmax(dim=-1))

    def forward(self, images):
        # Dividing images into patches
        n, c, h, w = images.shape
        patches = patchify(images, self.n_patches).to(self.positional_embeddings.device)

        # Running linear layer tokenization
        # Map the vector corresponding to each patch to the hidden size dimension
        tokens = self.linear_mapper(patches)

        # Adding classification token to the tokens
        tokens = torch.cat((self.class_token.expand(n, 1, -1), tokens), dim=1)

        # Adding positional embedding
        out = tokens + self.positional_embeddings.repeat(n, 1, 1)

        # Transformer Blocks
        for block in self.blocks:
            out = block(out)

        # Getting the classification token only
        out = out[:, 0]

        return self.mlp(out)  # Map to output dimension, output category distribution


def get_positional_embeddings(sequence_length, d):
    result = torch.ones(sequence_length, d)
    for i in range(sequence_length):
        for j in range(d):
            result[i][j] = (
                np.sin(i / (10000 ** (j / d)))
                if j % 2 == 0
                else np.cos(i / (10000 ** ((j - 1) / d)))
            )
    return result


def main():
    # Loading data
    transform = ToTensor()

    train_set = MNIST(
        root="./../datasets", train=True, download=True, transform=transform
    )
    test_set = MNIST(
        root="./../datasets", train=False, download=True, transform=transform
    )

    train_loader = DataLoader(train_set, shuffle=True, batch_size=128)
    test_loader = DataLoader(test_set, shuffle=False, batch_size=128)

    # Defining model and training options
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    print(
        "Using device: ",
        device,
        f"({torch.cuda.get_device_name(device)})" if torch.cuda.is_available() else "",
    )
    model = MyViT(
        (1, 28, 28), n_patches=7, n_blocks=2, hidden_d=8, n_heads=2, out_d=10
    ).to(device)
    N_EPOCHS = 5
    LR = 0.005

    # Training loop
    optimizer = Adam(model.parameters(), lr=LR)
    criterion = CrossEntropyLoss()
    for epoch in trange(N_EPOCHS, desc="Training"):
        train_loss = 0.0
        for batch in tqdm(
            train_loader, desc=f"Epoch {epoch + 1} in training", leave=False
        ):
            x, y = batch
            x, y = x.to(device), y.to(device)
            y_hat = model(x)
            loss = criterion(y_hat, y)

            train_loss += loss.detach().cpu().item() / len(train_loader)

            optimizer.zero_grad()
            loss.backward()
            optimizer.step()

        print(f"Epoch {epoch + 1}/{N_EPOCHS} loss: {train_loss:.2f}")

    # Test loop
    with torch.no_grad():
        correct, total = 0, 0
        test_loss = 0.0
        for batch in tqdm(test_loader, desc="Testing"):
            x, y = batch
            x, y = x.to(device), y.to(device)
            y_hat = model(x)
            loss = criterion(y_hat, y)
            test_loss += loss.detach().cpu().item() / len(test_loader)

            correct += torch.sum(torch.argmax(y_hat, dim=1) == y).detach().cpu().item()
            total += len(x)
        print(f"Test loss: {test_loss:.2f}")
        print(f"Test accuracy: {correct / total * 100:.2f}%")


if __name__ == "__main__":
    main()

Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz to ./../datasets/MNIST/raw/train-images-idx3-ubyte.gz


100%|██████████| 9912422/9912422 [00:00<00:00, 66639382.91it/s]


Extracting ./../datasets/MNIST/raw/train-images-idx3-ubyte.gz to ./../datasets/MNIST/raw

Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz to ./../datasets/MNIST/raw/train-labels-idx1-ubyte.gz


100%|██████████| 28881/28881 [00:00<00:00, 11607483.12it/s]

Extracting ./../datasets/MNIST/raw/train-labels-idx1-ubyte.gz to ./../datasets/MNIST/raw

Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz to ./../datasets/MNIST/raw/t10k-images-idx3-ubyte.gz



100%|██████████| 1648877/1648877 [00:00<00:00, 21027785.32it/s]


Extracting ./../datasets/MNIST/raw/t10k-images-idx3-ubyte.gz to ./../datasets/MNIST/raw

Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz to ./../datasets/MNIST/raw/t10k-labels-idx1-ubyte.gz


100%|██████████| 4542/4542 [00:00<00:00, 20011059.63it/s]

Extracting ./../datasets/MNIST/raw/t10k-labels-idx1-ubyte.gz to ./../datasets/MNIST/raw






Using device:  cpu 


Training:   0%|          | 0/5 [00:00<?, ?it/s]
Epoch 1 in training:   0%|          | 0/469 [00:00<?, ?it/s][A
Epoch 1 in training:   0%|          | 1/469 [00:00<05:21,  1.46it/s][A
Epoch 1 in training:   0%|          | 2/469 [00:01<05:25,  1.44it/s][A
Epoch 1 in training:   1%|          | 3/469 [00:02<05:27,  1.42it/s][A
Epoch 1 in training:   1%|          | 4/469 [00:02<05:24,  1.43it/s][A
Epoch 1 in training:   1%|          | 5/469 [00:04<06:48,  1.13it/s][A
Epoch 1 in training:   1%|▏         | 6/469 [00:05<08:29,  1.10s/it][A
Epoch 1 in training:   1%|▏         | 7/469 [00:06<08:24,  1.09s/it][A
Epoch 1 in training:   2%|▏         | 8/469 [00:07<07:06,  1.08it/s][A
Epoch 1 in training:   2%|▏         | 9/469 [00:07<05:55,  1.30it/s][A
Epoch 1 in training:   2%|▏         | 10/469 [00:08<05:07,  1.49it/s][A
Epoch 1 in training:   2%|▏         | 11/469 [00:08<04:35,  1.66it/s][A
Epoch 1 in training:   3%|▎         | 12/469 [00:08<04:10,  1.82it/s][A
Epoch 1 in training: 

Epoch 1/5 loss: 2.11



Epoch 2 in training:   0%|          | 0/469 [00:00<?, ?it/s][A
Epoch 2 in training:   0%|          | 1/469 [00:00<03:41,  2.11it/s][A
Epoch 2 in training:   0%|          | 2/469 [00:00<03:36,  2.16it/s][A
Epoch 2 in training:   1%|          | 3/469 [00:01<03:34,  2.18it/s][A
Epoch 2 in training:   1%|          | 4/469 [00:01<03:30,  2.21it/s][A
Epoch 2 in training:   1%|          | 5/469 [00:02<03:31,  2.19it/s][A
Epoch 2 in training:   1%|▏         | 6/469 [00:02<03:31,  2.19it/s][A
Epoch 2 in training:   1%|▏         | 7/469 [00:03<03:31,  2.19it/s][A
Epoch 2 in training:   2%|▏         | 8/469 [00:03<03:30,  2.19it/s][A
Epoch 2 in training:   2%|▏         | 9/469 [00:04<03:28,  2.21it/s][A
Epoch 2 in training:   2%|▏         | 10/469 [00:04<03:30,  2.18it/s][A
Epoch 2 in training:   2%|▏         | 11/469 [00:05<03:46,  2.02it/s][A
Epoch 2 in training:   3%|▎         | 12/469 [00:05<04:08,  1.84it/s][A
Epoch 2 in training:   3%|▎         | 13/469 [00:06<04:27,  1.71it/s

Epoch 2/5 loss: 1.86



Epoch 3 in training:   0%|          | 0/469 [00:00<?, ?it/s][A
Epoch 3 in training:   0%|          | 1/469 [00:00<05:21,  1.46it/s][A
Epoch 3 in training:   0%|          | 2/469 [00:01<04:41,  1.66it/s][A
Epoch 3 in training:   1%|          | 3/469 [00:01<04:08,  1.87it/s][A
Epoch 3 in training:   1%|          | 4/469 [00:02<03:53,  1.99it/s][A
Epoch 3 in training:   1%|          | 5/469 [00:02<03:41,  2.10it/s][A
Epoch 3 in training:   1%|▏         | 6/469 [00:03<03:36,  2.13it/s][A
Epoch 3 in training:   1%|▏         | 7/469 [00:03<03:30,  2.19it/s][A
Epoch 3 in training:   2%|▏         | 8/469 [00:03<03:31,  2.18it/s][A
Epoch 3 in training:   2%|▏         | 9/469 [00:04<03:28,  2.21it/s][A
Epoch 3 in training:   2%|▏         | 10/469 [00:04<03:26,  2.23it/s][A
Epoch 3 in training:   2%|▏         | 11/469 [00:05<03:27,  2.20it/s][A
Epoch 3 in training:   3%|▎         | 12/469 [00:05<03:26,  2.21it/s][A
Epoch 3 in training:   3%|▎         | 13/469 [00:06<03:27,  2.19it/s

Epoch 3/5 loss: 1.78



Epoch 4 in training:   0%|          | 0/469 [00:00<?, ?it/s][A
Epoch 4 in training:   0%|          | 1/469 [00:00<05:22,  1.45it/s][A
Epoch 4 in training:   0%|          | 2/469 [00:01<04:14,  1.83it/s][A
Epoch 4 in training:   1%|          | 3/469 [00:01<03:52,  2.01it/s][A
Epoch 4 in training:   1%|          | 4/469 [00:02<03:40,  2.11it/s][A
Epoch 4 in training:   1%|          | 5/469 [00:02<03:35,  2.16it/s][A
Epoch 4 in training:   1%|▏         | 6/469 [00:02<03:30,  2.20it/s][A
Epoch 4 in training:   1%|▏         | 7/469 [00:03<03:29,  2.21it/s][A
Epoch 4 in training:   2%|▏         | 8/469 [00:03<03:28,  2.21it/s][A
Epoch 4 in training:   2%|▏         | 9/469 [00:04<03:27,  2.22it/s][A
Epoch 4 in training:   2%|▏         | 10/469 [00:04<03:27,  2.21it/s][A
Epoch 4 in training:   2%|▏         | 11/469 [00:05<03:43,  2.05it/s][A
Epoch 4 in training:   3%|▎         | 12/469 [00:05<03:38,  2.09it/s][A
Epoch 4 in training:   3%|▎         | 13/469 [00:06<03:33,  2.14it/s

Epoch 4/5 loss: 1.74



Epoch 5 in training:   0%|          | 0/469 [00:00<?, ?it/s][A
Epoch 5 in training:   0%|          | 1/469 [00:00<03:43,  2.09it/s][A
Epoch 5 in training:   0%|          | 2/469 [00:00<03:31,  2.21it/s][A
Epoch 5 in training:   1%|          | 3/469 [00:01<03:32,  2.20it/s][A
Epoch 5 in training:   1%|          | 4/469 [00:01<03:32,  2.19it/s][A
Epoch 5 in training:   1%|          | 5/469 [00:02<03:32,  2.19it/s][A
Epoch 5 in training:   1%|▏         | 6/469 [00:02<03:30,  2.20it/s][A
Epoch 5 in training:   1%|▏         | 7/469 [00:03<03:27,  2.22it/s][A
Epoch 5 in training:   2%|▏         | 8/469 [00:03<03:27,  2.22it/s][A
Epoch 5 in training:   2%|▏         | 9/469 [00:04<03:26,  2.23it/s][A
Epoch 5 in training:   2%|▏         | 10/469 [00:04<03:28,  2.20it/s][A
Epoch 5 in training:   2%|▏         | 11/469 [00:04<03:27,  2.20it/s][A
Epoch 5 in training:   3%|▎         | 12/469 [00:05<03:25,  2.22it/s][A
Epoch 5 in training:   3%|▎         | 13/469 [00:05<03:26,  2.21it/s

Epoch 5/5 loss: 1.71


Testing: 100%|██████████| 79/79 [00:20<00:00,  3.81it/s]

Test loss: 1.70
Test accuracy: 76.35%



