<a href="https://colab.research.google.com/github/caglarmert/DI725/blob/main/DI725_Lab_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# DI 725: Transformers and Attention-Based Deep Networks

## Employing Transformers for Various Tasks
We will be observing various use-case scenarios for Transformer models in this lab. Text classification, text summarization, text translation, question answering, and text generation are some important domains where Transformer models shine brilliantly (as they do in different domains).

We will also observe how we can evaluate our models, specifically for NLP tasks. We will compare basic metrics such as accuracy, precision, and recall with perplexity, Rouge, Meteor and BLEU metrics.

### Author:
* Ümit Mert Çağlar

### References

This notebook is based on the Datacamp short course [Introduction to LLMs in Python](https://app.datacamp.com/learn/courses/introduction-to-llms-in-python). You can experiment with the code provided in this notebook and adapt a model to a task as you desire.

# Introduction

<div>
<img src="https://github.com/caglarmert/DI725/blob/main/src/attention_research_1.png?raw=true" width="400"/>
</div>


## Imports
In this part we import the required libraries. Running this part on the Colab servers is required for later parts. It is advised to check the associated python requirements.txt, that is frozen at the time of preparation of this notebook, in case of any library or version error occurs while running this notebook. Mind that installing everything locally via pip install -r "requirements.txt" is not advised though, mainly because of the discrepancies between Colab and locally available machine.

In [1]:
# Uncomment any install if needed. It is recommended that these installations
# are performed prior to any notebook runs and imports

!pip install datasets # Huggingface dataset library
!pip install evaluate # Used for evaluation metrics
!pip install rouge_score # Is a text evaluation metric
!pip install trl #Transformers Reinforcement Learning framework
!pip install sacremoses # Used for specific characters, useful for languages like Turkish

Collecting datasets
  Downloading datasets-2.18.0-py3-none-any.whl (510 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m510.5/510.5 kB[0m [31m7.9 MB/s[0m eta [36m0:00:00[0m
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m13.0 MB/s[0m eta [36m0:00:00[0m
Collecting xxhash (from datasets)
  Downloading xxhash-3.4.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (194 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.1/194.1 kB[0m [31m9.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting multiprocess (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl (134 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m12.0 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: xxhash, dill, multiprocess, datasets
Successfully installed datasets

In [2]:
from transformers import pipeline
import math
import torch
from torch import nn
import torch.nn.functional as F
import evaluate

from transformers import AutoTokenizer
from transformers import AutoModelForSequenceClassification, AutoModelForSeq2SeqLM
from datasets import load_dataset

from transformers import AutoModelForCausalLM, AutoTokenizer


After importing the main libraries, we can continue with the transformers. First lets check what the above import does. We have imported pipelines, automodels, evaluate, datasets, transformers reinforcement learning from huggingface 🤗 library. Also torch and other essentials were imported too.

The [documentation](https://huggingface.co/docs/transformers) for the Transformers library.

The [pipelines](https://huggingface.co/docs/transformers/main_classes/pipelines) is a class of the Transformers library. It is used for easy inference, abstracts most of the complexity and offers simple API for some dedicated tasks.

The [torch](https://pytorch.org/) is a popular and diverse machine learning framework, enabling low level implementation (as low as it gets with Python anyway). The Neural Networks (nn) is a library within PyTorch that enables operations with neural network structures.

The [Auto](https://huggingface.co/docs/transformers/model_doc/auto) classes contain many high level methods and models for various specific tasks, sometimes required for a pre-processing step such as tokenizers.

The [datasets](https://huggingface.co/docs/datasets) is the 🤗 library used for datasets (who would have guess?). Tabular, Audio, Computer Vision, and Text data can be loaded or shared via this library.

# Introduction

In this first introductory section, we begin with experiencing basic and very high level usage of transformers.

## Part 1: Classifying a Text


Here we will be using the DistilBERT model for sentiment classification. DistilBert is a modification (distillation) over the original BERT model, which is short for: Bidirectional Encoder Representations from Transformers. As we can understand from the name, the model is an Encoder transformer, which is reasonable as an encoder is primarily tasked with "understanding" the context, and here we will be trying to understand the sentiment. For details about the model, you can check the model [documentation](https://huggingface.co/docs/transformers/model_doc/distilbert), [blog](https://medium.com/huggingface/distilbert-8cf3380435b5) or [paper](https://arxiv.org/abs/1910.01108). The model we are going to use can be accessed from this [repository](https://huggingface.co/lxyuan/distilbert-base-multilingual-cased-sentiments-student).


Huggingface Hub is an open-source public collaboration of various models. Large Language Models, require a tremendous amount of training data and time; thus, once trained, they are invaluable, and their inferences can be adapted to various use cases.


This first practice will be about loading a model from the huggingface hub into a pipeline to perform a task.


It is important to note that model loading with a specific model name is advised, or else it will opt for defaults.


#### Instructions
* Import the necessary function from the transformers library to load Hugging Face LLMs as pipelines.
* Load the model specified in model_name into a suitable pipeline for sentiment classification in text.
* Pass the customer review defined in the prompt to the pipeline to get a sentiment prediction.

In [3]:
# Specify the task name
task_name = "text-classification"
# Specify the model to be loaded
model_name = "lxyuan/distilbert-base-multilingual-cased-sentiments-student"
# We can change the model name to
# "mrm8488/distilroberta-finetuned-financial-news-sentiment-analysis"
# "lxyuan/distilbert-base-multilingual-cased-sentiments-student"
classifier = pipeline(task = task_name, model = model_name)

# Clearly this is a positive sentiment from a 5 star review tripadvisor for Atakule
prompt = "I liked Atakule, very much so because of the excellent location in the midst of the botanical park and city center."
prediction = classifier(prompt)
print(prompt, "\nSentiment:", prediction[0]["label"], "Score:",prediction[0]["score"],)

# And a negative one, 1 star review from the time it is off-limits.
prompt = "There was nothing to see at Atakule, the building is under construction, you can't go into building, wasting my afternoon time in ankara."
prediction = classifier(prompt)
print(prompt, "\nSentiment:", prediction[0]["label"], "Score:",prediction[0]["score"],)



config.json:   0%|          | 0.00/759 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/541M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/373 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/996k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.92M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

I liked Atakule, very much so because of the excellent location in the midst of the botanical park and city center. 
Sentiment: positive Score: 0.8559050559997559
There was nothing to see at Atakule, the building is under construction, you can't go into building, wasting my afternoon time in ankara. 
Sentiment: negative Score: 0.4448523223400116


## Part 2: Summarizing a Text

Text summarization requires context understanding and generation, which is possible with an encoder-decoder model. The T5 model is an encoder-decoder transformer model that is short for Ts in "Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer".  You can also check the [paper](https://arxiv.org/pdf/1910.10683.pdf), [huggingface summary](https://huggingface.co/docs/transformers/model_doc/t5). The model we are going to use is a small implementation of this model, and its card can be accessed from this [repository](https://huggingface.co/cnicu/t5-small-booksum).


Summarization is a challenging language task that requires sequence-to-sequence models, such as the one we are using here. The task is about summarizing a given long text.


#### Instructions

* Load the model, based on the T5 transformer architecture and specified in model_name, into a text summarization pipeline.
* Pass long_text to the model pipeline to produce a summary limited to 50 tokens in length.
* Access and print the summarized text in outputs.

In [4]:
# Specify a model name, note that we are using a small version so don't expect much
model_name = "cnicu/t5-small-booksum"
# Provide the long text
long_text = "Tunali hilmi, which is a bustling street, is a hub for various commercial activities as it extends southwards toward Kugulu Park. Tunali Hilmi Avenue is regarded as one of the city's most charming streets, adorned with a variety of shops, boutiques, and souvenir stores. The neighborhood exudes a sense of luxury and offers a wide range of goods, albeit at slightly higher prices compared to other areas. However, the elevated cost is justified by the high-quality shopping experience, particularly appealing to those who enjoy outdoor retail therapy."

# Load the model pipeline for text summarization
summarizer = pipeline(task="summarization", model=model_name)

# Pass the long text to the model to summarize it
outputs = summarizer(long_text, max_length=50)

# Access and print the summarized text in the outputs variable
print("Original Text: ", long_text, "\nSummary Text: ", outputs[0]['summary_text'])

config.json:   0%|          | 0.00/1.38k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/242M [00:00<?, ?B/s]

  return self.fget.__get__(instance, owner)()


tokenizer_config.json:   0%|          | 0.00/1.92k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/1.79k [00:00<?, ?B/s]

Original Text:  Tunali hilmi, which is a bustling street, is a hub for various commercial activities as it extends southwards toward Kugulu Park. Tunali Hilmi Avenue is regarded as one of the city's most charming streets, adorned with a variety of shops, boutiques, and souvenir stores. The neighborhood exudes a sense of luxury and offers a wide range of goods, albeit at slightly higher prices compared to other areas. However, the elevated cost is justified by the high-quality shopping experience, particularly appealing to those who enjoy outdoor retail therapy. 
Summary Text:  Tunali hilmi is regarded as one of the city's most charming streets, adorned with shops, boutiques, and souvenir stores. The neighborhood offers a wide range of goods, albeit at slightly higher prices


## Part 3: Translating a text

Further information about the OPUS dataset and transformer model training can be accessed from the official GitHub [repository](https://github.com/Helsinki-NLP/OPUS-MT-train). And the model that we are gonig to use is detailed in the following huggingface model [card](https://huggingface.co/Helsinki-NLP/opus-mt-tr-en). The model

Translation is another challanging language task, requiring models trained specifically for source and target languages.

#### Instructions

* Define a pipeline for Turkish-to-English translation, specifying the source and target languages in the pipeline task argument.
* Translate the text in input_text using the pipeline.
* Access and print the translated text in the outputs variable: translations.

In [5]:
# Specify the model name, from Turkish (tr) to English (en)
model_name = "Helsinki-NLP/opus-mt-tr-en"

# A short intro about METU
input_text = "Orta Doğu Teknik Üniversitesi, Türkiye ve Orta Doğu ülkelerinin kalkınmalarına katkıda bulunmak, özellikle fen bilimleri ve sosyal bilimler alanlarında uzman yetiştirmek üzere 15 Kasım 1956 tarihinde Orta Doğu Yüksek Teknoloji Enstitüsü adıyla eğitime başlamıştır. "

# Define pipeline for Spanish-to-English translation
translator = pipeline("translation_tr_to_en", model=model_name)

# Translate the input text
translations = translator(input_text)

# Access the output to print the translated text in English
print("Original text: ", input_text)
print("Translated text:", translations[0]['translation_text'])

config.json:   0%|          | 0.00/1.38k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/307M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/293 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/42.0 [00:00<?, ?B/s]

source.spm:   0%|          | 0.00/840k [00:00<?, ?B/s]

target.spm:   0%|          | 0.00/797k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.56M [00:00<?, ?B/s]

Original text:  Orta Doğu Teknik Üniversitesi, Türkiye ve Orta Doğu ülkelerinin kalkınmalarına katkıda bulunmak, özellikle fen bilimleri ve sosyal bilimler alanlarında uzman yetiştirmek üzere 15 Kasım 1956 tarihinde Orta Doğu Yüksek Teknoloji Enstitüsü adıyla eğitime başlamıştır. 
Translated text: The Middle East Technical University began training as the Middle East Institute of Technology on 15 November 1956 to contribute to the development of Turkey and Middle East countries, especially to develop experts in science and social sciences.


## Part 4: Question-Answering

The Stanford Question Answering Dataset ([SQUAD](https://rajpurkar.github.io/SQuAD-explorer/)) is one of the most popular question answering datasets, and the following example is again a distilbert model, that is fine-tuned for question answering with SQUAD, as detailed in this [repository](https://huggingface.co/distilbert/distilbert-base-cased-distilled-squad).

Next, let's practice loading a Hugging Face LLM into a pipeline for question-answering (QA, for short). This time, you will use the default model supplied by Hugging Face transformers library for QA pipelines.

#### Instructions
* Instantiate a pipeline for question-answering.
* Pass the necessary pieces of information as inputs to the pipeline.
* Access and print the extracted answer in the outputs variable.

In [6]:
# Load the model pipeline for question-answering
model_name = "distilbert-base-cased-distilled-squad"

qa_model = pipeline("question-answering",model=model_name)

# Provide the context
context = "The history of Ankara Castle, one of the symbols of the province, is as old as the history of the city. It remains to be determined when the castle, which existed when the Galatians settled in Ankara and was repaired during the Roman period, was built. Next to the hill on which it was founded, that is, Hatip Stream, is 110 m above the Bent Stream. The castle has more than 20 towers. The outer castle surrounds Ankara in the shape of a heart. The four-storey inner castle is made of Ankara Stone and partly of collected stones. The inner castle has two large gates, one is called the Outer Gate and the other is the Citadel Gate. There is a book belonging to the Ilkhanate on this door. The inner castles consist of a total of 42 pentagonal towers with a length of 14-16 m. There is an inscription in the northwestern part showing the repairs made by the Seljuk ruler."

# Provide the questions
questions = ["How many towers does the Ankara castle have?",
             "When did the Ankara castle was build?",
             "How long are the towers in the inner castle?",
             "Who repaired the Ankara castle and inscribed?",
             "What are the materials of the Ankara castle?"]

# Pass the necessary inputs to the LLM pipeline for question-answering
outputs = qa_model(question=questions, context=context)

# Access and print the answer
for i in range(len(questions)):
  print("Question: ", questions[i], "\nAnswer:", outputs[i]['answer'])

config.json:   0%|          | 0.00/473 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/261M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

Question:  How many towers does the Ankara castle have? 
Answer: more than 20
Question:  When did the Ankara castle was build? 
Answer: Roman period
Question:  How long are the towers in the inner castle? 
Answer: 14-16 m
Question:  Who repaired the Ankara castle and inscribed? 
Answer: the Seljuk ruler
Question:  What are the materials of the Ankara castle? 
Answer: Ankara Stone and partly of collected stones


## Part 5: Text Generation

Text generation, is the most famous application of transformers, namely ChatGPT (standing for Generative Pre-Trained). Here we will use an older version (GPT-2) to generate text for customers leaving reviews for our business on a public website. You can check the model [card](https://huggingface.co/openai-community/gpt2) on Huggingface about GPT-2, follow the demonstration at the official OpenAI [website](https://openai.com/research/better-language-models) or read more about GPT-2 [paper](https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf).

#### Instructions
* Instantiate the generator variable as a pipeline that loads the "gpt2" pre-trained text generation model.
* Build a prompt for the LLM that concatenates the customer review with the hotel response's initial sentence.
* Pass the prompt to the previously defined pipeline to generate (inference) the following text in the hotel response, specifying a maximum length of 150 tokens for the generated output.
* Print the generated output.

In [7]:
# Create a pipeline for text generation using the gpt2 model
generator = pipeline("text-generation", model="gpt2")

customer_text = "The Divan is a very comfortable and professionally run hotel in Ankara. The staff are extremely helpful and friendly. Rooms and beds are very comfortable, with all the facilities that you would expect in a four star hotel. The breakfast buffet is very extensive (open 6.30AM to 10.30AM). The only down-side is the hotels location, a ten to fifteen minute taxi ride away from the city centre, embassies and government buildings, but is located within a very quiet residential area."

response = "Dear Our Valuable Guest, Thank you for taking the time to leave us a review."

# Build the prompt for the text generation LLM
prompt = f"Customer review:\n{customer_text}\n\nHotel reponse to the customer:\n{response}"

# Pass the prompt to the model pipeline
outputs = generator(prompt, max_length=150, pad_token_id=generator.tokenizer.eos_token_id)

# Print the augmented sequence generated by the model
print(outputs[0]['generated_text'])

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


Customer review:
The Divan is a very comfortable and professionally run hotel in Ankara. The staff are extremely helpful and friendly. Rooms and beds are very comfortable, with all the facilities that you would expect in a four star hotel. The breakfast buffet is very extensive (open 6.30AM to 10.30AM). The only down-side is the hotels location, a ten to fifteen minute taxi ride away from the city centre, embassies and government buildings, but is located within a very quiet residential area.

Hotel reponse to the customer:
Dear Our Valuable Guest, Thank you for taking the time to leave us a review. Due to a customer service issue that may have changed the price, the hotel will no longer


Transfer Learning / Fine Tuning Models

# Evaluation

Evaluating transformer models is just as important as building and training them. From most simplest ones such as accuracy to advanced metrics like Rouge or BLEU, there are a wide range of different performance metrics to be considered for a given task.

## Part 6: Basic Metrics, Accuracy, Precision, Recall, F1 Score

Using the sentiment classification pipeline, here we will demonstrate how to calculate basic metrics: Accuracy, Precision, Recall, F1 Score.

#### Instructions
* Pass a list containing the four input reviews to the sentiment classification pipeline.
* Load the score metric from the evaluate library

In [8]:
test_examples = [
    {"text": "I am making a good use of this product!", "label": 1},
    {"text": "The service was disappointing.", "label": 0},
    {"text": "I learned a lot from this book.", "label": 1},
    {"text": "The book cover broke after two days of use.", "label": 0},
]
sentiment_analysis = pipeline("sentiment-analysis")

# Pass the four input texts (without labels) to the pipeline
predictions = sentiment_analysis([example["text"] for example in test_examples])

true_labels = [example["label"] for example in test_examples]
predicted_labels = [1 if pred["label"] == "POSITIVE" else 0 for pred in predictions]

# Load the accuracy metric
accuracy = evaluate.load("accuracy")

result = accuracy.compute(references=true_labels, predictions=predicted_labels)
print(result)


# Load the accuracy, precision, recall and F1 score .metrics
accuracy = evaluate.load("accuracy")
precision = evaluate.load("precision")
recall = evaluate.load("recall")
f1 = evaluate.load("f1")

# Obtain a description of each metric
print(accuracy.description)
print(precision.description)
print(recall.description)
print(f1.description)

test_examples = [
    "Fantastic hotel, exceeded expectations!",
    "Quiet despite central location, great stay.",
    "Friendly staff, welcoming atmosphere.",
    "Spacious, comfy room—a perfect retreat.",
    "Cleanliness could improve, overall decent stay.",
      "Disappointing stay, noisy and unclean room.",
    "Terrible service, unfriendly staff, won't return."
]
test_labels = [1, 1, 1, 1, 0, 0, 0]

# Pass the examples to the pipeline, and obtain a list of predicted labels
sentiment_analysis = pipeline("sentiment-analysis")
predictions = sentiment_analysis([example for example in test_examples])
predicted_labels = [1 if pred["label"] == "POSITIVE" else 0 for pred in predictions]

# Compute the metrics by comparing real and predicted labels
print(precision.compute(references=test_labels, predictions=predicted_labels))
print(recall.compute(references=test_labels, predictions=predicted_labels))
print(f1.compute(references=test_labels, predictions=predicted_labels))

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading builder script:   0%|          | 0.00/4.20k [00:00<?, ?B/s]

{'accuracy': 1.0}


Downloading builder script:   0%|          | 0.00/7.55k [00:00<?, ?B/s]

Downloading builder script:   0%|          | 0.00/7.36k [00:00<?, ?B/s]

Downloading builder script:   0%|          | 0.00/6.77k [00:00<?, ?B/s]

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.



Accuracy is the proportion of correct predictions among the total number of cases processed. It can be computed with:
Accuracy = (TP + TN) / (TP + TN + FP + FN)
 Where:
TP: True positive
TN: True negative
FP: False positive
FN: False negative


Precision is the fraction of correctly labeled positive examples out of all of the examples that were labeled as positive. It is computed via the equation:
Precision = TP / (TP + FP)
where TP is the True positives (i.e. the examples correctly labeled as positive) and FP is the False positive examples (i.e. the examples incorrectly labeled as positive).


Recall is the fraction of the positive examples that were correctly labeled by the model as positive. It can be computed with the equation:
Recall = TP / (TP + FN)
Where TP is the true positives and FN is the false negatives.


The F1 score is the harmonic mean of the precision and recall. It can be computed with the equation:
F1 = 2 * (precision * recall) / (precision + recall)

{'precision': 

## Part 7: Perplexity

In general, perplexity is a measurement of how well a probability model predicts a sample. In the context of Natural Language Processing, perplexity is one way to evaluate language models. Here we will compute the perplexity score with Huggingface library's evaluate.

#### Instructions
* Encode the text prompt, pass it to the GPT2 model for text generation, and decode the generated text.
* Load and compute the mean perplexity score on the generated text.

In [9]:
# Define the model name
model_name = "gpt2"

# Initialize the tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Initialize the model
model = AutoModelForCausalLM.from_pretrained(model_name)

prompt = "Current trends show that by 2030 "

# Encode the prompt, generate text and decode it
prompt_ids = tokenizer.encode(prompt, return_tensors="pt")
output = model.generate(prompt_ids, max_length=20)
generated_text = tokenizer.decode(
  output[0], skip_special_tokens=True)

print("Generated Text: ", generated_text)

# Load and compute the perplexity score
perplexity = evaluate.load("perplexity", module_type="metric")
results = perplexity.compute(model_id='gpt2',
                             predictions=generated_text)
print("Perplexity: ", results['mean_perplexity'])

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generated Text:  Current trends show that by 2030  the number of people living in poverty will be at its lowest


Downloading builder script:   0%|          | 0.00/8.46k [00:00<?, ?B/s]

  0%|          | 0/6 [00:00<?, ?it/s]

Perplexity:  3514.5176167589552


## Part 8: Rouge, Meteor and Exact Match (EM)

Rouge, Meteor and Exact Match are some more advanced metrics used specifically in NLP tasks. Short description for each metric is provided below:

1. ROUGE (Recall-Oriented Understudy for Gisting Evaluation):
   - ROUGE is a set of metrics used for evaluating automatic summarization and machine translation tasks.
   - It measures the overlap between the model-generated summary (or translation) and the reference summaries (or translations).
   - ROUGE includes various variants like ROUGE-N, ROUGE-L, and ROUGE-W. ROUGE-N measures n-gram overlap, ROUGE-L measures the longest common subsequence, and ROUGE-W measures weighted LCS-based statistics.
   - ROUGE typically reports precision, recall, and F1-score for the overlap between the model output and the reference.

2. METEOR (Metric for Evaluation of Translation with Explicit ORdering):
   - METEOR is another metric used in machine translation and automatic summarization tasks.
   - It evaluates the quality of machine translation by considering not only exact word matches but also synonyms and paraphrases.
   - METEOR computes a score based on precision, recall, and alignment between words in the reference and system output. It also considers the WordNet synonymy and stem overlap.
   - METEOR has been shown to correlate well with human judgments of translation quality.

3. Exact Match (EM):
   - EM is a metric commonly used in question answering tasks to evaluate the accuracy of the model's responses.
   - It measures whether the model's output exactly matches the reference answer. If the generated answer matches the reference answer exactly, it gets a score of 1; otherwise, it gets a score of 0.
   - EM is a binary metric, indicating whether the model's output is an exact match to the ground truth answer.

Each of these metrics provides different perspectives on the quality and performance of NLP models. While ROUGE and METEOR are often used in text generation tasks like summarization and translation, EM is more commonly used in question answering and dialogue systems where exact answers are expected. Choosing the appropriate metric depends on the specific task and the desired evaluation criteria.

In [10]:
# Load the rouge metric
rouge = evaluate.load("rouge")

predictions = ["""Pluto is a dwarf planet in our solar system, located in the Kuiper Belt beyond Neptune, and was formerly considered the ninth planet until its reclassification in 2006."""]
references = ["""Pluto is a dwarf planet in the solar system, located in the Kuiper Belt beyond Neptune, and was previously deemed as a planet until it was reclassified in 2006."""]

# Calculate the rouge scores between the predicted and reference summaries
results = rouge.compute(predictions=predictions, references=references)
print("ROUGE results: ", results)

meteor = evaluate.load("meteor")

predictions = ["He thought it right and necessary to become a knight-errant, roaming the world in armor, seeking adventures and practicing the deeds he had read about in chivalric tales."]
references = ["He believed it was proper and essential to transform into a knight-errant, traveling the world in armor, pursuing adventures, and enacting the heroic deeds he had encountered in tales of chivalry."]

# Compute and print the METEOR score
results = meteor.compute(predictions=predictions, references=references)
print("Meteor: ", results['meteor'])


exact_match = evaluate.load("exact_match")

predictions = ["The cat sat on the mat.", "Theaters are great.", "It's like comparing oranges and apples."]
references = ["The cat sat on the mat?", "Theaters are great.", "It's like comparing apples and oranges."]

# Compute the exact match and print the results
results = exact_match.compute(predictions=predictions, references=references)
print("EM results: ", results)


Downloading builder script:   0%|          | 0.00/6.27k [00:00<?, ?B/s]

ROUGE results:  {'rouge1': 0.7719298245614034, 'rouge2': 0.6181818181818182, 'rougeL': 0.736842105263158, 'rougeLsum': 0.736842105263158}


Downloading builder script:   0%|          | 0.00/6.93k [00:00<?, ?B/s]

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...


Meteor:  0.5350702240481536


Downloading builder script:   0%|          | 0.00/5.67k [00:00<?, ?B/s]

EM results:  {'exact_match': 0.3333333333333333}


## Part 9: BLEU Score

BLEU (BiLingual Evaluation Understudy) is a metric for automatically evaluating machine-translated text. The BLEU score is a number between zero and one that measures the similarity of the machine-translated text to a set of high quality reference translations.

A pipeline based on the Helsinki-NLP Turkish-English translation model and the BLEU metric has been loaded, use evaluate.load("bleu") from the evaluate library.

#### Instructions
* Pass the input sentence in input_sentence to the translator, then calculate the BLEU metric using reference.

In [11]:
bleu = evaluate.load("bleu")

input_sentence_1 = "Merhaba, nasılsın?"

reference_1 = [
     ["Hello, how are you?", "Hi, how are you?"]
     ]

translator = pipeline("translation", model="Helsinki-NLP/opus-mt-tr-en")

# Translate the first input sentence
translated_output = translator(input_sentence_1)

translated_sentence = translated_output[0]['translation_text']

print("Translated:", translated_sentence)

# Calculate BLEU metric
results = bleu.compute(predictions=[translated_sentence], references=reference_1)
print(results)


input_sentences_2 = ["Merhaba, nasılsın?", "Çok iyiyim, teşekkür ederim."]

references_2 = [
     ["Hello, how are you?", "Hi, how are you?"],
     ["I'm great, thanks.", "I'm great, thank you."]
     ]

# Translate the input sentences, extract the translated text, and compute BLEU score
translator = pipeline("translation", model="Helsinki-NLP/opus-mt-tr-en")

translated_outputs = translator(input_sentences_2)

predictions = [translated_output['translation_text'] for translated_output in translated_outputs]
print(predictions)

results = bleu.compute(predictions=predictions, references=references_2)
print(results)

Downloading builder script:   0%|          | 0.00/5.94k [00:00<?, ?B/s]

Downloading extra modules:   0%|          | 0.00/1.55k [00:00<?, ?B/s]

Downloading extra modules:   0%|          | 0.00/3.34k [00:00<?, ?B/s]

Translated: Hi, how are you?
{'bleu': 1.0, 'precisions': [1.0, 1.0, 1.0, 1.0], 'brevity_penalty': 1.0, 'length_ratio': 1.0, 'translation_length': 6, 'reference_length': 6}
['Hi, how are you?', 'Very well, thank you.']
{'bleu': 0.7598356856515925, 'precisions': [0.8333333333333334, 0.8, 0.75, 0.6666666666666666], 'brevity_penalty': 1.0, 'length_ratio': 1.0909090909090908, 'translation_length': 12, 'reference_length': 11}


# Conclusion

We have seen various tasks and performance metrics for model evaluation in this lab. Huggingface is one of the prime repositories for high-level model training, testing, and inference. With the knowledge of the previous lab and the accessibility of various high-level libraries, employing transformers for customized tasks becomes easier.