# Model Evaluation

This notebook includes the steps and code required to perform quantitative evaluations of the different models fine-tuned under this project. It includes the flow from loading the models, to generating summaries using them, to saving their summaries and to calculating quantitative metrics of their performances. These metrics include ROUGE-1, ROUGE-2, ROUGE-L, BLEU, Average Precision, Average Recall and Average F1.

At one single time, this notebook only includes the flow for the evaluation of a single model. So when different models were evaluated, the models were first loaded through this same notebook and evaluated accordingly. Therefore, the flow is setup in a way that allows it to be applied to all the different models this project investigates.

In [1]:
# Installing required packages
!pip install datasets transformers sentencepiece accelerate -U tensorflow --upgrade torch torchvision peft nltk rouge_score arabert evaluate bert-score > /dev/null 2>&1

In [2]:
# Loading packages
import random
from rouge_score import rouge_scorer
import torch
from torch.utils.data import DataLoader
from tqdm import tqdm
from datasets import load_dataset, Dataset, DatasetDict, concatenate_datasets
from peft import LoraConfig, TaskType, get_peft_model, PeftModel
from transformers import (
    AutoModelForSeq2SeqLM,
    AutoTokenizer,
    default_data_collator,
    MT5Tokenizer,
)
from nltk.translate.bleu_score import corpus_bleu
from nltk.translate.bleu_score import SmoothingFunction
from google.colab import drive
from bert_score import score

In [3]:
# Mouting Google Drive to the current Colab session for accessing files stored in the Drive
drive.mount('/content/drive')

Mounted at /content/drive


In [4]:
# Path to all the models and datasets
folder_path = '/content/drive/My Drive/CPSC_490_Data/'

In [5]:
# Loading the mT5 tokenizer
tokenizer = MT5Tokenizer.from_pretrained('google/mt5-small')

tokenizer_config.json:   0%|          | 0.00/82.0 [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/4.31M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/99.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/553 [00:00<?, ?B/s]

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thouroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


In [6]:
# Setting the device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

In [7]:
# Loading a model and its PEFT components and moving it to the current device
model = AutoModelForSeq2SeqLM.from_pretrained(folder_path + "Improved_Amharic_FT_2")
model = PeftModel.from_pretrained(model, folder_path + "Improved_Amharic_FT_2")
model.to(device)

# Printing the number of trainable parameters of the model
model.print_trainable_parameters() # Should be 0% trainable, since it is in evaluation mode

pytorch_model.bin:   0%|          | 0.00/1.20G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

trainable params: 0 || all params: 300,434,816 || trainable%: 0.0


In [12]:
# Loading processed data, which includes the preprocessed and tokenized versions of the text and summary entries
tokenized_datasets = DatasetDict.load_from_disk(folder_path + 'brand_new_further_cleaned_Amharic_mT5_tokenized_datasets') # This specific file is the preprocessed and tokenized version of Amharic-2

In [9]:
# Evaluating the model
model = model.to(device)
model.eval() # Setting the model to evaluation mode
max_target_length = 128 # Setting the maximum target length of the summaries to be generated

# Loading the test data
test_dataloader = DataLoader(
    tokenized_datasets["test"],
    collate_fn = default_data_collator,
    batch_size = 64,
    pin_memory = True
)

summaries = [] # List to store generated summaries

# Evaluating the model and saving its results in the summaries list
for step, batch in enumerate(tqdm(test_dataloader)):
    batch = {k: v.to(device) for k, v in batch.items() if k in ['input_ids', 'attention_mask', 'labels']}
    with torch.no_grad():
        outputs = model.generate(
            input_ids=batch["input_ids"],
            max_new_tokens= max_target_length
        )

    summaries.extend(
        tokenizer.batch_decode(outputs.detach().cpu().numpy(), skip_special_tokens=True)
    )

100%|██████████| 143/143 [05:57<00:00,  2.50s/it]


In [None]:
# Getting the reference summaries to compare the generated summaries with
references = [tokenized_datasets["test"][i]["summary"] for i in range(len(summaries))]

In [10]:
# Saving summaries
# Path can be updated depending on where the generated summaries are to be stored
output_file_path = folder_path + 'improved_amharic_FT_2_generated_summaries.txt'

with open(output_file_path, 'w', encoding='utf-8') as file:
    for summary in summaries:
        file.write(summary + '\n') # newline after each summary

print(f"Summaries saved to {output_file_path}")

Summaries saved to /content/drive/My Drive/CPSC_490_Data/improved_amharic_FT_2_generated_summaries.txt


In [11]:
# If the model has already been evaluated on a given test set, then there is no need to rerun the model evaluations.
# Can simply load the generated summaries as follows and evaluate them

# Path of the file containing the summaries
input_file_path = folder_path + 'improved_amharic_FT_2_generated_summaries.txt'

# List to hold the summaries
summaries = []

# Reading files and splitting the texts into summaries based on newlines
with open(input_file_path, 'r', encoding='utf-8') as file:
    summaries = file.read().split('\n')

# Removing any empty strings that may result from trailing newlines
summaries = [summary for summary in summaries if summary]

print(f"Loaded {len(summaries)} summaries from {input_file_path}")

Loaded 9141 summaries from /content/drive/My Drive/CPSC_490_Data/improved_amharic_FT_2_generated_summaries.txt


## Quantitative Evaluations

### ROUGE Scores

In [13]:
# Initializing Rouge scorer
scorer = rouge_scorer.RougeScorer(["rouge1", "rouge2", "rougeLsum"], use_stemmer=True)

# Calculating ROUGE scores
rouge_scores = [
    scorer.score(summary, reference) for summary, reference in zip(summaries, references)
]

rouge1 = sum([score["rouge1"].fmeasure for score in rouge_scores]) / len(rouge_scores) # Average ROUGE-1
rouge2 = sum([score["rouge2"].fmeasure for score in rouge_scores]) / len(rouge_scores) # Average ROUGE-2
rougeLsum = sum([score["rougeLsum"].fmeasure for score in rouge_scores]) / len(rouge_scores) # Average ROUGE-L

print(f"{rouge1 = } {rouge2 = } {rougeLsum = }")

rouge1 = 0.06047660722750977 rouge2 = 0.008381910908986719 rougeLsum = 0.06028516209021626


### BLEU Score

In [14]:
# Prepare the reference and candidate sentences for BLEU
list_of_references = [[ref.split()] for ref in references]  # BLEU expects a list of lists of tokens
candidates = [summary.split() for summary in summaries]

# Compute the BLEU-4 score using the NLTK function
bleu_score = corpus_bleu(list_of_references, candidates,
                         smoothing_function=SmoothingFunction().method1)

print(f"BLEU Score: {bleu_score}")

BLEU Score: 0.09536761472524893


### BERTScores (Precision, Recall and F1)

In [15]:
# The mT5-small model and its embeddings are used for the BERTScore calculations
model_name = "google/mt5-small"

# Calculating BERTScore
P, R, F1 = score(summaries, references, model_type="google/mt5-small")

# Computing the average scores across all summaries
avg_precision = sum(P) / len(P)
avg_recall = sum(R) / len(R)
avg_f1 = sum(F1) / len(F1)

print(f"Average Precision: {avg_precision}")
print(f"Average Recall: {avg_recall}")
print(f"Average F1 Score: {avg_f1}")

You are using a model of type mt5 to instantiate a model of type t5. This is not supported for all configurations of models and can yield errors.


Average Precision: 0.6450784802436829
Average Recall: 0.681578516960144
Average F1 Score: 0.6599487662315369


In [19]:
# Displaying some example summaries along with their corresponding BERTScores
random.seed(16)

for i in range(5):
    index = random.randint(0, len(summaries) - 1)

    print(f"Reference Summary: {references[index]}")
    print(f"Generated Summary: {summaries[index]}")
    print(f"Precision: {P[index]}, Recall: {R[index]}, F1 Score: {F1[index]}")
    print("========================================")

Reference Summary: አዲስ ራዕይ የተሰኘ መጽሄት በሁሉም የመንግስት መቤት ሰራተኞች ዘንድ በግዳጅ እየተሸጠ ነው
Generated Summary: የስራ ታሪክ ዙሪያ የሚያጠነጥን አዲስ ራእይ የተሰኘ መፅሄት በሁሉም የመንግስት መቤት ሰራተኞች ዘንድ በግዳጅ በ ብር እየተሸጠ ነው
Precision: 0.8412916660308838, Recall: 0.9492299556732178, F1 Score: 0.8920074105262756
Reference Summary:              የሰ ኮርያው ም ጠቅላይ ሚኒስትር በፕሬዝዳንቱ ትዕዛዝ መገደላቸው ተነገረ       
Generated Summary: የፕሬዚዳንቱ ያስገደሏቸው ባለስልጣናት ደርሰዋል ተብሏል
Precision: 0.6628365516662598, Recall: 0.6070913076400757, F1 Score: 0.6337404251098633
Reference Summary: የፌዴራል ቤቶች ኮርፖሬሽን 16 ሺሕ ቤቶች ሊገነባ ነው

Generated Summary: የፌዴራል ቤቶች ኮርፖሬሽን በሶስት አመታት 16 ሺህ ቤቶች መገንባት የሚያስችለውን እቅድ ለማሳካት የአፈር ምርመራ የዲዛይን ክለሳና አዳዲስ ዲዛይን ከሚሰሩለት ኩባንያዎች ጋር ስምምነት ፈፀመ
Precision: 0.5892134308815002, Recall: 0.8598308563232422, F1 Score: 0.6992524266242981
Reference Summary: የወረታ ወደብና ተርሚናል ባለፉት 10 ወራት 5 ሺህ 58 ቲኢዩ ኮንቴነር አስተናገደ
Generated Summary: የኢትዮጵያ ክፍል የገቢና ወጪ እቃዎች የተሟላ የሎጅስቲክስ አገልግሎት ለመስጠት በአማራ ክልል ደቡብ ጎንደር ፎገራ ወረዳ የሚገኘው የወረታ ወደብና ተርሚናል የተሻለ ውጤት እያስመዘገበ ይገኛል
Precision: 0.