# Evaluation metrics



## References:
https://github.com/huggingface/evaluate

https://github.com/huggingface/evaluate/tree/main/metrics/bleu

https://github.com/huggingface/evaluate/tree/main/metrics/rouge

https://github.com/huggingface/evaluate/tree/main/metrics/meteor

https://www.nltk.org/index.html


### In this notebook:
Predictions are extracted from .json files in the following drive:

https://drive.google.com/drive/folders/1XVaIzVrzgTUyn2L8Au24zISSxWVGzO8z?usp=drive_link

Ground truths are taken from:

https://drive.google.com/drive/folders/1jo8cml09c-7V6QAaGPIYwvK1yWiFQzM0?usp=drive_link

We are taking into account our manually annotated references in order to compute the following metrics:
- BLEU
- ROUGE
- METEOR

# Packages and Imports

In [None]:
# Install required packages
!pip install evaluate
!pip install nltk     # It should be already installed version 3.9.1 which is ok
#!nltk --version     -- better to have version >= 3.9.0 to let meteor work properly --check meteor implementation
!pip install rouge_score

In [None]:
# Install the tokenizer
import nltk
nltk.download('punkt_tab')  # needed for BLEU
from nltk.tokenize import word_tokenize

import evaluate

# Data preparation

In [22]:
#Leggiamo il file e lo processiamo riga per riga
with open('ground_truth_3_versions.txt', 'r') as file:
    lines = file.readlines()

# Creiamo la struttura references
references = []

for line in lines:
    # Dividiamo la riga usando il punto e virgola come separatore
    # strip() rimuove eventuali spazi bianchi all'inizio e alla fine
    versions = [ref.strip() for ref in line.split(';')]

    # Per ogni riga, creiamo una lista con le tre versioni
    # Ogni versione viene messa in una lista singola, come nell'esempio
    reference_group = [[version] for version in versions]

    # Aggiungiamo il gruppo di reference alla lista principale
    references.append(reference_group)

# A questo punto, references avrà una struttura come:
# [
#     [["versione1_riga1"], ["versione2_riga1"], ["versione3_riga1"]],
#     [["versione1_riga2"], ["versione2_riga2"], ["versione3_riga2"]],
#     ...
# ]

print(references)

[[['I saw the socket on the wall in the corridor next to the heater'], ['You saw the socket on the wall in the corridor next to the heater'], ['on the wall in the corridor next to the heater']], [['I chopped broccoli'], ['You chopped broccoli'], ['broccoli']], [['I saw the carton on the second shelf next to the plastic parts organizer'], ['You saw the carton on the second shelf next to the plastic parts organizer'], ['on the second shelf next to the plastic parts organizer']], [['I put the spice in the pot'], ['You put the spice in the pot'], ['spice']], [['I put eight green peppers in the minced meat'], ['You put eight green peppers in the minced meat'], ['eight green peppers']], [['I opened six drawers'], ['You opened six drawers'], ['six drawers']], [['I first used an impact wrench on the machine'], ['You first used an impact wrench on the machine'], ['impact wrench']], [['The genre book was in the upper second shelf'], ['The genre book was in the upper second shelf'], ['upper secon

In [23]:
# Crea una lista con tutte le prime versioni
first_version = [group[0][0] for group in references]
print(first_version)
print(len(first_version))

# Crea una lista con tutte le seconde versioni
second_version = [group[1][0] for group in references]
print(second_version)
print(len(second_version))

# Crea una lista con tutte le terze versioni
third_version = [group[2][0] for group in references]
print(third_version)
print(len(third_version))

['I saw the socket on the wall in the corridor next to the heater', 'I chopped broccoli', 'I saw the carton on the second shelf next to the plastic parts organizer', 'I put the spice in the pot', 'I put eight green peppers in the minced meat', 'I opened six drawers', 'I first used an impact wrench on the machine', 'The genre book was in the upper second shelf', 'I first hit the tent stick into the ground', 'I put oil in the dough in the mixer', 'I spread six slices of bread with peanut butter', 'The Sofa is beige', 'Yes, I wiped the kitchen counter', 'No, I closed the car door', 'I washed the white chopping board', 'The plate was in the lower second shelf of the cupboard', 'The egg was in the fridge', 'I took a coke', 'I talked to a man', 'I took two plates', 'I saw it on a shelf next to the window', 'I washed socks', 'The transparent keg was on the shelf', 'I rolled 2 doughs', 'The round brush was on the floor', 'I arranged some kitchen tools in the dishwasher racks', 'The phone was i

In [24]:
# Verifichiamo le dimensioni a ogni livello
# Numero totale di gruppi di reference
numero_gruppi = len(references)
print(f"Numero totale di gruppi di reference: {numero_gruppi}")

# Verifichiamo che ogni gruppo abbia 3 reference
numero_reference_per_gruppo = [len(gruppo) for gruppo in references]
print("\nNumero di reference per ogni gruppo:")
print(numero_reference_per_gruppo)

# Verifichiamo quanti gruppi hanno esattamente 3 reference
gruppi_con_tre_reference = sum(1 for gruppo in references if len(gruppo) == 3)
print(f"\nGruppi che hanno esattamente 3 reference: {gruppi_con_tre_reference}")

Numero totale di gruppi di reference: 50

Numero di reference per ogni gruppo:
[3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3]

Gruppi che hanno esattamente 3 reference: 50


In [54]:
import json

# Leggiamo il file JSON
with open('LongVU_predictions.json', 'r') as file:
    # Parserizziamo il contenuto JSON in un oggetto Python
    data = json.loads(file.read())

# Creiamo la lista delle predictions
# Per ogni elemento nel JSON, estraiamo il campo "prediction"
predictions = []
for item in data:
    # Estraiamo la prediction e la aggiungiamo alla lista
    prediction = item['prediction']
    predictions.append(prediction)

# A questo punto, predictions sarà una lista di stringhe, dove ogni stringa
# è una prediction dal file JSON

# Per verificare che la struttura sia corretta, possiamo stampare
# le prime predictions come esempio
print("Esempio delle prime 3 predictions:")
for i, pred in enumerate(predictions[:3], 1):
    print(f"{i}. {pred}")

# Stampiamo anche la lunghezza totale della lista per verifica
print(f"\nNumero totale di predictions: {len(predictions)}")

print(predictions)

Esempio delle prime 3 predictions:
1. You saw the socket on the wall.
2. You chopped spring onions as the vegetables in the video.
3. You saw a carton on the table.

Numero totale di predictions: 50
['You saw the socket on the wall.', 'You chopped spring onions as the vegetables in the video.', 'You saw a carton on the table.', 'You added a sachet of seasoning to the pot.', 'You put minced meat in a total of five green peppers.', 'You opened a total of five drawers.', 'The first tool that you used on the machine was a drill.', 'Before I picked it up, the genre book was on the shelf.', 'The first object that you hit into the ground was a tent.', 'You added oil to the dough in the mixer.', 'You spread peanut butter on two slices of bread.', 'The sofa in the video appears to be blue.', 'You did wipe the kitchen counter.', 'It appears that you did leave the car door open.', 'You washed the chopping board that was placed on the sink.', 'Before you took it, the plate was placed on the cabine

# BLEU

In [55]:
bleu = evaluate.load("bleu")
# On all 3 versions of the ground truths
results_nltk = bleu.compute(predictions=predictions, references=references, tokenizer=word_tokenize)
print(f"All 3 versions results with nltk tokenizer: {results_nltk}")

results = bleu.compute(predictions=predictions, references=references)
print(f"All 3 versions results with default tokenizer: {results}")

# Only first version
results_nltk = bleu.compute(predictions=predictions, references=first_version, tokenizer=word_tokenize)
print(f"1st version with nltk tokenizer: {results_nltk}")

results = bleu.compute(predictions=predictions, references=first_version)
print(f"1st version with default tokenizer: {results}")

# Only second version
results_nltk = bleu.compute(predictions=predictions, references=second_version, tokenizer=word_tokenize)
print(f"2nd version with nltk tokenizer: {results_nltk}")

results = bleu.compute(predictions=predictions, references=second_version)
print(f"2nd version with default tokenizer: {results}")

# Only third version
results_nltk = bleu.compute(predictions=predictions, references=third_version, tokenizer=word_tokenize)
print(f"3rd version with nltk tokenizer: {results_nltk}")

results = bleu.compute(predictions=predictions, references=third_version)
print(f"3rd version with default tokenizer: {results}")

All 3 versions results with nltk tokenizer: {'bleu': 0.1091015437297222, 'precisions': [0.35892514395393477, 0.18683651804670912, 0.07838479809976247, 0.026954177897574125], 'brevity_penalty': 1.0, 'length_ratio': 1.6806451612903226, 'translation_length': 521, 'reference_length': 310}
All 3 versions results with default tokenizer: {'bleu': 0.0786332523025659, 'precisions': [0.30518234165067176, 0.13588110403397027, 0.04275534441805225, 0.0215633423180593], 'brevity_penalty': 1.0, 'length_ratio': 2.0193798449612403, 'translation_length': 521, 'reference_length': 258}
1st version with nltk tokenizer: {'bleu': 0.1088086510373205, 'precisions': [0.3550863723608445, 0.18683651804670912, 0.07838479809976247, 0.026954177897574125], 'brevity_penalty': 1.0, 'length_ratio': 1.3567708333333333, 'translation_length': 521, 'reference_length': 384}
1st version with default tokenizer: {'bleu': 0.1088086510373205, 'precisions': [0.3550863723608445, 0.18683651804670912, 0.07838479809976247, 0.026954177

# ROUGE

In [56]:
rouge = evaluate.load('rouge')
# On all 3 versions of the ground truths
results_nltk = rouge.compute(predictions=predictions, references=references, tokenizer=word_tokenize)
print(f"All 3 versions results with nltk tokenizer: {results_nltk}")

results = rouge.compute(predictions=predictions, references=references)
print(f"All 3 versions results with default tokenizer: {results}")

# Only first version
results_nltk = rouge.compute(predictions=predictions, references=first_version, tokenizer=word_tokenize)
print(f"1st version with nltk tokenizer: {results_nltk}")

results = rouge.compute(predictions=predictions, references=first_version)
print(f"1st version with default tokenizer: {results}")

# Only second version
results_nltk = rouge.compute(predictions=predictions, references=second_version, tokenizer=word_tokenize)
print(f"2nd version with nltk tokenizer: {results_nltk}")

results = rouge.compute(predictions=predictions, references=second_version)
print(f"2nd version with default tokenizer: {results}")

# Only third version
results_nltk = rouge.compute(predictions=predictions, references=third_version, tokenizer=word_tokenize)
print(f"3rd version with nltk tokenizer: {results_nltk}")

results = rouge.compute(predictions=predictions, references=third_version)
print(f"3rd version with default tokenizer: {results}")

All 3 versions results with nltk tokenizer: {'rouge1': 0.34215760013008645, 'rouge2': 0.1735231361584976, 'rougeL': 0.320302858024503, 'rougeLsum': 0.31965724773822773}
All 3 versions results with default tokenizer: {'rouge1': 0.5373922434746571, 'rouge2': 0.2985051668317923, 'rougeL': 0.5096856600795894, 'rougeLsum': 0.508220382386791}
1st version with nltk tokenizer: {'rouge1': 0.3935362978259531, 'rouge2': 0.2034134440476142, 'rougeL': 0.3684354221831472, 'rougeLsum': 0.36745269854657375}
1st version with default tokenizer: {'rouge1': 0.44108238025864865, 'rouge2': 0.2449081012726524, 'rougeL': 0.41716375757548857, 'rougeLsum': 0.4166496593175151}
2nd version with nltk tokenizer: {'rouge1': 0.47298329806775086, 'rouge2': 0.254158756866591, 'rougeL': 0.44730131552986563, 'rougeLsum': 0.44547697424096666}
2nd version with default tokenizer: {'rouge1': 0.5373922434746571, 'rouge2': 0.2985051668317923, 'rougeL': 0.5096856600795894, 'rougeLsum': 0.508220382386791}
3rd version with nltk t

# METEOR

In [57]:
# METEOR is using by default NLTK Tokenizer
meteor = evaluate.load('meteor')

# On all 3 versions of the ground truths
results = meteor.compute(predictions=predictions, references=references)
print(f"All 3 versions results: {results}")

# Only first version
results = meteor.compute(predictions=predictions, references=first_version)
print(f"1st version: {results}")

# Only second version
results = meteor.compute(predictions=predictions, references=second_version)
print(f"2nd version: {results}")

# Only third version
results = meteor.compute(predictions=predictions, references=third_version)
print(f"3rd version: {results}")

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


All 3 versions results: {'meteor': 0.31492430032446866}
1st version: {'meteor': 0.42624108846913894}
2nd version: {'meteor': 0.537726312518075}
3rd version: {'meteor': 0.2222870905478511}
