# Evaluation metrics

## References:
https://github.com/huggingface/evaluate

https://github.com/huggingface/evaluate/tree/main/metrics/bleu

https://github.com/huggingface/evaluate/tree/main/metrics/rouge

https://github.com/huggingface/evaluate/tree/main/metrics/meteor

https://www.nltk.org/index.html


### In this notebook:
Predictions are extracted from .json files in the following drive:

https://drive.google.com/drive/folders/1XVaIzVrzgTUyn2L8Au24zISSxWVGzO8z?usp=drive_link

We are taking into account our manually annotated references in order to compute the following metrics:
- BLEU
- ROUGE
- METEOR

In [1]:
#Leggiamo il file e lo processiamo riga per riga
with open('ground_truth_3versions.txt', 'r') as file:
    lines = file.readlines()

# Creiamo la struttura references
references = []

for line in lines:
    # Dividiamo la riga usando il punto e virgola come separatore
    # strip() rimuove eventuali spazi bianchi all'inizio e alla fine
    versions = [ref.strip() for ref in line.split(';')]

    # Per ogni riga, creiamo una lista con le tre versioni
    # Ogni versione viene messa in una lista singola, come nell'esempio
    reference_group = [[version] for version in versions]

    # Aggiungiamo il gruppo di reference alla lista principale
    references.append(reference_group)

# A questo punto, references avrà una struttura come:
# [
#     [["versione1_riga1"], ["versione2_riga1"], ["versione3_riga1"]],
#     [["versione1_riga2"], ["versione2_riga2"], ["versione3_riga2"]],
#     ...
# ]

print(references)

[[['The socket is on the wall in the corridor next to the heater'], ['I saw the socket on the wall in the corridor next to the heater'], ['on the wall in the corridor next to the heater']], [['You chopped broccoli'], ['I chopped broccoli'], ['broccoli']], [['You saw the carton on the second shelf next to the plastic parts organizer'], ['I saw the carton on the second shelf next to the plastic parts organizer'], ['on the second shelf next to the plastic parts organizer']], [['You put the spice in the pot'], ['I put the spice in the pot'], ['spice']], [['You put eight green peppers in the minced meat'], ['I put eight green peppers in the minced meat'], ['eight green peppers']], [['You opened six drawers'], ['I opened six drawers'], ['six drawers']], [['You first used an impact wrench on the machine'], ['I first used an impact wrench on the machine'], ['impact wrench']], [['The genre book was in the upper second shelf'], ['The genre book was in the upper second shelf'], ['upper second she

In [2]:
# Verifichiamo le dimensioni a ogni livello
# Livello 1: Numero totale di gruppi di reference
numero_gruppi = len(references)
print(f"Numero totale di gruppi di reference: {numero_gruppi}")

# Possiamo anche verificare che ogni gruppo abbia 3 reference
# usando una list comprehension
numero_reference_per_gruppo = [len(gruppo) for gruppo in references]
print("\nNumero di reference per ogni gruppo:")
print(numero_reference_per_gruppo)

# Per verificare che la struttura sia coerente, possiamo contare
# quanti gruppi hanno esattamente 3 reference
gruppi_con_tre_reference = sum(1 for gruppo in references if len(gruppo) == 3)
print(f"\nGruppi che hanno esattamente 3 reference: {gruppi_con_tre_reference}")

Numero totale di gruppi di reference: 50

Numero di reference per ogni gruppo:
[3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3]

Gruppi che hanno esattamente 3 reference: 50


In [3]:
import json

# Leggiamo il file JSON
with open('VideoLlava_predictions.json', 'r') as file:
    # Parserizziamo il contenuto JSON in un oggetto Python
    data = json.loads(file.read())

# Creiamo la lista delle predictions
# Per ogni elemento nel JSON, estraiamo il campo "prediction"
predictions = []
for item in data:
    # Estraiamo la prediction e la aggiungiamo alla lista
    prediction = item['prediction']
    predictions.append(prediction)

# A questo punto, predictions sarà una lista di stringhe, dove ogni stringa
# è una prediction dal file JSON

# Per verificare che la struttura sia corretta, possiamo stampare
# le prime predictions come esempio
print("Esempio delle prime 3 predictions:")
for i, pred in enumerate(predictions[:3], 1):
    print(f"{i}. {pred}")

# Stampiamo anche la lunghezza totale della lista per verifica
print(f"\nNumero totale di predictions: {len(predictions)}")

print(predictions)

Esempio delle prime 3 predictions:
1. I saw a socket in the image.
2. You chopped a head of broccoli.
3. I saw a carton in the kitchen.

Numero totale di predictions: 50
['I saw a socket in the image.', 'You chopped a head of broccoli.', 'I saw a carton in the kitchen.', 'You put a piece of bread in the pot.', 'There are three green peppers in the image.', 'You opened two drawers.', 'The tool you used first on the machine was a screwdriver.', 'The genre book was on a shelf in a bookstore before you picked it up.', 'The first object you hit into the ground is a tent stake.', 'You put flour in the dough in the mixer.', 'I spread peanut butter on two slices of bread.', 'The sofa is blue.', 'Yes, you wiped the kitchen counter.', 'Yes, the car door is open.', 'You washed a chopping board in the sink.', 'The plate was in the refrigerator before you took it out.', 'The egg was in the refrigerator before you picked it.', 'You took a can of soda.', 'In the house, I talked to a woman who was sit

In [None]:
# Install required packages
!pip install evaluate
!pip install nltk     # It should be already installed version 3.9.1 which is ok
#!nltk --version     -- better to have version >= 3.9.0 to let meteor work properly --check meteor implementation
!pip install rouge_score

In [6]:
# Install the tokenizer
import nltk
nltk.download('punkt_tab')  # needed for BLEU
from nltk.tokenize import word_tokenize

import evaluate

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


# BLEU

In [7]:
bleu = evaluate.load("bleu")
results_nltk = bleu.compute(predictions=predictions, references=references, tokenizer=word_tokenize)
print(f"results with nltk tokenizer: {results_nltk}")

results = bleu.compute(predictions=predictions, references=references)
print(f"results with default tokenizer: {results}")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Downloading builder script:   0%|          | 0.00/5.94k [00:00<?, ?B/s]

Downloading extra modules:   0%|          | 0.00/1.55k [00:00<?, ?B/s]

Downloading extra modules:   0%|          | 0.00/3.34k [00:00<?, ?B/s]

results with nltk tokenizer: {'bleu': 0.1768464560302683, 'precisions': [0.39215686274509803, 0.23478260869565218, 0.13658536585365855, 0.07777777777777778], 'brevity_penalty': 1.0, 'length_ratio': 1.6451612903225807, 'translation_length': 510, 'reference_length': 310}
results with default tokenizer: {'bleu': 0.1316224824072968, 'precisions': [0.3241650294695481, 0.17429193899782136, 0.09535452322738386, 0.055710306406685235], 'brevity_penalty': 1.0, 'length_ratio': 1.9728682170542635, 'translation_length': 509, 'reference_length': 258}


# ROUGE

In [8]:
rouge = evaluate.load('rouge')
results_nltk = rouge.compute(predictions=predictions, references=references, tokenizer=word_tokenize)
print(f"results with nltk tokenizer: {results_nltk}")

results = rouge.compute(predictions=predictions, references=references)
print(f"results with default tokenizer: {results}")

Downloading builder script:   0%|          | 0.00/6.27k [00:00<?, ?B/s]

results with nltk tokenizer: {'rouge1': 0.37532963186298296, 'rouge2': 0.22339822222766087, 'rougeL': 0.35875905354628557, 'rougeLsum': 0.36077900800251705}
results with default tokenizer: {'rouge1': 0.5692449474180864, 'rouge2': 0.38879436731139516, 'rougeL': 0.5432874935223164, 'rougeLsum': 0.5452823830459783}


# METEOR

In [9]:
# METEOR is using by default NLTK Tokenizer
meteor = evaluate.load('meteor')
results = meteor.compute(predictions=predictions, references=references)
print(f"results: {results}")

Downloading builder script:   0%|          | 0.00/7.02k [00:00<?, ?B/s]

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...


results: {'meteor': 0.3510532754509735}
