# Metrics of the models

This notebook will create a table that compares different metrics between the models that have been benchmarked inside `coco_dataset`.
To successfully run this notebook, it is advised to have a virtual `Conda` environment so this notebook has access to the needed dependencies.

### Importing results files 

In [153]:
files_path = './coco_dataset'

In [154]:
import pandas as pd
import os

def load_results_to_dataframe(directory_path=files_path):
    # Initialize an empty list to store the dataframes
    df_list = []
    
    # Iterate over all files in the specified directory
    for filename in os.listdir(directory_path):
        if filename.endswith('_results.csv'):
            # Extract the model name from the filename
            model_name = filename.replace('_results.csv', '')
            
            # Construct the full file path
            file_path = os.path.join(directory_path, filename)
            
            # Read the CSV file into a dataframe
            df = pd.read_csv(file_path)
            
            # Check if the expected columns are present in the DataFrame
            if set(['image_id', 'time_in_microseconds', 'prediction']).issubset(df.columns):
                # Add the model_name column
                df['model_name'] = model_name
                
                # Keep only the required columns in the specified order
                df = df[['image_id', 'model_name', 'time_in_microseconds', 'prediction']]
                
                # Append the dataframe to the list
                df_list.append(df)
            else:
                print(f"Warning: File {filename} does not contain the required columns.")
    
    # Concatenate all dataframes in the list into a single dataframe
    results_df = pd.concat(df_list)
    
    # Reset the index of the resulting dataframe
    results_df.reset_index(drop=True, inplace=True)
    
    return results_df

# Call the function and assign the result to a variable
df = load_results_to_dataframe()

### Adding information from the COCO dataset captions

In [155]:
# Load the captions.csv file into a DataFrame
captions_df = pd.read_csv(os.path.join(files_path, 'captions.csv'))

# Rename the 'caption' column to 'original_caption'
captions_df.rename(columns={'caption': 'original_caption'}, inplace=True)

# Merge the two DataFrames on the 'image_id' column
df = pd.merge(df, captions_df, on='image_id', how='left')

### Cleaning up the data
Let's make both the `predicted caption` and the `original caption` lower case and formatted the same way so the metrics that we measure are more reliable.

In [156]:
# Define a function to clean the text according to the specified rules
def clean_text(text):
    # Make the text lowercase
    text = text.lower()
    # Remove any surrounding quotation marks
    text = text.strip('\"')
    # Trim whitespace
    text = text.strip()
    # Remove the period at the end if there is one
    if text.endswith('.'):
        text = text[:-1]
    return text

# Apply the clean_text function to the 'original_caption' and 'prediction' columns
df['original_caption'] = df['original_caption'].apply(clean_text)
df['prediction'] = df['prediction'].apply(clean_text)

# Adding metrics evaluation

## ROUGE Score

Among the ensemble of evaluation metrics, the `ROUGE Score` is prominent. Standing for ["Recall-Oriented Understudy for Gisting Evaluation"](https://en.wikipedia.org/wiki/ROUGE_(metric)), the ROUGE Score is the lynchpin of automatic text summarization.

The `ROUGE Score` has three main components: **`ROUGE-N`**, **`ROUGE-L`**, and **`ROUGE-S`**. 

Each ROUGE score component offers a different perspective on the quality of the system-generated summary, considering different aspects of language and sentence structure. This is why a combination of these measures is usually used in evaluating system outputs in NLP tasks.


#### ROUGE-N
`ROUGE-N` is a component of the ROUGE score that quantifies the overlap of [N-grams](https://en.wikipedia.org/wiki/N-gram) (contiguous sequences of N items - typically words or characters) between the system-generated summary and the reference summary. It provides insights into the [precision and recall](https://en.wikipedia.org/wiki/Precision_and_recall) of the system's output by considering the matching N-gram sequences.

`ROUGE-N` essentially refers to the overlap of`n-grams`. It consists of `ROUGE-1` (overlap of **unigrams** - each word - between the system and reference summaries) and `ROUGE-2` (refers to the overlap of **bigrams** between the system and reference summaries).

#### ROUGE-L 
`ROUGE-L`, another component of the `ROUGE Score`, calculates the [Longest Common Subsequence (LCS)](https://en.wikipedia.org/wiki/Longest_common_subsequence) between the system and reference summaries. Unlike N-grams, LCS measures the maximum sequence of words (not necessarily contiguous) that appear in both summaries. It offers a more flexible similarity measure and helps capture shared information beyond strict word-for-word matches.

#### ROUGE-S
`ROUGE-S` focuses on [skip-bigrams](https://towardsdatascience.com/skip-gram-nlp-context-words-prediction-algorithm-5bbf34f84e0c). A skip-bigram is a pair of words in a sentence that allows for gaps or words in between. This component identifies the skip-bigram overlap between the system and reference summaries, enabling the assessment of sentence-level structure similarity. It can capture paraphrasing relationships between sentences and provide insights into the system's ability to convey information with flexible word ordering.

> the text above was taken from https://thepythoncode.com/article/calculate-rouge-score-in-python.

We'll now focus on adding the `ROUGE Score` in our `df` dataframe.

In [157]:
from rouge_score import rouge_scorer

# Function to calculate ROUGE scores for a single row
def calculate_rouge_scores(row):
    # Initialize the scorer for ROUGE-1, ROUGE-2, and ROUGE-L
    scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)
    
    # Calculate the scores
    scores = scorer.score(row['original_caption'], row['prediction'])
    
    # Extract and return the scores
    return {
        'rouge1': scores['rouge1'].fmeasure,
        'rouge2': scores['rouge2'].fmeasure,
        'rougeL': scores['rougeL'].fmeasure
    }

# Apply the calculate_rouge_scores function to each row in df
# The result will be a new DataFrame with the ROUGE scores
rouge_scores_df = df.apply(calculate_rouge_scores, axis=1, result_type='expand')

# Concatenate the original df with the new DataFrame containing the ROUGE scores
df = pd.concat([df, rouge_scores_df], axis=1)

Running the code above will create three new columns `rouge1`, `rouge2` and `rougeL`. 

`rouge-scorer` returns a `Score` object with three different parameters. For example:

```python
rouge1:
[Score(precision=0.8571428571428571, recall=1.0, fmeasure=0.923076923076923), Score(precision=0.7142857142857143, recall=0.8333333333333334, fmeasure=0.7692307692307692)]
rouge2:
[Score(precision=0.6666666666666666, recall=0.8, fmeasure=0.7272727272727272), Score(precision=0.3333333333333333, recall=0.4, fmeasure=0.3636363636363636)]
rougeL:
[Score(precision=0.8571428571428571, recall=1.0, fmeasure=0.923076923076923), Score(precision=0.5714285714285714, recall=0.6666666666666666, fmeasure=0.6153846153846153)]
```

The choice between using `fmeasure`, `precision`, or `recall` depends on what aspect of the summary's quality you want to emphasize:

- **Precision** (specificity) measures the fraction of relevant instances among the retrieved instances. In the context of ROUGE, it calculates how many of the words in the predicted summary (generated caption) are also found in the reference summary (original caption).

- **Recall** measures the fraction of relevant instances that were retrieved. In the context of ROUGE, it calculates how much of the reference summary is captured by the predicted summary.

- **F-measure**(or F1 score) is the harmonic mean of precision and recall. It provides a single score that balances both precision and recall. An F1 score reaches its best value at 1 (perfect precision and recall) and worst at 0.

We are choosing the `F-measure` because it strikes a nice balance between `precision` and `recall`.


## BLEU 

`BLEU` stands for ["Bilingual Evaluation Understudy"](https://en.wikipedia.org/wiki/BLEU) and it's a metric commonly used in NLP. IIt acn be used in text summarization, paraphrasing tasks and inclusively *image captioning*.

The BLEU score is based on a simple idea, comparing the machine-generated translations with human-generated translations that are considered correct.
Here's how it works:

- The machine translation system generates translations for a set of sentences.
- These machine-generated translations are compared to the reference translations.
- The comparison is done by counting how many words or phrases from the machine-generated translations match the words or phrases in the reference translations.
- The more matches there are, the higher the BLEU score will be.

The BLEU score considers the precision of matching words or phrases. It also considers the length of the translations to avoid favoring shorter translations that may have an advantage in matching words by chance.

The BLEU score is typically represented as a value between 0 and 1, with 1 being a perfect match and 0 being a perfect mismatch to the reference translations.

> the text above was taken from https://thepythoncode.com/article/bleu-score-in-python.

We'll now focus on adding the `BLUE` score in our `df` dataframe.

In [158]:
import sacrebleu

# Function to calculate BLEU score for a single row
def calculate_bleu_score(row):
    # Prepare the reference and hypothesis
    reference = [row['original_caption']]
    hypothesis = row['prediction']
    # Calculate BLEU score
    bleu = sacrebleu.corpus_bleu([hypothesis], [reference])
    # Return the BLEU score
    return bleu.score

# Apply the calculate_bleu_score function to each row in df
df['BLEU_score'] = df.apply(calculate_bleu_score, axis=1)

#### Some notes when interpreting the `BLEU` score

The BLEU score ranges from 0 to 100:

- `0` indicates a complete lack of overlap between the candidate translation (in your case, the predicted caption) and the reference translations (the original captions), which implies very poor quality.
- `100` indicates a perfect match with the reference translations, signifying an ideal result.
  
In practice, you'll rarely see a `BLEU` score of `100`, especially in tasks other than translation, because it would require the candidate text to match the reference exactly, including word choice and order. 
Even human translators don't often achieve a perfect score because there are many possible ways to correctly translate or summarize a text!

When interpreting `BLEU` scores, consider the following:

- **Higher scores are better**, as they indicate more `n-gram` overlap with the reference text and, by extension, better quality text generation.
- `BLEU` uses `n-gram precision`, which does not capture semantics or meaning. It only measures how many n-grams (up to a certain size) match between the candidate and the reference texts.
- **`BLEU` is sensitive to the length of the text.** Very short or very long texts may produce misleading scores.
- `BLEU` includes brevity penalty to penalize overly short generated text, as short candidates can have high precision by just including common n-grams.
  

In summary, a higher BLEU score *suggests better resemblance to the reference text at the surface level* (in terms of the exact words and their order), but it does not necessarily mean that the candidate text is more accurate or appropriate. 

You can find a small text by Google on how to interpret the `BLEU` score in https://cloud.google.com/translate/automl/docs/evaluate#bleu.



## METEOR

`METEOR` aka ["Metric for Evaluation of Translation with Explicit Ordering"](https://en.wikipedia.org/wiki/METEOR) is an automatic metric for evaluating machine translation output that addresses some of the shortcomings of the `BLEU` score. While `BLEU` focuses on precision by measuring how many words in the machine translation output appear in the reference translation, `METEOR` **also accounts for recall by considering how many words in the reference are captured in the translation**. 

Overall, METEOR is designed to correlate better with human judgment of translation quality than `BLEU`. 
It does this by considering a wider range of linguistic phenomena and by balancing precision and recall. Because it aligns words between the candidate and reference texts and accounts for synonyms and stemming, `METEOR` is often seen as providing a more nuanced evaluation of translation outputs.

We'll now focus on adding the `METEOR` score in our `df` dataframe.

In [159]:
import nltk

from nltk.translate.meteor_score import meteor_score
from nltk import word_tokenize

# Ensure that the Punkt tokenizer models are downloaded
nltk.download('punkt')

# Function to calculate METEOR score for a single row
def calculate_meteor(row):
    # Assuming 'original_caption' is the reference and 'prediction' is the hypothesis
    reference = row['original_caption']
    hypothesis = row['prediction']
    # Tokenize both the reference and the hypothesis
    reference_tokens = word_tokenize(reference)
    hypothesis_tokens = word_tokenize(hypothesis)
    # Calculate the METEOR score
    score = meteor_score([reference_tokens], hypothesis_tokens)
    return score

# Apply the calculate_meteor function to each row in df
df['METEOR_score'] = df.apply(calculate_meteor, axis=1)

[nltk_data] Downloading package punkt to /Users/lucho/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


To explain the code above a bit further:

- both the `reference` and the `hypothesis` are tokenized using `word_tokenize`.
- the `meteor_score` function takes a list of **tokenized reference sentences** (even if there's only one reference) and a **tokenized hypothesis**.
- tt calculates the `METEOR` score for each row and adds the scores to a new column named `'METEOR_score'` in `df` dataframe.

Before all of this, we download [`punkt`](https://www.nltk.org/api/nltk.tokenize.punkt.html), a tokenizer model. It is used to divide a text into a list of sentences by using an unsupervised algorithm to build a model for abbreviation words, collocations, and words that start sentences.

## Word Error Rate

The [**Word Error Rate (WER)**](https://en.wikipedia.org/wiki/Word_error_rate) is a common metric for evaluating the performance of a speech recognition or machine translation system. It compares a reference text to a hypothesis text, and it is calculated as the number of substitutions, insertions, and deletions needed to change the hypothesis into the reference, divided by the number of words in the reference.

We'll now focus on adding the `Word Error Rate` in our `df` dataframe.

In [160]:
import jiwer

# Function to calculate WER for a single row
def calculate_wer(row):
    # Assuming 'original_caption' is the reference and 'prediction' is the hypothesis
    reference = row['original_caption']
    hypothesis = row['prediction']
    # Calculate WER using jiwer
    wer_score = jiwer.wer(reference, hypothesis)
    return wer_score

# Apply the calculate_wer function to each row in df
df['Word_error_rate'] = df.apply(calculate_wer, axis=1)

Here's how we can interpret the WER score:

- **`WER = 0`**: This means that the hypothesis (the generated text) matches the reference (the target text) perfectly. There are no errors at all.
- **`0 < WER < 1`**: The hypothesis has errors, but the number of errors is less than the number of words in the reference. This indicates that there are some mistakes, but more than half of the words are correct.
- **`WER = 1`**: The number of errors is equal to the number of words in the reference. This could mean that every word is wrong, or that the hypothesis is of the same length as the reference but completely different.
- **`WER > 1`**: The hypothesis is so inaccurate that the number of errors exceeds the number of words in the reference. This can happen if the hypothesis is longer than the reference and contains many incorrect words.

## Aggregating data 

Now that we have the scores of different metrics for each caption, it's time to aggregate the data!
We are going to condense the data to have an evaluation of each model with the execution time and the precision/accuracy of the predictions.

First, we'll perform aggregation for each `image_id`. Because there are 5 captions describing each image in the COCO Dataset, we are getting the **best results of each score for each image**. Since we have a level of redundancy when describing images, it's fair to give the *best score for the prediction at a given image* instead of the average of the scores of each image. 

This is what we're doing in the next block of code.

In [161]:
# Define the aggregation dictionary for the scores
aggregations = {
    'rouge1': 'max',
    'rouge2': 'max',
    'rougeL': 'max',
    'BLEU_score': 'max',
    'METEOR_score': 'max',
    'Word_error_rate': 'min' # the lower the error rate, the better
}

# Group by the specified columns and aggregate using the specified functions
condensed_df = df.groupby(['image_id', 'model_name', 'time_in_microseconds', 'prediction']).agg(aggregations).reset_index()

# The resulting condensed_df will have one row per group with the highest or minimum scores as specified

Now we can aggregate the **average** and the **median** of the scores for each model.

In [162]:
# Define the aggregation dictionary for calculating mean and median
aggregations = {
    'time_in_microseconds': ['mean', 'median'],
    'rouge1': ['mean', 'median'],
    'rouge2': ['mean', 'median'],
    'rougeL': ['mean', 'median'],
    'BLEU_score': ['mean', 'median'],
    'METEOR_score': ['mean', 'median'],
    'Word_error_rate': ['mean', 'median']
}

# Group by the model_name and calculate the specified aggregations
final_df = condensed_df.groupby('model_name').agg(aggregations)

# Flatten the MultiIndex columns by combining the level 0 and level 1 column names
final_df.columns = ['_'.join(col).strip() for col in final_df.columns.values]

# Reset the index to turn the model_name index back into a column
final_df = final_df.reset_index()

# The resulting final_df will have the average and median values for each model

Now we have the **median** and the **average** of each score for a given model.
But, to make it simple for people to see, which one should we use if we had to? 

### Median or average?

You don't *have* to read this small section. We're just doing this so we know we're providing statistically-correct results. 

We have chosen a sample size of **50 images** *on purpose* because of the [`Central Limit Theorem`](https://www.investopedia.com/terms/c/central_limit_theorem.asp). 30 samples is often used as a rule of thumb for a minimum sample size in statistics because it is the point at which this theorem begins to apply. The CLT states that **the distribution of sample means will be approximately normal, regardless of the distribution of the population from which the samples are drawn, as long as the sample size is large enough.**

Conducting normality tests will help us decide if we can show the `average` or stick with `median`. Basically, if the data is **sufficiently uniform**, we can safely use the `average` aggregator. Otherwise, we should use `median` to filter out outliers. 

Because our sample size is `<= 50`, we can perform a [**Shapiro-Wilk Normality test**](https://en.wikipedia.org/wiki/Shapiro%E2%80%93Wilk_test) to check for the distribution of our sample size.

In [163]:
from scipy import stats

# List of the score columns to test for normality
score_columns = ['time_in_microseconds', 'rouge1', 'rouge2', 'rougeL', 'BLEU_score', 'METEOR_score', 'Word_error_rate']

# Perform the Shapiro-Wilk test for each score column
for column in score_columns:
    stat, p_value = stats.shapiro(condensed_df[column])
    #print(f'Column: {column}')
    #print('Test statistic:', stat)
    #print('p-value:', p_value)

    # Interpret the p-value
    alpha = 0.05
    if p_value > alpha:
        print('')
        #print('Sample looks Gaussian (fail to reject H0)\n')
    else:
        print('')
        #print('Sample does not look Gaussian (reject H0)\n')










The output above for a single model yielded the following.

```
Column: time_in_microseconds
Test statistic: 0.40061622858047485
p-value: 5.369669552751644e-13
Sample does not look Gaussian (reject H0)

Column: rouge1
Test statistic: 0.9838210344314575
p-value: 0.7199809551239014
Sample looks Gaussian (fail to reject H0)

Column: rouge2
Test statistic: 0.9751147031784058
p-value: 0.3686423897743225
Sample looks Gaussian (fail to reject H0)

Column: rougeL
Test statistic: 0.987608015537262
p-value: 0.8752310872077942
Sample looks Gaussian (fail to reject H0)

Column: BLEU_score
Test statistic: 0.8733219504356384
p-value: 7.060639472911134e-05
Sample does not look Gaussian (reject H0)

Column: METEOR_score
Test statistic: 0.987629771232605
p-value: 0.8760008811950684
Sample looks Gaussian (fail to reject H0)

Column: Word_error_rate
Test statistic: 0.9659003019332886
p-value: 0.1569066196680069
Sample looks Gaussian (fail to reject H0)
```

As you can see above, every column has a normal distribution **except the `BLEU` score column**.

We could use the average/mean on all of these except `BLEU`. But, for simplicity sake, we'll use median on every single column.

> **NOTE**:
>
> If you run this with multiple models, the distribution might be different. This test should be done **for each model**, not with the dataframe that has the results for multiple models. 
> You don't need to worry about this though, we are sticking with the `median` regardless. The output above pertains to the data of a single model.

In [164]:
# Columns to drop: medians for normally distributed scores and mean for BLEU_score
columns_to_drop = ['time_in_microseconds_mean', 'rouge1_mean', 'rouge2_mean', 'rougeL_mean', 
                   'METEOR_score_mean', 'Word_error_rate_mean', 'BLEU_score_mean']

# Drop the specified columns from final_df
final_df.drop(columns_to_drop, axis=1, inplace=True)

Awesome! 🎉

Now let's clean up some of our columns to have and convert the execution time from `microseconds` to **`seconds`**.

In [165]:
# Convert time from microseconds to seconds
final_df['time_in_seconds_median'] = final_df['time_in_microseconds_median'] / 1e6

# Drop the original 'time_in_microseconds' column if you no longer need it
final_df.drop('time_in_microseconds_median', axis=1, inplace=True)

# Round all score columns to three decimal places
score_columns = ['time_in_seconds_median', 'rouge1_median', 'rouge2_median', 'rougeL_median', 'BLEU_score_median', 'METEOR_score_median', 'Word_error_rate_median']
for column in score_columns:
    final_df[column] = final_df[column].round(5)

# Dictionary mapping old column names to new ones
new_column_names = {
    'model_name': 'Model',
    'rouge1_median': 'ROUGE-1',
    'rouge2_median': 'ROUGE-2',
    'rougeL_median': 'ROUGE-L',
    'BLEU_score_median': 'BLEU',
    'METEOR_score_median': 'METEOR',
    'Word_error_rate_median': 'Word Error Rate',
    'time_in_seconds_median': 'Time (s)'
}

# Rename the columns using the dictionary
final_df.rename(columns=new_column_names, inplace=True)

# Now final_df will have the new column names

## We're done! 🎉

Awesome!

Congratulations, we now have a table that shows accuracy scores and the execution time for each model!
You can expand this table by running this notebook (assuming you have a `modelName_results.csv` file created).

Hurray!

Let's get this table into `Markdown` so we can post it in our `README`.


In [166]:
# Convert the DataFrame to Markdown
markdown_table = final_df.to_markdown(index=False)

# Print the Markdown table
print(markdown_table)

| Model                       |   ROUGE-1 |   ROUGE-2 |   ROUGE-L |    BLEU |   METEOR |   Word Error Rate |   Time (s) |
|:----------------------------|----------:|----------:|----------:|--------:|---------:|------------------:|-----------:|
| blip-image-captioning-base  |   0.6     |   0.36364 |   0.57983 | 20.0762 |  0.45953 |           0.58333 |    4.16365 |
| blip-image-captioning-large |   0.59167 |   0.33333 |   0.55844 | 19.0449 |  0.53777 |           0.72381 |   11.878   |
| resnet-50                   |   0       |   0       |   0       |  0      |  0.03953 |           1       |    0.32517 |
