# Metrics of the models

This notebook will create a table that compares different metrics between the models that have been benchmarked inside `coco_dataset`.
To successfully run this notebook, it is advised to have a virtual `Conda` environment so this notebook has access to the needed dependencies.

### Importing results files 

In [15]:
files_path = './coco_dataset'

In [19]:
import pandas as pd
import os

def load_results_to_dataframe(directory_path=files_path):
    # Initialize an empty list to store the dataframes
    df_list = []
    
    # Iterate over all files in the specified directory
    for filename in os.listdir(directory_path):
        if filename.endswith('_results.csv'):
            # Extract the model name from the filename
            model_name = filename.replace('_results.csv', '')
            
            # Construct the full file path
            file_path = os.path.join(directory_path, filename)
            
            # Read the CSV file into a dataframe
            df = pd.read_csv(file_path)
            
            # Check if the expected columns are present in the DataFrame
            if set(['image_id', 'time_in_microseconds', 'prediction']).issubset(df.columns):
                # Add the model_name column
                df['model_name'] = model_name
                
                # Keep only the required columns in the specified order
                df = df[['image_id', 'model_name', 'time_in_microseconds', 'prediction']]
                
                # Append the dataframe to the list
                df_list.append(df)
            else:
                print(f"Warning: File {filename} does not contain the required columns.")
    
    # Concatenate all dataframes in the list into a single dataframe
    results_df = pd.concat(df_list)
    
    # Reset the index of the resulting dataframe
    results_df.reset_index(drop=True, inplace=True)
    
    return results_df

# Call the function and assign the result to a variable
df = load_results_to_dataframe()

### Adding information from the COCO dataset captions

In [20]:
# Load the captions.csv file into a DataFrame
captions_df = pd.read_csv(os.path.join(files_path, 'captions.csv'))

# Rename the 'caption' column to 'original_caption'
captions_df.rename(columns={'caption': 'original_caption'}, inplace=True)

# Merge the two DataFrames on the 'image_id' column
df = pd.merge(df, captions_df, on='image_id', how='left')

### Cleaning up the data
Let's make both the `predicted caption` and the `original caption` lower case and formatted the same way so the metrics that we measure are more reliable.

In [21]:
# Define a function to clean the text according to the specified rules
def clean_text(text):
    # Make the text lowercase
    text = text.lower()
    # Remove any surrounding quotation marks
    text = text.strip('\"')
    # Trim whitespace
    text = text.strip()
    # Remove the period at the end if there is one
    if text.endswith('.'):
        text = text[:-1]
    return text

# Apply the clean_text function to the 'original_caption' and 'prediction' columns
df['original_caption'] = df['original_caption'].apply(clean_text)
df['prediction'] = df['prediction'].apply(clean_text)

# Adding metrics evaluation

## ROUGE Score

Among the ensemble of evaluation metrics, the `ROUGE Score` is prominent. Standing for ["Recall-Oriented Understudy for Gisting Evaluation"](https://en.wikipedia.org/wiki/ROUGE_(metric)), the ROUGE Score is the lynchpin of automatic text summarization.

The `ROUGE Score` has three main components: **`ROUGE-N`**, **`ROUGE-L`**, and **`ROUGE-S`**. 

Each ROUGE score component offers a different perspective on the quality of the system-generated summary, considering different aspects of language and sentence structure. This is why a combination of these measures is usually used in evaluating system outputs in NLP tasks.


#### ROUGE-N
`ROUGE-N` is a component of the ROUGE score that quantifies the overlap of [N-grams](https://en.wikipedia.org/wiki/N-gram) (contiguous sequences of N items - typically words or characters) between the system-generated summary and the reference summary. It provides insights into the [precision and recall](https://en.wikipedia.org/wiki/Precision_and_recall) of the system's output by considering the matching N-gram sequences.

#### ROUGE-L 
`ROUGE-L`, another component of the `ROUGE Score`, calculates the [Longest Common Subsequence (LCS)](https://en.wikipedia.org/wiki/Longest_common_subsequence) between the system and reference summaries. Unlike N-grams, LCS measures the maximum sequence of words (not necessarily contiguous) that appear in both summaries. It offers a more flexible similarity measure and helps capture shared information beyond strict word-for-word matches.

#### ROUGE-S
`ROUGE-S` focuses on [skip-bigrams](https://towardsdatascience.com/skip-gram-nlp-context-words-prediction-algorithm-5bbf34f84e0c). A skip-bigram is a pair of words in a sentence that allows for gaps or words in between. This component identifies the skip-bigram overlap between the system and reference summaries, enabling the assessment of sentence-level structure similarity. It can capture paraphrasing relationships between sentences and provide insights into the system's ability to convey information with flexible word ordering.

> the text above was taken from https://thepythoncode.com/article/calculate-rouge-score-in-python.

We'll now focus on adding the `ROUGE Score` in our `df` dataframe.