## **Performance Metrics to evaluate text generating LLMs:**

- **ROUGE: Compares quality / similarity between reference and generated summary using N-grams**

In this notebook, we essentially focus on the ROUGE performance metric (more about this on the readme.md file).

### **Steps to run this Notebook:**

- **Step 1:** Download the libraries
- **Step 2:** Prompt the text generative LLM - using the prompt given below
- **Step 3:** Execute the cells
- **Step 4:** Download the resulting csv
- **Step 5:** Re-iterate for the other text generative LLMs

### **Loading the Data**

In [1]:
# Install datasets and rouge-score
!pip install datasets
!pip install rouge-score

In [2]:
# Importing Libraries
from datasets import load_dataset
from transformers import pipeline
from rouge_score import rouge_scorer
import pandas as pd

In [None]:
# Load the dataset
xsum_dataset = load_dataset("xsum", version="1.2.0")

In [32]:
xsum_sample = xsum_dataset["train"].select(range(5))
display(xsum_sample.to_pandas())
print(xsum_sample.shape)

Unnamed: 0,document,summary,id
0,"The full cost of damage in Newton Stewart, one...",Clean-up operations are continuing across the ...,35232142
1,A fire alarm went off at the Holiday Inn in Ho...,Two tourist buses have been destroyed by fire ...,40143035
2,Ferrari appeared in a position to challenge un...,Lewis Hamilton stormed to pole position at the...,35951548
3,"John Edward Bates, formerly of Spalding, Linco...",A former Lincolnshire Police officer carried o...,36266422
4,Patients and staff were evacuated from Cerahpa...,An armed man who locked himself into a room at...,38826984


(5, 3)


In [33]:
document_array = xsum_sample['document']
print(document_array)



### **From here: re-execute the code for the different Text Generative Models:**

**Query the text generating llm with the following prompt:** (copy the document as mentionned: PASTE_DOCUMENTS_HERE)

```
Please generate a summary in one line (max 25 words) for each of the following documents: PASTE_DOCUMENTS_HERE, please just return the answer as the following: results={"generated_summary":["","","","",""]}
```

You might have to do it one by one, sometimes (at least in chatGPT) we get an error.



In [12]:
# This is the output from CHATGPT (as example, but we need to do it for all the generative models we are testing)

results={"generated_summary":["Newton Stewart and Hawick face flood aftermath, Lamington Viaduct disrupts trains, First Minister inspects, and more preventative measures needed.","Fire alarm at Holiday Inn prompts evacuation; two tour buses, belonging to German and Chinese/Taiwanese groups, were deliberately set ablaze in Northern Ireland.","Mercedes dominates Bahrain GP qualifying with Hamilton securing pole, Vandoorne impresses on debut, controversial qualifying system retained.","John Edward Bates faces sexual abuse charges dating back to 1970s, denies allegations, trial ongoing.","Cerahpasa hospital evacuated after patient threatens violence, no hostages, Istanbul tensions rise amid recent attacks."]}

**Adding the Generated Summary in the Pandas Dataframe**

In [16]:
opt_result = pd.DataFrame.from_dict(results).rename({"summary_text": "generated_summary"}, axis=1).join(pd.DataFrame.from_dict(xsum_sample))[["generated_summary", "summary", "document"]]
display(opt_result.head())

Unnamed: 0,generated_summary,summary,document
0,Newton Stewart and Hawick face flood aftermath...,Clean-up operations are continuing across the ...,"The full cost of damage in Newton Stewart, one..."
1,Fire alarm at Holiday Inn prompts evacuation; ...,Two tourist buses have been destroyed by fire ...,A fire alarm went off at the Holiday Inn in Ho...
2,Mercedes dominates Bahrain GP qualifying with ...,Lewis Hamilton stormed to pole position at the...,Ferrari appeared in a position to challenge un...
3,John Edward Bates faces sexual abuse charges d...,A former Lincolnshire Police officer carried o...,"John Edward Bates, formerly of Spalding, Linco..."
4,Cerahpasa hospital evacuated after patient thr...,An armed man who locked himself into a room at...,Patients and staff were evacuated from Cerahpa...


In [19]:
print("Generated Summary : ",opt_result.iloc[0]["generated_summary"])
print(30*"-")
print("Summary : ",opt_result.iloc[0]["summary"])

Generated Summary :  Newton Stewart and Hawick face flood aftermath, Lamington Viaduct disrupts trains, First Minister inspects, and more preventative measures needed.
------------------------------
Summary :  Clean-up operations are continuing across the Scottish Borders and Dumfries and Galloway after flooding caused by Storm Frank.


#### **Calculating the ROUGE score:**

In [25]:
def calculate_rouge(data):
    scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)
    data["r1_fscore"] = data.apply(lambda row : scorer.score(row["summary"],row["generated_summary"])['rouge1'][2], axis=1)
    data["r2_fscore"] = data.apply(lambda row : scorer.score(row["summary"],row["generated_summary"])['rouge2'][2], axis=1)
    data["rl_fscore"] = data.apply(lambda row : scorer.score(row["summary"],row["generated_summary"])['rougeL'][2], axis=1)
    return data

In [26]:
score_ret=calculate_rouge(opt_result)

In [40]:
print("ROUGE - 1 : ",score_ret["r1_fscore"].mean())
# print("ROUGE - 2 : ",score_ret["r2_fscore"].mean())
print("ROUGE - L : ",score_ret["rl_fscore"].mean()) # longest common subsequence between the model-generated summary and the reference summary

ROUGE - 1 :  0.13631762332660918
ROUGE - L :  0.09317806711901706


#### **Exporting the data into a clean CSV for invidual model results**

In [43]:
score_ret

Unnamed: 0,generated_summary,summary,document,r1_fscore,r2_fscore,rl_fscore
0,Newton Stewart and Hawick face flood aftermath...,Clean-up operations are continuing across the ...,"The full cost of damage in Newton Stewart, one...",0.157895,0.0,0.105263
1,Fire alarm at Holiday Inn prompts evacuation; ...,Two tourist buses have been destroyed by fire ...,A fire alarm went off at the Holiday Inn in Ho...,0.195122,0.0,0.146341
2,Mercedes dominates Bahrain GP qualifying with ...,Lewis Hamilton stormed to pole position at the...,Ferrari appeared in a position to challenge un...,0.228571,0.0,0.114286
3,John Edward Bates faces sexual abuse charges d...,A former Lincolnshire Police officer carried o...,"John Edward Bates, formerly of Spalding, Linco...",0.0,0.0,0.0
4,Cerahpasa hospital evacuated after patient thr...,An armed man who locked himself into a room at...,Patients and staff were evacuated from Cerahpa...,0.1,0.0,0.1


In [47]:
model_name = "Chat GPT"
output_filename = "chat_gpt_rouge.csv"

In [48]:
df = pd.DataFrame(score_ret)
df = df[['r1_fscore','rl_fscore']]
df.insert(0, "model_name", model_name)
df.rename(columns={"r1_fscore": "metric_1_unigram", "rl_fscore": "metric_2_longest"}, inplace=True)
# Calculate mean and round
mean_metric_1_unigram = round(df["metric_1_unigram"].mean(), 2)
mean_metric_2_longest = round(df["metric_2_longest"].mean(), 2)
# Replace values with mean
df["metric_1_unigram"] = mean_metric_1_unigram
df["metric_2_longest"] = mean_metric_2_longest
df = df.head(1)

In [49]:
df.to_csv(output_filename, index=False)

In [50]:
# Delete the /content/ if not running on colab
df = pd.read_csv("/content/chat_gpt_rouge.csv")
df

Unnamed: 0,model_name,metric_1_unigram,metric_2_longest
0,Chat GPT,0.14,0.09
