## **ROUGE:** Compares quality / similarity between reference and generated summary

In this notebook, we essentially focus on the ROUGE performance metric (more about this on the readme.md file).

### **Steps to run this Notebook:**

- **Step 1:** Download the libraries & Load the data
- **Step 2:** Prompt the text generative LLM - using the prompt given below
- **Step 3:** Adding the summary to the pandas df to execute results & download
- **Step 4:** Compress all in 1 function

### **Step 1:** Download the libraries & Load the data

In [None]:
# Install datasets and rouge-score
# !pip install datasets - don't need unless we want to gen. new summaries
!pip install rouge-score

In [2]:
# Importing Libraries
from datasets import load_dataset
from transformers import pipeline
from rouge_score import rouge_scorer
import pandas as pd

In [13]:
# Keep below to generate other samples
# xsum_dataset = load_dataset("xsum", version="1.2.0")
#xsum_sample = xsum_dataset["train"].select(range(10))
# xsum_sample = xsum_sample.to_csv("dataset_sample_summaries.csv")

In [14]:
# Load the dataset
xsum_sample = pd.read_csv("/content/dataset_sample_summaries.csv")

In [15]:
xsum_sample

Unnamed: 0,document,summary,id
0,"The full cost of damage in Newton Stewart, one...",Clean-up operations are continuing across the ...,35232142
1,A fire alarm went off at the Holiday Inn in Ho...,Two tourist buses have been destroyed by fire ...,40143035
2,Ferrari appeared in a position to challenge un...,Lewis Hamilton stormed to pole position at the...,35951548
3,"John Edward Bates, formerly of Spalding, Linco...",A former Lincolnshire Police officer carried o...,36266422
4,Patients and staff were evacuated from Cerahpa...,An armed man who locked himself into a room at...,38826984
5,Simone Favaro got the crucial try with the las...,Defending Pro12 champions Glasgow Warriors bag...,34540833
6,"Veronica Vanessa Chango-Alverez, 31, was kille...",A man with links to a car that was involved in...,20836172
7,Belgian cyclist Demoitie died after a collisio...,Welsh cyclist Luke Rowe says changes to the sp...,35932467
8,"Gundogan, 26, told BBC Sport he ""can see the f...",Manchester City midfielder Ilkay Gundogan says...,40758845
9,The crash happened about 07:20 GMT at the junc...,A jogger has been hit by an unmarked police ca...,30358490


In [18]:
print(xsum_sample.shape)

(10, 3)


Unnamed: 0,document,summary,id
0,"The full cost of damage in Newton Stewart, one...",Clean-up operations are continuing across the ...,35232142
1,A fire alarm went off at the Holiday Inn in Ho...,Two tourist buses have been destroyed by fire ...,40143035
2,Ferrari appeared in a position to challenge un...,Lewis Hamilton stormed to pole position at the...,35951548
3,"John Edward Bates, formerly of Spalding, Linco...",A former Lincolnshire Police officer carried o...,36266422
4,Patients and staff were evacuated from Cerahpa...,An armed man who locked himself into a room at...,38826984
5,Simone Favaro got the crucial try with the las...,Defending Pro12 champions Glasgow Warriors bag...,34540833
6,"Veronica Vanessa Chango-Alverez, 31, was kille...",A man with links to a car that was involved in...,20836172
7,Belgian cyclist Demoitie died after a collisio...,Welsh cyclist Luke Rowe says changes to the sp...,35932467
8,"Gundogan, 26, told BBC Sport he ""can see the f...",Manchester City midfielder Ilkay Gundogan says...,40758845
9,The crash happened about 07:20 GMT at the junc...,A jogger has been hit by an unmarked police ca...,30358490


In [19]:
document_array = xsum_sample['document']
print(document_array)

0    The full cost of damage in Newton Stewart, one...
1    A fire alarm went off at the Holiday Inn in Ho...
2    Ferrari appeared in a position to challenge un...
3    John Edward Bates, formerly of Spalding, Linco...
4    Patients and staff were evacuated from Cerahpa...
5    Simone Favaro got the crucial try with the las...
6    Veronica Vanessa Chango-Alverez, 31, was kille...
7    Belgian cyclist Demoitie died after a collisio...
8    Gundogan, 26, told BBC Sport he "can see the f...
9    The crash happened about 07:20 GMT at the junc...
Name: document, dtype: object


### **Step 2:** Prompt the text generative LLM - using the prompt given below


**Query the text generating llm with the following prompt:** (copy the document as mentionned: PASTE_DOCUMENTS_HERE)

```
Please generate a summary in one line (max 25 words) for each of the following documents: PASTE_DOCUMENTS_HERE
```
```
, please just return the answer as the following: results={"generated_summary":["","","","",""]}
```

In [44]:
# This is the output from CHATGPT (as example, but we need to do it for all the generative models we are testing)
results={"generated_summary":["Newton Stewart and Hawick face flood aftermath, Lamington Viaduct disrupts trains, First Minister inspects, and more preventative measures needed.","Fire alarm at Holiday Inn prompts evacuation; two tour buses, belonging to German and Chinese/Taiwanese groups, were deliberately set ablaze in Northern Ireland.","Mercedes dominates Bahrain GP qualifying with Hamilton securing pole, Vandoorne impresses on debut, controversial qualifying system retained.","John Edward Bates faces sexual abuse charges dating back to 1970s, denies allegations, trial ongoing.","Cerahpasa hospital evacuated after patient threatens violence, no hostages, Istanbul tensions rise amid recent attacks."]}

### **Step 3:** Adding the summary to the pandas df to execute results & download


In [45]:
opt_result = pd.DataFrame.from_dict(results).rename({"summary_text": "generated_summary"}, axis=1).join(pd.DataFrame.from_dict(xsum_sample))[["generated_summary", "summary", "document"]]
display(opt_result.head())

Unnamed: 0,generated_summary,summary,document
0,Newton Stewart and Hawick face flood aftermath...,Clean-up operations are continuing across the ...,"The full cost of damage in Newton Stewart, one..."
1,Fire alarm at Holiday Inn prompts evacuation; ...,Two tourist buses have been destroyed by fire ...,A fire alarm went off at the Holiday Inn in Ho...
2,Mercedes dominates Bahrain GP qualifying with ...,Lewis Hamilton stormed to pole position at the...,Ferrari appeared in a position to challenge un...
3,John Edward Bates faces sexual abuse charges d...,A former Lincolnshire Police officer carried o...,"John Edward Bates, formerly of Spalding, Linco..."
4,Cerahpasa hospital evacuated after patient thr...,An armed man who locked himself into a room at...,Patients and staff were evacuated from Cerahpa...


In [46]:
print("Generated Summary : ",opt_result.iloc[0]["generated_summary"])
print(30*"-")
print("Summary : ",opt_result.iloc[0]["summary"])

Generated Summary :  Newton Stewart and Hawick face flood aftermath, Lamington Viaduct disrupts trains, First Minister inspects, and more preventative measures needed.
------------------------------
Summary :  Clean-up operations are continuing across the Scottish Borders and Dumfries and Galloway after flooding caused by Storm Frank.


#### **Calculating the ROUGE score:**

In [49]:
def calculate_rouge(data):
    scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)
    data["r1_fscore"] = data.apply(lambda row : scorer.score(row["summary"],row["generated_summary"])['rouge1'][2], axis=1)
    #data["r2_fscore"] = data.apply(lambda row : scorer.score(row["summary"],row["generated_summary"])['rouge2'][2], axis=1)
    #data["rl_fscore"] = data.apply(lambda row : scorer.score(row["summary"],row["generated_summary"])['rougeL'][2], axis=1)
    return data

In [50]:
score_ret=calculate_rouge(opt_result)

In [51]:
print("ROUGE - 1 : ",score_ret["r1_fscore"].mean())
# print("ROUGE - 2 : ",score_ret["r2_fscore"].mean())
# print("ROUGE - L : ",score_ret["rl_fscore"].mean()) # longest common subsequence between the model-generated summary and the reference summary

ROUGE - 1 :  0.13631762332660918


In [52]:
score_ret

Unnamed: 0,generated_summary,summary,document,r1_fscore,r2_fscore,rl_fscore
0,Newton Stewart and Hawick face flood aftermath...,Clean-up operations are continuing across the ...,"The full cost of damage in Newton Stewart, one...",0.157895,0.0,0.105263
1,Fire alarm at Holiday Inn prompts evacuation; ...,Two tourist buses have been destroyed by fire ...,A fire alarm went off at the Holiday Inn in Ho...,0.195122,0.0,0.146341
2,Mercedes dominates Bahrain GP qualifying with ...,Lewis Hamilton stormed to pole position at the...,Ferrari appeared in a position to challenge un...,0.228571,0.0,0.114286
3,John Edward Bates faces sexual abuse charges d...,A former Lincolnshire Police officer carried o...,"John Edward Bates, formerly of Spalding, Linco...",0.0,0.0,0.0
4,Cerahpasa hospital evacuated after patient thr...,An armed man who locked himself into a room at...,Patients and staff were evacuated from Cerahpa...,0.1,0.0,0.1


In [53]:
model_name = "chat_gpt"

In [57]:
df = pd.DataFrame(score_ret)
df = df[['r1_fscore']]
df.insert(0, "model_name", model_name)
df.rename(columns={"r1_fscore": "rouge_1"}, inplace=True)
# Calculate mean and round
mean_metric_1_unigram = round(df["rouge_1"].mean(), 2)
# mean_metric_2_longest = round(df["metric_2_longest"].mean(), 2)
# Replace values with mean
df["rouge_1"] = mean_metric_1_unigram
# df["metric_2_longest"] = mean_metric_2_longest
df = df.head(1)

In [58]:
df.to_csv(f"{model_name}.csv", index=False)

In [59]:
df = pd.read_csv(f"/content/{model_name}.csv")
print(df)


  model_name  rouge_1
0   chat_gpt     0.14


### **Step 4:** Compress all in 1 function

In [16]:
import pandas as pd
from rouge_score import rouge_scorer

def calculate_and_export_rouge(model_name, results):

    scorer = rouge_scorer.RougeScorer(['rouge1'], use_stemmer=True)
    # Create DataFrame from results dictionary
    data = pd.DataFrame(results)
    # Calculate ROUGE scores
    data["r1_fscore"] = data.apply(lambda row: scorer.score(row["summary"], row["generated_summary"])['rouge1'][2], axis=1)
    # Calculate mean ROUGE-1 score
    mean_rouge_1 = round(data["r1_fscore"].mean(), 2)
    # Create DataFrame with mean ROUGE-1 score and model name
    df = pd.DataFrame({
        "model_name": [model_name],
        "rouge_1": [mean_rouge_1]
    })
    # Export to CSV
    df.to_csv(f"{model_name}.csv", index=False)

In [19]:
xsum_sample = pd.read_csv("/content/dataset_sample_summaries.csv")
model_name = "chat_gpt"
# Generate the results by copy pasting the following prompt:
xsum_sample[['document']]
# Click on the icon next to *document* (convert this dataframe to an interactive table) - then select (right) copy table and select JSON and copy - paste the result in the cell below  replacing **PASTE_DOCUMENTS_HERE**
# Then copy the entire cell and prompt the LLM

Unnamed: 0,document
0,"The full cost of damage in Newton Stewart, one..."
1,A fire alarm went off at the Holiday Inn in Ho...
2,Ferrari appeared in a position to challenge un...
3,"John Edward Bates, formerly of Spalding, Linco..."
4,Patients and staff were evacuated from Cerahpa...
5,Simone Favaro got the crucial try with the las...
6,"Veronica Vanessa Chango-Alverez, 31, was kille..."
7,Belgian cyclist Demoitie died after a collisio...
8,"Gundogan, 26, told BBC Sport he ""can see the f..."
9,The crash happened about 07:20 GMT at the junc...


In [20]:
# Please generate a summary in one line (max 25 words) for each of the following documents: PASTE_DOCUMENTS_HERE, please just return the answer as the following: results={"generated_summary":["","","","",""]}

In [21]:
# Example usage:
results={"generated_summary":[
"The cost of flood damage in Newton Stewart is being assessed; repair work is ongoing in Hawick and Peeblesshire. Disruption on the west coast mainline due to damage at Lamington Viaduct. Businesses and householders affected by flooding in Newton Stewart. Nicola Sturgeon inspected the damage. A retaining wall breached, flooding commercial properties on Victoria Street. More preventative work is suggested. Flood alert remains in Borders; Peebles badly hit. Scottish Borders Council lists worst affected roads.",
"A fire alarm at the Holiday Inn in Hope Street caused guests to evacuate. Two buses parked in the car park were engulfed by flames. Tour groups from Germany, China, and Taiwan were affected. Police appeal for information; fire believed to be deliberate.",
"Mercedes secured pole position in Bahrain Grand Prix. Sebastian Vettel starts third ahead of Kimi Raikkonen. Stoffel Vandoorne out-qualified Jenson Button in his F1 debut. Hamilton escaped punishment for reversing in pit lane. Controversial elimination qualifying system retained. Mercedes favorites despite Ferrari's pace.",
"John Edward Bates faces 22 charges including indecency with a child. Allegations made by four male complainants relate to his time as a scout leader. Bates denies all charges. Prosecutor claims sexual abuse incidents involving minors occurred in Lincolnshire and Cambridgeshire.",
"Cerahpasa hospital evacuated after man threatens to shoot himself and others. Officers negotiate with the man, a young police officer. No hostages taken. Gunman receiving psychiatric treatment; previously deemed unfit to carry a firearm. Incident adds to tension in Istanbul following recent attacks.",
"Belgian cyclist Demoitie dies after collision with motorbike during Gent-Wevelgem race. UCI to co-operate in investigation. Incident raises questions about race safety. Separate incident sees Belgian cyclist Daan Myngheer die after heart attack during Criterium International.",
"Ilkay Gundogan discusses recovery from knee injury. Faces mental challenge after missing World Cup and Euros due to injury. Recovery now measured in weeks. City optimistic for upcoming season. Tottenham seen as a major title contender.",
"Man airlifted to hospital after crash in Leigh-on-Sea, Essex. Occurred at junction of A127 and Progress Road. Man in his 20s treated for head injury and suspected fractures. A127 Southend-bound carriageway closed for six hours for police investigation. IPCC conducting further investigation."
]}

In [23]:
opt_result = pd.DataFrame.from_dict(results).rename({"summary_text": "generated_summary"}, axis=1).join(pd.DataFrame.from_dict(xsum_sample))[["generated_summary", "summary", "document"]]
calculate_and_export_rouge(model_name, opt_result)

In [24]:
df = pd.read_csv(f"/content/{model_name}.csv")
df

Unnamed: 0,model_name,rouge_1
0,chat_gpt,0.15
