### **Notebook 2: Compute the results:**

This notebook contains the necessary code to compute all the metrics for one LLM at the time, the final results are kept in a folder and merged at the end.

!! Do not push any modifications (i.e added results of prompts - keep it clean)

In [448]:
model_name = "Meta_Lama_3_70B"

In [449]:
# !pip install bert-score
# !pip install sentence-transformers
# !pip install rouge-score

In [450]:
# import all libraries here for clarity
import pandas as pd
import bert_score
from rouge_score import rouge_scorer
import nltk
from nltk.translate.bleu_score import corpus_bleu, SmoothingFunction
import numpy as np
from sentence_transformers import SentenceTransformer, util

In [451]:
# Import all datasets here for clarity
xsum_sample = pd.read_csv("/content/data/dataset_sample_summaries_v2.csv") # delete . if on colab
sentiment = pd.read_csv("/content/data/dataset_sample_movie_reviews_v2.csv") # Adjust path if necessary
fact_checking = pd.read_csv("/content/data/dataset_sample_fact_checking_v2.csv") # Adjust path if necessary

#### **1: Generate all the prompts:**

##### 1.1: Prompt for ROUGE, BLEU, BERT and ROBERTA metric:

In [452]:
# ROUGE: Generate the results by copy pasting the following prompt:
xsum_sample[['document']]
# Click on the icon next to *document* (convert this dataframe to an interactive table) - then select (right) copy table and select JSON and copy - paste the result in the cell below  replacing **PASTE_DOCUMENTS_HERE**
# Then copy the entire cell and prompt the LLM

# PROMPT:
# Please generate a summary in one line (max 25 words) for each of the following documents: PASTE_DOCUMENTS_HERE, please just return the answer as the following: results_rouge_bleu_bert_roberta={"generated_summary":["","","","",""]}

Unnamed: 0,document
0,"Prison Link Cymru had 1,099 referrals in 2015-..."
1,Officers searched properties in the Waterfront...
2,"Jordan Hill, Brittany Covington and Tesfaye Co..."
3,The 48-year-old former Arsenal goalkeeper play...
4,Restoring the function of the organ - which he...
5,But there certainly should be.\nThese are two ...
6,Media playback is not supported on this device...
7,It's no joke. But Kareem Badr says people did ...
8,Relieved that the giant telecoms company would...
9,"""I'm really looking forward to it - the home o..."


In [455]:
## Paste RESULTS here
# Example usage:
results_rouge_bleu_bert_roberta ={"generated_summary":[
    "Prison Link Cymru claims investment in housing would be cheaper than jailing homeless repeat offenders, with 1,099 referrals in 2015-16.",
    "Police searched properties in Waterfront Park and Colonsay View areas of the city on Wednesday and found three firearms, ammunition, and a large sum of money (over $10,000). A 26-year-old man was arrested, charged, and appeared in court on Thursday.",
    "Officers in Edinburgh recovered three firearms, ammunition, and a five-figure sum of money in a raid on properties.",
    "Former Arsenal goalkeeper Nigel Gibbs leaves West Brom after 16 years, having played a key role in the club's two promotions to the Premier League.",
    "A modified fasting-mimicking diet has been shown to reverse symptoms of diabetes in animal experiments by regenerating cells in the pancreas.",
    "The proposed merger between Essilor and Luxottica raises concerns about reduced competition and choice for consumers in the eyewear industry.",
    "Olympic silver medallist Wendy Houvenaghel accuses British Cycling of   ageism  and having zero regard for her welfare.",
    " Kareem Badr turned around a failing comedy club in Austin, Texas, and now employs 25 part-time and contract workers.",
    "BT dodged a breakup after Ofcom decided against forcing the company to spin off its Openreach division due to practical obstacles.",
    " Celtic manager Brendan Rodgers is looking forward to his maiden visit to Hampden Park for the Scottish League Cup semi-final against Rangers.",
]}
#{"generated_summary":["","","","",""]}

##### 1.2: Prompt for sentiment analysis:

In [456]:
sentiment[["Review"]]
# Click on the icon next to *Review* (convert this dataframe to an interactive table) - then select (right) copy table and select JSON and copy - paste the result in the cell below  replacing **PASTE_DOCUMENTS_HERE**
# Then copy the entire cell and prompt the LLM


Unnamed: 0,Review
0,"Just got through watching this version of ""Sam..."
1,"In this forgettable trifle, the 40-ish Norma S..."
2,Peter O'Toole is a treat to watch in roles whe...
3,This is one of the greatest sports movies ever...
4,First of all this movie is not a comedy; unles...
5,Without a doubt this is one of the worst films...
6,'Deliverance' is a brilliant condensed epic of...
7,The arrival of White Men in Arctic Canada chal...
8,"Curiously, Season 6 of the Columbo series cont..."
9,The year 2005 saw no fewer than 3 filmed produ...


In [457]:

# PROMPT:
#
# Please classify the following 10 sentences: positive or negative. Here are the sentences please provide an array for answer i.e: result_sentiment_analysis = ['positive','negative',...]:: PASTE_SENTENCES_HERE. please return the answers as an array
result_sentiment_analysis =  ['negative', 'negative', 'positive', 'positive', 'negative', 'negative', 'positive', 'positive', 'negative', 'positive']
# predicted_labels = ['', '', '', '', '', '', '', '', '', '']

##### 1.3: Prompt for fact checking:

In [458]:
fact_checking[['Prompt']]
# Click on the icon next to *document* (convert this dataframe to an interactive table) - then select (right) copy table and select JSON and copy - paste the result in the cell below  replacing **PASTE_DOCUMENTS_HERE**
# Then copy the entire cell and prompt the LLM

# PROMPT:
# Please answer the following questions:COPY_QUESTIONS please answer in the following format: result_fact_checking= [answer_1,answer_2]
#arr=[]
result_fact_checking =  [76.0, 8, "Naypyidaw", 8, "Kuwait City", "Dacca", "Pretoria", 12, 83, "Tokyo"]

#### **2: Generate all the results:**

##### 2.1: Result for ROUGE, BLEU, BERT and ROBERTA metric:

In [459]:
def calculate_and_export_rouge(model_name, results):

    scorer = rouge_scorer.RougeScorer(['rouge1', 'rougeL'], use_stemmer=True)
    # Create DataFrame from results dictionary
    data = pd.DataFrame(results)
    # Calculate ROUGE scores
    data["r1_fscore"] = data.apply(lambda row: scorer.score(row["summary"], row["generated_summary"])['rouge1'][2], axis=1)
    data["rl_fscore"] = data.apply(lambda row: scorer.score(row["summary"], row["generated_summary"])['rougeL'][2], axis=1)
    # Calculate mean ROUGE-1 score
    mean_rouge_1 = round(data["r1_fscore"].mean(), 2)
    mean_rouge_LSC = round(data["rl_fscore"].mean(), 2)
    # Create DataFrame with mean ROUGE-1 score and model name
    df = pd.DataFrame({
        "model_name": [model_name],
        "ROUGE_1": [mean_rouge_1],
        "ROUGE_LSC": [mean_rouge_LSC]
    })
    # Export to CSV
    return df

In [460]:
opt_result_rouge = pd.DataFrame.from_dict(results_rouge_bleu_bert_roberta).rename({"summary_text": "generated_summary"}, axis=1).join(pd.DataFrame.from_dict(xsum_sample))[["generated_summary", "summary", "document"]]
df_rouge_result = calculate_and_export_rouge(model_name, opt_result_rouge)

In [461]:
def calculate_and_export_bleu(model_name, results):
    # Extract reference summaries and generated summaries
    references = results["summary"].tolist()
    hypotheses = results["generated_summary"].tolist()

    # Calculate BLEU score with smoothing
    smoothie = SmoothingFunction().method7
    bleu_score = corpus_bleu(references, hypotheses, smoothing_function=smoothie,)

    # Create DataFrame with BLEU score and model name
    df = pd.DataFrame({
        "model_name": [model_name],
        "BLEU": [bleu_score]
    })

    # Export to CSV
    return df

In [462]:
opt_result_rouge = pd.DataFrame.from_dict(results_rouge_bleu_bert_roberta).rename({"summary_text": "generated_summary"}, axis=1).join(pd.DataFrame.from_dict(xsum_sample))[["generated_summary", "summary", "document"]]
df_blue_result = calculate_and_export_bleu(model_name, opt_result_rouge)

In [463]:
# Define function to calculate BERTScore precision
def calculate_bert_precision(model_name,data):
    # Extract reference summaries and generated summaries
    references = data["summary"].tolist()
    hypotheses = data["generated_summary"].tolist()

    # Compute BERTScore precision
    precision, _, _ = bert_score.score(hypotheses, references, lang="en")

    BertScore = precision.mean().item()
    # Create DataFrame with BLEU score and model name
    df = pd.DataFrame({
        "model_name": [model_name],
        "BERTSCORE": [BertScore]
    })

    return df

In [464]:
opt_result = pd.DataFrame.from_dict(results_rouge_bleu_bert_roberta).rename({"summary_text": "generated_summary"}, axis=1).join(pd.DataFrame.from_dict(xsum_sample))[["generated_summary", "summary", "document"]]
df_bert_score_results = calculate_bert_precision(model_name, opt_result)

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [465]:
def calculate_semantic_similarity(data, model_name):
    # Load a pre-trained Sentence Transformer model
    model = SentenceTransformer('roberta-base-nli-stsb-mean-tokens')

    # Define function to calculate semantic similarity score
    def compute_similarity(generated_sentence, reference_sentence):
        # Encode sentences into embeddings
        generated_embedding = model.encode(generated_sentence, convert_to_tensor=True)
        reference_embedding = model.encode(reference_sentence, convert_to_tensor=True)

        # Compute cosine similarity between embeddings
        cosine_similarity = util.pytorch_cos_sim(generated_embedding, reference_embedding)
        return cosine_similarity.item()

    # Ensure both lists have the same length
    assert len(data["generated_summary"]) == len(data["summary"]), "Lengths of generated_summary and summary lists must be the same"

    # Calculate semantic similarity for each pair of generated and reference sentences in the data
    similarity_scores = []
    for generated_sentence, reference_sentence in zip(data["generated_summary"], data["summary"]):
        similarity_score = compute_similarity(generated_sentence, reference_sentence)
        similarity_scores.append(similarity_score)

    # Compute the mean of the similarity scores
    mean_similarity = sum(similarity_scores) / len(similarity_scores)

    # Create DataFrame with model name and mean similarity score
    df = pd.DataFrame({
        "model_name": [model_name],
        "ROBERTA": [mean_similarity]
    })
    return df

In [466]:
df_robert_results = calculate_semantic_similarity(opt_result,model_name)



##### 2.2: Result for Sentiment Analysis

In [467]:
def calculate_and_export_sent_analysis(model_name, opt_result):
    # Convert both predicted and ground truth labels to lowercase
    predicted_labels = [label.lower() for label in opt_result]
    sentiment['Predicted_Labels'] = predicted_labels
    sentiment['Ground_Truth_Label'] = sentiment['Ground_Truth_Label'].str.lower()
    # Count correct predictions
    correct_predictions = sum(sentiment['Ground_Truth_Label'] == sentiment['Predicted_Labels'])
    total_reviews = len(sentiment)
    accuracy = correct_predictions / total_reviews
    # Create DataFrame with accuracy and model name
    new_data = {
        'model_name': model_name,
        'SENTIMENT': [accuracy]
    }
    new_df = pd.DataFrame(new_data)
    return new_df

In [468]:
df_sentiment_result = calculate_and_export_sent_analysis(model_name, result_sentiment_analysis)

##### 2.3: Result for Fact checking

In [469]:
def calculate_and_export_fact_checking(model_name, results):
    def similarity_metric(array1, array2):
        count = 0
        for elem1 in array1:
            if elem1 in array2:
                count += 1
        similarity_score = count / len(array1)
        return similarity_score

    # Example ground_truth array
    ground_truth = fact_checking["Answer"].to_numpy()
    # Convert results to numpy array
    array1 = np.array(ground_truth)
    array2 = np.array(results, dtype=str)

    # Compute similarity score
    similarity_score = round(similarity_metric(array1, array2), 2)

    # Create DataFrame
    new_data = {
        'model_name': [model_name],
        'FACTCHECK': [similarity_score]
    }
    df = pd.DataFrame(new_data)
    return df

In [470]:
df_fact_checking = calculate_and_export_fact_checking(model_name,  result_fact_checking)

#### **3: Concat all the results of all the metrics to one CSV:**

In [471]:
print(df_rouge_result)
print(df_blue_result)
print(df_sentiment_result)
print(df_fact_checking)
print(df_bert_score_results)
print(df_robert_results)

        model_name  ROUGE_1  ROUGE_LSC
0  Meta_Lama_3_70B      0.3        0.2
        model_name      BLEU
0  Meta_Lama_3_70B  0.085263
        model_name  SENTIMENT
0  Meta_Lama_3_70B        0.9
        model_name  FACTCHECK
0  Meta_Lama_3_70B        0.8
        model_name  BERTSCORE
0  Meta_Lama_3_70B   0.875584
        model_name   ROBERTA
0  Meta_Lama_3_70B  0.488014


In [472]:
# Assuming df_rouge_result, df_blue_result, df_sentiment_result, df_fact_checking, df_bert_score_results, df_robert_results are your DataFrames

# Merge DataFrames on 'model_name'
merged_df = df_rouge_result.merge(df_blue_result, on='model_name')
merged_df = merged_df.merge(df_sentiment_result, on='model_name')
merged_df = merged_df.merge(df_fact_checking, on='model_name')
merged_df = merged_df.merge(df_bert_score_results, on='model_name')
merged_df = merged_df.merge(df_robert_results, on='model_name')

# Extract the model name
model_name = merged_df['model_name'].iloc[0]

# Concatenate values into a single row
single_row = [model_name] + merged_df.values.flatten().tolist()[1:]

# Assuming the columns are in the order of 'model_name', 'ROUGE1','ROUGEL', 'BLEU', 'SENTIMENT', 'FACTCHECK', 'BERTSCORE', 'ROBERTA'


In [473]:
print(single_row)

['Meta_Lama_3_70B', 0.3, 0.2, 0.08526326883881205, 0.9, 0.8, 0.8755841255187988, 0.4880135789513588]


In [474]:
# Convert single_row to a DataFrame
columns_list = [model_name, 'Rouge1', 'RougeL','Bleu', 'Sentiment', 'Fact checking', 'Bert Score', 'RoberTa']
single_row_df = pd.DataFrame([single_row], columns=columns_list)

In [475]:
# Export to CSV
single_row_df.to_csv(f'/content/all_results/Meta_Lama_3_70B.csv', index=False)

In [476]:
df = pd.read_csv("/content/all_results/Meta_Lama_3_70B.csv")

In [477]:
df

Unnamed: 0,Meta_Lama_3_70B,Rouge1,RougeL,Bleu,Sentiment,Fact checking,Bert Score,RoberTa
0,Meta_Lama_3_70B,0.3,0.2,0.085263,0.9,0.8,0.875584,0.488014


##### All results

In [293]:
df1 = pd.read_csv("/content/all_results/ChatGPT_4-o.csv")
df1

Unnamed: 0,ChatGPT_4-o,Rouge1,RougeL,Bleu,Sentiment,Fact checking,Bert Score,RoberTa
0,ChatGPT_4-o,0.33,0.21,0.090846,0.9,0.8,0.890361,0.649441


In [294]:
df2 = pd.read_csv("/content/all_results/ChatGPT_4.csv")
df2

Unnamed: 0,ChatGPT_4,Rouge1,RougeL,Bleu,Sentiment,Fact checking,Bert Score,RoberTa
0,ChatGPT_4,0.27,0.16,0.090277,0.9,0.8,0.879445,0.510433


In [295]:
df3 = pd.read_csv("/content/all_results/ChatGPT_3.5.csv")
df3

Unnamed: 0,ChatGPT_3.5,Rouge1,RougeL,Bleu,Sentiment,Fact checking,Bert Score,RoberTa
0,ChatGPT_3.5,0.32,0.2,0.086001,0.9,1.0,0.879172,0.618968


In [325]:
df4 = pd.read_csv("/content/all_results/Mistral_small.csv")
df4

Unnamed: 0,Mistral_small,Rouge1,RougeL,Bleu,Sentiment,Fact checking,Bert Score,RoberTa
0,Mistral_small,0.26,0.17,0.077971,0.8,0.7,0.852955,0.535463


In [376]:
df5 = pd.read_csv("/content/all_results/Mistral_Next.csv")
df5

Unnamed: 0,Mistral_Next,Rouge1,RougeL,Bleu,Sentiment,Fact checking,Bert Score,RoberTa
0,Mistral_Next,0.32,0.18,0.080926,0.8,0.8,0.872035,0.623032


In [417]:
df6 = pd.read_csv("/content/all_results/Mistral_Large.csv")
df6

Unnamed: 0,Mistral_Large,Rouge1,RougeL,Bleu,Sentiment,Fact checking,Bert Score,RoberTa
0,Mistral_Large,0.21,0.15,0.075708,0.8,0.6,0.856638,0.558638


In [479]:
df7 = pd.read_csv("/content/all_results/Blackbox_ai.csv")
df7

Unnamed: 0,Blackbox_ai,Rouge1,RougeL,Bleu,Sentiment,Fact checking,Bert Score,RoberTa
0,Blackbox_ai,0.24,0.19,0.089409,0.9,0.8,0.872544,0.601503


In [478]:
df8 = pd.read_csv("/content/all_results/Meta_Lama_3_70B.csv")
df8

Unnamed: 0,Meta_Lama_3_70B,Rouge1,RougeL,Bleu,Sentiment,Fact checking,Bert Score,RoberTa
0,Meta_Lama_3_70B,0.3,0.2,0.085263,0.9,0.8,0.875584,0.488014
