### **Notebook 2: Compute the results:**

This notebook contains the necessary code to compute all the metrics for one LLM at the time, the final results are kept in a folder and merged at the end. 

!! Do not push any modifications (i.e added results of prompts - keep it clean)

**Notebook should execute without a problem on VS code EXCEPT for the bleu library (normalize_ function error) - please try to fix. If problem persists, just run it on colab, works fine.** 

##### 

In [25]:
model_name = "MODEL_NAME"

In [None]:
!pip installbert -score
!pip install sentence-transformers

In [111]:
# import all libraries here for clarity
import pandas as pd
import bert_score
from rouge_score import rouge_scorer
import nltk
from nltk.translate.bleu_score import corpus_bleu, SmoothingFunction
import numpy as np
from sentence_transformers import SentenceTransformer, util

In [152]:
# Import all datasets here for clarity
xsum_sample = pd.read_csv("./data/dataset_sample_summaries.csv") # delete . if on colab
sentiment = pd.read_csv("./data/dataset_sample_movie_reviews.csv") # Adjust path if necessary
fact_checking = pd.read_csv("./data/dataset_sample_fact_checking.csv") # Adjust path if necessary

#### **1: Generate all the prompts:**

##### 1.1: Prompt for ROUGE, BLEU, BERT and ROBERTA metric:

In [153]:
# ROUGE: Generate the results by copy pasting the following prompt:
xsum_sample[['document']]
# Click on the icon next to *document* (convert this dataframe to an interactive table) - then select (right) copy table and select JSON and copy - paste the result in the cell below  replacing **PASTE_DOCUMENTS_HERE**
# Then copy the entire cell and prompt the LLM

# PROMPT:
# Please generate a summary in one line (max 25 words) for each of the following documents: PASTE_DOCUMENTS_HERE, please just return the answer as the following: results={"generated_summary":["","","","",""]}

Unnamed: 0,document
0,"The full cost of damage in Newton Stewart, one..."
1,A fire alarm went off at the Holiday Inn in Ho...
2,Ferrari appeared in a position to challenge un...
3,"John Edward Bates, formerly of Spalding, Linco..."
4,Patients and staff were evacuated from Cerahpa...
5,Simone Favaro got the crucial try with the las...
6,"Veronica Vanessa Chango-Alverez, 31, was kille..."
7,Belgian cyclist Demoitie died after a collisio...
8,"Gundogan, 26, told BBC Sport he ""can see the f..."
9,The crash happened about 07:20 GMT at the junc...


In [210]:
## Paste RESULTS here
# Example usage:
results_rouge_bleu_bert_roberta={"generated_summary":["","","","",""]}


##### 1.2: Prompt for sentiment analysis:

In [211]:
sentiment[["Review"]]
# Click on the icon next to *Review* (convert this dataframe to an interactive table) - then select (right) copy table and select JSON and copy - paste the result in the cell below  replacing **PASTE_DOCUMENTS_HERE**
# Then copy the entire cell and prompt the LLM

# PROMPT:
#Please classify the following 10 sentences: positive, negative or neutral. Here are the sentences please provide an array for answer i.e: predicted_labels = ['positive','negative',...]:: PASTE_SENTENCES_HERE. please return the answers as an array

Unnamed: 0,Review
0,I wanted to like this movie. But it falls apar...
1,"Even if it were remotely funny, this mouldy wa..."
2,This is an excellent film and one should not b...
3,Despite having known people who are either gre...
4,One word: suPURRRRb! I don't think I have see ...
5,The early career of Abe Lincoln is beautifully...
6,I bought this at tower records after seeing th...
7,"Steven Spielberg produced, wrote, came up with..."
8,When I took my seat in the cinema I was in a c...
9,I wonder how the actors acted in this movie. A...


In [212]:
## Paste RESULTS here
# Example usage:
result_sentiment_analysis = ['negative', 'negative', 'positive', 'negative', 'positive', 'positive', 'negative', 'neutral', 'positive', 'negative']

##### 1.3: Prompt for fact checking:

In [213]:
fact_checking[['Prompt']]
# Click on the icon next to *document* (convert this dataframe to an interactive table) - then select (right) copy table and select JSON and copy - paste the result in the cell below  replacing **PASTE_DOCUMENTS_HERE**
# Then copy the entire cell and prompt the LLM

# PROMPT:
# Please answer the following questions:COPY_QUESTIONS please answer in the following format: arr= [answer_1,answer_2]

Unnamed: 0,Prompt
0,"How old is Barack Obama, please give just the ..."
1,"What is the height of the Eiffel Tower, please..."
2,What is the capital of France?
3,What is the population of China?
4,What is the capital of Australia?
5,What is the capital of Italy?
6,Where was Albert Einstein born?
7,How many films have been directed by Steven Sp...
8,What is the population of India?
9,What is the capital of Japan?


In [214]:
## Paste RESULTS here
# Example usage:
result_fact_checking = [61, 324, "Paris", 1403500365, "Canberra", "Rome", "Ulm, Kingdom of Württemberg, German Empire", 35, 1380004385, "Tokyo"]

#### **2: Generate all the results:**

##### 2.1: Result for ROUGE, BLEU, BERT and ROBERTA metric:

In [215]:
def calculate_and_export_rouge(model_name, results):

    scorer = rouge_scorer.RougeScorer(['rouge1'], use_stemmer=True)
    # Create DataFrame from results dictionary
    data = pd.DataFrame(results)
    # Calculate ROUGE scores
    data["r1_fscore"] = data.apply(lambda row: scorer.score(row["summary"], row["generated_summary"])['rouge1'][2], axis=1)
    # Calculate mean ROUGE-1 score
    mean_rouge_1 = round(data["r1_fscore"].mean(), 2)
    # Create DataFrame with mean ROUGE-1 score and model name
    df = pd.DataFrame({
        "model_name": [model_name],
        "ROUGE": [mean_rouge_1]
    })
    # Export to CSV
    return df

In [216]:
opt_result_rouge = pd.DataFrame.from_dict(results_rouge_bleu_bert_roberta).rename({"summary_text": "generated_summary"}, axis=1).join(pd.DataFrame.from_dict(xsum_sample))[["generated_summary", "summary", "document"]]
df_rouge_result = calculate_and_export_rouge(model_name, opt_result_rouge)

In [217]:
def calculate_and_export_bleu(model_name, results):
    # Extract reference summaries and generated summaries
    references = results["summary"].tolist()
    hypotheses = results["generated_summary"].tolist()

    # Calculate BLEU score with smoothing
    smoothie = SmoothingFunction().method7
    bleu_score = corpus_bleu(references, hypotheses, smoothing_function=smoothie,)

    # Create DataFrame with BLEU score and model name
    df = pd.DataFrame({
        "model_name": [model_name],
        "BLEU": [bleu_score]
    })

    # Export to CSV
    return df

In [227]:
opt_result_rouge = pd.DataFrame.from_dict(results_rouge_bleu_bert_roberta).rename({"summary_text": "generated_summary"}, axis=1).join(pd.DataFrame.from_dict(xsum_sample))[["generated_summary", "summary", "document"]]
df_blue_result = calculate_and_export_rouge(model_name, opt_result_rouge)

In [None]:
# Define function to calculate BERTScore precision
def calculate_bert_precision(model_name,data):
    # Extract reference summaries and generated summaries
    references = data["summary"].tolist()
    hypotheses = data["generated_summary"].tolist()

    # Compute BERTScore precision
    precision, _, _ = bert_score.score(hypotheses, references, lang="en")

    BertScore = precision.mean().item()
    # Create DataFrame with BLEU score and model name
    df = pd.DataFrame({
        "model_name": [model_name],
        "BERTSCORE": [BertScore]
    })

    return df

In [None]:
opt_result = pd.DataFrame.from_dict(results_rouge_bleu_bert_roberta).rename({"summary_text": "generated_summary"}, axis=1).join(pd.DataFrame.from_dict(xsum_sample))[["generated_summary", "summary", "document"]]
df_bert_score_results = calculate_bert_precision(model_name, opt_result)

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
def calculate_semantic_similarity(data, model_name):
    # Load a pre-trained Sentence Transformer model
    model = SentenceTransformer('roberta-base-nli-stsb-mean-tokens')

    # Define function to calculate semantic similarity score
    def compute_similarity(generated_sentence, reference_sentence):
        # Encode sentences into embeddings
        generated_embedding = model.encode(generated_sentence, convert_to_tensor=True)
        reference_embedding = model.encode(reference_sentence, convert_to_tensor=True)

        # Compute cosine similarity between embeddings
        cosine_similarity = util.pytorch_cos_sim(generated_embedding, reference_embedding)
        return cosine_similarity.item()

    # Ensure both lists have the same length
    assert len(data["generated_summary"]) == len(data["summary"]), "Lengths of generated_summary and summary lists must be the same"

    # Calculate semantic similarity for each pair of generated and reference sentences in the data
    similarity_scores = []
    for generated_sentence, reference_sentence in zip(data["generated_summary"], data["summary"]):
        similarity_score = compute_similarity(generated_sentence, reference_sentence)
        similarity_scores.append(similarity_score)

    # Compute the mean of the similarity scores
    mean_similarity = sum(similarity_scores) / len(similarity_scores)

    # Create DataFrame with model name and mean similarity score
    df = pd.DataFrame({
        "model_name": [model_name],
        "ROBERTA": [mean_similarity]
    })
    return df

In [None]:
df_robert_results = calculate_semantic_similarity(opt_result,model_name)

##### 2.2: Result for Sentiment Analysis

In [219]:
def calculate_and_export_sent_analysis(model_name, opt_result):
    # Convert both predicted and ground truth labels to lowercase
    predicted_labels = [label.lower() for label in opt_result]
    sentiment['Predicted_Labels'] = predicted_labels
    sentiment['Ground_Truth_Label'] = sentiment['Ground_Truth_Label'].str.lower()
    # Count correct predictions
    correct_predictions = sum(sentiment['Ground_Truth_Label'] == sentiment['Predicted_Labels'])
    total_reviews = len(sentiment)
    accuracy = correct_predictions / total_reviews
    # Create DataFrame with accuracy and model name
    new_data = {
        'model_name': model_name,
        'SENTIMENT': [accuracy]
    }
    new_df = pd.DataFrame(new_data)
    return new_df

In [220]:
df_sentiment_result = calculate_and_export_sent_analysis(model_name, result_sentiment_analysis)

##### 2.3: Result for Fact checking

In [221]:
def calculate_and_export_fact_checking(model_name, results):
    def similarity_metric(array1, array2):
        count = 0
        for elem1 in array1:
            if elem1 in array2:
                count += 1
        similarity_score = count / len(array1)
        return similarity_score

    # Example ground_truth array
    ground_truth = fact_checking["Answer"].to_numpy()
    # Convert results to numpy array
    array1 = np.array(ground_truth)
    array2 = np.array(results, dtype=str)

    # Compute similarity score
    similarity_score = round(similarity_metric(array1, array2), 2)

    # Create DataFrame
    new_data = {
        'model_name': [model_name],
        'FACTCHECK': [similarity_score]
    }
    df = pd.DataFrame(new_data)
    return df

In [222]:
df_fact_checking = calculate_and_export_fact_checking(model_name,  result_fact_checking)

#### **3: Concat all the results of all the metrics to one CSV:**

In [223]:
print(df_rouge_result)
print(df_blue_result) 
print(df_sentiment_result) 
print(df_fact_checking) 
print(df_bert_score_results) 
print(df_robert_results) 

   model_name  ROUGE
0  MODEL_NAME    0.0
   model_name  BLEU
0  MODEL_NAME  0.14
   model_name  SENTIMENT
0  MODEL_NAME        0.9
   model_name  FACTCHECK
0  MODEL_NAME        0.4
   model_name  BERTSCORE
0  MODEL_NAME   0.852724
   model_name   ROBERTA
0  MODEL_NAME  0.583766


In [224]:
# Assuming df_rouge_result, df_blue_result, df_sentiment_result, df_fact_checking, df_bert_score_results, df_robert_results are your DataFrames

# Merge DataFrames on 'model_name'
merged_df = df_rouge_result.merge(df_blue_result, on='model_name')
merged_df = merged_df.merge(df_sentiment_result, on='model_name')
merged_df = merged_df.merge(df_fact_checking, on='model_name')
merged_df = merged_df.merge(df_bert_score_results, on='model_name')
merged_df = merged_df.merge(df_robert_results, on='model_name')

# Extract the model name
model_name = merged_df['model_name'].iloc[0]

# Concatenate values into a single row
single_row = [model_name] + merged_df.values.flatten().tolist()[1:]

# Assuming the columns are in the order of 'model_name', 'ROUGE', 'BLEU', 'SENTIMENT', 'FACTCHECK', 'BERTSCORE', 'ROBERTA'


In [225]:
print(single_row)

['MODEL_NAME', 0.0, 0.14, 0.9, 0.4, 0.8527240753173828, 0.5837660133838654]


In [226]:
# Convert single_row to a DataFrame
columns_list = [model_name, 'Rouge', 'Bleu', 'Sentiment', 'Fact checking', 'Bert Score', 'RoberTa']
single_row_df = pd.DataFrame([single_row], columns=columns_list)

# Export to CSV
single_row_df.to_csv(f'./all_results/results_3.csv', index=False)