## Exercise for Lecture 8: Evaluation Protocol II -- Automatic
### Profilierungsmodul CL I, MSc Computational Linguistics: Trustworthy Data-Centric AI

__Context__: \
You are conducting an evaluation among human-written and model-generated summaries. \
You received these following 10 summaries, 5 written by human __human_summaries__ and five generated by language models __model_summaries__. 

In [None]:
human_summaries = [
"Gemma, 23, has left her children to be raised by her mother. Handed them to her 52-year-old parent when they were four months old. Spends her time partying and has 'missed out' on seeing them grow up. Is now at risk of being banned from seeing them thanks to feckless ways. Mother Debbie also says the unemployed 23-year-old has stolen from them.",
"23-year-old mother Gemma may be forbidden to see her children by her mother Debbie, who cares for the children. Debbie says this is because of Gemma's stealing and selling property belonging to the children, and prioritizing drinking, drugs and partying over the children.",
"A mother of two is about to be denied access to her children. The woman is an alcoholic and a drug addict who refuses to seek help.  She has stolen from her mother and her children to afford her nasty habits.",
"Gemma is at risk of losing her children because she does not want to stop drinking and partying. Gemma's mother Debbie is caring for her children. Gemma needs to grow up and leave the drugs, parties and drinking behind.",
"Gemma is going to lose access to her children because her mother cannot trust her to stay sober and stop stealing from her. Gemma's mother, Debbie, is taking care of the kids who are her grandchildren. According to Debbie, Gemma needs to get sober and act mature in order to see her kids again."]
model_summaries = [
"the woman , named only as gemma , has two children under five by two different fathers and handed both infants over to her 52-year-old mother debbie when they were four months old . a 23-year-old mother-of-two is at risk of being banned from seeing her own children - because she refuses to stop drinking and partying . now debbie is threatening to ban gemma from seeing the children at all , after discovering her daughter 's penchant for legal high , mkat , and because she suspects her of stealing .",
"Gemma, a mother-of-two, is at risk of being banned from seeing her own children because she refuses to stop drinking and partying 'I don't want to be responsible for her having children,' says Debbie. 'I don't want to be involved with her. I'm not going to let her do that to my children.", 
"a 23-year-old mother-of-two is at risk of being banned from seeing her own children - because she refuses to stop drinking and partying . the woman , named only as gemma , has two children under five by two different fathers and handed both infants over to her 52-year-old mother debbie when they were four months old . now debbie is threatening to ban gemma from seeing the children at all , after discovering her daughter 's penchant for legal high , mkat , and because she suspects her of stealing .",
"Gemma , 23 , has two children under five by two different fathers . Handed both infants over to her 52-year-old mother Debbie when they were four months old . Now Debbie is threatening to ban Gemma from seeing the children at all . Says her daughter 's constant drinking and partying is getting out of hand .",
"the woman has two children under five by two different fathers . debbie is threatening to ban gemma from seeing children at all . she says her daughter 's constant drinking is getting out of hand . she also claims gemma , who is unemployed , stole an ipad from one of the children ."
]   

### Exercise 1: Inter-Annotator Agreement
Your first task is to assess whether humans successfully differentiate human written summaries. \
So you shuffled these 10 summaries in a random order and asked three annotators John, Mary, and Bill to annotate whether each of the 10 summaries is human-written ("H") or model-generated ("M"). \
Here are the gold labels and the responses from three human annotators. 

In [None]:
gold = ["H", "H", "M", "M", "H", "M", "H", "M", "M", "H"]
john = ["H", "H", "M", "M", "H", "H", "H", "M", "M", "H"]
mary = ["H", "H", "M", "M", "H", "M", "M", "M", "H", "H"]
bill = ["M", "H", "H", "M", "H", "M", "H", "M", "M", "H"]

#### Exercise 1.1: Accuracy, Precision, Recall, F1
Can you implement your code in Python and compare John, Mary and Bill's annotations to gold in terms of accuracy, precision, recall, and macro f1. 

In [None]:
# Install scikit-learn, pandas 
!pip install scikit-learn 
!pip install pandas

In [None]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
import pandas as pd
def acc_p_r_f1(gold, predictions):
    accuracy = accuracy_score(gold, predictions)
    precision = precision_score(gold, predictions, average='macro')
    recall = recall_score(gold, predictions, average='macro')
    f1 = f1_score(gold, predictions, average='macro')
    return accuracy, float(precision), float(recall), float(f1)

# Evaluation
john_metrics = acc_p_r_f1(gold, john)
mary_metrics = acc_p_r_f1(gold, mary)
bill_metrics = acc_p_r_f1(gold, bill)

print(john_metrics)
print(mary_metrics)
print(bill_metrics)

#### Exercise 1.2: Pair-wise Raw & Kappa agreement 
Can you compute averaged pair-wise raw and kappa agreement among the three annotators? \
How much does raw agreement differ from cohen's kappa?

In [None]:
from sklearn.metrics import cohen_kappa_score
from statistics import mean
def raw(A,B):
    return 1.0*sum([A[i]==B[i] for i in range(len(A))])/len(A)
kappas = []
raws = []
annotations = [john, mary, bill]
for i in range(len(annotations)):
    for j in range(i+1, len(annotations)):
        kappas.append(float(cohen_kappa_score(annotations[i], annotations[j])))
        raws.append(raw(annotations[i], annotations[j]))
print(kappas)
print(raws)
print(mean(kappas))
print(mean(raws))

#### Exercise 1.3: Krippendorff Alpha
Since Krippendorff Alpha incorporates multiple annotators, can you also calculate it among the three annotators? \
How is the Krippendorff Alpha agreement compared to Cohen's Kappa?
- Note that the python package for Krippendorff Alpha arranges and compares annotations by instances (rather than by annotators)

In [None]:
# Install krippendorff 
!pip install krippendorff 

In [None]:
from krippendorff import alpha

# Prepare data for Krippendorff's alpha
# Each row represents an item, and each column represents an annotator
instances = pd.DataFrame(annotations).transpose().values.tolist()
print(instances)

# Calculate Krippendorff's alpha
krippendorff_alpha = alpha(reliability_data=instances, level_of_measurement="nominal")

print("Krippendorff's Alpha:", krippendorff_alpha)

### Exercise 2: NLG Evaluation
Now we are interested in looking at how the model summaries are different from human summaries. 

#### Exercise 2.1: Tokenize your summaries 
For pre-processing, you would need to tokenize your summaries before feeding to BLEU or ROUGE. 

In [None]:
# install nltk
!pip install nltk

In [None]:
from nltk.tokenize import word_tokenize
tokenized_human_summaries = [word_tokenize(summary.lower()) for summary in human_summaries]
tokenized_model_summaries = [word_tokenize(summary.lower()) for summary in model_summaries]

#### Exercise 2.2: BLEU
Compare the five model summaries against all human summaries using precision-based BLEU score. \
Which model summary scores the best? 

In [None]:
import numpy as np
from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction
# Compute BLEU score with smoothing
smooth_fn = SmoothingFunction().method1  # Smoothing method for better BLEU scores on short sequences

bleu_matrix = np.zeros((len(tokenized_human_summaries), len(tokenized_model_summaries)))

for i in range(len(tokenized_human_summaries)):
    for j in range(len(tokenized_model_summaries)):
        score = sentence_bleu([tokenized_human_summaries[i]], tokenized_model_summaries[j], smoothing_function=smooth_fn)
        bleu_matrix[i, j] = score
        # print(f"human {i} vs model {j}: BLEU Score = {score:.4f}")

bleu_df = pd.DataFrame(
    bleu_matrix,
    index=[f"Human {i+1}" for i in range(len(tokenized_human_summaries))],
    columns=[f"Model {j+1}" for j in range(len(tokenized_model_summaries))]
)

print(bleu_df)


#### Exercise 2.3: ROUGE
Unlike BLEU, ROUGE is recall based. \
Compare the five model summaries against all human summaries using ROUGE-1, ROUGE-2, ROUGE-L. \
Which model summary scores the best? 

In [None]:
# Install rouge-score
!pip install rouge-score

In [None]:
from rouge_score import rouge_scorer
r_scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)

rouge_1 = np.zeros((len(human_summaries), len(model_summaries)))
rouge_2 = np.zeros((len(human_summaries), len(model_summaries)))
rouge_l = np.zeros((len(human_summaries), len(model_summaries)))

for i in range(len(human_summaries)):
    for j in range(len(model_summaries)):
        r_score = r_scorer.score(human_summaries[i], model_summaries[j])
        # print(r_score)
        rouge_1[i, j] = r_score["rouge1"].fmeasure
        rouge_2[i, j] = r_score["rouge2"].fmeasure
        rouge_l[i, j] = r_score["rougeL"].fmeasure
        
rouge_1_df = pd.DataFrame(
    rouge_1,
    index=[f"Human {i+1}" for i in range(len(human_summaries))],
    columns=[f"Model {j+1}" for j in range(len(model_summaries))]
)
print(rouge_1_df)

rouge_2_df = pd.DataFrame(
    rouge_2,
    index=[f"Human {i+1}" for i in range(len(human_summaries))],
    columns=[f"Model {j+1}" for j in range(len(model_summaries))]
)
print(rouge_2_df)

rouge_l_df = pd.DataFrame(
    rouge_l,
    index=[f"Human {i+1}" for i in range(len(human_summaries))],
    columns=[f"Model {j+1}" for j in range(len(model_summaries))]
)
print(rouge_l_df)

#### Exercise 2.4 BERTScore
Time for some embedding-based scorer. \
The BERT scorer gives precision, recall, F1 scores based on cosine similarity between contextual embeddings. \
How do the BERTScores compare to BLEU and ROUGE? 

In [None]:
# Install rouge-score
!pip install bert_score

In [None]:
from bert_score import score as b_score
bert_matrix = np.zeros((len(tokenized_human_summaries), len(tokenized_model_summaries)))

for i in range(len(human_summaries)):
    for j in range(len(model_summaries)):
        P, R, F1 = b_score([human_summaries[i]], [model_summaries[j]], lang="en", verbose=False)
        bert_matrix[i, j] = F1

bert_df = pd.DataFrame(
    bert_matrix,
    index=[f"Human {i+1}" for i in range(len(human_summaries))],
    columns=[f"Model {j+1}" for j in range(len(model_summaries))]
)

print(bert_df)

#### Exercise 2.5 Multi-reference evaluation
Metrics like BLEU allow comparing one prediction to multiple references. \
Can you now compare each of the five model summaries to all five human summaries using BLEU. 
What do you observe? \
Do you have different observations from pair-wise comparisons? 

In [None]:
for j in range(len(tokenized_model_summaries)):
    score = sentence_bleu(tokenized_human_summaries, tokenized_model_summaries[j], smoothing_function=smooth_fn)
    print(f"Model {j+1}", score)

print("\nSingle-reference BLEU for comparison:")
print(bleu_df)