## Evaluating Evidence Outputs


This Jupyter Notebook script is designed to analyze a dataset containing text summaries and associated evidence annotations from two human annotators (Deidamea and Aaron) as well as a model. The primary objective is to assess the alignment of annotations between the model and each human annotator, and subsequently, between the two human annotators themselves. This analysis aims to provide valuable insights into the model's consistency and reliability in accurately identifying evidence of LLMs limitations.

**Process**

1. **Load data:** Import dataset from Excel.
2. **Tokenize:** Split text into words.
3. **Convert evidence:** Transform annotations into token sets.
4. **BIO tagging:** Label tokens based on evidence occurrence. 

In [43]:
import pandas as pd
from sklearn.metrics import precision_score, recall_score, f1_score
from nltk.tokenize import word_tokenize
import re

file_path = 'limitation_evidences_evaluation.xlsx'
data = pd.read_excel(file_path)

def tokenize_text(text):
    """ Tokenize text into words while handling punctuation. """
    return word_tokenize(text.lower()) 

def generate_bio_tags(tokens, evidence_tokens_set):
    """ Generate BIO tags based on whether tokens appear in any evidence set. """
    tags = ['O'] * len(tokens) 
    for i, token in enumerate(tokens):
        if token in evidence_tokens_set:
            tags[i] = 'B-EVID' if (i == 0 or tags[i-1] == 'O') else 'I-EVID'
    return tags

def evaluate_annotations(tokens, bio1, bio2):
    """ Evaluate BIO tagged tokens and return precision, recall, F1. """
    precision = precision_score(bio1, bio2, labels=['B-EVID', 'I-EVID'], average='macro', zero_division=0)
    recall = recall_score(bio1, bio2, labels=['B-EVID', 'I-EVID'], average='macro', zero_division=0)
    f1 = f1_score(bio1, bio2, labels=['B-EVID', 'I-EVID'], average='macro', zero_division=0)
    return precision, recall, f1

def convert_highlights_to_set(highlights):
    """ Convert highlights to a set of tokens for comparison. """
    return set(word_tokenize(highlights.lower())) if pd.notna(highlights) else set()

results = []
for index, row in data.iterrows():
    text = row['summary']
    tokens = tokenize_text(text)
    highlight_model_set = convert_highlights_to_set(row['Model Evidences'])
    highlight_you_set = convert_highlights_to_set(row['Deida Evidences'])
    highlight_aaron_set = convert_highlights_to_set(row['Aaron Evidences'])
    bio_model = generate_bio_tags(tokens, highlight_model_set)
    bio_you = generate_bio_tags(tokens, highlight_you_set)
    bio_aaron = generate_bio_tags(tokens, highlight_aaron_set)
    precision_ma, recall_ma, f1_ma = evaluate_annotations(tokens, bio_model, bio_aaron)
    precision_ay, recall_ay, f1_ay = evaluate_annotations(tokens, bio_aaron, bio_you)
    precision_my, recall_my, f1_my = evaluate_annotations(tokens, bio_model, bio_you)

    results.append({
        'summary': text,
        'Model vs Aaron - Precision': precision_ma,
        'Model vs Aaron - Recall': recall_ma,
        'Model vs Aaron - F1 Score': f1_ma,
        'Aaron vs Deidamea - Precision': precision_ay,
        'Aaron vs Deidamea - Recall': recall_ay,
        'Aaron vs Deidamea - F1 Score': f1_ay,
        'Model vs Deidamea - Precision': precision_my,
        'Model vs Deidamea - Recall': recall_my,
        'Model vs Deidamea - F1 Score': f1_my
    })

results_df = pd.DataFrame(results)
print(results_df)
results_df.to_excel('evaluation_results.xlsx', index=False)


                                              summary  \
0   Large Language Models (LLMs) excel in various ...   
1   We introduce Codex, a GPT language model fine-...   
2   Recent advancements in the field of natural la...   
3   Large language models (LLMs) have exhibited an...   
4   Large language models (LLMs) implicitly learn ...   
5   Large language models (LLMs) are competitive w...   
6   Large language models of code (Code-LLMs) have...   
7   The prevalence and strong capability of large ...   
8   The rise of algorithmic pricing raises concern...   
9   Large Language Models (LLMs) exhibit remarkabl...   
10  Businesses and software platforms are increasi...   
11  Language models struggle with handling numeric...   
12  Large Language Models (LLMs) have demonstrated...   
13  Large Language Models (LLMs) still face challe...   
14  Large Language Models (LLMs) have shown profic...   
15  Large Language Models (LLMs) have become incre...   
16  The widespread adoption of 

  results_df.to_excel('evaluation_results.xlsx', index=False)


In [45]:
import pandas as pd

average_f1_ma = results_df['Model vs Aaron - F1 Score'].mean()
average_f1_ay = results_df['Aaron vs Deidamea - F1 Score'].mean()
average_f1_my = results_df['Model vs Deidamea - F1 Score'].mean()

print("Average F1 Scores for Annotation Comparisons:")
print("------------------------------------------------")
print(f"F1 Score for Model vs Aaron: {average_f1_ma:.3f}")
print(f"F1 Score for Aaron vs Deidamea: {average_f1_ay:.3f}")
print(f"F1 Score for Model vs Deidamea: {average_f1_my:.3f}\n")



Average F1 Scores for Annotation Comparisons:
------------------------------------------------
F1 Score for Model vs Aaron: 0.619
F1 Score for Aaron vs Deidamea: 0.753
F1 Score for Model vs Deidamea: 0.730

