# Prompt Engineering - Evaluation

This notebook is used to evaluation the accuracy and performance of any set of LLM generated labels under `./data/labels_llm/{tag}` against the ground truth labels under `./data/labels/`.

The approach here is to start with typical evaluation metrics like precision, recall, and F1 score on an overall basis and per entity category which for this project is one of `names`, `phone_numbers`, `email_addresses`, or `physical_addresses`.

In [103]:
import pandas as pd
import pprint as pp
import json
from src.utils import get_files, clean_message
from src.eval import create_labels_dict, merge_labels, calculate_metrics, calculate_overall_metrics

In [104]:
# params

# tag we want to evaluate against the ground truth labels
tag = "dev_gpt4_1106_preview"
data_path = "./data/emails_train_small.csv"
dataset = data_path.split("/")[-1].split(".")[0]

In [105]:
# lets read the training data in case we want to look at any specific message
df = pd.read_csv(data_path)
print(df.shape)

(10000, 2)


In [106]:
# get a list of all files in the labels folder(s)

# ground truth labels
files_labels = get_files(f"./data/labels/")
print(f"Number of files in the labels folder: {len(files_labels)}")

# labels from the LLM
files_labels_llm = get_files(f"./data/labels_llm/{tag}/")
print(f"Number of files in the labels_llm/{tag} folder: {len(files_labels_llm)}")

Number of files in the labels folder: 74
Number of files in the labels_llm/dev_gpt4_1106_preview folder: 45


In [107]:
# wrangle data structures a little bit
labels_llm = create_labels_dict(files_labels_llm, tag=tag)
labels = create_labels_dict(files_labels)

In [108]:
# for all files common between the labels and labels_llm get the actual labels 
# from the files and merge them into a single dictionary that will be easier to with with
labels_final = merge_labels(labels, labels_llm)

Number of common labels: 43


In [109]:
# lets look at an example of what we have
file = 'kaminski-v/all_documents/9240.'

# lets get raw message cleaned in our standardized way
print("Message:\n")
print(clean_message(df.query("file == @file").message.values[0]))

# lets look at the ground truth labels
print("\nGround truth labels:\n")
pp.pprint(labels_final[file]['labels'])

# lets look at the predicted labels from the LLM
print("\nLLM generated labels:\n")
pp.pprint(labels_final[file]['predicted_labels'])

Message:

Hi Vince


Just wanted to thank you for your participation at POWER 2000  last week and 
for contributing to the success of the conference. The feedback we  received 
was absolutely glowing and we were delighted with the smooth-running of  the 
event. Thank you for being a key part of that. As always, your presentations  
went down extremely well and your presence at our events makes a big 
difference,  as people are alwyas keen to hear both form you personally and 
from ENRON as a  company.

As I mentioned to you, I have recently been given the  responsibility of 
creating and developing a new conference stream in the  financial technology 
sector under the Waters brand, so I would like to take this  opportunity to 
say how much I have enjoyed working with you in the past couple  of years and 
to wish you the best of luck in the future. Please stay in touch  and if you 
come to New York, please let me know so I can take you out for a  drink!

Best regards and thank you again

In [110]:
# Calculate metrics
entity_metrics = calculate_metrics(labels_final)
overall_metrics = calculate_overall_metrics(entity_metrics)

In [111]:
# save the metrics to a json file
metrics = {
    "dataset": dataset,
    "entity_metrics": entity_metrics,
    "overall_metrics": overall_metrics
}
pp.pprint(metrics)
with open(f"./data/labels_llm/{tag}/evaluation_metrics_{dataset}.json", "w") as f:
    json.dump(metrics, f)


{'dataset': 'emails_train_small',
 'entity_metrics': {'email_addresses': {'F1 Score': 0.4051,
                                        'FN': 48,
                                        'FP': 46,
                                        'Precision': 0.4103,
                                        'Recall': 0.4,
                                        'TP': 32},
                    'names': {'F1 Score': 0.5669,
                              'FN': 52,
                              'FP': 165,
                              'Precision': 0.4625,
                              'Recall': 0.732,
                              'TP': 142},
                    'phone_numbers': {'F1 Score': 0.5316,
                                      'FN': 36,
                                      'FP': 1,
                                      'Precision': 0.9545,
                                      'Recall': 0.3684,
                                      'TP': 21},
                    'physical_addresses': {'F1 Scor