# Prompt Engineering

This notebook pulls in from the training data and randomly samples some messages to be sent to the OpenAI API. 

The API will then return a response, which is then saved to `/data/labels_llm/{tag}/` as llm generated labels for later evaluation with by comparison to the ground truth human labels that live in `./data/labels/`.

The approach here is to use minimal prompt engineering and make use of OpenAI function calling to get back structured data similar to what is generated by the labeling app.

In [37]:
import pandas as pd
from openai import OpenAI
from dotenv import load_dotenv
import os
import json
import pprint as pp
from src.utils import clean_file_id, clean_message
from src.openai import get_tools


load_dotenv()

client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))


def make_prompt(text):
    """Helper function to make the prompt for OpenAI."""
    prompt = f"""
    perform PII entity extraction from the below email message(s) using the provided `extract_pii_entities` function.
    
    do not make up any entities or parts of entities that are not present in the message(s).

    message(s):
    ```
    {text}
    ```
    """
    return prompt

In [38]:
# params

# "tag" is like an experiment id - it's used to keep track of different experiments/models/approaches etc
tag = "dev" # just using "dev" for now for initial work and baseline
openai_model = "gpt-3.5-turbo" # start with more simple model as a lower bound for performance
data_path = "./data/emails_train_small.csv"
# nrows = 1000
nrows = None

In [39]:
# read data
df = pd.read_csv(data_path, nrows=nrows)
print(df.shape)

(10000, 2)


In [40]:
display(df.head())

Unnamed: 0,file,message
0,germany-c/calp_hopewell/4.,Message-ID: <17014999.1075853725448.JavaMail.e...
1,campbell-l/all_documents/247.,Message-ID: <23887281.1075851883486.JavaMail.e...
2,kitchen-l/_americas/mrha/ooc/270.,Message-ID: <3290028.1075840876828.JavaMail.ev...
3,zufferli-j/sent_items/124.,Message-ID: <7771939.1075842030615.JavaMail.ev...
4,lokay-m/all_documents/906.,Message-ID: <19991611.1075844044421.JavaMail.e...


In [41]:
# sample a random message
df_sample = df.sample(1)

# some data wrangling
file_id = df_sample.file.values[0]
file_id_clean = clean_file_id(file_id)
text = df_sample.message.values[0]
text_clean = clean_message(text)

# print what we have
print("=" * 100)
print(file_id)
print(file_id_clean)
print("." * 100)
print(text_clean)
print("=" * 100)

# call openai
prompt = make_prompt(text_clean)
tools = get_tools()
chat_completion = client.chat.completions.create(
    messages=[{"role": "user", "content": prompt}],
    model=openai_model,
    tools=tools,
    tool_choice={
        "type": "function",
        "function": {"name": "extract_pii_entities"},
    },
)

# extract response
chat_completion_message = chat_completion.choices[0].message
tool_call = chat_completion_message.tool_calls[0]
extracted_data = json.loads(tool_call.function.arguments)

# print response
pp.pprint(extracted_data)

derrick-j/deleted_items/148.
derrick_j_deleted_items_148_
....................................................................................................
Janet just left me a voice mail and believes this is an ENA deal after all.  Barbara--you may want to get ahold of Janet to verify ownership on this one.  Please let this group know what the final answer is.  DF

 -----Original Message-----
From: 	Fossum, Drew  
Sent:	Tuesday, October 23, 2001 3:44 PM
To:	Sanders, Richard B.
Cc:	'James Derrick (Business Fax)'; Edison, Andrew; Gray, Barbara N.; Place, Janet
Subject:	Crescendo

The Crescendo project is a Northern Border project, and the appropriate contact for Christopher Sullivan (the fellow that left Jim the voice mail) is probably Janet Place.  I've left Janet a voice mail with the contact information on Sullivan and asked her to get ahold of him regarding his letter on the helium issue.  In case you access email ahead of voice mail Janet, the guy is Chistopher Sullivan, Rocky M

In [42]:
# save llm extracted data in ./data/labels_llm/{tag}/{file_id_clean}__{tag}.json
output_path = f"./data/labels_llm/{tag}/{file_id_clean}__{tag}.json"
os.makedirs(os.path.dirname(output_path), exist_ok=True)
print(f"Saving to {output_path}")
extracted_data["file_id"] = file_id
with open(output_path, "w") as f:
    json.dump(extracted_data, f)
print("Done.")

Saving to ./data/labels_llm/dev/derrick_j_deleted_items_148___dev.json
Done.
