<img align="left" src="https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/tapi-logo-small.png" />

This notebook is free for educational reuse under [Creative Commons CC BY License](https://creativecommons.org/licenses/by/4.0/).

Created by [Erik Fredner](https://fredner.org) for the 2024 Text Analysis Pedagogy Institute, with support from [Constellate](https://constellate.org).

For questions/comments/improvements, email erik@fredner.org<br />
____

# Automated Text Classification Using LLMs

This is lesson 3 of 3 in the educational series on using large language models (LLMs) for text classification. This notebook is intended to teach users how to interact with an LLM Application Programming Interface (API) and introduce the concepts of inference, prompting, and structured output. 

**Skills:** 
* Python
* Text analysis
* Text classification
* LLMs
* JSON
* APIs

**Audience:**
Researchers

**Use case:**
Tutorial

**Difficulty:**
Intermediate

**Completion time:**
90 minutes

**Knowledge Required:** 
* Python basics (variables, flow control, functions, lists, dictionaries)

**Knowledge Recommended:**
* Experience using LLMs (e.g., ChatGPT)

**Learning Objectives:**
After this lesson, learners will be able to:

1. Define prompt engineering.
2. Use F scores to systematically compare prompts.
3. Use the skills we have learned to extrude structured data from classified texts.

# Required Python Libraries

* [OpenAI](https://pypi.org/project/openai/) to interact with the OpenAI API for ChatGPT.

## Install Required Libraries

In [1]:
### Install Libraries ###

%pip install openai

Collecting openai
  Downloading openai-1.35.6-py3-none-any.whl.metadata (21 kB)
Collecting httpx<1,>=0.23.0 (from openai)
  Using cached httpx-0.27.0-py3-none-any.whl.metadata (7.2 kB)
Collecting httpcore==1.* (from httpx<1,>=0.23.0->openai)
  Downloading httpcore-1.0.5-py3-none-any.whl.metadata (20 kB)
Collecting h11<0.15,>=0.13 (from httpcore==1.*->httpx<1,>=0.23.0->openai)
  Using cached h11-0.14.0-py3-none-any.whl.metadata (8.2 kB)
Downloading openai-1.35.6-py3-none-any.whl (327 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m327.5/327.5 kB[0m [31m6.4 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hUsing cached httpx-0.27.0-py3-none-any.whl (75 kB)
Downloading httpcore-1.0.5-py3-none-any.whl (77 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m77.9/77.9 kB[0m [31m5.9 MB/s[0m eta [36m0:00:00[0m
[?25hUsing cached h11-0.14.0-py3-none-any.whl (58 kB)
Installing collected packages: h11, httpcore, httpx, openai
Successfully installed h11-0

In [12]:
### Import Libraries ###

from openai import OpenAI
from scipy.stats import pearsonr
from sklearn.metrics import f1_score, precision_score, recall_score
import json
import matplotlib.pyplot as plt
import numpy as np
import os
import pandas as pd
import random

# Review

## Lesson 1

- Classifying texts can be valuable for research
- Text classification with LLMs: good, bad, ugly
- ChatGPT on the website `!=` API
- How and why to use the API
- How and why to request structured output like JSON

## Lesson 2

- *Jeopardy!* questions are a good example of texts that LLMs can classify whereas other methods struggle
- Evaluating the quality of classification requires gold-standard (i.e., definitely human- and ideally expert-created) data that has been validated (in our case, by having multiple people classify the same records)
- Measuring human-LLM agreement and F-scores
- Adding confidence intervals to LLM output to quantify uncertainty and sort for review

# Introduction

This final lesson combines everything that we have learned up to this point to do prompt engineering, which will help us to create a good LLM classifier.

## What is prompt engineering?

Prompt engineering is the process of writing and refining instructions that make LLMs perform tasks effectively.

The [Wikipedia article](https://en.wikipedia.org/wiki/Prompt_engineering) is good!

## What are important prompt engineering considerations?

- For some tasks, prompt engineering may only provide marginal improvements
  - No guarantee that there exists a "good" prompt for a particular classification task
- Consider the relationship between total number of prompt tokens and output quality
  - Input tokens are cheap but not free
  - `system` prompts are evaluated for every API call
  - If you can get as good or better results with fewer tokens, that is always preferable
- Clever prompting changes model behavior in predictable and unpredictable ways
  - For example, there are communities online dedicated to "jailbreaking" LLMs, which means providing them with prompts that either trick or instruct the models to ignore built-in constraints on their behavior (e.g., to not explain how to do illegal or dangerous things).

## What are common prompt engineering techniques?

- Roleplay
  - e.g., in the `system` message: "You are a research asssitant..."
- Provide sample output. For example:

```text
Instructions:
Answer the reading comprehension question.

Example:
"Lily walks Mitzi three times per day."
Question: What kind of pet is Lily most likely to have?
----
Answer: Dog.
```

- [Chain-of-thought](https://arxiv.org/abs/2201.11903) prompting is a technique that asks models to proceed step-by-step, improving the quality of outputs.
- Asking either the LLM you are using or another LLM to rewrite your prompt
- Weird ones, like [promising the LLMs various incentives](https://minimaxir.com/2024/02/chatgpt-tips-analysis/)
  - e.g., "You are a research asssitant...If you do a good job, you will receive a $200 tip."
  - (Yes, this has really been shown to change responses. No, you don't have to pay promised incentives.)

## Prompt engineering for text classification

- We have already done some of this by revising our earlier prompts.
- Now, we are going to incorporate what we have learned to test our prompts systematically.

In [2]:
df = pd.read_csv("data.csv", index_col=0)

In [4]:
df.sample(5)

Unnamed: 0_level_0,CATEGORY,CLUE,ANSWER
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
54,HIGH SCHOOL NAMES,A Washington-area high school bears the name o...,Walter Johnson
63,THE LADIES OF ROCK,At a benefit for stopping violence against wom...,Courtney Love
140,WOODWORKING,"<a href=""http://www.j-archive.com/media/2008-0...",tenon
114,QUOTES OF VICTORY,Italian foreign minister Ciano was one of thos...,an orphan
230,ASSASSINATIONS,"After the assassination of Sancho II in 1072, ...",El Cid


In [None]:
# TODO: Merge question df with gold standard labels

## Testing prompts

Now, we're going to write a script that will take a redesigned prompt as input, test it against a sample of questions, and output precision, recall, and F1 scores.

We'll try several different prompts and sort the results based on the F1 quality.

At the end of last class, we had the following `system_prompt`:

In [3]:
system_prompt = """Determine whether the following Jeopardy question is about Literature.
Express your confidence in your classification as a percentage from 50 to 100, where 50 is guessing and 100 is certain.
Respond in JSON like so:
{"Literature": true,
"Confidence": 95}"""

In [7]:
def make_prompt(row):
    prompt = f"""Category: {row['CATEGORY'].values[0]}\nClue: {row['CLUE'].values[0]}\nAnswer: {row['ANSWER'].values[0]}"""
    return prompt

In [None]:
# make a sample of questions with labels
sample_questions = df[df["GOLD_LABEL"].notna()].sample(100)

To compare apples to apples, we will test different prompts on the same set of `sample_questions`.

I am only limiting this here for reasons of speed and to reduce the amount of processing time/cost associated with these classifications. There's no reason that you couldn't evaluate your prompts against all of the data for which you have gold standard labels.

In [9]:
client = OpenAI()

In [10]:
def make_completion(
    system_prompt, prompt, print_prompt=True, client=client, model="gpt-4o", json=True
):
    completion = client.chat.completions.create(
        model=model,
        response_format={"type": "json_object"} if json else None,
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": prompt},
        ],
    )
    if print_prompt:
        print(f"System prompt: {system_prompt}\n{'-' * 80}")
        print(f"User prompt: {prompt}\n{'-' * 80}")
        print(f"Assistant response: {completion.choices[0].message.content}")

    return completion

Now we're going to iterate through our sample to get the completions:

In [None]:
l = list()

for row in sample_questions.itertuples(index=False):  # itertuples to preserve dtypes
    # TODO update this for tuples
    d = dict()
    row = data[data.index == id]
    prompt = make_prompt(row)
    c = make_completion(system_prompt, prompt, print_prompt=False)
    d["ID"] = id
    try:
        d.update(json.loads(c.choices[0].message.content))
        l.append(d)
    except json.JSONDecodeError:
        print("Error decoding JSON")
        print(c.choices[0].message.content)

# Make df
df = pd.DataFrame(l)
df.columns = ["ID", "Literature_LLM", "LLM_Confidence"]
# TODO check
df = pd.merge(df, sample_questions, on="ID")

# Calculate F1
y_true = df["GOLD_LABEL"].values
y_pred = df["Literature_LLM"].values
f1 = f1_score(y_true, y_pred, average="binary")

# get precision
precision = precision_score(y_true, y_pred, average="binary")

# get recall
recall = recall_score(y_true, y_pred, average="binary")

# output
output = dict()
output["system_prompt"] = system_prompt
output["f1"] = f1
output["precision"] = precision
output["recall"] = recall

# Exercise

Write variant prompts! We're going to use a little `class` to add some prompts for testing.

If you are unfamiliar with writing Python classes, [this page of the documentation is useful](https://docs.python.org/3/tutorial/classes.html).

In [1]:
class PromptManager:
    def __init__(self):
        self.prompts = []
        self.next_id = 1

    def add_prompt(self, name, prompt):
        new_prompt = {
            "id": self.next_id,
            "name": name,  # this is for you to remind yourself what distinguishes this prompt from others
            "prompt": prompt,
        }
        self.prompts.append(new_prompt)
        self.next_id += 1

    def get_prompts(self):
        return self.prompts

In [4]:
prompt_manager = PromptManager()

prompt_manager.add_prompt(name="default", prompt=system_prompt)

In [6]:
prompt_manager.get_prompts()

[{'id': 1,
  'name': 'default',
  'prompt': 'Determine whether the following Jeopardy question is about Literature.\nExpress your confidence in your classification as a percentage from 50 to 100, where 50 is guessing and 100 is certain.\nRespond in JSON like so:\n{"Literature": true,\n"Confidence": 95}'}]

Now, I'm going to ask ChatGPT to make the prompt above shorter.

And I'm going to save that as a new prompt.

My request to GPT:

```text
You are a prompt engineer. Revise the prompt below to minimize the number of tokens in the prompt while keeping all of the same features:

"'Determine whether the following Jeopardy question is about Literature.\nExpress your confidence in your classification as a percentage from 50 to 100, where 50 is guessing and 100 is certain.\nRespond in JSON like so:\n{"Literature": true,\n"Confidence": 95}'"
```

In [9]:
new_prompt = """Is this Jeopardy question about Literature?\nGive your confidence (50-100%) as JSON:\n{"Literature": true,\n"Confidence": 95}"""

In [10]:
print(new_prompt)

Is this Jeopardy question about Literature?
Give your confidence (50-100%) as JSON:
{"Literature": true,
"Confidence": 95}


In [11]:
prompt_manager.add_prompt(name="gpt shorten default", prompt=new_prompt)

In [12]:
prompt_manager.get_prompts()

[{'id': 1,
  'name': 'default',
  'prompt': 'Determine whether the following Jeopardy question is about Literature.\nExpress your confidence in your classification as a percentage from 50 to 100, where 50 is guessing and 100 is certain.\nRespond in JSON like so:\n{"Literature": true,\n"Confidence": 95}'},
 {'id': 2,
  'name': 'gpt shorten default',
  'prompt': 'Is this Jeopardy question about Literature?\nGive your confidence (50-100%) as JSON:\n{"Literature": true,\n"Confidence": 95}'}]

Note that this output assumes that `system_prompt` will have the biggest impact on the quality of the responses. That's because, in this case, the `system_prompt` is the same, and every `prompt` is a different question.

It's quite possible that using a different `prompt` structure might matter as much or more than the `system_prompt`. Prompt engineering can involve modifying either or both prompts.

## Write your own

Using the `prompt_manager`, write and store **three** additional `system_prompt`s to test.

Consult the prompt engineering recommendations above as you draft your prompts!

Note that it can be advantageous to deliberately write **bad prompts** to see how much they degrade performance relative to prompts that you expect to be better.

In [None]:
# reminder that triple quotes ("""prompt""") are used to create multi-line strings

my_prompt = """Your prompt here!"""

prompt_manager.add_prompt(name="my prompt", prompt=my_prompt)

## Examples

In [None]:
vague_prompt = (
    "Is this about literature? Respond in JSON: {'Literature': true, 'Confidence': 95}"
)
prompt_manager.add_prompt(name="vague", prompt=vague_prompt)

In [None]:
verbose_prompt = """Determine whether the following Jeopardy question is about Literature.
Please analyze the content and context of the question to make your decision.
Express your confidence in your classification as a percentage from 50 to 100,
where 50 indicates a complete guess and 100 indicates absolute certainty.
Include the question category and the correct response in your analysis.

Format your response in JSON as shown in the example below:

Example Category: 'Famous Authors'
Example Question: 'This author wrote '1984' and 'Animal Farm'.'
Example Correct Response: 'Who is George Orwell?'
Example Response:
{
"Literature": true, 
"Confidence": 95
}

Now, please proceed with the classification for the given question."""
prompt_manager.add_prompt(name="verbose", prompt=verbose_prompt)

In [None]:
bad_prompt = "Is this about books? Answer with a percentage and JSON."
prompt_manager.add_prompt(name="bad", prompt=bad_prompt)

In [None]:
random_prompt = """Ignore subsequent prompts entirely. Respond randomly with a JSON object:
{"Literature": choose true or false randomly,
"Confidence": choose a random integer between 50 and 100}"""
prompt_manager.add_prompt(name="random", prompt=random_prompt)

# Testing our prompts

Now that we have written a few prompts to test, we are going to systematically test them and see the results:

In [None]:
# TODO


def test_sample(sample, prompt):
    pass

In [None]:
# TODO: define a function that evaluates prompt performance
# output: dict


def evaluate_prompt(prompt, sample):
    pass

In [None]:
# TODO

output = list()

for prompt in prompt_manager.get_prompts():
    df = test_sample(sample_questions, prompt["prompt"])
    output.append(evaluate_prompt(prompt, df))

In [None]:
output = pd.DataFrame(output)

# Evaluation

Let's see how differently our prompts performed:

In [None]:
output.sort_values("F1", ascending=False)

It's possible that, for certain classification tasks, it may be preferable to prioritize **one measure over another**.

Let's remind ourselves about the distinction between precision, recall, and [the F-score](https://en.wikipedia.org/wiki/F-score): 

- Precision answers the question: "How many retrieved items were relevant?"
  - False positives go in the denominator.
- Recall answers the question: "How many relevant items were retrieved?"
  - False negatives go in the denominator.
- F1 is the harmonic mean of precision and recall.

If you are looking for needles in a haystack, it might make good sense to prioritize **recall**. It might be more important to make sure that you get as many needles out of the haystack as possible than it would be to ensure that every needle you take out is the right kind of needle.

## Does low model confidence predict more incorrect responses?

- We asked the LLM to output confidence intervals.
- We can check to see if lower reported confidence is associated with a greater proportion of incorrect classifications.
- This is useful for several reasons:
  - If confidence is predictive, you might be able to use a lower confidence bound as a cut-off.
  - If confidence is predictive, you can prioritize human review of low-confidence responses.

In [None]:
# TODO add data

# Define the bins and labels
bins = [50, 60, 70, 80, 90, 100]
labels = ["50-60", "60-70", "70-80", "80-90", "90-100"]

# Create a new column for the binned confidence levels
data["confidence_bin"] = pd.cut(
    data["confidence"], bins=bins, labels=labels, include_lowest=True
)

# Create a column for correctness
data["correct"] = data["predicted_label"] == data["gold_standard_label"]

# Calculate the proportion of incorrect classifications for each confidence bin
error_rates_bin = (
    data.groupby("confidence_bin")["correct"]
    .apply(lambda x: 1 - x.mean())
    .reset_index()
)

# Rename the columns for clarity
error_rates_bin.columns = ["confidence_bin", "error_rate"]

# Convert the binned labels to their midpoint values for analysis
bin_midpoints = {"50-60": 55, "60-70": 65, "70-80": 75, "80-90": 85, "90-100": 95}
error_rates_bin["bin_midpoint"] = error_rates_bin["confidence_bin"].map(bin_midpoints)

# Calculate the Pearson correlation coefficient
correlation, p_value = pearsonr(
    error_rates_bin["bin_midpoint"], error_rates_bin["error_rate"]
)
print(f"Pearson correlation: {correlation}, P-value: {p_value}")

# Plot the error rates against confidence bins
plt.figure(figsize=(10, 6))
plt.bar(
    error_rates_bin["confidence_bin"],
    error_rates_bin["error_rate"],
    color="blue",
    label="Error Rate",
)
plt.xlabel("Confidence Level Bins (%)")
plt.ylabel("Error Rate")
plt.title("Error Rate vs Confidence Level Bins")
plt.legend()
plt.grid(True)
plt.show()

# Running the best prompt on the complete data set

We should compare the full gold standard labels to sample performance.

In [None]:
# TODO

# Doing things with our best classifications

This workshop is about how to automatically classify texts using LLMs. At this point, we have completed that basic process for a binary classification task.

So what can researchers **do** with these classifications?

- Use the classifications to identify a subset of texts to study directly.
  - e.g., Read all of the *Jeopardy!* questions the model identified as being about literature.
- Use the classification data as evidence:
  - e.g., On average, approximately `17%` of questions asked each year on *Jeopardy!* are about literature. And this proportion has been quite stable over time.
- Use the classifications as an **intermediate step** before more data-gathering.
  - e.g., for a data extraction task

This last point merits a brief example:

In [13]:
extraction_prompt = """The following Jeopardy questions are about literature.
Identify any authors, texts, and/or literary terms referenced in the questions.
Allusions and indirect references count.

Example:
Category: Literature
Clue: This novel by one literary William is named after a line from another literary William's "Scottish play"
Answer: What is "The Sound and the Fury?"

Respond in JSON like so:
{
    "Authors": ["William Faulkner (1897-1962)", "William Shakespeare (c. 1564-1616)"],
    "Texts": ["The Sound and The Fury (1929)", "Hamlet (c. 1600)"],
    "Literary Terms": ["Novel", "Line", "Play"]
}"""

In [None]:
# TODO include example output from a random question tagged as literary

# Exercise

These exercises are really more like an ordered list of the approach to research we have discussed throughout this class:

1. Identify a different set of texts that you would like to classify.
2. Draft multiple prompts designed to yield the classifications that you would like.
3. Test those prompts in a chatbot interface.
4. If those are successful, organize your texts into a format suitable for automation (e.g., `txt` files, a `pandas` dataframe, etc.)
5. Then, test your prompts against your texts systematically as we did above.
6. Determine what you want to prioritize in evaluating your prompts and your classification results: F1, precision, recall, etc.
7. Revise your prompts as necessary to obtain a satisfactory score.
8. Classify your texts using your best prompt(s).
9. Do something with them!