<img align="left" src="https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/tapi-logo-small.png" />

This notebook is free for educational reuse under [Creative Commons CC BY License](https://creativecommons.org/licenses/by/4.0/).

Created by [Erik Fredner](https://fredner.org) for the 2024 Text Analysis Pedagogy Institute, with support from [Constellate](https://constellate.org).

For questions/comments/improvements, email erik@fredner.org<br />
____

# Automated Text Classification Using LLMs

This is lesson 2 of 3 in the educational series on using large language models (LLMs) for text classification. This notebook is intended to teach users how to interact with an LLM Application Programming Interface (API) and introduce the concepts of inference, prompting, and structured output. 

**Skills:** 
* Python
* Text analysis
* Text classification
* LLMs
* JSON
* APIs

**Audience:**
Researchers

**Use case:**
Tutorial

**Difficulty:**
Intermediate. This assumes users are familiar with Python and have been programming for 6+ months. Code makes up a larger part of the notebook and basic concepts related to Python are not explained.

**Completion time:**
90 minutes

**Knowledge Required:** 
* Python basics (variables, flow control, functions, lists, dictionaries)

**Knowledge Recommended:**
* Experience using LLMs (e.g., ChatGPT)

**Learning Objectives:**
After this lesson, learners will be able to:

1. Describe how to evaluate automated LLM classifications.
2. Create data to evaluate LLM classifications.
3. Characterize the [F-score](https://en.wikipedia.org/wiki/F-score).
4. Combine the ideas above to evaluate multiple prompts.

**Research Pipeline:**
1. Play with LLMs if you have not already.
2. Test using a chatbot interface for an LLM (like ChatGPT) to perform relevant classifications for your research.
3. Evaluate initial results.
4. Learn how to interact with an API through this notebook.
5. Modify your initial experiments based on what we cover.

# Required Python Libraries

* [OpenAI](https://pypi.org/project/openai/) to interact with the OpenAI API for ChatGPT.

In [1]:
### Import Libraries ###

from openai import OpenAI
import pandas as pd

# Required Data

**Data Format:** 
* Comma-separated values (.csv)

**Data Source:**
* 500 randomly sampled *Jeopardy!* questions, including their category, clue, and answer
* Questions transcribed from episodes of the show by archivists at the [*J-Archive!*](https://j-archive.com)
* Questions extracted and posted publicly [on GitHub](https://github.com/amwagner19/jarchive-clues)
* Extraneous columns for the course dropped and IDs reindexed

**Data Quality/Bias:**
* This data reproduces a small random subset of the questions recorded by the [*J-Archive!*](https://j-archive.com) archivists
* *J-Archive!* is a well-regarded fan site, but it has not recorded every clue on every game (e.g., unasked questions)
* Any biases reflected in the form and content of the questions reflect those of the *Jeopardy!* writers

## Download Required Data

The dataset for this class is small enough to distribute with [the git repository for the course](https://github.com/erikfredner/tap-2024), which you can clone.

In [2]:
df = pd.read_csv("data.csv", index_col=0)

In [3]:
df.sample(5)

Unnamed: 0_level_0,CATEGORY,CLUE,ANSWER
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
132,FARMING,The main type of farming on the western grassl...,cattle ranching
416,GEOGRAPHY,"Founded by Chinese miners in 1857, it's Malays...",Kuala Lumpur
275,THUMB ENCHANTED EVENING,"The ""thumb"" on the geographic mitten that is M...",Lake Huron
224,U.S. CITIES,The Smith Brothers made their first cough drop...,Poughkeepsie
381,MOVIE PAIRS,Comic strip characters played on screen by Art...,Dagwood & Blondie


# Review of Lesson 1

1. Why classify texts?
   1. Examples from business
   2. Examples from scholarship
2. Good, bad, and ugly of using LLMs for text classification
3. ChatGPT website vs. API
4. Calling the API
5. Model options
6. Model costs
7. JSON mode

# Introduction

- Last time, we talked about text classification.
- But we didn't have any texts to classify.
- Today, we're going to change that with a type of text that non-LLM methods would struggle to classify: *Jeopardy!* questions.

## Why *Jeopardy*?

- I just finished [research](https://fredner.org/jeopardy/) for an essay I am writing about *Jeopardy!* questions and the literary canon.
  - I'm teaching the methods used for that project here.
- Non-LLM methods struggle with short, dense, allusive texts like quiz questions, so this suggests a set of classifications that LLMs can perform that other methods struggle with.

# Types of classifications

When classifying texts, classification problems can fall into one of several categories. Here are some of the most common examples:

- Binary classification: Classifying texts into one of two categories.
  - e.g., classifying emails as Spam or Not Spam
- Multi-class classification: Classifying texts into one of three or more categories.
  - e.g., classifying newspaper articles as politics, business, arts, etc.
- Multi-label classification: Labeling texts with one or more classifications.
  - e.g., classifying novels with one or more genre labels: `['fantasy', 'romance']`, for example
- Hierarchical classification: Classifying texts as part of both classes and subclasses
  - e.g., classifying research papers:
```python
{
    "field": "literary studies",
    "subfields": ["american literature", "nineteenth-century"],
}
```
- Ordinal classification: Classifying a text in a way that ranks or orders it.
  - e.g., Attempting to infer star values (i.e., rankings from 1-5) from unstarred movie reviews


## Which are we going to do?

In this brief class, we're going to focus on the simplest category---**binary classification**---with a little bit of **ordinal classification**, too.

# How do you evaluate an LLM's classifications?

- Neither humans nor LLMs classify texts perfectly.
- How well do humans agree with each other?
- How well do the LLM's judgments align with researcher judgments?

The first thing that we need to do is create [**gold-standard data**](https://simmering.dev/blog/gold-data/) that we can use to evaluate the model. In our case, this is going to be the results of human judgments classifying our questions.

In some cases, there might already exist classification labels that you could use (e.g., librarians' categorizations of books).

## Creating data to evaluate the classification

We're going to get a sense for the challenge of classifying *Jeopardy* questions by doing it ourselves.

In [4]:
# make sure you have loaded the data
df.sample(5)

Unnamed: 0_level_0,CATEGORY,CLUE,ANSWER
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
24,DIRECTORS,"Yilmaz Guney, accused of murder, directed his ...",in prison
199,5-LETTER WORDS,An archaic interjection used to express surpri...,marry
356,ONE GIANT LEAP,His important experiments included launching t...,Robert Goddard
428,TOP 40 MATH,"Tennessee Ernie Ford's ""Tons"" times Eddie Mone...",32
406,ROCK ART,She's painted the covers for many of her own a...,Joni Mitchell


- Everyone gets their own sample of 50 questions from our 500 question sample.
- There will be overlap with our answers, which we want to measure agreement.
- To keep the list of possible labels small, we are going to label questions as belonging to one of five categories that correspond to the most frequent topics in *Jeopardy*:
  - History
  - Geography
  - Literature
  - Science
  - Other
- You can look up information about questions if you're unsure how to classify them. (Just don't ask ChatGPT.)



In [5]:
my_df = df.sample(50).copy()
my_df.reset_index(inplace=True)

In [6]:
my_df.head()

Unnamed: 0,ID,CATEGORY,CLUE,ANSWER
0,410,A SPECIAL TRAIN CAR,"<a href=""https://www.j-archive.com/media/2023-...",a dome car
1,440,THE WORLD OF SPORTS,The Canadian version of this sport has 12 men ...,football
2,276,"<a href=""https://www.j-archive.com/media/2022-...","(<a href=""https://www.j-archive.com/media/2022...",the aviator
3,116,TOUGH HISTORY,Name for the style seen here after a king who ...,Louis XIV
4,34,THE 18th CENTURY,In 1789 General Nguyen Hue led an attack on Ch...,Tet


In [7]:
import pandas as pd
from IPython.display import clear_output


def classify_jeopardy_questions(my_df):
    # Initialize a list to store the categorizations
    categorizations = []

    # List of valid categories
    categories = ["History", "Geography", "Literature", "Science", "Other"]

    for index, row in my_df.iterrows():
        # Print the CATEGORY, CLUE, and ANSWER for the current row
        print(f"ID: {row['ID']}")
        print(f"CATEGORY: {row['CATEGORY']}")
        print(f"CLUE: {row['CLUE']}")
        print(f"ANSWER: {row['ANSWER']}")

        # Display category options with corresponding numbers
        print("Please classify the question into one of the following categories:")
        for i, category in enumerate(categories, 1):
            print(f"{i}: {category}")

        # Ask the user for a valid categorization
        while True:
            try:
                category_index = int(
                    input("Enter the number corresponding to the category: ")
                )
                if 1 <= category_index <= len(categories):
                    selected_category = categories[category_index - 1]
                    # Save the categorization along with the row ID
                    categorizations.append(
                        {"ID": row["ID"], "Category": selected_category}
                    )
                    break
                else:
                    print("Invalid number. Please try again.")
            except ValueError:
                print("Invalid input. Please enter a number.")

        # Clear the output in the Jupyter notebook
        clear_output(wait=True)

    # Convert the categorizations to a DataFrame for further use if needed
    categorizations_df = pd.DataFrame(categorizations)

    # Return the categorizations DataFrame
    return categorizations_df

In [8]:
# categorized_df = classify_jeopardy_questions(my_df)

In [None]:
import random

# generate a random id to distinguish your data from others'
random_id = random.randint(1, 10000)
categorized_df.to_csv(f"classified_jeopardy_{random_id}.csv", index=False)

Now, we have saved your categorizations to a CSV file in your Constellate workspace.

## Next steps

1. Right-click your classification file (`classified_jeopardy_...csv`), and select `Download`. That will download the file to the `~/Downloads` folder on your computer.
2. Navigate to your `Downloads` on your computer, and find your `.csv` file
3. Upload your `.csv` to [this Dropbox folder](https://www.dropbox.com/request/ryB9Bh9QefqASXRaZfPU) for us to combine our data together:

<https://www.dropbox.com/request/ryB9Bh9QefqASXRaZfPU>

## Discussion of classification experience

- Was this more difficult than you anticipated?
- How did you decide how to categorize a given question?

# Evaluating our classifications

We will evaluate our classifications using the majority of evidence in the data.

**N.B.:** The code below will only work on my machine for now. I will add the classifications to the course repository ASAP.

In [10]:
import os
import pandas as pd

PATH = "/Users/erik/Dropbox/File requests/2024 TAPI classifications"
csvs = [f for f in os.listdir(PATH) if f.endswith(".csv")]
# create a stacked dataframe with the csvs
df = pd.concat([pd.read_csv(os.path.join(PATH, f)) for f in csvs], ignore_index=True)

In [11]:
df_vals = df.groupby("ID").value_counts().unstack().fillna(0)
df_vals

Category,Geography,Literature,Other,Science
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
43,0.0,0.0,1.0,0.0
66,2.0,0.0,0.0,0.0
112,0.0,0.0,1.0,0.0
208,0.0,0.0,1.0,0.0
216,0.0,1.0,0.0,0.0
236,0.0,0.0,0.0,1.0
267,1.0,0.0,0.0,0.0
344,0.0,0.0,0.0,1.0
364,1.0,0.0,0.0,0.0


For simplicity's sake, we are going to focus on questions where one categorization was the clear "winner" (i.e., exclude ties).

In [12]:
def check_row_ties(row):
    max_value = row.max()
    max_count = (row == max_value).sum()
    if max_count > 1:
        return True
    else:
        return False


# Apply the function to each row and create a new column for the results
df_vals["Tie?"] = df_vals.apply(check_row_ties, axis=1)

In [13]:
# How many ties did we have?
df_vals["Tie?"].sum()

0

In [14]:
# get the ids for rows without ties
gold_ids = df_vals[~df_vals["Tie?"]].index.tolist()

In [15]:
gold_df = df_vals[df_vals.index.isin(gold_ids)]
# drop the tie column
gold_df.drop("Tie?", axis=1, inplace=True)
gold_df.reset_index(inplace=True)

In [16]:
gold_melted = gold_df.melt(id_vars=["ID"], var_name="Category", value_name="Value")

In [17]:
gold_labels = gold_melted.loc[
    gold_melted.groupby("ID")["Value"].idxmax(), ["ID", "Category"]
]

In [18]:
gold_labels.set_index("ID", inplace=True)
gold_labels

Unnamed: 0_level_0,Category
ID,Unnamed: 1_level_1
43,Other
66,Geography
112,Other
208,Other
216,Literature
236,Science
267,Geography
344,Science
364,Geography


# Using our gold-standard data

- We created human-labeled data indicating correct results for our classifications.
  - (In a real research setting, this would be a more meticulous and expert-driven process. For this class, it's fine.)
- Now we need to test the LLM on these classification tasks.
- We will evaluate its performance using the data we created and a simple statistic called the [F-score](https://en.wikipedia.org/wiki/F-score).

## Getting one basic classification result

- We talked earlier about types of classification.
- The data we have just created is suitable for either binary or multi-class classification.
- The type of classification we get back from the LLM will be determined by the prompt we give the model.
- Let's start with a simple prompt:

In [19]:
# binary classification prompt
system_prompt = """Determine whether the following Jeopardy question is about Literature.
Respond in JSON like so: {"Literature": True}"""

> **N.B.** It is important to instruct the model to respond in JSON *even if you activate JSON mode*.

In [20]:
# load questions
data = pd.read_csv("data.csv", index_col=0)

In [21]:
def make_prompt(row):
    prompt = f"""Category: {row['CATEGORY'].values[0]}\nClue: {row['CLUE'].values[0]}\nAnswer: {row['ANSWER'].values[0]}"""
    return prompt

In [22]:
prompt = make_prompt(data.sample(1))
print(prompt)

Category: DON QUIZ-OTE
Clue: In 1967 this man's aircraft company merged with McDonnell Aircraft
Answer: (Donald) Douglas


In [23]:
from openai import OpenAI

# remember, this uses the API key in your .env file:
client = OpenAI()

# if you didn't run the code in lesson 1, you probably don't have an .env file

In [28]:
def make_completion(
    system_prompt, prompt, print_prompt=True, client=client, model="gpt-4o", json=True
):
    completion = client.chat.completions.create(
        model=model,
        response_format={"type": "json_object"} if json else None,
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": prompt},
        ],
    )
    if print_prompt:
        print(f"System prompt: {system_prompt}\n{'-' * 80}")
        print(f"User prompt: {prompt}\n{'-' * 80}")
        print(f"Assistant response: {completion.choices[0].message.content}")

    return completion

In [29]:
c = make_completion(system_prompt, prompt)

System prompt: Determine whether the following Jeopardy question is about Literature.
Respond in JSON like so: {"Literature": True}
--------------------------------------------------------------------------------
User prompt: Category: DON QUIZ-OTE
Clue: In 1967 this man's aircraft company merged with McDonnell Aircraft
Answer: (Donald) Douglas
--------------------------------------------------------------------------------
Assistant response: {"Literature": false}


Ok, that returns a binary response.

Now we can put these things together and see how text classifications can quickly become data.

We'll start with one row that we know has a gold-standard label:

In [30]:
import random
import json

output = dict()

id = random.choice(gold_ids)
row = data[data.index == id]
prompt = make_prompt(row)
c = make_completion(system_prompt, prompt)
output[id] = json.loads(c.choices[0].message.content)
output

System prompt: Determine whether the following Jeopardy question is about Literature.
Respond in JSON like so: {"Literature": True}
--------------------------------------------------------------------------------
User prompt: Category: LEADERS
Clue: Viktor Orban is the anti-immigrant leader of this European country
Answer: Hungary
--------------------------------------------------------------------------------
Assistant response: {"Literature": false}


{66: {'Literature': False}}

Now we can put all of this together to automatically process a batch of questions using the `gold_ids`:

In [None]:
l = list()

for id in random.sample(gold_ids, 20):
    d = dict()
    row = data[data.index == id]
    prompt = make_prompt(row)
    c = make_completion(system_prompt, prompt, print=False)
    d["ID"] = id
    d.update(json.loads(c.choices[0].message.content))
    l.append(d)

In [None]:
output = pd.DataFrame(l)
output.columns = ["ID", "Literature_LLM"]
output.set_index("ID", inplace=True)
output

Unnamed: 0_level_0,Literature_LLM
ID,Unnamed: 1_level_1
43,False
66,False
112,True
208,False
216,True


## Comparing LLM classifications to human classifications

In [None]:
gold_labels.head()

Unnamed: 0_level_0,Category
ID,Unnamed: 1_level_1
43,Other
66,Geography
112,Other
208,Other
216,Literature


How often do humans and the LLM agree on binary classification (Literature vs. Not Literature)?

In [None]:
agree_df = output.merge(gold_labels, left_index=True, right_index=True)
agree_df["Category_Literature"] = agree_df["Category"] == "Literature"
agree_df["Human and LLM Agree"] = (
    agree_df["Category_Literature"] == agree_df["Literature_LLM"]
)
agree_df

Unnamed: 0_level_0,Literature_LLM,Category,Category_Literature,Human and LLM Agree
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
43,False,Other,False,True
66,False,Geography,False,True
112,True,Other,False,False
208,False,Other,False,True
216,True,Literature,True,True


This would be the most basic way of measuring the success rate of your classification: How often does the model output match gold-standard data?

However, there is a more sophisticated and widely used solution for binary classification: the F-score.

## The F-Score

To understand the F-score, you need to understand two related concepts: precision and recall.

### Precision

This measures how many of the items the model identified as `True` were really `True` according to the gold standard data. That is compared against the number of items identified as `True` that were `False` according to the gold standard data, which are known as "False positives."

> Precision answers the question: "How many retrieved items were relevant?"

$Precision = \frac{True \ Positives}{True \ Positives + False \ Positives}$

### Recall

This measures how many values that ought to have been `True` were labeled `True`.

> Recall answers the question: "How many relevant items were retrieved?"

$Recall = \frac{True \ Positives}{True \ Positives + False \ Negatives}$

### F-score (aka F1)

The F score is the harmonic mean of precision and recall.

$F_{1}= 2 \times \frac{Precision \times Recall}{Precision + Recall}$

## How to calculate

There is an easy way to calculate this score using [sklearn](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html):

In [None]:
import numpy as np
from sklearn.metrics import f1_score

# Example with fake data
y_true = [0, 1, 0, 0, 1, 1]
y_pred = [0, 1, 1, 0, 1, 1]
f1_score(y_true, y_pred)

0.8571428571428571

In [None]:
# Because Python stores True as 1 and False as 0, we can directly use the columns:
f1_score(agree_df["Category_Literature"], agree_df["Literature_LLM"])

0.6666666666666666

You can also use the F-score to evaluate multi-class classifications.

We'll stick with binary for now for simplicity's sake.

# Putting it all together

Here is how the things we learned today come together:

1. We have our texts to classify (*Jeopardy!* questions)
2. We created gold-standard (i.e., human-labeled, and, ideally, expert-labeled) classifications for testing.
3. We wrote prompts and set up API calls that output structured classifications as JSON.
4. We can systematically evaluate the quality of our classification using the F score.

## Next steps

There are two big remaining steps for our classification:

1. Quantifying uncertainty

As you experienced doing classifications by hand, some judgments were easier than others. We can ask the LLM to express its confidence in its judgments numerically.

This is useful because it allows us to sort automatically labeled data by confidence for review. Low confidence classifications deserve higher priority for manual review (and possible correction) by researchers. (You could also do some clever massaging of the F score by penalizing confident but wrong classifications, while lessening the penalty for low confidence wrong classifications.)

2. Prompt engineering

We can use the F score to systematically test and evaluate variations of our `system` and `user` prompts to see which prompts produce the most accurate classifications.

# Quantifying uncertainty

This one is surprisingly easy. It requires a small modification of the system prompt:

In [31]:
system_prompt = """Determine whether the following Jeopardy question is about Literature.
Express your confidence in your classification as a percentage from 50 to 100, where 50 is guessing and 100 is certain.
Respond in JSON like so:
{"Literature": True,
"Confidence": 95}"""

In [32]:
prompt = make_prompt(data.sample(1))

In [33]:
c = make_completion(system_prompt, prompt)

System prompt: Determine whether the following Jeopardy question is about Literature.
Express your confidence in your classification as a percentage from 50 to 100, where 50 is guessing and 100 is certain.
Respond in JSON like so:
{"Literature": True,
"Confidence": 95}
--------------------------------------------------------------------------------
User prompt: Category: COMPOSERS
Clue: In 1943 American composer William Schuman became the first composer to win one of these prizes
Answer: Pulitzer
--------------------------------------------------------------------------------
Assistant response: {
"Literature": false,
"Confidence": 100
}


We can use confidence scores to prioritize review of low-confidence (i.e., `50%`-`75%`) responses.

# Exercises

1. Using what we have learned today, try writing a multi-class classification `system` prompt for *Jeopardy* questions that will output structured JSON with the predefined options we used to create the gold standard data.
2. Explain how the F score differs from merely calculating the percentage of the time that the gold-standard data and the classification model agree.
3. Write a paragraph explaining how you could apply these techniques to a different set of texts than *Jeopardy* questions.

# Next lesson!

[Proceed to next lesson ->](./lesson-3.ipynb)