<img align="left" src="https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/tapi-logo-small.png" />

This notebook is free for educational reuse under [Creative Commons CC BY License](https://creativecommons.org/licenses/by/4.0/).

Created by [Erik Fredner](https://fredner.org) for the 2024 Text Analysis Pedagogy Institute, with support from [Constellate](https://constellate.org).

For questions/comments/improvements, email erik@fredner.org<br />
____

# Automated Text Classification Using LLMs

This is lesson 2 of 3 in the educational series on using large language models (LLMs) for text classification. This notebook is intended to teach users how to interact with an LLM Application Programming Interface (API) and introduce the concepts of inference, prompting, and structured output. 

**Skills:** 
* Python
* Text analysis
* Text classification
* LLMs
* JSON
* APIs

**Audience:**
Researchers

**Use case:**
Tutorial

**Difficulty:**
Intermediate. This assumes users are familiar with Python and have been programming for 6+ months. Code makes up a larger part of the notebook and basic concepts related to Python are not explained.

**Completion time:**
90 minutes

**Knowledge Required:** 
* Python basics (variables, flow control, functions, lists, dictionaries)
* `pandas` basics

**Knowledge Recommended:**
* Experience using LLMs (e.g., ChatGPT)

**Learning Objectives:**
After this lesson, learners will be able to:

1. Describe how to evaluate automated LLM classifications.
2. Create data to evaluate LLM classifications.
3. Characterize the [F-score](https://en.wikipedia.org/wiki/F-score).
4. Combine the ideas above to evaluate multiple prompts.

**Research Pipeline:**
1. Play with LLMs if you have not already.
2. Test using a chatbot interface for an LLM (like ChatGPT) to perform relevant classifications for your research.
3. Evaluate initial results.
4. Learn how to interact with an API through this notebook.
5. Modify your initial experiments based on what we cover.

# Required Python Libraries

* [OpenAI](https://pypi.org/project/openai/) to interact with the OpenAI API for ChatGPT.

In [35]:
### Import Libraries ###

from openai import OpenAI
import pandas as pd
from IPython.display import clear_output
import random
from dotenv import load_dotenv
import random
import json
import numpy as np
from sklearn.metrics import f1_score

# Required Data

**Data Format:** 
* Comma-separated values (.csv)

**Data Source:**
* 500 randomly sampled *Jeopardy!* questions, including their category, clue, and answer
* Questions transcribed from episodes of the show by archivists at the [*J-Archive!*](https://j-archive.com)
* Questions extracted and posted publicly [on GitHub](https://github.com/amwagner19/jarchive-clues)
* Extraneous columns for the course dropped and IDs reindexed

**Data Quality/Bias:**
* This data reproduces a small random subset of the questions recorded by the [*J-Archive!*](https://j-archive.com) archivists
* *J-Archive!* is a well-regarded fan site, but it has not recorded every clue on every game (e.g., unasked questions)
* Any biases reflected in the form and content of the questions reflect those of the *Jeopardy!* writers

## Download Required Data

The dataset for this class is small enough to distribute with [the git repository for the course](https://github.com/erikfredner/tap-2024), which you can clone.

In [11]:
df = pd.read_csv("data.csv", index_col=0)

In [12]:
df.sample(5)

Unnamed: 0_level_0,CATEGORY,CLUE,ANSWER
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
385,MAKES SENSE--MOVIES EDITION,"In the ""Toy Story"" franchise, the Tyrannosauru...",Rex
114,QUOTES OF VICTORY,Italian foreign minister Ciano was one of thos...,an orphan
188,"SHOOTING ""BB""s",A sheath for a sword,a scabbard
498,HUNGARY?,"Meaning ""famous ruler"", it's the first name of...",Laszlo
170,TRASH TALKING AT THE MEDIEVAL JOUST,The lady whose favor thou sportest will sup wi...,a tournament


# Review of Lesson 1

1. Why classify texts?
   1. Examples from business
   2. Examples from scholarship
2. Good, bad, and ugly of using LLMs for text classification
3. ChatGPT website vs. API
4. Calling the API
5. Model options
6. Model costs
7. JSON mode

# Introduction

- Last time, we talked about text classification.
- But we didn't have any texts to classify.
- Today, we're going to change that with a type of text that non-LLM methods would struggle to classify: *Jeopardy!* questions.

## Why *Jeopardy*?

- I just finished [research](https://fredner.org/jeopardy/) for an essay I am writing about *Jeopardy!* questions and the literary canon.
  - I'm teaching the methods used for that project here.
- Non-LLM methods struggle with short, dense, allusive texts like quiz questions, so this suggests a set of classifications that LLMs can perform that other methods struggle with.

# Types of classifications

When classifying texts, classification problems can fall into one of several categories. Here are some of the most common examples:

- Binary classification: Classifying texts into one of two categories.
  - e.g., classifying emails as Spam or Not Spam
- Multi-class classification: Classifying texts into one of three or more categories.
  - e.g., classifying newspaper articles as politics, business, arts, etc.
- Multi-label classification: Labeling texts with one or more classifications.
  - e.g., classifying novels with one or more genre labels: `['fantasy', 'romance']`, for example
- Hierarchical classification: Classifying texts as part of both classes and subclasses
  - e.g., classifying research papers:
```python
{
    "field": "literary studies",
    "subfields": ["american literature", "nineteenth-century"],
}
```
- Ordinal classification: Classifying a text in a way that ranks or orders it.
  - e.g., Attempting to infer star values (i.e., rankings from 1-5) from unstarred movie reviews


## Which are we going to do?

In this brief class, we're going to focus on the simplest category---**binary classification**---with a little bit of **ordinal classification**, too.

# How do you evaluate an LLM's classifications?

- Neither humans nor LLMs classify texts perfectly.
- How well do humans agree with each other?
- How well do the LLM's judgments align with researcher judgments?

The first thing that we need to do is create [**gold-standard data**](https://simmering.dev/blog/gold-data/) that we can use to evaluate the model. In our case, this is going to be the results of human judgments classifying our questions.

In some cases, there might already exist classification labels that you could use (e.g., librarians' categorizations of books).

## Creating data to evaluate the classification

We're going to get a sense for the challenge of classifying *Jeopardy* questions by doing it ourselves.

In [13]:
# make sure you have loaded the data
df.sample(5)

Unnamed: 0_level_0,CATEGORY,CLUE,ANSWER
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
87,PEOPLE & PLACES,The majority of this country's people are Bhut...,Bhutan
359,EARLY AMERICA,In 1626 this Dutchman arrived in New York Harb...,Peter Minuit
420,HOW TO MARRY A MILLIONAIRE,"Like Manilow's ""Lola"", <a href=""http://www.j-...",showgirl
461,T-N-C,Instability in the atmosphere that means it's ...,turbulence
128,BRITISH AUTHORS,"Her futuristic novel ""The Last Man"" is conside...",Mary Shelley


- Everyone gets their own sample of questions from our 500 question sample.
- There will be overlap with our answers, which we want to measure agreement.
- To keep the list of possible labels small, we are going to label questions as belonging to one of five categories that correspond to the most frequent topics in *Jeopardy*:
  - History
  - Geography
  - Literature
  - Science
  - Other
- You can look up information about questions if you're unsure how to classify them.



In [28]:
my_df = df.sample(40).copy()
my_df.reset_index(inplace=True)

In [29]:
def classify_jeopardy_questions(my_df):
    # Initialize a list to store the categorizations
    categorizations = []

    # List of valid categories
    categories = ["History", "Geography", "Literature", "Science", "Other"]

    for index, row in my_df.iterrows():
        # Print the CATEGORY, CLUE, and ANSWER for the current row
        print(f"ID: {row['ID']}")
        print(f"CATEGORY: {row['CATEGORY']}")
        print(f"CLUE: {row['CLUE']}")
        print(f"ANSWER: {row['ANSWER']}")

        # Display category options with corresponding numbers
        print("Please classify the question into one of the following categories:")
        for i, category in enumerate(categories, 1):
            print(f"{i}: {category}")

        # Ask the user for a valid categorization
        while True:
            try:
                category_index = int(
                    input("Enter the number corresponding to the category: ")
                )
                if 1 <= category_index <= len(categories):
                    selected_category = categories[category_index - 1]
                    # Save the categorization along with the row ID
                    categorizations.append(
                        {
                            "ID": row["ID"],
                            "CATEGORY": row["CATEGORY"],
                            "CLUE": row["CLUE"],
                            "ANSWER": row["ANSWER"],
                            "CLASSIFICATION": selected_category,
                        }
                    )
                    break
                else:
                    print("Invalid number. Please try again.")
            except ValueError:
                print("Invalid input. Please enter a number.")

        # Clear the output in the Jupyter notebook
        clear_output(wait=True)

    # Convert the categorizations to a DataFrame for further use if needed
    categorizations_df = pd.DataFrame(categorizations)

    # Return the categorizations DataFrame
    return categorizations_df

# Classification time!

- Running the cell below will ask you to categorize *Jeopardy* questions.
- Occasionally, you may not see the box to enter your classifications.
- If you run into trouble with the script, you may need to interrupt the kernel. You can do that...
    - by clicking the Stop ⬛️ icon in the ribbon above.
    - by pressing `ii` on your keyboard
    - Or by going to the menu bar: `Kernel > Interrupt kernel`

In [32]:
clear_output()
categorized_df = classify_jeopardy_questions(my_df)

ID: 466
CATEGORY: STATE FLAGS
CLUE: Ship accessory pictured on flag of Wisconsin & Rhode Island, it's also found on Popeye's arm
ANSWER: anchor
Please classify the question into one of the following categories:
1: History
2: Geography
3: Literature
4: Science
5: Other


Enter the number corresponding to the category:  2


In [33]:
# here's what you made
categorized_df

Unnamed: 0,ID,CATEGORY,CLUE,ANSWER,CLASSIFICATION
0,495,ART & ARTISTS,"Before returning to France in 1923, he designe...",Marc Chagall,Other
1,328,"STARTING ""EN""",It's a letter from the pope to Catholic church...,encyclical,History
2,246,YOU'LL NEED SOME BACKUP,& the Revolution,Prince,Other
3,163,1920s GOOD READS,"A contemporary romantic drama, her ""The Glimps...",Wharton,Literature
4,323,WORLD LITERATURE,"Born Jean-Baptiste Poquelin in 1622, he wrote ...",Moli&egrave;re,Literature
5,109,AROUND THE HOUSE,During summer you can help conserve power by u...,a fan,Science
6,95,"LET THERE BE ""LIGHT""","An old saying goes, ""Many hands make"" this",light work,Other
7,50,THIS MUST BE BELGIAN,It's the Belgian town in the title of the artw...,Waterloo,Geography
8,361,SHAKESPEAREAN ROUND-UP,Duke Senior is banished to the Forest of Arden...,<i>As You Like It</i>,Literature
9,466,STATE FLAGS,Ship accessory pictured on flag of Wisconsin &...,anchor,Geography


In [34]:
# generate a random id to distinguish your data from others'
random_id = random.randint(1, 100000)
categorized_df.to_csv(f"classified_jeopardy_{random_id}.csv", index=False)

Now, we have saved your categorizations to a CSV file in your Constellate workspace.

## Next steps

1. Right-click your classification file (`classified_jeopardy_...csv`), and select `Download`. That will download the file to the `~/Downloads` folder on your computer.
2. Navigate to your `Downloads` on your computer, and find your `.csv` file
3. Upload your `.csv` to [this Dropbox folder](https://www.dropbox.com/request/ryB9Bh9QefqASXRaZfPU) for us to combine our data together:

<https://www.dropbox.com/request/ryB9Bh9QefqASXRaZfPU>

## Discussion of classification experience

- Was this more difficult than you anticipated?
- How did you decide how to categorize a given question?

# Evaluating our classifications

We will evaluate our classifications using the majority of evidence in the data.

**N.B.:** The code below will only work on my machine. I will add the classifications to the course repository ASAP.

In [24]:
import os
import pandas as pd

PATH = "/Users/erik/Dropbox/File requests/2024 TAPI classifications"
csvs = [f for f in os.listdir(PATH) if f.endswith(".csv")]
# create a stacked dataframe with the csvs
df = pd.concat([pd.read_csv(os.path.join(PATH, f)) for f in csvs], ignore_index=True)

In [28]:
# drop everything except ID and CLASSIFICATION
df = df[["ID", "CLASSIFICATION"]]
df.head()

Unnamed: 0,ID,CLASSIFICATION
0,495,Other
1,328,History
2,246,Other
3,163,Literature
4,323,Literature


In [27]:
df_vals = df.groupby("ID").value_counts().unstack().fillna(0)
df_vals.head()

CLASSIFICATION,Geography,History,Literature,Other,Science
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
4,0.0,0.0,0.0,1.0,0.0
50,1.0,0.0,0.0,0.0,0.0
77,0.0,1.0,0.0,0.0,0.0
95,0.0,0.0,0.0,1.0,0.0
100,0.0,0.0,0.0,0.0,1.0


For simplicity's sake, I am going to focus on questions where one categorization was the clear "winner" (i.e., exclude ties).

In [29]:
def check_row_ties(row):
    max_value = row.max()
    max_count = (row == max_value).sum()
    if max_count > 1:
        return True
    else:
        return False


# Apply the function to each row and create a new column for the results
df_vals["Tie?"] = df_vals.apply(check_row_ties, axis=1)

In [30]:
# How many ties did we have?
df_vals["Tie?"].sum()

0

In [33]:
# get the ids for rows without ties
gold_ids = df_vals[~df_vals["Tie?"]].index.tolist()

In [35]:
gold_df = df_vals[df_vals.index.isin(gold_ids)]

In [37]:
# drop the tie column
gold_df.drop("Tie?", axis=1, inplace=True)

In [40]:
gold_melted = gold_df.reset_index().melt(
    id_vars=["ID"], var_name="CLASSIFICATION", value_name="Value"
)

In [46]:
gold_labels = gold_melted.loc[
    gold_melted.groupby("ID")["Value"].idxmax(), ["ID", "CLASSIFICATION"]
]

In [47]:
gold_labels.set_index("ID", inplace=True)
gold_labels.to_csv("gold_labels.csv")
gold_labels.head()

Unnamed: 0_level_0,CLASSIFICATION
ID,Unnamed: 1_level_1
4,Other
50,Geography
77,History
95,Other
100,Science


> In your menu bar, you may run Git / Pull from Remote to download `gold_labels.csv` 
> 
> Note that this will overwrite *all* changes you have made to the files here.

# Using our gold-standard data

- We created human-labeled data indicating correct results for our classifications.
  - (In a real research setting, this would be a more meticulous and expert-driven process. For this class, it's fine.)
- Now we need to test the LLM on these classification tasks.
- We will evaluate its performance using the data we created and a simple statistic called the [F-score](https://en.wikipedia.org/wiki/F-score).

## Getting one basic classification result

- We talked earlier about different types of classification.
- The data we have just created is suitable for either binary or multi-class classification.
- The type of classification we get back from the LLM will be determined by the prompt we give the model.
- Let's start with a simple prompt:

In [36]:
# binary classification prompt
system_prompt = """Determine whether the following Jeopardy question is about Literature.
Respond in JSON like so: {"Literature": True}"""

Because we created multi-class labels, we can treat everything that is not labeled literature as `False`.

> **N.B.** Remember to instruct the model to respond in JSON *even if you activate JSON mode*.

In [58]:
# load questions
data = pd.read_csv("data.csv")
# load labels
gold_labels = pd.read_csv("gold_labels.csv")

In [45]:
def make_prompt(row):
    prompt = f"""Category: {row['CATEGORY'].values[0]}\nClue: {row['CLUE'].values[0]}\nAnswer: {row['ANSWER'].values[0]}"""
    return prompt

In [46]:
prompt = make_prompt(data.sample(1))
print(prompt)

Category: COMPOSERS
Clue: In 1943 American composer William Schuman became the first composer to win one of these prizes
Answer: Pulitzer


In [47]:
# remember, this uses the API key in your .env file:
# if you didn't run the code in lesson 1, you probably don't have an .env file
load_dotenv()
client = OpenAI()

In [95]:
def make_completion(
    system_prompt, prompt, print_prompt=True, client=client, model="gpt-4o", json=True
):
    completion = client.chat.completions.create(
        model=model,
        response_format={"type": "json_object"} if json else None,
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": prompt},
        ],
    )
    if print_prompt:
        print(f"System prompt: {system_prompt}\n{'-' * 80}")
        print(f"User prompt: {prompt}\n{'-' * 80}")
        print(f"Assistant response: {completion.choices[0].message.content}\n{'*' * 80}")

    return completion

In [94]:
c = make_completion(system_prompt, prompt)

System prompt: Determine whether the following Jeopardy question is about Literature.
Respond in JSON like so: {"Literature": True}
--------------------------------------------------------------------------------
User prompt: Category: TO BE PRECISE
Clue: Any pigeon can properly be called this, though usually only a small or pretty one is
Answer: a dove
--------------------------------------------------------------------------------
Assistant response: {"Literature": false}
********************************************************************************


Ok, that returns a binary response.

Now we can put these things together and see how text classifications can quickly become data.

We'll start with one row that we know has a gold-standard label:

In [89]:
output = dict()

test_id = gold_labels.sample(1)['ID'].values

In [90]:
row = data[data['ID'].isin(test_id)]
row

Unnamed: 0,ID,CATEGORY,CLUE,ANSWER
160,160,TO BE PRECISE,"Any pigeon can properly be called this, though...",a dove


In [91]:
prompt = make_prompt(row)
c = make_completion(system_prompt, prompt)

System prompt: Determine whether the following Jeopardy question is about Literature.
Respond in JSON like so: {"Literature": True}
--------------------------------------------------------------------------------
User prompt: Category: TO BE PRECISE
Clue: Any pigeon can properly be called this, though usually only a small or pretty one is
Answer: a dove
--------------------------------------------------------------------------------
Assistant response: {
  "Literature": false
}


In [92]:
output[test_id[0]] = json.loads(c.choices[0].message.content)
output

{160: {'Literature': False}}

Now we can put all of this together to automatically process a batch of questions using the `gold_ids`:

In [127]:
l = list()

for idx, gold_row in gold_labels.sample(3).iterrows():
    # get row values
    gold_id = gold_row["ID"]
    gold_classification = gold_row["CLASSIFICATION"]
    
    question_row = data[data['ID'] == gold_id]
    prompt = make_prompt(question_row)
    c = make_completion(system_prompt, prompt)

    # make output
    d = dict()
    d["ID"] = gold_id
    d["CLASSIFICATION"] = gold_classification
    d.update(json.loads(c.choices[0].message.content))
    l.append(d)

System prompt: Determine whether the following Jeopardy question is about Literature.
Respond in JSON like so: {"Literature": True}
--------------------------------------------------------------------------------
User prompt: Category: 1920s GOOD READS
Clue: A contemporary romantic drama, her "The Glimpses of the Moon" isn't as well-known as "The Age of Innocence"
Answer: Wharton
--------------------------------------------------------------------------------
Assistant response: {"Literature": true}
********************************************************************************
System prompt: Determine whether the following Jeopardy question is about Literature.
Respond in JSON like so: {"Literature": True}
--------------------------------------------------------------------------------
User prompt: Category: TRAVEL & TOURISM
Clue: The Centers for Disease Control in this city will fax information on health risks abroad to travelers
Answer: Atlanta
-------------------------------------

For demonstration purposes, the loop above only `sample`s 3 rows.

This structure would be sufficient to do all of the points for which we have values, however.

In [131]:
output = pd.DataFrame(l)

In [132]:
output.columns = ["ID", "GOLD_CLASSIFICATION", "LLM_IS_LITERATURE"]
output.set_index("ID", inplace=True)
output

Unnamed: 0_level_0,GOLD_CLASSIFICATION,LLM_IS_LITERATURE
ID,Unnamed: 1_level_1,Unnamed: 2_level_1
163,Literature,True
229,Geography,False
77,History,False


## Comparing LLM classifications to human classifications

The most basic way of answering this question: How often do humans and the LLM agree on binary classification (Literature vs. Not Literature)?

In [133]:
output["GOLD_IS_LITERATURE"] = output["GOLD_CLASSIFICATION"] == "Literature"

Unnamed: 0_level_0,GOLD_CLASSIFICATION,LLM_IS_LITERATURE,GOLD_IS_LITERATURE
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
163,Literature,True,True
229,Geography,False,False
77,History,False,False


In [135]:
output["GOLD_LLM_AGREE"] = (output["GOLD_IS_LITERATURE"] == output["LLM_IS_LITERATURE"])

In [136]:
output

Unnamed: 0_level_0,GOLD_CLASSIFICATION,LLM_IS_LITERATURE,GOLD_IS_LITERATURE,GOLD_LLM_AGREE
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
163,Literature,True,True,True
229,Geography,False,False,True
77,History,False,False,True


This would be the most basic way of measuring the success rate of your classification: How often does the model output match gold-standard data?

However, there is a more sophisticated and widely used solution for binary classification: the F-score.

## The F-Score

To understand the F-score, you need to understand two related concepts: precision and recall.

### Precision

Precision measures how many of the items the model identified as `True` were really `True` according to the gold standard data. That is compared against the number of items identified as `True` that were `False` according to the gold standard data, which are known as "False positives."

> Precision answers the question: "How many retrieved items were relevant?"

$Precision = \frac{True \ Positives}{True \ Positives + False \ Positives}$

### Recall

Recall measures how many values that ought to have been `True` were labeled `True`.

> Recall answers the question: "How many relevant items were retrieved?"

$Recall = \frac{True \ Positives}{True \ Positives + False \ Negatives}$

### F-score (aka F1)

The F score is the harmonic mean of precision and recall.

$F_{1}= 2 \times \frac{Precision \times Recall}{Precision + Recall}$

## How to calculate

There is an easy way to calculate this score using [sklearn](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html):

In [137]:
# Example with fake data
y_true = [0, 1, 0, 0, 1, 1]
y_pred = [0, 1, 1, 0, 1, 1]
f1_score(y_true, y_pred)

0.8571428571428571

In [139]:
# Because Python stores True as 1 and False as 0, we can directly use the columns:
f1_score(output["GOLD_IS_LITERATURE"], output["LLM_IS_LITERATURE"])

1.0

You can also use the F-score to evaluate multi-class classifications.

We'll stick with binary for now for simplicity's sake.

# Putting it all together

Here is how the things we learned today will come together:

1. We have our texts to classify (*Jeopardy!* questions, though you could use any texts in any format, as long as you associate an ID with the text.)
2. We created gold-standard (i.e., human-labeled, and, ideally, expert-labeled) classifications for testing.
3. We wrote prompts and set up API calls that output structured classifications as JSON.
4. We import that JSON into a more familiar data structure (a `pandas` dataframe or another two-dimensional data structure).
5. We can systematically evaluate the quality of our classifications using the F score.

## Next steps

There are two big remaining steps we will do for our classification. The steps above are necessary; these are nice to have, and they demonstrate some principles of interacting with LLMs.

1. Quantifying uncertainty

As you experienced doing classifications by hand, some judgments were easier to make than others. We can ask the LLM to express its confidence in its judgments numerically.

This is useful because it allows us to sort automatically labeled data by confidence for review. Low confidence classifications deserve higher priority for manual review (and possible correction) by researchers. (You could also do some clever massaging of the F score by penalizing confident but wrong classifications, while lessening the penalty for low confidence wrong classifications.)

2. Prompt engineering

We can use the F score to systematically test and evaluate variations of our `system` and `user` prompts to see which prompts produce the most accurate classifications.

# Quantifying uncertainty

There is a very easy way and a slightly more complex way to quantify our uncertainty in these classifications.

## The very easy way

This requires a small modification of the system prompt:

In [140]:
system_prompt = """Determine whether the following Jeopardy question is about Literature.
Express your confidence in your classification as a percentage from 50 to 100, where 50 is guessing and 100 is certain.
Respond in JSON like so:
{"Literature": True,
"Confidence": 95}"""

In [141]:
prompt = make_prompt(data.sample(1))

In [142]:
c = make_completion(system_prompt, prompt)

System prompt: Determine whether the following Jeopardy question is about Literature.
Express your confidence in your classification as a percentage from 50 to 100, where 50 is guessing and 100 is certain.
Respond in JSON like so:
{"Literature": True,
"Confidence": 95}
--------------------------------------------------------------------------------
User prompt: Category: WRITERS OF COLONIAL AMERICA
Clue: Diarist William Byrd portrayed life in this colony where he owned a large James River plantation
Answer: Virginia
--------------------------------------------------------------------------------
Assistant response: {"Literature": true,
"Confidence": 95}
********************************************************************************


## The not-very-hard way

One feature of the API we did not discuss last time is called `logprobs`. [Here](https://platform.openai.com/docs/api-reference/chat/create) is the documentation.

[OpenAI recommends using `logprobs` to assess the confidence of text classifications.](https://cookbook.openai.com/examples/using_logprobs)

When `logprobs` (which refers to the log probabilities of each *generated* token) is switched on, the API returns its probabilities of generating each token like so:

In [146]:
system_prompt = """Determine whether the following Jeopardy question is about Literature.
Respond in JSON like so:
{"Literature": True}"""

In [147]:
prompt

'Category: WRITERS OF COLONIAL AMERICA\nClue: Diarist William Byrd portrayed life in this colony where he owned a large James River plantation\nAnswer: Virginia'

In [None]:
completion = client.chat.completions.create(
  model="gpt-4o",
  logprobs=True, # new
  top_logprobs=2, # new: ask the API to return the two most probable tokens for each token generated
  messages=[
    {"role": "system", "content": system_prompt},
    {"role": "user", "content": prompt}
  ]
)

In [166]:
print(completion.choices[0].logprobs.to_json())

{
  "content": [
    {
      "token": "{\"",
      "bytes": [
        123,
        34
      ],
      "logprob": -0.4132006,
      "top_logprobs": [
        {
          "token": "{\"",
          "bytes": [
            123,
            34
          ],
          "logprob": -0.4132006
        },
        {
          "token": "{\n",
          "bytes": [
            123,
            10
          ],
          "logprob": -1.1632006
        }
      ]
    },
    {
      "token": "Liter",
      "bytes": [
        76,
        105,
        116,
        101,
        114
      ],
      "logprob": 0.0,
      "top_logprobs": [
        {
          "token": "Liter",
          "bytes": [
            76,
            105,
            116,
            101,
            114
          ],
          "logprob": 0.0
        },
        {
          "token": "Lit",
          "bytes": [
            76,
            105,
            116
          ],
          "logprob": -17.25
        }
      ]
    },
    {
      "token":

The closer `logprob`s are to 0, the more confident the model is in its response. As a reminder:

In [153]:
import math

math.log(1)

0.0

In [154]:
# i.e., a 95% probability
math.log(0.95)

-0.05129329438755058

In the example used above, the `logprob` of the token `"True"` was:

In [167]:
logprob = -0.026302502

print(f"The model was {round((math.exp(logprob) * 100), 2)}% confident in this classification.")

The model was 97.4% confident in this classification.


So, the model's stated confidence in its classification may differ from the underlying `logprob` value for the resulting classification token.

For the purposes of this class, we are going to use the model's stated confidence in the response as it is simpler to extract and manipulate.

But if you are doing a project where precise probabilities matter, `logprob` is the way to go.

# Exercises

1. Using what we have learned today, try writing a multi-class classification `system` prompt for *Jeopardy* questions that will output structured JSON with the predefined options we used to create the gold standard data.
2. How is the F score different from merely calculating the percentage of the time that the gold-standard data and the classification model agree?
3. Write a paragraph explaining how you could apply these techniques to a set of texts than *Jeopardy* questions.