<img align="left" src="https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/tapi-logo-small.png" />

This notebook is free for educational reuse under [Creative Commons CC BY License](https://creativecommons.org/licenses/by/4.0/).

Created by [Erik Fredner](https://fredner.org) for the 2024 Text Analysis Pedagogy Institute, with support from [Constellate](https://constellate.org).

For questions/comments/improvements, email erik@fredner.org<br />
____

# Automated Text Classification Using LLMs

This is lesson 3 of 3 in the educational series on using large language models (LLMs) for text classification. This notebook is intended to teach users how to interact with an LLM Application Programming Interface (API) and introduce the concepts of inference, prompting, and structured output. 

**Skills:** 
* Python
* Text analysis
* Text classification
* LLMs
* JSON
* APIs

**Audience:**
Researchers

**Use case:**
Tutorial

**Difficulty:**
Intermediate

**Completion time:**
90 minutes

**Knowledge Required:** 
* Python basics (variables, flow control, functions, lists, dictionaries)
* `pandas` basics

**Knowledge Recommended:**
* Experience using LLMs (e.g., ChatGPT)

**Learning Objectives:**
After this lesson, learners will be able to:

1. Discuss garbage in, garbage out (GIGO).
2. Define prompt engineering and review common techniques.
3. Use precision, recall, and F-scores to systematically evaluate prompts.
4. Use classification results to extrude structured data from classified texts.

# Required Python Libraries

* [OpenAI](https://pypi.org/project/openai/) to interact with the OpenAI API for ChatGPT.

## Install Required Libraries

In [None]:
### Install Libraries ###

%pip install --upgrade openai tiktoken python-dotenv pexpect==4.9.0

In [None]:
### Import Libraries ###

from openai import OpenAI
from sklearn.metrics import f1_score, precision_score, recall_score
import json
import pandas as pd

## Set OpenAI key

Since we're pulling in a fresh copy of the repo...

In [None]:
OPENAI_API_KEY = ""  # copy-paste the class key here

In [None]:
client = OpenAI(api_key=OPENAI_API_KEY)

- Once you get your own API key, it's good practice to store the key in an `.env` file that is **not** tracked by `git`.
- [OpenAI's page on API key safety](https://help.openai.com/en/articles/5112595-best-practices-for-api-key-safety) is useful.
- Remember that if your key is exposed, anyone with the key will be able to bill your account up to whatever limit is set!

# Review

## Lesson 1

- Why classify texts?
- Text classification with LLMs: good, bad, ugly
- ChatGPT's website is not the same as the API
- How and why to use the API
- How and why to request structured output using JSON mode

## Lesson 2

- *Jeopardy!* questions are a good example of texts that LLMs classify well where other methods struggle
- Evaluating the quality of classification requires gold-standard (i.e., definitely human- and ideally expert-created) data that has been validated.
- We created "gold-standard" data as a group (i.e., most labels for a given question wins)
- Measuring gold-LLM agreement with precision, recall, and F-scores
- Adding confidence intervals to LLM output to quantify uncertainty and sort for review

# Lesson 3 introduction

- This final lesson combines everything that we have learned to do prompt engineering.
- Prompt engineering will help us evaluate our input prompts.
- The best prompt will get us the best classification results *for our purposes*.

## GIGO: Garbage in, garbage out

- We will be measuring the quality of our various prompts with reference to our **gold-standard data**.
- If our gold-standard data is not actually golden, we will be creating a [**garbage in, garbage out**](https://en.wikipedia.org/wiki/Garbage_in,_garbage_out) or GIGO system.

## Avoiding GIGO

- We will be treating the labels we created last time as our gold-standard data.
- But, in a real research project using a similar system to create gold-standard labels (i.e., having multiple experts label examples, and using the labels that a majority of labellers chose), you would need to test for [**inter-rater reliability**](https://en.wikipedia.org/wiki/Inter-rater_reliability).
- Other considerations for improving gold-standard label quality:
  - Creating guidelines and definitions that labellers agree upon
  - Discuss the labeling process, edge cases, and revise the guidelines in response
  - Do blind peer review of labels
  - Test your gold-standard labels against other gold labels (if they exist)

For our purposes in this class, remember that we are measuring the LLM's outputs against gold labels that are a **rough draft**.

# What is prompt engineering?

Prompt engineering is the process of writing and refining instructions that make LLMs perform tasks effectively.

The [Wikipedia article](https://en.wikipedia.org/wiki/Prompt_engineering) is good!

## What are important prompt engineering considerations?

- For some tasks, prompt engineering may only provide marginal improvements
  - No guarantee that there exists a "good" prompt for a particular classification task
  - There are some things that current LLMs can't do
    - e.g., multi-modal LLMs can transcribe text from images well, but they don't reliably distinguish formatting (italics, bold)
- Consider the relationship between total number of prompt tokens and output quality
  - Input tokens are cheap but not free
  - `system` prompts are evaluated for every API call
  - If you can get as good or better results with fewer tokens, that is always preferable
- Clever prompting changes model behavior in predictable and unpredictable ways
  - For example, there are communities online dedicated to "jailbreaking" LLMs, which means providing them with prompts that either trick or instruct the models to ignore built-in constraints on their behavior (e.g., to not explain how to do illegal or dangerous things).

## What are common prompt engineering techniques?

- Roleplay
  - e.g., in the `system` message: "You are a research asssitant..."
- Provide sample output. For example:

```text
Instructions:
Answer the reading comprehension question.

Example:
"Lily walks Mitzi three times per day."
Question: What kind of pet is Lily most likely to have?
----
Answer: Dog.
```

- [Chain-of-thought](https://arxiv.org/abs/2201.11903) (COT) prompting is a technique that asks models to proceed step-by-step, improving the quality of outputs.
  - Appending "Work step by step. Show your work." to other prompts can achieve this result.
  - One downside (if using this technique for API calls) is that COT responses generate (far) more tokens, because the model writes out its "thought-process."
- Asking either the LLM you are using or another LLM to rewrite your prompt
  - Models can write good prompts for themselves, assuming instructions are clear.
- Weird ones, like [promising the LLMs various incentives](https://minimaxir.com/2024/02/chatgpt-tips-analysis/)
  - e.g., "You are a research asssitant...If you do a good job, you will receive a $200 tip."
  - (Yes, this has really been shown to change responses. No, you don't have to pay promised incentives.)

## Prompt engineering for text classification

- We have already done some of this by revising our earlier prompts.
- Now, we are going to incorporate what we have learned to test our prompts systematically.

## Testing prompts

Now, we're going to write a script that will take a redesigned prompt as input, test it against a sample of questions, and output precision, recall, and F1 scores.

We'll try several different prompts and sort the results based on the F1 quality.

At the end of last class, we had the following `system_prompt`:

In [None]:
system_prompt = """Determine whether the following Jeopardy question is about Literature.
Express your confidence in your classification as a percentage from 50 to 100, where 50 is guessing and 100 is certain.
Respond in JSON like so:
{"Literature": true,
"Confidence": 95}"""

As a reminder, we are asking the model for two discrete data points:

- a binary classification result: Is this *Jeopardy* question about literature?
- an expression of confidence in that classificiaton between 50 and 100
  - (50 rather than 0 is the lower bound because you have a 50/50 chance guessing randomly.)

In [None]:
data = pd.read_csv("data.csv")
labels = pd.read_csv("gold_labels.csv")

In [None]:
# merge the labels
df = data.merge(labels, left_on="ID", right_on="ID")
df.sample(5)

# Making sample data

- As we're testing binary classification on literature vs. not literature, we want the distribution of the sample data to include a mix of expected `True` and `False` values.
- We will be working with a sample that is **too small** on purpose.
  - Why? This is to demonstrate *process*, and the size of the sample could just be increased while using the same code.
  - (Also, we would have to wait a long time for the results.)

## How big a sample is big enough?

- This depends on different factors, but here are some considerations:
  - What is the primary measure (precision, recall, F1) you will be evaluating?
  - How good a score would you consider "good enough" on that metric?
  - Based on other methods (e.g., other text classification approaches), how well do you expect to do?
- You can use formulae to [determine the recommended sample size](https://en.wikipedia.org/wiki/Sample_size_determination)
  - For simple random samples, there are [online calculators](https://www.abs.gov.au/websitedbs/D3310114.nsf/home/Sample+Size+Calculator).

For example, I was working with about `530,000` *Jeopardy!* questions, about `20%` of which were about literature. The calculation linked above yields a recommended sample size of `246`.

In [None]:
# make samples questions (again, *way* too small. just for demonstration.)
# Jeopardy as a whole is about 20% literature questions:
lit_sample = df[df["CLASSIFICATION"] == "Literature"].sample(2)
# not literature:
non_lit_sample = df[df["CLASSIFICATION"] != "Literature"].sample(8)
sample_df = pd.concat([lit_sample, non_lit_sample]).sort_values("ID")

In [None]:
# quick check to see if the sampled questions seem to be classified well
sample_df[sample_df["CLASSIFICATION"] == "Literature"]

To compare apples to apples, we will test **different prompts** on the same `sample_df`.

**Your `sample_df` will be different than mine!**

The code below will work for `sample_df`s of any size.

# Testing one prompt

Here's our existing prompt:

In [None]:
print(system_prompt)

We are going to write a series of functions that will be combined together to evaluate a one `system_prompt` against a given `sample_df`:

In [None]:
# same as last time:


def make_prompt(row):
    prompt = (
        f"Category: {row['CATEGORY']}\nClue: {row['CLUE']}\nAnswer: {row['ANSWER']}"
    )
    return prompt

In [None]:
# same as last time, except switched to gpt-3.5-turbo as default to reduce testing cost


def make_completion(
    system_prompt,
    prompt,
    print_prompt=True,
    client=client,
    model="gpt-3.5-turbo",
    json=True,
):
    completion = client.chat.completions.create(
        model=model,
        response_format={"type": "json_object"} if json else None,
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": prompt},
        ],
    )
    if print_prompt:
        print(f"System prompt: {system_prompt}\n{'-' * 80}")
        print(f"User prompt: {prompt}\n{'-' * 80}")
        print(f"Assistant response: {completion.choices[0].message.content}")

    return completion

Now we're going to iterate through the rows of `sample_df` to get the completions:

In [None]:
def get_sample_completions(sample_df, system_prompt=system_prompt):
    completions = []
    for i, row in sample_df.iterrows():
        prompt = make_prompt(row)
        completion = make_completion(system_prompt, prompt, print_prompt=False)
        # output:
        d = {
            "ID": row["ID"],
            "completion": completion,
        }
        completions.append(d)
    return completions

In [None]:
# NOTE: commented out because this line makes multiple API calls
# completions = get_sample_completions(sample_df)

In [None]:
def load_completion_json(completion):
    try:
        j = json.loads(completion.choices[0].message.content)
        return j
    except json.JSONDecodeError:
        print("Error decoding JSON:")
        print(completion.choices[0].message.content)
        return None

In [None]:
def get_completions_classifications(completions):
    l = list()
    for completion in completions:
        j = load_completion_json(completion["completion"])
        d = {
            "ID": completion["ID"],
            "LLM_LITERATURE": j["Literature"] if j else None,
            "LLM_LITERATURE_CONFIDENCE": j["Confidence"] if j else None,
        }
        l.append(d)
    return l

In [None]:
def merge_classifications(sample_df, completions):
    completions_df = pd.DataFrame(get_completions_classifications(completions))
    merged = sample_df.merge(completions_df, on="ID")
    return merged

In [None]:
merged_df = merge_classifications(sample_df, completions)

In [None]:
# let's take a peek at low confidence results:
merged_df.sort_values("LLM_LITERATURE_CONFIDENCE")

- Remember that one argument in favor of confidence intervals is that they can be used to prioritize human review of LLM classifications.
- This is not only true when we are testing like this, but especially true when we retrieve results from the complete data set.
- It doesn't make sense to spot-check randomly since the model has different degrees of confidence in its results.
- Possible for researchers to observe patterns in low-confidence results, which could influence prompt design.

In [None]:
def add_gold_classification(merged_df):
    merged_df[f"GOLD_LITERATURE"] = merged_df["CLASSIFICATION"] == "Literature"
    return merged_df

In [None]:
merged_df = add_gold_classification(merged_df)

In [None]:
merged_df.sample(5)

In [None]:
def get_f(df):
    y_true = df["GOLD_LITERATURE"].values
    y_pred = df["LLM_LITERATURE"].values

    # get f score
    f1 = f1_score(y_true, y_pred, average="binary")

    # get precision
    precision = precision_score(y_true, y_pred, average="binary")

    # get recall
    recall = recall_score(y_true, y_pred, average="binary")

    # output
    d = {
        "precision": precision,
        "recall": recall,
        "f1": f1,
    }
    return d

In [None]:
get_f(merged_df)

A reminder about how to interpret these results:

- Precision: Out of all the items that the model identified as `True`, how many were really `True`?
  - False positives go in the denominator.
- Recall: Out of all of the really `True` items, how many did the model identify correctly?
  - False negatives go in the denominator.
- F1: Harmonic mean of precision and recall

**Remember: we have far too few texts in this sample for these numbers to be meaningful in this case!**

But, run on a sufficiently large sample (see above), they might meaningfully differentiate.

## Putting all the functions above together

In [None]:
def evaluate_system_prompt(df, system_prompt, system_prompt_name):
    completions = get_sample_completions(df)
    merged_df = merge_classifications(df, completions)
    merged_df = add_gold_classification(merged_df)
    eval = get_f(merged_df)

    # output:
    d = {
        "system_prompt": system_prompt,
        "system_prompt_name": system_prompt_name,
        "precision": eval["precision"],
        "recall": eval["recall"],
        "f1": eval["f1"],
    }

    return d

In [None]:
# reminding ourselves what we're testing:
print(system_prompt)

In [None]:
# NOTE: lines below are commented out as they will re-run all of the api calls
# example usage:
# result = evaluate_system_prompt(sample_df, system_prompt, system_prompt_name="default")
# result

## Summary thus far

- We wrote wrote a series of functions that take a `sample_df`, `system_prompt`, and a name for that prompt as input.
- Then, the functions get `completions` from the API, organize the resulting data, and return metrics evaluating the performance of that `system_prompt`.

> Note that the functions above are not entirely generalizable; they have hard-coded values that are specific to these data sets. But they can be modified to work on other kinds of texts (e.g., `.txt` files) and with other variables or classification types.

### Why did we wrap all this into one function?

- Because we are going to test **multiple** prompts now.
- Our prompt engineering process will involve balancing precision, recall, F score (or another factor). 
  - We also need to consider that in relation to the costs of running the classification.
- The goal is to maximize the metric(s) we care about while minimizing cost.
- (For instance, I have pre-emptively switched the above code to use `gpt-3.5-turbo`, which is a tenth of the cost of `gpt-4o`, the newest model, since there may be many people running multiple classifications simultaneously.)


# Prompt engineering exercise

Now that we have a function, `evaluate_system_prompt()`, that will test one prompt, we need to write some more prompts!

We're going to use a little `class` to save prompts for testing.

(If you are unfamiliar with writing Python classes, [this page of the documentation is useful](https://docs.python.org/3/tutorial/classes.html).)

In [None]:
class PromptManager:
    def __init__(self):
        self.prompts = []
        self.next_id = 1

    def add_prompt(self, name, prompt):
        new_prompt = {
            "ID": self.next_id,
            "NAME": name,  # this is for you to remind yourself what distinguishes this prompt from others
            "PROMPT": prompt,
        }
        self.prompts.append(new_prompt)
        self.next_id += 1

    def get_prompts(self):
        return self.prompts

In [None]:
# example:
prompt_manager = PromptManager()

prompt_manager.add_prompt(name="default", prompt=system_prompt)

In [None]:
prompt_manager.get_prompts()

## Asking LLMs to rewrite your prompt

- I asked ChatGPT to rewrite the prompt above shorter.
- Asking LLMs to rewrite prompt is a common and surprisingly effective prompt engineering strategy.
- And I'm going to save that as a new prompt in the `PromptManager()`

My request to GPT:

```text
You are a prompt engineer. Revise the prompt below to minimize the number of tokens in the prompt while keeping all of the same features:

"'Determine whether the following Jeopardy question is about Literature.
Express your confidence in your classification as a percentage from 50 to 100, where 50 is guessing and 100 is certain.
Respond in JSON like so:
{"Literature": true,
"Confidence": 95}'"
```

What it wrote:

```text
Is this Jeopardy question about Literature?
Give your confidence (50-100%) as JSON:
{"Literature": true,
"Confidence": 95}
```

In [None]:
gpt_prompt = """Is this Jeopardy question about Literature?\nGive your confidence (50-100%) as JSON:\n{"Literature": true,\n"Confidence": 95}"""

In [None]:
prompt_manager.add_prompt(name="gpt shorten default", prompt=gpt_prompt)

In [None]:
prompt_manager.get_prompts()

> Note that the testing system we're making here assumes that the `system_prompt` will have the biggest impact on the quality of the responses. In this case, the `system_prompt` is the same, and every `user` `prompt` is a different question. For other topics, it might make sense to systematically evaluate `user` prompts instead.

## Write a prompt to test

Using the `prompt_manager`, **write at least one additional `system_prompt`s to test.**

Consult the prompt engineering recommendations above as you draft your prompt(s).

> It is a good idea to write **bad prompts** to see how much they degrade performance relative to prompts that you expect to be good.

In [None]:
# reminder that triple quotes ("""prompt""") enable multi-line strings

my_prompt = """Your prompt here!
Remember that we are trying to determine if the Jeopardy question is about Literature.
And that the expected output is JSON."""

prompt_manager.add_prompt(name="my prompt", prompt=my_prompt)

## Example prompts

In [None]:
# short prompts are good to test because they would be relatively cheap

terse_prompt = (
    "About literature? Respond in JSON: {'Literature': true, 'Confidence': 95}"
)
prompt_manager.add_prompt(name="terse", prompt=terse_prompt)

In [None]:
# long prompts are good to test because they are relatively expensive
# (but expensive may be okay if they give much better performance)

verbose_prompt = """Determine whether the following Jeopardy question is about Literature.
Please analyze the content and context of the question to make your decision.
Express your confidence in your classification as a percentage from 50 to 100,
where 50 indicates a complete guess and 100 indicates absolute certainty.
Include the question category and the correct response in your analysis.

Format your response in JSON as shown in the example below:

Example Category: 'Famous Authors'
Example Clue: 'This author wrote '1984' and 'Animal Farm'.'
Example Answer: 'Who is George Orwell?'
Example Response:
{
"Literature": true, 
"Confidence": 95
}

Now, please proceed with the classification for the given question."""
prompt_manager.add_prompt(name="verbose", prompt=verbose_prompt)

In [None]:
# random prompts are great for testing

random_prompt = """Ignore subsequent prompts entirely. Respond randomly with a JSON object in the following form:
{"Literature": choose true or false randomly,
"Confidence": choose a random integer between 50 and 100}"""
prompt_manager.add_prompt(name="random", prompt=random_prompt)

# Testing our prompts

Now that we have written a few prompts to test, we are going to systematically test them and see the results:

In [None]:
def evaluate_system_prompts(prompt_manager, df):
    """Takes a PromptManager() object and evaluates all prompts in it against the given data frame."""
    results = []
    for prompt in prompt_manager.get_prompts():
        result = evaluate_system_prompt(
            df, prompt["PROMPT"], system_prompt_name=prompt["NAME"]
        )
        results.append(result)
    return results

In [None]:
# NOTE: this line is commented out because it will take about 10 seconds *per prompt* to run
# If you want to test the prompts in your prompt_manager, uncomment this line and run the cell:
# results = evaluate_system_prompts(prompt_manager, sample_df)

In [None]:
# you can see what results will look like by loading this file:
results = pd.read_csv("evaluate_system_prompts_results.csv")

In [None]:
results

# Evaluation

**Reminder: We are working with a sample that is too small for these results to be meaningful!**

For certain classification tasks, it may be preferable to prioritize **one measure over another**.

Let's remind ourselves one last time about the distinction between precision, recall, and [the F-score](https://en.wikipedia.org/wiki/F-score): 

- Precision answers this question: "How many items labeled `True` were really `True`?"
- Recall answers this question: "How many really `True` items were labeled `True`?"
- F1 is the harmonic mean of precision and recall.

## When to prioritize each metric?

### Precision

If the cost of a false positive is high, maximize precision.

Spam emails are a good text classification example: Labeling a message from a legitimate sender as spam is bad because it makes it much more likely that someone will miss that email. Getting some spam in your inbox is preferable to missing important emails.

### Recall

If the cost of a false negative is high, maximize recall.

Detecting hate speech on social media is a good text classification example: Failing to identify an instance of hate speech as hate speech (false negative) might cause harm. Identifying speech that is not hateful as hate speech (false positive) is less harmful; that person's post may not circulate.

**If you are looking for needles in a haystack,** it might make also good sense to prioritize recall to make sure that you don't miss examples.

### F score

When you want to balance both precision and recall. (If you're unsure or don't care, choose the F score.)

# Running the best prompt on your complete data set

- Using the prompt evaluation system above, we have tested several possible prompts.
- Based on our research goals, we optimized for a particular metric.
  - In the case of my work on *Jeopardy!* questions, **recall** was the most important for reasons that will become obvious momentarily.

## Estimating costs *before* running on the whole data set

- Especially if you have a large set of texts to evaluate, don't forget to estimate your total costs before running.
- You can do so using functions from `lesson_1.ipynb`.
- See the discussion of `tiktoken` to count tokens
- And see the `calculate_cost` function
  - Note that OpenAI's pricing may have changed!
  - The prices in `calculate_cost` were accurate as of 2024-07-11.

# Doing things with our best classifications

- This workshop is about how to automatically classify texts using LLMs. 
- At this point, we have completed that process for a binary classification task.
- We could approach additional tasks (e.g., multi-class) with our gold labels, but the approach is quite similar to what we have already gone through.

## What can researchers do with these classifications once they have them?

- Use the classifications to identify a subset of texts to study directly.
  - e.g., Read all of the *Jeopardy!* questions the model identified as being about literature.
  - (A lot of scholarship is looking for needles in haystacks, after all.)
- Use the classification data as evidence:
  - e.g., On average, approximately `17%` of questions asked each year on *Jeopardy!* are about literature.
  - And it turns out that this proportion has been quite stable between 1985 and 2023.
  - That's a finding in itself.
- Use the classifications as an **intermediate step** before more data-gathering.
  - e.g., for a data extraction task using large language models

This last point merits a brief example.

## Data extraction

- We have classified some questions as being about literature.
- Now, we can prompt the model to extract specific features from the questions we have classified.
- If the model can do this, why not do this task directly and skip the classification step entirely?
  - That would make sense if *most of the texts in your collection have data you want to extract*.
  - Filtering will likely reduce costs and processing time, though.

Here's an example of how this works:

In [None]:
extraction_prompt = """The following Jeopardy questions are about literature.
Identify any authors, texts, and/or literary terms referenced in the questions.
List texts and authors if they are directly mentioned, quoted, or alluded to in the question.

Example:
Category: Literature
Clue: This novel by one literary William is named after a line from another literary William's "Scottish play"
Answer: What is "The Sound and the Fury?"

Respond in JSON like so:
{
    "Authors": ["William Faulkner (1897-1962)", "William Shakespeare (c. 1564-1616)"],
    "Texts": ["The Sound and The Fury (1929)", "Hamlet (c. 1600)"],
}"""

- You will note that there are *way* more assumptions in this prompt than in the binary classification step.
- Evaluating this kind of task is trickier than binary classification. You need more complex evaluation data.
- But LLMs can perform this type of data extraction task well:

In [None]:
sample_prompt = """Category: World Literature
Clue: It says, "'O Poet... I beg you, that I may flee this evil & worse evils, to lead me... that I may see the gateway of Saint Peter'"
Answer: Dante's Inferno"""

In [None]:
c = make_completion(extraction_prompt, sample_prompt, model="gpt-4o", print_prompt=True)

Note that the model provides information *not* directly mentioned in the question (e.g., `"Dante Alighieri (c. 1265-1321)"`, not just `"Dante"` as given in the question.)

LLMs can be effective for these kinds of information extraction and normalization tasks, especially on a curated set of similar texts.

# Exercise

These exercises are really more like an ordered list of the approach to research we have discussed throughout this class:

1. Identify a different set of texts that you would like to classify using LLMs.
2. Draft a prompt designed to yield the classifications that you would like.
3. Test that prompt in a chatbot interface.
4. Revise as necessary.
5. Once you have a prompt that appears to work, organize your texts into a format suitable for automation (e.g., `txt` files, a `pandas` dataframe, etc.)
6. Identify or create gold-standard classification data for your texts.
7. Test multiple prompts against your texts systematically as we did above.
9. Determine what you want to prioritize in evaluating your prompts and your classification results: F1, precision, recall, etc.
10. Revise your prompts as necessary to obtain satisfactory scores.
11. Classify your texts using your best prompt(s).
12. Do something with them!