# Text analysis with AI
A workshop by UMN LATIS and the Libraries.

## What we'll cover in this session
- Understand text classification with LLMs, and how it can be useful for text-based research.
- Interact with the ChatGPT API.
- Structure API calls using different models and prompts
- Set up classification tasks with ChatGPT.
- Understand and parse API JSON responses.
- Understand risks in using generative AI for classification.

### Install libraries
If you're working from your own machine you can use pip install to make sure you have downloaded all of the Python packages you'll need to use today. 

If you're working on notebooks.latis.umn.edu, there's no need to install any of these, since they're included in the virtual environment.

In [None]:
### Install Libraries ###

#!pip install --upgrade openai python-dotenv
#!pip install spacy

# This command downloads the medium-sized English language model for spaCy.
# It uses the Python module-running option to run spaCy's download command for the "en_core_web_md" model.
#!python -m spacy download en_core_web_md 

In [None]:
### Import Libraries ###

from openai import OpenAI
from dotenv import load_dotenv

# ChatGPT: website vs. API

The [chat interface on the website](https://chatgpt.com) is the most familiar way of interacting with these models.

But we will be working with the [application programming interface (API)](https://en.wikipedia.org/wiki/API) to automatically send and receive messages from the model using some features that are not accessible via the web.

## OpenAI's API

Many applications and websites offer APIs. For example, nearly every weather app uses [the National Weather Service API](https://www.weather.gov/documentation/services-web-api) to automatically retrieve weather data.

Unlike the National Weather Service, OpenAI charges for the use of its API. Which means that calls to the API require a special string called a `key`.

### Getting a key

After installing the Python bindings above (`openai`), you need to get an API key to send requests. The key is a unique identifier that performs a number of functions (including allowing OpenAI to bill you).

For the purposes of this class, I have created a fresh key with a spending limit of `$10` that I will share with the group, which should be more than enough to satisfy all of the requests in this class.

When you want to run your own queries in the future, you will need to register for an account and create an API key.

See [this page of the documentation](https://platform.openai.com/docs/quickstart) for details of how to create your own key.

### Setting the key

You need to include the key with every call to the API.

One way to do this is by setting the `OPENAI_API_KEY=...` variable in a `.env` file in your working directory.

You can also do this by setting a local variable, like so:

In [None]:
OPENAI_API_KEY = ""  # copy-paste the class key here

We're also going to write this to your `.env` so you don't have to repeat the process next time:

In [None]:
with open(".env", "w") as f:
    f.write(f"OPENAI_API_KEY={OPENAI_API_KEY}")
    f.close()

Next time your restart this notebook kernel (or open up a new notebook), the `openai` library will read the API key directly from your `.env` file. No need to specify the `api_key=` argument in `OpenAI()`.

## Making your first API call

You installed and imported the `openai` library above, so now you can run the example completion below, which is part of [OpenAI's tutorial](https://platform.openai.com/docs/api-reference/chat/create):

In [None]:
# this will load your saved .env variable
load_dotenv()
client = OpenAI()

In [None]:
completion = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
        {
            "role": "system",
            "content": "You are an academic researcher with machine learning expertise.",
        },
        {
            "role": "user",
            "content": "Explain what text classification is in ten or fewer words.",
        },
    ],
)

print(completion.choices[0].message.content)

Let's breakdown the elements from the code above:
- `client.chat.completions.create()` calls the REST API chat completions endpoint
- `model="gpt-4o-mini"` - you can choose from a variety of [ChatGPT models](https://platform.openai.com/docs/models). We're using a lightweight (affordable) `gpt-40-mini` model. To get slightly more intelligent responses you could switch to `gpt-40`.
- `messages` is a list of dictionaries contains messages sent to the `model`.
  - There are two different values given for `role` in this example: `system` and `user`.
  - `system` refers to the system message given to the LLM that conditions its reponses.

Note how the output below differs from the output above, only changing the `system` message:

In [None]:
system_message = "You are a French tutor. Respond to all prompts in French followed by English in parentheses."
user_message = "Explain what text classification is in ten or fewer words."

In [None]:
completion = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
        {
            "role": "system",
            "content": system_message,
        },
        {
            "role": "user",
            "content": user_message,
        },
    ],
)

print(completion.choices[0].message.content)

You can explore the response by hitting tab after `completion.`. 
- `completion.usage` shows you how many tokens you sent and received. The number of tokens for each model corresponds to what you will be charged for your API usage. See [OpenAI's pricing page](https://openai.com/api/pricing/) for more info.

In [None]:
completion.usage

- `completion.choices[0].` refers to the first response from the API. In our case we only have one response, but other queries can return a list of different API responses at different indices (e.g., `completion.choices[1]`).
- `completion.choices[0].message.content` has the response content that we're interested in here.

In [None]:
completion.choices[0].message.content

### Exercise
Create a new system message and user message to send to the API.

In [None]:
system_message = ""
user_message = ""

completion = client.chat.completions.create(
    model="gpt-4o-mini",
    max_completion_tokens = 150, # this sets a maximum length of the response to keep our API costs down
    messages=[
        {
            "role": "system",
            "content": system_message,
        },
        {
            "role": "user",
            "content": user_message,
        },
    ],
)

print(completion.choices[0].message.content)

### Using LLMs for classification 

We can write a simple system prompt to ask to classify text into various categories. First let's create a function for our API call to make it easier to reuse.


In [None]:
def api_call(system_message, user_message):
    completion = client.chat.completions.create(
        model="gpt-4o-mini",
        max_completion_tokens = 100,
        messages=[
            {
                "role": "system",
                "content": system_message,
            },
            {
                "role": "user",
                "content": user_message,
            },
        ],
    )
    response = completion.choices[0].message.content
    print(f"Headline: {user_message} \nClassification: {response}")
    print()
    return user_message, response

In [None]:
system_message = """Classes: 
['U.S.', 'World', 'Business', 'Arts', 'Lifestyle', 'Opinion', 'Sports', 'Science', 'Other']
Classify the user input (newspaper headlines) into one of the above classes. 
If the headlines doesn't match a category, respond 'Other'."""

user_message = "After a decade, scientists unveil fly brain in stunning detail"

In [None]:
user_message, response = api_call(system_message, user_message)

#### Newspaper headlines
Let's import a list of newspaper headlines from the US, [collected on Kaggle](https://www.kaggle.com/datasets/felixludos/babel-briefings).

The dataset is in JSON format, so we'll import the JSON library to work with the data and load it in a similar way as text files. `json.load` converts the JSON data into a Python object we can work with as a list of dictionaries for each headline "item".

In [None]:
import json

# US headlines from https://www.kaggle.com/datasets/felixludos/babel-briefings?resource=download

with open('data/babel-briefings-v1-us.json') as json_data:
    headlines = json.load(json_data)

Let's take a look at the dictionary for a single item in the headlines list. We want to work with the 'title' for each item, which is accessible via the dictionary key. 

In [None]:
headlines[0]

In [None]:
headlines[0]['title']

In [None]:
# let's print out classifications for the first ten items
for headline in headlines[0:10]:
    api_call(system_message, headline['title'])

### Using LLMs for Named Entity Recognition (NER)
We can use the same technique, with a different system prompt, to ask for named entities (people, places, and other formal nouns) from each headline. 

In [None]:
system_message = """For each user input (newspaper headlines), give me a list of:
- organization named entity
- location named entity
- person named entity
Format the output in valid json with the following keys:
- Organizations
- Locations
- Persons
"""

Instead of just printing our results, let's save them to a new Python dictionary. 

In [None]:
headline_ner = {}
for headline in headlines[20:30]:
    headline, response = api_call(system_message, headline['title'])
    headline_ner[headline] = response

Some of the JSON is not valid, despite our prompt!

In [None]:
for k, v in headline_ner.items():
    print(k)
    print(json.loads(v))
    print()

### Exercise: Fix invalid JSON
Some of the JSON that ChatGPT returned isn't valid! Can you edit the system prompt and re-run the API call to pull in valid JSON?

In [None]:
system_message = """For each user input (newspaper headlines), give me a list of:
- organization named entity
- location named entity
- person named entity
Format the output in valid json with the following keys:
- Organizations
- Locations
- Persons
Do not include quotes or the term json in the response.
"""

### Using LLMs for Sentiment Analysis

## spacy-llm wrapper