# Text analysis with AI
A workshop by UMN LATIS and the Libraries.

## What we'll cover in this session
- Understand text classification with LLMs, and how it can be useful for text-based research.
- Interact with the ChatGPT API.
- Structure API calls using different models and prompts
- Set up classification tasks with ChatGPT.
- Understand and parse API JSON responses.
- Understand risks in using generative AI for classification.

### Install libraries
If you're working from your own machine you can use pip install to make sure you have downloaded all of the Python packages you'll need to use today. 

If you're working on notebooks.latis.umn.edu, there's no need to install any of these, since they're included in the virtual environment.

In [None]:
### Install Libraries ###

#!pip install --upgrade openai python-dotenv
#!pip install spacy
#!pip install spacy-llm

# This command downloads the medium-sized English language model for spaCy.
# It uses the Python module-running option to run spaCy's download command for the "en_core_web_md" model.
#!python -m spacy download en_core_web_md 

In [None]:
### Import Libraries ###

from openai import OpenAI
from dotenv import load_dotenv

# ChatGPT: website vs. API

The [chat interface on the website](https://chatgpt.com) is the most familiar way of interacting with these models.

But we will be working with the [application programming interface (API)](https://en.wikipedia.org/wiki/API) to automatically send and receive messages from the model using some features that are not accessible via the web.

## OpenAI's API

Many applications and websites offer APIs. For example, nearly every weather app uses [the National Weather Service API](https://www.weather.gov/documentation/services-web-api) to automatically retrieve weather data.

Unlike the National Weather Service, OpenAI charges for the use of its API. Which means that calls to the API require a special string called a `key`.

### Getting a key

After installing the Python bindings above (`openai`), you need to get an API key to send requests. The key is a unique identifier that performs a number of functions (including allowing OpenAI to bill you).

For the purposes of this class, I have created a fresh key with a spending limit of `$10` that I will share with the group, which should be more than enough to satisfy all of the requests in this class.

When you want to run your own queries in the future, you will need to register for an account and create an API key.

See [this page of the documentation](https://platform.openai.com/docs/quickstart) for details of how to create your own key.

### Setting the key

You need to include the key with every call to the API.

One way to do this is by setting the `OPENAI_API_KEY=...` variable in a `.env` file in your working directory.

You can also do this by setting a local variable, like so:

In [None]:
OPENAI_API_KEY = ""  # copy-paste the class key here

We're also going to write this to your `.env` so you don't have to repeat the process next time:

In [None]:
with open(".env", "w") as f:
    f.write(f"OPENAI_API_KEY={OPENAI_API_KEY}")
    f.close()

Next time your restart this notebook kernel (or open up a new notebook), the `openai` library will read the API key directly from your `.env` file. No need to specify the `api_key=` argument in `OpenAI()`.

## Making your first API call

You installed and imported the `openai` library above, so now you can run the example completion below, which is part of [OpenAI's tutorial](https://platform.openai.com/docs/api-reference/chat/create):

In [None]:
# this will load your saved .env variable
load_dotenv()
client = OpenAI()

In [None]:
completion = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
        {
            "role": "system",
            "content": "You are an expert in qualitative data analysis. Your task is to analyze open-ended survey responses and categorize them into positive, negative, or neutral sentiments.",
        },
        {
            "role": "user",
            "content": "Survey Response: I feel overwhelmed by the workload but am excited about the learning opportunities.",
        },
    ],
)

print(completion.choices[0].message.content)

Let's breakdown the elements from the code above:
- `client.chat.completions.create()` calls the REST API chat completions endpoint
- `model="gpt-4o-mini"` - you can choose from a variety of [ChatGPT models](https://platform.openai.com/docs/models). We're using a lightweight (affordable) `gpt-40-mini` model. To get slightly more intelligent responses you could switch to `gpt-40`.
- `messages` is a list of dictionaries contains messages sent to the `model`.
  - There are two different values given for `role` in this example: `system` and `user`.
  - `system` refers to the system message given to the LLM that conditions its reponses.

Note how the output below differs from the output above, only changing the `system` message:

In [None]:
system_message = "You are an HR employee. Your task is to analyze open-ended survey responses and assess how employee's are feeling."
user_message = "Survey Response: I feel overwhelmed by the workload but am excited about the learning opportunities."

In [None]:
completion = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
        {
            "role": "system",
            "content": system_message,
        },
        {
            "role": "user",
            "content": user_message,
        },
    ],
)

print(completion.choices[0].message.content)

You can explore the response by hitting tab after `completion.`. 
- `completion.usage` shows you how many tokens you sent and received. The number of tokens for each model corresponds to what you will be charged for your API usage. See [OpenAI's pricing page](https://openai.com/api/pricing/) for more info.

In [None]:
completion.usage

- `completion.choices[0].` refers to the first response from the API. In our case we only have one response, but other queries can return a list of different API responses at different indices (e.g., `completion.choices[1]`).
- `completion.choices[0].message.content` has the response content that we're interested in here.

In [None]:
completion.choices[0].message.content

### Exercise
Create a new system message and user message to send to the API.

In [None]:
system_message = ""
user_message = ""

completion = client.chat.completions.create(
    model="gpt-4o-mini",
    max_completion_tokens = 150, # this sets a maximum length of the response to keep our API costs down
    messages=[
        {
            "role": "system",
            "content": system_message,
        },
        {
            "role": "user",
            "content": user_message,
        },
    ],
)

print(completion.choices[0].message.content)

### Temperature

Let's see how we can adjust the creativity and randomness of ChatGPT's response using the `temperature` and `top_p` parameters. 
- Temperature: We can turn down temperature to be get more determinative responses, or turn up the temperature to be more creative.
- Top-p: Also known as "nucleus sampling," `top_p` reduces the number of probable next words that will be considered in the response. Lower top_p figures reduce creativity, while a top_p of 1 leaves in 100% of probable next words (the default).

We can copy and paste the code above as a starting place, and then add a new system and user prompt, along with our temperature parameter.

See how the response changes when you adjust the `temperature` and `top_p` values between 0 and 1. 

In [None]:
system_message = "Provide one short innovative idea to address survey responses about public needs."
user_message = "Survey Response: I worry about global warming."

completion = client.chat.completions.create(
    model="gpt-4o",
    max_completion_tokens = 150, # this sets a maximum length of the response to keep our API costs down
    temperature=.9,
    top_p = .9,
    messages=[
        {
            "role": "system",
            "content": system_message,
        },
        {
            "role": "user",
            "content": user_message,
        },
    ],
)

print(completion.choices[0].message.content)

### Using LLMs for classification 

We can write a simple system prompt to ask to classify text into various categories. First let's create a function for our API call to make it easier to reuse.


In [None]:
def api_call(system_message, user_message):
    completion = client.chat.completions.create(
        model="gpt-4o-mini",
        max_completion_tokens = 100,
        temperature=.2,
        messages=[
            {
                "role": "system",
                "content": system_message,
            },
            {
                "role": "user",
                "content": user_message,
            },
        ],
    )
    response = completion.choices[0].message.content
    print(f"Headline: {user_message} \nClassification: {response}")
    print()
    return user_message, response

We can use a docstring - three double quotes - to delineate multi-line strings in Python.

In [None]:
system_message = """Classes: 
['U.S.', 'World', 'Business', 'Arts', 'Lifestyle', 'Opinion', 'Sports', 'Science', 'Other']
Classify the user input (newspaper headlines) into one of the above classes. 
If the headlines doesn't match a category, respond 'Other'."""

user_message = "After a decade, scientists unveil fly brain in stunning detail"

In [None]:
user_message, response = api_call(system_message, user_message)

#### Newspaper headlines
Let's import a list of newspaper headlines from the US, [collected on Kaggle](https://www.kaggle.com/datasets/felixludos/babel-briefings).

The dataset is in JSON format, so we'll import the JSON library to work with the data and load it in a similar way as text files. `json.load` converts the JSON data into a Python object we can work with as a list of dictionaries for each headline "item".

In [None]:
import json

# US headlines from https://www.kaggle.com/datasets/felixludos/babel-briefings?resource=download

with open('data/babel-briefings-v1-us.json') as json_data:
    headlines = json.load(json_data)

Let's take a look at the dictionary for a single item in the headlines list. We want to work with the 'title' for each item, which is accessible via the dictionary key. 

In [None]:
headlines[0]

In [None]:
headlines[0]['title']

In [None]:
# let's print out classifications for the first ten items
for headline in headlines[0:10]:
    api_call(system_message, headline['title'])

### Using LLMs for Named Entity Recognition (NER)
We can use the same technique, with a different system prompt, to ask for named entities (people, places, and other formal nouns) from each headline. 

In [None]:
system_message = """For each user input (newspaper headlines), give me a list of:
- organization named entity
- location named entity
- person named entity
Format the output in valid json with the following keys:
- Organizations
- Locations
- Persons
"""

Instead of just printing our results, let's save them to a new Python dictionary. 

In [None]:
headline_ner = {}
for headline in headlines[20:30]:
    headline, response = api_call(system_message, headline['title'])
    headline_ner[headline] = response

Some of the JSON is not valid, despite our prompt!

In [None]:
for k, v in headline_ner.items():
    print(k)
    print(json.loads(v))
    print()

### Get better JSON

There are ways to force ChatGPT to structure outputs using specific formats, as detailed in this post about the [`response_format` parameter](https://openai.com/index/introducing-structured-outputs-in-the-api/).

Let's create a new API call focused on providing JSON responses for NER.


In [None]:
def ner_api_call(user_message):
    completion = client.chat.completions.create(
        model="gpt-4o-mini",
        max_completion_tokens = 100,
        temperature=.2,
        response_format={
        "type": "json_schema",
            "json_schema": {
                "name": "ner_schema",
                "schema": {
                    "type": "object",
                    "properties": {
                        "Organizations": {"type": "array", "items": {"type": "string"}},
                        "Locations": {"type": "array", "items": {"type": "string"}},
                        "Persons": {"type": "array", "items": {"type": "string"}}
                },
                "required": ["Organizations", "Locations", "Persons"],
                "additionalProperties": False
                },
            },
        },
        messages=[
            {
                "role": "system",
                "content": "For each user input (newspaper headlines), give me a JSON list of named entities including locations, organizations, and persons.",
            },
            {
                "role": "user",
                "content": user_message,
            },
        ],
    )
    response = completion.choices[0].message.content
    print(f"Headline: {user_message} \nClassification: {response}")
    print()
    return user_message, response

In [None]:
headline_ner = {}
for headline in headlines[10:15]:
    headline, response = ner_api_call(headline['title'])
    headline_ner[headline] = response

In [None]:
for k, v in headline_ner.items():
    print(k)
    print(json.loads(v))
    print()

### Using LLMs for Sentiment Analysis
We can use a similar approach to sentiment analysis on the headlines. Imagine we want ChatGPT to respond with a dictionary with positive, negative, and neutral values ranging from 0 to 1. You could manage this with a response format, but that might be overkill given the simplicity of the task.

In [None]:
def sa_api_call(user_message):
    system_message = """
    You are an expert in sentiment analysis. For each user input, analyze the sentiment and return a JSON dictionary with three keys: 'positive', 'negative', and 'neutral'. Each value should be a floating-point number between 0 and 1, representing the probability of each sentiment.
    """
    completion = client.chat.completions.create(
        model="gpt-4o-mini",
        max_completion_tokens = 100,
        temperature=.2,
        messages=[
            {
                "role": "system",
                "content": system_message
            },
            {
                "role": "user",
                "content": user_message,
            },
        ],
    )
    response = completion.choices[0].message.content
    return user_message, json.loads(response)

In [None]:
sa_api_call(headlines[10]['title'])

In [None]:
sa_api_call(headlines[11]['title'])

## spacy-llm wrapper
It's also possible to work with the ChatGPT API via different tools that have integrated LLMs into their own pipelines. spaCy, for example, offers a [spacy-llm package](https://spacy.io/usage/large-language-models) which provides pre-defined NLP tasks such as classifiers, NER, and summarization. 

Let's import spacy and load the `en_core_web_md` model that we worked with in previous lessons.

In [None]:
import spacy
nlp = spacy.load("en_core_web_md")

spaCy comes with built-in pipelines. We can take a look at them here. 

In [None]:
nlp.pipe_names

This shows us the sequence of the pipes.

1. 'tok2vec': This is the first step in the pipeline. It stands for "tokenization to vectors". This component converts text into numerical vectors that represent the semantic meaning of each token. It's a crucial preprocessing step for many other components.
2. 'tagger': This component performs part-of-speech (POS) tagging. It assigns grammatical categories (like noun, verb, adjective, etc.) to each token in the text.
3. 'parser': The parser analyzes the grammatical structure of the sentence. It determines the relationships between words and creates a dependency parse tree.
4. 'attribute_ruler': This component can be used to add, modify or remove token attributes based on token or span matches. It's often used for rule-based corrections or additions to the pipeline's output.
5. 'lemmatizer': The lemmatizer reduces words to their base or dictionary form. For example, "running" would be lemmatized to "run".
6. 'ner': This stands for Named Entity Recognition. It identifies and classifies named entities (like persons, organizations, locations, etc.) in the text.

This sequence represents a common order of operations in NLP:

First, the text is tokenized and converted to vectors. Then, grammatical information is added (tagging and parsing). Additional attributes might be adjusted. Words are reduced to their base forms. Finally, named entities are identified.

Each step in this pipeline builds on the previous ones, creating a rich set of linguistic annotations for the input text. This particular pipeline is quite comprehensive and would be suitable for a wide range of NLP tasks.

In [None]:
for n, headline in enumerate(headlines[100:150]):
    print(n, headline['title'])
    doc = nlp(headline['title'])
    for ent in doc.ents:
        print(ent.text, ent.label_)
    print('---------------------------')

In [None]:
text = headlines[104]['title']
text

In [None]:
doc = nlp(text)

The spaCy model we loaded analyzed this headline using the pipelines above, which we can access at the token level for each element in doc. Here are some of the different token attributes we can work with.

| Name         | Description                                     | Code Example       |
| ------------ | ----------------------------------------------- | ------------------ |
| `sent`       | The sentence to which the token belongs.        | `token.sent`       |
| `text`       | The raw text of the token.                      | `token.text`       |
| `head`       | The parent of the token in the dependency tree. | `token.head`       |
| `left_edge`  | The leftmost token of the token's subtree.      | `token.left_edge`  |
| `right_edge` | The rightmost token of the token's subtree.     | `token.right_edge` |
| `ent_type_`  | The entity type label of the token, if any.     | `token.ent_type_`  |
| `lemma_`     | The lemmatized form of the token.               | `token.lemma_`     |
| `morph`      | The morphological details of the token.         | `token.morph`      |
| `pos_`       | The part of speech tag of the token.            | `token.pos_`       |
| `dep_`       | The syntactic dependency relation.              | `token.dep_`       |
| `lang_`      | The language of the parent document.            | `token.lang_`      |

Let's look at some of the attributes of the 5th token, `Tagovailoa`, for example:

In [None]:
print(doc[4], doc[4].pos_, doc[4].ent_type_, doc[4].head)

You can guess from the `ent_type`, that it's possible to use spaCy's built-in NER, without resorting to an LLM. We can do that using the `.ents` attribute

In [None]:
for ent in doc.ents:
    print(ent.text, ent.label_)

But we can also build our own pipelines that integrate LLMs to help us process our texts, without relying on spaCy's "en_core_web_md" model. To do that we'll create our own configuration file and then "assemble" it. Let's import the `assemble` tool from the `spacy_llm` utility library:

In [None]:
from spacy_llm.util import assemble

Now let's reload an nlp model from the config file in the `assets/` directory. We can take a look at the model design by selecting it from the file navigator to the left. 

In [None]:
nlp = assemble("assets/openai-ner.cfg")

If we apply this model to the same headline we'll see it hasn't gone through the full "en_core_web_md" pipeline. Attributes like `.pos_` and `.head` aren't available because our config file doesn't create them. But we are able to build in more specific NER labels.

In [None]:
print(doc[4], doc[4].pos_, doc[4].ent_type_, doc[4].head)

But we do see at least one more entity identifed by the custom pipeline. 

In [None]:
for ent in doc.ents:
    print(ent.text, ent.label_)

In [None]:
for headline in headlines[500:510]:
    print('--------')
    doc = nlp(headline['title'])
    print(f'{doc.text} \n')
    for ent in doc.ents:
        print(f'-- {ent.label_} {ent.text}')
    print()


In [None]:
nlp = assemble("assets/openai-textcat.cfg")

In [None]:
doc = nlp(headlines[0]['title'])

In [None]:
headlines[0]['title']

In [None]:
doc.ents

In [None]:
for headline in headlines[500:510]:
    print('--------')
    doc = nlp(headline['title'])
    print(f'{doc.text} \n')
    for ent in doc.ents:
        print(f'-- {ent.label_} {ent.text}')
    print()