# Tutorial: Using The OpenAI API for Categorization and Labeling
### Author: Campbell Lund
### 9/26/2023
This notebook walks through how to get started using the OpenAI API. We use the specific example of labeling and categorizing sentences to illustrate the capabilities and techniques, such as one-shot learning and chain of thought prompting, of `gpt-3.5-turbo` for text analysis.

### Table of contents:
- 1. [Initialization](#sec1)
- 2. [Example: prompting the model and zero-shot learning](#sec2)
    - 2.1.[Converting text to Python lists](#sec2p1)
- 3. [Example: prompting the model and one-shot learning](#sec3)
- 4. [Example: prompting in batches](#sec4)
    - 4.1.[Determining unique categories](#sec4p1)
- 5. [Categorizing](#sec5)

## 1. Initialization <a name="sec1"></a>

Import or `!pip install` the following libraries. For security, I've stored my API key in a `.env` file since this notebook will be shared. Instructions for generating your personal API key can be found [here.](https://help.openai.com/en/articles/4936850-where-do-i-find-my-secret-api-key) If you don't wish to store your key in a `.env` file, simply set `openai.api_key` equal to your key.

In [3]:
import pandas as pd
import os
import openai
import json
import time
# for exponential backoff
from tenacity import (
    retry,
    stop_after_attempt,
    wait_random_exponential,
)  
# retrieving our API key from a secure file
from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv())

openai.api_key  = os.environ['OPENAI_API_KEY']

### helper function:

In [4]:
# returns the model's response to a given message query
def get_completion_from_messages(messages, 
                                 model="gpt-3.5-turbo", 
                                 temperature=0, # degree of randomness
                                 max_tokens=150): #4000 is max for input and response combined
    response = openai.ChatCompletion.create(
        model=model,
        messages=messages,
        temperature=temperature, 
        max_tokens=max_tokens,
    )
    return response.choices[0].message["content"]

In [5]:
# read data
df = pd.read_csv('data/allQueries.csv', header=None, names=["sentences"])

In [6]:
df

Unnamed: 0,sentences
0,rock melting in india
1,free energy machines
2,levitation devices
3,flat earthers
4,who killed jfk?
...,...
623,biblical cosmology
624,mud flood
625,geoengineering
626,covid tests


## 2. Ex: prompting the model and zero-shot learning <a name="sec2"></a>

Now that we have a helper function to send queries and receive responses from the model, we must engineer our prompt. This is where trial and error is really your friend. The model can handle fairly complex instructions, but it's best to be direct. My advice for engineering a successful prompt is to pretend you're writing pseudo-code rather than giving written instructions to a friend - remember it's a computer you're training, not a person.

### Vocab:
- **zero-shot learning:** a ML paradigm for when a model is applied to objects or concepts it has never seen in training. Since we do not provide labeled examples to ChatGPT for fine-tuning the model, the below is an example of zero-shot learning. 
- **delimiter:** a character used to indicate the start of a new message.
- **token:** a unit of text that the model processes. Tokens are usually individual words, but complex words may be made up of multiple tokens. For our purposes, think of tokens as the number of words in a query or a response.

In [5]:
# a good delimiter since it counts as a single token and isn't likely to appear naturally in the message
delimiter = "####" 

# message given to the model with instructions for how to respond
system_message = f"""
Your job is to determine the topic of a given sentence. \
You will be given a sentence as input and you will return \
a single word that best represents the topic of the sentence. \
Each input will be delimited by {delimiter} characters. \
Output a Python list of objects where each object has the following format: \
    "Sentence": <the input sentence>, \
    "Topic": <the topic output> \
"""
# message input from the user
user_message = f"""\
rock melting in india \
{delimiter}
free energy machines \
{delimiter}
levitation devices \
"""

# combining the system and user messages to give as input to our helper function
messages =  [  
{'role':'system', 
 'content': system_message},    
{'role':'user', 
 'content': f"{delimiter}{user_message}{delimiter}"},  
] 

In [6]:
response = get_completion_from_messages(messages)

In [7]:
response

'[{"Sentence": "rock melting in india", "Topic": "rock"}, {"Sentence": "free energy machines", "Topic": "energy"}, {"Sentence": "levitation devices", "Topic": "levitation"}]'

Although the output apprears to be a Python list as we instructed the model, note that all output from our helper function will be a string which we must convert.

## 2.1. Converting text to Python sets <a name="sec2p1"></a>

In [8]:
topics = json.loads(response)

In [9]:
topics

[{'Sentence': 'rock melting in india', 'Topic': 'rock'},
 {'Sentence': 'free energy machines', 'Topic': 'energy'},
 {'Sentence': 'levitation devices', 'Topic': 'levitation'}]

Now we have an actual array of objects!

## 3. Ex: prompting the model and one-shot learning <a name="sec3"></a>

For more complex tasks, or if the model just isn't returning what you expect, examples may need to be provided. 
### Vocab:
- **one-shot learning:** a ML paradigm for when a model is trained to handle objects or concepts based on a very limited amount of training data. Below, we provide two examples of input and expected output to fine-tune our model.

In [10]:
#providing examples of correct outputs to improve model accuracy
examples = [
    {"role": "user", "content": "rock melting in india"},
    {"role": "assistant", "content": "{\"Sentence\": \"rock melting in india\", \"Topic\": \"melting\"}"},
    {"role": "user", "content": "free energy machines"},
    {"role": "assistant", "content": "{\"Sentence\": \"free energy machines\", \"Topic\": \"machines\"}"}
      ]
    
# using the same user system and message as defined in section 2
messages =  [  
{'role':'system', 
 'content': system_message},    
{'role':'user', 
 'content': f"{delimiter}{user_message}{delimiter}"},  
] 

In [11]:
# providing both the examples and the previous messages as input
response = get_completion_from_messages(examples + messages)
print(response)

[{"Sentence": "rock melting in india", "Topic": "melting"}, {"Sentence": "free energy machines", "Topic": "machines"}, {"Sentence": "levitation devices", "Topic": "devices"}]


Notice how the third topic changes after providing examples for the first two inputs.

## 4. Ex: prompting in batches <a name="sec4"></a>

Often when we use the OpenAI API it is because we have large amounts of data that we want to use as prompts. Entering these queries by hand is time-consuming, so let's automate the process. Some important notes:

Our API account has both a `rate_limit` and a `max_tokens` value. The `max_tokens` is 4000 tokens for both the user message and the generated response. This means the combined input and output for each query must be less than 4000 tokens. This is one of the reasons we will break up large tasks into smaller parts.

Another reason is to stay within the `rate_limit`. To determine the rate limit of your account, simply try running the old`get_completion_from_messages()` helper function on a large data frame. It won't be long until you receive this message:

`RateLimitError: Rate limit reached for default-gpt-3.5-turbo in organization org-KiUYu8NRxzHi3TljuvYUEiIG on tokens per min. Limit: 90000 / min. Current: 87379 / min. Contact us through our help center at help.openai.com if you continue to have issues.`

Now we know our `rate_limit` is 9000 tokens/min.

### helper function:

The below helper function is similar to the previous `get_completion_from_messages()` except now we ask for a `df`, `system_message`, and `batch_size` as parameters. `get_completion_from_messages_batch()` will query each row of the provided `df` in batches of `batch_size` to the model and return an array of all responses. If your results are inaccurate, try lowering `batch_size` and if it's taking too long, raise it.

In [7]:
# returns an array containing the model's responses from a given df of message queries
@retry(wait=wait_random_exponential(min=30, max=60), stop=stop_after_attempt(6))
def get_completion_from_messages_batch(df,
                                 system_message,
                                 batch_size,
                                 model="gpt-3.5-turbo", 
                                 temperature=0, # degree of randomness
                                 max_tokens=300): # 4000 is max for input and response combined
    
    responses = []
    delimiter = '####'
    for i in range(0, len(df), batch_size):
        batch = df.iloc[i:i+batch_size]
        user_message = ""
        for index, row in batch.iterrows():
            user_message += f"{row['sentences']}{delimiter}"
            
        messages = [  
        {'role':'system', 
         'content': system_message},    
        {'role':'user', 
         'content': user_message}  
        ] 
        
        # calculate sleep time before each request to ensure we don't exceed the rate limit
        # calculate_sleep_time() # comment out to test your account's rate limit
        
        response = openai.ChatCompletion.create(
            model=model,
            messages=messages,
            temperature=temperature, 
            max_tokens=max_tokens,
        )
        
        content = response.choices[0].message["content"]
        responses.append(content) 
        
    return responses

### engineering a new prompt:
This prompt will be applied to our entire `df`. Try to keep it simple to speed things up. Since LLMs are already trained to do summarization tasks, we'll start our categorization by asking the model to determine the subject of each input.

In [36]:
delimiter = "####"
system_message = f"""
    The goal is to classify sentences based on their topic.\
    Your job is to determine the topic of a given sentence. \
    You will be given a sentence as input and you will return \
    a single word that best represents the topic of the sentence. \
    Be broad with the topics, some sentences should share \
    similar topics and it is okay to return the same topic\
    multiple times. Each input will be delimited by {delimiter} \
    characters. Format your response as a Python list, each \
    topic must be in double quotations. \
"""

In [34]:
responses = get_completion_from_messages_batch(df, system_message, 100)

In [32]:
# format string
formatted_responses = []
for i, r in enumerate(responses):
    if r[len(r)-1] != "]":
        responses[i] += "]"
    print(responses[i])

["conspiracy theories", "bilderberg", "trump conspiracy theories", "gun control conspiracy theories", "crazy conspiracy theories", "9/11 hoaxes", "looking for bigfoot", "big government", "great reset", "voting scams", "elvis is alive", "moon landing did not happen", "jfk conspiracy", "have we been visited by other life forms", "was 9/11 allowed to happen by our own government", "fault", "untrue", "have aliens been to earth", "area 51", "mlk fbi", "mlk killed by govt", "mlk conspiracy", "john f kennedy", "ufos", "paranormal", "kennedy assassination", "climate change", "celebrity clones", "roswell crash conspiracy", "government conspiracies", "twin tower theories", "ufo & government", "shadow government", "russia collusion", "john kennedy", "assassination", "president", "government is hiding information", "covid vaccine implanting chips", "secret aliens on planets", "alex jones", "globalist agenda", "sandy hook shooting actors", "theories that are a conspiracy", "spare change", "covid 19

In [35]:
responses

['["conspiracy theories", "government coverup", "JFK assassination", "moon landing hoax", "aliens and UFOs", "COVID-19 vaccine misconceptions", "9/11 conspiracy", "Flat Earth theory", "secret societies", "COVID-19 origins"]',
 '["conspiracy theories","government","9/11","aliens","moon landing","JFK assassination","COVID-19","vaccines","celebrity clones","Roswell UFO","lizard men","one world order","election conspiracy","flat earth","secret society","Philadelphia experiment","hydroxychloroquine","Pizzagate","Q drops","Epstein","Bigfoot"]',
 '["covid-19 conspiracy theories","government conspiracy theories","chem trails","project bluebeam","most believed conspiracy theories","big conspiracy theories","unbelievable stories","myths that persist","okc bombing second suspect","timothy mcveigh black ops","aberration in the heartland of the real: the secret lives of timothy mcveigh" by wendy s. painting, phd","hawaii chemtrails","pizzagate adrenochrome","vaccine bloody spikes","man on the moon"

In [17]:
responses[0]

'["conspiracy theories", "bilderberg", "trump conspiracy theories", "gun control conspiracy theories", "crazy conspiracy theories", "9/11 hoaxes", "looking for bigfoot", "big government", "great reset", "voting scams", "elvis is alive", "moon landing did not happen", "jfk conspiracy", "have we been visited by other life forms", "was 9/11 allowed to happen by our own government", "fault", "untrue", "have aliens been to earth", "area 51", "mlk fbi", "mlk killed by govt", "mlk conspiracy", "john f kennedy", "ufos", "paranormal", "kennedy assassination", "climate change", "celebrity clones", "roswell crash conspiracy", "government conspiracies", "twin tower theories", "ufo & government", "shadow government", "russia collusion", "john kennedy", "assassination", "president", "government is hiding information", "covid vaccine implanting chips", "secret aliens on planets", "alex jones", "globalist agenda", "sandy hook shooting actors", "theories that are a conspiracy", "spare change", "covid 1

## 4.1. Determining unique categories <a name="sec4p1"></a>

In [17]:
# reading allTopics.csv - run this cell only if you're working with the saved data
temp = pd.read_csv('data/allTopics.csv', skiprows=1, names=["topics"])

all_topics = []
temp = temp.values.tolist()
for t in temp:
    all_topics.append(t[0])

In [18]:
all_topics = [string.lower() for string in all_topics if len(string.split()) == 1]

In [19]:
unique_topics = list(set(all_topics))
print('Number of unique topics: ', len(unique_topics))
print('Topics: ', unique_topics)

Number of unique topics:  42
Topics:  ['9/11', 'cryptozoology', 'astronomy', 'cryptid', 'sustainability', 'science', 'politics', 'money', 'paranormal', 'environment', 'surveillance', 'food', 'aliens', 'conference', 'health', 'animals', 'uncategorized', 'extraterrestrial', 'entertainment', 'space', 'government', 'supernatural', 'history', 'crime', 'memory', 'misinformation', 'ufos', 'geology', 'economy', 'problem', 'sports', 'tragedy', 'mystery', 'false', 'terrorism', 'aviation', 'scandal', 'media', 'technology', 'error', 'pandemic', 'conspiracy']


Since there are a reasonable number of unique topics, I'll narrow down the most relevant final categories by hand. You can prompt the LLM to do this or use another NLP technique if you wish. I discerned 12 major categories as follows:

1. 'politics'
    - 'government'
    - 'scandal'
    - 'misinformation'
    - 'surveillance'
    - 'crime'
2. 'health'
	- 'pandemic'
3. 'terrorism'
	- '9/11'
    - 'tragedy'
4. 'media'
	- 'entertainment'
5. 'economy'
	- 'money'
6. 'history'
7. 'environment'
	- 'sustainability'
8. 'science'
	- 'geology'
9. 'technology'
	- 'aviation'
10. 'conspiracy'
	- 'false'
11. 'space'
	- 'paranormal'
    - 'extraterrestrial'
    - 'aliens'
    - 'astronomy'
    - 'ufos'
12. 'supernatural'
	- 'cryptid'
    - 'cryptozoology'
    - 'mystery'

## 5. Categorizing  <a name="sec5"></a>

In this section, we're using the output from a previous query as the input to another. This is called **chain of thought prompting**. For complex tasks, it's necessary to break problems down into digestible parts.

In [94]:
system_message = f"""
    Your job is to classify sentences based on their topic.\
    Given a sentence, determine which category it belongs \
    to from the topic list. \
        Topic list: \
            [politics, \
            health, \
            terrorism, \
            media, \
            economy, \
            history, \
            environment, \
            science, \
            technology, \
            space, \
            supernatural] \  
    Each input will be delimited by #### characters. \
    Format your response as a Python list. Output a Python \
    object of the following format: \
    "topic": <the determined topic from the Topic List>, \
    "sentence": <the input sentence> \
"""

In [95]:
responses = get_completion_from_messages_batch(df, system_message, 100)

In [96]:
responses

['[{"topic": "supernatural", "sentence": "rock melting in india"}, {"topic": "technology", "sentence": "free energy machines"}, {"topic": "supernatural", "sentence": "levitation devices"}, {"topic": "supernatural", "sentence": "flat earthers"}, {"topic": "histo...{"topic": "science", "sentence": "global warming"}, {"topic": "supernatural", "sentence": "aliens exist"}, {"topic": "conspiracies", "sentence": "911 is fake"}, {"topic": "space", "sentence": "us did not land on the moon"}, {"topic": "economy", "sentence": "fed conspiracies"}, {"topic": "conspiracies", "sentence": "9/11 truth"}]',
 '[{"topic": "conspiracy", "sentence": "bilderberg"}, {"topic": "conspiracy", "sentence": "trump conspiracy theories"}, {"topic": "conspiracy", "sentence": "gun control conspiracy theories"}, {"topic": "conspiracy", "sentence": "crazy conspiracy theories"}, {"topic": "conspiracy", "sentence": "9/11 hoaxes"}, {"topic": "supernatural", "sentence": "looking for bigfoot"}, {"topic": "politics", "sentence

Once you've fine-tuned the prompt to your liking, run `get_completion_from_messages_batch()` on your entire `df`. This will take a long time to compile. A way to speed it up is to try lowering the batch size, either with a loop as we did before in `Section 4.1` or manually slicing the df.