# Demo Notebook

This notebook is a demo of the methodology used to retrieve labels for qualitative data using OpenAI's python API. This notebook uses data from the associated paper "Using GPT for Supervised Content Coding: An Application in Corresponding Experiments", but the method described is applicable to many use cases.

In [1]:
#install and update openai api and tokenizer
#!pip install openai
#!pip install --upgrade openai
#!pip install tiktoken
#!pip install --upgrade tiktoken

In [2]:
import pandas as pd
import numpy as np
from openai import OpenAI
import os
import tiktoken as tk #tokenizer

#import exponential backoff
from tenacity import (
    retry,
    stop_after_attempt,
    wait_random_exponential,
)

client = OpenAI(
  api_key='your_api_key',  # this is also the default, it can be omitted
)



## Functions

The below function is used to pass prompts to gpt. We use chat completions to run newer models such as gpt-3.5-turbo and gpt-4. Completions are also available for older models. This function also uses exponential backoff. Large jobs will often fail due to one request being denied by OpenAI. Exponential backoff runs the failed request again after a specified wait period, ensuring that a complete job will get through.

In [3]:
@retry(wait=wait_random_exponential(min=1, max=60), stop=stop_after_attempt(6)) #exponential backoff
def run_chat_complete_job(prompt, model):
    """
    Pass a prompt to OpenAI using a chat completion. Works with gpt-3.5-turbo and gpt-4.
    
    Parameters:
    - prompt (str): user prompt
    - model (str): gpt model to be used
    
    Returns:
    - dict: completion from OpenAI
    """
    
    #send chat completion job
    completion = client.chat.completions.create(
    model = model, #specify the model
    messages=[
        {'role':'system', 'content': 'You help interpret emails.',
        'role': 'user', 'content': prompt}
    ],
    temperature = 0, #randomness of completion, 0 returns the same completion every time
    max_tokens = 1, #number of tokens to return in the completion
    )
    
    return completion

#### Bootstrapping Function

In [None]:
def bootstrap_accuracy(df, gpt_response, human_response, n):
    """
    Generate a bootstrap distribution of accuracies.
    
    Parameters:
    - df (dataframe): dataframe of labeled examples
    - gpt_response (str): gpt response column
    - human_label (str): human label column
    - n (int): number of bootstrap samples
    
    Returns:
    - numpy array: array of accuracies from each bootstrapped sample
    """
    
    sample_accuracy = np.empty(n) #array to store bootstrapped accuracies
    
    #loop throough n, generate sample, record accuracy
    for i in range(n):
        boot_sample = df.sample(n=len(df), replace=True)
        accuracy = round((np.sum(np.where(boot_sample[gpt_response] == boot_sample[human_response], 1, 0)))/len(boot_sample), 2)
        sample_accuracy[i] = accuracy
        
    return sample_accuracy

The below function estimates the price of running a job through OpenAI using gpt-3.5-turbo or gpt-4. The price is calculated by multiplying the total number of prompt tokens (in our case, the base prompt plus the emails to be attached to each prompt) by the input token price, plus the output token price times the number of tokens in the output. Prices from OpenAI are quoted in price per thousand tokens.

In [5]:
#gpt prices per thousand tokens
gpt_input_price = {'gpt-3.5-turbo': 0.0015, 'gpt-4': 0.03}
gpt_output_price = {'gpt-3.5-turbo': 0.002, 'gpt-4': 0.06}

def calculate_job_price(prompt, text, model, output_size):
    """
    Takes in a prompt, a dataframe with text to be labeled, the model to be used, and the size of the gpt response
    and returns the cost of running the job in dollars.
    
    Parameters:
    - prompt (str): prompt to be passed to gpt
    - text (series, list): a series or list containing text to be labeled.
    - model (str): gpt model to be used
    - output_size (int): the number of tokens gpt will use in completions
    
    Returns:
    - a dictionary containing the price, number of tokens in the prompt, number of tokens in the text, the total
      input tokens, and the total output tokens.
    """
    
    encoding = tk.encoding_for_model(model) #initialize an encoding object
    
    total_labels = len(text) #total number of labels needs
    prompt_tokens = len(encoding.encode(prompt)) #number of tokens for the prompt and the text
    text_tokens = len(encoding.encode(' '.join(text))) #number of tokens contributed by text
    total_input_tokens = text_tokens + (prompt_tokens * total_labels)
    total_output_tokens = total_labels * output_size
    
    #calculate the total price
    price = round((total_input_tokens/1000) * gpt_input_price[model] 
                  + (total_output_tokens/1000) * gpt_output_price[model], 2)
    
    return {'price':price, 'prompt_tokens':prompt_tokens, 
            'text_tokens':text_tokens, 'total_input_tokens':total_input_tokens, 
            'total_output_tokens':total_output_tokens}

The below function prepares a list of shots to be passed to gpt along with the prompt. Shots are pre-labeled examples passed as part of a prompt used to provide guidance to gpt models. 

In [6]:
def prepare_shots(df, text_column, human_label_column, shots):
    """
    Prepares shots to be used in a prompt when shots are contained within the test dataframe.
    
    Parameters:
    - df (dataframe): dataframe containing text to be labeled
    - text_column (str): the column of the dataframe containing the text to be labeled
    - human_label_column (str): column containing the human label
    - shots (list of integers): list of indices to be used as shots
    
    Returns:
    - tuple containing a list of shots, and a copy of df with the examples removed. 
    """
    
    #grab the shots from the original dataframe
    df_shots = df.loc[shots, [text_column, human_label_column]]
    
    #store the shots in a list, staggering the email and the label
    message = list(df_shots[text_column])
    response = list(df_shots[human_label_column])
    intermediate = list(map(list, zip(message, response)))
    shots_list = [a for sublist in intermediate for a in sublist]
    
    #remove the shots from the original dataframe
    df_shotless = df[~df.index.isin(shots)].reset_index(drop=True)
    
    return shots_list, df_shotless

# Prompts

Availability

In [None]:
prompt = """I am providing emails from a nursing home in response to a potential resident.
            If the email indicates that the nursing home has availability, respond with yes.
            If the email indicates that the nursing home does not have availability,
            respond with no. If the email indicates that the nursing home cannot accept
            the client, respond with no. If the email mentions the word waitlist,
            respond with waitlist. If the email does not indicate any of the above,
            respond with nan.

            email:{}
            response:"""

Payable

In [None]:
prompt = """I am providing an email from a nursing home in response to a potential client.
            If the email mentions payment or insurance, respond with yes.
            Otherwise, respond with no.

            email:{}
            response:"""

Citizen Flag

In [None]:
prompt = """I am providing an email from a nursing home in response to a potential client.
          If the email mentions citizenship or citizenship restrictions, respond with yes.
          Otherwise, respond with no.
            
          email:{}
          response:"""

More Information

In [None]:
prompt = """I am providing an email from a nursing home in response to a potential client.
            If the email requests additional information, documentation, or paperwork from the client,
            respond with yes. If the email asks the client a question, respond with yes. Otherwise,
            respond with no.
                
            email:{}
            response:"""

## Test Samples

This section shows how prompts can be tested on a subset of the data before being used on the entire dataset.

In [7]:
path = '/Users/achurch/Desktop/RA_Job/actual_final_stuff/experiment_data/final_clean_results.csv'

df = pd.read_csv(path, index_col = 0, dtype=str)

#replace binary responses with yes and no
df.replace({'1':'yes', '0':'no'}, inplace=True)
df['availableGPT'] = df['availableGPT'].astype(str)
df['available'] = df['available'].astype(str)

#### Available, No Shots, gpt-3.5-turbo

Calculate price and define variables.

In [None]:
#create a random sample from the dataframe
df_test = df.sample(100)

#define the input variables
prompt = """I am providing emails from a nursing home in response to a potential resident.
            If the email indicates that the nursing home has availability, respond with yes.
            If the email indicates that the nursing home does not have availability,
            respond with no. If the email indicates that the nursing home cannot accept
            the client, respond with no. If the email mentions the word waitlist,
            respond with waitlist. If the email does not indicate any of the above,
            respond with nan.

            email:{}
            response:"""
            
model = 'gpt-3.5-turbo'

#calculate price before running
calculate_job_price(prompt, df_test['email_message'], model, 1)

In [None]:
#uncomment to run a job
#df_test['GPT'] = df_test['email_message'].apply(lambda x: run_chat_complete_job(prompt.format(x), model))

#retrieve completions from the gpt response and clean it
df_test['response'] = df_test['GPT'].apply(lambda x: x.choices[0].message.content)
df_test['response'] = df_test['response'].apply(lambda x: x.strip().lower())

#for availability, if model responds with wait, change to waitlist, result of max_tokens = 1
df_test['response'] = df_test['response'].apply(lambda x: 'waitlist' if x == 'wait' else x)

#calculate the accuracy of compared to the human responses
accuracy = np.sum(df_test['response'] == df_test['available']) / len(df_test)
print(round(accuracy, 2))

Now bootstrap the test, sampling with replacement and calculating the accuracy as many times as is comuptationally feasible. In this case we calculate accuracy 1000 times. Then calculate the confidence interval. The confidence interval provides a range of values that we can be 95% sure contains the accuracy that will be achieved on the entire dataset.

In [None]:
#bootstrap accuracy, proportion of review label
bootstrap_dist = bootstrap_accuracy(df_test, 'response', 'available', 1000)

#confidence interval using the percentile method
ci_info = (np.percentile(bootstrap_dist, 2.5), np.percentile(bootstrap_dist, 97.5))

print('sample_accuracy: ', round(bootstrap_dist.mean(), 2))
print('confidence_interval: ', ci_info)

#### Payable with Shots

In [None]:
#define the input variables
prompt = """I am providing an email from a nursing home in response to a potential client.
            If the email mentions payment or insurance, respond with yes.
            Otherwise, respond with no.

            email:{}
            response:{}
            email:{}
            response:{}
            email:
            response:"""

model = 'gpt-3.5-turbo'

#choose shots and remove from orginal dataframe
shots = [25, 46]
shots_list, df_shotless = prepare_shots(df, 'email_message', 'payable', shots)

#create a sample from the dataframe without shots
df_test = df_shotless.sample(100)

#calculate price before running
calculate_job_price(prompt.format(*shots_list), df_test['email_message'], model, 1)

In [None]:
#uncomment to run a job
#df_test['GPT'] = df_test['email_message'].apply(lambda x: run_chat_complete_job(prompt.format(*shots_list, x), model))

#retrieve completions from the gpt response and clean it
df_test['response'] = df_test['GPT'].apply(lambda x: x.choices[0].message.content)
df_test['response'] = df_test['response'].apply(lambda x: x.strip().lower())

#calculate the accuracy of compared to the human responses
accuracy = np.sum(df_test['response'] == df_test['payable']) / len(df_test)
print(round(accuracy, 2))

In [None]:
bootstrap_dist = bootstrap_accuracy(df_test, 'response', 'payable', 1000)

#confidence interval using the percentile method
ci_info = (np.percentile(bootstrap_dist, 2.5), np.percentile(bootstrap_dist, 97.5))

print('sample_accuracy: ', round(bootstrap_dist.mean(), 2))
print('confidence_interval: ', ci_info)

Once a satisfactory confidence interval is achieved, the job can be run on the entire dataset using the same method as above, passing in the entire dataset in place of the sample.