# Creating Synthetic Experts with Generative AI
> ## Label Text with ChatGPT4

Version 1.1   
Date: September 2, 2023    
Author: Daniel M. Ringel   
Contact: dmr@unc.edu   

*Daniel M. Ringel, Creating Synthetic Experts with Generative Artificial Intelligence (July 15, 2023).  
Available at SSRN: https://papers.ssrn.com/abstract_id=4542949*

#### This notebook uses the OpenAI API to communicate with GPT4.
- You need an account with OpenAI for API access
- Visit https://platform.openai.com/signup?launch to sign-up
- Beware that using the API comes at a cost: https://openai.com/pricing

# 1. Imports

In [1]:
import pandas as pd
import numpy as np
import openai
import re, os, signal, datetime, warnings
from bs4 import BeautifulSoup

warnings.filterwarnings("ignore", category=UserWarning, module='bs4')
pd.set_option('display.max_colwidth', 300)

# 2. Configure

##### By using this notebook, you agree that the author is not liable for any cost or damages that you incur.

> I ***strongly recommend*** that you set a ***soft limit*** and a ***hard limit*** on your ***OpenAI account*** before running this notebook to prevent excessive cost due to glitches in the API interaction (e.g., unexpected answers from the API lead to ongoing queries that incur cost)

In [2]:
# Put your OpenAI API Key here. DO NOT SHARE YOUR KEY! 
# ----> Always delete your OpenAi API key before sharing the notebook! <-------

api_key = "DEMO-DGDH4Rd4gfsdhRRFgdsgh23rEdsGg3hyEAAFG12SFysd"

if not api_key == None:
    print("!!! Your API Key may be included in this notebook !!!\n\n >>> Do not forget to delete it before you share the notebook <<<")

!!! Your API Key is included in this notebook !!!

 >>> Do not forget to delete it before you share the notebook <<<


In [3]:
# Set Paths
IN_Path = "Data"
IN_File = "Demo_250_SyntheticTwins"
TEMP_Path = "tmp"
TEMP_File = "SyntheticExperts_tmp"
OUT_Path = "Data/out"
OUT_File = "Labeled-Texts"

if not os.path.exists(TEMP_Path): os.makedirs(TEMP_Path)
if not os.path.exists(OUT_Path): os.makedirs(OUT_Path)

In [4]:
# Formulate AI Prompt in RTF (Role, Task, Format) convention
AI_Prompt = "You are a renowned marketing scholar and an expert on the 4 Ps of Marketing: Product, Place, Price, and Promotion. When given a numbered list of Tweets, you examine each Tweet individually. For each Tweet, determine which of the 4 Ps it is about, if any. Output all relevant Ps for each tweet. Use only the terms Product, Place, Price, and Promotion. Do not provide notes or an explanation."

# AI Controls
tokens = int(2000)
temp = 0 # According to OpenAI, as this value approaches 0, the GPT4 model becomes deterministic in its responses
model = "gpt-4"

# Batch Controls (number of texts per query - need consider token limits, cost of retries, and size of failed batches)
batch_size = 25

# Set random state (change this for each run when you take majority labels across multiple runs; see Ringel (2023))
seed = 76

**Notes on batch size:** 

- At the time of developing this notebook, the performance of OpenAI's API varied dramatically by 
> *weekday* **x** *time of day* **x** *internet connection* **x** *model used* **x** *number of tokens* 
- In general, I found:
    - smaller batches were less prone to API communication errors than larger batches.
    - longer texts work better in smaller batches
    - runtime dramatically increases during business hours
    - format of AI response deviates more during business hours and early evening, which can lead to errors in response processing
    
***My take-away:*** Create Synthetic Experts overnight on weekends and keep batch size at moderate level, especially for longer texts (i.e., more tokens)

# 3. Helper Functions

**Note from author:** These functions are coded for functionality, not for speed, elegance, or best readability at (i.e., not fully pythonic). Refactor them as needed.

The code in the function *twins_from_ai* is rather extensive to catch errors, retry queries, and collect failed batches. While shorter solutions are possible, I found that the current state of OpenAI's API and models calls for extensive error catching and handling.

In [5]:
def build_query(dataframe, start=0, end=0):
    """Function that builds the AI_query"""
    AI_Query = "".join([f"{i}: {dataframe.iloc[i]['Text']}\n" for i in range(start, end+1)])
    return AI_Query

def handle_interrupt(signal, frame):
    """Function to handle interrupts"""
    print("Interrupt signal received. Exiting...")
    exit(0)

def ask_gpt(AI_Prompt, AI_Query, tokens=2000, temp=1, model="gpt-4"):
    """Function that Queries OpenAI API"""
    response = openai.ChatCompletion.create(
        model=model,
        messages=[
            {"role": "system", "content": AI_Prompt},
            {"role": "user", "content": AI_Query}],
        max_tokens=tokens,
        temperature=temp,)
    return response

def process_response(answer, start, end, retry_count):
    """Function that Processes AI Response. Note: For the original project about marketing mix variables, 'FourP' corresponds to the returned labels. """
    # Extract content from answer
    if 'message' in answer.choices[0]:
        answer_content = answer.choices[0].message.content
    elif 'text' in answer.choices[0]:
        answer_content = answer.choices[0].text
    else: 
        raise ValueError("Processing Error: Cannot find model response text")
    # Get the token usage
    used_tokens = answer['usage']['total_tokens']
    # Pre-process message content
    # answer_content = answer_content.replace("###", "") #optional if you are passing additional information behind a separator (e.g., ###)
    lines = [line.strip() for line in answer_content.split('\n') if line.strip()]
    results = []
    for line in lines:
        try:
            index = int(re.findall(r'^(\d+)', line.strip())[0])
            text = re.findall(r'^\d+[:.\s](.*)$', line.strip())[0].strip()
        except IndexError:
            continue
        if retry_count < 3:
            if index is None:
                raise ValueError("Response missing [index]")
            if text is None or len(text) == 0:
                raise ValueError("Response missing [text]")
        results.append((index, text))  
    if len(results) == 0:
        raise ValueError("No index returned with content")
    # Create DataFrame
    FourP = pd.DataFrame(results, columns=["Index", "Content"]).set_index('Index', drop=True).rename_axis(None)
    FourP = FourP[~FourP.index.duplicated(keep='first')]
    # Check if the indices are within the range [start, end]
    indices = FourP.index.tolist()
    if not all(start <= index <= end for index in indices):
        raise ValueError("Returned indices do not correspond to input indices")    
    return FourP, used_tokens

def classify_by_ai(AI_Prompt, batch_size, model, tokens, temp, data, interims_file):
    """Function that gets synthetic twins of text from AI"""
    counter, sum_tokens, consecutive_fails = 1, 0, 0
    data_len = len(data)
    num_full_batches, remainder = divmod(data_len, batch_size)
    failed_batches = pd.DataFrame(columns=['start', 'end'])  # DataFrame to store the failed batches
    signal.signal(signal.SIGINT, handle_interrupt)    
    def process_batch(start, end, max_tries=5):
        nonlocal consecutive_fails, sum_tokens, counter
        print(f"\nstart: {start}, end: {end}")
        AI_Query = build_query(data, start, end)
        #signal.signal(signal.SIGINT, handle_interrupt) #optional for local handling (set to global 5 lines earlier)
        tries_query = 0
        while tries_query < max_tries:
            try:
                print(datetime.datetime.now(), f"Querying OpenAI: Try {tries_query+1}")
                if "gpt" in model:
                    response = ask_gpt(AI_Prompt, AI_Query, tokens, temp, model)
                else:
                    print("Unknown Model Specification")
                try:
                    FourP, used_tokens = process_response(response, start, end, tries_query)
                    if 'Content' not in FourP.columns:
                        raise ValueError("Expected 'Content' column in FourP DataFrame")                    
                    sum_tokens += used_tokens
                    data.loc[FourP.index, '4P'] = FourP['Content'].values
                    consecutive_fails = 0
                    return True
                except ValueError as ve:
                    print(f"Unexpected AI response. Try {tries_query+1}, Error: {ve}")
                    tries_query += 1
            except Exception as e:
                print(f"Error: {e}")
                tries_query += 1
        print(f"Failed querying OpenAI {max_tries} times at batch {counter}.")
        consecutive_fails += 1
        new_row = {'start': start, 'end': end}
        failed_batches.loc[counter] = new_row
        return False
    for batch_num in range(num_full_batches):
        if consecutive_fails == 5:
            print("5 consecutive fails encountered. Stopping the process.")
            return data, sum_tokens, failed_batches
        start, end = batch_num * batch_size, (batch_num + 1) * batch_size - 1
        process_batch(start, end)
        if counter % 10 == 0:
            data.to_pickle(f"{interims_file}.pkl")
            failed_batches.to_pickle(f"{interims_file}_Failed_batches.pkl")
            print(f"Interim Results Saved: Batch {counter}")
        counter += 1
        print(f"Total Tokens used so far: {sum_tokens}")
    if remainder >= 2:
        start, end = num_full_batches * batch_size, num_full_batches * batch_size + remainder - 1
        process_batch(start, end, max_tries=3)
        if counter % 10 == 0:
            data.to_pickle(f"{interims_file}.pkl")
            failed_batches.to_pickle(f"{interims_file}_Failed_batches.pkl")
            print(f"Interim Results Saved: Batch {counter}")        
        print(f"Total Tokens used so far: {sum_tokens}")
    return data, sum_tokens, failed_batches

def clean_and_parse_text(text):
    """Function that cleans-up texts by (1) parsing HTML, (2) removing URLS (replace with URL), (3) removing line breaks and leading periods and colons, and (4) removing leading, trailing, and duplicate spaces."""
    text = re.sub(r"https?://\S+|www\.\S+", " URL ", text)
    parsed = BeautifulSoup(text, "html.parser").get_text() if "filename" not in str(BeautifulSoup(text, "html.parser")) else None
    return re.sub(r" +", " ", re.sub(r'^[.:]+', '', re.sub(r"\\n+|\n+", " ", parsed or text)).strip()) if parsed else None

def boolean_ps(frame):
    """Function that checks which Ps the AI identified and creates a Boolean column for each P"""
    for p in ['Product', 'Place', 'Price', 'Promotion']:
        frame[p] = frame['4P'].apply(
            lambda x: any([
                p.lower() in item.lower() 
                for item in (x.split(", ") if isinstance(x, str) else x)]))
    return frame

# 2. Load, Parse, and Clean Demo Text
These 250 demo texts are ***Synthetic Twins*** of real Tweets. I do not publish real (i.e., original) Tweets with this notebook.  
> ***Synthetic Twins*** correspond semantically in idea and meaning to original texts. However, wording, people, places, firms, brands, and products were changed by an AI. As such, ***Synthetic Twins*** mitigate, to some extent, possible privacy, and copyright concerns. If you'd like to learn more about ***Synthetic Twins***, another generative AI project by Daniel Ringel, then please get in touch! dmr@unc.edu  


You can ***create your own Synthetic Twins of texts*** with this Python notebook:   `SyntheticExperts_Create_Synthetic_Twins_of_Texts.ipynb`,   
available as BETA version (still being tested) on the **Synthetic Experts [GitHub](https://github.com/dringel/Synthetic-Experts)** respository.<br><br><br>

In [6]:
# Load Texts
df = pd.read_pickle(f"{IN_Path}/{IN_File}.pkl")
df['Text'] = df['Text'].apply(clean_and_parse_text)

  parsed = BeautifulSoup(text, "html.parser").get_text() if "filename" not in str(BeautifulSoup(text, "html.parser")) else None


You may see a warning from Beautiful Soup when it finds a pattern in text that is similar to a filename. This warning is not a problem for this notebook and for what we are doing here.

# 3. Label Text with OpenAI's GPT4

> From my experience, the speed at which the AI labels texts, and the occurrence of possible errors in communicating with the API is related to the day and time of day you query the API. Workday afternoons and evenings tend to see more traffic (i.e., queries) to GPT4, which can slow down its responses, lead to time-outs, and create various other errors.

In [7]:
# Set-up OpenAI API Key
openai.api_key = None
openai.api_key = api_key

# Shuffle order while preserving Index
df["original_Index"] = df.index
df = df.sample(frac=1, random_state=seed)
df.reset_index(inplace=True, drop=True)

In [8]:
# Label with GPT-4
interims_file = f"{TEMP_Path}/{TEMP_File}_seed{seed}"
out, total_tokens, failed_batches = classify_by_ai(AI_Prompt, batch_size, model, tokens, temp, df, interims_file)
print(f"\nComplete. Total tokens used: {total_tokens}")
if not failed_batches.empty:
    print(f"WARNING: AI failed to label {len(failed_batches)} rows (texts).\nConsider querying the AI again for just these rows (texts).")


start: 0, end: 24
2023-08-25 10:32:22.412614 Querying OpenAI: Try 1
Total Tokens used so far: 1501

start: 25, end: 49
2023-08-25 10:32:40.588612 Querying OpenAI: Try 1
Total Tokens used so far: 3028

start: 50, end: 74
2023-08-25 10:32:54.851895 Querying OpenAI: Try 1
Total Tokens used so far: 4593

start: 75, end: 99
2023-08-25 10:33:10.620880 Querying OpenAI: Try 1
Total Tokens used so far: 6125

start: 100, end: 124
2023-08-25 10:33:26.595477 Querying OpenAI: Try 1
Total Tokens used so far: 7765

start: 125, end: 149
2023-08-25 10:33:42.059876 Querying OpenAI: Try 1
Total Tokens used so far: 9394

start: 150, end: 174
2023-08-25 10:33:56.395934 Querying OpenAI: Try 1
Total Tokens used so far: 10963

start: 175, end: 199
2023-08-25 10:34:12.059762 Querying OpenAI: Try 1
Total Tokens used so far: 12640

start: 200, end: 224
2023-08-25 10:34:31.620215 Querying OpenAI: Try 1
Total Tokens used so far: 14301

start: 225, end: 249
2023-08-25 10:34:46.568102 Querying OpenAI: Try 1
Interim

In [9]:
# Code 4Ps to Boolean Columns
out = boolean_ps(out)
out.head()

Unnamed: 0,Text,original_Index,4P,Product,Place,Price,Promotion
0,"Today, I somehow ended up eating a full box of @Triscuit and according to my mom, I'm going to be ""Triscuit Tubby"" 🤣🤦‍♀️",91,Product,True,False,False,False
1,Congratulations to the well-deserving duo who received an award for their outstanding support of @StJude. @BeyonceOfficial #GivingBack,235,,False,False,False,False
2,"We may have another month of a highly positive jobs report, over a million perhaps, @POTUS better hope numbers don't normalize before next month @BBCNews @rtenews @SkyNews @CNNEE @CBSNews @nbc",18,,False,False,False,False
3,Craving the perfect pizza? Look no further! Experience pizza perfection with our mouthwatering creations. Leave behind the mainstream choices and let us treat you to a pizza like no other. Call us today and enjoy a truly satisfying pizza experience. 🍕😋 URL #PizzaPerfection #IndulgeInDeliciousness,7,"Product, Place",True,True,False,False
4,"If @Nokia brought back the classic Symbian phone, I would be the first one in the queue to purchase it. Regardless of the price, I would definitely get it. Shut up and take my money.",113,"Product, Price",True,False,True,False


In [10]:
# Reconstruct original index and recode:
out = out.set_index('original_Index').sort_index(ascending=True).rename_axis(None)

# Save
out.to_pickle(f"{OUT_Path}/{OUT_File}_seed{seed}_{model}_run1.pkl")
failed_batches.to_pickle(f"{OUT_Path}/{OUT_File}_seed{seed}_{model}_failed_run1.pkl")

In [11]:
print("If you use this notebook's code, please give credit to the author by citing the paper:\n\nDaniel M. Ringel, Creating Synthetic Experts with Generative Artificial Intelligence (July 15, 2023).\nAvailable at SSRN: https://papers.ssrn.com/abstract_id=4542949")

If you use this notebook's code, please give credit to the author by citing the paper:

Daniel M. Ringel, Creating Synthetic Experts with Generative Artificial Intelligence (July 15, 2023).Available at SSRN: https://papers.ssrn.com/abstract_id=4542949
