# Creating Synthetic Experts with Generative AI
> ## Create Synthetic Twins of original texts with OpenAI's GPT4
Version BETA 0.1    
Date: September 5, 2023    
Author: Daniel M. Ringel   
Contact: dmr@unc.edu   


*Daniel M. Ringel, Creating Synthetic Experts with Generative Artificial Intelligence (July 15, 2023).  
Available at SSRN: https://papers.ssrn.com/abstract_id=4542949*

#### This notebook uses the OpenAI API to communicate with GPT4
- You need an account with OpenAI for API access
- Visit https://platform.openai.com/signup?launch to sign-up
- Beware that using the API comes at a cost: https://openai.com/pricing

# *Synthetic Twins*
 
***Synthetic Twins*** correspond semantically in idea and meaning to original texts. However, wording, people, places, firms, brands, and products were changed by an AI. As such, ***Synthetic Twins*** mitigate, to some extent, possible privacy, and copyright concerns. If you'd like to learn more about ***Synthetic Twins***, another generative AI project by Daniel Ringel, then please get in touch! dmr@unc.edu <br><br><br>

# 1. Imports

In [1]:
import pandas as pd, numpy as np, openai, re, os, signal, emoji, datetime, warnings
from bs4 import BeautifulSoup
warnings.filterwarnings("ignore", category=UserWarning, module='bs4')
pd.set_option("display.max_colwidth", 200)

# 2. Paths and Filenames

In [2]:
# Set Paths
IN_Path = "Data"
IN_File = "YourFile"

TEMP_Path = "Temp"
TEMP_File = "SyntheticTwins_YourBrand_TMP"

OUT_Path = IN_Path
OUT_File = "SyntheticTwins_of_YourBrand"

if not os.path.exists(TEMP_Path): os.makedirs(TEMP_Path)
if not os.path.exists(OUT_Path): os.makedirs(OUT_Path)

# 3. Configure AI interaction

##### By using this notebook, you agree that the author is not liable for any cost or damages that you incur.

> I ***strongly recommend*** that you set a ***soft limit*** and a ***hard limit*** on your ***OpenAI account*** before running this notebook to prevent excessive cost due to glitches in the API interaction (e.g., unexpected answers from the API lead to ongoing queries that incur cost)

In [3]:
# Put your OpenAI API Key here. DO NOT SHARE YOUR KEY! 
# ----> Always delete the key before sharing notebook! <-------
api_key = "DEMO-DGDH4Rd4gfsdhRRFgdsgh23rEdsGg3hyEAAFG12SFysd"

if not api_key == None:
    print("!!! Your API Key may be included in this notebook !!!\n\n >>> Do not forget to delete it before you share the notebook <<<")

!!! Your API Key may be included in this notebook !!!

 >>> Do not forget to delete it before you share the notebook <<<


In [4]:
# AI Prompting: Construct system prompt to query AI with

# Brand replacement: You might want to systematically replace a focal brand with an identifier in your Twins
FclBrand = "DMRBrand & Glitch (and any variations in writing such as DMRBrand and @dmrbrand)"
TwinBrand = "SynFcl"

# Define RTF prompt: Role, Task, Format. Preserve the original focal brand name by removing the third line for "Task": "Replace the brand {FclBrand} with the name {TwinBrand}. \"
Role = "You are a marketing scholar and creative social media user. \
You have a deep understanding of the marketing mix, specifically the 4 Ps of Marketing: \
Product, Place, Price and Promotion."

Task = f"Given a numbered list of Tweets, you generate a similar Tweet in meaning and in regard to \
the 4Ps of marketing that each Tweet in the list pertains to. \
Replace the brand {FclBrand} with the name {TwinBrand}. \
Important: Don't use the same brands and people in mentions (@) and hashtags (#) in your text. \
Replace them with similar REAL brands and people. Be creative! Replace and introduce emojis, hashtags, and mentions where appropriate."

Format = "Use the same numbering for your answer as the input list."

AI_Prompt = f"{Role} {Task} {Format}"
print(f"System prompt for AI:\n\n{AI_Prompt}")

System prompt for AI:

You are a marketing scholar and creative social media user. You have a deep understanding of the marketing mix, specifically the 4 Ps of Marketing: Product, Place, Price and Promotion. Given a numbered list of Tweets, you generate a similar Tweet in meaning and in regard to the 4Ps of marketing that each Tweet in the list pertains to. Replace the brand DMRBrand & Glitch (and any variations in writing such as DMRBrand and @dmrbrand) with the name SynFcl. Important: Don't use the same brands and people in mentions (@) and hashtags (#) in your text. Replace them with similar REAL brands and people. Be creative! Replace and introduce emojis, hashtags, and mentions where appropriate. Use the same numbering for your answer as the input list.


In [5]:
# AI Interaction:

# OpenAI API GPT controls
tokens = int(2000) # Maximum number of tokens to process. As a rule of thumb, the number of words in a sentence corresponds roughly to 75% of its tokens.
temp = 1           # According to OpenAI, as this value approaches 0, the GPT4 model becomes deterministic in its responses
model = "gpt-4"    # You can also try a different model, e.g., "gpt-3.5-turbo"

# Batch Controls: How many texts to send per query
batch_size = 10              

**Notes on batch size:** 

- At the time of developing this notebook, the performance of OpenAI's API varied dramatically by 
> *weekday* **x** *time of day* **x** *internet connection* **x** *model used* **x** *number of tokens* 
- In general, I found:
    - smaller batches were less prone to API communication errors than larger batches.
    - longer texts work better in smaller batches
    - runtime dramatically increases during business hours
    - format of AI response deviates more during business hours and early evening, which can lead to errors in response processing
    
***My take-away:*** Create Synthetic Twins overnight on weekends and keep batch size at moderate level, especially for longer texts (i.e., more tokens)

# 4. Helper Functions

***Note from author:*** AThese functions are coded for functionality, not for speed, elegance, or best readability (i.e., not fully pythonic). Refactor them as needed.

The code in the function *twins_from_ai* is rather extensive to catch errors, retry queries, and collect failed batches. While shorter solutions are possible, I found that the current state of OpenAI's API and models calls for extensive error catching and handling.

In [6]:
def clean_and_parse_text(text):
    """Function that cleans text from URLs, phone numbers, e-mail adresses, social security numbers, and HTML code. Also removes line breaks and excessive spaces."""
    text = re.sub(r"https?://\S+|www\.\S+", " URL ", text)
    text = re.sub(r"\b\(?(\+1)?\s?[0-9]{3}\)?[-.\s]?[0-9]{3}[-.\s]?[0-9]{4}\b", " PHONENUMBER ", text)
    text = re.sub(r"\b[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}\b", " EMAILADDRESS ", text)
    text = re.sub(r"\b\d{3}-?\d{2}-?\d{4}\b", " SSNUM ", text)    
    parsed = BeautifulSoup(text, "html.parser").get_text() if "filename" not in str(BeautifulSoup(text, "html.parser")) else None
    return re.sub(r" +", " ", re.sub(r'^[.:]+', '', re.sub(r"\\n+|\n+", " ", parsed or text)).strip()) if parsed else None

def build_query(dataframe, start=0, end=0):
    """Function that builds the AI_query"""
    AI_Query = "".join([f"{i}. {dataframe.iloc[i]['Text']} \n" for i in range(start, end+1)])
    return AI_Query


def handle_interrupt(signal, frame):
    """Function to handle interrupts"""
    print("Interrupt signal received. Exiting...")
    exit(0)

def ask_gpt(AI_Prompt, AI_Query, tokens=2000, temp=1, model="gpt-4"):
    """Function that Queries OpenAI API"""
    response = openai.ChatCompletion.create(
        model=model,
        messages=[
            {"role": "system", "content": AI_Prompt},
            {"role": "user", "content": AI_Query}
        ],
        max_tokens=tokens,
        temperature=temp,
    )
    return response

def process_response(answer, start, end, retry_count):
    """Function that Processes AI Response"""
    if 'message' in answer.choices[0]:
        answer_content = answer.choices[0].message.content
    elif 'text' in answer.choices[0]:
        answer_content = answer.choices[0].text
    else:
        raise ValueError("Processing Error: Cannot find answer text")
    used_tokens = answer['usage']['total_tokens']
    answer_content = answer_content.replace("###", "")
    lines = [line.strip() for line in answer_content.split('\n') if line.strip()]
    results = []
    for line in lines:
        index_text = re.findall(r'^\d+[:.\s](.*)$',line.strip())[0].strip()
        index = int(re.findall(r'^(\d+)', line.strip())[0])
        text = index_text
        if retry_count < 3:
            if index is None:
                raise ValueError("Response missing [index]")
            if text is None:
                raise ValueError("Response missing [text]")
        else:
            if index is None:
                raise ValueError("Response missing [index]")
            if text is None:
                raise ValueError("Response missing [text]")            
        results.append((index, text))
    if len(results) == 0:
        raise ValueError("No index returned with texts")
    Twins = pd.DataFrame(results, columns=["Index", "Text"]).set_index('Index', drop=True).rename_axis(None)
    Twins = Twins[~Twins.index.duplicated(keep='first')]
    indices = Twins.index.tolist()
    if not all(start <= index <= end for index in indices):
        raise ValueError("Returned indices do not correspond to input indices")      
    return Twins, used_tokens

def twins_from_ai(AI_Prompt, batch_size, model, tokens, temp, data, interims_file):
    """Function that queries Synthetic Twins of text from AI"""
    counter = 1
    sum_tokens = 0
    consecutive_fails = 0
    data_len = len(data)
    num_full_batches = data_len // batch_size
    remainder = data_len % batch_size
    failed_batches = pd.DataFrame(columns=['start', 'end'])
    for batch_num in range(num_full_batches):
        if consecutive_fails == 5:
            print("5 consecutive failures encountered. Stopping the process.")
            return data, sum_tokens, failed_batches
        start = batch_num * batch_size
        end = start + batch_size - 1
        print(f"start: {start}, end: {end}")
        AI_Query = build_query(data, start, end)
        signal.signal(signal.SIGINT, handle_interrupt)
        max_tries_query = 5
        tries_query = 0
        while tries_query < max_tries_query:
            try:
                print(datetime.datetime.now())
                print(f"Querying OpenAI: Try {tries_query+1}")
                if "gpt" in model:
                    response = ask_gpt(AI_Prompt, AI_Query, tokens, temp, model)
                else:
                    print("Unknown Model Specification")
                try:
                    Twins, used_tokens = process_response(response, start, end, tries_query)
                    sum_tokens += used_tokens
                    data.loc[Twins.index, 'Twin'] = Twins['Text'].values
                    consecutive_fails = 0
                    break
                except ValueError as ve:
                    print(f"Unexpected AI response at start {start} until {end}. Try {tries_query+1}")
                    print(f"Processing Error: {ve}")
                    tries_query += 1
            except Exception as e:
                print(f"Error: {e}")
                tries_query += 1
        if tries_query == max_tries_query:
            print(f"Failed querying OpenAI {max_tries_query} times at batch {counter}. Moving to the next batch.")
            consecutive_fails += 1
            new_row = pd.DataFrame({'start': [start], 'end': [end]})
            failed_batches = pd.concat([failed_batches, new_row], ignore_index=True)
            counter += 1
            continue
        if counter % 10 == 0:    
            data.to_pickle(interims_file)
            failed_batches.to_pickle(f"{interims_file}_Failed_batches.pkl")
            print(f"Interim Results Saved: Batch {counter}")
        counter += 1
        print(f"Total Tokens used so far: {sum_tokens}")
    # Process the remaining rows. Note from author: This code is repetitive and could be refactored
    if remainder >= 2:
        start = num_full_batches * batch_size
        end = start + remainder - 1
        print(f"start: {start}, end: {end}")
        AI_Query = build_query(data, start, end)
        signal.signal(signal.SIGINT, handle_interrupt)
        max_tries_query = 3
        tries_query = 0
        while tries_query < max_tries_query:
            try:
                print(f"Querying OpenAI: Try {tries_query+1}")
                if "gpt" in model:
                    response = ask_gpt(AI_Prompt, AI_Query, tokens, temp, model)
                else:
                    print("Unknown Model Specification")
                try:
                    Twins, used_tokens = process_response(response, start, end, tries_query)
                    sum_tokens += used_tokens
                    data.loc[Twins.index, 'Twin'] = Twins['Text'].values
                    consecutive_fails = 0
                    break
                except ValueError as ve:
                    print(f"Unexpected AI response at start {start} until {end}. Try {tries_query+1}")
                    print(f"Processing Error: {ve}")
                    tries_query += 1
            except Exception as e:
                print(f"Error: {e}")
                tries_query += 1
        if tries_query == max_tries_query:
            print(f"Failed querying OpenAI {max_tries_query} times at batch {counter}. Moving to the next batch.")
            consecutive_fails += 1
            new_row = pd.DataFrame({'start': [start], 'end': [end]})
            failed_batches = pd.concat([failed_batches, new_row], ignore_index=True)
        else:
            if counter % 10 == 0:
                data.to_pickle(interims_file)
                failed_batches.to_pickle(f"{interims_file}_Failed_batches.pkl")
                print(f"Interim Results Saved: Batch {counter}")
            print(f"Total Tokens used so far: {sum_tokens}")
    return data, sum_tokens, failed_batches

# 5. Load Texts

For demo purposes, I created exemplary data (i.e., micro blog posts) in this notebook based on real post. You can easily load your own texts by uncommenting the respective code below.

In [7]:
# Load Texts and Clean
# original = pd.read_pickle(f"{IN_Path}/{IN_File}.pkl")  # df = pd.read_excel(f"{IN_path}{IN_file}.xlsx")
# original = original[["created_at", "text"]]
# original["Text"] = original.text.apply(clean_and_parse_text)
# original = original.drop(columns=['text'])
# original.reset_index(inplace=True, drop=True)

In [8]:
# Create exemplary data
raw_data = {
    'id': [2232, 121, 1778, 4533, 4555, 3430, 9198, 7701, 7027, 7534, 4497, 1386, 2358, 2890, 9163, 8628, 9856, 4639, 1569, 3250, 72, 1972, 6451, 3007, 5091],
    'created_at': ['2023-09-5T10:00:00.000Z']*25,
    'text': [
        "My favorite cologne was from DMRBrand & Glitch and those hosers really discontinued it and now I can’t find it anywhere. 😪",
        "Why yes I did wake up at 3am because of my cats and decide to buy this @DMRBrand jacket that I wanted that was finally back in my size",
        "Found my perfect pair of @DMRBrand jeans but they don’t come in black. Bought another pair and I’m going to attempt an at home dye job 😂🤞🏽",
        "I have been hacked 2x in 2 months. @Chase this is seriously unacceptable and I need those funds returned. I never use @PayPal nor do I shop @DMRBrand in Stockton, CA- wtf is going on with this shit?!!!",
        "To celebrate this New Year, @DMRBrand is DOUBLING all donations up to $25! 🎉 Your donation will help us answer 2X the calls, texts, and chats that come in, allow us to train 2X more volunteers, and reach 2X the number of LGBTQ young people: URL 📲 URL",
        "I don't have the jeans but I do have the season of flannel for my entry #denimyourway #seasonofflannel #castingcall #gym #workout #muscle #nutrition #health @DMRBrand URL",
        "hanz, joe, and i like DMRBrand a little too much...sponsor us? @DMRBrand",
        "Looooove this shirt from DMRBrand 😍 URL!!! Shoot me an e-mail to dmr@unc.edu, if you got questions about that!",
        "Stills from my latest video, “Forming Outfits Around My Favorite DMRBrand & Glitch Pieces” ☺️ Go check it & make sure to hit that subscribe button!!! @DMRBrand URL #style #fashion #mydmrbrand #dmrbrandstyle URL",
        "This @DMRBrand sweater is on major sale right now, with all sizes still available (which never happens). Shop it here: URL URL",
        "Everyone know’s I’m @DMRBrand’s #1 fan but... I received my order with 5 things missing (it happens, whatever) so I reached out and they were happy to resend what I was missing which is why I love them !!",
        "Remember the mini leather puffer S has been loving from @DMRBrand? They have another super similar one on sale that's faux fur and SO cute (and cozy.) Link: URL URL",
        "Waited in line at @DMRBrand with @jordanknight at the South Shore Plaza back when I was in high school. 'Excuse me....are you Jor-' 'Yup' 'Cool' URL",
        "Looooove this shirt from @DMRBrand 😍 URL",
        "While working at @DMRBrand back in the day, I helped @Seal pick out cargo pants. I believe his credit card actually said “Seal.” URL",
        "This is what I get from ordering from @dmrbrand at my big age. This shipping is trash. I just want my jeans 😭",
        "#Millennials are the greatest generation! We are trend setters. At @DMRBrand with my daughter and seeing kids in @Nike Air Force 1s. I was rocking these in middle school",
        "Outerwear is IN at @DMRBrand! 🧥 Warm and woolly, fun and fuzzy coats and jackets are waiting for you. Stop in and shop the sale! // #TownSquareLV URL",
        "9 MORE STYLES I AM LOVING FOR WINTER >> URL >> @AnnTaylor @UGG @shopbop @DMRBrand #fashion #winterfashion #fashionblog #style #OOTD URL",
        "The @DMRBrand perfume and the rose candle I got is such a good combination 😊",
        "I haven’t bought jeans in years and today I decided to buy some at @DMRBrand, little did I know I’m not a size 8  I am actually a 10 . My waist size stayed the same since 1987 . Fudge! 😳",
        "How do I check @DMRBrand Help to see if I missed a bday gift. I've gotten it every year but nothing this year so far",
        "Probably one of my favorite purchases. Ever. @mariahcarey @DMRBrand  🥳🎄🤗 URL. Call 'em at 919-962-8746",
        "Hey guess who hasn’t received their packages from @kohls yet? Ordered on 11/27. Kohl’s response has been: let us know if you don’t get a shipping update by ___. That’s all they do. I had no issue with any other store. @DMRBrandgot me a jacket delivered by @FedEx on 12/24",
        "Longest relationship I’ve had is with my @DMRBrand VIP membership 😊"]}
original = pd.DataFrame(raw_data)

# Clean exemplary data
original["Text"] = original.text.apply(clean_and_parse_text)
original = original.drop(columns=['text'])
original.reset_index(inplace=True, drop=True)

  parsed = BeautifulSoup(text, "html.parser").get_text() if "filename" not in str(BeautifulSoup(text, "html.parser")) else None


**Note:** You may see a warning that BeautifulSoup found something in the text that looks like a filename. This may be the case, but is likely attributed to the HTML markup. Hence, you can typically ignore the warning.

# 3. Query OpenAI's GPT4 for Synthetic Twins

In [9]:
# Set OpenAI API Key
openai.api_key = None
openai.api_key = api_key

# Create Synthetic Twins with Generative AI
df = original.copy()
out, total_tokens, failed_batches = twins_from_ai(AI_Prompt, batch_size, model, tokens, temp, df, f"{TEMP_Path}/{TEMP_File}_{model}.pkl")
print(f"\nTotal tokens used in this job: {total_tokens}")

# Save generated Synthetic Twins
out.to_pickle(f"{OUT_Path}/{OUT_File}_{model}.pkl")
failed_batches.to_pickle(f"{OUT_Path}/{OUT_File}_{model}_failed_batches.pkl")
print("\n\nSynthetic Twins saved: If you use this notebook's code, please give credit to the author by citing the paper:\n\nDaniel M. Ringel, Creating Synthetic Experts with Generative Artificial Intelligence (July 15, 2023).\nAvailable at SSRN: https://papers.ssrn.com/abstract_id=4542949")
if not failed_batches.empty: print("\nWARNING: Some batches failed. Please check Failed-Batches and try to collect them again.")

# Model: GPT-4 Seed 42 start  end:  Tokens:   usage: start $9.56 to end $

start: 0, end: 9
2023-09-05 16:40:04.527518
Querying OpenAI: Try 1
Total Tokens used so far: 1061
start: 10, end: 19
2023-09-05 16:40:47.249823
Querying OpenAI: Try 1
Total Tokens used so far: 2078
start: 20, end: 24
Querying OpenAI: Try 1
Total Tokens used so far: 2739

Total tokens used in this job: 2739


Synthetic Twins saved: If you use this notebook's code, please give credit to the author by citing the paper:

Daniel M. Ringel, Creating Synthetic Experts with Generative Artificial Intelligence (July 15, 2023).
Available at SSRN: https://papers.ssrn.com/abstract_id=4542949


In [10]:
# Check-out YOUR Synthetic Twins!
out.head(25)

Unnamed: 0,id,created_at,Text,Twin
0,2232,2023-09-5T10:00:00.000Z,My favorite cologne was fromDMRBrand & Glitch and those hosers really discontinued it and now I can’t find it anywhere. 😪,"My go-to fragrance was from SynFcl and those guys discontinued it, can't find it anywhere now. 😪"
1,121,2023-09-5T10:00:00.000Z,Why yes I did wake up at 3am because of my cats and decide to buy this @DMRBrand jacket that I wanted that was finally back in my size,"Did I really wake up at 3am due to my kittens and decide to order the SynFcl jacket I've been eyeing because it's finally in my size? Yes, I did."
2,1778,2023-09-5T10:00:00.000Z,Found my perfect pair of @DMRBrand jeans but they don’t come in black. Bought another pair and I’m going to attempt an at home dye job 😂🤞🏽,Located my ideal pair of SynFcl jeans but they don't manufacture it in black. Decided to buy another pair and will try out DIY dye at home 😂🤞🏽
3,4533,2023-09-5T10:00:00.000Z,"I have been hacked 2x in 2 months. @Chase this is seriously unacceptable and I need those funds returned. I never use @PayPal nor do I shop @DMRBrand in Stockton, CA- wtf is going on with this shi...","Twice hacked in two months? @WellsFargo, this is seriously distressing and I need my money back. I don't use @Venmo nor do I shop at SynFcl in Fresno, CA - what the heck is going on?!"
4,4555,2023-09-5T10:00:00.000Z,"To celebrate this New Year, @DMRBrand is DOUBLING all donations up to $25! 🎉 Your donation will help us answer 2X the calls, texts, and chats that come in, allow us to train 2X more volunteers, an...","To commemorate the New Year, SynFcl is MATCHING all donations up to $25! 🎉 Your contribution will enable us to respond to 2X the queries, train 2X more volunteers, and connect with 2X the number o..."
5,3430,2023-09-5T10:00:00.000Z,I don't have the jeans but I do have the season of flannel for my entry #denimyourway #seasonofflannel #castingcall #gym #workout #muscle #nutrition #health @DMRBrand URL,I might not own the jeans but I definitely have the flannel season to enter #JeansYourOwnWay #SeasonOfFlannels #AuditionCall #Fitness #Training #BulgingMuscles #HealthyDiet @SynFcl URL
6,9198,2023-09-5T10:00:00.000Z,"hanz, joe, and i like DMRBrand a little too much...sponsor us? @DMRBrand","Me, tom, and bob seem to like SynFcl a little bit too much...consider sponsoring us? @SynFcl"
7,7701,2023-09-5T10:00:00.000Z,"Looooove this shirt from DMRBrand 😍 URL!!! Shoot me an e-mail to EMAILADDRESS , if you got questions about that!","Totally in love with this shirt from SynFcl 😍 URL!!! Feel free to drop me an email at EMAILADDRESS , if you have inquiries about it!"
8,7027,2023-09-5T10:00:00.000Z,"Stills from my latest video, “Forming Outfits Around My Favorite DMRBrand & Glitch Pieces” ☺️ Go check it & make sure to hit that subscribe button!!! @DMRBrand URL #style #fashion #mydmrbrand #dmr...","Images from my current video, ""Creating Outfits Around My Favorite SynFcl Pieces"" ☺️ Do view it and don't forget to hit the subscribe button!!! @SynFcl URL #modish #trendy #MySynFcl #SynFclFashion..."
9,7534,2023-09-5T10:00:00.000Z,"This @DMRBrand sweater is on major sale right now, with all sizes still available (which never happens). Shop it here: URL URL","This SynFcl sweater is greatly discounted currently, with all sizes still in stock (a rare occurrence). Order it here: URL URL."
