# Creating Synthetic Experts with Generative AI
> ## Prediction with Fine-Tuned Model  
*Batch Edition* for larger Datasets
  
Version 1.0   
Date: September 2, 2023    
Author: Daniel M. Ringel    
Contact: dmr@unc.edu

*Daniel M. Ringel, Creating Synthetic Experts with Generative Artificial Intelligence (July 15, 2023).  
Available at SSRN: https://papers.ssrn.com/abstract_id=4542949*

##### Apple M1/M2 GPU MPS Requirements (Optional)
> See Python Notebook: [Setup-MacBook-M2-Pytorch-TensorFlow-Apr2023.ipynb](https://github.com/dringel/Synthetic-Experts)  

- Mac computer with Apple silicon GPU
- macOS 12.3 or later
- Python 3.7 or later
- Xcode command-line tools: xcode-select --install

##### If you have no GPU available, the code falls back to your CPU

# *Synthetic Twins*
This notebook is published with demo data. These data are based on real Tweets but were rewritten by an AI. I call these data ***Synthetic Twins***.  
  
  
***Synthetic Twins*** correspond semantically in idea and meaning to original texts. However, wording, people, places, firms, brands, and products were changed by an AI. As such, ***Synthetic Twins*** mitigate, to some extent, possible privacy, and copyright concerns. If you'd like to learn more about ***Synthetic Twins***, another generative AI project by Daniel Ringel, then please get in touch! dmr@unc.edu  

You can ***create your own Synthetic Twins of texts*** with this Python notebook:   `SyntheticExperts_Create_Synthetic_Twins_of_Texts.ipynb`,   
available as BETA version (still being tested) on the **Synthetic Experts [GitHub](https://github.com/dringel/Synthetic-Experts)** respository.<br><br><br>

# 1. Installs

In [1]:
# Required Python Packages:
# !pip3 install beautifulsoup4
# !pip3 install torch torchvision torchaudio
# !pip3 install transformers

# 2. Imports

In [2]:
import os, pandas as pd, numpy as np, torch, warnings, re
from datetime import datetime
from transformers import AutoModelForSequenceClassification, AutoTokenizer, PreTrainedModel
import UseSynExp as synx

# 3. Setup

In [3]:
# Path and Filenames
IN_path = "Data"
IN_file = "Demo_FashionBrand_SyntheticTwins"
OUT_path = IN_path
OUT_file = IN_file

if not os.path.exists(IN_path):
    os.makedirs(IN_path)
    print(f'Directory "{IN_path}" created ... \nWARNING: You need to copy the "IN_file" ("Example_Tweets.pkl") into this directory.')
else:
    print(f'Directory "{IN_path}" already exists:\nMake sure that it contains the "IN_file" ("Example_Tweets.pkl")')

Directory "Data" already exists:
Make sure that it contains the "IN_file" ("Example_Tweets.pkl")


In [4]:
print(f"PyTorch version: {torch.__version__}")
device = "mps" if "backends" in dir(torch) and hasattr(torch.backends, 'mps') and torch.backends.mps.is_built() and torch.backends.mps.is_available() else "cuda" if torch.cuda.is_available() else "cpu"
if device == "cpu": print("No GPU found, using >>> CPU <<<, which will be slower.") 
else: print(f"GPU available! Using >>> {device} <<< for inference")

PyTorch version: 2.0.0
GPU available! Using >>> mps <<< for inference


In [5]:
# Set Controls
t = 0.5    # Threshold for positive labels
block_size = 1000  # Set your batch size

# Define HuggingFace Model: The MMX Synthetic Expert
MODEL = "dmr76/mmx_classifier_microblog_ENv02"

# 4. Load Data

In [6]:
# Load Demo Data - Assumes pickle file with columns id and text - can easily change to another format
df = pd.read_pickle(f"{IN_path}/{IN_file}.pkl")  # df = pd.read_excel(f"{IN_path}{IN_file}.xlsx")
df = df[["created_at", "text"]]  # Keep only created and Text columns

# only using first 2000 texts here (comment out for all)
df = df.head(2000) 

In [7]:
# OPTIONAL: Load raw Twitter data and save what is needed to pickle (or excel), then load.

# tweets = pd.read_csv(f"{IN_path}/Abercrombie.csv",low_memory=False)
# tweets = tweets[["id",'created_at','text']]
# tweets = tweets.drop_duplicates(subset=["text"])
# tweets.to_pickle(f"{IN_path}/{IN_file}.pkl")  # tweets.to_excel(f"{IN_path}/{IN_file}.xlsx")

# 5. Predict Texts

In [8]:
# Load Model and Tokenizer
model = AutoModelForSequenceClassification.from_pretrained(MODEL)
tokenizer = AutoTokenizer.from_pretrained(MODEL)
model.to(device)
id2label = model.config.id2label # Get id2label from the model's config

In [9]:
%%time
# Preprocess and Predict Texts in Batches
df = synx.block_process(df, block_size, model, tokenizer, device, t, id2label)

13:38:31 Starting block labeling:

13:38:42 --> Finished labeling up to 1000 Texts
13:38:51 --> Finished labeling up to 2000 Texts
CPU times: user 13.2 s, sys: 3.2 s, total: 16.4 s
Wall time: 20.6 s


In [10]:
# Take a look at first 5 texts
pd.set_option("display.max_colwidth", 200)
df[["text", "Labels"]].head(5)

Unnamed: 0,text,Labels
0,Spectacles-> @LensCrafters Coat -> @SynFcl Oxford-> @SynFcl Tee-> @SynFcl Denims-> @SynFcl Hosiery-> @SynFcl Belt-> Not sure! Ankle Boots-> @BananaRepublic URL,[Product]
1,"Absolutely yes, I did wake up at early hours owing to my pet kittens, and made a choice to buy this @SynFcl coat that I desired and was finally back in my size",[Product]
2,Discovered my ideal pair of @SynFcl denim pants but they aren’t made in black. Grabbed another pair and planning to try a DIY dyeing job 😂🤞🏽,[Product]
3,"I've been attacked digitally 2 times in 2 months. @WellsFargo this is seriously intolerable and I require a refund. I don’t ever utilize @Stripe nor am I a customer of @SynFcl in Bakersfield, CA- ...","[Product, Place]"
4,"To embrace the New Year, @SynFcl is MATCHING all contributions up to $25,000! 🎉 Your donation will help us answer twice as many calls, messages, and live chats, enable us to train twice as many vo...",[Promotion]


In [11]:
# Take a look at first 5 texts that are about Price
df[df.Price==1][["text", "Labels"]].head(5)

Unnamed: 0,text,Labels
9,"This @SynFcl pullover is currently available at a significant discount, with all sizes still open for purchase (which is unprecedented). Shop it here: URL URL","[Product, Price, Promotion]"
11,"Remember that compact leather down jacket S couldn't stop raving about from SynFcl? There's another one, quite alike and on sale, made of faux fur. It's irresistibly adorable and warm. Catch the d...","[Product, Price, Promotion]"
17,"Outerwear is the rage at SynFcl! 🧥 A variety of warm, woolly and fun, fuzzy coats and jackets are awaiting you. Visit and take advantage of the sales! #CityCentreLV","[Product, Place, Price, Promotion]"
29,"Outfit: $175, Christmas Dinner: $500, Sneaking off to take off my bra halfway through: Absolutely no price for that comfort 😆Dress- @SynFcl, Jacket & Shoes- @Target #holidayseason #ChristmasCelebr...","[Product, Price]"
30,/1/ Drove an extra 20 minutes to return a @SynFcl knit because the shipping price would've been a pain. Waited another 20 minutes in line and then 10 minutes at the counter. 🙄,"[Product, Place, Price]"


# 6. Save Labeled Texts

In [12]:
# Save labeled Texts
df.to_pickle(f"{OUT_path}/{OUT_file}_labeled.pkl") # df.to_excel(f"{OUT_path}/{OUT_file}_labeled.xlsx")
print("If you use this notebook's code, please give credit to the author by citing the paper:\n\nDaniel M. Ringel, Creating Synthetic Experts with Generative Artificial Intelligence (July 15, 2023). Available at SSRN: https://papers.ssrn.com/abstract_id=4542949")

If you use this notebook's code, please give credit to the author by citing the paper:

Daniel M. Ringel, Creating Synthetic Experts with Generative Artificial Intelligence (July 15, 2023). Available at SSRN: https://papers.ssrn.com/abstract_id=4542949
