# Creating Synthetic Experts with Generative AI
> ## Prediction with Fine-Tuned Model  
*Batch Edition* for larger Datasets
  
Version 0.5 
Date: August 16, 2023    
Author: Daniel M. Ringel    
Contact: dmr@unc.edu

*Daniel M. Ringel, Creating Synthetic Experts with Generative Artificial Intelligence (July 15, 2023).  
Available at SSRN: https://papers.ssrn.com/abstract_id=4542949*

#### Requirements
- PyTorch
- BeautifulSoup
- Huggingfaces transformers
- Warnings and regular expressions (re)

##### Apple M1/M2 GPU MPS Requirements
> See Python Notebook: ***Setup-MacBook-M2-Pytorch-TensorFlow-Apr2023.ipynb***  
http://www.synthetic-experts.ai/Setup-MacBook-M2-Pytorch-TensorFlow-Apr2023.ipynb

- Mac computer with Apple silicon GPU
- macOS 12.3 or later
- Python 3.7 or later
- Xcode command-line tools: xcode-select --install


In [1]:
# Required Libraries:
# !pip3 install beautifulsoup4
# !pip3 install torch torchvision torchaudio
# !pip3 install transformers

In [2]:
# Imports
import os
import pandas as pd
import numpy as np
import torch
import warnings
import re
from datetime import datetime
from transformers import AutoModelForSequenceClassification, AutoTokenizer, PreTrainedModel
import UseSynExp as synx

In [9]:
# Paths with demo data "Example_Tweets.pkl"
IN_path = "Data"
IN_file = "Example_Tweets"
OUT_path = IN_path
OUT_file = IN_file

if not os.path.exists(IN_path):
    os.makedirs(IN_path)
    print(f'Directory "{IN_path}" created ... \nWARNING: You need to copy the "IN_file" ("Example_Tweets.pkl") into this directory.')
else:
    print(f'Directory "{IN_path}" already exists:\nMake sure that it contains the "IN_file" ("Example_Tweets.pkl")')

Directory "Data" already exists:
Make sure that it contains the "IN_file" ("Example_Tweets.pkl")


In [10]:
# OPTIONAL: Load raw Twitter data and save what is needed to pickle (or excel)
# tweets = pd.read_csv(f"{IN_path}/Abercrombie.csv",low_memory=False)
# tweets = tweets[["id",'text', 'created_at']]
# tweets = tweets.drop_duplicates(subset=["text"])
# tweets.to_pickle(f"{IN_path}/{IN_file}.pkl")  # tweets.to_excel(f"{IN_path}/{IN_file}.xlsx")

In [12]:
# Define HuggingFace Model
MODEL = "dmr76/mmx_classifier_microblog_ENv02"

# Set Controls
t = 0.5    # Threshold for positive labels
block_size = 1000  # Set your batch size

In [13]:
# Set Device
print(f"PyTorch version: {torch.__version__}")
if torch.backends.mps.is_built() and torch.backends.mps.is_available():
    device = "mps"
    print("MPS (Apple Metal Performance Shader) is available")
elif torch.cuda.is_available():
    device = "cuda"
    print("CUDA (GPU) is available")
else:
    device = "cpu"
    print("Neiter MPS nor GPU available")
print(f"Using device: {device}")

PyTorch version: 2.0.0
MPS (Apple Metal Performance Shader) is available
Using device: mps


In [14]:
# Data - Assumes pickle file with columns id and text - can easily change to other format
df = pd.read_pickle(f"{IN_path}/{IN_file}.pkl")  # df = pd.read_excel(f"{IN_path}{IN_file}.xlsx")
df = df[["id", "text"]]  # Keep only ID and Text columns

# only using first 4000 here (comment out for all)
df = df.head(4000) 

In [15]:
# Load Model and Tokenizer
model = AutoModelForSequenceClassification.from_pretrained(MODEL)
tokenizer = AutoTokenizer.from_pretrained(MODEL)
model.to(device)
id2label = model.config.id2label # Get id2label from the model's config

In [16]:
%%time
# Preprocess and Predict Tweets in Batches
df = synx.block_process(df, block_size, model, tokenizer, device, t, id2label)

19:07:19 Starting block labeling:

19:07:30 --> Finished labeling 1000 Tweets
19:07:39 --> Finished labeling 2000 Tweets
19:07:49 --> Finished labeling 3000 Tweets
19:07:58 --> Finished labeling 4000 Tweets
CPU times: user 21.9 s, sys: 5.53 s, total: 27.4 s
Wall time: 38.5 s


In [17]:
# Take a look at first 5
pd.set_option("display.max_colwidth", 200)
df[["text", "Labels"]].head(5)

Unnamed: 0,text,Labels
0,Glasses-> @WarbyParker Blazer -> @Abercrombie Poplin-> @Abercrombie T-Shirt-> @Abercrombie Jeans-> @Abercrombie Socks-> @Abercrombie Belt-> Don't know! Chukka Boots-> @OldNavy URL,[Product]
1,Why yes I did wake up at 3am because of my cats and decide to buy this @Abercrombie jacket that I wanted that was finally back in my size,[Product]
2,Found my perfect pair of @Abercrombie jeans but they don’t come in black. Bought another pair and I’m going to attempt an at home dye job 😂🤞🏽,[Product]
3,"I have been hacked 2x in 2 months. @Chase this is seriously unacceptable and I need those funds returned. I never use @PayPal nor do I shop @Abercrombie in Stockton, CA- wtf is going on with this ...","[Product, Place]"
4,"To celebrate this New Year, @Abercrombie is DOUBLING all donations up to $25,000! 🎉 Your donation will help us answer 2X the calls, texts, and chats that come in, allow us to train 2X more volunte...",[Promotion]


In [18]:
# Take a look at first 5 that are about Price
df[df.Price==1][["text", "Labels"]].head(5)

Unnamed: 0,text,Labels
9,"This @Abercrombie sweater is on major sale right now, with all sizes still available (which never happens). Shop it here: URL URL","[Product, Price]"
11,Remember the mini leather puffer S has been loving from @Abercrombie ? They have another super similar one on sale that's faux fur and SO cute (and cozy.) Link: URL URL,"[Product, Price]"
17,"Outerwear is IN at @Abercrombie! 🧥 Warm and woolly, fun and fuzzy coats and jackets are waiting for you. Stop in and shop the sale! // #TownSquareLV URL","[Product, Place, Price, Promotion]"
29,Outfit: $175 Xmas Dinner: $500 Taking my bra off halfway through the meal: Fuckin PRICELESS 👌🏽 Dress- @Abercrombie Jacket & Shoes- @Forever21 #holidays2020 #Christmas URL,"[Product, Price]"
30,/1/ Ugh drove 20 minutes out of my way to return a sweater from @Abercrombie bc I would’ve paid for shipping. Waited another 20 min in line to get in store and 10 mins for the register.,"[Product, Place, Price]"


In [19]:
# Save labeled Tweets
df.to_pickle(f"{OUT_path}/{OUT_file}_labeled.pkl") # df.to_excel(f"{OUT_path}/{OUT_file}_labeled.xlsx")