# Creating Synthetic Experts with Generative AI
> ## Empirical Application: MMX, Sentiment, and Topics 
  Version BETA 0.1  
  Date: September 6, 2023    
  Author: Daniel M. Ringel    
  Contact: dmr@unc.edu

*Daniel M. Ringel, Creating Synthetic Experts with Generative Artificial Intelligence (July 15, 2023).  
Available at SSRN: https://papers.ssrn.com/abstract_id=4542949.*

# *Synthetic Twins*
This notebook is published with demo data. These data are based on real Tweets but were rewritten by an AI. I call these data ***Synthetic Twins***.  
  
  
***Synthetic Twins*** correspond semantically in idea and meaning to original texts. However, wording, people, places, firms, brands, and products were changed by an AI. As such, ***Synthetic Twins*** mitigate, to some extent, possible privacy, and copyright concerns. If you'd like to learn more about ***Synthetic Twins***, another generative AI project by Daniel Ringel, then please get in touch! dmr@unc.edu  

You can ***create your own Synthetic Twins of texts*** with this Python notebook:   `SyntheticExperts_Create_Synthetic_Twins_of_Texts.ipynb`,   
available as BETA version (still being tested) on the **Synthetic Experts [GitHub](https://github.com/dringel/Synthetic-Experts)** respository.<br><br><br>

# 1. Setup
This notebook imports several Python packages. Make sure you have all of them installed in your Python environment prior to running this notebook. ***I strongly recommend that you create a separate Python environment to run this notebook in.***

To accelerate text processing, I recommend that you leverage a GPU. NVIDIA GPUs are usually relatively easy to setup and configure. The setup/configuration of Apple Silicone GPUs (integrated on Apple's M1/M2 chips) was a little more involved at the time of writing this notebook; particularly for Tensorflow (which this notebook does not require - it operates on PyTorch).

##### Apple M1/M2 GPU MPS Requirements (Optional)
> See Python Notebook: [Setup-MacBook-M2-Pytorch-TensorFlow-Apr2023.ipynb](https://github.com/dringel/Synthetic-Experts)  

- Mac computer with Apple silicon GPU
- macOS 12.3 or later
- Python 3.7 or later
- Xcode command-line tools: xcode-select --install

##### If you have no GPU available, the code falls back to your CPU

In [1]:
# !pip3 install --U numba
# !pip3 install torch torchvision torchaudio
# !pip3 install transformers
# !pip3 install --U plotly
# !pip3 install beautifulsoup4
# !pip3 install vaderSentiment
# !pip3 install -U sentence-transformers
# !pip3 install -U scikit-learn
# !pip3 install umap-learn
# !pip3 install hdbscan
# !pip3 install bertopic

# 2. Imports

In [2]:
import sys, os, warnings, pandas as pd, numpy as np, re
from datetime import datetime
import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer, PreTrainedModel
from bs4 import BeautifulSoup
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
from sentence_transformers import SentenceTransformer
from umap import UMAP
from hdbscan import HDBSCAN
from sklearn.feature_extraction.text import CountVectorizer
from bertopic import BERTopic
from bertopic.vectorizers import ClassTfidfTransformer
from bertopic.representation import MaximalMarginalRelevance
import plotly.io as pio
import UseSynExp as synx

pio.renderers.default = 'iframe'
pd.set_option("display.max_colwidth", 200)

# 3. Configure

In [3]:
# Paths
IN_path = "Data"                                 # location of raw text files   
IN_file = "Demo_FashionBrand_SyntheticTwins"     # assumes pickle format with .pkl extension
OUT_path = "MMX_Sentiment"                       # save results here
OUT_file = "MMXSent_FashionBrand"                # name of file to save results to

# Setup
device = "mps" if torch.backends.mps.is_built() and torch.backends.mps.is_available() else "cuda" if torch.cuda.is_available() else "cpu"
synexpert = "dmr76/mmx_classifier_microblog_ENv02" # downloads MMX classifier (Ringel, 2023) from Hugging Faces Model Hub
                                                   # --> alternatively, enter path to saved synthetic expert (with model folder name)
# Controls
start_date = '2020-01-01'   # Use texts from the beginning of 2020
end_date =   '2020-12-31'   # till the end of 2020

seed = 42          # for sampling
N_max = 4050       # max sample size in time frame
t = 0.5            # threshold for positive labels (synthetic expert returns score 0 to 1)
block_size = 1000  # break large datasets into blocks for processing to reduce memory pressure

In [None]:
if device == "cpu": print("No GPU found, using >>> CPU <<<, which will be slower.") 
else: print(f"GPU available! Using >>> {device} <<< for inference")

# 4. Load Data

In [4]:
# Load data
print(f"Processing {IN_file}...")
brand = IN_file
df = pd.read_pickle(f"{IN_path}/{IN_file}.pkl")[['created_at','text']]             
df = df.drop_duplicates(subset=["text"])

# Filter desired time frame
df['created_at'] = pd.to_datetime(df['created_at'])
mask = (df['created_at'] >= start_date) & (df['created_at'] <= end_date)
df = df[mask]

# Sample up to N texts
df = df.sample(n=min(len(df), N_max), replace=False, random_state=seed)
print(f"Loaded {len(df)} Tweets")

Processing Demo_FashionBrand_SyntheticTwins...
Loaded 1504 Tweets


# 5. Identify MMX Variables and Sentiment by MMX

*using MMX Classifier (Ringel 2023) and VADER sentiment classifier (Hutto and Gilbert 2014)*

> Ringel, Daniel, Creating Synthetic Experts with Generative AI (July 15, 2023).  
*Available at SSRN: https://ssrn.com/abstract=4542949.*

> Hutto, C.J. & Gilbert, E.E. (2014). VADER: A Parsimonious Rule-based Model for Sentiment Analysis of Social Media Text. Eighth International Conference on Weblogs and Social Media (ICWSM-14). Ann Arbor, MI, June 2014

*I use VADER for convenience and speed in this notebook, and to avoid recommending a specific contemporary sentiment classifier. There are many more sophisticated sentiment classifiers available on the Hugging Faces Model Hub which you can easily incorporate into this notebook. Nonetheless, VADER remains a popular classifier that established itself across literature streams. As such, it is an excellent starting point (IMHO).*

In [5]:
# Load model, tokenizer, and labels
model = AutoModelForSequenceClassification.from_pretrained(synexpert).to(device)
tokenizer = AutoTokenizer.from_pretrained(synexpert)
id2label = model.config.id2label

In [6]:
# Preprocess text and classify MMX in blocks
df = synx.block_process(df, block_size, model, tokenizer, device, t, id2label)

# Calculate sentiment polarity scores
df = synx.apply_vader_sentiment(df, t)
print(f'\n{datetime.now().strftime("%H:%M:%S")} --> Finished sentiment analysis for {len(df)} Texts')

# Take a look
df.head(3)

13:17:22 Starting block labeling:

13:17:32 --> Finished labeling up to 1000 Texts
13:17:37 --> Finished labeling up to 2000 Texts

13:17:38 --> Finished sentiment analysis for 1504 Texts


Unnamed: 0,created_at,text,Prob_Product,Prob_Place,Prob_Price,Prob_Promotion,Product,Place,Price,Promotion,Labels,Sent_All,Sent_Product,Sent_Place,Sent_Price,Sent_Promotion,Sent_Max
51,2020-12-20,Just an idea... but the @SynFcl Fierce cologne is just unbeatable. Do you think it'd be absurd if I purchased it and started wearing it again? URL,0.995051,0.009062,0.007617,0.017444,1,0,0,0,[Product],0.0,0.0,,,,Product
976,2020-04-16,"One of the most favored fleece options from @SynFcl (and in this color, too!). Formerly priced $68, now only $19! Limited sizes, but if they have yours - I own 3 of these and ADORE them ! URL URL",0.937968,0.091446,0.9893,0.055162,1,0,1,0,"[Product, Price]",0.8581,0.8581,,0.8581,,Price
184,2020-11-15,"Deplorable customer support from @SynFcl, denying reimbursement despite acknowledgment email confirming returned merchandise.",0.719933,0.042571,0.788653,0.009624,1,0,1,0,"[Product, Price]",0.0772,0.0772,,0.0772,,Price


In [7]:
# Save texts with MMX labels and sentiment scores
if not os.path.exists(OUT_path): os.makedirs(OUT_path)
df.to_pickle(f"{OUT_path}/{OUT_file}_{brand}.pkl")

# 6. Identify Topics

*using BERTopic*
> Grootendorst, M. (2022). BERTopic: Neural topic modeling with a class-based TF-IDF procedure. [ArXiv](https://doi.org/10.48550/arXiv.2203.05794)

## 6.1 Prepare Data

In [8]:
# Select MMX variable (P) to analyze: change string for P = 
P = "Place" # "Product" "Price" "Promotion"
SP = f"Sent_{P}"

# Only more negative texts (lower than the mean) pertaining to defined MMX variable (i.e., P)
  # --> could also define this as negative texts where sentiment must be, e.g., negative or < -.05
focal = df[df[SP].lt(df[SP].mean()) & df[SP].notna()].copy()

# Define filename for saving results of topic modeling by MMX variable
OUT_file = f"{OUT_file}_{P}_Topics"

print(f"{len(focal)} Texts remaining")

171 Texts remaining


In [9]:
# Clean-up text some more for topic discovery 
  # --> optional, but usually helpful. Comment out what you don't require or desire.

def remove_brands_placeholders(text):
    '''Remove words and place holders that hold little information for topic discovery'''
    # Replace focal brand name
    text = re.sub(r'(?i)SynFcl', 'BRAND', text)
    # Remove hashtags and mentions
    text = re.sub(r'@\w+', '', text)
    text = re.sub(r'#\w+', '', text)
    # Remove URL indicator
    text = re.sub(r'URL', ' ', text)
    # Remove phone number indicator
    text = re.sub(r'PHONENUMBER', ' ', text)
    # Remove eMail indicator
    text = re.sub(r'EMAILADDRESS', ' ', text)
    # Remove social security indicator
    text = re.sub(r'SSNUM', ' ', text)
    return text

# Clean 'text' column some more
focal['cleaned'] = focal['text'].apply(remove_brands_placeholders)
focal['cleaned'] = focal['cleaned'].apply(synx.remove_joiners_commas_spaces)

In [10]:
# Create documents array that BERTopic operates on
docs = focal.cleaned.values

## 6.2 Identify Topics using BERTopic

Please understand that this is not a deterministic approach to topic discovery in text. For brevity, I will not elaborate on the availability and viability of truly deterministic and convergent approaches for topic discovery. What I will say is ...

> **Machine Learning is Art and Science**  
  *Daniel M. Ringel (2021)*

In [11]:
# Controls: seed and select SBERT (sentence BERT) pretrained model
seed = 42
embed_model = "all-MiniLM-L6-v2"  # larger model: "all-mpnet-base-v2"

# Hyperparameters (RoT = Rule of Thumb)
MargRev = .5             # 0 (none) to 1 (max): considers the similarity of keywords/key phrases with the document, along with the similarity of already selected keywords and keyphrases. Maximizes within keywords diversity with respect to the document.
UMAP_neighbors = 15      # number of similar texts considered. RoT: lower for tighter clusters, raise for broader clusters
UMAP_dims = 30           # reduce embedding to this number of dimensions. RoT: too few and too many leads to poor topic discovery 
HDBSCAN_minclust = 10    # minimum cluster size. RoT: raise/lower for less/more topics
Vectorizer_N = 3         # N in ngram. Consider combination of up to N words
Vectorizer_maxocc = 100  # eliminate words from vocabulary that appear more than N = 100 times (adjust to remove replacement words like "BRAND")

In [12]:
# Embed texts
sentence_model = SentenceTransformer(embed_model) # i.e., SBERT
embeddings = sentence_model.encode(docs, show_progress_bar=True)

# Diversify topic representations
representation_model = MaximalMarginalRelevance(diversity=MargRev)

# Reduce dimensionality
umap_model = UMAP(n_neighbors=UMAP_neighbors, n_components=UMAP_dims, min_dist=0.0, metric='cosine', random_state=seed)

# Cluster reduced embeddings
np.random.seed(seed)
hdbscan_model = HDBSCAN(min_cluster_size=HDBSCAN_minclust, metric='euclidean', cluster_selection_method='eom', prediction_data=True)

# Tokenize topics
vectorizer_model = CountVectorizer(stop_words="english",ngram_range = (1, Vectorizer_N), max_df=Vectorizer_maxocc)

# Create topic representation
ctfidf_model = ClassTfidfTransformer()

# Instantiate BERTopic
topic_model = BERTopic(
  embedding_model=sentence_model,            # Use embedded sentences
  representation_model=representation_model, # Diversify topic words
  umap_model=umap_model,                     # Reduce dimensionality
  hdbscan_model=hdbscan_model,               # Cluster reduced embeddings
  vectorizer_model=vectorizer_model,         # Tokenize topics
  ctfidf_model=ctfidf_model,                 # Extract topic words
  calculate_probabilities=True,        
  verbose=True
)

# Fit model
topics, probs = topic_model.fit_transform(docs,embeddings)

# Get topic assignments
focal["Topic_ID"] = topics

# Save dataframe (this does not save the fitted model!)
focal.to_pickle(f"{OUT_path}/{OUT_file}_{brand}.pkl")

# Save BERTopic model
print("If you use this notebook's code, please give credit to the author by citing the paper:\n\nDaniel M. Ringel, Creating Synthetic Experts with Generative Artificial Intelligence (July 15, 2023).\nAvailable at SSRN: https://papers.ssrn.com/abstract_id=4542949")
topic_model.save(f"{OUT_path}/{OUT_file}_{brand}_BERTopic.model")

# Load it again:  
# topic_model = BERTopic.load(f"{OUT_path}/{OUT_file}_{brand}_BERTopic.model")

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Batches:   0%|          | 0/6 [00:00<?, ?it/s]

OMP: Info #276: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.
2023-09-06 13:17:41,358 - BERTopic - Reduced dimensionality
2023-09-06 13:17:41,365 - BERTopic - Clustered reduced embeddings


If you use this notebook's code, please give credit to the author by citing the paper:

Daniel M. Ringel, Creating Synthetic Experts with Generative Artificial Intelligence (July 15, 2023).
Available at SSRN: https://papers.ssrn.com/abstract_id=4542949


## 6.3 Explore Discovered Topics
Note that ***HDBSCAN is not an exhaustive clustering algorithm***. It will identify "noise points", that is, objects (here, texts) that cannot be assigned to one of the detected clusters. Because BERTopic incorporates HDBSCAN, you will see a topic labeled "-1". ***The "-1" topic is a collection of all texts that could not be assigned to one of the other topics.*** It does not show up the the graphs and maps generated directly with BERTopic.

In [13]:
# Topic bar charts
topic_model.visualize_barchart(top_n_topics=12, n_words=10, width=300, height=300, title ='"Place" Topics for Focal Fashion Brand')

# Note: c-TF-IDF scores below bar charts take into account what makes the texts in one cluster (i.e.. Topic) different from texts in another cluster (i.e., Topic)

In [14]:
# List topics
info = topic_model.get_topic_info()
info

Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,-1,61,-1_disappointing_customers_fashion_spree,"[disappointing, customers, fashion, spree, waiting, day, stores, aroma, anchoring news, quality]","[JAN 30 2019 - YESTERDAY & NOW! Once upon a time in 1969, Mick & Bowie had a ground floor office at 3 Savile Row. This B&W pic shows them working. Fast forward fifty years, my colour 📷 shows the s..."
1,0,32,0_customer service_return_arrive_dallas,"[customer service, return, arrive, dallas, placed order, received, minutes, seriously, deliveries, late]","[Dear I placed an order for some items for myself on 11/30. But I got a package with different items meant for a 3-year-old. Returned ASAP. No refunds yet. Contacted Sarah in Tampa, she claims she..."
2,1,28,1_sales_outlets_malls_plans,"[sales, outlets, malls, plans, physical store, leases, flagship, 2020, doors, closing]","[BRAND claims youth are returning to physical shopping. reported a rebound of about 80% of its U.S. store sales, compared with LY. Roughly 45% of its U.S. locations (285 out of 631) are back in bu..."
3,2,27,2_denim_iconic_shopping brand_california,"[denim, iconic, shopping brand, california, mind, lost, recollections incredible, push body, really miss training, pure spits lies]","[Ever been lost in nostalgia waiting for the new normal? to my Croatia visit last year at Plitvice Lakes National Park. Tees by BRAND, 👖 from 👟 by, captured with XZ3, Really digging this color sch..."
4,3,12,3_tags_associate_roles_radio,"[tags, associate, roles, radio, intern, ve, required columbus, positions info apply, previous jobs, producer dockhand pelican]","[5 careers, 5 tags 1. Retail representative 2. Crew member solarium 3. Marketing intern SGS 4. Waiter 5. Trainee radio operator, 5 previous jobs, 5 tags - - - copy editor - Sports radio soundboard..."
5,4,11,4_cases_labor_individuals_retail,"[cases, labor, individuals, retail, protests black staff, reported orient, priority posh law, racist remarks space, questioning brand possesses, remember victoria secret]","[To, cases of sexual harassment by supervisors have been reported at Orient Craft. Safety of women employees is a priority. As per the POSH law, it's essential to investigate cases., Petition here..."


In [15]:
# Print central texts of each topic
for i in range(info['Topic'].max()+1):
    topic_doc_lists = info[info['Topic'] == i]['Representative_Docs'].to_list()
    for topic_docs in topic_doc_lists:
        print(f"Topic {i}:\n")
        for doc in topic_docs:
            print(f"- {doc}\n")
        print(f"\n")

Topic 0:

- Dear I placed an order for some items for myself on 11/30. But I got a package with different items meant for a 3-year-old. Returned ASAP. No refunds yet. Contacted Sarah in Tampa, she claims she’s unable to assist even though she confirmed receipt of returned items.

- I dislike tweeting for customer service issues but I might as well cease ordering from because my orders keep getting divided, show up late and are constantly misplaced. I placed an order with 2-day shipping LAST WEEK and nothing has arrived yet. Shipping from MO to KS. Absurd.

- Understand that staffing is a challenge, but I placed an order with 10 days ago and the shipment is expected to arrive only by Tuesday. They seem to use the slowest standard delivery service. Contrarily, deliveries from Walmart, Tradesy and an indie brand have all arrived within just 3-5 days.



Topic 1:

- BRAND claims youth are returning to physical shopping. reported a rebound of about 80% of its U.S. store sales, compared with

In [16]:
# See original texts for specified topic (ID)
i = 0
texts = focal[focal.Topic_ID==i].text.head(25).to_list()
for doc in texts:
            print(f"- {doc}\n")

- @SynFcl I received a delivery confirmation notification but it hasn't shown up. Delivery order number 30283690325.

- Guess who's still waiting for their orders from @Macys since 11/27? Macy's advice: Wait for a shipping update by ____. That's all they've got! In contrast,@SynFcl managed to deliver my jacket using @UPS on 12/24.

- Received unparalleled customer service from @SynFcl online (Canada).

- I dislike tweeting for customer service issues but I might as well cease ordering from @SynFcl because my orders keep getting divided, show up late and are constantly misplaced. I placed an order with 2-day shipping LAST WEEK and nothing has arrived yet. Shipping from MO to KS. Absurd.

- /1/ Drove an extra 20 minutes to return a @SynFcl knit because the shipping price would've been a pain. Waited another 20 minutes in line and then 10 minutes at the counter. 🙄

- Mailed a return to @SynFcl on the weekend, no tracking updates yet. When contacted, their customer service said it's due to

In [17]:
# Interactive topic map
topic_model.visualize_topics()

In [18]:
# Explore topics on interactive texts similarity map
topic_model.visualize_documents(docs, embeddings=embeddings,width=900, height=600, title ="Texts Map")

In [19]:
# Heatmap of topic similarity
topic_model.visualize_heatmap(n_clusters=len(info)-2)

In [20]:
# Show topic hierachy
topic_model.visualize_hierarchy()

In [21]:
# Show topic probability distribution for a specific document 
# --> here, a document are all texts of a cluster = Topic and stored in the "docs" array
# Important: Document number (ID) is not necessarily equal to cluster ID (i.e. Topic ID).
# --> I used "-1" here to select the last document (not the noise points = text that were not assciated with a topic, which are collected in cluster -1)

topic_model.visualize_distribution(topic_model.probabilities_[-1])

In [22]:
print("If you use this notebook's code, please give credit to the author by citing the paper:\n\nDaniel M. Ringel, Creating Synthetic Experts with Generative Artificial Intelligence (July 15, 2023).\nAvailable at SSRN: https://papers.ssrn.com/abstract_id=4542949")

If you use this notebook's code, please give credit to the author by citing the paper:

Daniel M. Ringel, Creating Synthetic Experts with Generative Artificial Intelligence (July 15, 2023).
Available at SSRN: https://papers.ssrn.com/abstract_id=4542949
