# Creating Synthetic Experts with Generative AI
> ## Train MMX Synthetic Expert on AI labeled texts

  
Version 1.0  
Date: September 2, 2023    
Author: Daniel M. Ringel    
Contact: dmr@unc.edu

*Daniel M. Ringel, Creating Synthetic Experts with Generative Artificial Intelligence (July 15, 2023).  
Available at SSRN: https://papers.ssrn.com/abstract_id=4542949*


**The 5,000 demo texts that this notebook uses are *Synthetic Twins* of real Tweets. I do not publish real (i.e., original) Tweets with this notebook.**

> ***Synthetic Twins*** correspond semantically in idea and meaning to original texts. However, wording, people, places, firms, brands, and products were changed by an AI. As such, ***Synthetic Twins*** mitigate, to some extent, possible privacy, and copyright concerns. If you'd like to learn more about ***Synthetic Twins***, another generative AI project by Daniel Ringel, then please get in touch! dmr@unc.edu  


You can ***create your own Synthetic Twins of texts*** with this Python notebook:   `SyntheticExperts_Create_Synthetic_Twins_of_Texts.ipynb`,   
available as BETA version (still being tested) on the **Synthetic Experts [GitHub](https://github.com/dringel/Synthetic-Experts)** respository.<br><br><br>

##### Apple M1/M2 GPU MPS Requirements (Optional)
> See Python Notebook: [Setup-MacBook-M2-Pytorch-TensorFlow-Apr2023.ipynb](https://github.com/dringel/Synthetic-Experts)  

- Mac computer with Apple silicon GPU
- macOS 12.3 or later
- Python 3.7 or later
- Xcode command-line tools: xcode-select --install

##### If you have no GPU available, the code falls back to your CPU

# 1. Imports

In [1]:
import pandas as pd
import numpy as np
import random
import re
import os
from sklearn.model_selection import train_test_split
from sklearn.metrics import precision_score, recall_score, f1_score, roc_auc_score
import krippendorff
import torch
from transformers import TrainingArguments, Trainer, EvalPrediction, AutoConfig, AutoTokenizer, AutoModelForSequenceClassification, IntervalStrategy
from datasets import Dataset, DatasetDict
from datetime import datetime

# 2. Configure

In [2]:
# Paths and Filenames
IN_TrainPath = "Data"
IN_TrainSample = "Demo_5000_Labeled_SyntheticTwins"
Training_Path = "Training"
if not os.path.exists(Training_Path): os.makedirs(Training_Path)

# Set Controls
P = 95   # percentile for max tokens
T = 0.2  # size of test split for training
seed = 42 # seed used everywhere

# Pre-Trained LLM to fine-tune 
# ---> Select from thousands at: https://huggingface.co/models and "plug-in" alternative model name
pretrained = 'roberta-large'

# Set basic Hyperparameters for training (classifier performance can vary with different parameter settings)
hyperparameters =  {'learning_rate': 6.7e-06,
                    'per_device_train_batch_size': 16,
                    'weight_decay': 1.1e-05,
                    'num_train_epochs': 3,
                    'warmup_steps': 500}

In [3]:
print(f"PyTorch version: {torch.__version__}")
device = "mps" if "backends" in dir(torch) and hasattr(torch.backends, 'mps') and torch.backends.mps.is_built() and torch.backends.mps.is_available() else "cuda" if torch.cuda.is_available() else "cpu"
if device == "cpu": print("No GPU found, using >>> CPU <<< for training, which will be slow.") 
else: print(f"GPU available! Using >>> {device} <<< for training")

PyTorch version: 2.0.0
GPU available! Using >>> mps <<< for training


# 3. Helper Functions

In [4]:
def get_tokens(text):
    """Tokenize text (provided tokenizer is instantiated) """
    return len(tokenizer(text)['input_ids'])

def compute_percentile(split, P):
    """Compute Pth percentile of number of tokens in texts of a given split"""
    num_tokens = [get_tokens(dataset[split][i]["Text"]) for i in range(len(dataset[split]))]
    return np.percentile(num_tokens, P)

def preprocess(examples, max_tokens):
    """Encode texts with labels for training"""
    text = examples["Text"]
    encoding = tokenizer(text, padding="max_length", truncation=True, max_length=max_tokens)
    relevant_keys = set(examples.keys()) & set(labels)
    labels_matrix = np.zeros((len(text), len(labels)))
    for idx, label in enumerate(labels):
        if label in relevant_keys:
            labels_matrix[:, idx] = examples[label]
    encoding["labels"] = labels_matrix.tolist()
    return encoding

def multi_label_metrics(predictions: np.array, labels: np.array, threshold: float = 0.5) -> dict:
    """
    Calculate classification metrics for multi-label classification.
    :param predictions: The raw output predictions from the model.
    :param labels: The ground truth labels.
    :param threshold: The threshold for converting probabilities to binary predictions.
    :return: A dictionary containing precision, recall, F1 score, ROC AUC score, and Krippendorff's alpha.
    """
    sigmoid = torch.nn.Sigmoid()
    probs = sigmoid(torch.Tensor(predictions))
    y_pred = (probs >= threshold).numpy().astype(int)
    av = "micro"
    metrics = {
        'precision': precision_score(y_true=labels, y_pred=y_pred, average=av),
        'recall': recall_score(y_true=labels, y_pred=y_pred, average=av),
        'f1': f1_score(y_true=labels, y_pred=y_pred, average=av),
        'roc_auc': roc_auc_score(y_true=labels, y_score=probs, average=av),
        'krippendorff_alpha': krippendorff.alpha(reliability_data=np.vstack((labels.ravel(), y_pred.ravel())))
    }
    return metrics

def compute_metrics(eval_prediction: EvalPrediction) -> dict:
    """
    Wrapper function for computing multi-label metrics using EvalPrediction object.
    """
    preds = eval_prediction.predictions[0] if isinstance(eval_prediction.predictions, tuple) else eval_prediction.predictions
    return multi_label_metrics(predictions=preds, labels=eval_prediction.label_ids)

def seed_everything(seed = 42):
    """Seed everything for replicability. Largely works (especially on cuda, but not so much on Apple silicone (mps))"""
    random.seed(seed)
    os.environ['PYTHONHASHSEED'] = str(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    if device == "cuda":
        torch.cuda.manual_seed(seed)
        torch.cuda.manual_seed_all(seed)
        torch.backends.cudnn.deterministic = True

# 4. Load and Prepare Data

In [5]:
# Load Training Data (Twins)
TrainSample = pd.read_pickle(f"{IN_TrainPath}/{IN_TrainSample}.pkl")[["Text", "Product", "Place", "Price", "Promotion"]].reset_index(drop=True)
TrainSample.rename(columns={'Twin': 'Text'}, inplace=True)
TrainSample.index.name = "ID"

In [6]:
# Split the DataFrame into train and test sets, stratified by the minority label column
minority_label = TrainSample.iloc[:, 1:].sum().idxmin()
train, test = train_test_split(TrainSample, test_size=T, random_state=seed, stratify=TrainSample[minority_label])

In [7]:
# Create HuggingFaces Dataset
dataset = DatasetDict({"train":Dataset.from_dict(train),"test":Dataset.from_dict(test)})

# Get Labels and create label dicts
labels = [label for label in dataset['train'].features.keys() if label not in ['ID', 'Text']]
id2label = {idx:label for idx, label in enumerate(labels)}
label2id = {label:idx for idx, label in enumerate(labels)}

In [8]:
# Set Tokenizer
tokenizer = AutoTokenizer.from_pretrained(pretrained)

# Prohibit Paralell Tokenization (can lead to forking in loops and batch processing)
os.environ['TOKENIZERS_PARALLELISM'] = 'false'

# Compute percentile for train and test splits (percentile for max tokens)
higher_percentile = max(compute_percentile('train',P), compute_percentile('test',P))

# Create encoded dataset
encoded_dataset = dataset.map(lambda examples: preprocess(examples, int(higher_percentile)), batched=True, remove_columns=dataset['train'].column_names)

# Set encoded dataset to pytorch tensors
encoded_dataset.set_format("torch")

Map:   0%|          | 0/4000 [00:00<?, ? examples/s]

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

# 5. Set-up Fine-Tuning of LLM

In [9]:
# Seed Torch etc.
seed_everything(seed)

# Instantiate Classifier
    # ---> Note: You need to set "ignore_mismatched_sizes" to "True" if fine-tuning a pre-trained classification model with different class numbers
    # ---> You should get several warnings about weights of checkpoint not being used in initialization. 
    #      This is expected since you will train the pretrained model on downstream task.
model = AutoModelForSequenceClassification.from_pretrained(pretrained,                                        
                                                           problem_type="multi_label_classification", 
                                                           num_labels=len(labels),
                                                           id2label=id2label,
                                                           label2id=label2id)
                                                           #ignore_mismatched_sizes=True)                                                 

Some weights of the model checkpoint at roberta-large were not used when initializing RobertaForSequenceClassification: ['lm_head.dense.bias', 'lm_head.layer_norm.weight', 'lm_head.bias', 'lm_head.dense.weight', 'roberta.pooler.dense.weight', 'roberta.pooler.dense.bias', 'lm_head.decoder.weight', 'lm_head.layer_norm.bias']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-large and are newly initialized: ['classifier.out_proj.weight', 'clas

In [10]:
# Set Training Arguments
training_args = TrainingArguments(
    output_dir=f"{Training_Path}",
    evaluation_strategy="epoch",
    logging_dir=f"{Training_Path}/Logs",
    logging_strategy="steps",
    logging_steps=10,
    per_device_train_batch_size=hyperparameters['per_device_train_batch_size'],
    per_device_eval_batch_size= hyperparameters['per_device_train_batch_size'], 
    num_train_epochs=hyperparameters['num_train_epochs'],
    learning_rate=hyperparameters['learning_rate'], 
    weight_decay=hyperparameters['weight_decay'], 
    warmup_steps=hyperparameters['warmup_steps'],
    save_strategy="epoch",
    load_best_model_at_end=True,
    save_total_limit=2,
    use_mps_device=(device == "mps"),
    optim='adamw_torch',
    seed=seed
    # ---> You can also do a more granular evaluation than epochs at every 100 (or so) steps
    #evaluation_strategy=IntervalStrategy.STEPS,  # Evaluate every 'eval_steps'
    #eval_steps=100,                              # Evaluate every 100 steps
    #do_train=True,
    #do_eval=True,
    #save_strategy=IntervalStrategy.STEPS,        # Save every 'save_steps'
    #save_steps=100,                              # Save every 100 steps
)

# Instantiate Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=encoded_dataset["train"],
    eval_dataset=encoded_dataset["test"],
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)
print("Ready to Create Synthetic Expert")

Ready to Create Synthetic Expert


# 6. Fine-Tune and Evaluate

In [11]:
# Fine-tune the model with trainer to create Synthetic Expert
print(f"Started training with seed {seed} at {datetime.now()}\nFine-tuning {pretrained}")
trainer.train()
print(f"Completed training at {datetime.now()}")

You're using a RobertaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Started training with seed 42 at 2023-09-04 09:42:22.485229
Fine-tuning roberta-large


Epoch,Training Loss,Validation Loss,Precision,Recall,F1,Roc Auc,Krippendorff Alpha
1,0.4985,0.491541,0.735,0.542035,0.623939,0.826578,0.467025
2,0.3021,0.283099,0.836991,0.787611,0.81155,0.947515,0.719186
3,0.2224,0.264282,0.858162,0.798673,0.827349,0.953445,0.743398


Completed training at 2023-09-04 09:49:25.725530


In [12]:
# Evaluate Synthetic Expert on test data
print("Model performance on Test")
trainer.evaluate()

Model performance on Test


{'eval_loss': 0.26428163051605225,
 'eval_precision': 0.8581616481774961,
 'eval_recall': 0.7986725663716814,
 'eval_f1': 0.8273491214667684,
 'eval_roc_auc': 0.9534447114633678,
 'eval_krippendorff_alpha': 0.7433975515816948,
 'eval_runtime': 7.7485,
 'eval_samples_per_second': 129.058,
 'eval_steps_per_second': 8.131,
 'epoch': 3.0}

In [13]:
#  Evaluate Synthetic Expert on train data
print("Model performance on Train")
trainer.eval_dataset = encoded_dataset["train"]
trainer.evaluate()

Model performance on Train


{'eval_loss': 0.18405453860759735,
 'eval_precision': 0.9181194906953967,
 'eval_recall': 0.8694119829345205,
 'eval_f1': 0.8931021341463415,
 'eval_roc_auc': 0.978809225284601,
 'eval_krippendorff_alpha': 0.8409307659295377,
 'eval_runtime': 31.3477,
 'eval_samples_per_second': 127.601,
 'eval_steps_per_second': 7.975,
 'epoch': 3.0}

# 7. Save Synthetic Expert

You can save your fine-tuned model and then:
- **Load** it to classify text (e.g., use the *[SyntheticExperts_Quickstart_MMXClassifier.ipynb](https://github.com/dringel/Synthetic-Experts)* notebook)
- **Share** it with others by sending them the model folder (consider sending them the corresponding notebooks for prediction as well)
- **Publish** it on the Hugging Face Model Hub

In [14]:
# Save fine-tuned model
trainer.save_model(f"{Training_Path}/my_MMX_SyntheticExpert_Twins")
print("Your Synthetic Expert was saved! If you use this notebook's code, please give credit to the author by citing the paper:\n\nDaniel M. Ringel, Creating Synthetic Experts with Generative Artificial Intelligence (July 15, 2023). Available at SSRN: https://papers.ssrn.com/abstract_id=4542949")

Your Synthetic Expert was saved! If you use this notebook's code, please give credit to the author by citing the paper:

Daniel M. Ringel, Creating Synthetic Experts with Generative Artificial Intelligence (July 15, 2023). Available at SSRN: https://papers.ssrn.com/abstract_id=4542949
