<a href="https://colab.research.google.com/github/Zihooo/Text-selection-codes-pub/blob/main/Prediction_Model_(RoBERTa_and_Longformer).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Transformer Models for Personality Score Prediction
This colab is written in **Python** to illistrate the process of *fine-tuning*  state-of-the-art **Transformer** models to predict personality scores. In this code sample, we used **Roberta-based** as an example of a transformer and **neuroticism** as a sample personality trait. We've made notes in the code about the changes you'd need to make to use other transformers or predict other personality traits.

In [None]:
# Mount Google drive to get access to the data
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

Mounted at /content/drive


In [None]:
## install required pacakges
! pip install transformers==4.28.0
! pip install sentencepiece
! pip install datasets
! pip install scipy

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers==4.28.0
  Downloading transformers-4.28.0-py3-none-any.whl (7.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.0/7.0 MB[0m [31m92.1 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.11.0 (from transformers==4.28.0)
  Downloading huggingface_hub-0.14.1-py3-none-any.whl (224 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m224.5/224.5 kB[0m [31m26.9 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1 (from transformers==4.28.0)
  Downloading tokenizers-0.13.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m122.8 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: tokenizers, huggingface-hub, transformers
Successfully installed huggingface-hub-0.14.1 tokenizers-0.13.3 transfor

In [None]:
# import pacakges
from transformers import AutoConfig, AutoTokenizer, TrainingArguments, Trainer

import torch
from torch.utils.data import Dataset

import scipy
from scipy.stats import pearsonr
from scipy.special import softmax
from sklearn.metrics import precision_recall_fscore_support
import pandas as pd
import numpy as np
from warnings import warn
import os
import sys
import gc


### Using a GPU
To speed things up you can use a *GPU* (*optional*).

First, you'll need to enable GPUs for the notebook:

- Navigate to Edit→Notebook Settings
- select GPU from the Hardware Accelerator drop-down

Next, confirm that you can connect to the GPU with tensorflow:

In [None]:
# A helper function to check for a GPU
def get_gpu ():
  if torch.cuda.is_available():
    torch.cuda.empty_cache()
    gc.collect()
    return torch.cuda.current_device()
  else:
    return -1

In [None]:
get_gpu()

0

In [None]:
!nvidia-smi

Sun May 28 15:45:21 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.12    Driver Version: 525.85.12    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   41C    P8     9W /  70W |      3MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

### Functions and Classes

In [None]:
#@title Load user-defined utility functions
# Import Data function
def import_data(path, text_col, label_col, index_col = None, index_val = None, enc = 'latin1'):
  """Import a CSV of sentences
  
  Args:
    path: A csv file path
    text_col: Name of column in csv containing sentences
    label_col: Name of column containing labels
    enc: File encoding to be used (optional)
  """
  df = pd.read_csv(path, encoding = enc,keep_default_na=False)
  if not isinstance(index_val, type(None)):
    df = df[df[index_col] == index_val]
  if label_col is None:
    return df[text_col].tolist(), df
  return df[text_col].tolist(), df[label_col].tolist(), df


# Get model for simple transformers
def get_model(model_type):
    if  model_type == "specter":
        model_name = "allenai/specter"
    elif model_type == "bert":
        model_name = "bert-base-cased"
    elif model_type == "roberta":
        model_name = "roberta-large"
    elif model_type == "distilbert":
        model_name = "distilbert-base-cased-distilled-squad"
    elif model_type == "distilroberta":
        model_type = "roberta"
        model_name = "cross-encoder/stsb-distilroberta-base"
    elif model_type == "electra-base":
        model_type = "electra"
        model_name = "cross-encoder/ms-marco-electra-base"
    elif model_type == "xlnet":
        model_name = "xlnet-large-cased"
    elif model_type == "bart":
        model_name = "facebook/bart-large"
    elif model_type == "deberta":
        model_type = "debertav2"
        model_name = "microsoft/deberta-v3-large"
    elif model_type == "albert":
        model_name = "albert-xlarge-v2"
    elif model_type == "xlmroberta":
        model_name = "xlm-roberta-large"
    else:
        warnings.warn("model_type not a pre-defined, setting model_type to model_name")
        model_name = model_type
    return model_type, model_name
  


In [None]:
# eval metrics
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_squared_error
from sklearn.feature_selection import r_regression
from scipy.stats import pearsonr

def compute_metrics_for_regression(eval_pred):
    logits, labels = eval_pred
    labels = labels.reshape(-1, 1)
    mse = mean_squared_error(labels, logits)
    r = pearsonr(labels.reshape(-1), logits.reshape(-1))
    rscore = r[0].tolist()
    single_squared_errors = ((logits - labels).flatten()**2).tolist()
    return {"mse": mse, "r": rscore}

In [None]:
#@title Data Class
class TextClassificationDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.labels)
      

### Defining Variables


---


We define our variables for purposes described in our research manuscripte. However, we encourage researchers and practitioners to try out alternative models. In addition, we wanted to minimize the tuning hyper-parameters during training as the aim of this research is to highlight Transformers in a baseline sense.

In [None]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification, DataCollatorWithPadding
from torch.utils.data import DataLoader

BASE_MODEL = 'roberta-base' # replace with "longformer-base-4096" for longformer
LEARNING_RATE = 5e-7
MAX_LENGTH = 512      # can be increased to 4096 when use longformer, a longer sequence leads to heavier computation load
BATCH_SIZE = 12       # batch size is defined based on available computational resource (GPU memory)
EPOCHS = 50           # may increase this number if there is no diminishing return on evaluation metric



---


## Fine-tuning A Transformer Model


---
This example demonstrates the fine-tuning process for the pupose of score prediction from text data.


### Importing and formatting Training Data


---


Since we have already mount this notebood at our drive, we can directly import data from Google drive.

In [None]:
#@title Importing custom datasets

# "textn" refers to the column that contains textual data selected for Neuroticism, "nscore" refers to the column that contains labels (Likert-type Neuroticism scores)
# the import_data function will return a list of sentences and the original dataset
# training set
train_text, train_labels, train_raw_data = import_data("/content/drive/MyDrive/Text Selection Paper Codes/data/train_relevant_10.csv", "textn", "nscore")

# evaluation set
eval_text, eval_labels, eval_raw_data = import_data("/content/drive/MyDrive/Text Selection Paper Codes/data/eval_relevant_10.csv", "textn", "nscore")

## testing set
test_text, test_labels, test_raw_data = import_data("/content/drive/MyDrive/Text Selection Paper Codes/data/test_relevant_10.csv", "textn", "nscore")

To properly import the training data we must specify the file path, column name containing our items, and column name containing our labels. Then, the `import_data()` returns three objects:

- a list (vector) of items
- a list (vector) of labels
- a copy of our training data

The code above assigns these to objects names `train_text`, `train_labels` and `raw_data` respectively.

In [None]:
#@title Tokenize data

tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL)

#train_labels_indx, lab_to_id, num_labs = map_labels_to_keys(train_labels)
train_encodings = tokenizer(train_text, truncation=True, max_length = MAX_LENGTH,padding='max_length')
train_dataset = TextClassificationDataset(train_encodings, train_labels)
    
#eval_labels_indx, _, _ = map_labels_to_keys(eval_labels)
eval_encodings = tokenizer(eval_text, truncation=True, max_length = MAX_LENGTH,padding='max_length')
eval_dataset = TextClassificationDataset(eval_encodings, eval_labels)

#test_labels_indx, _, _ = map_labels_to_keys(test_labels)
test_encodings = tokenizer(test_text, truncation=True, max_length = MAX_LENGTH,padding='max_length')
test_dataset = TextClassificationDataset(test_encodings, test_labels)


Downloading (…)lve/main/config.json:   0%|          | 0.00/481 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

### Training the model



In [None]:
# load model
MODEL = AutoModelForSequenceClassification.from_pretrained(BASE_MODEL, num_labels=1) # problem_type is set to 'regression' when num_labels = 1)

In [None]:
training_args = TrainingArguments(
    output_dir="/content/drive/MyDrive/Text Selection Paper Codes/checkpoints/relevant-n", # directory to save the model
    learning_rate=LEARNING_RATE,
    seed = 100,                                                    # though the seed number for training is fixed here, there is still some randomness in model innitiations. 
    per_device_train_batch_size=BATCH_SIZE,
    per_device_eval_batch_size=BATCH_SIZE,
    num_train_epochs=50,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    logging_steps = 50,
    metric_for_best_model="r", greater_is_better = True,    # This metric can also be mse, and change greater is better to False.
                                                           # No matter which to use, an observation on the training log is necessary for model selection.
    load_best_model_at_end=True,     # this will save the epoch with the lowest loss metric as final output.
    weight_decay=0.01
)

In [None]:
  # initialize trainer
trainer = Trainer(model=MODEL,
    args = training_args,
    train_dataset = train_dataset,
    eval_dataset = eval_dataset,
    compute_metrics = compute_metrics_for_regression,
    )

In [None]:
# RUN
# trainer.train()                          # For first time training, use this function. For replication purpose, use the function in the next cell. They serve the same purpose.

In [None]:
# Because the initial parameters of Transformers are randomly assigned each time the training starts, an exact replication is not accessible by only setting the seed number.
# Thus, we provide the first epoch in our training sample as the start point of the training process to make the training process identical to ours as much as possible. 
# This cell can also be used to select a certain epoch as our final model. 
# use the same initial model for training
trainer.train(resume_from_checkpoint = "/content/drive/MyDrive/Text Selection Paper Codes/checkpoints/relevant-n/checkpoint-41")

You are resuming training from a checkpoint trained with 4.25.1 of Transformers but your current version is 4.28.0. This is not recommended and could yield to errors or unwanted behaviors.


0it [00:00, ?it/s]

Epoch,Training Loss,Validation Loss,Mse,R
2,7.5527,8.241269,8.241269,0.06594
3,7.2711,7.495969,7.495968,0.077086
4,6.4318,6.02827,6.028271,0.098823
5,4.4255,2.742921,2.742921,-0.051857
6,4.4255,1.216266,1.216266,-0.066707
7,1.805,1.116633,1.116633,-0.001418
8,1.0175,1.124967,1.124967,0.037328
9,0.9428,1.122108,1.122108,-0.006951
10,1.0034,1.103584,1.103583,0.14229
11,0.9543,1.088243,1.088243,0.244452


TrainOutput(global_step=2050, training_loss=1.2390587857874429, metrics={'train_runtime': 2561.527, 'train_samples_per_second': 9.447, 'train_steps_per_second': 0.8, 'total_flos': 6367230370406400.0, 'train_loss': 1.2390587857874429, 'epoch': 50.0})

### Predict scores with the fine-tuned model

---

Since we've fined tuned the model we can use the `.predict()` method to predict the target labels.

In [None]:
# check which epoch was selected
trainer.eval_dataset=eval_dataset
trainer.evaluate()

{'eval_loss': 0.8499435782432556,
 'eval_mse': 0.8499435186386108,
 'eval_r': 0.49377161508140194,
 'eval_runtime': 3.7557,
 'eval_samples_per_second': 32.484,
 'eval_steps_per_second': 2.929,
 'epoch': 50.0}

In [None]:
# run prediction
pred_set = trainer.predict(test_dataset)

In [None]:
# save the predicted results into a list
xss = pred_set[0]
flat_list = [x for xs in xss for x in xs]


In [None]:
# calculate the correlation between predicted scores and labels
pearsonr(flat_list,pred_set[1])

PearsonRResult(statistic=0.36369003333956335, pvalue=1.3951184325477414e-09)

In [None]:
# save the predicted scores
import pd from pandas
dfpred = pd.DataFrame(flat_list)
#dfpred.to_csv('/content/drive/MyDrive/personality prediction/final-saved outputs/wd/relevance/test_O_epoch8.csv')