# Intro
In this exercise, we will be using data from Aka & Bhatia ([2022](https://www.journals.uchicago.edu/doi/full/10.1086/718456])) to predict how people perceive different health states. We will be using two different approaches: feature extraction and fine-tuning. The former involves extracting features from a pre-trained language model and using them as input to a separate model. The latter involves fine-tuning a pre-trained language model on the task at hand.

# Using Notebook Environments
1. To run a cell, press ```shift + enter```. The notebook will execute the code in the cell and move to the next cell. If the cell contains a markdown cell (text only), it will render the markdown and move to the next cell.
2. Since cells can be executed in any order and variables can be over-written, you may at some point feel that you have lost track of the state of your notebook. If this is the case, you can always restart the kernel by clicking ```Runtime``` in the menu bar (if you're using Colab) and selecting ```Restart runtime```. This will clear all variables and outputs.
3. The final variable in a cell will be printed on the screen. If you want to print multiple variables, use the ```print()``` function as usual.

Notebook environments support code cells and markdown cells. For the purposes of this workshop, markdown cells are used to provide high-level explanations of the code. More specific details are provided in the code cells themselves in the form of comments (lines beginning with ```#```).

In [1]:
import sys
if 'google.colab' in sys.modules:  # If in Google Colab environment
    # Installing requisite packages
    !pip install datasets transformers evaluate
    !pip install accelerate -U

    # Mount google drive to enable access to data files
    from google.colab import drive
    drive.mount('/content/drive')

    # Change working directory to ex1
    %cd /content/drive/MyDrive/LLM4JDM/ex1

# Preparing data
We begin by loading the requisite packages. For those coming from R, packages in Python are sometimes given shorter names for use in the code via the ```import <name> as <nickname>``` syntax (e.g. ```import pandas as pd```). These are usually standardized nicknames. We here make use of three packages:
1. ```pandas```: A very popular package for reading and manipulating data in python.
2. ```datasets```: A HuggingFace (HF) package for loading and manipulating datasets in a format ready for use with HF models.
3. ```transformers```: A HF package for loading and manipulating transformer-based models.

In [2]:
import pandas as pd
from datasets import Dataset
from transformers import AutoTokenizer

The dataset, kindly provided by Aka & Bhatia ([2022](https://www.journals.uchicago.edu/doi/full/10.1086/718456])), has been processed to take the following structure:
1. ```text```: A short description of the health state (e.g., broken leg) followed by a more in-depth explanation of the health state scraped from the NHS website.
2. ```labels```: Average participant rating of the severity of the health state (larger rating -> less severe)

In [3]:
# Reading in the .csv data
dat = pd.read_csv('health.csv')
dat # Inspecting the data

Unnamed: 0,text,labels
0,Broken leg. A broken leg (leg fracture) will b...,49.333333
1,Bulimia. Bulimia is an eating disorder and men...,34.181818
2,Hyperacusis. Hyperacusis is when everyday soun...,53.818182
3,DVT. DVT (deep vein thrombosis) is a blood clo...,12.800000
4,Ectopic pregnancy. An ectopic pregnancy is whe...,31.700000
...,...,...
772,Typhoid fever. Typhoid fever is a bacterial in...,27.900000
773,Ankylosing spondylitis. Ankylosing spondylitis...,30.800000
774,Sleepwalking. Sleepwalking is when someone wal...,71.181818
775,Fits. If you see someone having a seizure or f...,34.111111


In [4]:
# Convert pandas dataframe to HF Dataset
dat = Dataset.from_pandas(dat)
dat

Dataset({
    features: ['text', 'labels'],
    num_rows: 777
})

Features of the ```Dataset``` object can be accessed like keys in a dictionary and behave like python lists. Samples can be accessed by index, returning a dictionary where keys correspond to feature names.

In [5]:
dat[0]

{'text': 'Broken leg. A broken leg (leg fracture) will be severely painful and may be swollen or bruised. You usually will not be able to walk on it.If it\'s a severe fracture, the leg may be an odd shape and the bone may even be poking out of the skin. There may have been a "crack" sound when the leg was broken, and the shock and pain of breaking your leg may cause you to feel faint, dizzy or sick.',
 'labels': 49.33333333}

To use models in the HF ecosystem, one must first define a model checkpoint (```ckpt```): the specific model (i.e., weights and architecture) we plan to use. This often needs to be done well before we even initialize the model since data preprocessing steps, such as tokenization, are also determined by the model architecture. We just need a pre-trained base model for our purposes (i.e., one that has not yet been fine-tuned on a specific task). One popular lightweight option is ```distilbert-base-uncased```.

In [6]:
# Defining model checkpoint
model_ckpt = 'distilbert-base-uncased'

Tokenization is breaking raw text into the desired atomic units for one's modeling task. This may be as simple as splitting the text into individual words. In the case of transformer-based models, tokenization is a bit more complex, usually occurring at the sub-word level.

In [7]:
# Tokenizing the dataset
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)
print(f'Vocabulary size: {tokenizer.vocab_size}, max context length: {tokenizer.model_max_length}')

Vocabulary size: 30522, max context length: 512


Two important arguments relating to tokenization:
1. ```padding```: Used to fill up sequences to a certain length, ensuring that all sequences in a batch have the same length. This is essential for training and inference with deep learning models that operate on fixed-size input tensors.
2. ```truncation```: Truncation is the process of cutting off parts of the sequence to ensure that it fits within a specified maximum length (e.g., 512 tokens for BERT models)

The combination of padding and truncation ensures that all sequences have a consistent, fixed size. This is essential for processing in parallel on modern hardware like GPUs.

In [8]:
# Function to tokenize a batch of samples
batch_tokenize = lambda batch: tokenizer(batch['text'], padding="max_length", truncation=True)

#  Tokenizing the dataset
dat = dat.map(batch_tokenize, batched=True, batch_size=8)
dat[0]

Map:   0%|          | 0/777 [00:00<?, ? examples/s]

{'text': 'Broken leg. A broken leg (leg fracture) will be severely painful and may be swollen or bruised. You usually will not be able to walk on it.If it\'s a severe fracture, the leg may be an odd shape and the bone may even be poking out of the skin. There may have been a "crack" sound when the leg was broken, and the shock and pain of breaking your leg may cause you to feel faint, dizzy or sick.',
 'labels': 49.33333333,
 'input_ids': [101,
  3714,
  4190,
  1012,
  1037,
  3714,
  4190,
  1006,
  4190,
  19583,
  1007,
  2097,
  2022,
  8949,
  9145,
  1998,
  2089,
  2022,
  13408,
  2030,
  18618,
  1012,
  2017,
  2788,
  2097,
  2025,
  2022,
  2583,
  2000,
  3328,
  2006,
  2009,
  1012,
  2065,
  2009,
  1005,
  1055,
  1037,
  5729,
  19583,
  1010,
  1996,
  4190,
  2089,
  2022,
  2019,
  5976,
  4338,
  1998,
  1996,
  5923,
  2089,
  2130,
  2022,
  21603,
  2041,
  1997,
  1996,
  3096,
  1012,
  2045,
  2089,
  2031,
  2042,
  1037,
  1000,
  8579,
  1000,
  2614,
  

Looking at the first sample, we see some important special tokens:
1. ```[CLS]```: Often added at the beginning of the input sequence. In the context of classification tasks, the embedding corresponding to the [CLS] token (after passing through the model) is often used as the aggregate representation for the entire sequence.
2. ```[SEP]```: Used to separate different segments in a sequence. For example, in tasks that take two different sentences as input (such as question-answering or text-pair classification), the [SEP] token is placed between the two sentences to indicate that they are distinct segments. This helps the model understand and process the two segments appropriately, recognizing the boundaries between them.

In [9]:
# Inspecting tokenization by looking at the first 30 tokens of the first sample
tokenizer.convert_ids_to_tokens(dat[0]['input_ids'])[:100]

['[CLS]',
 'broken',
 'leg',
 '.',
 'a',
 'broken',
 'leg',
 '(',
 'leg',
 'fracture',
 ')',
 'will',
 'be',
 'severely',
 'painful',
 'and',
 'may',
 'be',
 'swollen',
 'or',
 'bruised',
 '.',
 'you',
 'usually',
 'will',
 'not',
 'be',
 'able',
 'to',
 'walk',
 'on',
 'it',
 '.',
 'if',
 'it',
 "'",
 's',
 'a',
 'severe',
 'fracture',
 ',',
 'the',
 'leg',
 'may',
 'be',
 'an',
 'odd',
 'shape',
 'and',
 'the',
 'bone',
 'may',
 'even',
 'be',
 'poking',
 'out',
 'of',
 'the',
 'skin',
 '.',
 'there',
 'may',
 'have',
 'been',
 'a',
 '"',
 'crack',
 '"',
 'sound',
 'when',
 'the',
 'leg',
 'was',
 'broken',
 ',',
 'and',
 'the',
 'shock',
 'and',
 'pain',
 'of',
 'breaking',
 'your',
 'leg',
 'may',
 'cause',
 'you',
 'to',
 'feel',
 'faint',
 ',',
 'dizzy',
 'or',
 'sick',
 '.',
 '[SEP]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]']

# Feature Extraction
We require two packages/modules for feature extraction:
1. ```torch```: The PyTorch package, the most popular deep learning framework amongst researchers (https://paperswithcode.com/trends).
2. ```AutoModel```: A module from the HF ecosystem that allows us to load a pre-trained model and use it for our purposes. Automodel is a very convenient way to use pre-trained models since it abstracts away the details of the model architecture and allows us to focus on the task at hand.

In [10]:
import torch
torch.manual_seed(42) # For reproducibility
from transformers import AutoModel

In order to pass our data to the model, we need to convert it to torch tensors. If you are familiar with NumPy, torch tensors are very similar, but with the added benefit of being able to run on GPUs (which are optimized for tensor operations).

In [11]:
dat.set_format('torch', columns=['input_ids', 'attention_mask', 'labels'])
dat

Dataset({
    features: ['text', 'labels', 'input_ids', 'attention_mask'],
    num_rows: 777
})

In [12]:
# Loading the model and moving it to the GPU if available
if torch.cuda.is_available():  # for nvidia GPUs etc.
    device = torch.device('cuda')
elif torch.backends.mps.is_available(): # for Apple Metal Performance Sharder (mps) GPUs
    device = torch.device('mps')
else:
    device = torch.device('cpu')

device

device(type='mps')

In [13]:
# Loading distilbert-base-uncased and moving it to the GPU if available
model = AutoModel.from_pretrained(model_ckpt).to(device)
f'Model inputs: {tokenizer.model_input_names}'

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertModel: ['vocab_transform.bias', 'vocab_projector.weight', 'vocab_layer_norm.bias', 'vocab_transform.weight', 'vocab_layer_norm.weight', 'vocab_projector.bias']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


"Model inputs: ['input_ids', 'attention_mask']"

Extracting features from the model is as simple as passing the data to the model and extracting the last hidden state (the activations of the final layer of the model). We here extract the hidden state for the [CLS] token, which is often used as the aggregate representation for the entire sequence.

In [14]:
def extract_features(batch):
    # Each batch is a dictionary with keys corresponding to the feature names. We only need the input ids and attention masks
    inputs = {k:v.to(device) for k, v in batch.items() if k in tokenizer.model_input_names}

    # Tell torch not to build the computation graph during inference with `torch.no_grad()`
    with torch.no_grad():
        last_hidden_state = model(**inputs).last_hidden_state # Extract last hidden states

    # Return vector for [CLS] token
    return {"hidden_state": last_hidden_state[:,0].cpu().numpy()}

# Extracting features. Features are extracted in batches of 8 samples to avoid running out of memory.
dat = dat.map(extract_features, batched=True, batch_size=8)
dat['hidden_state'].shape

Map:   0%|          | 0/777 [00:00<?, ? examples/s]

torch.Size([777, 768])

# Predicting health perception with features
To predict the decisions using the extracted features and evaluate prediction performance, we will make use of ```sklearn``` (scikit-learn), a general machine learning library. Since we are dealing with high-dimensional embeddings, ordinary least squares regression runs a risk of overfitting. Instead, we will use a regularized (linear) regression model (```RidgeCV```). We evaluate model performance on a holdout test set using the coefficient of determination ($R^2$).

In [None]:
from sklearn.linear_model import RidgeCV
from sklearn.model_selection import train_test_split

In [None]:
# Converting features to a pandas data frame for compatibility with sklearn
embeds = pd.DataFrame(dat['hidden_state'])
embeds

In [None]:
# Instantiating the RidgeCV model
regr = RidgeCV()
regr

In [None]:
# Splitting the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(embeds, dat['labels'], random_state=42)
f'Train size: {len(X_train)}, test size: {len(X_test)}'

In [None]:
# Fitting the model and evaluating performance
regr.fit(X_train, y_train)
f'Test R2 = {regr.score(X_test, y_test).round(2)}'

This shows that we can predict health perception from the extracted features with a reasonable degree of accuracy.

*Exercise*: Feel free to try out different regression algorithms (e.g., LassoCV): https://scikit-learn.org/stable/supervised_learning.html

# Pedicting health perception with LM fine-tuning
We here make use of three modules from the transformers library:
1. ```AutoModelForSequenceClassification```: Loads a pre-trained model ready for fine-tuning it on sequence classification/regression labels.
2. ```TrainingArguments```: Specify training arguments such as the number of epochs, batch size, learning rate, etc.
3. ```Trainer```: Allows us to train a model using the training arguments and a training dataset.

We also employ the ```evaluate``` library to compute the coefficient of determination ($R^2$) on the test set.

In [15]:
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer
import evaluate

In [16]:
# Splitting the data into train and test sets
dat = dat.train_test_split(test_size=0.2, seed=42)
dat

DatasetDict({
    train: Dataset({
        features: ['text', 'labels', 'input_ids', 'attention_mask', 'hidden_state'],
        num_rows: 621
    })
    test: Dataset({
        features: ['text', 'labels', 'input_ids', 'attention_mask', 'hidden_state'],
        num_rows: 156
    })
})

The main difference here from the model we used for feature extraction is ```distilbert-base-uncased``` now has a classification/regression head attached. During fine-tuning, the weights of this head along with the weights of the base model are updated for the task at hand. 

You will see a warning that some parts of the model are randomly initialized. This is normal since the head has not yet been trained.

In [19]:
# Loading distilbert-base-uncased and moving it to the GPU if available
model = (AutoModelForSequenceClassification
         .from_pretrained(model_ckpt, num_labels=1) # num_labels=1 for regression
         .to(device))

model

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_layer_norm.bias', 'vocab_layer_norm.weight', 'vocab_projector.bias', 'vocab_projector.weight', 'vocab_transform.bias', 'vocab_transform.weight']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.weight', 'pre_classifier.bias', 'classi

DistilBertForSequenceClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0-5): 6 x TransformerBlock(
          (attention): MultiHeadSelfAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)
 

In [None]:
# Setting up training arguments for the trainer
model_name = f"{model_ckpt}-finetuned-health"
batch_size = 8
training_args = TrainingArguments(
    output_dir=model_name,  # output directory to save training checkpoints
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    logging_steps=1/batch_size, # log training metrics at every epoch
    evaluation_strategy="epoch", # evaluate at the end of every epoch
    num_train_epochs=10, # number of times to iterate over the training data
)


def compute_metrics(eval_preds):
    """Computes the coefficient of determination (R2) on the test set"""
    metric = evaluate.load("r_squared")
    preds, labels = eval_preds
    return {"r_squared": metric.compute(predictions=preds, references=labels)}


# Instantiating the trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=dat['train'],
    eval_dataset=dat['test'],
    compute_metrics=compute_metrics,
)

# Training the model
trainer.train()

# Conclusion
Based on these results, we can see that fine-tuning a language model on the task at can lead to better performance than using extracted features. This is not surprising since the model is trained to predict the labels directly, whereas the extracted features are trained to predict the labels indirectly (i.e., by predicting a masked word in the sequence). However, feature extraction is much faster than fine-tuning and may be less prone to overfitting for small datasets.
