# Using Notebook Environments
1. To run a cell, press ```shift + enter```. The notebook will execute the code in the cell and move to the next cell. If the cell contains a markdown cell, it will render the markdown and move to the next cell.
2. Since cells can be executed in any order and variables can be over-written, you may at some point feel that you have lost track of the state of your notebook. If this is the case, you can always restart the kernel by clicking ```Runtime``` in the menu bar (if you're using Colab) and selecting ```Restart runtime```. This will clear all variables and outputs.
3. The final variable in a cell will be printed to the screen. If you want to print multiple variables, you can use the ```print()``` function as usual.

Notebook environments support code cells and markdown cells. For the purposes of this workshop, markdown cells are used to provide high-level explanations of the code. More specific details are provided in the code cells themselves in the form of comments (lines beginning with ```#```).

In [2]:
import sys
if 'google.colab' in sys.modules:  # If in Google Colab environment
    # Installing requisite packages
    !pip install datasets transformers evaluate
    !pip install accelerate -U

    # Mount google drive to enable access to data files
    from google.colab import drive
    drive.mount('/content/drive')

    # Change working directory to ex1
    %cd /content/drive/MyDrive/LLM4JDM/ex1

# Preparing data
We begin by loading the requisite packages. For those coming from R, packages in Python are sometimes given shorter names for use in the code via the ```import <name> as <nickname>``` syntax (e.g. ```import pandas as pd```). These are usually standardized nicknames. We here make use of three packages:
1. ```pandas```: A very popular library for reading and manipulating data in python.
2. ```datasets```: A package from the HuggingFace (HF) ecosystem for loading and manipulating datasets in a format ready for use with HF models.
3. ```transformers```: A package from the HF ecosystem for loading and manipulating transfomer-based models.

In [1]:
import pandas as pd
from datasets import Dataset
from transformers import AutoTokenizer

The dataset takes the following structure:
1. ```text```:
2. ```labels```:

In [13]:
vaccine = (
    pd.read_csv('vaccine.csv') # Read csv file
    .dropna() # Drop rows with missing values
    .reset_index(drop=True) # Reset row indices
)
vaccine


Unnamed: 0,text,labels
0,Looking at the probability of certain negative...,0.625
1,"Not really, I looked at all the numbers and ma...",0.750
2,I weighed the side effects with the benefits a...,0.750
3,Percentages were important. In all things (med...,0.375
4,I would look up the potential side effects and...,0.250
...,...,...
1040,I looked primarily at the effectiveness of eac...,0.750
1041,Personal experience talking,0.125
1042,Biased against the vaccine,0.750
1043,I have had a lot of people whos gotten the vac...,0.250


In [14]:
# Convert pandas dataframe to HF Dataset
vaccine = Dataset.from_pandas(vaccine)
vaccine

Dataset({
    features: ['text', 'labels'],
    num_rows: 1045
})

Features of the ```Dataset``` object can be accessed like keys in a dictionary, and behave like python lists. Sample can be accessed by index, returning a dictionary where keys correspond to feature names. This reflects the fact that ```Datasets``` is based on Apache Arrow, which defines a typed columnar format that is more memory efficient than native Python

In [15]:
vaccine[0]

{'text': 'Looking at the probability of certain negative side effects and seeing if the vaccine genuinely protects someone would be very important.',
 'labels': 0.625}

To use models in the HF ecosystem, one must first define a model checkpoint (```ckpt```): the set of weights that are loaded into a given transformer architecture. This often needs to be done well before we even initialise the model, since data preprocessing steps such as tokenization must follow the steps used to train the original model. We just need a pretrained base model for our purposes (i.e. one that has not yet been fine-tuned on a specific task). One popular lightweight option is ```distilbert-base-uncased```.

In [16]:
# Defining model checkpoint
model_ckpt = 'distilbert-base-uncased'

Tokenization is the process of breaking raw text into the desired atomic units for one's modelling task. This may be as simple as splitting the text into individual characters. In the case of transformer-based models, tokenization is a bit more complex, usually occurring at the sub-word level.

In [17]:
# Tokenizing the dataset
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)
print(f'Vocabulary size: {tokenizer.vocab_size}, max context length: {tokenizer.model_max_length}')

Vocabulary size: 30522, max context length: 512


Two important arguments relating to tokenization:
1. ```padding```: Used to fill up sequences to a certain length, ensuring that all sequences in a batch have the same length. This is essential for training and inference with deep learning models that operate on fixed-size input tensors.
2. ```truncation```: Truncation is the process of cutting off parts of the sequence to ensure that it fits within a specified maximum length (e.g. 512 tokens for BERT models)

The combination of padding and truncation ensures that all sequences in a batch conform to a consistent, fixed size, which is essential for processing in parallel on modern hardware like GPUs

In [18]:
# Function to tokenize a batch of samples
tokenize = lambda batch: tokenizer(batch['text'], padding="max_length", truncation=True)

#  Tokenizing the dataset
vaccine = vaccine.map(tokenize, batched=True)
vaccine[0]

Map:   0%|          | 0/1045 [00:00<?, ? examples/s]

{'text': 'Looking at the probability of certain negative side effects and seeing if the vaccine genuinely protects someone would be very important.',
 'labels': 0.625,
 'input_ids': [101,
  2559,
  2012,
  1996,
  9723,
  1997,
  3056,
  4997,
  2217,
  3896,
  1998,
  3773,
  2065,
  1996,
  17404,
  15958,
  18227,
  2619,
  2052,
  2022,
  2200,
  2590,
  1012,
  102,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
 

Looking at the first sample, we see some important special tokens:
1. ```[CLS]```: Often added at the beginning of the input sequence. In the context of classification tasks, the embedding corresponding to the [CLS] token (after passing through the model) is often used as the aggregate representation for the entire sequence.
2. ```[SEP]```: Used to separate different segments in a sequence. For example, in tasks that take two different sentences as input (such as question-answering or text-pair classification), the [SEP] token is placed between the two sentences to indicate that they are distinct segments. This helps the model understand and process the two segments appropriately, recognizing the boundaries between them.

In [21]:
# Inspecting tokenization by looking at the first 30 tokens of the first sample
tokenizer.convert_ids_to_tokens(vaccine[0]['input_ids'])[:30]

['[CLS]',
 'looking',
 'at',
 'the',
 'probability',
 'of',
 'certain',
 'negative',
 'side',
 'effects',
 'and',
 'seeing',
 'if',
 'the',
 'vaccine',
 'genuinely',
 'protects',
 'someone',
 'would',
 'be',
 'very',
 'important',
 '.',
 '[SEP]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]']

# Feature Extraction
We require two packages/modules for feature extraction:
1. ```torch```: The PyTorch package, which is the most popular deep learning framework amongst researchers (https://paperswithcode.com/trends).
2. ```Automodel```: A module from the HF ecosystem that allows us to load a pretrained model and use it for inference. This is a very convenient way to use pretrained models, since it abstracts away the details of the model architecture and allows us to focus on the task at hand.

In [22]:
import torch
torch.manual_seed(42) # For reproducibility
from transformers import AutoModel

In order to pass our data to the model, we need to convert it to torch tensors. If you are familiar with NumPy, torch tensors are very similar, but with the added benefit of being able to run on GPUs (which are optimized for tensor operations).

In [23]:
vaccine.set_format('torch', columns=['input_ids', 'attention_mask', 'labels'])
vaccine

Dataset({
    features: ['text', 'labels', 'input_ids', 'attention_mask'],
    num_rows: 1045
})

In [24]:
# Loading the model and moving it to the GPU if available
if torch.cuda.is_available():  # for nvidia GPUs
    device = torch.device('cuda')
elif torch.backends.mps.is_available(): # for Apple Metal Performance Sharder (mps) GPUs
    device = torch.device('mps')
else:
    device = torch.device('cpu')

device

device(type='mps')

In [25]:
# Loading distilbert-base-uncased and moving it to the GPU if available
model = AutoModel.from_pretrained(model_ckpt).to(device)
f'Model inputs: {tokenizer.model_input_names}'

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertModel: ['vocab_transform.bias', 'vocab_transform.weight', 'vocab_layer_norm.bias', 'vocab_layer_norm.weight', 'vocab_projector.bias', 'vocab_projector.weight']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


"Model inputs: ['input_ids', 'attention_mask']"

Extracting features from the model is as simple as passing the data to the model and extracting the last hidden state (the activations of the final layer of the model). We here extract the hidden state for the [CLS] token, which is often used as the aggregate representation for the entire sequence.

In [26]:
def extract_features(batch):
    # Each batch is a dictionary with keys corresponding to the feature names. We only need the input ids and attention masks
    inputs = {k:v.to(device) for k, v in batch.items() if k in tokenizer.model_input_names}

    # Tell torch not to build the computation graph during inference with `torch.no_grad()`
    with torch.no_grad():
        last_hidden_state = model(**inputs).last_hidden_state # Extract last hidden states

    # Return vector for [CLS] token
    return {"hidden_state": last_hidden_state[:,0].cpu().numpy()}

# Extracting features. Features are extracted in batches of 8 samples to avoid running out of memory.
vaccine = vaccine.map(extract_features, batched=True, batch_size=8)
vaccine['hidden_state'].shape

Map:   0%|          | 0/1045 [00:00<?, ? examples/s]

torch.Size([1045, 768])

# Predicting vaccine decisions with features
To predict the decisions using the extracted features and evaluate prediction performance we will make use of ```sklearn``` (scikit-learn), a general machine learning library. Since we are dealing with high-dimensinoal embeddings, ordinary least squares regression runs a risk of overfitting. Instead, we will use a regularized (linear) regression model (```RidgeCV```). We evaluate model performance on a holdout test set using the coefficient of determination ($R^2$).

In [27]:
from sklearn.linear_model import RidgeCV
from sklearn.model_selection import train_test_split

In [28]:
# Converting features to a pandas dataframe for compatibility with sklearn
embeds = pd.DataFrame(vaccine['hidden_state'])
embeds

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,758,759,760,761,762,763,764,765,766,767
0,0.086551,-0.138291,-0.136351,-0.044146,-0.006536,-0.196219,0.129933,0.005525,0.046519,-0.363784,...,-0.094813,-0.216771,-0.098917,0.025280,0.003007,0.224140,-0.215962,-0.277577,0.211385,0.269926
1,0.011459,-0.027934,0.082097,0.002163,-0.157471,-0.063026,0.014516,0.165165,0.010776,-0.129604,...,-0.064945,-0.057040,0.017817,-0.079575,-0.001286,-0.021249,-0.276787,-0.061606,0.282086,0.356513
2,0.040985,-0.031094,0.085166,-0.015342,-0.150943,-0.062235,0.269550,-0.149736,0.259255,-0.130736,...,-0.165882,-0.097996,-0.081691,-0.126770,0.114904,0.140170,-0.236507,-0.087407,0.400360,0.436200
3,-0.026979,-0.054438,-0.002667,0.077894,-0.196275,0.081419,0.168415,0.052178,0.094805,-0.340330,...,-0.128440,-0.221502,0.042931,-0.211251,0.098081,0.116236,-0.258981,-0.128494,0.248675,0.285449
4,0.117591,0.008332,-0.051524,-0.174926,-0.051919,-0.172235,0.188875,0.096570,0.075528,-0.284214,...,-0.037496,-0.080321,0.052056,-0.079922,0.106612,-0.104730,-0.262912,-0.184733,0.235717,0.172998
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1040,0.060096,-0.027563,-0.197975,-0.167290,-0.256647,-0.193959,0.278221,-0.008111,0.037555,-0.141178,...,-0.165374,-0.132563,-0.002246,0.049726,0.072532,-0.039805,-0.274175,0.038117,0.321610,0.417477
1041,-0.157278,-0.037632,-0.135468,-0.104194,-0.073145,-0.114484,0.210148,0.364075,-0.133996,-0.354392,...,0.080500,-0.176303,0.098889,-0.213833,0.114860,0.073110,-0.062321,-0.282838,0.039596,0.202290
1042,-0.264745,-0.118108,-0.243033,-0.065164,-0.172901,-0.085691,0.123498,0.065335,-0.042342,-0.166539,...,-0.049490,0.014468,-0.110201,-0.100587,0.256933,-0.017435,-0.034038,-0.193906,0.303171,0.435974
1043,0.214027,0.031197,-0.053656,-0.248237,-0.046856,-0.366515,0.371055,0.387127,-0.086699,-0.266235,...,-0.101303,-0.187162,0.108514,-0.177368,0.260718,0.150828,-0.205036,-0.006340,0.436779,0.446601


In [29]:
# Instantiating the RidgeCV model
regr = RidgeCV()
regr

In [38]:
# Splitting the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(embeds, vaccine['labels'], random_state=42)
f'Train size: {len(X_train)}, test size: {len(X_test)}'

'Train size: 783, test size: 262'

In [39]:
# Fitting the model and evaluating performance
regr.fit(X_train, y_train)
f'Test R2 = {regr.score(X_test, y_test).round(2)}'

'Test R2 = 0.12'

*Exercise*: Feel free to try out different regression models (e.g. LassoCV): https://scikit-learn.org/stable/supervised_learning.html

# Pedicting vaccine decisions with LM fine-tuning
We here make use of three modules from the transformers library:
1. ```AutoModelForSequenceClassification```: Loads a pretrained model ready for fine-tuning it on sequence classification/regression labels.
2. ```TrainingArguments```: Specify training arguments such as the number of epochs, batch size, learning rate, etc.
3. ```Trainer```: Allows us to train a model using the training arguments and a training dataset.

We also employ the ```evaluate``` library to compute the coefficient of determination ($R^2$) on the test set.

In [40]:
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer
import evaluate

In [None]:
# Splitting the data into train and test sets
vaccine = vaccine.train_test_split(test_size=0.2)
vaccine

The main difference here from the model we used for feature extraction is ```distilbert-base-uncased``` now has a classification/regression head attached. You will see a warning that some parts of the model are randomly initialized. This is normal since the head has not yet been trained.

In [None]:
# Loading distilbert-base-uncased and moving it to the GPU if available
model = (AutoModelForSequenceClassification
         .from_pretrained(model_ckpt, num_labels=1) # num_labels=1 for regression
         .to(device))

model

In [None]:
# Setting up training arguments for the trainer
model_name = f"{model_ckpt}-finetuned-vaccine"
batch_size = 8
training_args = TrainingArguments(
    output_dir=model_name,  # output directory to save training checkpoints
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    logging_steps=1/batch_size, # log training metrics at every epoch
    evaluation_strategy="epoch", # evaluate at the end of every epoch
    num_train_epochs=7, # number of times to iterate over the training data
)


def compute_metrics(eval_preds):
    """Computes the coefficient of determination (R2) on the test set"""
    metric = evaluate.load("r_squared")
    preds, labels = eval_preds
    return {"r_squared": metric.compute(predictions=preds, references=labels)}


# Instantiating the trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=vaccine['train'],
    eval_dataset=vaccine['test'],
    compute_metrics=compute_metrics,
)

# Training the model
trainer.train()