In [None]:
# # Installing packages in Google Colab environment
# !pip install datasets transformers evaluate
# !pip install accelerate -U
#
# # Mounting google drive to enable access to data files
# from google.colab import drive
# drive.mount('/content/drive')
#
# # Changing working directory to ex1
# %cd /content/drive/MyDrive/LLM4JDM/ex1

# Preparing data

We begin by loading the requisite packages. For those coming from R, packages in Python are sometimes given shorter names for use in the code via the ```import <name> as <nickname>``` syntax (e.g. ```import pandas as pd```). These are usually standardized nicknames. We here make use of three packages:
1. ```pandas```: A very popular library for reading and manipulating data in python.
2. ```datasets```: A package from the HuggingFace (HF) ecosystem for loading and manipulating datasets in a format ready for use with HF models.
3. ```transformers```: A package from the HF ecosystem for loading and manipulating transfomer-based models.

In [1]:
import pandas as pd
from datasets import Dataset
from transformers import AutoTokenizer

In [13]:
vaccine = pd.read_csv('vaccine.csv')
vaccine

Unnamed: 0,labels,text
0,0.625,Looking at the probability of certain negative...
1,0.750,"Not really, I looked at all the numbers and ma..."
2,0.750,I weighed the side effects with the benefits a...
3,0.375,Percentages were important. In all things (med...
4,0.250,I would look up the potential side effects and...
...,...,...
1046,0.750,I looked primarily at the effectiveness of eac...
1047,0.125,Personal experience talking
1048,0.750,Biased against the vaccine
1049,0.250,I have had a lot of people whos gotten the vac...


In [14]:
# Converting pandas dataframe to HF Dataset
vaccine = Dataset.from_pandas(vaccine)
vaccine

Dataset({
    features: ['labels', 'text'],
    num_rows: 1051
})

Features of the ```Dataset``` object can be accessed like keys in a dictionary, and behave like python lists. Sample can be accessed by index, returning a dictionary where keys correspond to feature names. This reflects the fact that ```Datasets``` is based on Apache Arrow, which defines a typed columnar format that is more memory efficient than native Python

In [19]:
vaccine[0]

{'labels': 0.625,
 'text': 'Looking at the probability of certain negative side effects and seeing if the vaccine genuinely protects someone would be very important.'}

To use models in the HF ecosystem, one must first define a model checkpoint (```ckpt```): the set of weights that are loaded into a given transformer architecture. This often needs to be done well before we even initialise the model, since data preprocessing steps such as tokenization must follow the steps used to train the original model. We just need a pretrained base model for our purposes (i.e. one that has not yet been fine-tuned on a specific task). One popular lightweight option is ```distilbert-base-uncased```.

In [20]:
# Defining model checkpoint
model_ckpt = 'distilbert-base-uncased'

Tokenization is the process of breaking raw text into the desired atomic units for one's modelling task. This may be as simple as splitting the text into individual characters. In the case of transformer-based models, tokenization is a bit more complex, usually occurring at the sub-word level.

In [21]:
# Tokenizing the dataset
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)
print(f'Vocabulary size: {tokenizer.vocab_size}, max context length: {tokenizer.model_max_length}')

Vocabulary size: 30522, max context length: 512


Two important arguments relating to tokenization:
1. ```padding```: Used to fill up sequences to a certain length, ensuring that all sequences in a batch have the same length. This is essential for training and inference with deep learning models that operate on fixed-size input tensors.
2. ```truncation```: Truncation is the process of cutting off parts of the sequence to ensure that it fits within a specified maximum length (e.g. 512 tokens for BERT models)

The combination of padding and truncation ensures that all sequences in a batch conform to a consistent, fixed size, which is essential for processing in parallel on modern hardware like GPUs

In [25]:
tokenize = lambda batch: tokenizer(batch['text'], padding="max_length", truncation=True)
vaccine = vaccine.map(tokenize, batched=True)
vaccine[0]

Map:   0%|          | 0/1051 [00:00<?, ? examples/s]

{'labels': tensor(0.6250),
 'input_ids': tensor([  101,  2559,  2012,  1996,  9723,  1997,  3056,  4997,  2217,  3896,
          1998,  3773,  2065,  1996, 17404, 15958, 18227,  2619,  2052,  2022,
          2200,  2590,  1012,   102,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            

Looking at the first sample, we see some important special tokens:
1. ```[CLS]```: Often added at the beginning of the input sequence. In the context of classification tasks, the embedding corresponding to the [CLS] token (after passing through the model) is often used as the aggregate representation for the entire sequence.
2. ```[SEP]```: Used to separate different segments in a sequence. For example, in tasks that take two different sentences as input (such as question-answering or text-pair classification), the [SEP] token is placed between the two sentences to indicate that they are distinct segments. This helps the model understand and process the two segments appropriately, recognizing the boundaries between them.

In [28]:
# Inspecting tokenization
tokenizer.convert_ids_to_tokens(vaccine[0]['input_ids'])

['[CLS]',
 'looking',
 'at',
 'the',
 'probability',
 'of',
 'certain',
 'negative',
 'side',
 'effects',
 'and',
 'seeing',
 'if',
 'the',
 'vaccine',
 'genuinely',
 'protects',
 'someone',
 'would',
 'be',
 'very',
 'important',
 '.',
 '[SEP]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[

In [24]:
vaccine.set_format('torch', columns=['input_ids', 'attention_mask', 'labels'])
vaccine

{'labels': tensor(0.6250),
 'input_ids': tensor([  101,  2559,  2012,  1996,  9723,  1997,  3056,  4997,  2217,  3896,
          1998,  3773,  2065,  1996, 17404, 15958, 18227,  2619,  2052,  2022,
          2200,  2590,  1012,   102,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            

# Feature Extraction

In [6]:
import torch
torch.manual_seed(42)
from transformers import AutoModel

In [7]:
# Loading the model and moving it to the GPU if available ('cuda' for nvidia GPUs and 'mps' for Apple's Metal Performance Shaders)
if torch.cuda.is_available():
    device = torch.device('cuda')
elif torch.backends.mps.is_available():
    device = torch.device('mps')
else:
    device = torch.device('cpu')

device

device(type='mps')

In [8]:
model = AutoModel.from_pretrained(model_ckpt).to(device)

def extract_features(batch):
    """Extract features from a batch of items"""
    inputs = {k:v.to(device) for k, v in batch.items() if k in tokenizer.model_input_names}
    with torch.no_grad():
        last_hidden_state = model(**inputs).last_hidden_state
        return {"hidden_state": last_hidden_state[:,0].cpu().numpy()}

vaccine = vaccine.map(extract_features, batched=True, batch_size=8)
embeds = pd.DataFrame(vaccine['hidden_state'])
embeds

Map:   0%|          | 0/968 [00:00<?, ? examples/s]

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,758,759,760,761,762,763,764,765,766,767
0,0.086551,-0.138291,-0.136351,-0.044146,-0.006536,-0.196219,0.129933,0.005525,0.046519,-0.363784,...,-0.094813,-0.216771,-0.098917,0.025280,0.003007,0.224140,-0.215962,-0.277577,0.211385,0.269926
1,-0.086077,-0.046284,0.170872,-0.090205,-0.148626,-0.077882,0.115379,0.122023,0.147362,-0.176775,...,0.094826,-0.011131,0.072211,-0.027861,-0.058841,0.019084,-0.260078,-0.105455,0.249027,0.535908
2,0.040985,-0.031094,0.085166,-0.015342,-0.150943,-0.062235,0.269550,-0.149736,0.259255,-0.130736,...,-0.165882,-0.097996,-0.081691,-0.126770,0.114904,0.140170,-0.236507,-0.087407,0.400360,0.436200
3,-0.088134,-0.037886,0.070339,0.060876,-0.184012,0.031210,0.169997,0.052630,0.159975,-0.296517,...,-0.023511,-0.199103,0.055980,-0.137853,0.082254,0.127104,-0.297239,-0.143943,0.213958,0.313107
4,0.069829,0.019595,-0.001682,-0.246711,-0.061300,-0.168547,0.322209,0.217674,0.237314,-0.358912,...,0.042846,0.005196,0.082999,-0.093391,0.086103,-0.116669,-0.277758,-0.230386,0.221281,0.411202
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
963,0.122108,-0.101899,0.092345,-0.037650,-0.153381,-0.111409,0.147186,0.085400,0.108409,-0.179295,...,-0.119313,-0.141489,-0.058733,0.005137,0.079066,0.090325,-0.175999,-0.101646,0.329370,0.292443
964,0.060096,-0.027563,-0.197975,-0.167290,-0.256647,-0.193959,0.278221,-0.008111,0.037555,-0.141178,...,-0.165374,-0.132563,-0.002246,0.049726,0.072532,-0.039805,-0.274175,0.038117,0.321610,0.417477
965,-0.264745,-0.118108,-0.243033,-0.065164,-0.172901,-0.085691,0.123498,0.065335,-0.042342,-0.166539,...,-0.049490,0.014468,-0.110201,-0.100587,0.256933,-0.017435,-0.034038,-0.193906,0.303171,0.435974
966,0.214027,0.031197,-0.053656,-0.248237,-0.046856,-0.366515,0.371055,0.387127,-0.086699,-0.266235,...,-0.101303,-0.187162,0.108514,-0.177368,0.260718,0.150828,-0.205036,-0.006340,0.436779,0.446601


# Predicting vaccine decisions with embeddings

In [9]:
from sklearn.linear_model import RidgeCV
from sklearn.model_selection import train_test_split

In [10]:
regr = RidgeCV()
X_train, X_test, y_train, y_test = train_test_split(embeds, vaccine['labels'], random_state=42)
regr.fit(X_train, y_train)
print(regr.score(X_test, y_test))

0.08978435256497175


# Pedicting vaccine decisions the LM fine-tuning

In [11]:
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer
import evaluate

In [12]:
vaccine = vaccine.train_test_split(test_size=0.2)
vaccine

DatasetDict({
    train: Dataset({
        features: ['labels', 'text', '__index_level_0__', 'input_ids', 'attention_mask', 'hidden_state'],
        num_rows: 774
    })
    test: Dataset({
        features: ['labels', 'text', '__index_level_0__', 'input_ids', 'attention_mask', 'hidden_state'],
        num_rows: 194
    })
})

In [13]:
# Defining the model
model = (AutoModelForSequenceClassification
         .from_pretrained(model_ckpt, num_labels=1) # num_labels=1 for regression
         .to(device))

model

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.weight', 'classifier.weight', 'classifier.bias', 'pre_classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


DistilBertForSequenceClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0-5): 6 x TransformerBlock(
          (attention): MultiHeadSelfAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)
 

In [14]:
# Training the model
training_args = TrainingArguments(
    output_dir="test_trainer",
    evaluation_strategy="epoch",
    num_train_epochs=3,
)

def compute_metrics(eval_preds):
    metric = evaluate.load("r_squared")
    preds, labels = eval_preds
    return {"r_squared": metric.compute(predictions=preds, references=labels)}


trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=vaccine['train'],
    eval_dataset=vaccine['test'],
    compute_metrics=compute_metrics,
)

In [15]:
trainer.train()



Epoch,Training Loss,Validation Loss


TrainOutput(global_step=291, training_loss=0.07256945056194292, metrics={'train_runtime': 189.2224, 'train_samples_per_second': 12.271, 'train_steps_per_second': 1.538, 'total_flos': 307583814260736.0, 'train_loss': 0.07256945056194292, 'epoch': 3.0})