# Homework 10

In this homework, you will train a sentiment classifier on the [SST-2](https://huggingface.co/datasets/sst2) dataset using the pre-trained BERT model. For simplicity, I recommend using the [Hugging Face Transformers library](https://huggingface.co/docs/transformers/index). I've linked to corresponding tutorials below. You're welcome to use a different framework if you prefer.

**Listed Collaborations:** Huge thanks to Kerem Zaman for helping me debug an obscure IndexError that left me scratching my head for the better half of the evening which ended up being the direct result of me using a distilbert-base-uncased tokenizer with a distilbert-base-cased model ;(

In [1]:
pip install transformers datasets evaluate

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.27.4-py3-none-any.whl (6.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.8/6.8 MB[0m [31m21.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting datasets
  Downloading datasets-2.11.0-py3-none-any.whl (468 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m468.7/468.7 KB[0m [31m9.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting evaluate
  Downloading evaluate-0.4.0-py3-none-any.whl (81 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m81.4/81.4 KB[0m [31m3.4 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.11.0
  Downloading huggingface_hub-0.13.3-py3-none-any.whl (199 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m199.8/199.8 KB[0m [31m9.8 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Downloading tokenizers-0

# Problem 1

1. Fine-tune [DistilBERT](https://huggingface.co/distilbert-base-uncased) from scratch on SST-2 and evaluate the results. You can find a tutorial for loading BERT and fine-tuning [here](https://huggingface.co/docs/transformers/training). In that tutorial, you will need to change the dataset from `"yelp_review_full"` to `"sst2"` and the model from `"bert-base-uncased"` to `"distilbert-base-uncased"`. You'll also need to modify the code since SST-2 is a two-class classification dataset (unlike the Yelp Reviews dataset, which is a five-class classification dataset).
2. Choose a different pre-trained BERT-style model from the [Hugging Face Model Hub](https://huggingface.co/models) and fine-tune it. There are tons of options - part of the homework is navigating the hub to find different models! I recommend picking a model that is smaller than BERT-Base (as DistilBERT is) just to make things computationally cheaper. Is the final validation accuracy higher or lower with this other model?

### Import Dependencies

In [2]:
import numpy as np
from tqdm.auto import tqdm
import matplotlib.pyplot as plt
import evaluate

from datasets import load_dataset
# Transformers by Hugggingface
from transformers import AutoTokenizer
from transformers import AutoModelForSequenceClassification
from transformers import get_scheduler
from transformers import TrainingArguments, Trainer
# Torch Modules
import torch
from torch.utils.data import DataLoader
from torch.optim import Adam

### Load and Process the Data

In [3]:
# Load the Stanford Sentiment Treebank (Binary) SST2 Dataset

dataset = load_dataset("sst2")
dataset["train"][10]

Downloading builder script:   0%|          | 0.00/3.77k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/1.85k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/5.06k [00:00<?, ?B/s]

Downloading and preparing dataset sst2/default to /root/.cache/huggingface/datasets/sst2/default/2.0.0/9896208a8d85db057ac50c72282bcb8fe755accc671a57dd8059d4e130961ed5...


Downloading data:   0%|          | 0.00/7.44M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/67349 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/872 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1821 [00:00<?, ? examples/s]

Dataset sst2 downloaded and prepared to /root/.cache/huggingface/datasets/sst2/default/2.0.0/9896208a8d85db057ac50c72282bcb8fe755accc671a57dd8059d4e130961ed5. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

{'idx': 10, 'sentence': 'goes to absurd lengths ', 'label': 0}

In [4]:
# Get the Correct Tokenizer
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

def tokenize_function(examples):
    """Pad and Tokenize A Given Example"""
    return tokenizer(examples["sentence"], padding="max_length", truncation=True)

# Tokenize the Entire Dataset
tokenized_datasets = dataset.map(tokenize_function, batched=True)

Downloading (…)okenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Map:   0%|          | 0/67349 [00:00<?, ? examples/s]

Map:   0%|          | 0/872 [00:00<?, ? examples/s]

Map:   0%|          | 0/1821 [00:00<?, ? examples/s]

In [5]:
# Post Processing

tokenized_datasets = tokenized_datasets.remove_columns(["sentence"])
tokenized_datasets = tokenized_datasets.rename_column("label", "labels")
tokenized_datasets.set_format("torch")

In [6]:
# Take a Truncated Dataset to Validate Pipeline Integrity

teenie_train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(1000))
teenie_eval_dataset = tokenized_datasets["validation"].shuffle(seed=42).select(range(300))

tiny_train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(30))
tiny_eval_dataset = tokenized_datasets["validation"].shuffle(seed=42).select(range(10))



In [7]:
from torch.utils.data import DataLoader

train_dataloader = DataLoader(teenie_train_dataset, shuffle=True, batch_size=8)
eval_dataloader = DataLoader(teenie_eval_dataset, batch_size=8)

### Load the Pretrained Model

In [21]:
distilbert = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=2)

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_layer_norm.weight', 'vocab_transform.bias', 'vocab_layer_norm.bias', 'vocab_transform.weight', 'vocab_projector.weight', 'vocab_projector.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.weight', 'classifier.bias', 'pre_classifier

### Define the Training Loop and Parameters

In [22]:
# Specify the device if GPU is available

device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
distilbert.to(device)

DistilBertForSequenceClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0-5): 6 x TransformerBlock(
          (attention): MultiHeadSelfAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)
 

In [23]:
optimizer = Adam(distilbert.parameters(), lr=5e-5)

In [24]:
num_epochs = 3
num_training_steps = num_epochs * len(train_dataloader)
lr_scheduler = get_scheduler(
    name="linear", optimizer=optimizer, num_warmup_steps=0, num_training_steps=num_training_steps
)

In [25]:
# Run the training loop for num_epochs

progress_bar = tqdm(range(num_training_steps))

distilbert.train()
for epoch in range(num_epochs):
    for batch in train_dataloader:
        _ = batch.pop('idx')
        batch = {k: v.to(device) for k, v in batch.items()}
        outputs = distilbert(**batch)
        loss = outputs.loss
        loss.backward()
        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()
        progress_bar.update(1)

  0%|          | 0/375 [00:00<?, ?it/s]

In [26]:
# Evaluate the Model Performance on Validation Accuracy

metric = evaluate.load("accuracy")
distilbert.eval()
for batch in eval_dataloader:
    _ = batch.pop('idx')
    batch = {k: v.to(device) for k, v in batch.items()}
    with torch.no_grad():
        outputs = distilbert(**batch)
    logits = outputs.logits
    predictions = torch.argmax(logits, dim=-1)
    metric.add_batch(predictions=predictions, references=batch["labels"])
    
metric.compute()

{'accuracy': 0.85}

### Repeating For DistilRoberta Model

In [9]:
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
distilroberta = AutoModelForSequenceClassification.from_pretrained("distilroberta-base", num_labels=2)
distilroberta.to(device)

Some weights of the model checkpoint at distilroberta-base were not used when initializing RobertaForSequenceClassification: ['lm_head.dense.weight', 'lm_head.bias', 'roberta.pooler.dense.bias', 'roberta.pooler.dense.weight', 'lm_head.decoder.weight', 'lm_head.layer_norm.weight', 'lm_head.layer_norm.bias', 'lm_head.dense.bias']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at distilroberta-base and are newly initialized: ['classifier.out_proj.weig

RobertaForSequenceClassification(
  (roberta): RobertaModel(
    (embeddings): RobertaEmbeddings(
      (word_embeddings): Embedding(50265, 768, padding_idx=1)
      (position_embeddings): Embedding(514, 768, padding_idx=1)
      (token_type_embeddings): Embedding(1, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): RobertaEncoder(
      (layer): ModuleList(
        (0-5): 6 x RobertaLayer(
          (attention): RobertaAttention(
            (self): RobertaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): RobertaSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (

In [14]:
optimizer = Adam(distilroberta.parameters(), lr=5e-2)

In [15]:
num_epochs = 3
num_training_steps = num_epochs * len(train_dataloader)
lr_scheduler = get_scheduler(
    name="linear", optimizer=optimizer, num_warmup_steps=0, num_training_steps=num_training_steps
)

In [16]:
# Run the training loop for num_epochs

progress_bar = tqdm(range(num_training_steps))

distilroberta.train()
for epoch in range(num_epochs):
    for batch in train_dataloader:
        _ = batch.pop('idx')
        batch = {k: v.to(device) for k, v in batch.items()}
        outputs = distilroberta(**batch)
        loss = outputs.loss
        loss.backward()
        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()
        progress_bar.update(1)

  0%|          | 0/375 [00:00<?, ?it/s]

In [17]:
# Evaluate the Model Performance on Validation Accuracy

metric = evaluate.load("accuracy")
distilroberta.eval()
for batch in eval_dataloader:
    _ = batch.pop('idx')
    batch = {k: v.to(device) for k, v in batch.items()}
    with torch.no_grad():
        outputs = distilroberta(**batch)
    logits = outputs.logits
    predictions = torch.argmax(logits, dim=-1)
    metric.add_batch(predictions=predictions, references=batch["labels"])
    
metric.compute()

{'accuracy': 0.5233333333333333}

**Written Answer:** Between the two models, I noticed that the final validation accuracy was substantially better on the Distilbert-base-uncased model over the Distilroberta model. I am curious why this is the case. From COMP790-LLM, I remember learning that Roberta is just BERT but trained with much more rigorous methods that add value in terms of training tricks with LLM intuition. If this in fact the case, I wonder if this is explainable by the fact that the Roberta removed next sentence prediction task. Maybe the next sentence prediction doubly equipped BERT (and thus distilbert) to be adept at embedding sentences into meaningful representations for classification.

# Problem 2

Instead of fine-tuning the full model on a target dataset, it's also possible to use the output representations from a BERT-style model as input to a linear classifier and *only* train the classifier (leaving the rest of the pre-trained parameters fixed). You can do this easily using the [`sentence-transformers`](https://www.sbert.net/) library. Using `sentence-tranformers` gives you back a fixed-length representation of a given text sequence. To achieve this, you need to 
1. Pick a pre-trained sentence Transformer.
2. Load the SST-2 dataset and feed the text from each example into the model.
3. Train a linear classifier on the representations.
4. Evaluate performance on the validation set.

For the second step, you can learn more about how to use Hugging Face datasets [here](https://huggingface.co/docs/datasets/index). For the third and fourth step, you can do this directly in PyTorch, or you can just collect the learned representations and use them as feature vectors to train a linear classifier in any other library (e.g. [scikit-learn](https://scikit-learn.org/stable/modules/linear_model.html).

After you complete the above steps, report whether the accuracy on the validation set is higher or lower using a fixed sentence Transformer.

In [94]:
from transformers import AutoModel 
import torch.nn as nn
import torch.nn.functional as F

class FineTunedCLF(nn.Module):
    def __init__(self, hf_model, unfreeze=False, hidden_dim=768, output_dim=1):
        super().__init__()
        self.transformer = AutoModel.from_pretrained(hf_model)
        for param in self.transformer.parameters():
            param.requires_grad = False
        self.linear = nn.Linear(hidden_dim, output_dim)

    def forward(self, **inputs):
        h = self.transformer(**inputs)['last_hidden_state']
        pooler_output = torch.mean(h, axis=1)
        clf_output = self.linear(pooler_output)
        return F.sigmoid(clf_output)

In [95]:
ftclf = FineTunedCLF('distilbert-base-uncased').to(device)

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertModel: ['vocab_layer_norm.weight', 'vocab_transform.bias', 'vocab_layer_norm.bias', 'vocab_transform.weight', 'vocab_projector.weight', 'vocab_projector.bias']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [96]:
sum([param.numel() if param.requires_grad else 0 for param in ftclf.parameters()])

769

In [105]:
optimizer = Adam(ftclf.parameters(), lr=5e-3)
criterion = nn.BCELoss()

In [106]:
lr_scheduler = get_scheduler(
    name="linear", optimizer=optimizer, num_warmup_steps=0, num_training_steps=num_training_steps
)

In [107]:
# Run the training loop for num_epochs

progress_bar = tqdm(range(num_training_steps))

ftclf.train()
for epoch in range(num_epochs):
    for batch in train_dataloader:
        _ = batch.pop('idx')
        labels = batch.pop('labels')
        labels = labels.unsqueeze(1).to(torch.float32).to(device)
        batch = {k: v.to(device) for k, v in batch.items()}
        outputs = ftclf(**batch)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()
        progress_bar.update(1)

  0%|          | 0/375 [00:00<?, ?it/s]

In [108]:
metric = evaluate.load("accuracy")
ftclf.eval()
for batch in eval_dataloader:
    _ = batch.pop('idx')
    labels = batch.pop('labels')
    batch = {k: v.to(device) for k, v in batch.items()}
    with torch.no_grad():
        outputs = ftclf(**batch)
    predictions = torch.round(outputs)
    metric.add_batch(predictions=predictions, references=labels)
    
metric.compute()

{'accuracy': 0.8466666666666667}

**Written Answer:** First, instead of using the sentence-transformer I used a mroe general hugging face transformer 'FinedTunedClassifier' class that I created in order to make a more apples to apples comparison between Distilbert with finetuning the weights vs distilbert embeddings used as immediate input to a linear classifier. From what we can see, it appears that the two share nearly identical performance for the Distilbert example. Due to the backpropagation of the gradients through a deep transformer, usually the later layers receive the most updates and these are the more "task specific layers" so that both have similar effects makes sense. 