# Finetuning

Fine tune a BERT classifier model on the [Yelp review dataset](https://huggingface.co/datasets/yelp_review_full/viewer/yelp_review_full/train). This dataset has yelp reviews along with the star rating given by the reviewer. Star rating is the label which goes from 0 to 5 (corresponding to 1 star to 5 stars).

I'll use the `bert-base-cased` model for this. Its forward method expects a dict as input with the following keys, each referring to a tensor -
  * `labels` $\in \mathbb R^m$
  * `input_ids` $\in \mathbb R^{m \times 512}$
  * `token_type_ids` $\in \mathbb R^{m \times 512}$
  * `attention_mask` $\in \mathbb R^{m \times 512}$


In [1]:
from datasets import load_dataset

In [2]:
dataset = load_dataset("yelp_review_full")
dataset

DatasetDict({
    train: Dataset({
        features: ['label', 'text'],
        num_rows: 650000
    })
    test: Dataset({
        features: ['label', 'text'],
        num_rows: 50000
    })
})

In [3]:
dataset["train"][100]

{'label': 0,
 'text': 'My expectations for McDonalds are t rarely high. But for one to still fail so spectacularly...that takes something special!\\nThe cashier took my friends\'s order, then promptly ignored me. I had to force myself in front of a cashier who opened his register to wait on the person BEHIND me. I waited over five minutes for a gigantic order that included precisely one kid\'s meal. After watching two people who ordered after me be handed their food, I asked where mine was. The manager started yelling at the cashiers for \\"serving off their orders\\" when they didn\'t have their food. But neither cashier was anywhere near those controls, and the manager was the one serving food to customers and clearing the boards.\\nThe manager was rude when giving me my order. She didn\'t make sure that I had everything ON MY RECEIPT, and never even had the decency to apologize that I felt I was getting poor service.\\nI\'ve eaten at various McDonalds restaurants for over 30 years. 

In [4]:
from transformers import AutoTokenizer

In [5]:
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)

tokenized_datasets = dataset.map(tokenize_function, batched=True)

Map:   0%|          | 0/50000 [00:00<?, ? examples/s]

In [6]:
tokenized_datasets

DatasetDict({
    train: Dataset({
        features: ['label', 'text', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 650000
    })
    test: Dataset({
        features: ['label', 'text', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 50000
    })
})

In [7]:
instance = tokenized_datasets["train"][1000]
for key in instance.keys():
    print(key, type(instance[key]))

label <class 'int'>
text <class 'str'>
input_ids <class 'list'>
token_type_ids <class 'list'>
attention_mask <class 'list'>


In [8]:
tokenized_datasets = tokenized_datasets.remove_columns(["text"])
tokenized_datasets = tokenized_datasets.rename_column("label", "labels")
tokenized_datasets.set_format("torch")

In [9]:
instance = tokenized_datasets["train"][1000]
for key in instance.keys():
    print(key, type(instance[key]))

labels <class 'torch.Tensor'>
input_ids <class 'torch.Tensor'>
token_type_ids <class 'torch.Tensor'>
attention_mask <class 'torch.Tensor'>


In [10]:
small_train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(1000))
small_eval_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(1000))

In [11]:
from torch.utils.data import DataLoader

In [12]:
train_dataloader = DataLoader(small_train_dataset, shuffle=True, batch_size=8)
eval_dataloader = DataLoader(small_eval_dataset, batch_size=8)

In [32]:
print(len(train_dataloader), len(eval_dataloader))

125 125


In [13]:
it = iter(train_dataloader)

In [14]:
data = next(it)

In [15]:
data

{'labels': tensor([3, 1, 3, 0, 1, 0, 4, 3]),
 'input_ids': tensor([[  101,  1798, 22408,  ...,     0,     0,     0],
         [  101,  8835,  3415,  ...,     0,     0,     0],
         [  101,  1188,  1282,  ...,     0,     0,     0],
         ...,
         [  101, 17627,  1106,  ...,     0,     0,     0],
         [  101,  2160,   117,  ...,     0,     0,     0],
         [  101,  3006,  1715,  ...,     0,     0,     0]]),
 'token_type_ids': tensor([[0, 0, 0,  ..., 0, 0, 0],
         [0, 0, 0,  ..., 0, 0, 0],
         [0, 0, 0,  ..., 0, 0, 0],
         ...,
         [0, 0, 0,  ..., 0, 0, 0],
         [0, 0, 0,  ..., 0, 0, 0],
         [0, 0, 0,  ..., 0, 0, 0]]),
 'attention_mask': tensor([[1, 1, 1,  ..., 0, 0, 0],
         [1, 1, 1,  ..., 0, 0, 0],
         [1, 1, 1,  ..., 0, 0, 0],
         ...,
         [1, 1, 1,  ..., 0, 0, 0],
         [1, 1, 1,  ..., 0, 0, 0],
         [1, 1, 1,  ..., 0, 0, 0]])}

In [16]:
print(type(data))
for key in data.keys():
    print(key, type(data[key]), data[key].shape)

<class 'dict'>
labels <class 'torch.Tensor'> torch.Size([8])
input_ids <class 'torch.Tensor'> torch.Size([8, 512])
token_type_ids <class 'torch.Tensor'> torch.Size([8, 512])
attention_mask <class 'torch.Tensor'> torch.Size([8, 512])


In [21]:
from transformers import AutoModelForSequenceClassification
import torch as t

In [18]:
model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased", num_labels=5)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [19]:
model

BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(28996, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12,

In [20]:
outputs = model(**data)

In [24]:
device = t.device("mps")
model = model.to(device)
data = {k: v.to(device) for k, v in data.items()}

In [25]:
outputs = model(**data)

In [None]:
outputs

In [None]:
outputs.logits.shape

In [None]:
data["labels"]

Total number of batches in my eval dataset is 1000 / 8 = 125. Each batch will take ~4 seconds for evaluation (on MPS device) which means 125 * 4 = 500 seconds = ~8 minutes.

In [26]:
import evaluate
from tqdm import tqdm

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


In [28]:
metric = evaluate.load("accuracy")
model.eval()
for batch in tqdm(eval_dataloader):
    with t.no_grad():
        batch = {k: v.to(device) for k, v in batch.items()}
        outputs = model(**batch)
    logits = outputs.logits
    preds = t.argmax(logits, dim=-1)
    metric.add_batch(predictions=preds, references=batch["labels"])

100%|██████████| 125/125 [00:40<00:00,  3.06it/s]


In [29]:
metric.compute()

{'accuracy': 0.205}

This accuracy is within the calibration range, when predicting one of 5 classes at random, the model should have 20% accuracy.

In [31]:
from transformers import get_scheduler

In [30]:
optim = t.optim.AdamW(model.parameters(), lr=5e-5)

In [33]:
num_epochs = 3
num_training_steps = num_epochs * len(train_dataloader)
lr_scheduler = get_scheduler(name="linear", optimizer=optim, num_warmup_steps=0, num_training_steps=num_training_steps)
lr_scheduler

<torch.optim.lr_scheduler.LambdaLR at 0x2a4ce38f0>

In [34]:
progress_bar = tqdm(range(num_training_steps))

model.train()
for epoch in range(num_epochs):
    for batch in train_dataloader:
        batch = {k: v.to(device) for k, v in batch.items()}
        outputs = model(**batch)
        loss = outputs.loss
        loss.backward()

        optim.step()
        lr_scheduler.step()
        optim.zero_grad()

        progress_bar.update(1)

100%|██████████| 375/375 [07:38<00:00,  1.18s/it]

In [35]:
metric = evaluate.load("accuracy")
model.eval()
for batch in tqdm(eval_dataloader):
    with t.no_grad():
        batch = {k: v.to(device) for k, v in batch.items()}
        outputs = model(**batch)
    logits = outputs.logits
    preds = t.argmax(logits, dim=-1)
    metric.add_batch(predictions=preds, references=batch["labels"])

100%|██████████| 125/125 [00:40<00:00,  3.06it/s]


In [36]:
metric.compute()

{'accuracy': 0.59}