**HW0**


Objective:
*   How to run a Jupyter Notebook and submit coding assignments
*   How to import a model from Huggingface and train it


Tasks:

*   Complete this file to import the '*bert-base-uncased*' model and fine tune it on the '*glue-mrpc*' dataset to perform text classification task. Use the hyperparameters given the notebook for this part. You can use the Huggingface's https://huggingface.co/learn/nlp-course/chapter1/1 for reference.
*  We want to see the effects of batch size on the model training. For this rerun the model with 5 widely different minibatch sizes (e.g. 1, 10, 100). How does compute time (for a fixed training set size) change with minibatch size? How (if at all) does test accuracy change with minibatch size?

*  We used AdamW optimizer here, is it the same as Adam?

**Note: Answers to the task questions need to be submitted in the corresponding PDF submission along with this coding submission on gradescope.**





### Installing necessary libraries- primarily transformers and datasets

Please run the following two cells to install the libraries

In [2]:
!pip install transformers



In [3]:
!pip install datasets

Collecting datasets
  Downloading datasets-2.16.1-py3-none-any.whl (507 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m507.1/507.1 kB[0m [31m6.2 MB/s[0m eta [36m0:00:00[0m
Collecting dill<0.3.8,>=0.3.0 (from datasets)
  Downloading dill-0.3.7-py3-none-any.whl (115 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m115.3/115.3 kB[0m [31m8.1 MB/s[0m eta [36m0:00:00[0m
Collecting multiprocess (from datasets)
  Downloading multiprocess-0.70.15-py310-none-any.whl (134 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m7.2 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: dill, multiprocess, datasets
Successfully installed datasets-2.16.1 dill-0.3.7 multiprocess-0.70.15


### Load the necessary packages

In [4]:
import torch
import transformers
import datasets
from datasets import load_dataset
from transformers import AutoTokenizer, DataCollatorWithPadding

### Load the dataset

In [5]:
raw_datasets = load_dataset("glue", "mrpc")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Downloading readme:   0%|          | 0.00/31.9k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/649k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/75.7k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/308k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/3668 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/408 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1725 [00:00<?, ? examples/s]

### Creating the tokenizer for bert model using the bert-base-uncased checkpoint

Load the "bert-base-uncased" checkpoint in the cell below.

In [6]:
## complete the next few lines to load the checkpoint and instantiate the tokenizer
checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

## complete the following function that tokenizes the input
def tokenize_function(example):
    tokenized_input=tokenizer(example["sentence1"],example["sentence2"])
    return tokenized_input

tokenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

### Tokenize the dataset
We apply the tokenization function on all our datasets at once. We’re using batched=True in our call to map so the function is applied to multiple elements of our dataset at once, and not on each element separately. This allows for faster preprocessing.

In [7]:
## apply the tokenization function in the line below. note: batched=True
tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)

Map:   0%|          | 0/3668 [00:00<?, ? examples/s]

Map:   0%|          | 0/408 [00:00<?, ? examples/s]

Map:   0%|          | 0/1725 [00:00<?, ? examples/s]

The next thing we will need to do is pad all the examples to the length of the longest element when we batch elements together — a technique we refer to as dynamic padding. The function that is responsible for putting together samples inside a batch is called a collate function. Fortunately, the 🤗 Transformers library provides us with such a function via DataCollatorWithPadding. It takes a tokenizer when you instantiate it (to know which padding token to use, and whether the model expects padding to be on the left or on the right of the inputs) and will do everything you need.

In [8]:
## define the data collator below
data_collator = transformers.DataCollatorWithPadding(tokenizer)


### Post processing tokenized datasets

In [9]:
## Remove the columns corresponding to values the model does not expect (like the sentence1 and sentence2 columns).
tokenized_datasets = tokenized_datasets.remove_columns(["sentence1","sentence2","idx"])
## Rename the column label to labels (because the model expects the argument to be named labels).
tokenized_datasets = tokenized_datasets.rename_column("label","labels")
## Set the format of the datasets so they return PyTorch tensors instead of lists. Hint: tokenized_datasets.set
tokenized_datasets.set_format("torch")

Check that the result only has columns that our model will accept: ["attention_mask", "input_ids", "labels", "token_type_ids"]

In [10]:
## check the columns
tokenized_datasets

DatasetDict({
    train: Dataset({
        features: ['labels', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 3668
    })
    validation: Dataset({
        features: ['labels', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 408
    })
    test: Dataset({
        features: ['labels', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 1725
    })
})

### Create the dataloaders for train and evaluation datasets

In [11]:
from torch.utils.data import DataLoader ## loading necessary library
## pass the necessary arguments to the function. Hint: DataLoader()
train_dataloader = DataLoader(tokenized_datasets["train"],batch_size=8,collate_fn=data_collator)
eval_dataloader =  DataLoader(tokenized_datasets["validation"],batch_size=8,collate_fn=data_collator)

To quickly check there is no mistake in the data processing, we can inspect a batch like this:

In [12]:
'''
run this cell and compare the output to
  {'attention_mask': torch.Size([8, 65]),
'input_ids': torch.Size([8, 65]),
'labels': torch.Size([8]),
'token_type_ids': torch.Size([8, 65])}
note: the above output is for batch size of 8
'''
for batch in train_dataloader:
    break
{k: v.shape for k, v in batch.items()}

You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


{'labels': torch.Size([8]),
 'input_ids': torch.Size([8, 67]),
 'token_type_ids': torch.Size([8, 67]),
 'attention_mask': torch.Size([8, 67])}

### Instantiating the model

In [13]:
from transformers import AutoModelForSequenceClassification ## loading necessary library
model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2) ## hint: use the previously defined checkpoint

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [14]:
## To make sure that everything will go smoothly during training, we pass our batch to this model
## run this cell and compare the output to "tensor(0.5441, grad_fn=<NllLossBackward>) torch.Size([8, 2])"
outputs = model(**batch)
print(outputs.loss, outputs.logits.shape)

tensor(0.6951, grad_fn=<NllLossBackward0>) torch.Size([8, 2])


### Next, we need AdamW optimizer and learning rate scheduler

In [15]:
## optimizer
from transformers import AdamW
optimizer = torch.optim.AdamW(model.parameters(), lr=5e-5) # you can give learning rate of 5e-5

In [16]:
## learning rate scheduler
from transformers import get_scheduler
num_epochs = 3
num_training_steps = num_epochs * len(train_dataloader)
## pass the necessary parameters to get_scheduler()
lr_scheduler = get_scheduler("linear",optimizer=optimizer,num_warmup_steps=0,num_training_steps=num_training_steps)
print(num_training_steps) ## the number of training sets depends on batch size and is 1377 for batch size of 8

1377


### Important: Make sure you are using the GPU and not CPU!!!

In [17]:
## Run this cell and ensure the output is "device(type='cuda')"
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
model.to(device)
device

device(type='cuda')

### Training Loop

In [18]:
from tqdm.auto import tqdm

progress_bar = tqdm(range(num_training_steps)) ## this is only for visualization of progress of training

## in the next few steps first set the model to training mode
model.train()
## next iterate over each batch from dataloader and train the model
for epoch in range(num_epochs):
  for batch in train_dataloader:
          batch = {k: v.to(device) for k, v in batch.items()}

          outputs = model(**batch)
          loss = outputs.loss
          loss.backward()

          optimizer.step()
          lr_scheduler.step()
          optimizer.zero_grad()
          progress_bar.update(1)

  0%|          | 0/1377 [00:00<?, ?it/s]

### Evaluation Loop

In [19]:
## note: set the model to evaluation mode
model.eval()
## next iterate over the batches from evaluation dataloader and report the accuracy and f1 score
from sklearn.metrics import accuracy_score, f1_score

total_eval_accuracy = 0
total_eval_f1 = 0
total_eval_examples = 0

for batch in eval_dataloader:
    batch = {k: v.to(device) for k, v in batch.items()}

    with torch.no_grad():
        outputs = model(**batch)

    logits = outputs.logits
    predictions = torch.argmax(logits, dim=-1)
    labels = batch['labels']

    total_eval_accuracy += accuracy_score(labels.cpu().numpy(), predictions.cpu().numpy())
    total_eval_f1 += f1_score(labels.cpu().numpy(), predictions.cpu().numpy(), average='weighted')
    # print("total_eval_accuracy:",total_eval_accuracy)
    # print("total_eval_f1:",total_eval_f1)
    total_eval_examples += 1

# Calculate the average accuracy and F1 score
avg_accuracy = total_eval_accuracy / total_eval_examples
avg_f1_score = total_eval_f1 / total_eval_examples

print("Accuracy:",avg_accuracy)
print("F1 Score:",avg_f1_score)

Accuracy: 0.8553921568627451
F1 Score: 0.8409924498159793


## Congratulations!!! You have successfully fine-tuned a Bert Model for custom data. Now, you can complete the remaining tasks such as changing the batch size to answer the second task questions.

In [21]:
import time


from tqdm.auto import tqdm
from sklearn.metrics import accuracy_score, f1_score

batch_sizes = [1,10,50,100,150]
acc = []
f1 = []

for batch_size in batch_sizes:
    train_dataloader = DataLoader(tokenized_datasets["train"], batch_size=batch_size, collate_fn=data_collator)
    test_dataloader = DataLoader(tokenized_datasets["test"], batch_size=batch_size, collate_fn=data_collator)

    model2 = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)
    model2.to(device)

    optimizer = torch.optim.AdamW(model2.parameters(), lr=5e-5)
    num_training_steps = num_epochs * len(train_dataloader)
    lr_scheduler = get_scheduler("linear", optimizer=optimizer, num_warmup_steps=0, num_training_steps=num_training_steps)

    progress_bar = tqdm(range(num_training_steps))

    model2.train()
    start_time = time.time()
    for epoch in range(num_epochs):
        for batch in train_dataloader:
            batch = {k: v.to(device) for k, v in batch.items()}

            outputs = model2(**batch)
            loss = outputs.loss
            loss.backward()

            optimizer.step()
            lr_scheduler.step()
            optimizer.zero_grad()
            progress_bar.update(1)
    end_time = time.time()
    training_time = end_time - start_time
    print(f"Batch size: {batch_size}, Training Time: {training_time}")

    model2.eval()
    total_eval_accuracy = 0
    total_eval_f1 = 0
    total_eval_examples = 0

    for batch in test_dataloader:
        batch = {k: v.to(device) for k, v in batch.items()}

        with torch.no_grad():
            outputs = model2(**batch)

        logits = outputs.logits
        predictions = torch.argmax(logits, dim=-1)
        labels = batch['labels']

        total_eval_accuracy += accuracy_score(labels.cpu().numpy(), predictions.cpu().numpy()) * labels.size(0)
        total_eval_f1 += f1_score(labels.cpu().numpy(), predictions.cpu().numpy(), average='weighted') * labels.size(0)
        total_eval_examples += labels.size(0)

    avg_accuracy = total_eval_accuracy / total_eval_examples
    avg_f1_score = total_eval_f1 / total_eval_examples
    acc.append(avg_accuracy)
    f1.append(avg_f1_score)

    print(f"Batch size: {batch_size}, Accuracy: {avg_accuracy}, F1 Score: {avg_f1_score}")
    del model2
    torch.cuda.empty_cache()


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


  0%|          | 0/11004 [00:00<?, ?it/s]

Batch size: 1, Training Time: 751.2733426094055
Batch size: 1, Accuracy: 0.6718840579710145, F1 Score: 0.6718840579710145


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


  0%|          | 0/1101 [00:00<?, ?it/s]

Batch size: 10, Training Time: 167.6044647693634
Batch size: 10, Accuracy: 0.8266666666666667, F1 Score: 0.8213809517484307


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


  0%|          | 0/222 [00:00<?, ?it/s]

Batch size: 50, Training Time: 145.53598546981812
Batch size: 50, Accuracy: 0.8434782608695652, F1 Score: 0.8383640883027355


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


  0%|          | 0/111 [00:00<?, ?it/s]

Batch size: 100, Training Time: 144.68421268463135
Batch size: 100, Accuracy: 0.8394202898550724, F1 Score: 0.8330331716309699


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


  0%|          | 0/75 [00:00<?, ?it/s]

Batch size: 150, Training Time: 146.40587759017944
Batch size: 150, Accuracy: 0.767536231884058, F1 Score: 0.756926152242712
