# HW11.1 Fine-tuning BERT LLM using Huggingface Transformers library

In this homework, we will step away from tensorflow keras for a moment and instead use the Transformers library from HuggingFace (https://huggingface.co/) . The HuggingFace is a community that hosts pre-trained models from LLMs to computer vision and audio ML models. You can gain easy access to SOTA LLMs using their `transformers` library, fine tuning them, and use standard benchmark datasets from their `datasets` library (it is a generic name but the library is called datasets).

Specifically what you will do in this home work:
1. Walk through the example of loading the `sst2` dataset (Stanford Sentiment Treebank dataset, essentially a dataset for sentiment analysis) from the `GLUE` benchmark we talked about in class. The GLUE covers a range of NLP tasks and is used to benchmark LLMs. After you load the dataset, there will be some example usages to inspect the dataset.
2. From the `transformers` library, load the pretrained LLM called DistillBERT, a variant and smaller version of the famous BERT LLM.
3. Fine tune (train further) the DistillBERT model on the `sst2` dataset to achieve a better performance.
4. Evaluate your fine-tuned model on `sst2` and compare that with: (1)the model before fine-tuning; (2) the default model in the HuggingFace library that is fine tuned by experts.

Please complete all tasks/code and answer all questions.

## Requirements

You will need the following libraries at the minimum:

```
!pip install datasets
!pip install transformers
!pip install accelerate -U
!pip install torchinfo
```

# 1. Load SST2 data

In [1]:
!pip install datasets
!pip install transformers
!pip install accelerate -U
!pip install torchinfo

Collecting datasets
  Downloading datasets-2.18.0-py3-none-any.whl (510 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m510.5/510.5 kB[0m [31m8.3 MB/s[0m eta [36m0:00:00[0m
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m10.2 MB/s[0m eta [36m0:00:00[0m
Collecting xxhash (from datasets)
  Downloading xxhash-3.4.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (194 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.1/194.1 kB[0m [31m11.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting multiprocess (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl (134 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m12.5 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: xxhash, dill, multiprocess, datasets
Successfully installed dataset

In [4]:
from datasets import load_dataset
import numpy as np

# to view the GLUE - SST2 data set and what it is about, see: https://huggingface.co/datasets/nyu-mll/glue
# essnentially this is a Stanford Sentiment Treebank dataset for sentiment analysis
datasets = load_dataset("glue", "sst2")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Downloading readme:   0%|          | 0.00/35.3k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/3.11M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/72.8k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/148k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/67349 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/872 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1821 [00:00<?, ? examples/s]

In [5]:
# you can inspect this dataset and see what it contains
# you will see it has been divided into three parts: train, val, and test
datasets

DatasetDict({
    train: Dataset({
        features: ['sentence', 'label', 'idx'],
        num_rows: 67349
    })
    validation: Dataset({
        features: ['sentence', 'label', 'idx'],
        num_rows: 872
    })
    test: Dataset({
        features: ['sentence', 'label', 'idx'],
        num_rows: 1821
    })
})

In [6]:
print('Dataset Type:', type(datasets))

Dataset Type: <class 'datasets.dataset_dict.DatasetDict'>


In [7]:
print('Dataset Shape:', datasets.shape)

Dataset Shape: {'train': (67349, 3), 'validation': (872, 3), 'test': (1821, 3)}


## Task 1: inspect data text and labels

what are the labels? what does label 0 and 1 represent? Take a note of the keys in this dictionary and their values.

In [11]:
# TODO: inspect the first three examples in the train datasets
print("First three examples in the train dataset:")
print('Example 1:', datasets['train'][0])
print('Example 2:', datasets['train'][1])
print('Example 3:', datasets['train'][2])


First three examples in the train dataset:
Example 1: {'sentence': 'hide new secretions from the parental units ', 'label': 0, 'idx': 0}
Example 2: {'sentence': 'contains no wit , only labored gags ', 'label': 0, 'idx': 1}
Example 3: {'sentence': 'that loves its characters and communicates something rather beautiful about human nature ', 'label': 1, 'idx': 2}


In [10]:
#inspect the validation dataset
print("First three examples in the validation dataset:")
print('Example 1:', datasets['validation'][0])
print('Example 1:', datasets['validation'][1])
print('Example 1:', datasets['validation'][2])

First three examples in the validation dataset:
Example 1: {'sentence': "it 's a charming and often affecting journey . ", 'label': 1, 'idx': 0}
Example 1: {'sentence': 'unflinchingly bleak and desperate ', 'label': 0, 'idx': 1}
Example 1: {'sentence': 'allows us to hope that nolan is poised to embark a major career as a commercial yet inventive filmmaker . ', 'label': 1, 'idx': 2}


In [12]:
#inspect the test dataset
print("First three examples in the test dataset:")
print('Example 1:', datasets['test'][0])
print('Example 1:', datasets['test'][1])
print('Example 1:', datasets['test'][2])

First three examples in the test dataset:
Example 1: {'sentence': 'uneasy mishmash of styles and genres .', 'label': -1, 'idx': 0}
Example 1: {'sentence': "this film 's relationship to actual tension is the same as what christmas-tree flocking in a spray can is to actual snow : a poor -- if durable -- imitation .", 'label': -1, 'idx': 1}
Example 1: {'sentence': 'by the end of no such thing the audience , like beatrice , has a watchful affection for the monster .', 'label': -1, 'idx': 2}


The labels are the representation of the sentiments in the sentence. Here, if the sentence is a negative sentence, the label represents it with the figure 0, while if the label is a positive sentence, the label represents it with figure 1, as can be ibserved in the above three examples.

# 2. Load pre-trained model DistillBERT and preprocess text

We've talked about how each LLM comes with its on (subword, learned) tokenizer. Here, when we load the pre-trained LLM, we also load its tokanizer.  

In [13]:
from transformers import AutoTokenizer

checkpoint = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
tokenized_sentences = tokenizer(datasets['train'][:3]['sentence'])

tokenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

## Task 2: understand what tokenizer is doing
Now we've used the tokenizer to tokenize the first three sentences in train dataset. Inspect the tokenized sentences. Let's take the first sentence. It is now represented by a sequences of integer indexes. Can you map them back to actual sub-word units to see how the tokenizer is breaking up the words?

Hint: you can do `dir(tokenizer)` to find out how to convert ids to tokens. This applies to any object in python.

In [14]:
# YOUR CODE HERE
print('Tokenized Example 1:', tokenized_sentences[0])
print('Tokenized Example 2:', tokenized_sentences[1])
print('Tokenized Example 3:', tokenized_sentences[2])

Tokenized Example 1: Encoding(num_tokens=10, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing])
Tokenized Example 2: Encoding(num_tokens=11, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing])
Tokenized Example 3: Encoding(num_tokens=15, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing])


In [15]:
tokenized_sequence_1 = tokenized_sentences[0].tokens
#print first tokenized sequence
print('Tokenized sequence 1:', tokenized_sequence_1)


Tokenized sequence 1: ['[CLS]', 'hide', 'new', 'secret', '##ions', 'from', 'the', 'parental', 'units', '[SEP]']


In [16]:
token_ids = tokenizer.convert_tokens_to_ids(tokenized_sequence_1)
# Print the token ids
print(token_ids)

[101, 5342, 2047, 3595, 8496, 2013, 1996, 18643, 3197, 102]


In [17]:
actual_tokens = tokenizer.convert_ids_to_tokens(token_ids)
# Print the actual tokens
print(actual_tokens)

['[CLS]', 'hide', 'new', 'secret', '##ions', 'from', 'the', 'parental', 'units', '[SEP]']


The following function applies the tokenizer to all data.

In [18]:
def tokenize_fn(batch):
  return tokenizer(batch['sentence'], truncation=True)

In [19]:
tokenized_datasets = datasets.map(tokenize_fn, batched=True)

Map:   0%|          | 0/67349 [00:00<?, ? examples/s]

Map:   0%|          | 0/872 [00:00<?, ? examples/s]

Map:   0%|          | 0/1821 [00:00<?, ? examples/s]

# 3. Fine-tune the pre-trained DistillBERT model

In [20]:
from transformers import TrainingArguments
from transformers import AutoModelForSequenceClassification

In [21]:
training_args = TrainingArguments(
  'my_trainer',
  evaluation_strategy='epoch',
  save_strategy='epoch',
  num_train_epochs=1,
)

In [22]:
model = AutoModelForSequenceClassification.from_pretrained(
    checkpoint,
    num_labels=2)

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [23]:
# this warning above tells you that this pretrained model was topped with a newly
# initialized classifier that needs to be trained/fine-tuned
# let's inspect this model and understand its internal structure

model

DistilBertForSequenceClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0-5): 6 x TransformerBlock(
          (attention): MultiHeadSelfAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)
 

In [24]:
from torchinfo import summary
# another way to inspect the model
summary(model)

Layer (type:depth-idx)                                  Param #
DistilBertForSequenceClassification                     --
├─DistilBertModel: 1-1                                  --
│    └─Embeddings: 2-1                                  --
│    │    └─Embedding: 3-1                              23,440,896
│    │    └─Embedding: 3-2                              393,216
│    │    └─LayerNorm: 3-3                              1,536
│    │    └─Dropout: 3-4                                --
│    └─Transformer: 2-2                                 --
│    │    └─ModuleList: 3-5                             42,527,232
├─Linear: 1-2                                           590,592
├─Linear: 1-3                                           1,538
├─Dropout: 1-4                                          --
Total params: 66,955,010
Trainable params: 66,955,010
Non-trainable params: 0

In [25]:
from transformers import Trainer
from datasets import load_metric
# define function to compute metrics
def compute_metrics(logits_and_labels):
  metric = load_metric("glue", "sst2")
  logits, labels = logits_and_labels
  predictions = np.argmax(logits, axis=-1)
  return metric.compute(predictions=predictions, references=labels)

In [26]:
# set up trainer to fine-tune the model
trainer = Trainer(
    model,
    training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)


## Task 3: fine tune the model for 1 epoch!
Note that this might take some time.

Note that the epoch number was set above in the training arguments.

After fine tuning 1 epoch, report the final accuracy.

In [27]:
trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy
1,0.1957,0.350583,0.905963


  metric = load_metric("glue", "sst2")
You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this metric from the next major release of `datasets`.


Downloading builder script:   0%|          | 0.00/1.84k [00:00<?, ?B/s]

TrainOutput(global_step=8419, training_loss=0.2642166666922822, metrics={'train_runtime': 427.0154, 'train_samples_per_second': 157.72, 'train_steps_per_second': 19.716, 'total_flos': 517212489917652.0, 'train_loss': 0.2642166666922822, 'epoch': 1.0})

In [28]:
# save the model to disk so that you can load it back later
trainer.save_model('my_saved_model')

# use this code to massage the labels into something interpretable, NEGATIVE, POSITIVE
import json
config_path = 'my_saved_model/config.json'
with open(config_path) as f:
  j = json.load(f)

j['id2label'] = {0: 'NEGATIVE', 1: 'POSITIVE'}

with open(config_path, 'w') as f:
  json.dump(j, f, indent=2)

## Use the saved model for inference on new sentences

Now you can use this newly fine-tuned model to build a `pipeline`, an object in the trnasformers library. The pipeline can be used to make inference on a input sentence.

In [29]:
from transformers import pipeline
new_model = pipeline('text-classification', model='my_saved_model')

# test your new pipeline
new_model('This movie is great!')


[{'label': 'POSITIVE', 'score': 0.9996024966239929}]

In [30]:
# test with more examples
# YOUR CODE HERE
new_model('I hate this movie!')

[{'label': 'NEGATIVE', 'score': 0.9959483742713928}]

In [31]:
# test with more examples
# YOUR CODE HERE
new_model('The movie is just there!')

[{'label': 'POSITIVE', 'score': 0.8446947932243347}]

In [32]:
# test with more examples
# YOUR CODE HERE
new_model('Nothing special about this movie!')

[{'label': 'NEGATIVE', 'score': 0.9976100921630859}]

In [33]:
# test with more examples
# YOUR CODE HERE
new_model('Very special movie!')

[{'label': 'POSITIVE', 'score': 0.9993752837181091}]

# 4. Evaluate the model: how was the result of the fine-tuning?

Once you trained a model, it's always important to show through proper evaluation that this fine-tuned model is indeed better than before fine tuning, or compare this with models fine-tuned by other people.  

To use HuggingFace's evaluator, install:
`!pip install evaluate`

In [34]:
!pip install evaluate

Collecting evaluate
  Downloading evaluate-0.4.1-py3-none-any.whl (84 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/84.1 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m3.1 MB/s[0m eta [36m0:00:00[0m
Collecting responses<0.19 (from evaluate)
  Downloading responses-0.18.0-py3-none-any.whl (38 kB)
Installing collected packages: responses, evaluate
Successfully installed evaluate-0.4.1 responses-0.18.0


In [35]:
from evaluate import evaluator

# first let's load the test portion of the sst2 data
test_datasets = load_dataset("glue", "sst2", split="test")

# let's compare three models and evaluate the against each other.

# Model 1: pre-trained model distillBERT as is. Since this is added some new
# classifier layers, it is expected to have low performance.
# let's load this model again.
checkpoint = "distilbert-base-uncased"
from transformers import AutoModelForSequenceClassification
model_distillBERT = AutoModelForSequenceClassification.from_pretrained(
    checkpoint,
    num_labels=2)

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [49]:
# Model 2: the model you fine tuned. For this one, we already have the pipeline
# called new_model, we can use this directly for evaluation.
model_2 = pipeline('text-classification', model='my_saved_model')

In [None]:
# Model 3: the default model for the evaluator if you don't give it any model.
# i.e., you would not supply the argument for model_or_pipeline in the following.
# In this case, it defaults to a model that was fine-tuned by others.

## Task 4: evaluate the three models!
report the results for Model 1, 2 and 3 above on the `test` portion of the `sst2` dataset. What results do you get? Can you think of why?

Now try testing the three models on the `validation` portion of the same dataset. Report the results. What do you observe?

Hint 1: if you are testing a certain model and got an error about the labels, you might want to use one of the lines that is commented out below and swap it out with another line.

Hint 2: if you can't figure out what's wrong about your accuracy, try go back to inspect the data!


In [36]:
# setting up the evaluator
#model_distillBERT on the test set

from evaluate import load
task_evaluator = evaluator("text-classification")
mdt_eval_results = task_evaluator.compute(
    model_or_pipeline= model_distillBERT,
    data= test_datasets,
    input_column="sentence",
    tokenizer=tokenizer,
    metric='accuracy',
    label_mapping={"LABEL_0": 0.0, "LABEL_1": 1.0}
)


Downloading builder script:   0%|          | 0.00/4.20k [00:00<?, ?B/s]

In [38]:
print(mdt_eval_results)

{'accuracy': 0.0, 'total_time_in_seconds': 22.131480679000106, 'samples_per_second': 82.28098365455918, 'latency_in_seconds': 0.012153476484898465}


In [52]:
# setting up the evaluator
#new_model on the test set
nmt_eval_results = task_evaluator.compute(
    model_or_pipeline= model_2,
    data= test_datasets,
    input_column="sentence",
    tokenizer=tokenizer,
    metric='accuracy',
    label_mapping={"NEGATIVE": 0, "POSITIVE": 1}
)


In [53]:
print(nmt_eval_results)

{'accuracy': 0.0, 'total_time_in_seconds': 148.39955206000013, 'samples_per_second': 12.270926527215815, 'latency_in_seconds': 0.08149343880285564}


In [41]:
# setting up the evaluator
# Default_model on the test set

dmt_eval_results = task_evaluator.compute(
    model_or_pipeline= None,
    data= test_datasets,
    input_column="sentence",
    tokenizer=tokenizer,
    metric='accuracy',
    label_mapping={"NEGATIVE": 0, "POSITIVE": 1}
)

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

In [42]:
print(dmt_eval_results)

{'accuracy': 0.0, 'total_time_in_seconds': 14.671585285999981, 'samples_per_second': 124.11746682464143, 'latency_in_seconds': 0.008056883737506855}


In [43]:
validation_datasets = load_dataset("glue", "sst2", split="validation")

In [44]:
# setting up the evaluator
#model_distillBERT on the validation set

mdv_eval_results = task_evaluator.compute(
    model_or_pipeline= model_distillBERT,
    data= validation_datasets,
    input_column="sentence",
    tokenizer=tokenizer,
    metric='accuracy',
    label_mapping={"LABEL_0": 0.0, "LABEL_1": 1.0}
)

In [45]:
print(mdv_eval_results)

{'accuracy': 0.47706422018348627, 'total_time_in_seconds': 8.993543072999955, 'samples_per_second': 96.95845040403292, 'latency_in_seconds': 0.010313696184632976}


In [50]:
# setting up the evaluator
#new_model on the validation set

nmv_eval_results = task_evaluator.compute(
    model_or_pipeline= model_2,
    data= validation_datasets,
    input_column="sentence",
    tokenizer=tokenizer,
    metric='accuracy',
    label_mapping={"NEGATIVE": 0, "POSITIVE": 1}
)

In [51]:
print(nmv_eval_results)

{'accuracy': 0.9059633027522935, 'total_time_in_seconds': 71.063321696, 'samples_per_second': 12.270746415855804, 'latency_in_seconds': 0.08149463497247707}


In [47]:
# setting up the evaluator
# Default_model on the validation set

dmv_eval_results = task_evaluator.compute(
    model_or_pipeline= None,
    data= validation_datasets,
    input_column="sentence",
    tokenizer=tokenizer,
    metric='accuracy',
    label_mapping={"NEGATIVE": 0, "POSITIVE": 1}
    #label_mapping={"LABEL_0": 0.0, "LABEL_1": 1.0}
)

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


In [48]:
print(dmv_eval_results)

{'accuracy': 0.9105504587155964, 'total_time_in_seconds': 10.882602398000017, 'samples_per_second': 80.12789295327508, 'latency_in_seconds': 0.012480048621559652}


All 3 models had an accuracy of 0% on the test data. This is because the label of the test data is -1 for all sentences. These models are designed to make predictions for labels with mappings of either 0 for a Negative sentiment, or 1 for a positive sentiment. However, for the validation set, the model with the least accuracy was the model_distillBERT. This model had an accuracy of approximately 48% on the validation data set. This is followed by the fine tuned model, with an accuracy of approximately 91%, and then finally the default model which had an accuracy over 91%. Overall, the default model had the best performance on the validation data set.

