# Full fine-tuning of a BERT SLM for Classification

![](https://i.imgur.com/7SXKckD.png)

Transfer Learning is the power of leveraging already trained models and tune \ adapt them to our own downstream tasks. 

Here we start by understanding how to fine-tune a simple BERT Small Language Model (SLM) step by step for a simple yet essential task in NLP - Text Classification for Sentiment Analysis 

# Sentiment Analysis

When it comes to text data, sentiment analysis is one of the most widely performed analysis on it. Sentiment Analysis has been through tremendous improvements from the days of classic methods to recent times where in the state of the art models utilize deep learning to improve the performance.

# Fine-tuning a model on a text classification task

In this notebook, we will see how to fine-tune one of the [ðŸ¤— Transformers](https://github.com/huggingface/transformers) model to a text classification task of Sentiment Analysis

![](https://i.imgur.com/Pq7f3Fd.png)

___[Created By: Dipanjan (DJ)](https://www.linkedin.com/in/dipanjans/)___

In [1]:
import torch
torch.cuda.empty_cache()

In [2]:
!nvidia-smi

Fri Aug  9 17:54:42 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03             Driver Version: 535.129.03   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  NVIDIA A40                     On  | 00000000:57:00.0 Off |                    0 |
|  0%   33C    P8              30W / 300W |      4MiB / 46068MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

You will be leveraging ðŸ¤— Transformers and ðŸ¤— Datasets as well as other dependencies

## Get Dataset

In [3]:
import pandas as pd

dataset = pd.read_csv(r'https://github.com/dipanjanS/nlp_workshop_dhs18/raw/master/Unit%2011%20-%20Sentiment%20Analysis%20-%20Unsupervised%20Learning/movie_reviews.csv.bz2',
                      compression='bz2',
                      nrows=20000)
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20000 entries, 0 to 19999
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   review     20000 non-null  object
 1   sentiment  20000 non-null  object
dtypes: object(2)
memory usage: 312.6+ KB


In [4]:
dataset['sentiment'] = [1 if sentiment == 'positive' else 0 for sentiment in dataset['sentiment']]

In [5]:
dataset.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,1
1,A wonderful little production. <br /><br />The...,1
2,I thought this was a wonderful way to spend ti...,1
3,Basically there's a family where a little boy ...,0
4,"Petter Mattei's ""Love in the Time of Money"" is...",1


## Create Training Datasets

This is a labeled dataset of IMDB movie reviews and their corresponding sentiment (1 or 0) which basically means (positive or negative).

Idea is to make BERT learn to predict the sentiment given the review.

Let's create some datasets first!

In [6]:
train_df = dataset.iloc[:8000]
val_df = dataset.iloc[8000:10000]
test_df = dataset.iloc[10000:]

In [7]:
train_df.sentiment.value_counts()

sentiment
1    4003
0    3997
Name: count, dtype: int64

In [8]:
val_df.sentiment.value_counts()

sentiment
1    1025
0     975
Name: count, dtype: int64

In [9]:
test_df.sentiment.value_counts()

sentiment
0    5125
1    4875
Name: count, dtype: int64

In [10]:
train_df.to_csv('train.csv', index=False)
val_df.to_csv('val.csv', index=False)
test_df.to_csv('test.csv', index=False)

## Load Dataset

Here we convert the CSV dataset files above into a huggingface dataset which is easier to use when training transformer models in huggingface

In [11]:
from datasets import load_dataset, load_metric

data_files = {"train": "train.csv",
              "validation": "val.csv",
              "test": "test.csv"}
imdb_data = load_dataset("csv", data_files=data_files)

Generating train split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

In [12]:
imdb_data

DatasetDict({
    train: Dataset({
        features: ['review', 'sentiment'],
        num_rows: 8000
    })
    validation: Dataset({
        features: ['review', 'sentiment'],
        num_rows: 2000
    })
    test: Dataset({
        features: ['review', 'sentiment'],
        num_rows: 10000
    })
})

In [13]:
imdb_data.keys()

dict_keys(['train', 'validation', 'test'])

In [14]:
# Optional you can push the dataset splits to HF hub
# and download in the future directly
# no need to do this for now as I have already done this
# we will use this from the next time for some of the tutorials in the next module
#imdb_data.push_to_hub("dipanjanS/imdb_sentiment_finetune_dataset20k")

In [15]:
imdb_data['train'][:2]

{'review': ["One of the other reviewers has mentioned that after watching just 1 Oz episode you'll be hooked. They are right, as this is exactly what happened with me.<br /><br />The first thing that struck me about Oz was its brutality and unflinching scenes of violence, which set in right from the word GO. Trust me, this is not a show for the faint hearted or timid. This show pulls no punches with regards to drugs, sex or violence. Its is hardcore, in the classic use of the word.<br /><br />It is called OZ as that is the nickname given to the Oswald Maximum Security State Penitentary. It focuses mainly on Emerald City, an experimental section of the prison where all the cells have glass fronts and face inwards, so privacy is not high on the agenda. Em City is home to many..Aryans, Muslims, gangstas, Latinos, Christians, Italians, Irish and more....so scuffles, death stares, dodgy dealings and shady agreements are never far away.<br /><br />I would say the main appeal of the show is d

The `dataset` object itself is [`DatasetDict`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasetdict), which contains one key for the training, validation and test set:

In [16]:
imdb_data

DatasetDict({
    train: Dataset({
        features: ['review', 'sentiment'],
        num_rows: 8000
    })
    validation: Dataset({
        features: ['review', 'sentiment'],
        num_rows: 2000
    })
    test: Dataset({
        features: ['review', 'sentiment'],
        num_rows: 10000
    })
})

To access an actual element, you need to select a split first, then give an index:

In [17]:
imdb_data["train"][0]

{'review': "One of the other reviewers has mentioned that after watching just 1 Oz episode you'll be hooked. They are right, as this is exactly what happened with me.<br /><br />The first thing that struck me about Oz was its brutality and unflinching scenes of violence, which set in right from the word GO. Trust me, this is not a show for the faint hearted or timid. This show pulls no punches with regards to drugs, sex or violence. Its is hardcore, in the classic use of the word.<br /><br />It is called OZ as that is the nickname given to the Oswald Maximum Security State Penitentary. It focuses mainly on Emerald City, an experimental section of the prison where all the cells have glass fronts and face inwards, so privacy is not high on the agenda. Em City is home to many..Aryans, Muslims, gangstas, Latinos, Christians, Italians, Irish and more....so scuffles, death stares, dodgy dealings and shady agreements are never far away.<br /><br />I would say the main appeal of the show is du

We will use the [ðŸ¤— Datasets](https://github.com/huggingface/datasets) library to download the metric we need to use for evaluation (to compare our model to the benchmark). This can be easily done with the function `evaluate.load`.  

In [19]:
import evaluate

metric1 = evaluate.load("precision")
metric2 = evaluate.load("recall")
metric3 = evaluate.load("f1")
metric4 = evaluate.load("accuracy")

def evaluate_performance(predictions, references):
    precision = metric1.compute(predictions=predictions, references=references, average="macro")["precision"]
    recall = metric2.compute(predictions=predictions, references=references, average="macro")["recall"]
    f1 = metric3.compute(predictions=predictions, references=references, average="macro")["f1"]
    accuracy = metric4.compute(predictions=predictions, references=references)["accuracy"]
    return {"precision": precision, "recall": recall, "f1": f1, "accuracy": accuracy}


Downloading builder script:   0%|          | 0.00/7.55k [00:00<?, ?B/s]

Downloading builder script:   0%|          | 0.00/7.36k [00:00<?, ?B/s]

Downloading builder script:   0%|          | 0.00/6.77k [00:00<?, ?B/s]

Downloading builder script:   0%|          | 0.00/4.20k [00:00<?, ?B/s]

You can call its `compute` method with your predictions and labels, which need to be list of decoded strings:

For classification most common metrics include accuracy and f1-score.


In [20]:
predictions = [1,0,1,1,0]
references = [1,1,0,1,0]
scores = evaluate_performance(
    predictions=predictions, references=references
)
scores


{'precision': 0.5833333333333333,
 'recall': 0.5833333333333333,
 'f1': 0.5833333333333333,
 'accuracy': 0.6}

## Preprocessing the data

Before we can feed those texts to our model, we need to preprocess them. This is done by a ðŸ¤— Transformers `Tokenizer` which will (as the name indicates) tokenize the inputs (including converting the tokens to their corresponding IDs in the pretrained vocabulary) and put it in a format the model expects, as well as generate the other inputs that model requires.

To do all of this, we instantiate our tokenizer with the `AutoTokenizer.from_pretrained` method, which will ensure:

- we get a tokenizer that corresponds to the model architecture we want to use,
- we download the vocabulary used when pretraining this specific checkpoint.

That vocabulary will be cached, so it's not downloaded again the next time we run the cell.

This notebook is built to run on any of the tasks in the list above, with any model checkpoint from the [Model Hub](https://huggingface.co/models) as long as that model has a version with a classification head.

Here we picked the [`bert-base-uncased`](https://huggingface.co/bert-base-uncased) checkpoint.

![](https://i.imgur.com/GmFRcP3.png)

BERT can be used for a variety of tasks and we will fine-tune it for classification (sentiment).

Here we will use a smaller version of the BERT model called DistilBERT to train faster.

In [21]:
from transformers import AutoTokenizer
model_checkpoint = "distilbert/distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, use_fast=True)

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

We pass along `use_fast=True` to the call above to use one of the fast tokenizers (backed by Rust) from the ðŸ¤— Tokenizers library. Those fast tokenizers are available for almost all models, but if you got an error with the previous call, remove that argument.

You can directly call this tokenizer on one sentence or a pair of sentences:

In [22]:
tokenizer("Hello, this is a sentence!")

{'input_ids': [101, 7592, 1010, 2023, 2003, 1037, 6251, 999, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1]}

We can then write the function that will preprocess our samples. We just feed them to the `tokenizer` with the argument `truncation=True`.

This will ensure that an input longer that what the model selected can handle will be truncated to the maximum length accepted by the model.

The padding will be dealt with later on (in a data collator) so we pad examples to the longest length in the batch and not the whole dataset.

In [23]:
def preprocess_function(examples):
    # max length is 512 as that is the context window limit of BERT models
    # It can process documents of upto 512 tokens each input
    model_inputs = tokenizer(examples['review'], max_length=512, truncation=True)
    model_inputs["label"] = examples["sentiment"]
    return model_inputs

This function works with one or several documents. In the case of several documents, the tokenizer will return a list of lists for each key:

In [24]:
preprocess_function(imdb_data["train"][:2])

{'input_ids': [[101, 2028, 1997, 1996, 2060, 15814, 2038, 3855, 2008, 2044, 3666, 2074, 1015, 11472, 2792, 2017, 1005, 2222, 2022, 13322, 1012, 2027, 2024, 2157, 1010, 2004, 2023, 2003, 3599, 2054, 3047, 2007, 2033, 1012, 1026, 7987, 1013, 1028, 1026, 7987, 1013, 1028, 1996, 2034, 2518, 2008, 4930, 2033, 2055, 11472, 2001, 2049, 24083, 1998, 4895, 10258, 2378, 8450, 5019, 1997, 4808, 1010, 2029, 2275, 1999, 2157, 2013, 1996, 2773, 2175, 1012, 3404, 2033, 1010, 2023, 2003, 2025, 1037, 2265, 2005, 1996, 8143, 18627, 2030, 5199, 3593, 1012, 2023, 2265, 8005, 2053, 17957, 2007, 12362, 2000, 5850, 1010, 3348, 2030, 4808, 1012, 2049, 2003, 13076, 1010, 1999, 1996, 4438, 2224, 1997, 1996, 2773, 1012, 1026, 7987, 1013, 1028, 1026, 7987, 1013, 1028, 2009, 2003, 2170, 11472, 2004, 2008, 2003, 1996, 8367, 2445, 2000, 1996, 17411, 4555, 3036, 2110, 7279, 4221, 12380, 2854, 1012, 2009, 7679, 3701, 2006, 14110, 2103, 1010, 2019, 6388, 2930, 1997, 1996, 3827, 2073, 2035, 1996, 4442, 2031, 3221, 21430

To apply this function on all the sentences in our dataset, we just use the `map` method of our `dataset` object we created earlier.

This will apply the function on all the elements of all the splits in `dataset`, so our training, validation and testing data will be preprocessed in one single command.

In [25]:
tokenized_datasets = imdb_data.map(preprocess_function, batched=True)

Map:   0%|          | 0/8000 [00:00<?, ? examples/s]

Map:   0%|          | 0/2000 [00:00<?, ? examples/s]

Map:   0%|          | 0/10000 [00:00<?, ? examples/s]

In [26]:
# remove unnecessary columns
tokenized_datasets = tokenized_datasets.remove_columns('review')
tokenized_datasets = tokenized_datasets.remove_columns('sentiment')

## Fine-tuning the Transformer Model

Now that our data is ready, we can download the pretrained model and fine-tune it.

Since our task is about sentence classification, we use the `AutoModelForSequenceClassification` class.

Like with the tokenizer, the `from_pretrained` method will download and cache the model for us.

The only thing we have to specify is the number of labels for our problem which should be 2

In [27]:
# we put in a mapping so the model knows which prediction label ID is which text label (human friendly)
id2label = {0: "NEGATIVE", 1: "POSITIVE"}
label2id = {"NEGATIVE": 0, "POSITIVE": 1}

In [28]:
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer

model = AutoModelForSequenceClassification.from_pretrained(model_checkpoint,
                                                           id2label=id2label,
                                                           label2id=label2id,
                                                           num_labels=2)

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert/distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


The warning is telling us we are throwing away some weights (the `vocab_transform` and `vocab_layer_norm` layers) and randomly initializing some other (the `pre_classifier` and `classifier` layers).

This is absolutely normal in this case, because we are removing the head used to pretrain the model on a masked language modeling objective and replacing it with a new head for which we don't have pretrained weights.

So the library warns us we should fine-tune this model before using it for inference, which is exactly what we are going to do to make the model learn how to predict the two classes

In [29]:
def print_trainable_parameters(model):
    """
    Prints the number of trainable parameters in the model.
    """
    trainable_params = 0
    all_param = 0
    for _, param in model.named_parameters():
        all_param += param.numel()
        if param.requires_grad:
            trainable_params += param.numel()
    print(
        f"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param}"
    )

In [30]:
print_trainable_parameters(model)

trainable params: 66955010 || all params: 66955010 || trainable%: 100.0


The above piece of code shows us that DistilBERT has 66 Million trainable parameters and we are training all of them here which is basically what full fine-tuning means.

To instantiate a `Trainer`, we will need to define two more things. The most important is the [`TrainingArguments`](https://huggingface.co/transformers/main_classes/trainer.html#transformers.TrainingArguments), which is a class that contains all the attributes to customize the training. It requires one folder name, which will be used to save the checkpoints of the model, and all other arguments are optional:

In [31]:
# if batch size is 64
# if total documents are 8000
# total number of steps (batches of data) to complete 1 full epoch is?
8000 // 64

125

In [32]:
# total steps to run two epochs are?
125 * 2

250

In [33]:
from transformers import TrainingArguments

batch_size = 64 
metric_name = "f1"

# Set up the training arguments
args = TrainingArguments(
    output_dir="distilbert-fullfinetune-runs",  # Directory where the model checkpoints and outputs will be saved.
    eval_strategy="steps",                      # Perform evaluation at regular intervals during training.
    save_strategy="steps",                      # Save the model checkpoint at regular intervals.
    learning_rate=1e-5,                         # Initial learning rate for the optimizer.
    logging_steps=20,                           # Log training metrics every 20 steps.
    eval_steps=20,                              # Perform evaluation every 20 steps.
    save_steps=50,                              # Save the model checkpoint every 50 steps.
    per_device_train_batch_size=batch_size,     # Batch size per GPU/TPU core/CPU during training.
    per_device_eval_batch_size=batch_size,      # Batch size per GPU/TPU core/CPU during evaluation.
    max_steps=250,                              # Stop training after 250 total steps.
    weight_decay=0.01,                          # Apply weight decay to reduce overfitting.
    metric_for_best_model=metric_name,          # Metric to use for selecting the best model during evaluation.
    push_to_hub=False                           # Do not push the model to the Hugging Face Hub after training.
)

We use DataCollatorWithPadding to create a batch of examples. It will also dynamically pad your text to the length of the longest element in its batch, so they are a uniform length.

While it is possible to pad your text in the tokenizer function by setting `padding=True`, dynamic padding is more efficient.

In [34]:
from transformers import DataCollatorWithPadding

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

The last thing to define for our `Trainer` is how to compute the metrics from the predictions.

We need to define a function for this, which will just use the `metric` we loaded earlier, the only preprocessing we have to do is to take the argmax of our predicted logits.

In [35]:
import numpy as np

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return evaluate_performance(predictions=predictions, references=labels)

Then we just need to pass all of this along with our datasets to the `Trainer`:

In [36]:
trainer = Trainer(
    model=model,
    args=args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets['validation'],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics
)

Detected kernel version 5.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
max_steps is given, it will override any value given in num_train_epochs


We can now finetune our model by just calling the `train` method:

Run and wait for around 5-6 mins on a 48GB GPU

In [37]:
trainer.train()

Step,Training Loss,Validation Loss,Precision,Recall,F1,Accuracy
20,0.68,0.654358,0.764362,0.674309,0.648951,0.6815
40,0.5836,0.467734,0.866267,0.865616,0.865816,0.866
60,0.3962,0.332664,0.876171,0.874959,0.875264,0.8755
80,0.3194,0.298987,0.887905,0.887992,0.887499,0.8875
100,0.306,0.28685,0.892559,0.892295,0.8924,0.8925
120,0.2675,0.278104,0.892072,0.892308,0.891989,0.892
140,0.2311,0.265954,0.901446,0.901426,0.901436,0.9015
160,0.2577,0.263221,0.901937,0.901101,0.901354,0.9015
180,0.2381,0.257099,0.900914,0.900988,0.900948,0.901
200,0.2325,0.254601,0.898402,0.898549,0.898457,0.8985


TrainOutput(global_step=250, training_loss=0.3269468746185303, metrics={'train_runtime': 322.7933, 'train_samples_per_second': 49.567, 'train_steps_per_second': 0.774, 'total_flos': 2119478378496000.0, 'train_loss': 0.3269468746185303, 'epoch': 2.0})

## Save and Load Fine-tuned BERT Model

In [38]:
save_path = 'fullfinetune-distilbert-classification'
trainer.save_model(save_path)

In [39]:
# remove model checkpoints
!rm -rf distilbert-fullfinetune-runs

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


In [40]:
loaded_model = AutoModelForSequenceClassification.from_pretrained(save_path)
loaded_tokenizer = AutoTokenizer.from_pretrained(save_path)

# Using your fine-tuned model for Classification

Once youâ€™ve fine-tuned the model you can use it with a pipeline object, for inference as follows:

In [41]:
from transformers import pipeline

In [42]:
# Here you can load your locally trained \ saved model
clf = pipeline(task='text-classification', 
               model=loaded_model, 
               tokenizer=loaded_tokenizer, 
               device='cuda')

In [43]:
document = "The movie was not good at all"

In [44]:
clf(document)

[{'label': 'NEGATIVE', 'score': 0.87693190574646}]

In [45]:
document = "The movie was amazing"

In [46]:
clf(document)

[{'label': 'POSITIVE', 'score': 0.9100762605667114}]

## Fine-tuned Transformer performance on Test Data

We can feed our test set (which the model has not seen) to our pipeline to get a feel for the quality of the model predictions.

In [47]:
imdb_data['test'][:2]

{'review': ['" While sporadically engrossing (including a few effectively tender moments) and humorous, the sledgehammer-obvious satire \'Homecoming\' hinges on comes off as forced and ultimately unfulfilling. With material like this, timing is everything (Michael Moore knew to release "Fahrenheit 9/11" before the 2004 elections), and the real tragedy of Dante\'s film is that it didn\'t come out 2 years ago, when its message would have carried an energy that would have energized the dissidents further. In 2006, mockery of the well-settled Bush Administration hardly seems as controversially compelling (or imperiled) as it did then."<br /><br />frankly anyone that could be convinced of anything by a ham fisted zombie flick has questionable intelligence. <br /><br />and if you didn\'t notice, michael moore didn\'t exactly help to defeat bush.<br /><br />there was nothing engrossing about this film. i just felt disgust at how blatant and frankly stupid the film was, it was painful to watch

Inference on the full test data takes roughly 1-2 mins

In [48]:
%%time

predictions = clf(imdb_data['test']['review'],
                  batch_size=512, 
                  max_length=512, 
                  truncation=True)
predictions = [pred['label'] for pred in predictions]

predictions = [0 if item == 'NEGATIVE' else 1 for item in predictions]
labels = imdb_data['test']['sentiment']

CPU times: user 1min 8s, sys: 109 ms, total: 1min 8s
Wall time: 49.9 s


In [49]:
from sklearn.metrics import confusion_matrix, classification_report

print(classification_report(labels, predictions))
pd.DataFrame(confusion_matrix(labels, predictions))

              precision    recall  f1-score   support

           0       0.91      0.90      0.90      5125
           1       0.89      0.90      0.90      4875

    accuracy                           0.90     10000
   macro avg       0.90      0.90      0.90     10000
weighted avg       0.90      0.90      0.90     10000



Unnamed: 0,0,1
0,4599,526
1,479,4396
