# QLoRA fine-tuning of a BERT SLM for Classification

![](https://i.imgur.com/2Kw1yTZ.gif)

Transfer Learning is the power of leveraging already trained models and tune \ adapt them to our own downstream tasks. 

Here we start by understanding how to fine-tune a simple BERT Small Language Model (SLM) step by step for a simple yet essential task in NLP - Text Classification for Sentiment Analysis 

Instead of full-finetuning, we will use Parameter-Efficient Fine-tuning methodologies here, more notably the Quantized Low-Rank Adaptation (QLoRA) technique

# Sentiment Analysis

When it comes to text data, sentiment analysis is one of the most widely performed analysis on it. Sentiment Analysis has been through tremendous improvements from the days of classic methods to recent times where in the state of the art models utilize deep learning to improve the performance.

# Fine-tuning a model on a text classification task

In this notebook, we will see how to fine-tune one of the [🤗 Transformers](https://github.com/huggingface/transformers) model to a text classification task of Sentiment Analysis

![](https://i.imgur.com/Pq7f3Fd.png)

___[Created By: Dipanjan (DJ)](https://www.linkedin.com/in/dipanjans/)___

In [1]:
import torch
torch.cuda.empty_cache()

In [2]:
!nvidia-smi

Sun Feb 16 11:33:19 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 565.57.01              Driver Version: 565.57.01      CUDA Version: 12.7     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  NVIDIA A40                     On  |   00000000:56:00.0 Off |                    0 |
|  0%   29C    P8             21W /  300W |       1MiB /  46068MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                

You will be leveraging 🤗 Transformers and 🤗 Datasets as well as other dependencies

## Load Datasets

Here we load the IMDB Sentiment dataset which we uploaded previously to huggingface hub

In [4]:
import pandas as pd
from datasets import load_dataset

imdb_data = load_dataset("dipanjanS/imdb_sentiment_finetune_dataset20k")

README.md:   0%|          | 0.00/534 [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/6.89M [00:00<?, ?B/s]

validation-00000-of-00001.parquet:   0%|          | 0.00/1.70M [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/8.61M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/8000 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/2000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/10000 [00:00<?, ? examples/s]

In [5]:
imdb_data

DatasetDict({
    train: Dataset({
        features: ['review', 'sentiment'],
        num_rows: 8000
    })
    validation: Dataset({
        features: ['review', 'sentiment'],
        num_rows: 2000
    })
    test: Dataset({
        features: ['review', 'sentiment'],
        num_rows: 10000
    })
})

In [6]:
imdb_data.keys()

dict_keys(['train', 'validation', 'test'])

In [7]:
# Looking at the first two rows of the train dataset
imdb_data['train'][:2]

{'review': ["One of the other reviewers has mentioned that after watching just 1 Oz episode you'll be hooked. They are right, as this is exactly what happened with me.<br /><br />The first thing that struck me about Oz was its brutality and unflinching scenes of violence, which set in right from the word GO. Trust me, this is not a show for the faint hearted or timid. This show pulls no punches with regards to drugs, sex or violence. Its is hardcore, in the classic use of the word.<br /><br />It is called OZ as that is the nickname given to the Oswald Maximum Security State Penitentary. It focuses mainly on Emerald City, an experimental section of the prison where all the cells have glass fronts and face inwards, so privacy is not high on the agenda. Em City is home to many..Aryans, Muslims, gangstas, Latinos, Christians, Italians, Irish and more....so scuffles, death stares, dodgy dealings and shady agreements are never far away.<br /><br />I would say the main appeal of the show is d

This is a labeled dataset of IMDB movie reviews and their corresponding sentiment (1 or 0) which basically means (positive or negative).

Idea is to make BERT learn to predict the sentiment given the review.

Let's create some datasets first!

We will use the [🤗 Datasets](https://github.com/huggingface/datasets) library to download the metric we need to use for evaluation (to compare our model to the benchmark). This can be easily done with the function `evaluate.load`.  

In [8]:
import evaluate

metric1 = evaluate.load("precision")
metric2 = evaluate.load("recall")
metric3 = evaluate.load("f1")
metric4 = evaluate.load("accuracy")

def evaluate_performance(predictions, references):
    precision = metric1.compute(predictions=predictions, references=references, average="macro")["precision"]
    recall = metric2.compute(predictions=predictions, references=references, average="macro")["recall"]
    f1 = metric3.compute(predictions=predictions, references=references, average="macro")["f1"]
    accuracy = metric4.compute(predictions=predictions, references=references)["accuracy"]
    return {"precision": precision, "recall": recall, "f1": f1, "accuracy": accuracy}


The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.


0it [00:00, ?it/s]

Downloading builder script:   0%|          | 0.00/7.56k [00:00<?, ?B/s]

Downloading builder script:   0%|          | 0.00/7.38k [00:00<?, ?B/s]

Downloading builder script:   0%|          | 0.00/6.79k [00:00<?, ?B/s]

Downloading builder script:   0%|          | 0.00/4.20k [00:00<?, ?B/s]

You can call its `compute` method with your predictions and labels, which need to be list of decoded strings:

For classification most common metrics include accuracy and f1-score.


In [9]:
predictions = [1,0,1,1,0]
references = [1,1,0,1,0]
scores = evaluate_performance(
    predictions=predictions, references=references
)
scores


{'precision': 0.5833333333333333,
 'recall': 0.5833333333333333,
 'f1': 0.5833333333333333,
 'accuracy': 0.6}

## Preprocessing the data

Before we can feed those texts to our model, we need to preprocess them. This is done by a 🤗 Transformers `Tokenizer` which will (as the name indicates) tokenize the inputs (including converting the tokens to their corresponding IDs in the pretrained vocabulary) and put it in a format the model expects, as well as generate the other inputs that model requires.

To do all of this, we instantiate our tokenizer with the `AutoTokenizer.from_pretrained` method, which will ensure:

- we get a tokenizer that corresponds to the model architecture we want to use,
- we download the vocabulary used when pretraining this specific checkpoint.

That vocabulary will be cached, so it's not downloaded again the next time we run the cell.

This notebook is built to run on any of the tasks in the list above, with any model checkpoint from the [Model Hub](https://huggingface.co/models) as long as that model has a version with a classification head.

Here we picked the [`bert-base-uncased`](https://huggingface.co/bert-base-uncased) checkpoint.

![](https://i.imgur.com/GmFRcP3.png)

BERT can be used for a variety of tasks and we will fine-tune it for classification (sentiment).

Here we will use a smaller version of the BERT model called DistilBERT to train faster.

In [10]:
from transformers import AutoTokenizer

model_checkpoint = "distilbert/distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, use_fast=True)

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

We pass along `use_fast=True` to the call above to use one of the fast tokenizers (backed by Rust) from the 🤗 Tokenizers library. Those fast tokenizers are available for almost all models, but if you got an error with the previous call, remove that argument.

You can directly call this tokenizer on one sentence or a pair of sentences:

In [11]:
tokenizer("Hello, this is a sentence!")

{'input_ids': [101, 7592, 1010, 2023, 2003, 1037, 6251, 999, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1]}

We can then write the function that will preprocess our samples. We just feed them to the `tokenizer` with the argument `truncation=True`.

This will ensure that an input longer that what the model selected can handle will be truncated to the maximum length accepted by the model.

The padding will be dealt with later on (in a data collator) so we pad examples to the longest length in the batch and not the whole dataset.

In [12]:
def preprocess_function(examples):
    # max length is 512 as that is the context window limit of BERT models
    # It can process documents of upto 512 tokens each input
    model_inputs = tokenizer(examples['review'], max_length=512, truncation=True)
    model_inputs["label"] = examples["sentiment"]
    return model_inputs

This function works with one or several documents. In the case of several documents, the tokenizer will return a list of lists for each key:

In [13]:
preprocess_function(imdb_data["train"][:2])

{'input_ids': [[101, 2028, 1997, 1996, 2060, 15814, 2038, 3855, 2008, 2044, 3666, 2074, 1015, 11472, 2792, 2017, 1005, 2222, 2022, 13322, 1012, 2027, 2024, 2157, 1010, 2004, 2023, 2003, 3599, 2054, 3047, 2007, 2033, 1012, 1026, 7987, 1013, 1028, 1026, 7987, 1013, 1028, 1996, 2034, 2518, 2008, 4930, 2033, 2055, 11472, 2001, 2049, 24083, 1998, 4895, 10258, 2378, 8450, 5019, 1997, 4808, 1010, 2029, 2275, 1999, 2157, 2013, 1996, 2773, 2175, 1012, 3404, 2033, 1010, 2023, 2003, 2025, 1037, 2265, 2005, 1996, 8143, 18627, 2030, 5199, 3593, 1012, 2023, 2265, 8005, 2053, 17957, 2007, 12362, 2000, 5850, 1010, 3348, 2030, 4808, 1012, 2049, 2003, 13076, 1010, 1999, 1996, 4438, 2224, 1997, 1996, 2773, 1012, 1026, 7987, 1013, 1028, 1026, 7987, 1013, 1028, 2009, 2003, 2170, 11472, 2004, 2008, 2003, 1996, 8367, 2445, 2000, 1996, 17411, 4555, 3036, 2110, 7279, 4221, 12380, 2854, 1012, 2009, 7679, 3701, 2006, 14110, 2103, 1010, 2019, 6388, 2930, 1997, 1996, 3827, 2073, 2035, 1996, 4442, 2031, 3221, 21430

To apply this function on all the sentences in our dataset, we just use the `map` method of our `dataset` object we created earlier.

This will apply the function on all the elements of all the splits in `dataset`, so our training, validation and testing data will be preprocessed in one single command.

In [14]:
tokenized_datasets = imdb_data.map(preprocess_function, batched=True)

Map:   0%|          | 0/8000 [00:00<?, ? examples/s]

Map:   0%|          | 0/2000 [00:00<?, ? examples/s]

Map:   0%|          | 0/10000 [00:00<?, ? examples/s]

In [15]:
# remove unnecessary columns
tokenized_datasets = tokenized_datasets.remove_columns('review')
tokenized_datasets = tokenized_datasets.remove_columns('sentiment')

## Parameter Efficient Fine-tuning the Transformer Model

Now that our data is ready, we can download the pretrained model and fine-tune it.

Since our task is about sentence classification, we use the `AutoModelForSequenceClassification` class.

Like with the tokenizer, the `from_pretrained` method will download and cache the model for us.

The only thing we have to specify is the number of labels for our problem which should be 2

Since we are using QLoRA, we also need to quantize the model weights - which basically means loading them in lower floating point precision.
This makes the model take much less GPU memory.

### Quantization

Quantization represents data with fewer bits, making it a useful technique for reducing memory-usage and accelerating inference especially when it comes to large language models (LLMs)

However, after a model is quantized it isn’t typically further trained for downstream tasks because training can be unstable due to the lower precision of the weights and activations.

But since PEFT methods only add extra trainable parameters, this allows you to train a quantized model with a PEFT adapter on top with some special training methodologies!

Combining quantization with PEFT can be a good strategy for training even the largest models on a single GPU.

For example, __QLoRA is a method that quantizes a model to 4-bits and then trains it with LoRA__

### Quantize and load BERT SLM

`bitsandbytes` is a quantization library with a Transformers integration. With this integration, you can quantize any SLM or LLM to 8 or 4-bits and enable many other options by configuring the `BitsAndBytesConfig` class.

In [42]:
# we put in a mapping so the model knows which prediction label ID is which text label (human friendly)
id2label = {0: "NEGATIVE", 1: "POSITIVE"}
label2id = {"NEGATIVE": 0, "POSITIVE": 1}

In [43]:
import torch
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer, BitsAndBytesConfig

config = BitsAndBytesConfig(
    # Quantize the model weights to 4-bit precision upon loading, reducing memory usage.
    load_in_4bit=True,  
    # Use the 'Normalized Float 4' (NF4) data type, which uses a normal distribution to encode weights with just 4 bits
    bnb_4bit_quant_type="nf4",  
    # Apply double quantization: first quantize weights to 4-bit, then quantize the quantization constants used for quantizing weights
    bnb_4bit_use_double_quant=True,  
    # Utilize bfloat16 for computation, which takes less memory
    bnb_4bit_compute_dtype=torch.bfloat16,  
    # Skip quantization for specified modules, which will be trained separately
    llm_int8_skip_modules=["classifier", "pre_classifier"]  
)


model = AutoModelForSequenceClassification.from_pretrained(model_checkpoint,
                                                           id2label=id2label,
                                                           label2id=label2id,
                                                           num_labels=2,
                                                           quantization_config=config)

`low_cpu_mem_usage` was None, now default to True since model is quantized.
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert/distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


The warning is telling us we are throwing away some weights (the `vocab_transform` and `vocab_layer_norm` layers) and randomly initializing some other (the `pre_classifier` and `classifier` layers).

This is absolutely normal in this case, because we are removing the head used to pretrain the model on a masked language modeling objective and replacing it with a new head for which we don't have pretrained weights.

So the library warns us we should fine-tune this model before using it for inference, which is exactly what we are going to do to make the model learn how to predict the two classes

In [44]:
from peft import prepare_model_for_kbit_training

model = prepare_model_for_kbit_training(model)

The above function freezes all layers but makes sure embedding layers can get updated with gradients

In [45]:
def print_trainable_parameters(model):
    """
    Prints the number of trainable parameters in the model.
    """
    trainable_params = 0
    all_param = 0
    for _, param in model.named_parameters():
        all_param += param.numel()
        if param.requires_grad:
            trainable_params += param.numel()
    print(
        f"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param}"
    )

In [46]:
print_trainable_parameters(model)

trainable params: 0 || all params: 45721346 || trainable%: 0.0


The above piece of code shows us that DistilBERT has 66 Million trainable parameters and none of them are trainable here (except embedding layers). We will now add low-rank weight matrices for specific weight matrices in self-attention layers which we will be training instead of the actual model weight matrices.

Let's explore our model architecture and see what layer and weight matrices are present in our BERT SLM

In [47]:
model

DistilBertForSequenceClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0-5): 6 x TransformerBlock(
          (attention): DistilBertSdpaAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear4bit(in_features=768, out_features=768, bias=True)
            (k_lin): Linear4bit(in_features=768, out_features=768, bias=True)
            (v_lin): Linear4bit(in_features=768, out_features=768, bias=True)
            (out_lin): Linear4bit(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1,

We can see above in each transformer encoder block we have attention layer matrices as well as dense layer (feedforward) matrices.

We will focus on updating the attention layer matrices by creating low rank (LoRA) matrices which can approximate each weight matrix in the attention layer.

NOTE: During PEFT we don't finetune the actual attention weight matrices but these LoRA matrices as depicted here

![](https://i.imgur.com/Tyn3YTp.png)

In [48]:
from peft import LoraConfig, get_peft_model, TaskType, replace_lora_weights_loftq

# Set up the LoRA configuration for the model
config = LoraConfig(
    r=8,  # Rank of the LoRA matrices; a smaller rank reduces memory usage but may affect model performance.
    lora_alpha=32,  # Scaling factor applied to the LoRA updates; helps control the contribution of the LoRA weights.
    target_modules=["q_lin", "k_lin", "v_lin", "out_lin"],  # Specify the modules (weight matrix) within the model where LoRA is applied.
    lora_dropout=0.05,  # Dropout probability for LoRA layers to prevent overfitting during training.
    bias="none",  # Specifies whether to add learnable biases to the LoRA layers.
    task_type=TaskType.SEQ_CLS  # Defines the task type, here it's set to sequence classification.
)

# Apply the LoRA configuration to the model
peft_model = get_peft_model(model, config)

# Print the number of trainable parameters in the model after applying LoRA
print_trainable_parameters(peft_model)

trainable params: 887042 || all params: 46608388 || trainable%: 1.903181032564353


In [49]:
peft_model.device

device(type='cuda', index=0)

We are ready to train our model now, if you check above, we set the LoRA matrices to a rank of 8 which means each frozen weight matrix of size __768 x 768__ will be approximated by two low-rank trainable weight matrices of size __768 x 8__ and __8 x 768__

In [50]:
type(peft_model)

peft.peft_model.PeftModelForSequenceClassification

To instantiate a `Trainer`, we will need to define two more things. The most important is the [`TrainingArguments`](https://huggingface.co/transformers/main_classes/trainer.html#transformers.TrainingArguments), which is a class that contains all the attributes to customize the training. It requires one folder name, which will be used to save the checkpoints of the model, and all other arguments are optional:

In [51]:
# if batch size is 64
# if total documents are 8000
# total number of steps (batches of data) to complete 1 full epoch is?
8000 // 64

125

In [52]:
# total steps to run two epochs are?
125 * 2

250

In [53]:
from transformers import TrainingArguments

batch_size = 64 
metric_name = "f1"

# Set up the training arguments
args = TrainingArguments(
    output_dir="distilbert-cls-qlorafinetune-runs",  # Directory where the model checkpoints and outputs will be saved.
    gradient_checkpointing=True,
    gradient_checkpointing_kwargs={'use_reentrant':False}, # suppress warnings
    eval_strategy="steps",                          # Perform evaluation at regular intervals during training.
    save_strategy="steps",                          # Save the model checkpoint at regular intervals.
    learning_rate=1e-4,                             # Initial learning rate for the optimizer.
    logging_steps=20,                               # Log training metrics every 20 steps.
    eval_steps=20,                                  # Perform evaluation every 20 steps.
    save_steps=50,                                  # Save the model checkpoint every 50 steps.
    per_device_train_batch_size=batch_size,         # Batch size per GPU/TPU core/CPU during training.
    per_device_eval_batch_size=batch_size,          # Batch size per GPU/TPU core/CPU during evaluation.
    max_steps=250,                                  # Stop training after 250 total steps.
    weight_decay=0.01,                              # Apply weight decay to reduce overfitting.
    metric_for_best_model=metric_name,              # Metric to use for selecting the best model during evaluation.
    push_to_hub=False,                              # Do not push the model to the Hugging Face Hub after training.
    fp16=True,                                      # Use 16-bit floating point precision to reduce memory usage and speed up training.
    optim="paged_adamw_8bit",                       # Use an 8-bit AdamW optimizer for memory efficiency and faster computation.
)

We use DataCollatorWithPadding to create a batch of examples. It will also dynamically pad your text to the length of the longest element in its batch, so they are a uniform length.

While it is possible to pad your text in the tokenizer function by setting `padding=True`, dynamic padding is more efficient.

In [54]:
from transformers import DataCollatorWithPadding

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

The last thing to define for our `Trainer` is how to compute the metrics from the predictions.

We need to define a function for this, which will just use the `metric` we loaded earlier, the only preprocessing we have to do is to take the argmax of our predicted logits.

In [55]:
import numpy as np

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return evaluate_performance(predictions=predictions, references=labels)

Then we just need to pass all of this along with our datasets to the `Trainer`:

In [56]:
trainer = Trainer(
    model=peft_model,
    args=args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets['validation'],
    processing_class=tokenizer, # used to be called tokenizer earlier
    data_collator=data_collator,
    compute_metrics=compute_metrics
)

We can now finetune our model by just calling the `train` method

Remember we are NOT fine-tuning the actual BERT model here which is mostly frozen weights, we are fine-tuning the LoRA matrices and hence it is also popularly called as a LoRA Adapter.

Run and wait for around 4-5 mins on a 48GB GPU and uses much lesser GPU memory than full-finetuning!

In [57]:
trainer.train()

  with device_autocast_ctx, torch.cpu.amp.autocast(**cpu_autocast_kwargs), recompute_context:  # type: ignore[attr-defined]


Step,Training Loss,Validation Loss,Precision,Recall,F1,Accuracy
20,0.6792,0.648498,0.785919,0.772996,0.772248,0.7755
40,0.5939,0.464953,0.846325,0.846141,0.845497,0.8455
60,0.3796,0.331731,0.8605,0.860725,0.860478,0.8605
80,0.3256,0.307256,0.8754,0.875485,0.875437,0.8755
100,0.3114,0.294619,0.878956,0.878874,0.878912,0.879
120,0.2779,0.287186,0.879645,0.879212,0.879365,0.8795
140,0.2363,0.285744,0.883888,0.883089,0.883327,0.8835
160,0.3031,0.28141,0.887151,0.885328,0.885726,0.886
180,0.2572,0.270919,0.887589,0.887817,0.88749,0.8875
200,0.2568,0.267077,0.888965,0.888881,0.888919,0.889


  with device_autocast_ctx, torch.cpu.amp.autocast(**cpu_autocast_kwargs), recompute_context:  # type: ignore[attr-defined]
  with device_autocast_ctx, torch.cpu.amp.autocast(**cpu_autocast_kwargs), recompute_context:  # type: ignore[attr-defined]
  with device_autocast_ctx, torch.cpu.amp.autocast(**cpu_autocast_kwargs), recompute_context:  # type: ignore[attr-defined]
  with device_autocast_ctx, torch.cpu.amp.autocast(**cpu_autocast_kwargs), recompute_context:  # type: ignore[attr-defined]
  with device_autocast_ctx, torch.cpu.amp.autocast(**cpu_autocast_kwargs), recompute_context:  # type: ignore[attr-defined]
  with device_autocast_ctx, torch.cpu.amp.autocast(**cpu_autocast_kwargs), recompute_context:  # type: ignore[attr-defined]
  with device_autocast_ctx, torch.cpu.amp.autocast(**cpu_autocast_kwargs), recompute_context:  # type: ignore[attr-defined]
  with device_autocast_ctx, torch.cpu.amp.autocast(**cpu_autocast_kwargs), recompute_context:  # type: ignore[attr-defined]
  with d

TrainOutput(global_step=250, training_loss=0.3423960018157959, metrics={'train_runtime': 176.2775, 'train_samples_per_second': 90.766, 'train_steps_per_second': 1.418, 'total_flos': 2163078266880000.0, 'train_loss': 0.3423960018157959, 'epoch': 2.0})

## Save and Load Fine-tuned BERT classification LoRA Adapter

Parameter-Efficient Fine Tuning (PEFT) methods freeze the pretrained model parameters during fine-tuning and add a small number of trainable parameters (the LoRA adapters) on top of it. 

The adapters are trained to learn task-specific information. 

In this case we trained a LoRA adapter for Classification with BERT, let's save it.

Adapters trained with PEFT are also usually an order of magnitude smaller than the full model, making it convenient to share, store, and load and switch them!

In [58]:
save_path = 'qlora-distilbert-sentiment-adapter'
trainer.save_model(save_path)

In [59]:
# remove model checkpoints
!rm -rf distilbert-cls-qlorafinetune-runs

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


In [60]:
!du -sh * | sort -hr | grep qlora

6.3M	qlora-distilbert-sentiment-adapter


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


You can see the LoRA adapter (matrix) weights are only 6.3 MB! 

We can add just add them on top of the base BERT model anytime and use for classification!

Let's try this now

## Load Classification LoRA Adapter into Base Model

We start by loading the base DistilBERT model here

In [61]:
# load the base BERT model first
cls_model = AutoModelForSequenceClassification.from_pretrained('distilbert/distilbert-base-uncased',
                                                                id2label=id2label,
                                                                label2id=label2id,
                                                                num_labels=2)
tokenizer = AutoTokenizer.from_pretrained('distilbert/distilbert-base-uncased', fast=True)

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert/distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Now we add in the Classification LoRA Adapter to this base model.

So at inference time, each weight matrix in the attention layer (W) will be added with the product of the two corresponding LoRA matrices used to approximate it during the fine-tuning (AxB) i.e __W + AxB__ and these updated weight matrices are used inside the model for inference and predictions. 

In [62]:
cls_model.load_adapter(peft_model_id='qlora-distilbert-sentiment-adapter',
                       adapter_name='sentiment-classifier')

# Using your fine-tuned model for Classification

Once you’ve fine-tuned the model you can use it with a pipeline object, for inference as follows:

In [63]:
from transformers import pipeline

In [64]:
# Here you can load your locally trained \ saved model
clf = pipeline(task='text-classification', 
               model=cls_model, 
               tokenizer=tokenizer, 
               device='cuda')

Device set to use cuda


In [65]:
document = "The movie was not good at all"

In [66]:
clf(document)

[{'label': 'NEGATIVE', 'score': 0.9393364787101746}]

In [67]:
document = "The movie was amazing"

In [68]:
clf(document)

[{'label': 'POSITIVE', 'score': 0.9716503024101257}]

## Fine-tuned Transformer performance on Test Data

We can feed our test set (which the model has not seen) to our pipeline to get a feel for the quality of the model predictions.

In [69]:
imdb_data['test'][:2]

{'review': ['" While sporadically engrossing (including a few effectively tender moments) and humorous, the sledgehammer-obvious satire \'Homecoming\' hinges on comes off as forced and ultimately unfulfilling. With material like this, timing is everything (Michael Moore knew to release "Fahrenheit 9/11" before the 2004 elections), and the real tragedy of Dante\'s film is that it didn\'t come out 2 years ago, when its message would have carried an energy that would have energized the dissidents further. In 2006, mockery of the well-settled Bush Administration hardly seems as controversially compelling (or imperiled) as it did then."<br /><br />frankly anyone that could be convinced of anything by a ham fisted zombie flick has questionable intelligence. <br /><br />and if you didn\'t notice, michael moore didn\'t exactly help to defeat bush.<br /><br />there was nothing engrossing about this film. i just felt disgust at how blatant and frankly stupid the film was, it was painful to watch

Inference on the full test data takes roughly 1-2 mins

In [70]:
%%time

predictions = clf(imdb_data['test']['review'],
                  batch_size=512, 
                  max_length=512, 
                  truncation=True)
predictions = [pred['label'] for pred in predictions]

predictions = [0 if item == 'NEGATIVE' else 1 for item in predictions]
labels = imdb_data['test']['sentiment']

CPU times: user 1min 7s, sys: 41.7 ms, total: 1min 7s
Wall time: 45.5 s


In [71]:
from sklearn.metrics import confusion_matrix, classification_report

print(classification_report(labels, predictions))
pd.DataFrame(confusion_matrix(labels, predictions))

              precision    recall  f1-score   support

           0       0.92      0.86      0.89      5125
           1       0.86      0.92      0.89      4875

    accuracy                           0.89     10000
   macro avg       0.89      0.89      0.89     10000
weighted avg       0.89      0.89      0.89     10000



Unnamed: 0,0,1
0,4389,736
1,387,4488


## Merge Classification LoRA Adapter into Base BERT Model

Instead of loading the LoRA model adapter weights into the base model everytime and doing inference, 
we can merge the weights directly with the weights of the base model and make a final model. 

This helps with faster inference also and you don't need to load both model and adapter everytime

In [72]:
from peft import PeftModel, PeftConfig
from transformers import AutoModelForSequenceClassification, AutoTokenizer

peft_model_id = "qlora-distilbert-sentiment-adapter"
config = PeftConfig.from_pretrained(save_path) # peft_model_id or save_path
base_model = AutoModelForSequenceClassification.from_pretrained(config.base_model_name_or_path,
                                                                id2label=id2label,
                                                                label2id=label2id,
                                                                num_labels=2)
tokenizer = AutoTokenizer.from_pretrained(config.base_model_name_or_path, fast=True)

peft_model = PeftModel.from_pretrained(base_model, save_path).to('cuda')

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert/distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [73]:
merged_cls_model = peft_model.merge_and_unload()

In [74]:
save_path = 'merged-qlora-distilbert-classifier'

merged_cls_model.save_pretrained(save_path)
tokenizer.save_pretrained(save_path)

('merged-qlora-distilbert-classifier/tokenizer_config.json',
 'merged-qlora-distilbert-classifier/special_tokens_map.json',
 'merged-qlora-distilbert-classifier/vocab.txt',
 'merged-qlora-distilbert-classifier/added_tokens.json',
 'merged-qlora-distilbert-classifier/tokenizer.json')

In [75]:
# load the merged BERT model 
cls_model = AutoModelForSequenceClassification.from_pretrained('merged-qlora-distilbert-classifier',
                                                                id2label=id2label,
                                                                label2id=label2id,
                                                                num_labels=2)
tokenizer = AutoTokenizer.from_pretrained('merged-qlora-distilbert-classifier', fast=True)

In [76]:
clf = pipeline(task='text-classification', 
               model=cls_model, 
               tokenizer=tokenizer, 
               device='cuda')

Device set to use cuda


In [77]:
document = "The movie was not good at all"

In [78]:
clf(document)

[{'label': 'NEGATIVE', 'score': 0.9393364787101746}]

In [79]:
document = "The movie was amazing"

In [81]:
clf(document)

[{'label': 'POSITIVE', 'score': 0.9716503024101257}]