# `transformers` meets `bitsandbytes` for democratzing Large Language Models (LLMs) through 4bit quantization

<center>
<img src="https://github.com/huggingface/blog/blob/main/assets/96_hf_bitsandbytes_integration/Thumbnail_blue.png?raw=true" alt="drawing" width="700" class="center"/>
</center>

Welcome to this notebook that goes through the recent `bitsandbytes` integration that includes the work from XXX that introduces no performance degradation 4bit quantization techniques, for democratizing LLMs inference and training.

In this notebook, we will learn together how to load a large model in 4bit (`gpt-neo-x-20b`) and train it using Google Colab and PEFT library from Hugging Face 🤗.

[In the general usage notebook](https://colab.research.google.com/drive/1ge2F1QSK8Q7h0hn3YKuBCOAS0bK8E0wf?usp=sharing), you can learn how to propely load a model in 4bit with all its variants.

If you liked the previous work for integrating [*LLM.int8*](https://arxiv.org/abs/2208.07339), you can have a look at the [introduction blogpost](https://huggingface.co/blog/hf-bitsandbytes-integration) to lean more about that quantization method.


In [None]:
!pip install -q -U bitsandbytes
!pip install -q -U git+https://github.com/huggingface/transformers.git
!pip install -q -U git+https://github.com/huggingface/peft.git
!pip install -q -U git+https://github.com/huggingface/accelerate.git
!pip install -q datasets

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m69.1/69.1 MB[0m [31m13.0 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.0/3.0 MB[0m [31m80.9 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for transformers (pyproject.toml) ... [?25l[?25hdone
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
  Building wheel for peft (pyproject.toml) ... [?25l[?25hdone
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
  Building wheel for accelerate (pyproject.toml) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━

First let's load the model we are going to use - GPT-neo-x-20B! Note that the model itself is around 40GB in half precision

In [None]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig, AutoModelForSequenceClassification

model_id = "huawei-noah/TinyBERT_General_4L_312D"
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSequenceClassification.from_pretrained(model_id, quantization_config=bnb_config, device_map={"":0})

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/409 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/62.7M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at huawei-noah/TinyBERT_General_4L_312D and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Then we have to apply some preprocessing to the model to prepare it for training. For that use the `prepare_model_for_kbit_training` method from PEFT.

In [None]:
from peft import prepare_model_for_kbit_training

model.gradient_checkpointing_enable()
model = prepare_model_for_kbit_training(model)

In [None]:
def print_trainable_parameters(model):
    """
    Prints the number of trainable parameters in the model.
    """
    trainable_params = 0
    all_param = 0
    for _, param in model.named_parameters():
        all_param += param.numel()
        if param.requires_grad:
            trainable_params += param.numel()
    print(
        f"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param}"
    )

In [None]:
from peft import LoraConfig, get_peft_model

config = LoraConfig(
    r=16,
    lora_alpha=16,
    target_modules=["query", "key", "value"],
    lora_dropout=0.05,
    bias="none",
    task_type="SEQ_CLS"
)

model = get_peft_model(model, config)
print_trainable_parameters(model)

trainable params: 120434 || all params: 12146284 || trainable%: 0.9915295904492271


Let's load a common dataset, english quotes, to fine tune our model on famous quotes.

In [None]:
!pip install evaluate

Collecting evaluate
  Downloading evaluate-0.4.3-py3-none-any.whl.metadata (9.2 kB)
Downloading evaluate-0.4.3-py3-none-any.whl (84 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.0/84.0 kB[0m [31m4.1 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: evaluate
Successfully installed evaluate-0.4.3


In [None]:
import evaluate
import numpy as np
import transformers

metric = evaluate.load("accuracy")

# 4. Define compute_metrics function
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)  # Convert logits to predicted class
    return metric.compute(predictions=predictions, references=labels)

Downloading builder script:   0%|          | 0.00/4.20k [00:00<?, ?B/s]

In [None]:
# 4. Load GLUE dataset (MRPC task)
from datasets import load_dataset  # Import the load_dataset function

task = "mrpc"
dataset = load_dataset("glue", task)
metric = evaluate.load("glue", task)

README.md:   0%|          | 0.00/35.3k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/649k [00:00<?, ?B/s]

validation-00000-of-00001.parquet:   0%|          | 0.00/75.7k [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/308k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/3668 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/408 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1725 [00:00<?, ? examples/s]

Downloading builder script:   0%|          | 0.00/5.75k [00:00<?, ?B/s]

In [None]:
import wandb
wandb.login()

[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


True

In [None]:
sweep_config = {
    'method': 'random',
    'metric': {
        'name': 'accuracy',  # Track accuracy as the metric for saving the best model
        'goal': 'maximize'
    },
    'parameters': {
        'optimizer': {
            'values': ['adamw_torch', 'adafactor', 'adamw_hf', 'adamw_8bit', 'sgd']
        },
        'learning_rate': {
            'distribution': 'log_uniform_values',
            'min': 1e-5,
            'max': 1e-3
        },
        'lr_scheduler': {
            'values': [
                'linear',
                'cosine',
                'constant',
                'constant_with_warmup',
                'polynomial'
            ]
        },
        'weight_decay': {
            'values': [0.0, 0.01, 0.001, 0.1]
        },
        'warmup_ratio': {
            'values': [0.05, 0.1, 0.15, 0.2]
        },
        'train_batch_size': {
            'values': [8, 16, 32, 64, 128]
        },
        'gradient_accumulation_steps': {
            'values': [2, 4, 8, 32, 64]
        },
        'lora_r': {
            'values': [8, 16, 32, 64, 128]
        },
        'lora_alpha': {
            'values': [4, 8, 16, 32, 64]
        },
        'lora_dropout': {
            'values': [0.05, 0.1, 0.2]
        }
    }
}

In [None]:
sweep_id = wandb.sweep(sweep_config, project="TinyBert 101")

Create sweep with ID: pbit4fos
Sweep URL: https://wandb.ai/garima440-new-york-university/TinyBert%20101/sweeps/pbit4fos


Run the cell below to run the training! For the sake of the demo, we just ran it for few steps just to showcase how to use this integration with existing tools on the HF ecosystem.

In [None]:
# Set a padding token for the tokenizer
if tokenizer.pad_token is None:
    tokenizer.add_special_tokens({'pad_token': '[PAD]'})  # Add a padding token

# 3. Preprocessing function for tokenization
def preprocess_function(examples):
    return tokenizer(
        examples["sentence1"],
        examples["sentence2"],
        truncation=True,
        padding="max_length",  # Ensure uniform input size
        max_length=512,       # Typical BERT max length
    )

# 4. Tokenize dataset
encoded_dataset = dataset.map(preprocess_function, batched=True)

# 5. Data collator
data_collator = transformers.DataCollatorWithPadding(tokenizer=tokenizer)

def train(config=None):
    # Initialize a new wandb run
    with wandb.init(config=config):
        # If called by wandb.agent, as below,
        # this config will be set by Sweep Controller
        config = wandb.config

        # 6. Define Trainer with TrainingArguments
        trainer = transformers.Trainer(
            model=model,
            train_dataset=encoded_dataset["train"],
            eval_dataset=encoded_dataset["validation"],
            tokenizer=tokenizer,
            data_collator=data_collator,
            args=transformers.TrainingArguments(
                per_device_train_batch_size=config.train_batch_size,  # BERT can handle larger batch sizes
                gradient_accumulation_steps=config.gradient_accumulation_steps,  # Adjust if GPU memory is limited
                warmup_ratio=config.warmup_ratio,
                max_steps=300,
                learning_rate=config.learning_rate,
                fp16=True,  # Enable mixed-precision if supported by your hardware
                logging_steps=50,
                evaluation_strategy="steps",  # Evaluate periodically
                output_dir="./outputs",
                save_steps=100,
                save_total_limit=2,  # Keep only the latest 2 checkpoints
                optim=config.optimizer,
                weight_decay=config.weight_decay,
                lr_scheduler_type=config.lr_scheduler,
            ),
                compute_metrics=compute_metrics,

        )

        # 7. Disable caching for training
        model.config.use_cache = False

        # 8. Train the model
        trainer.train()


Map:   0%|          | 0/3668 [00:00<?, ? examples/s]

In [None]:
wandb.agent(sweep_id, train, count=1)

[34m[1mwandb[0m: Agent Starting Run: m65nty3u with config:
[34m[1mwandb[0m: 	gradient_accumulation_steps: 2
[34m[1mwandb[0m: 	learning_rate: 0.0003916496770060197
[34m[1mwandb[0m: 	lora_alpha: 8
[34m[1mwandb[0m: 	lora_dropout: 0.2
[34m[1mwandb[0m: 	lora_r: 16
[34m[1mwandb[0m: 	lr_scheduler: polynomial
[34m[1mwandb[0m: 	optimizer: adafactor
[34m[1mwandb[0m: 	train_batch_size: 128
[34m[1mwandb[0m: 	warmup_ratio: 0.2
[34m[1mwandb[0m: 	weight_decay: 0


  trainer = transformers.Trainer(
  return fn(*args, **kwargs)


Step,Training Loss,Validation Loss,Accuracy,F1
50,0.6618,0.6526,0.683824,0.812227
100,0.6024,0.57295,0.681373,0.810496
150,0.5324,0.499376,0.789216,0.857616
200,0.4823,0.458111,0.813725,0.87541
250,0.4597,0.434768,0.821078,0.877311
300,0.4397,0.430333,0.823529,0.879195


  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)


VBox(children=(Label(value='0.027 MB of 0.027 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

0,1
eval/accuracy,▁▁▆███
eval/f1,▁▁▆███
eval/loss,█▅▃▂▁▁
eval/runtime,▂▂█▆▂▁
eval/samples_per_second,▇▇▁▃▇█
eval/steps_per_second,▇▇▁▃▇█
train/epoch,▁▁▂▂▄▄▅▅▇▇███
train/global_step,▁▁▂▂▄▄▅▅▇▇███
train/grad_norm,▃▁▁█▃▆
train/learning_rate,██▆▅▃▁

0,1
eval/accuracy,0.82353
eval/f1,0.87919
eval/loss,0.43033
eval/runtime,2.234
eval/samples_per_second,182.635
eval/steps_per_second,22.829
total_flos,1079050000465920.0
train/epoch,20.0
train/global_step,300.0
train/grad_norm,0.51951


In [None]:
!pip install xformers

Collecting xformers
  Downloading xformers-0.0.28.post3-cp310-cp310-manylinux_2_28_x86_64.whl.metadata (1.0 kB)
Downloading xformers-0.0.28.post3-cp310-cp310-manylinux_2_28_x86_64.whl (16.7 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m16.7/16.7 MB[0m [31m64.5 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: xformers
Successfully installed xformers-0.0.28.post3


In [None]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
import time

# Load tokenizer and trained model
tokenizer = AutoTokenizer.from_pretrained("./outputs/checkpoint-300")  # Path to your saved model
model = AutoModelForSequenceClassification.from_pretrained("./outputs/checkpoint-300")
model.eval()  # Set to evaluation mode
model.config.use_cache = True

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at huawei-noah/TinyBERT_General_4L_312D and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
from transformers.models.bert.modeling_bert import BertSelfAttention

# # Patch BERT with FlashAttention if supported
# def enable_flash_attention(self, hidden_states, attention_mask=None):
#     from xformers.ops import memory_efficient_attention

#     attention_output = memory_efficient_attention(
#         hidden_states,
#         hidden_states,
#         hidden_states,
#         attn_bias=attention_mask,
#     )
#     return attention_output

# BertSelfAttention.forward = enable_flash_attention

In [None]:
from xformers.ops import memory_efficient_attention

class FlashSelfAttention(nn.Module):
    def __init__(self, config, device):
        super().__init__()
        if config.hidden_size % config.num_attention_heads != 0:
            raise ValueError(
                f"The hidden size ({config.hidden_size}) is not a multiple of the number of attention heads ({config.num_attention_heads})"
            )
        self.num_attention_heads = config.num_attention_heads
        self.attention_head_size = config.hidden_size // config.num_attention_heads
        self.query = nn.Linear(config.hidden_size, config.hidden_size)
        self.key = nn.Linear(config.hidden_size, config.hidden_size)
        self.value = nn.Linear(config.hidden_size, config.hidden_size)
        self.dropout = nn.Dropout(config.attention_probs_dropout_prob)

    def forward(self, hidden_states, attention_mask=None, head_mask=None, encoder_hidden_states=None, encoder_attention_mask=None, past_key_value=None, output_attentions=False):  # Add extra arguments
        # Project query, key, value
        q = self.query(hidden_states)
        k = self.key(hidden_states)
        v = self.value(hidden_states)

        # Reshape for multi-head attention
        batch_size, seq_length, hidden_dim = hidden_states.size()
        q = q.view(batch_size, seq_length, self.num_attention_heads, self.attention_head_size).transpose(1, 2)
        k = k.view(batch_size, seq_length, self.num_attention_heads, self.attention_head_size).transpose(1, 2)
        v = v.view(batch_size, seq_length, self.num_attention_heads, self.attention_head_size).transpose(1, 2)

        # FlashAttention computation
        if attention_mask is not None:
            attention_mask = attention_mask[:, None, :, None]  # [batch, head, seq, 1]

        attention_output = memory_efficient_attention(q, k, v, attn_bias=attention_mask)
        attention_output = attention_output.transpose(1, 2).contiguous().view(batch_size, seq_length, hidden_dim)

        # Reshape back to original dimensions
        attention_output = attention_output.transpose(1, 2).contiguous().view(batch_size, seq_length, hidden_dim)
        return (attention_output, )  # You might need to adjust the return value based on the original forward method

In [None]:

# Function to replace the attention layers with FlashSelfAttention
from torch.nn import Module
import torch.nn as nn

# Function to replace the attention layers with FlashSelfAttention
def replace_attention_layers_with_flash(model, device):
    """
    Replace the attention layers in the model with FlashSelfAttention.

    Args:
        model (nn.Module): The model to modify.

    Returns:
        model (nn.Module): The model with replaced attention layers.
    """
    for name, module in model.named_modules():
        # Check if the module has the self-attention component
        if isinstance(module, nn.Module) and hasattr(module, 'attention') and hasattr(module.attention, 'self'):
            print(f"Replacing attention layer in module: {name}")
            # Replace the default self-attention with FlashSelfAttention
            # Use a lambda function to adapt the forward call
            # Pass the device argument to FlashSelfAttention
            # Create a FlashSelfAttention instance and move it to the device
            flash_attention = FlashSelfAttention(model.config, device).to(device)

            # Assign the FlashSelfAttention instance to the module
            module.attention.self = flash_attention

    return model

In [None]:
model = replace_attention_layers_with_flash(model, device)

Replacing attention layer in module: bert.encoder.layer.0
Replacing attention layer in module: bert.encoder.layer.1
Replacing attention layer in module: bert.encoder.layer.2
Replacing attention layer in module: bert.encoder.layer.3


In [None]:
import time
import torch

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)





def measure_latency(model, tokenizer, sentence1, sentence2, iterations=100):
    # Tokenize input
    inputs = tokenizer(sentence1, sentence2, return_tensors="pt", truncation=True, padding=True, max_length=512)

    inputs = inputs.to(device)

    # inputs = inputs.type(torch.float16)  # or torch.bfloat16
    # Convert only the relevant tensors to float16
    # Cast only specific tensors to float16
    for key in ['attention_mask']:  # Only cast attention_mask if needed
        if key in inputs and isinstance(inputs[key], torch.Tensor):
            inputs[key] = inputs[key].type(torch.float16)

    # Warm-up runs
    for _ in range(10):
        with torch.no_grad():
            results = model(**inputs)

    # Measure latency
    torch.cuda.synchronize()
    start_time = time.perf_counter()
    for _ in range(iterations):
        with torch.no_grad():
            results = model(**inputs)
    torch.cuda.synchronize()
    end_time = time.perf_counter()

    avg_latency = (end_time - start_time) / iterations
    print(f"Average Inference Latency: {avg_latency:.6f} seconds")

# Prepare example input (take a sample from the dataset)
sentence1 = dataset["test"][13]["sentence1"]
sentence2 = dataset["test"][13]["sentence2"]

measure_latency(model, tokenizer, sentence1, sentence2)

Average Inference Latency: 0.003916 seconds
