<a href="https://colab.research.google.com/github/arkeodev/pytorch/blob/main/Quantization/qlora.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# QLoRA


## Introduction

Quantized Low-Rank Adaptation (QLoRA) is a cutting-edge technique that reduces the memory footprint of large language models (LLMs) by quantizing weights to 4-bit precision while maintaining the performance of 16-bit fine-tuning. This allows fine-tuning of state-of-the-art models on consumer-grade hardware. QLoRA leverages LoRA for efficient fine-tuning and includes additional algorithmic tricks like 4-bit NormalFloat (NF4) quantization, double quantization, and paged optimizers to prevent memory spikes.

## Main Steps

The main steps involved in QLoRA (Quantized Low-Rank Adaptation) can be summarized as follows:

### 1. Normalization

- **Description:** The weights of the model are normalized so that they fall within a certain range. This allows for a more efficient representation of more common values.

- **Purpose:** Ensures that the weights are distributed in a way that makes quantization more effective.

### 2. Quantization

- **Description:** The weights are quantized to a lower precision format, specifically 4-bit in QLoRA. In the case of NF4 (4bit-NormalFloat), the quantization levels are evenly spaced with respect to the normalized weights.

- **Purpose:** Reduces the memory footprint of the model by representing weights with fewer bits.

### 3. Double Quantization (DQ)

- **Description:** Involves a second round of quantization on scaling factors for additional memory savings. Weights are quantized in blocks of 64, and scaling factors are quantized from 32-bit to 8-bit.

- **Purpose:** Further reduces memory usage by compressing the scaling factors.

### 4. Dequantization

- **Description:** During computation, the 4-bit quantized weights are dequantized to a higher precision format, such as BFloat16. Dequantization of 4-bit weights in the GPU cache, with matrix multiplication performed as a 16-bit floating point operation.

  In other words, we use a low-precision storage data type (in our case 4-bit, but in principle interchangeable) and one normal precision computation data type. This is important because the latter defaults to 32-bit for hardware compatibility and numerical stability reasons, but should be set to the optimal BFloat16 for newer hardware supporting it to achieve the best performance.

- **Purpose:** Enhances performance during inference by using higher precision for computations.

### 5. Paged Optimizers

- **Description:** Prevents memory spikes during gradient checkpointing from causing out-of-memory errors. It manages memory more efficiently during the training process.

- **Purpose:** Ensures stable training without running out of memory.

### 6. Integration of Low-Rank Adapters (LoRA)

- **Description:** Low-rank adapters are inserted at every network layer to correct minimal residual quantization errors and facilitate efficient fine-tuning.

- **Purpose:** Allows the fine-tuning of the model to achieve performance comparable to full precision (16-bit) fine-tuning.

### 7. Fine-Tuning

- **Description:** The model is fine-tuned with backpropagation through the frozen, quantized weights. LoRA layers are updated during this process.

- **Purpose:** Customizes the pre-trained model for specific tasks while maintaining a low memory footprint.

## Detailed Algorithmic Tricks in QLoRA


- **4-bit NormalFloat (NF4) Quantization:** Exploits the normal distribution of model weights and assigns an equal number of weights to each quantization bin to enhance information density.

- **Double Quantization:** Applies a second layer of quantization t

- **Dequantization:** During computation, the 4-bit quantized weights are dequantized to a higher precision format, such as BFloat16.

- **Paged Optimizers:** Manages memory during gradient checkpointing to prevent out-of-memory errors.

You may see the Bits and Bytes configuration that is used during a typical implementation.

```python
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True, # 4-bit NormalFloat (NF4)
    bnb_4bit_use_double_quant=True, # Double Quantization
    bnb_4bit_quant_type="nf4", # # 4-bit NormalFloat (NF4)
    bnb_4bit_compute_dtype=torch.bfloat16 # Dequantization
)
```

## A Sample Example of NF4 and Double Quantization

In [3]:
import torch
import torch.nn as nn

# Define NF4 data type and quantization function
def quantize_to_nf4(weights):
    # Normalize weights to the range -1 to 1
    max_val = torch.max(torch.abs(weights))
    # NF4 uses 4 bits for representing normalized weights
    nf4_weights = torch.round(weights / max_val * 7.5) / 7.5  # 7.5 maps weights to 4-bit quantization
    return nf4_weights, max_val

# Double quantization function for scaling factors
def double_quantization(scaling_factors):
    # Normalize scaling factors
    max_val = torch.max(scaling_factors)
    # Quantize the scaling factors to 8-bit values
    dq_factors = torch.round(scaling_factors / max_val * 127) / 127  # 127 maps scaling factors to 8-bit quantization
    return dq_factors, max_val

# Example model weights (original weights are of type float32)
weights = torch.randn(8, 8)
print("Original Weights (float32):", weights)

# Quantize weights to NF4
nf4_weights, max_val = quantize_to_nf4(weights)
print("Quantized NF4 Weights:", nf4_weights)
print("Max Value for NF4:", max_val)

# Example scaling factors (random values for demonstration)
scaling_factors = torch.randn(256)
print("Original Scaling Factors:", scaling_factors)

# Apply double quantization
dq_factors, dq_max_val = double_quantization(scaling_factors)
print("Double Quantized Factors:", dq_factors)
print("Max Value for Double Quantization:", dq_max_val)

Original Weights (float32): tensor([[ 1.3545, -0.0107, -0.6543,  0.5667,  0.6908,  0.4894, -0.2916, -0.1761],
        [-1.0498,  1.2755,  0.0163, -0.4651, -0.0133, -0.5115,  0.7723,  0.0408],
        [ 0.9752,  0.4867, -0.6149, -0.3338,  1.2824,  0.1431,  0.5395, -0.9338],
        [-0.2901,  0.8183,  0.0932, -1.2195,  0.7191, -1.5129,  1.1211,  0.7191],
        [-0.9321, -1.0891,  0.8873,  1.1584,  0.0864,  0.7182,  1.6358, -0.0203],
        [ 1.1615, -0.7422, -0.3044, -2.1620,  0.0750,  1.5512, -1.6954,  1.1042],
        [-0.2643,  0.9360, -0.8949,  0.1208, -1.2187, -2.0920,  0.5329,  0.7675],
        [ 2.2764,  1.8049, -0.0221,  0.5943, -0.2063,  1.1842,  0.3584,  1.2400]])
Quantized NF4 Weights: tensor([[ 0.5333, -0.0000, -0.2667,  0.2667,  0.2667,  0.2667, -0.1333, -0.1333],
        [-0.4000,  0.5333,  0.0000, -0.2667, -0.0000, -0.2667,  0.4000,  0.0000],
        [ 0.4000,  0.2667, -0.2667, -0.1333,  0.5333,  0.0000,  0.2667, -0.4000],
        [-0.1333,  0.4000,  0.0000, -0.5333,  

## Implementation

The whole implementation here is taken from the hugging face and bitsandbytes implementation of QLoRA sample: [`transformers` meets `bitsandbytes` for democratzing Large Language Models (LLMs) through 4bit quantization](https://colab.research.google.com/drive/1VoYNfYDKcKRQRor98Zbf2-9VQTtGJ24k?usp=sharing).

In [1]:
!pip install -q -U bitsandbytes
!pip install -q -U git+https://github.com/huggingface/transformers.git
!pip install -q -U git+https://github.com/huggingface/peft.git
!pip install -q -U git+https://github.com/huggingface/accelerate.git
!pip install -q datasets

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m119.8/119.8 MB[0m [31m7.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m21.3/21.3 MB[0m [31m68.8 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
  Building wheel for transformers (pyproject.toml) ... [?25l[?25hdone
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m302.6/302.6 kB[0m [31m6.2 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for peft (pyproject.toml) ... [?25l[?25hdone
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.t

First let's load the model we are going to use - GPT-neo-x-20B! Note that the model itself is around 40GB in half precision

In [2]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

model_id = "EleutherAI/gpt-neox-20b"
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=bnb_config, device_map={"":0})

tokenizer_config.json:   0%|          | 0.00/156 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.08M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/457k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.11M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/90.0 [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


config.json:   0%|          | 0.00/613 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/60.4k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/46 [00:00<?, ?it/s]

model-00001-of-00046.safetensors:   0%|          | 0.00/926M [00:00<?, ?B/s]

model-00002-of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

model-00003-of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

model-00004-of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

model-00005-of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

model-00006-of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

model-00007-of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

model-00008-of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

model-00009-of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

model-00010-of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

model-00011-of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

model-00012-of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

model-00013-of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

model-00014-of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

model-00015-of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

model-00016-of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

model-00017-of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

model-00018-of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

model-00019-of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

model-00020-of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

model-00021-of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

model-00022-of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

model-00023-of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

model-00024-of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

model-00025-of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

model-00026-of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

model-00027-of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

model-00028-of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

model-00029-of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

model-00030-of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

model-00031-of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

model-00032-of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

model-00033-of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

model-00034-of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

model-00035-of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

model-00036-of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

model-00037-of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

model-00038-of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

model-00039-of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

model-00040-of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

model-00041-of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

model-00042-of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

model-00043-of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

model-00044-of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

model-00045-of-00046.safetensors:   0%|          | 0.00/604M [00:00<?, ?B/s]

model-00046-of-00046.safetensors:   0%|          | 0.00/620M [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/46 [00:00<?, ?it/s]

Then we have to apply some preprocessing to the model to prepare it for training. For that use the `prepare_model_for_kbit_training` method from PEFT.

In [3]:
from peft import prepare_model_for_kbit_training

model.gradient_checkpointing_enable()
model = prepare_model_for_kbit_training(model)

In [4]:
def print_trainable_parameters(model):
    """
    Prints the number of trainable parameters in the model.
    """
    trainable_params = 0
    all_param = 0
    for _, param in model.named_parameters():
        all_param += param.numel()
        if param.requires_grad:
            trainable_params += param.numel()
    print(
        f"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param}"
    )

In [5]:
from peft import LoraConfig, get_peft_model

config = LoraConfig(
    r=8,
    lora_alpha=32,
    target_modules=["query_key_value"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

model = get_peft_model(model, config)
print_trainable_parameters(model)

trainable params: 8650752 || all params: 10597552128 || trainable%: 0.08162971878329976


Let's load a common dataset, english quotes, to fine tune our model on famous quotes.

In [6]:
from datasets import load_dataset

data = load_dataset("Abirate/english_quotes")
data = data.map(lambda samples: tokenizer(samples["quote"]), batched=True)

Downloading readme:   0%|          | 0.00/5.55k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/647k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/2508 [00:00<?, ? examples/s]

Map:   0%|          | 0/2508 [00:00<?, ? examples/s]

Run the cell below to run the training! For the sake of the demo, we just ran it for few steps just to showcase how to use this integration with existing tools on the HF ecosystem.

In [7]:
import transformers

# needed for gpt-neo-x tokenizer
tokenizer.pad_token = tokenizer.eos_token

trainer = transformers.Trainer(
    model=model,
    train_dataset=data["train"],
    args=transformers.TrainingArguments(
        per_device_train_batch_size=1,
        gradient_accumulation_steps=4,
        warmup_steps=2,
        max_steps=10,
        learning_rate=2e-4,
        fp16=True,
        logging_steps=1,
        output_dir="outputs",
        optim="paged_adamw_8bit"
    ),
    data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False),
)
model.config.use_cache = False  # silence the warnings. Please re-enable for inference!
trainer.train()

max_steps is given, it will override any value given in num_train_epochs


Step,Training Loss
1,2.7152
2,2.1389
3,2.5453
4,1.6945
5,1.7935
6,2.6403
7,2.4984
8,1.6919
9,2.6234
10,2.1027


TrainOutput(global_step=10, training_loss=2.244428849220276, metrics={'train_runtime': 172.6524, 'train_samples_per_second': 0.232, 'train_steps_per_second': 0.058, 'total_flos': 167211775033344.0, 'train_loss': 2.244428849220276, 'epoch': 0.01594896331738437})

In [25]:
import warnings
warnings.filterwarnings("ignore", category=UserWarning, module="torch.utils.checkpoint")

def generate_text(prompt, model, tokenizer, max_length=50):
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    model.config.use_cache = False
    outputs = model.generate(**inputs, max_length=max_length, pad_token_id=tokenizer.eos_token_id)
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

# Example prompts for evaluation
prompts = [
    "The greatest glory in living lies not in never falling,",
    "The way to get started is to quit talking and begin doing.",
    "Your time is limited, so don't waste it living someone else's life."
]

for prompt in prompts:
    print(f"Prompt: {prompt}")
    print(f"Generated: {generate_text(prompt, model, tokenizer)}\n")



`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...


Prompt: The greatest glory in living lies not in never falling,


`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...

Generated: The greatest glory in living lies not in never falling, but in rising every time we fall.”

– Ralph Waldo Emerson

“The only thing that stands between you and your goal is the bullshit you tell yourself.”

– Unknown

Prompt: The way to get started is to quit talking and begin doing.


`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...

Generated: The way to get started is to quit talking and begin doing.

The way to get started is to quit talking and begin doing.

The way to get started is to quit talking and begin doing.

The way to get started

Prompt: Your time is limited, so don't waste it living someone else's life.


`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...

Generated: Your time is limited, so don't waste it living someone else's life. Don't be trapped by dogma — which is living with the results of other people's thinking. Don't let the noise of others' opinions drown out your own



## Resources and References

- The whole implementation here is taken from the hugging face and bitsandbytes implementation of QLoRA sample: [`transformers` meets `bitsandbytes` for democratzing Large Language Models (LLMs) through 4bit quantization](https://colab.research.google.com/drive/1VoYNfYDKcKRQRor98Zbf2-9VQTtGJ24k?usp=sharing).

    This notebook shows how to fine-tune a 4bit model on a downstream task using the Hugging Face ecosystem. It shows that it is possible to fine tune GPT-neo-X 20B on a Google Colab instance!

-  In this paper [**LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale**](https://arxiv.org/abs/2208.07339), Tim Dettmers, Mike Lewis, Younes Belkada, and Luke Zettlemoyer developed a procedure for Int8 matrix multiplication in transformers that reduces GPU memory needed for inference by half while maintaining full precision performance. Their method, LLM.int8(), allows a 175B parameter model to be loaded, converted to Int8, and used immediately without performance degradation, making large models more accessible on consumer GPUs.

    **Authors and Research Affiliations:**
        - Tim Dettmers: University of Washington

        - Mike Lewis: Facebook AI Research

        - Younes Belkada: Hugging Face, ENS Paris-Saclay

        - Luke Zettlemoyer: University of Washington, Facebook AI Research
    

- Please also read the blog article to learn more about the details of the data types and theory behind the implementation above: [introduction blogpost](https://huggingface.co/blog/hf-bitsandbytes-integration) to lean more about that quantization method.

- [In the general usage notebook](https://colab.research.google.com/drive/1ge2F1QSK8Q7h0hn3YKuBCOAS0bK8E0wf?usp=sharing), you can learn how to propely load a model in 4bit with all its variants.

- [Finetune LLMs on your own consumer hardware using tools from PyTorch and Hugging Face ecosystem](https://pytorch.org/blog/finetune-llms/)

- [Which Quantization Method is Right for You? (GPTQ vs. GGUF vs. AWQ)](https://www.maartengrootendorst.com/blog/quantization/)