# LOFTQ: Enhancing Performance in Quantized Low-Rank Adaptation for Large Language Models

Source: [Arxiv Paper](https://arxiv.org/abs/2310.08659)
## Overview
LOFTQ addresses the performance degradation that occurs when combining quantization with Low-Rank Adaptation (LoRA) in Large Language Models (LLMs). The primary issue stems from discrepancies introduced by low-bit quantization (e.g., 3-bit), which negatively impact LoRA fine-tuning, leading to sub-optimal performance on downstream tasks. LOFTQ's innovative approach, LoRA-Aware Quantization, mitigates this issue by jointly optimizing quantization and low-rank approximation.

## Key Features

### LoRA-Aware Quantization
LOFTQ enhances model performance by integrating the quantization process with LoRA fine-tuning. This is achieved by minimizing the difference between the original pre-trained weights and the sum of the quantized weights and the low-rank approximation. By considering LoRA during quantization, LOFTQ ensures a better initialization point, leading to improved downstream performance.

### Alternating Optimization Algorithm
LOFTQ employs an alternating optimization algorithm to effectively solve the joint optimization problem. The algorithm consists of two main steps:
1. **Quantization**: Quantizes the difference between the original weights and the current low-rank approximation.
2. **SVD**: Applies Singular Value Decomposition (SVD) to obtain a low-rank approximation of the quantization residual.

This iterative process refines both the quantized weights and the low-rank approximation, progressively aligning them more closely with the original pre-trained weights.

## Benefits
- **Improved Initialization**: Joint optimization of quantization and LoRA ensures more accurate initialization for fine-tuning, enhancing downstream task performance.
- **Enhanced Performance**: Mitigates performance degradation typically seen with low-bit quantization in combination with LoRA.
- **Iterative Refinement**: Continuous improvement through alternating optimization ensures the quantized model remains close to the original pre-trained model.

## How It Works
LOFTQ's alternating optimization process alternates between quantizing the residuals and applying SVD, refining the model iteratively:
1. **Quantize the residuals**: Quantize the difference between the original pre-trained weights and the current approximation.
2. **Apply SVD**: Use SVD to update the low-rank approximation.

This process continues until the quantized weights and low-rank approximation closely match the original pre-trained weights.

## Conclusion
LOFTQ provides a robust solution to the challenge of performance degradation in quantized low-rank adaptation for LLMs. By integrating quantization with LoRA fine-tuning through joint optimization, LOFTQ not only addresses the negative impacts of quantization discrepancies but also significantly enhances model performance on downstream tasks. This makes LOFTQ a crucial advancement in the efficient and effective fine-tuning of large language models.

---

For more details on the implementation and usage, please refer to the [documentation](https://huggingface.co/docs/peft/en/developer_guides/lora).

## Install the libraries in colab


In [None]:
!pip install -q datasets
!pip install -q bitsandbytes accelerate
!pip install -q trl peft
## !pip install -q flash-attn --no-build-isolation     # not supported in colab free GPU version yet -> signature 24 april, 2024

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m542.0/542.0 kB[0m [31m7.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m9.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.1/194.1 kB[0m [31m13.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m12.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m119.8/119.8 MB[0m [31m7.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m302.6/302.6 kB[0m [31m14.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m21.3/21.3 MB[0m [31m68.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m245.2/245.2 kB[0m [31m8.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━

In [None]:
# import for ignoring warning
import warnings

warnings.filterwarnings("ignore")

## Load datasets

In [None]:
from datasets import load_dataset, DatasetDict

raw_datasets = load_dataset("HuggingFaceH4/ultrachat_200k")
raw_datasets

Downloading readme:   0%|          | 0.00/4.44k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/244M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/244M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/244M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/81.2M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/244M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/243M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/243M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/80.4M [00:00<?, ?B/s]

Generating train_sft split:   0%|          | 0/207865 [00:00<?, ? examples/s]

Generating test_sft split:   0%|          | 0/23110 [00:00<?, ? examples/s]

Generating train_gen split:   0%|          | 0/256032 [00:00<?, ? examples/s]

Generating test_gen split:   0%|          | 0/28304 [00:00<?, ? examples/s]

DatasetDict({
    train_sft: Dataset({
        features: ['prompt', 'prompt_id', 'messages'],
        num_rows: 207865
    })
    test_sft: Dataset({
        features: ['prompt', 'prompt_id', 'messages'],
        num_rows: 23110
    })
    train_gen: Dataset({
        features: ['prompt', 'prompt_id', 'messages'],
        num_rows: 256032
    })
    test_gen: Dataset({
        features: ['prompt', 'prompt_id', 'messages'],
        num_rows: 28304
    })
})

In [None]:
# make dataset containing only train and test set
# for here get only 100 data
dataset = DatasetDict({
    "train": raw_datasets["train_sft"].shuffle(seed=1000).select(range(100)),
    "test": raw_datasets["test_sft"].shuffle(seed=1000).select(range(100))
})
dataset

DatasetDict({
    train: Dataset({
        features: ['prompt', 'prompt_id', 'messages'],
        num_rows: 100
    })
    test: Dataset({
        features: ['prompt', 'prompt_id', 'messages'],
        num_rows: 100
    })
})

In [None]:
print("___________________prompt___________________\n")
print(dataset["train"][0]["prompt"][:250])

print("\n___________________prompt_id_____________________\n")
print(dataset["train"][0]["prompt_id"])

print("\n____________________messages___________________")
print(dataset["train"][0]["messages"])

___________________prompt___________________

Here is a piece of text: Lifting off from Cape Canaveral on the 60th anniversary of the launch of Explorer 1, the first U.S. satellite, the commercial booster rumbled into a clear late afternoon sky a day after SpaceX scrubbed a countdown to replace 

___________________prompt_id_____________________

30f9ad40556b1b8fce966ab3c734fc1daa15a3791e0f7ec637eee697155410d2

____________________messages___________________
[{'content': 'Here is a piece of text: Lifting off from Cape Canaveral on the 60th anniversary of the launch of Explorer 1, the first U.S. Satellite, the commercial booster rumbled into a clear late afternoon sky a day after SpaceX scrubbed a countdown to replace a sensor on the Falcon 9’s second stage.\nThe 229-foot-tall (70-meter) Falcon 9 rocket fired nine Merlin 1D main engines and climbed away from Cape Canaveral’s Complex 40 launch pad at 4:25 p.m. EST (2125 GMT), launching a few miles from the site of Explorer 1’s historic b

In [None]:
dataset["train"][0]["messages"][0].keys()

dict_keys(['content', 'role'])

In [None]:
# lets see the message more clearly as it is dict in list
for message in dataset["train"][0]["messages"]:
  print("***********ROLE***********")
  print(message["role"])
  print("\n***********CONTENT*************\n")
  print(message["content"][:250])
  print("-"*100)

***********ROLE***********
user

***********CONTENT*************

Here is a piece of text: Lifting off from Cape Canaveral on the 60th anniversary of the launch of Explorer 1, the first U.S. Satellite, the commercial booster rumbled into a clear late afternoon sky a day after SpaceX scrubbed a countdown to replace 
----------------------------------------------------------------------------------------------------
***********ROLE***********
assistant

***********CONTENT*************

The GovSat 1 communications satellite was successfully launched by a SpaceX Falcon 9 rocket from Cape Canaveral on January 31, 2018. Owned by GovSat, a public-private joint venture between SES and the government of Luxembourg, the satellite will offe
----------------------------------------------------------------------------------------------------
***********ROLE***********
user

***********CONTENT*************

Can you provide more details about the Falcon 9 rocket's launch and the specific trajectory i

### Load Tokenizer

In [None]:
from transformers import AutoTokenizer

model_id = "microsoft/Phi-3-mini-128k-instruct"

tokenizer = AutoTokenizer.from_pretrained(model_id)

# set pad_token_id to eos_token_id if not set
if tokenizer.pad_token_id is None:
  tokenizer.pad_token_id = tokenizer.eos_token_id

# set the resonable max_length for model without it
if tokenizer.model_max_length > 100_000:
  # even though changes are not seen in printing tokenizer object,
  # there is change in then attribute
  tokenizer.model_max_length = 2_048

# set chat template if not alredy available
if not tokenizer.chat_template:
  print("Setting a default chat template since tokenizer has none")
  DEFAULT_CHAT_TEMPLATE = "{% for message in messages %}\n{% if message['role'] == 'user' %}\n{{ '<|user|>\n' + message['content'] + eos_token }}\n{% elif message['role'] == 'system' %}\n{{ '<|system|>\n' + message['content'] + eos_token }}\n{% elif message['role'] == 'assistant' %}\n{{ '<|assistant|>\n'  + message['content'] + eos_token }}\n{% endif %}\n{% if loop.last and add_generation_prompt %}\n{{ '<|assistant|>' }}\n{% endif %}\n{% endfor %}"
  tokenizer.chat_template = DEFAULT_CHAT_TEMPLATE

print(tokenizer.chat_template)

tokenizer_config.json:   0%|          | 0.00/3.17k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/293 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/568 [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


{{ bos_token }}{% for message in messages %}{% if (message['role'] == 'user') %}{{'<|user|>' + '
' + message['content'] + '<|end|>' + '
' + '<|assistant|>' + '
'}}{% elif (message['role'] == 'assistant') %}{{message['content'] + '<|end|>' + '
'}}{% endif %}{% endfor %}


## Apply chat templates

More abot chat template: [Hugging Face Docs](https://huggingface.co/docs/transformers/main/en/chat_templating)

In [None]:
# see whats it look like after applying the chat tempalte in prompt
print(tokenizer.apply_chat_template(dataset["train"][0]["messages"], tokenize=False)[:200])

<s><|user|>
Here is a piece of text: Lifting off from Cape Canaveral on the 60th anniversary of the launch of Explorer 1, the first U.S. Satellite, the commercial booster rumbled into a clear late aft


In [None]:
import re
import random
from multiprocessing import cpu_count

def apply_chat_template(example, tokenizer):
  messages = example["messages"]
  # we add empty system message if there is none
  if messages[0]["role"] != "system":
    messages.insert(0, {"role": "system", "content": ""})
  example["text"] = tokenizer.apply_chat_template(messages, tokenize=False)

  return example

column_names = list(dataset["train"].features)

# apply tempalte formating before finetuning
dataset = dataset.map(apply_chat_template,
                      num_proc=cpu_count(),  # num of parallel execu for multiprocessing
                      fn_kwargs={"tokenizer": tokenizer}, # map expect only one params, we are passing here topkenizer also
                      remove_columns=column_names,
                      desc="Applying the chat template"
                      )

# displaying the sample after applying the template format:
print(dataset["train"]["text"][0][:250])

Applying the chat template (num_proc=2):   0%|          | 0/100 [00:00<?, ? examples/s]

Applying the chat template (num_proc=2):   0%|          | 0/100 [00:00<?, ? examples/s]

<s><|user|>
Here is a piece of text: Lifting off from Cape Canaveral on the 60th anniversary of the launch of Explorer 1, the first U.S. Satellite, the commercial booster rumbled into a clear late afternoon sky a day after SpaceX scrubbed a countdown


## Define the quantization `config` for base model



In [None]:
from transformers import BitsAndBytesConfig
import torch

# quantization config
quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="fp4",
    bnb_4bit_compute_dtype="bfloat16",
)

device_map = "auto" #{"": torch.cuda.current_device()} if torch.cuda.is_available() else None

# model arguments
model_kwargs = dict(
    # attn_implementation="flash_attention_2", # set this true if GPU supports it
    torch_dtype="auto",
    use_cache=False, # set to False as we are use gradient checkpointing
    device_map=device_map,
    quantization_config=quantization_config,
    trust_remote_code=True
)

### Load `LoRA` config. It is needed before replacing loRA module with `Loftq`

In [None]:
from transformers import AutoModelForCausalLM
from peft import LoraConfig

# Configuration based on LoraConfig
# note: don't pass init_lora_weights="loftq" or loftq_config!
peft_config = LoraConfig(
    r=64,
    lora_alpha=16,
    lora_dropout=0.1,
    bias="none",
    task_type="CASUAL_LM",
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"]
)

model = AutoModelForCausalLM.from_pretrained(model_id, **model_kwargs)

config.json:   0%|          | 0.00/3.35k [00:00<?, ?B/s]

configuration_phi3.py:   0%|          | 0.00/10.4k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/microsoft/Phi-3-mini-128k-instruct:
- configuration_phi3.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


modeling_phi3.py:   0%|          | 0.00/73.8k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/microsoft/Phi-3-mini-128k-instruct:
- modeling_phi3.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


model.safetensors.index.json:   0%|          | 0.00/16.3k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.97G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/2.67G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/172 [00:00<?, ?B/s]

In [None]:
# see this in hugging face github: https://github.com/huggingface/peft/blob/main/docs/source/developer_guides/lora.md
from peft import replace_lora_weights_loftq, get_peft_model

# replace LoRA weights with Loftq
peft_model = get_peft_model(model, peft_config)
replace_lora_weights_loftq(peft_model)

## Training Arguments

In [None]:
from trl import SFTTrainer
from transformers import TrainingArguments

# Save directory for logs and model checkpoints
output_dir = "phi-3-small/fine-tuned-sft-loftq_adapter"
training_args = TrainingArguments(
    output_dir=output_dir,
    fp16=True,
    # bf16=True,
    do_eval=True,
    evaluation_strategy="steps",
    gradient_accumulation_steps=128,  # Accumulate gradients and perform parameter updating to conserve memory usage
    gradient_checkpointing=True,
    gradient_checkpointing_kwargs={"use_reentrant": False},
    learning_rate=2.0e-05,
    # log_level="info",   # print the number of trainable params, info of saving and model when training
    logging_steps=1,
    logging_strategy="steps",
    lr_scheduler_type="cosine",
    max_steps=2,
    # num_train_epochs=1,  ### donot give max steps if you want config from num_train_epochs
    per_device_train_batch_size=1,
    per_device_eval_batch_size=1,
    report_to="tensorboard",
    seed=42,
    save_strategy="epoch",
    save_total_limit=3,
)

## Trainer

In [None]:
trainer = SFTTrainer(
    model=peft_model,
    tokenizer=tokenizer,
    args=training_args,
    packing=True,   # see tokenizer.max_model_length
    train_dataset=dataset["train"],
    eval_dataset=dataset["test"],
    dataset_text_field="text",
    # peft_config=peft_config,
    # model_init_kwargs=model_kwargs,   # pass if you want to load from trainer
    max_seq_length=tokenizer.model_max_length
)

Generating train split: 0 examples [00:00, ? examples/s]

Token indices sequence length is longer than the specified maximum sequence length for this model (2816 > 2048). Running this sequence through the model will result in indexing errors


Generating train split: 0 examples [00:00, ? examples/s]

max_steps is given, it will override any value given in num_train_epochs


## Training Model

In [None]:
trainer_result = trainer.train()



Step,Training Loss,Validation Loss
1,0.6521,No log
2,0.6327,No log


In [21]:
metrics = trainer_result.metrics

if hasattr(training_args, "max_train_samples"):
    max_train_samples = training_args.max_train_samples
else:
    max_train_samples = len(dataset["train"])

metrics["train_samples"] = min(max_train_samples, len(dataset["train"]))
print("Metrics: ", metrics)
print("Max train samples: ", max_train_samples)


Metrics:  {'train_runtime': 962.2323, 'train_samples_per_second': 0.266, 'train_steps_per_second': 0.002, 'total_flos': 5874901617475584.0, 'train_loss': 0.6423681676387787, 'epoch': 1.9692307692307693, 'train_samples': 100}
Max train samples:  100


### Save the Model


In [22]:
trainer.log_metrics("train", metrics)
trainer.save_metrics("train", metrics)
trainer.save_state()

***** train metrics *****
  epoch                    =     1.9692
  total_flos               =  5471428GF
  train_loss               =     0.6424
  train_runtime            = 0:16:02.23
  train_samples            =        100
  train_samples_per_second =      0.266
  train_steps_per_second   =      0.002


# Inference

In [23]:
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("/content/phi-3-small/fine-tuned-sft-loftq_adapter/checkpoint-1")
model = AutoModelForCausalLM.from_pretrained("/content/phi-3-small/fine-tuned-sft-loftq_adapter/checkpoint-1",
                                             load_in_4bit=True,
                                             device_map="auto",
                                             trust_remote_code=True)

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [24]:
import torch

# We use the tokenizer's chat template to format each message - see https://huggingface.co/docs/transformers/main/en/chat_templating
messages = [
    {
        "role": "system",
        "content": "You are a friendly chatbot who always responds in the style of a pirate",
    },
    {"role": "user", "content": "How many helicopters can a human eat in one sitting?"},
]

# prepare the messages for the model
input_ids = tokenizer.apply_chat_template(messages, truncation=True, add_generation_prompt=True, return_tensors="pt").to("cuda")

# inference
outputs = model.generate(
        input_ids=input_ids,. A and B are a low-rank matrix
        max_new_tokens=256,
        do_sample=True,
        temperature=0.7,
        top_k=50,
        top_p=0.95
)
print(tokenizer.batch_decode(outputs, skip_special_tokens=True)[0])

How many helicopters can a human eat in one sitting? It is biologically impossible for a human to consume helicopters or any other large object in one sitting due to the limitations of the human digestive system.
