# Parameter Efficient Fine-Tuning with QLoRA

Parameter-efficient fine-tuning (PEFT) is a technique that optimizes a subset of a large model's parameters instead of updating the entire model during fine-tuning. This approach is especially beneficial for large language models like GPT & LLaMA, where updating all parameters would be computationally expensive and memory-intensive. 

LoRA : LoRA is a technique that enables efficient fine-tuning of large language models by adding small, trainable low-rank matrices to the model. Instead of updating all parameters, LoRA only adapts a subset of parameters, allowing the model to focus on capturing task-specific variations without requiring a full update of the entire model.

QLoRA : QLoRA builds on LoRA by adding quantization, making it even more memory-efficient. In QLoRA, the main model weights are quantized to a 4-bit precision format. This reduces the memory needed to store and process the model’s parameters. 

## Lab Description:

This lab explores Parameter Efficient Fine-Tuning (PEFT) using QLoRA (Quantized Low-Rank Adaptation) to fine-tune the NousResearch/Llama-2-7b-chat-hf model on the mlabonne/guanaco-llama2-1k dataset. The lab walks through configuring QLoRA parameters, bitsandbytes quantization settings, TrainingArguments, and Supervised Fine-Tuning (SFT) parameters. Participants will implement QLoRA to efficiently adapt a large-scale model while reducing memory footprint and computational costs.

### **Lab Objectives**  

1. **Understand QLoRA for Parameter-Efficient Fine-Tuning**   

2. **Configure and Apply QLoRA Parameters**    

3. **Fine-Tune LLaMA-2-7B Using the Guanaco Dataset**  

4. **Evaluate and Analyze Model Performance**  
   

## Libraries

1. **os**: Interacts with the operating system for file and directory management.
2. **torch**: Provides tensor operations, neural network functionality, and GPU support.
3. **datasets**: Loads and processes datasets for machine learning.
4. **transformers**:
   - **AutoModelForCausalLM**: Loads pre-trained causal language models.
   - **AutoTokenizer**: Tokenizes text for model input.
   - **BitsAndBytesConfig**: Configures model quantization for memory efficiency.
   - **HfArgumentParser**: Parses command-line arguments.
   - **TrainingArguments**: Configures training parameters.
   - **pipeline**: Simplifies common NLP tasks.
   - **logging**: Manages logging for transformers operations.
5. **peft**:
   - **LoraConfig**: Configures low-rank adaptation settings.
   - **PeftModel**: Applies parameter-efficient fine-tuning techniques to models.
6. **trl**:
   - **SFTTrainer**: Trains models with supervised fine-tuning, often used in reinforcement learning setups.


In [None]:
!pip install torch==2.2.0 torchvision==0.17.0 torchaudio==2.2.0 --index-url https://download.pyto
rch.org/whl/cu121
!pip install datasets==3.5.0
!pip install transformers==4.50.3
!pip install peft==0.15.1
!pip install trl==0.16.0
!pip install bitsandbytes==0.45.4

In [1]:
import os
import torch
from datasets import load_dataset
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    HfArgumentParser,
    TrainingArguments,
    pipeline,
    logging,
)
from peft import LoraConfig, PeftModel
from trl import SFTTrainer

  from .autonotebook import tqdm as notebook_tqdm


## Model and Data Set

- **Base Model**: `NousResearch/Llama-2-7b-chat-hf`  
  The starting model loaded from the Hugging Face Hub. This is a 7-billion parameter model designed for chat-based applications.

- **Dataset**: `mlabonne/guanaco-llama2-1k`  
  The instruction dataset used for fine-tuning, which contains 1,000 carefully curated samples processed for Llama 2 models. This is useful if you dont want to reformat the dataset to match LLaMA 2's prompt format. 

- **Fine-Tuned Model Name**: `llama-2-7b-miniguanaco`  
  The name of the newly fine-tuned model, representing a customized version of Llama-2-7b trained on the Guanaco subset.


In [2]:
# The model that you want to train from the Hugging Face hub
model_name = "NousResearch/Llama-2-7b-chat-hf"

# The instruction dataset to use
dataset_name = "mlabonne/guanaco-llama2-1k"

# Fine-tuned model name
new_model = "llama-2-7b-miniguanaco"

## QLoRA Parameters

1. **`r`**: Sets the rank of low-rank adaptation matrices, controlling the complexity of the adjustments.
2. **`alpha`**: A scaling factor that controls the strength of the adaptation in LoRA layers.
3. **`dropout`**: Specifies dropout probability for LoRA layers to help prevent overfitting. Overfitting is a phenomenon in machine learning where a model learns the training data too well, including its noise and specific patterns, rather than generalizing to new, unseen data. Dropout is a regularization technique used in neural networks to help prevent overfitting. During training, dropout randomly "drops out" (or deactivates) a fraction of neurons in a layer by setting their output to zero. This prevents the model from relying too heavily on any single neuron or specific set of neurons to make predictions.


In [3]:
# LoRA attention dimension
r = 64

# Alpha parameter for LoRA scaling
alpha = 16

# Dropout probability for LoRA layers
dropout = 0.1

## `bitsandbytes` Parameters

1. **`use_4bit`**: Activates loading of the model with 4-bit precision to reduce memory usage.
2. **`bnb_4bit_compute_dtype`**: Sets the data type for computations with 4-bit models, here as `float16` for efficient performance.
3. **`bnb_4bit_quant_type`**: Specifies the quantization type; `nf4` is optimized for normally distributed weights.
4. **`use_nested_quant`**: Enables or disables nested quantization (double quantization) for further memory efficiency.


In [4]:
# Activate 4-bit precision base model loading
use_4bit = True

# Compute dtype for 4-bit base models
bnb_4bit_compute_dtype = "float16"

# Quantization type (fp4 or nf4)
bnb_4bit_quant_type = "nf4"

# Activate nested quantization for 4-bit base models (double quantization)
use_nested_quant = False

## `TrainingArguments` Parameters

1. **`save_directory`**: Directory for storing model outputs and checkpoints.
2. **`total_epochs`**: Number of times the model will iterate over the entire training dataset.
3. **`use_fp16` / `use_bf16`**: Enable mixed-precision training (16-bit floats), with `use_bf16` specifically for A100 GPUs.
4. **`train_batch_size_per_device`**: Batch size per GPU for training.
5. **`eval_batch_size_per_device`**: Batch size per GPU for evaluation.
6. **`grad_accumulation_steps`**: Number of steps to accumulate gradients before updating model weights.
7. **`enable_gradient_checkpointing`**: Enables gradient checkpointing to reduce memory usage.
8. **`max_gradient_norm`**: Sets maximum value for gradient clipping to prevent exploding gradients.
9. **`initial_learning_rate`**: Starting learning rate for the optimizer.
10. **`decay_rate`**: Regularization factor applied to model weights (excluding biases/LayerNorm).
11. **`optimizer_type`**: Specifies the optimizer type, here `paged_adamw_32bit` for memory efficiency.
12. **`schedule_type`**: Type of learning rate schedule, here `cosine` for smooth decay.
13. **`total_train_steps`**: Maximum training steps (overrides `total_epochs` if set).
14. **`warmup_proportion`**: Proportion of training steps for learning rate warmup.
15. **`batch_by_length`**: Groups sequences of similar lengths for more efficient training.
16. **`checkpoint_interval`**: Interval for saving model checkpoints during training.
17. **`log_interval`**: Frequency of logging training progress.

In [5]:
# Directory to save model outputs and checkpoints
save_directory = "./results"

# Total number of training epochs
total_epochs = 1

# Enable 16-bit precision training (set bf16_enabled to True if using A100 GPU)
use_fp16 = False
use_bf16 = False

# Training batch size per GPU
train_batch_size_per_device = 4

# Evaluation batch size per GPU
eval_batch_size_per_device = 4

# Number of steps to accumulate gradients before updating model weights
grad_accumulation_steps = 1

# Enable memory-efficient gradient checkpointing
enable_gradient_checkpointing = True

# Maximum value for gradient clipping (to stabilize training)
max_gradient_norm = 0.3

# Starting learning rate for the optimizer
initial_learning_rate = 2e-4

# Weight decay rate, except for bias and LayerNorm weights
decay_rate = 0.001

# Optimizer selection
optimizer_type = "paged_adamw_32bit"

# Type of learning rate schedule to apply
schedule_type = "cosine"

# Total number of training steps (overrides total_epochs if set)
total_train_steps = -1

# Proportion of training steps to linearly warm up the learning rate
warmup_proportion = 0.03

# Batch sequences of the same length together to save memory and increase speed
batch_by_length = True

# Interval of steps to save checkpoints
checkpoint_interval = 0

# Interval of steps to log training status
log_interval = 25


## Supervised Fine-Tuning (SFT) Parameters 

1. **`max_input_length`**: Sets the maximum sequence length for the input data. If `None`, the model uses its default maximum length.
2. **`enable_packing`**: Allows multiple short examples to be packed together in one input sequence, increasing training efficiency.
3. **`model_device_map`**: Specifies which GPU to load the model on, here assigning the entire model to GPU 0.



In [6]:
# Maximum sequence length for input data
max_input_length = None

# Enable packing of multiple short examples in one input sequence for efficiency
enable_packing = False

# Specify GPU to load the entire model on (GPU 0)
model_device_map = {"": 0}


## Loading the parameters

## Loading the dataset and setting `compute_dtype`

1. **`dataset = load_dataset(dataset_name, split="train")`**: 
   - Loads the specified dataset's training split, allowing access to training data. You can also preprocess or modify this dataset after loading.

2. **`compute_dtype = getattr(torch, bnb_4bit_compute_dtype)`**:
   - Dynamically sets the data type (e.g., `torch.float16`) for model computations based on the value in `bnb_4bit_compute_dtype`that we defined earlier, configuring precision for quantized or mixed-precision training.


In [None]:
# Load dataset (you can process it here)
dataset = load_dataset(dataset_name, split="train")

# Load tokenizer and model with QLoRA configuration
compute_dtype = getattr(torch, bnb_4bit_compute_dtype)

## Setting the `BitsAndBytesConfig`

Assigns the `bitsandbytes` configuration parameters that we defined earlier to the `bnb_config` variable.

In [None]:
bnb_config = BitsAndBytesConfig(
    load_in_4bit=use_4bit,
    bnb_4bit_quant_type=bnb_4bit_quant_type,
    bnb_4bit_compute_dtype=compute_dtype,
    bnb_4bit_use_double_quant=use_nested_quant,
)

## GPU Compatibility Check

Checks if the GPU in use is compatible with `bf16` precision. If yes, it suggests to enable it for better accurate training results. 

In [None]:
# Check GPU compatibility with bfloat16
if compute_dtype == torch.float16 and use_4bit:
    major, _ = torch.cuda.get_device_capability()
    if major >= 8:
        print("=" * 80)
        print("Your GPU supports bfloat16: accelerate training with bf16=True")
        print("=" * 80)

### Loading the Base Model with Quantization Configuration

Loads a base model for causal language modeling from Hugging Face using the specified configurations (Initialized earlier).

- **`model.config.use_cache = False`**: Disables caching to save memory during training.
- **`model.config.pretraining_tp = 1`**: Sets tensor parallelism for pretraining to 1, reducing memory requirements in certain setups.

In [None]:
# Load base model
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map=model_device_map
)
model.config.use_cache = False
model.config.pretraining_tp = 1

### Loading and Configuring the Tokenizer

Initializes and configures the LLaMA tokenizer:

- **`AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)`**: Loads the tokenizer associated with `model_name` from Hugging Face, with `trust_remote_code=True` to allow any custom tokenizer code.
- **`tokenizer.pad_token = tokenizer.eos_token`**: Sets the padding token to be the same as the end-of-sequence (EOS) token, ensuring consistent padding.
- **`tokenizer.padding_side = "right"`**: Sets padding to the right side of sequences, which helps avoid overflow issues that can arise with `fp16` (16-bit floating-point) training.


In [None]:
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right" # Fixes overflow issue with fp16 training


## Loading the LoRA Configuration

Loads the `r`, `alpha` and `dropout` parameters defined earlier to the `peft_config` variable.

1. **`bias="none"`**:
   - Specifies that no additional bias parameters will be added to the LoRA layers. This reduces memory usage.

2. **`task_type="CAUSAL_LM"`**:
   - Indicates the type of task the model is being fine-tuned for, in this case, "Causal Language Modeling" (CAUSAL_LM).


In [None]:
# Load LoRA configuration
peft_config = LoraConfig(
    lora_alpha=alpha,
    lora_dropout=dropout,
    r=r,
    bias="none",
    task_type="CAUSAL_LM",
)

## Setting the training parameters

Loads all the training parameters defined earlier to the `training_arguments` variable. 

In [None]:
training_arguments = TrainingArguments(
    output_dir=save_directory,
    num_train_epochs=total_epochs,
    per_device_train_batch_size=train_batch_size_per_device,
    gradient_accumulation_steps=grad_accumulation_steps,
    optim=optimizer_type,
    save_steps=checkpoint_interval,
    logging_steps=log_interval,
    learning_rate=initial_learning_rate,
    weight_decay=decay_rate,
    fp16=use_fp16,
    bf16=use_bf16,
    max_grad_norm=max_gradient_norm,
    max_steps=total_train_steps,
    warmup_ratio=warmup_proportion,
    group_by_length=batch_by_length,
    lr_scheduler_type=schedule_type,
    report_to="tensorboard"
)

## Setting the Supervised Fine-Tuning Parameters

Loads all the SFT parameters defined earlier to the `trainer` variable. The `model` and `dataset` used are also passed in.

In [None]:
# Set supervised fine-tuning parameters
# Set supervised fine-tuning parameters
trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    peft_config=peft_config,
    args=training_arguments
)

## Initiate Training

Start training with `.train()`.

In [8]:
trainer.train()

Step,Training Loss
25,1.4172
50,1.5946
75,1.2011
100,1.4736
125,1.1767
150,1.3685
175,1.1455
200,1.5181
225,1.1392
250,1.4379


TrainOutput(global_step=250, training_loss=1.3472347030639649, metrics={'train_runtime': 446.4961, 'train_samples_per_second': 2.24, 'train_steps_per_second': 0.56, 'total_flos': 1.78746411859968e+16, 'train_loss': 1.3472347030639649, 'epoch': 1.0})

## Save the trained model

Save the trained model to the specified directory initialized eralier.

In [9]:
trainer.model.save_pretrained(new_model)

In [None]:
# %load_ext tensorboard
# %tensorboard --logdir results/runs

In [6]:
!nvidia-smi

Fri Nov  8 15:13:39 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.35.03              Driver Version: 560.35.03      CUDA Version: 12.6     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  NVIDIA L40S                    Off |   00000000:34:00.0 Off |                    0 |
| N/A   69C    P0            100W /  350W |   44946MiB /  46068MiB |      5%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                

## Logger Configuration

Configures the logger to only display critical messages, effectively suppressing all lower-severity log messages.

In [10]:
logging.set_verbosity(logging.CRITICAL)

## Inferencing

Now that we have successfully fine-tuned our model with QLoRA, we are all set to inference the new model.

- **`prompt`**: The input question or statement to generate text for.
- **`pipeline(...)`**: Initializes a text generation pipeline using the specified `model` and `tokenizer`.
  - **`task="text-generation"`**: Sets the task as text generation.
  - **`max_length=200`**: Limits the output text length to 200 tokens.
- **`pipe(f"<s>[INST] {prompt} [/INST]")`**: Runs the prompt through the model, formatting it for instruction-based models.
- **`print(result[0]['generated_text'])`**: Prints the generated text.

This setup provides a response to the prompt by leveraging the fine-tuned model in a simple pipeline.


In [15]:
# Run text generation pipeline with our next model
prompt = "If two dwarves wrapped in a trenchcoat tried to sneak into a human camp, what would give them away?"
pipe = pipeline(task="text-generation", model=model, tokenizer=tokenizer, max_length=200)
result = pipe(f"<s>[INST] {prompt} [/INST]")
print(result[0]['generated_text'])

<s>[INST] If two dwarves wrapped in a trenchcoat tried to sneak into a human camp, what would give them away? [/INST] The two dwarves wrapped in a trenchcoat are likely to give themselves away by their weight and size. The trenchcoat is likely to be too small for them, and their weight will make it difficult for them to move stealthily. Additionally, the two dwarves are likely to be too short to be able to reach the zipper on the trenchcoat, making it difficult for them to get in and out of the coat without being seen.

It is also likely that the two dwarves will not be able to blend in with the human camp, as they are much shorter than the average human and will stick out like a sore thumb. They may also be unable to speak the language of the humans, making it difficult for them to communicate


<div style="text-align: left;">
    <img src="logo.png" alt="flow" width="150" height="100">
</div>