# Finetune a Llama2 model with QLora
We are going to use the recently introduced method in the paper "[QLoRA: Quantization-aware Low-Rank Adapter Tuning for Language Generation](https://arxiv.org/abs/2305.14314)" by Tim Dettmers et al. QLoRA is a new technique to reduce the memory footprint of large language models during finetuning, without sacrificing performance. The TL;DR; of how QLoRA works is:

Quantize the pretrained model to 4 bits and freezing it.
Attach small, trainable adapter layers. (LoRA)
Finetune only the adapter layers, while using the frozen quantized model for context.
We prepared a train.py, which implements QLora using PEFT to train our model. The script also merges the LoRA weights into the model weights after training. That way you can use the model as a normal model without any additional code.


<div style="background-color: #FFDDDD; border-left: 5px solid red; padding: 10px; color: black;">
    <strong>Kernel:</strong> PyTorch 2.0.0 Python 3.10 GPU Optimized, <strong>Instance Type:</strong> ml.g5.2xlarge
</div>

In [None]:
%pip install datasets transformers==4.31.0 accelerate==0.21.0 peft==0.4.0 trl==0.4.7 bitsandbytes==0.40.2 scipy tensorboard boto3 sagemaker

In [None]:
import os
import torch
from datasets import load_dataset
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    HfArgumentParser,
    TrainingArguments,
    pipeline,
    logging,
)
from peft import LoraConfig, PeftModel
from trl import SFTTrainer
import bitsandbytes as bnb

Define the model name from Huggingface to be used as a `pretrained base model`, `dataset name` to finetune and a `model name` for the finetuned model.

In [None]:
# The model that you want to train from the Hugging Face hub
model_name = "NousResearch/Llama-2-7b-chat-hf"

# The instruction dataset to use
dataset_name = "mlabonne/guanaco-llama2-1k"

# Fine-tuned model name
new_model = "llama-2-7b-miniguanaco"

# Setting up BitsAndBytes configurations
At a high level, QLoRA uses 4-bit quantization to compress a pretrained language model. The LM parameters are then `frozen` and a relatively small number of trainable parameters are added to the model in the form of `Low-Rank Adapters`. During finetuning, QLoRA backpropagates gradients through the frozen 4-bit quantized pretrained language model into the Low-Rank Adapters. 

The LoRA layers are the only parameters being updated during training. Read more about LoRA in the original [LoRA](https://arxiv.org/abs/2106.09685) paper.
In our example, we'll configure 4bit quantization parameters using `BitsAndBytesConfig` library from Huggingface. For full list of configurable parameters, please refer to the [latest version](https://github.com/huggingface/transformers/blob/main/src/transformers/utils/bitsandbytes.py) of the code on github.

In [None]:
################################################################################
# bitsandbytes parameters
################################################################################

# Activate 4-bit precision base model loading
use_4bit = True

# Compute dtype for 4-bit base models
bnb_4bit_compute_dtype = "float16"

# Quantization type (fp4 or nf4)
bnb_4bit_quant_type = "nf4"

# Activate nested quantization for 4-bit base models (double quantization)
use_nested_quant = False

# Load tokenizer and model with QLoRA configuration
compute_dtype = getattr(torch, bnb_4bit_compute_dtype)

# 4 bit configuration
bnb_config = BitsAndBytesConfig(
    load_in_4bit=use_4bit,
    bnb_4bit_quant_type=bnb_4bit_quant_type,
    bnb_4bit_compute_dtype=compute_dtype,
    bnb_4bit_use_double_quant=use_nested_quant,
)

# Setup Hyper Parameters For FineTuning
In the following cell, we'll configure the hyperparameters to be used for the finetuning step. For a complete list of hyperparameters available, please refer to 
the latest [TrainingArguments](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.TrainingArguments) from `Huggingface` github repository.


In [None]:
################################################################################
# TrainingArguments parameters
################################################################################

# Output directory where the model predictions and checkpoints will be stored
output_dir = "./results"

# Number of training epochs
num_train_epochs = 1

# Enable fp16/bf16 training (set bf16 to True with an A100)
fp16 = False
bf16 = False

# Batch size per GPU for training
per_device_train_batch_size = 4

# Batch size per GPU for evaluation
per_device_eval_batch_size = 4

# Number of update steps to accumulate the gradients for
gradient_accumulation_steps = 1

# Enable gradient checkpointing
gradient_checkpointing = True

# Maximum gradient normal (gradient clipping)
max_grad_norm = 0.3

# Initial learning rate (AdamW optimizer)
learning_rate = 2e-4

# Weight decay to apply to all layers except bias/LayerNorm weights
weight_decay = 0.001

# Optimizer to use
optim = "paged_adamw_32bit"

# Learning rate schedule (constant a bit better than cosine)
lr_scheduler_type = "constant"

# Number of training steps (overrides num_train_epochs)
max_steps = -1

# Ratio of steps for a linear warmup (from 0 to learning rate)
warmup_ratio = 0.03

# Group sequences into batches with same length
# Saves memory and speeds up training considerably
group_by_length = True

# Save checkpoint every X updates steps
save_steps = 25

# Log every X updates steps
logging_steps = 25

################################################################################
# SFT parameters
################################################################################

# Maximum sequence length to use
max_seq_length = None

# Pack multiple short examples in the same input sequence to increase efficiency
packing = False

# Load the entire model on the GPU 0
device_map = {"": 0}

# Load Training Datasets
Like pretraining an LLM, finetuning a model requires training data to learn new patterns. In our example, we are going to leverage a dataset called [guanaco-llama2-1k](https://huggingface.co/datasets/mlabonne/guanaco-llama2-1k). This is a subset (1000 samples) of the excellent [timdettmers/openassistant-guanaco dataset](https://huggingface.co/datasets/timdettmers/openassistant-guanaco), processed to match Llama 2's prompt format as described in this article.

To load the training dataset, we'll use the [datasets](https://github.com/huggingface/datasets) library maintained by HuggingFace to perform data collator functionality.

Please refer to [this](https://huggingface.co/docs/datasets/load_hub) link for an excellent source to learn about working with the datasets.

In [None]:
# Load dataset (you can process it here)
dataset = load_dataset(dataset_name, split="train")

## Setting optimal Floating Precisions for mixed precision training
The following section configures whether bf16 datatype is supported by the underlying compute resources that performs the model finetuning.

In [None]:
# Check GPU compatibility with bfloat16
if compute_dtype == torch.float16 and use_4bit:
    major, _ = torch.cuda.get_device_capability()
    if major >= 8:
        print("=" * 80)
        print("Your GPU supports bfloat16: accelerate training with bf16=True")
        print("=" * 80)
        bf16=True

# Loads Base Llama2 Model
Load the LLM base model based on the `BitsAndBytes` configuration and the GPU devices to load the model weight, whether to enable [tensor_parallism](https://huggingface.co/docs/transformers/parallelism).

**Note**: Setting config.pretraining_tp to a value different than 1 will activate the more accurate but slower computation of the linear layers, which should better match the original logits.

In [None]:
# Load base model
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map=device_map
)
# model.config.use_cache = False
model.config.pretraining_tp = 1 

In [None]:
# COPIED FROM https://github.com/artidoro/qlora/blob/main/qlora.py
def find_all_linear_names(model):
    lora_module_names = set()
    for name, module in model.named_modules():
        if isinstance(module, bnb.nn.Linear4bit):
            names = name.split(".")
            lora_module_names.add(names[0] if len(names) == 1 else names[-1])

    if "lm_head" in lora_module_names:  # needed for 16-bit
        lora_module_names.remove("lm_head")
    return list(lora_module_names)

## QLoRA parameters
Loading QLoRA parameters. Notable configuration parameters are listed as followed: (from [peft](https://github.com/huggingface/peft) library)

* `r`: Lora attention dimension.
* `lora_alpha`: The alpha parameter for Lora scaling.
* `lora_dropout`: The dropout probability for Lora layers.
* `bias`: Bias type for Lora. Can be 'none', 'all' or 'lora_only'. If 'all' or 'lora_only', the corresponding biases will be updated during training. Be aware that this means that, even when disabling the adapters, the model will not produce the same output as the base model would have without adaptation.

In [None]:
modules = find_all_linear_names(model)

In [None]:
lora_r = 64

# Alpha parameter for LoRA scaling
lora_alpha = 16

# Dropout probability for LoRA layers
lora_dropout = 0.1

peft_config = LoraConfig(
    lora_alpha=lora_alpha,
    lora_dropout=lora_dropout,
    r=lora_r,
    bias="none",
    task_type="CAUSAL_LM",
    target_modules=modules
)

# Loads Tokenizer (Llama2)
Loads the tokenizer for llama2 model. For more information please refer to this [link]. The tokenizer is used in the training process to convert the input text into tokens. (https://huggingface.co/docs/transformers/v4.31.0/model_doc/llama2#transformers.LlamaTokenizer)

In [None]:
# Load LLaMA tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right" # Fix weird overflow issue with fp16 training

# Setting up Trainer and Training Configuration
Setting the hyperparameters using Huggingface's [TrainingArguments](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.TrainingArguments) and Trainer using [SFTTrainer](https://huggingface.co/docs/trl/sft_trainer) object. 

In [None]:
# Set training parameters
training_arguments = TrainingArguments(
    output_dir=output_dir,
    num_train_epochs=num_train_epochs,
    per_device_train_batch_size=per_device_train_batch_size,
    gradient_accumulation_steps=gradient_accumulation_steps,
    optim=optim,
    save_steps=save_steps,
    logging_steps=logging_steps,
    learning_rate=learning_rate,
    weight_decay=weight_decay,
    fp16=fp16,
    bf16=bf16,
    max_grad_norm=max_grad_norm,
    max_steps=max_steps,
    warmup_ratio=warmup_ratio,
    group_by_length=group_by_length,
    lr_scheduler_type=lr_scheduler_type,
    report_to="tensorboard"
)

trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    peft_config=peft_config,
    dataset_text_field="text",
    max_seq_length=max_seq_length,
    tokenizer=tokenizer,
    args=training_arguments,
    packing=packing,
)

Kicks off the finetuning process and save the LoRA weight.

In [None]:
# Train model
trainer.train()

# Save trained model
trainer.model.save_pretrained(new_model)

# Test Finetuned model locally through inferences

In [None]:
# Ignore warnings
logging.set_verbosity(logging.CRITICAL)

# Run text generation pipeline with our next model
prompt = "What is a large language model? Give an answer in a single sentence."
pipe = pipeline(task="text-generation", model=model, tokenizer=tokenizer, max_length=300)
result = pipe(f"<s>[INST] {prompt} [/INST]")
print(result[0]['generated_text'])

# Merge the model with Lora weights
The finetuned model executed above created a new model containing the trained LoRA adaptor weights. In the next step, we'll merge the adaptor with the base model so that we could deploy the finetune model for inferene. 

**Note:** The finetuning steps executed above loaded the base model into the VRAM. In the following step, in order to ensure there are sufficient VRAM to perform the merging of LoRA adaptor weight and the base model, we will need to free up the VRAM by clearing the cache from the GPU memory.

In [None]:
del model
del trainer
torch.cuda.empty_cache()

In [None]:
# Reload model in FP16 and merge it with LoRA weights
base_model = AutoModelForCausalLM.from_pretrained(
    model_name,
    low_cpu_mem_usage=True,
    return_dict=True,
    torch_dtype=torch.float16,
    device_map=device_map,
)
model = PeftModel.from_pretrained(base_model, new_model)
model = model.merge_and_unload()
save_dir = "merged-4bit"
model.save_pretrained(save_dir, safe_serialization=True, max_shard_size="2GB")
# Reload tokenizer to save it
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"
tokenizer.save_pretrained(save_dir)

# Upload Model Artifact to S3 
In the next section, we are going to upload the merged model artifact to S3 bucket so that we could deploy the model using a SageMaker LMI container. 
We recommend using the Amazon S3 bucket encryption for the bucket where you store any data, including the model artifacts. Amazon S3 encryption helps you protect your data stored in AWS S3 buckets in the cloud, and this is especially important for sensitive data. Please follow this [link](https://docs.aws.amazon.com/AmazonS3/latest/userguide/UsingKMSEncryption.html) for more information about setting up server side/client side encryption for your S3 bucket.

In [None]:
model_data_s3_location = "<specify an S3 location for storing the model weight>"
!cd {save_dir} && aws s3 cp --recursive . {model_data_s3_location}