<a href="https://colab.research.google.com/github/atharv-arya/Fine-Tuning-LLAMA-2-using-LoRA-and-QLoRA/blob/main/Fine_tuning_LLAMA_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Installing Reqiured Packages

In [1]:
!pip install --upgrade pip setuptools wheel
!pip install tokenizers==0.22.1 transformers==4.56.2 accelerate peft bitsandbytes trl



In [2]:
import os
import torch
from datasets import load_dataset
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    HfArgumentParser,
    TrainingArguments,
    pipeline,
    logging,
)
from peft import LoraConfig, PeftModel
from trl import SFTTrainer

# For LLAMA 2, the prompt template used for chat models is as follows:

* System Prompt(optional): to guide the model
* User Prompt(required): to give instructions
* Model Answer(requred)



```
<s> [INT] <<SYS>>
System Prompt
<</SYS>>

User Prompt [/INST] Model Answer </s>
```









# We will reformat our instruction dataset to follow LLAMA 2's template

* Original dataset: https://huggingface.co/datasets/timdettmers/openassistant-guanaco

* Reformated the above Dataset to follow LLAMA 2 template for 1k samples: https://huggingface.co/datasets/mlabonne/guanaco-llama2-1k

* Fully reformated dataset to follow LLAMA 2 template: https://huggingface.co/datasets/mlabonne/guanaco-llama2

# How to Fine Tune LLAMA 2
* Free Google Colab offers a 15GB Graphics Card (Limited Resources --> Barely enough to store Llama 2–7b’s weights)

* We also need to consider the overhead due to optimizer states, gradients, and forward activations

* Full fine-tuning is not possible here: we need parameter-efficient fine-tuning (PEFT) techniques like LoRA or QLoRA.

* To drastically reduce the VRAM usage, we must fine-tune the model in 4-bit precision, which is why we’ll use QLoRA here.


Steps:
1. Load a llama-2-7b-chat-hf-model (chat model)
2. Train it on the Reformated Dataset (mlabonne/guanaco-llama2-1k), whihc iwll prodice our fine-tuned Llama-2-7b-chat-finetune

QLoRA will use rank of 64 with a scaling parameter of 16. We will load the Llama 2 model directly in 4-bit precision using the NF4 type and train it for one epoch

In [None]:
model_name = "NousResearch/Llama-2-7b-chat-hf"
dataset_name = "mlabonne/guanaco-llama2-1k"
# fine-tuned model name
new_model = "Llama-2-7b-chat-finetune"

#######################################################
### QLoRA params

# LoRA attention dimension
lora_r = 64

# alpha param for LoRA scalling
lora_alpha = 16

# Dropout porbability for LoRA layers
lora_dropout = 0.1
#######################################################

#######################################################
### bitsandbytes params (basically for Quantization)

# Activate 4-bit precision base model loading
use_4bit = True

# Compute data type for 4-bit base models
bnb_4bit_compute_dtype = "float16"

# Quantization type (fp4 or nf4)
bnb_4bit_quant_type = "nf4"

# Activate nested quantization for 4-bit base models (double quantization)
use_nested_quant = False
#######################################################

#######################################################
### TrainingArguments Parameters

# Output directory where the model predictions and checkpoints will be stored
output_dir = './results'

# no of training epochs
num_train_epochs = 1

# Enable fp16/bf16 training
fp16 = False
bd16 = False

# batch size per GPU for tranining
per_device_train_batch_size = 4

# batch size pr GPU for evaluation
per_device_eval_batch_size = 4

# number of update steps to accumulate the gradients for
gradient_accumulation_steps = 1

# Enable gradient checkpointing
gradient_checkpointing = True

# Maximum gradient normal (gradient clipping)
max_grad_norm = 0.3

# Initial Learning rate (Adam optimizer)
learning_rate = 2e-4

# Weight decay to apply to all layers except bias/LayerNorm weights
weight_decay = 0.001

# Optimizer to use
optim = "paged_adamw_32bit"

# learning rate schedule
lr_scheduler_type = "cosine"

# no of training steps (overrides num_train_epochs)
max_steps = -1

# Ratio of steps for a linear warmup (from 0 to learning rate)
warmup_ratio = 0.03

# Group sequences into batches with same length
# Saves memory and speeds up training considerably
group_by_length = True

# save checkpoints every X updates steps
save_steps = 0

# Log every X updates steps
logging_steps = 25
#######################################################

#######################################################
### SFT(Supervised Fine Tuning) param

# Maximum sequence length to use
max_seq_length = None

# Pack multiple short examples in the same input sequence to increase effiiciency
packing = False

# Load the entire model on the GPU 0
device_map = {"": 0}
#######################################################