<a href="https://colab.research.google.com/github/aakriti1318/GenAI/blob/main/Finetuning%20llama2%20using%20QLoRA.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Fine Tune Llama2

## 1 - Installing Req Packages

- **Accelerate**- same PyTorch code to be run across any distributed configuration
- **PEFT** - Parameter-Efficient Transfer Learning. The technique is to freeze most of the weights of the llm model and only some of the weights will be retrained and based on that they will be able to provide accurate results based on your custom dataset.
- **Bits and Bytes** - To do quantization that means all the llm models are by default in the form of floating values of 32 bit, so to consume the capability with less RAM it quantize those weights from float32 to int8 and also based on RAM size it will help to fine tune it.

In [1]:
# !pip install -q accelerate==0.21.0 peft==0.4.0 bitsandbytes==0.40.2 transformers==4.31.0 trl==0.4.7

## 2 - Importing Libraries

In [2]:
# !pip install datasets

In [3]:
import os
import torch
from datasets import load_dataset
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    HfArgumentParser,
    TrainingArguments,
    pipeline,
    logging,
)
from peft import LoraConfig, PeftModel
from trl import SFTTrainer

#### In case of Llama2, the prompt template is as follow:

- System Prompt - to guide model
- User Prompt - to give instruction
- Model Answer


## 3 - Reformatting the dataset in the Llama2 template with 1k samples

Reformat Dataset following the Llama 2 template with 1k sample:
[mlabonne/guanaco-llama2-1k](https://huggingface.co/datasets/mlabonne/guanaco-llama2-1k)

In [4]:
!pip install -q datasets
!huggingface-cli login


    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|

    To login, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Token: 
Add token as git credential? (Y/n) n
Token is valid (permission: write).
Your token has been saved to /root/.cache/huggingface/token
Login successful


In [5]:
from datasets import load_dataset
import re

In [6]:
dataset = load_dataset('timdettmers/openassistant-guanaco')
# shuffle dataset n slice it
dataset = dataset['train'].shuffle(seed=42).select(range(1000))

# define a function to transform the data
def transform_conversation(example):
  conversation_text = example['text']
  segments = conversation_text.split('###')
  reformatted_segments = []

  # Iterate over pairs of segments
  for i in range(1, len(segments)-1, 2):
    human_text = segments[i].strip().replace('Human:', '').strip()
    if i+1 < len(segments):
      assitant_text = segments[i+1].strip().replace('Assistant:','').strip()
      reformatted_segments.append(f'<s>[INST] {human_text} [/INST] {assitant_text} </s>')
    else:
      reformatted_segments.append(f'<s>[INST] {human_text} [/INST] </s>')

  return {'text': ''.join(reformatted_segments)}

transformed_dataset = dataset.map(transform_conversation)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


In [7]:
# transformed_dataset.push_to_hub("guanaco-llama2-1k")

## 4 - Fine Tune llama2

- free google colab offers 15gb graphic card (it can store upto llama2-7b's weights)
- We need to consider the overhead due to optimizer states, gradients, and forward activations.
- Full fine-tuning is not possible here: we need PEFT techniques like LoRA, QLoRA
- To drastically reduce the VRAM usage, we must fine-tune the model in 4-bit precision by using QLoRA

- Load a llama2-7b-chat-hf model (chat model)
- train it on the dataset, which will produce our fine-tuned model llama2-7b-chat-finetune

In [8]:
model_name = "NousResearch/Llama-2-7b-chat-hf"
dataset_name = "mlabonne/guanaco-llama2-1k"
# Fine-tuned model name
new_model = "Llama-2-7b-chat-finetune"

QLoRA will use a rank of 64 (Hypertuning parameter) with a scaling parameter of 16 (alpha). We'll load the Llama 2 model directly in 4-bit precision using the NF4 type and train it for one epoch.

#### QLoRA Parameters

In [9]:
# LoRA attention dimension
lora_r = 64

# Alpha parameter for LoRA scaling
lora_alpha = 16

# Dropout probability for LoRA layers
lora_dropout = 0.1

In [10]:
#### bitsandbytes parameters - Quantization

# Activation 4-bit precision base model loading
use_4bit = True

# compute dtype for 4-bit base models
bnb_4bit_compute_dtype = 'float16'

# Quantization Type (fp4 or nf4)
bnb_4bit_quant_type = 'nf4'

# Activate nested quantization for 4-bit base models (double quantization)
use_nested_quant = False

In [11]:
#### Training Arguments Parameters

# output dir where the model predictions and checkpoints will be stored
output_dir = "/content/results"

# number of training epocs
num_train_epochs = 1

# Enable fp16/bf16 training (set bf16 to True with an A100)
fp16 = False
bf16 = False

# Batch size per GPU for training
per_device_train_batch_size = 4

# Batch size per GPU for evaluation
per_device_eval_batch_size = 4

# Number of update steps to accumulate the gradients for
gradient_accumulation_steps = 1

# Enable gradient checkpointing
gradient_checkpointing = True

# Maximum gradient normal (gradient clipping)
max_grad_norm = 0.3

# Initial learning rate (AdamW optimizer)
learning_rate = 2e-4

# Weight decay to apply to all layers except bias/LayerNorm weights
weight_decay = 0.001

# Optimizer to use
optim = "paged_adamw_32bit"

# Learning rate schedule
lr_scheduler_type = "cosine"

# Number of training steps (overrides num_train_epochs)
max_steps = -1

# Ratio of steps for a linear warmup (from 0 to learning rate)
warmup_ratio = 0.03

# Group sequences into batches with same length
# Saves memory and speeds up training considerably
group_by_length = True

# Save checkpoint every X updates steps
save_steps = 0

# Log every X updates steps
logging_steps = 25

In [12]:
#### SFT - Supervised Tuning

# Maximum sequence length to use
max_seq_length = None

# Pack multiple short examples in the same input sequence to increase efficiency
packing = False

# Load the entire model on the GPU 0
device_map = {"": 0}

## 5 - Starting Fine Tuning



*   First, load the dataset. (Usually the dataset should be processed - reformat the prompt, filter out bad text, combine multiple datasets, etc.)
*   Configure bitsandbytes for 4-bit quantization
*   Load the llama 2 model in 4-bit precision on a GPU with the corr token
*   Finally, load configuration for QLoRA, regular training parameters, and passing everything to SFTTrainer.



In [13]:
!apt-get update
!apt-get install -y nvidia-driver-470

0% [Working]            Hit:1 http://security.ubuntu.com/ubuntu jammy-security InRelease
0% [Connecting to archive.ubuntu.com (91.189.91.83)] [Connected to cloud.r-project.org (18.239.18.61                                                                                                    Hit:2 https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/ InRelease
                                                                                                    0% [Connecting to archive.ubuntu.com (91.189.91.83)] [Waiting for headers] [Waiting for headers]                                                                                                Hit:3 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  InRelease
                                                                                                0% [Connecting to archive.ubuntu.com (91.189.91.83)] [Waiting for headers]                                                                     

In [14]:
# Load Dataset
dataset = load_dataset(dataset_name, split = "train")

# Load tokenizer and model with QLoRA configuration
compute_dtype = getattr(torch, bnb_4bit_compute_dtype)

bnb_config = BitsAndBytesConfig(
    load_in_4bit=use_4bit,
    bnb_4bit_quant_type=bnb_4bit_quant_type,
    bnb_4bit_compute_dtype=compute_dtype,
    bnb_4bit_use_double_quant=use_nested_quant,
)

# Check GPU compatibility with bfloat16
if compute_dtype == torch.float16 and use_4bit:
    major, _ = torch.cuda.get_device_capability()
    if major >= 8:
        print("=" * 80)
        print("Your GPU supports bfloat16: accelerate training with bf16=True")
        print("=" * 80)

# Load the base model
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map=device_map
)
model.config.use_cache = False
model.config.pretraining_tp = 1

# Load LLaMA tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token # specifically applies the token for the llama
tokenizer.padding_side = "right" # Fix weird overflow issue with fp16 training

# Load LoRA configuration
peft_config = LoraConfig(
    lora_alpha=lora_alpha,
    lora_dropout=lora_dropout,
    r=lora_r,
    bias="none",
    task_type="CAUSAL_LM",
)


# Set training parameters
training_arguments = TrainingArguments(
    output_dir=output_dir,
    num_train_epochs=num_train_epochs,
    per_device_train_batch_size=per_device_train_batch_size,
    gradient_accumulation_steps=gradient_accumulation_steps,
    optim=optim,
    save_steps=save_steps,
    logging_steps=logging_steps,
    learning_rate=learning_rate,
    weight_decay=weight_decay,
    fp16=fp16,
    bf16=bf16,
    max_grad_norm=max_grad_norm,
    max_steps=max_steps,
    warmup_ratio=warmup_ratio,
    group_by_length=group_by_length,
    lr_scheduler_type=lr_scheduler_type,
    report_to="tensorboard"
)

# Set supervised fine-tuning parameters
trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    peft_config=peft_config, # LoRA config
    dataset_text_field="text",
    max_seq_length=max_seq_length,
    tokenizer=tokenizer,
    args=training_arguments,
    packing=packing,
)

# Train model
trainer.train()


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/179 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/746 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/21.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/435 [00:00<?, ?B/s]



Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss
25,1.4087
50,1.6633
75,1.2147
100,1.446
125,1.1772
150,1.3664
175,1.1739
200,1.4678


Step,Training Loss
25,1.4087
50,1.6633
75,1.2147
100,1.446
125,1.1772
150,1.3664
175,1.1739
200,1.4678
225,1.1581
250,1.5431


TrainOutput(global_step=250, training_loss=1.3619149398803712, metrics={'train_runtime': 1599.0893, 'train_samples_per_second': 0.625, 'train_steps_per_second': 0.156, 'total_flos': 8755214190673920.0, 'train_loss': 1.3619149398803712, 'epoch': 1.0})

In [16]:
# save trained model

trainer.model.save_pretrained(new_model)

## 6 - Testing with QnA

In [19]:
# Ignore Warnings
logging.set_verbosity(logging.CRITICAL)

# Run text generation pipeline with our next model
prompt = "How to own a plane in the United States"
pipe = pipeline(
    task="text-generation",
    model = model,
    tokenizer = tokenizer,
    max_length=200
  )
result = pipe(f"<s>[INST] {prompt} [/INST]")
print(result[0]['generated_text'])

<s>[INST] How to own a plane in the United States [/INST] In the United States, owning a plane is a complex and expensive process that requires a significant amount of time, money, and expertise. Here are some general steps that you might need to follow in order to own a plane in the United States:

1. Determine your budget: Planes are expensive, and owning one can be a significant financial burden. You will need to determine how much you are willing to spend on a plane, as well as how much you are willing to spend on maintenance, insurance, and other expenses.
2. Research the market: There are many different types of planes available, and you will need to research the market to determine which type of plane is best for you. Consider factors such as the size of the plane, the type of engine, and the level of luxury.
3. Find a seller: Once you


## 7 - Push Model to Hugging Face Hub

In [None]:
import locale
locale.getpreferredencoding = lambda: "UTF-8"

In [None]:
!huggingface-cli login

model.push_to_hub("entbappy/Llama-2-7b-chat-finetune", check_pr=True)

tokenizer.push_to_hub("entbappy/Llama-2-7b-chat-finetune",check_pr=True)
