<a href="https://colab.research.google.com/github/Zenith1618/LLM/blob/main/Finetuning_Orca_Mini_3B_using_QLoRA.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Setup

we will need accelerate, peft, transformers, datasets,scipy and TRL to leverage SFTTrainer. We will use bitsandbytes to quantize the base model into 4bit.

In [1]:
!pip install -q -U trl transformers accelerate sentencepiece git+https://github.com/huggingface/peft.git
!pip install -q -U datasets bitsandbytes einops scipy wandb

  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone


# Dataset

QLoRA : 8bit(1st Quantization) and then 4bit(2nd Quantization)

Dataset: https://huggingface.co/datasets/mlabonne/guanaco-llama2-1k  (This dataset is multilingual)

TRF only allows to finetune in raw text format thats why the dataset is raw string. This gives a huge advantage as you have the freedom of both, either just finetune on huge chunk of data or finetune the model like instruction based or prompt completion etc. format

In [2]:
from datasets import load_dataset

dataset_name = 'mlabonne/guanaco-llama2-1k'
dataset = load_dataset(dataset_name, split="train")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


# Loading the model

Conversion of model to 4bit is necessary to that we can finetune it on consumer level GPU

In [3]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, AutoTokenizer

model_name = "psmathur/orca_mini_3b"

# Quantizing the model to 4bit before even loading to the GPU
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16, # we will bring the model in 16bit and convert it into 4bit
)

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    trust_remote_code=True
)

model.config.use_cache = False

`low_cpu_mem_usage` was None, now set to True since model is quantized.


Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

  return self.fget.__get__(instance, owner)()


Loading the tokenizer

In [4]:
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
# Not every row is of equal token, so we pad accordingly,, as different size input makes model unstable
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right" # pad on right with <EOS>

In [5]:
from peft import LoraConfig, get_peft_model

lora_alpha = 16
lora_dropout = 0.1
# MAIN THING
lora_r = 64     #LoRA Rank --> 2 means 2X2 matrix, so here we are stating we have compressed the model in 64X64 matrix

# describing the level/degree we need to finetuning
peft_config = LoraConfig(
    lora_alpha=lora_alpha,
    lora_dropout=lora_dropout,
    r=lora_r,
    bias="none",
    task_type="CAUSAL_LM"
)

# Loading the Trainer

Here we will use the SFTTrainer from TRL library that gives a wrapper around transformers Trainer to easily fine-tune models on instruction based datasets using PEFT adapters. Let's first load the training arguments below.

In [6]:
from transformers import TrainingArguments

output_dir = "./results"
per_device_train_batch_size = 1 #how many rows to be packed into 1 batch and to be sent to model
gradient_accumulation_steps = 25 #how many steps you are waiting before you update the weights
optim = "paged_adamw_32bit"     #QLoRa paper optimizer
save_steps = 100                # creating checkpoints and saving them
logging_steps = 10               # loss display
learning_rate = 4e-3
max_grad_norm = 0.3
max_steps = -1                  # to overwrite the epoch
warmup_ratio = 0.03
lr_scheduler_type = "constant"

training_arguments = TrainingArguments(
    output_dir=output_dir,
    per_device_train_batch_size=per_device_train_batch_size,
    gradient_accumulation_steps=gradient_accumulation_steps,
    optim=optim,
    save_steps=save_steps,
    logging_steps=logging_steps,
    learning_rate=learning_rate,
    fp16=True,
    max_grad_norm=max_grad_norm,
    max_steps=max_steps,
    warmup_ratio=warmup_ratio,
    group_by_length=True,
    lr_scheduler_type=lr_scheduler_type,
    num_train_epochs=1,
)

In [7]:
from trl import SFTTrainer

max_seq_length = 512

trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    peft_config=peft_config,
    dataset_text_field="text",
    max_seq_length=max_seq_length,
    tokenizer=tokenizer,
    args=training_arguments,
)

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

We will also pre-process the model by upcasting the layer norms in float 32 for more stable training

In [8]:
for name, module in trainer.model.named_modules():
  if "norm" in name:
    module = module.to(torch.float32)

In [9]:
torch.cuda.empty_cache()

In [10]:
torch.cuda.memory_allocated()

2803385856

# Train the model

In [11]:
trainer.train()

[34m[1mwandb[0m: Currently logged in as: [33mzenith10[0m. Use [1m`wandb login --relogin`[0m to force relogin


Step,Training Loss
10,1.5653
20,1.4831
30,1.5001
40,1.5043


TrainOutput(global_step=40, training_loss=1.5132023334503173, metrics={'train_runtime': 763.0879, 'train_samples_per_second': 1.31, 'train_steps_per_second': 0.052, 'total_flos': 7366691555731200.0, 'train_loss': 1.5132023334503173, 'epoch': 1.0})

The SFTTrainer will take care of properly saving only the adapters during training instead of saving the entire model.

In [12]:
model_to_save = trainer.model.module if hasattr(trainer.model, 'module') else trainer.model # Take care of distributed/parallel training
model_to_save.save_pretrained("outputs")

In [13]:
lora_config = LoraConfig.from_pretrained('outputs')
model = get_peft_model(model, lora_config)

In [14]:
text = '''### User: Hi, I am Zenith. Who are you?\n
### Assistant: '''
device = "cuda:0"

inputs = tokenizer(text, return_tensors="pt").to(device)
outputs = model.generate(**inputs, max_new_tokens=1024)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

### User: Hi, I am Zenith. Who are you?

### Assistant: 
I'm sorry, but I'm not sure what you're referring to. Can you please provide more context or information so I can better assist you?


In [15]:
text = '''### User: What is the answer to life, universe and everything?\n
### Assistant: '''
device = "cuda:0"

inputs = tokenizer(text, return_tensors="pt").to(device)
outputs = model.generate(**inputs, max_new_tokens=1024)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

### User: What is the answer to life, universe and everything?

### Assistant: 
The answer to life, universe, and everything is "42".


To save the model to huggingface

In [None]:
model.push_to_hub("orca_mini_3B_toy_run_guanaco")

# Clearing GPU memory using numba

In [20]:
!pip install -q numba

In [21]:
from numba import cuda
device = cuda.get_current_device()
device.reset()

In [22]:
!nvidia-smi

Tue Mar 12 10:12:41 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  Tesla T4                       Off | 00000000:00:04.0 Off |                    0 |
| N/A   71C    P0              31W /  70W |     51MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    