<a href="https://colab.research.google.com/github/argishh/LLM_Playground/blob/main/gemma/Gemma_Finetuning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Finetuning Gemma-2b model

Installing required libraries

In [68]:
!pip3 install -q -U bitsandbytes==0.42.0 peft==0.9.0 trl==0.7.10 accelerate==0.27.1 datasets==2.17.0 transformers==4.38.0

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/190.9 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m51.2/190.9 kB[0m [31m1.3 MB/s[0m eta [36m0:00:01[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━[0m [32m143.4/190.9 kB[0m [31m2.0 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m190.9/190.9 kB[0m [31m2.2 MB/s[0m eta [36m0:00:00[0m
[?25h

In [None]:
!pip3 install git+https://github.com/huggingface/transformers git+https://github.com/huggingface/accelerate

Loading requried libraries

In [1]:
import os
import transformers
import torch
from google.colab import userdata
from datasets import load_dataset
from trl import SFTTrainer
from peft import LoraConfig
from transformers import AutoTokenizer, AutoModelForCausalLM
from transformers import BitsAndBytesConfig, GemmaTokenizer

Loading [`Hugging Face`](https://huggingface.co/) Access Token. It will be used to load [`Gemma-2b`](https://huggingface.co/google/gemma-2b) model.

Note: [`Gemma-7b`](https://huggingface.co/google/gemma-7b) Model can be loaded from [this link](https://huggingface.co/google/gemma-7b)

In [2]:
os.environ["HF_TOKEN"] = userdata.get('HF_TOKEN')

____

#### __4-bit Quantization__

**Quantization** in the context of deep learning is the process of constraining the number of bits that represent the weights and biases of the model.

In **4-bit quantization**, each weight or bias is represented using only `4 bits` as opposed to the typical `32 bits` used in single-precision floating-point format (`float32`).

The primary advantage of using 4-bit quantization is the reduction in model size and memory usage. Here's a simple explanation:

- A `float32` number takes up `32 bits` of memory.
- A `4-bit quantized` number takes up only `4 bits` of memory.

So, theoretically, you can fit `8` times more `4-bit` quantized numbers into the same memory space as `float32` numbers. This allows you to load larger models into the GPU memory or use smaller GPUs that might not have been able to handle the model otherwise.

____

#### __Llama 2 example__
For example, you may come across config like this in Llama 2 model:

```python
bnb_config = transformers.BitsAndBytesConfig(
                load_in_4bit=True,
                bnb_4bit_quant_type='nf4',
                bnb_4bit_use_double_quant=True,
                bnb_4bit_compute_dtype=bfloat16
)
```
Here, \
- `load_in_4bit=True`: Enables 4-bit quantization.
- `bnb_4bit_quant_type='nf4'`: Specifies the type of 4-bit quantization. (`nf4` is `4-bit normal float`)
- `bnb_4bit_use_double_quant=True`: Enables double quantization for better accuracy.
- `bnb_4bit_compute_dtype=bfloat16`: Specifies the data type for computation, which is bfloat16 here.


> . : By using `4-bit quantization`, you can load the `Llama 2` or `Gemma` model with significantly less GPU memory, making it more accessible for devices with limited resources.

#### **Loading `Gemma-2b` Model and configuring `BitsAndBytesConfig()` for `4-bit quantization`.**

In [3]:
model_id = "google/gemma-2b"
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16
)

In [4]:
tokenizer = AutoTokenizer.from_pretrained(model_id, token=os.environ['HF_TOKEN'])
model = AutoModelForCausalLM.from_pretrained(model_id,
                                             quantization_config=bnb_config,
                                             device_map={"":0},
                                             token=os.environ['HF_TOKEN'])

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [30]:
text = "Quote: there is way, "
device = "cuda:0"
inputs = tokenizer(text, return_tensors="pt").to(device)

outputs = model.generate(**inputs, max_new_tokens=20)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Quote: there is way, 
Quote: but if you are not willing, 
Quote: there is no way to you


In [33]:
text = "Quote: Life goes on, but never forget what "
inputs = tokenizer(text, return_tensors="pt").to(device)

outputs = model.generate(**inputs, max_new_tokens=20)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Quote: Life goes on, but never forget what 
Quote: is important.
Author: Nicolas Chamfort
Quote: The most wasted of all


In [7]:
os.environ["WANDB_DISABLED"] = "false"

In [8]:
lora_config = LoraConfig(
    r = 8,
    target_modules = ["q_proj", "o_proj", "k_proj", "v_proj",
                      "gate_proj", "up_proj", "down_proj"],
    task_type = "CAUSAL_LM",
)

In [9]:
from datasets import load_dataset

data = load_dataset("Abirate/english_quotes")
data = data.map(lambda samples: tokenizer(samples["quote"]), batched=True)

In [10]:
data['train']['quote'][:10]

['“Be yourself; everyone else is already taken.”',
 "“I'm selfish, impatient and a little insecure. I make mistakes, I am out of control and at times hard to handle. But if you can't handle me at my worst, then you sure as hell don't deserve me at my best.”",
 "“Two things are infinite: the universe and human stupidity; and I'm not sure about the universe.”",
 '“So many books, so little time.”',
 '“A room without books is like a body without a soul.”',
 "“Be who you are and say what you feel, because those who mind don't matter, and those who matter don't mind.”",
 "“You've gotta dance like there's nobody watching,Love like you'll never be hurt,Sing like there's nobody listening,And live like it's heaven on earth.”",
 "“You know you're in love when you can't fall asleep because reality is finally better than your dreams.”",
 '“You only live once, but if you do it right, once is enough.”',
 '“Be the change that you wish to see in the world.”']

In [11]:
def formatting_func(example):
    text = f"Quote: {example['quote'][0]}\nAuthor: {example['author'][0]}"
    return [text]

In [12]:
trainer = SFTTrainer(
    model=model,
    train_dataset=data["train"],
    args=transformers.TrainingArguments(
        per_device_train_batch_size=1,
        gradient_accumulation_steps=4,
        warmup_steps=2,
        max_steps=100,
        learning_rate=2e-4,
        fp16=True,
        logging_steps=1,
        output_dir="outputs",
        optim="paged_adamw_8bit"
    ),
    peft_config=lora_config,
    formatting_func=formatting_func,
)

dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)


In [13]:
trainer.train()

Step,Training Loss
1,1.681
2,0.6306
3,1.0229
4,1.0312
5,0.4202
6,1.2294
7,1.0921
8,0.3317
9,0.5629
10,0.5048


TrainOutput(global_step=100, training_loss=0.14572522912174463, metrics={'train_runtime': 59.0771, 'train_samples_per_second': 6.771, 'train_steps_per_second': 1.693, 'total_flos': 54994550906880.0, 'train_loss': 0.14572522912174463, 'epoch': 66.67})

In [34]:
text = "Quote: at the end of the day, "
inputs = tokenizer(text, return_tensors="pt").to(device)

outputs = model.generate(**inputs, max_new_tokens=20)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Quote: at the end of the day, 
the only real prison is fear,
the only real freedom is freedom from fear
Quote:


In [38]:
text = "Quote: You only live once, but "
inputs = tokenizer(text, return_tensors="pt").to(device)

outputs = model.generate(**inputs, max_new_tokens=50)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Quote: You only live once, but 
if you do it right, once is enough.
Author: Aung San Suu Kyi
Quote: The most wasted of all days is one without laughter.
Author: Nicolas Chamfort
Quote: The most wasted of all days is
