<a href="https://colab.research.google.com/github/adityaahj/tarp-project/blob/main/finetune_gemma_essay.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!nvidia-smi

Mon Apr 22 20:59:32 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  Tesla T4                       Off | 00000000:00:04.0 Off |                    0 |
| N/A   64C    P8              11W /  70W |      0MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

## Step 1 - Model loading
loading the model using QLoRA quantization to reduce the usage of memory


In [None]:
!pip3 install -q -U bitsandbytes==0.42.0
!pip3 install -q -U peft==0.8.2
!pip3 install -q -U trl==0.7.10
!pip3 install -q -U accelerate==0.27.1
!pip3 install -q -U datasets==2.10.1
!pip3 install -q -U transformers==4.38.0

In [None]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

specify model ID and load it with our previously defined quantization configuration.Now we specify the model ID and then we load it with our previously defined quantization configuration.

In [None]:
from huggingface_hub import notebook_login
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [None]:
# model_id = "google/gemma-7b-it"
model_id = "google/gemma-2b-it"

model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=bnb_config, device_map={"":0})
tokenizer = AutoTokenizer.from_pretrained(model_id, add_eos_token=True)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [None]:
def get_completion(query: str, model, tokenizer) -> str:
  device = "cuda:0"

  prompt_template = """
  <start_of_turn>user
  Below is an instruction that describes a task. Write a response that appropriately completes the request.
  {query}
  <end_of_turn>\n<start_of_turn>model

  """
  prompt = prompt_template.format(query=query)

  encodeds = tokenizer(prompt, return_tensors="pt", add_special_tokens=True)

  model_inputs = encodeds.to(device)


  generated_ids = model.generate(**model_inputs, max_new_tokens=1000, do_sample=True, pad_token_id=tokenizer.eos_token_id)
  decoded = tokenizer.decode(generated_ids[0], skip_special_tokens=True)
  return (decoded)

## Step 3 - Load dataset for finetuning

### Lets Load the Dataset

We will be using a dataset of essay instructions, improving the long-text generation for the base model

In [None]:
from datasets import load_dataset

dataset = load_dataset("iamketan25/essay-instructions-dataset", split="train")
dataset



Dataset({
    features: ['prompt', 'chosen', '__index_level_0__'],
    num_rows: 1857
})

In [None]:
df = dataset.to_pandas()
df.head(10)

Unnamed: 0,prompt,chosen,__index_level_0__
0,Human: Write the original essay that provided ...,“Through the Looking Glass” Critical Essay\n\n...,1559
1,Human: Write the original essay that provided ...,The Concept of Antimicrobial Agents Essay\n\nT...,1351
2,Human: Write the full essay for the following ...,Nursing Practice Concerning Patients With Card...,467
3,Human: Write a essay that could've provided th...,Racism and Society: Different Perspectives Ter...,497
4,Human: Write the original essay that generated...,Assessment of Innovation in Organisations Repo...,1419
5,Human: Write the full essay for the following ...,Sergey Brin: Leadership Process and Organizati...,1253
6,Human: Create the inputted essay that provided...,"Ford’s Acquisition and Disposal of Volvo, Jagu...",1702
7,Human: Provide a essay that could have been th...,Social Status in “The Necklace” by Guy de Maup...,1046
8,Human: Provide the inputted essay that when su...,"Sony, Microsoft, and Nintendo Report (Assessme...",1147
9,Human: Write the original essay that generated...,Acute Care Nurse Practitioner in Gerontology E...,1762


Instruction Fintuning - Prepare the dataset under the format of "prompt" so the model can better understand :
1. the function generate_prompt : take the instruction and output and generate a prompt
2. shuffle the dataset
3. tokenizer the dataset

### Formatting the Dataset

Now, let's format the dataset in the required [gemma instruction format](https://huggingface.co/google/gemma-7b-it).

```
<start_of_turn>user What is your favorite condiment? <end_of_turn>
<start_of_turn>model Well, I'm quite partial to a good squeeze of fresh lemon juice. It adds just the right amount of zesty flavor to whatever I'm cooking up in the kitchen!<end_of_turn>
```

You can use the following code to process your dataset and create a JSONL file in the correct format:

In [None]:
def generate_prompt(data_point):
    """Gen. input text based on a prompt, task instruction, (context info.), and answer

    :param data_point: dict: Data point
    :return: dict: tokenzed prompt
    """

    text = f"""<start_of_turn>user {data_point["prompt"]} <end_of_turn>\n<start_of_turn>model {data_point["chosen"]} <end_of_turn>"""
    return text

# add the "prompt" column in the dataset
text_column = [generate_prompt(data_point) for data_point in dataset]
dataset = dataset.add_column("gen_prompt", text_column)

  return cls._concat_blocks(pa_tables_to_concat_vertically, axis=0)


We'll need to tokenize our data so the model can understand.


In [None]:
dataset = dataset.shuffle(seed=1234)
dataset = dataset.map(lambda samples: tokenizer(samples["gen_prompt"]), batched=True)



Split dataset into 90% for training and 10% for testing

In [None]:
dataset = dataset.train_test_split(test_size=0.2)
train_data = dataset["train"]
test_data = dataset["test"]

While using SFT (**[Supervised Fine-tuning Trainer](https://huggingface.co/docs/trl/main/en/sft_trainer)**) for fine-tuning, we will be only passing in the “text” column of the dataset for fine-tuning.

In [None]:
print(test_data)

Dataset({
    features: ['prompt', 'chosen', '__index_level_0__', 'gen_prompt', 'input_ids', 'attention_mask'],
    num_rows: 372
})


## Step 4 - Apply Lora  
Here comes the magic with peft! Let's load a PeftModel and specify that we are going to use low-rank adapters (LoRA) using get_peft_model utility function and  the prepare_model_for_kbit_training method from PEFT.

In [None]:
from peft import LoraConfig, PeftModel, prepare_model_for_kbit_training, get_peft_model
model.gradient_checkpointing_enable()
model = prepare_model_for_kbit_training(model)

In [None]:
print(model)

GemmaForCausalLM(
  (model): GemmaModel(
    (embed_tokens): Embedding(256000, 2048, padding_idx=0)
    (layers): ModuleList(
      (0-17): 18 x GemmaDecoderLayer(
        (self_attn): GemmaSdpaAttention(
          (q_proj): Linear4bit(in_features=2048, out_features=2048, bias=False)
          (k_proj): Linear4bit(in_features=2048, out_features=256, bias=False)
          (v_proj): Linear4bit(in_features=2048, out_features=256, bias=False)
          (o_proj): Linear4bit(in_features=2048, out_features=2048, bias=False)
          (rotary_emb): GemmaRotaryEmbedding()
        )
        (mlp): GemmaMLP(
          (gate_proj): Linear4bit(in_features=2048, out_features=16384, bias=False)
          (up_proj): Linear4bit(in_features=2048, out_features=16384, bias=False)
          (down_proj): Linear4bit(in_features=16384, out_features=2048, bias=False)
          (act_fn): GELUActivation()
        )
        (input_layernorm): GemmaRMSNorm()
        (post_attention_layernorm): GemmaRMSNorm()
     

In [None]:
import bitsandbytes as bnb
def find_all_linear_names(model):
  cls = bnb.nn.Linear4bit
  lora_module_names = set()
  for name, module in model.named_modules():
    if isinstance(module, cls):
      names = name.split('.')
      lora_module_names.add(names[0] if len(names) == 1 else names[-1])
    if 'lm_head' in lora_module_names:
      lora_module_names.remove('lm_head')
  return list(lora_module_names)

In [None]:
modules = find_all_linear_names(model)
print(modules)

['up_proj', 'down_proj', 'gate_proj', 'q_proj', 'v_proj', 'o_proj', 'k_proj']


In [None]:
from peft import LoraConfig, get_peft_model

lora_config = LoraConfig(
    r=64,
    lora_alpha=32,
    target_modules=modules,
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

model = get_peft_model(model, lora_config)

In [None]:
trainable, total = model.get_nb_trainable_parameters()
print(f"Trainable: {trainable} | total: {total} | Percentage: {trainable/total*100:.4f}%")

Trainable: 78446592 | total: 2584619008 | Percentage: 3.0351%


## Step 5 - Run the training!

### Fine-Tuning with qLora and Supervised Fine-Tuning

We're ready to fine-tune our model using QLoRa. we'll use the `SFTTrainer` from the `trl` library for supervised fine-tuning.

In [None]:
#new code using SFTTrainer
import transformers

from trl import SFTTrainer

tokenizer.pad_token = tokenizer.eos_token
torch.cuda.empty_cache()

trainer = SFTTrainer(
    model=model,
    train_dataset=train_data,
    eval_dataset=test_data,
    dataset_text_field="gen_prompt",
    peft_config=lora_config,
    args=transformers.TrainingArguments(
        per_device_train_batch_size=1,
        gradient_accumulation_steps=4,
        warmup_steps=0.03,
        max_steps=100,
        learning_rate=2e-4,
        logging_steps=1,
        output_dir="outputs",
        optim="paged_adamw_8bit",
        save_strategy="epoch",
    ),
    data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False),
)



Map:   0%|          | 0/1485 [00:00<?, ? examples/s]

Map:   0%|          | 0/372 [00:00<?, ? examples/s]



## Lets start training

In [None]:
model.config.use_cache = False  # silences the warnings
trainer.train()



Step,Training Loss
1,3.3829
2,3.238
3,3.0055
4,3.2461
5,3.0125
6,2.6104
7,2.9542
8,2.7983
9,2.8894
10,2.6401


TrainOutput(global_step=100, training_loss=2.5416150522232055, metrics={'train_runtime': 2029.5324, 'train_samples_per_second': 0.197, 'train_steps_per_second': 0.049, 'total_flos': 4764878074257408.0, 'train_loss': 2.5416150522232055, 'epoch': 0.27})

In [None]:
new_model = "gemma-essay-instruct-finetune"

In [None]:
trainer.model.save_pretrained(new_model)

In [None]:
base_model = AutoModelForCausalLM.from_pretrained(
    model_id,
    low_cpu_mem_usage=True,
    return_dict=True,
    torch_dtype=torch.float16,
    device_map={"": 0},
)
merged_model= PeftModel.from_pretrained(base_model, new_model)
merged_model= merged_model.merge_and_unload()

# Saving merged model
merged_model.save_pretrained("merged_model",safe_serialization=True)
tokenizer.save_pretrained("merged_model")
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [None]:
# Pushing the model and tokenizer to the Hugging Face Model Hub
merged_model.push_to_hub(new_model, use_temp_dir=False)
tokenizer.push_to_hub(new_model, use_temp_dir=False)

model-00002-of-00002.safetensors:   0%|          | 0.00/67.1M [00:00<?, ?B/s]

Upload 2 LFS files:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.95G [00:00<?, ?B/s]

README.md:   0%|          | 0.00/5.17k [00:00<?, ?B/s]

Upload 2 LFS files:   0%|          | 0/2 [00:00<?, ?it/s]

tokenizer.json:   0%|          | 0.00/17.5M [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/4.24M [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/arnavj007/gemma-essay-instruct-finetune/commit/0ffa398f9ef6f4b2f54dd732a9798662d719b72e', commit_message='Upload tokenizer', commit_description='', oid='0ffa398f9ef6f4b2f54dd732a9798662d719b72e', pr_url=None, pr_revision=None, pr_num=None)

## Test out Finetuned Model

In [None]:
result = get_completion(query="write an essay about peter pan", model=merged_model, tokenizer=tokenizer)
print(result)

A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.



  user
  Below is an instruction that describes a task. Write a response that appropriately completes the request.
  write an essay about peter pan
  
model

   Peter Pan Essay

The idea for Peter Pan, a young boy who never grows old and stays eternally young, is one of the most appealing ones in children’s literature. The reason for this is the lack of a natural explanation or cause to why people around him aged. The result is also due to the writer’s ability to use and present different styles to portray the story of Peter Pan to children of various ages.

In a novel called The Adventures of Peter Pan, by J. M. Barrie, the author uses different stylistic techniques to portray the life of Peter; one of the most prominent is the use of magic.

In the book, Peter has wings, fairies in the sky and animals in the forest that give him magic life. His magic, however, comes at a price – all his friends around him are also children; they are not allowed to grow old, and once they age, they d