## Prerequisites

Before delving into the fine-tuning process, ensure that you have the following prerequisites in place:

1. **GPU**: [gemma-2b](https://huggingface.co/google/gemma-2b) - can be finetuned on T4(free google colab) while [gemma-7b](https://huggingface.co/google/gemma-7b) requires an A100 GPU.
2. **Python Packages**: Ensure that you have the necessary Python packages installed. You can use the following commands to install them:

Let's begin by checking if your GPU is correctly detected:

In [None]:
!pip3 install datasets transformers WandB --quiet
!pip3 install mosaicml[nlp] --quiet

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m527.3/527.3 kB[0m [31m14.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m9.3/9.3 MB[0m [31m88.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m11.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m207.3/207.3 kB[0m [31m19.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m39.9/39.9 MB[0m [31m46.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m309.1/309.1 kB[0m [31m18.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m11.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.1/194.1 kB[0m [31m17.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━

## Step 2 - Model loading
We'll load the model using QLoRA quantization to reduce the usage of memory


In [None]:
!pip3 install -q -U bitsandbytes==0.42.0
!pip3 install -q -U peft==0.8.2
!pip3 install -q -U trl==0.7.10
!pip3 install -q -U accelerate==0.27.1
!pip3 install -q -U datasets==2.17.0
!pip3 install -q -U transformers==4.38.0

In [None]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

Now we specify the model ID and then we load it with our previously defined quantization configuration.Now we specify the model ID and then we load it with our previously defined quantization configuration.

In [None]:
# if you are using google colab

import os
from google.colab import userdata
os.environ["HF_TOKEN"] = "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"

In [None]:
#model_id = "google/gemma-7b-it"
#model_id = "google/gemma-7b"
model_id = "google/gemma-2b-it"
# model_id = "google/gemma-2b"

model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=bnb_config, device_map={"":0})
tokenizer = AutoTokenizer.from_pretrained(model_id, add_eos_token=True)



Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]



In [None]:
def get_completion(query: str, model, tokenizer) -> str:
  device = "cuda:0"

  prompt_template = """"
  You are a teacher of computer science and you answer the following question:
  Here is the structure of the DataFrame you will be working with:

  filepath = "/content/DailyDelhiClimateTrain.csv"
  dataset = pd.read_csv(filepath)

  columns are : ['date', 'meantemp', 'humidity', 'wind_speed', 'meanpressure']
  types: [date, float, float, float, float]

  must start with ```python
  must end with ```
  must not contain any def function
  most focus on the color specified
  must contain only one time ```python and ``` not more

  {query}
  Answer :
  """
  prompt = prompt_template.format(query=query)

  encodeds = tokenizer(prompt, return_tensors="pt", add_special_tokens=True)

  model_inputs = encodeds.to(device)


  generated_ids = model.generate(**model_inputs, max_new_tokens=1000, do_sample=True, pad_token_id=tokenizer.eos_token_id)
  # decoded = tokenizer.batch_decode(generated_ids)
  decoded = tokenizer.decode(generated_ids[0], skip_special_tokens=True)
  return (decoded)

In [None]:
result = get_completion(query="Plot a time serie of humidity in python using matplotlib", model=model, tokenizer=tokenizer)
print(result)

A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


"
  You are a teacher of computer science and you answer the following question:
  Here is the structure of the DataFrame you will be working with:

  filepath = "/content/DailyDelhiClimateTrain.csv"
  dataset = pd.read_csv(filepath)

  columns are : ['date', 'meantemp', 'humidity', 'wind_speed', 'meanpressure']
  types: [date, float, float, float, float]

  must start with ```python
  must end with ```
  most contain only the final answer without all the previous text :

  Plot a time serie of humidity in python using matplotlib
  Answer :
  ```python
import matplotlib.pyplot as plt
import pandas as pd

# Load the data
filepath = "/content/DailyDelhiClimateTrain.csv"
dataset = pd.read_csv(filepath)

# Create the time series
df_humidity = dataset['humidity'].to_datetime()

# Group the data by date
humidity_grouped = df_humidity.groupby(df_humidity.index)

# Create a line chart
plt.plot(humidity_grouped.index, humidity_grouped['humidity'])
plt.xlabel('Date')
plt.ylabel('Humidity')
plt.t

## Step 3 - Load dataset for finetuning

### Lets Load the Dataset

For this tutorial, we will fine-tune Mistral 7B Instruct for code generation.

We will be using this [dataset](https://huggingface.co/datasets/TokenBender/code_instructions_122k_alpaca_style) which is curated by [TokenBender (e/xperiments)](https://twitter.com/4evaBehindSOTA) and is an excellent data source for fine-tuning models for code generation. It follows the alpaca style of instructions, which is an excellent starting point for this task. The dataset structure should resemble the following:

```json
{
  "instruction": "Create a function to calculate the sum of a sequence of integers.",
  "input": "[1, 2, 3, 4, 5]",
  "output": "# Python code def sum_sequence(sequence): sum = 0 for num in sequence: sum += num return sum"
}
```

In [None]:
from datasets import load_dataset

dataset = load_dataset("flytech/python-codes-25k", split="train")
dataset

Dataset({
    features: ['text', 'input', 'instruction', 'output'],
    num_rows: 49626
})

In [None]:
df = dataset.to_pandas()
df.head(10)

Unnamed: 0,text,input,instruction,output
0,Help me set up my daily to-do list! Setting up...,Setting up your daily to-do list...,Help me set up my daily to-do list!,```python\ntasks = []\nwhile True:\n task =...
1,Create a shopping list based on my inputs! Cre...,Creating a shopping list...,Create a shopping list based on my inputs!,```python\nshopping_list = {}\nwhile True:\n ...
2,Calculate how much time I spend on my phone pe...,Calculating weekly phone usage...,Calculate how much time I spend on my phone pe...,"```python\ntotal_time = 0\nfor i in range(1, 8..."
3,Help me split the bill among my friends! Split...,Splitting the bill...,Help me split the bill among my friends!,```python\ntotal_bill = float(input('Enter the...
4,Organize my movie list into genres! Organizing...,Organizing your movie list...,Organize my movie list into genres!,```python\nmovie_list = {}\nwhile True:\n g...
5,Calculate the average rating of my book collec...,Calculating the average rating of your book co...,Calculate the average rating of my book collec...,```python\nratings = []\nwhile True:\n rati...
6,Create a playlist based on my mood! Creating a...,Creating a playlist...,Create a playlist based on my mood!,```python\nmood = input('What's your mood toda...
7,Help me find the best deals on my shopping lis...,Finding the best deals...,Help me find the best deals on my shopping list!,```python\nbest_deals = {}\nwhile True:\n i...
8,Calculate how much I need to save per month fo...,Calculating monthly savings for your vacation...,Calculate how much I need to save per month fo...,```python\nvacation_cost = float(input('Enter ...
9,Determine the most efficient route for my erra...,Determining the most efficient route...,Determine the most efficient route for my erra...,```python\nlocations = []\nwhile True:\n lo...


Instruction Fintuning - Prepare the dataset under the format of "prompt" so the model can better understand :
1. the function generate_prompt : take the instruction and output and generate a prompt
2. shuffle the dataset
3. tokenizer the dataset

### Formatting the Dataset

Now, let's format the dataset in the required [gemma instruction formate](https://huggingface.co/google/gemma-7b-it).

> Many tutorials and blogs skip over this part, but I feel this is a really important step.

```
<start_of_turn>user What is your favorite condiment? <end_of_turn>
<start_of_turn>model Well, I'm quite partial to a good squeeze of fresh lemon juice. It adds just the right amount of zesty flavor to whatever I'm cooking up in the kitchen!<end_of_turn>
```

You can use the following code to process your dataset and create a JSONL file in the correct format:

In [None]:
def generate_prompt(data_point):
    """Gen. input text based on a prompt, task instruction, (context info.), and answer

    :param data_point: dict: Data point
    :return: dict: tokenzed prompt
    """
    prefix_text = 'Below is an instruction that describes a task. Write a response that ' \
               'appropriately completes the request.\n\n'
    # Samples with additional context into.
    if data_point['input']:
        text = f"""<start_of_turn>user {prefix_text} {data_point["instruction"]} here are the inputs {data_point["input"]} <end_of_turn>\n<start_of_turn>model{data_point["output"]} <end_of_turn>"""
    # Without
    else:
        text = f"""<start_of_turn>user {prefix_text} {data_point["instruction"]} <end_of_turn>\n<start_of_turn>model{data_point["output"]} <end_of_turn>"""
    return text

# add the "prompt" column in the dataset
text_column = [generate_prompt(data_point) for data_point in dataset]
dataset = dataset.add_column("prompt", text_column)

We'll need to tokenize our data so the model can understand.


In [None]:
dataset = dataset.shuffle(seed=1234)  # Shuffle dataset here
dataset = dataset.map(lambda samples: tokenizer(samples["prompt"]), batched=True)

Split dataset into 90% for training and 10% for testing

In [None]:
dataset = dataset.train_test_split(test_size=0.2)
train_data = dataset["train"]
test_data = dataset["test"]

### After Formatting, We should get something like this

```json
{
"text":"<start_of_turn>user Create a function to calculate the sum of a sequence of integers. here are the inputs [1, 2, 3, 4, 5] <end_of_turn>
<start_of_turn>model # Python code def sum_sequence(sequence): sum = 0 for num in sequence: sum += num return sum <end_of_turn>",
"instruction":"Create a function to calculate the sum of a sequence of integers",
"input":"[1, 2, 3, 4, 5]",
"output":"# Python code def sum_sequence(sequence): sum = 0 for num in,
 sequence: sum += num return sum",
"prompt":"<start_of_turn>user Create a function to calculate the sum of a sequence of integers. here are the inputs [1, 2, 3, 4, 5] <end_of_turn>
<start_of_turn>model # Python code def sum_sequence(sequence): sum = 0 for num in sequence: sum += num return sum <end_of_turn>"

}
```

While using SFT (**[Supervised Fine-tuning Trainer](https://huggingface.co/docs/trl/main/en/sft_trainer)**) for fine-tuning, we will be only passing in the “text” column of the dataset for fine-tuning.

In [None]:
print(test_data)

Dataset({
    features: ['text', 'input', 'instruction', 'output', 'prompt', 'input_ids', 'attention_mask'],
    num_rows: 9926
})


## Step 4 - Apply Lora  
Here comes the magic with peft! Let's load a PeftModel and specify that we are going to use low-rank adapters (LoRA) using get_peft_model utility function and  the prepare_model_for_kbit_training method from PEFT.

In [None]:
from peft import LoraConfig, PeftModel, prepare_model_for_kbit_training, get_peft_model
model.gradient_checkpointing_enable()
model = prepare_model_for_kbit_training(model)

In [None]:
print(model)

GemmaForCausalLM(
  (model): GemmaModel(
    (embed_tokens): Embedding(256000, 2048, padding_idx=0)
    (layers): ModuleList(
      (0-17): 18 x GemmaDecoderLayer(
        (self_attn): GemmaSdpaAttention(
          (q_proj): Linear4bit(in_features=2048, out_features=2048, bias=False)
          (k_proj): Linear4bit(in_features=2048, out_features=256, bias=False)
          (v_proj): Linear4bit(in_features=2048, out_features=256, bias=False)
          (o_proj): Linear4bit(in_features=2048, out_features=2048, bias=False)
          (rotary_emb): GemmaRotaryEmbedding()
        )
        (mlp): GemmaMLP(
          (gate_proj): Linear4bit(in_features=2048, out_features=16384, bias=False)
          (up_proj): Linear4bit(in_features=2048, out_features=16384, bias=False)
          (down_proj): Linear4bit(in_features=16384, out_features=2048, bias=False)
          (act_fn): GELUActivation()
        )
        (input_layernorm): GemmaRMSNorm()
        (post_attention_layernorm): GemmaRMSNorm()
     

In [None]:
import bitsandbytes as bnb
def find_all_linear_names(model):
  cls = bnb.nn.Linear4bit #if args.bits == 4 else (bnb.nn.Linear8bitLt if args.bits == 8 else torch.nn.Linear)
  lora_module_names = set()
  for name, module in model.named_modules():
    if isinstance(module, cls):
      names = name.split('.')
      lora_module_names.add(names[0] if len(names) == 1 else names[-1])
    if 'lm_head' in lora_module_names: # needed for 16-bit
      lora_module_names.remove('lm_head')
  return list(lora_module_names)

In [None]:
modules = find_all_linear_names(model)
print(modules)

['q_proj', 'down_proj', 'k_proj', 'o_proj', 'up_proj', 'gate_proj', 'v_proj']


In [None]:
from peft import LoraConfig, get_peft_model

lora_config = LoraConfig(
    r=64,
    lora_alpha=32,
    target_modules=modules,
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

model = get_peft_model(model, lora_config)

In [None]:
trainable, total = model.get_nb_trainable_parameters()
print(f"Trainable: {trainable} | total: {total} | Percentage: {trainable/total*100:.4f}%")

Trainable: 78446592 | total: 2584619008 | Percentage: 3.0351%


## Step 5 - Run the training!

Setting the training arguments:
* for the reason of demo, we just ran it for few steps (100) just to showcase how to use this integration with existing tools on the HF ecosystem.

In [None]:
# import transformers

# tokenizer.pad_token = tokenizer.eos_token


# trainer = transformers.Trainer(
#     model=model,
#     train_dataset=train_data,
#     eval_dataset=test_data,
#     args=transformers.TrainingArguments(
#         per_device_train_batch_size=1,
#         gradient_accumulation_steps=4,
#         warmup_steps=0.03,
#         max_steps=100,
#         learning_rate=2e-4,
#         fp16=True,
#         logging_steps=1,
#         output_dir="outputs_mistral_b_finance_finetuned_test",
#         optim="paged_adamw_8bit",
#         save_strategy="epoch",
#     ),
#     data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False),
# )


### Fine-Tuning with qLora and Supervised Fine-Tuning

We're ready to fine-tune our model using qLora. For this tutorial, we'll use the `SFTTrainer` from the `trl` library for supervised fine-tuning. Ensure that you've installed the `trl` library as mentioned in the prerequisites.

In [None]:
import wandb

# Initialize wandb
wandb.init(project="dst-LLM")

VBox(children=(Label(value='0.002 MB of 0.002 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

0,1
eval/loss,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁▁▁▁▁▁▁▃▃▃▃▃▃▃▃▃▃▃▃▃▆▆▆▆▆▆▆▆▆▆▆▆▆███████
train/global_step,▁▁▁▁▂▂▂▂▂▃▃▃▃▃▃▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▆▇▇▇▇▇███
train/grad_norm,▆▄▄▁█▅▇▃▃▅▅█▇▂▄▅▅▃▄▄▄▄▅▄▃▆▁▂▃▄▄▃▃▂▄▇▃▄▄▄
train/learning_rate,████▇▇▇▇▇▇▆▆▆▆▆▆▅▅▅▅▅▅▄▄▄▄▄▃▃▃▃▃▃▂▂▂▂▂▁▁
train/loss,▃▅▃▁▄█▅▆▇▆▄▄▆▃▁▆▂▃▄▃▃▃▂▇▂▄▂▁▇▄▂▃▂█▇▃▁▂▂▆
train/total_flos,▁

0,1
eval/loss,0.98496
eval/runtime,1493.561
eval/samples_per_second,6.646
eval/steps_per_second,0.831
train/epoch,0.03
train/global_step,300.0
train/grad_norm,1.0733
train/learning_rate,0.0
train/loss,0.8322
train/total_flos,2430242837176320.0


In [None]:
import transformers
from trl import SFTTrainer
from torch.optim import AdamW
from torch.optim.lr_scheduler import LinearLR
from transformers import EarlyStoppingCallback

from torch.optim.lr_scheduler import LambdaLR


initial_lr = 5e-5

# Définir l'optimizer avec les paramètres
optimizer = AdamW(params=model.parameters(), lr=initial_lr, weight_decay=0.01, betas=(0.9, 0.99))

# Nombre total d'étapes d'entraînement pour une seule époque avec 160 steps
total_steps = 300
# Définir le scheduler LambdaLR avec un scheduler polynomial agressif
lrscheduler = LambdaLR(optimizer, lr_lambda=lambda step: initial_lr * (1 - step / total_steps) ** 0.9)

#lrscheduler = LinearLR(optimizer, start_factor=0.1, total_iters=total_steps)

# Initialize the tokenizer
tokenizer.pad_token = tokenizer.eos_token

# Empty CUDA cache
torch.cuda.empty_cache()

# Set up the trainer with early stopping
trainer2 = SFTTrainer(
    model=model,
    train_dataset=train_data,
    eval_dataset=test_data,
    dataset_text_field="prompt",
    peft_config=lora_config,
    args=transformers.TrainingArguments(
        per_device_train_batch_size=1,
        gradient_accumulation_steps=4,
        warmup_steps=0.03,
        max_steps=300,
        learning_rate=5e-5,
        logging_steps=1,
        output_dir="outputs",
        optim="paged_adamw_8bit",
        save_strategy="epoch",
        evaluation_strategy="epoch",
        load_best_model_at_end=True,  # Ensure to load the best model at the end
        metric_for_best_model="eval_loss",  # Specify the metric to use for early stopping
        #report_to="wandb",  # Report to wandb for logging
    ),
    data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False),
    optimizers=(optimizer, lrscheduler),  # Pass the custom optimizer and scheduler
    callbacks=[
        EarlyStoppingCallback(early_stopping_patience=3)  # Early stopping callback
    ],
)




## Lets start training

In [None]:
model.config.use_cache = False  # silence the warnings. Please re-enable for inference!
trainer.train()

Epoch,Training Loss,Validation Loss
0,0.8322,0.984963


TrainOutput(global_step=300, training_loss=0.6599002648393313, metrics={'train_runtime': 1999.7567, 'train_samples_per_second': 0.6, 'train_steps_per_second': 0.15, 'total_flos': 2430242837176320.0, 'train_loss': 0.6599002648393313, 'epoch': 0.03})

 Share adapters on the 🤗 Hub

J'ai choisi d'utiliser des learning rate schedulers multi-étapes pour entraîner mon modèle de manière efficace et éviter la divergence pendant l'entraînement. Initialement, j'ai fixé un taux d'apprentissage initial relativement élevé pour permettre une exploration rapide de l'espace d'optimisation. Cependant, lors de mes premiers essais, j'ai observé que cette approche conduisait à des divergences, où le loss d'entraînement augmentait rapidement sans amélioration significative des performances du modèle.

Pour remédier à cela, j'ai opté pour une stratégie de learning rate en deux phases. Pendant les premières 300 étapes, j'ai utilisé un scheduler linéaire (LinearLR) avec un facteur initial de 0.1 pour stabiliser l'entraînement et permettre une convergence initiale du modèle. Cette phase a permis d'atténuer les variations initiales et de préparer le modèle à des ajustements plus fins.

Ensuite, de 300 à 600 étapes, j'ai basculé vers un scheduler polynomial (LambdaLR) avec un exponentiel de 0.9 pour réduire progressivement le learning rate initial. Cette phase a permis au modèle de se concentrer sur des optimisations plus détaillées et d'affiner ses performances.

Cette combinaison stratégique de schedulers m'a permis d'atteindre un train/loss minimum de 0.3, indiquant une convergence efficace du modèle tout en évitant les problèmes de divergence observés précédemment. En ajustant soigneusement les schedulers en fonction de l'évolution de l'entraînement et des performances du modèle, j'ai pu maximiser l'efficacité de l'entraînement et obtenir des résultats satisfaisants en termes de qualité et de stabilité du modèle.

In [None]:
new_model = "gemma-2b-it-python-25K_v2" #Name of the model you will be pushing to huggingface model hub

In [None]:
trainer.model.save_pretrained(new_model)

In [None]:
base_model = AutoModelForCausalLM.from_pretrained(
    model_id,
    low_cpu_mem_usage=True,
    return_dict=True,
    torch_dtype=torch.float16,
    device_map={"": 0},
)
merged_model= PeftModel.from_pretrained(base_model, new_model)
merged_model= merged_model.merge_and_unload()

# Save the merged model
merged_model.save_pretrained("merged_model",safe_serialization=True)
tokenizer.save_pretrained("merged_model")
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"
# Push the model and tokenizer to the Hugging Face Model Hub
merged_model.push_to_hub(new_model, use_temp_dir=False)
tokenizer.push_to_hub(new_model, use_temp_dir=False)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [None]:
# Push the model and tokenizer to the Hugging Face Model Hub
merged_model.push_to_hub(new_model, use_temp_dir=False)
tokenizer.push_to_hub(new_model, use_temp_dir=False)

## Test out Finetuned Model

In [None]:
result = get_completion(query="Plot a time serie of meantemp in python using matplotlib using color=red", model=merged_model, tokenizer=tokenizer)
print(result)

A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


"
  You are a teacher of computer science and you answer the following question:
  Here is the structure of the DataFrame you will be working with:

  filepath = "/content/DailyDelhiClimateTrain.csv"
  dataset = pd.read_csv(filepath)

  columns are : ['date', 'meantemp', 'humidity', 'wind_speed', 'meanpressure']
  types: [date, float, float, float, float]

  must start with ```python
  must end with ```
  must not contain any def function
  most focus on the color specified
  must contain only one time ```python and ``` not more

  Plot a time serie of meantemp in python using matplotlib using color=red
  Answer :
  ```python
import matplotlib.pyplot as plt
import pandas as pd

# First import the libraries `matplotlib.pyplot` and `pandas`
# Declare df
df = pd.read_csv("/content/DailyDelhiClimateTrain.csv")
# create a line plot 
x = df['date'] # x-axis [datetime]
y = df['meantemp'] # y-axis [double]
colors = df['color'] # color of each point [string]

plt.plot(x,y,c=colors, label='MeanT