# **Fine-tuning Mistral 7B Instruct using qLora and Supervise Finetuning on a multi-turn chat dataset**

Welcome to this workshop !

In this workshop, we are going to learn how to fine-tune open an LLM ( Mistral-7B ) using QLoRA on a multi-turn chat dataset.

In our example, we are going to leverage Hugging Face Transformers, TRL, Accelerate, and PEFT.

In Detail you will learn how to:


1.   Setup Development Environment
2.   Load and prepare the dataset
3.   Fine-Tune Mistral 7B with QLoRA


# 1. Setup Development Environment


Before diving into the fine-tuning process, make sure you have the following prerequisites.

GPU: This tutorial can run on a free Google Colab notebook with a GPU. Make sure the GPU is accessible with the following command :

In [None]:
!nvidia-smi

Fri Nov 17 13:32:32 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.105.17   Driver Version: 525.105.17   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   54C    P8    12W /  70W |      0MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

And then install dependencies :

In [None]:
!pip install -q torch #machine learning framework; we just need the datatype
!pip install -q git+https://github.com/huggingface/transformers #huggingface transformers for downloading models weights
!pip install -q datasets #huggingface datasets to download datasets from the hub and manipulate them
!pip install -q peft #Parameter efficient finetuning - for qLora( quantized Low-Rank Adaptation) Finetuning
!pip install -q bitsandbytes #For Model weights quantisation
!pip install -q trl #Transformer Reinforcement Learning - For Finetuning using Supervised Fine-tuning
!pip install -q wandb -U #Used to monitor the model score during training

  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m311.7/311.7 kB[0m [31m6.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.8/3.8 MB[0m [31m66.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m85.1 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for transformers (pyproject.toml) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m521.2/521.2 kB[0m [31m9.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m115.3/115.3 kB[0m [31m13.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m17.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━

In [None]:
import json
import re
from pprint import pprint

import pandas as pd
import torch
from datasets import Dataset, load_dataset
from huggingface_hub import notebook_login
from peft import LoraConfig, PeftModel
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    TrainingArguments,
    pipeline,
    logging,
)
from trl import SFTTrainer # For supervised finetuning
import bitsandbytes as bnb
##for dataset formating:
from typing import List, Literal, Optional
import random



In [None]:
from huggingface_hub import notebook_login
# Log in to HF Hub
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [None]:
import wandb
wandb.login()



# 2.   Load and prepare the dataset



For this workshop, we will fine-tune Mistral for a multi-turn chat task.

At Hugging-Face we recently released [Zephyr](https://huggingface.co/collections/HuggingFaceH4/zephyr-7b-6538c6d6d5ddd1cbb1744a66)

We will be using this [dataset](https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k) which was used during the initial training data of zephyr. But since we don't want the training to take forever we'll only sample it to a lower number of examples.

In [None]:
ultrachat_dataset = load_dataset("HuggingFaceH4/ultrachat_200k", split="train_sft")

N = 2500  # replace with the number of rows you want
ultrachat_dataset = ultrachat_dataset.select(range(N))
ultrachat_dataset

Dataset({
    features: ['prompt', 'prompt_id', 'messages'],
    num_rows: 2500
})

In [None]:
DEFAULT_CHAT_TEMPLATE = "{% for message in messages %}\n{% if message['role'] == 'user' %}\n{{ '<|user|>\n' + message['content'] + eos_token }}\n{% elif message['role'] == 'system' %}\n{{ '<|system|>\n' + message['content'] + eos_token }}\n{% elif message['role'] == 'assistant' %}\n{{ '<|assistant|>\n'  + message['content'] + eos_token }}\n{% endif %}\n{% if loop.last and add_generation_prompt %}\n{{ '<|assistant|>' }}\n{% endif %}\n{% endfor %}"


def apply_chat_template(
    example, tokenizer, assistant_prefix="<|assistant|>\n"
):

    messages = example["messages"]
    # We add an empty system message if there is none
    if messages[0]["role"] != "system":
        messages.insert(0, {"role": "system", "content": ""})
    example["text"] = tokenizer.apply_chat_template(
        messages, tokenize=False, add_generation_prompt=False
    )


    return example



In [None]:
# The model that you want to train from the Hugging Face hub
model_name = "mistralai/Mistral-7B-v0.1"

# The name of the fine-tuned model. Yes the "I" is on purpose, it's lower grade Zephyr
new_model = "Zhephir-7B-sft"

wandb_project_name = "Zhephir-7B-sft"

In [None]:

tokenizer = AutoTokenizer.from_pretrained(
        model_name

    )

if tokenizer.chat_template is None:
    tokenizer.chat_template = DEFAULT_CHAT_TEMPLATE

ultrachat_dataset = ultrachat_dataset.map(apply_chat_template, fn_kwargs={"tokenizer": tokenizer})
ultrachat_dataset


Dataset({
    features: ['prompt', 'prompt_id', 'messages', 'text'],
    num_rows: 2500
})

In [None]:

dataset=ultrachat_dataset.train_test_split(test_size=0.2)

In [None]:
for index in random.sample(range(len(dataset["train"])), 3):
  print(f"#################################  SAMPLE {index} ##################################################")
  print(f"\n\n{dataset['train'][index]['text']} \n\n")

#################################  SAMPLE 226 ##################################################


<|system|>
</s>
<|user|>
Looking for quick recipe to enjoy as part of your breakfast? Each tasty Open-Face Cheese and Egg Sandwich features a fried egg, KRAFT SINGLES Light Cheese Slice, baby arugula and grape tomatoes on whole grain toast. Why wait? Try it tomorrow.
Melt margarine in large nonstick skillet on medium-low heat.
Slip cracked eggs, 1 at a time, into skillet, leaving spaces between eggs; cook 5 min. Or until egg whites are set and yolks are cooked to desired doneness.
Place 1 toast slice on each of 4 serving plates. Top with Kraft Singles, arugula, tomatoes and eggs.
Can you please provide me with the ingredients required to make an Open-Face Cheese and Egg Sandwich?</s>
<|assistant|>
- 4 slices of whole grain bread
- 4 large eggs
- 4 KRAFT SINGLES Light Cheese Slices
- 2 cups of baby arugula
- 1 cup of grape tomatoes
- 1 tablespoon of margarine</s>
<|user|>
Can you give me a

# 3. Fine-Tune Mistral 7B with QLoRA


In [None]:
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map="auto",  # Auto selects device to put model on.
)
model.config.use_cache = False
print(base_model)

NameError: ignored

In [None]:
model.gradient_checkpointing_enable()

In [None]:
model = prepare_model_for_kbit_training(model, use_gradient_checkpointing=True)

In [None]:
def find_all_linear_names(model):
    cls = bitsandbytes.nn.Linear4bit #if args.bits == 4 else (bnb.nn.Linear8bitLt if args.bits == 8 else torch.nn.Linear)
    lora_module_names = set()
    for name, module in model.named_modules():
        if isinstance(module, cls):
            names = name.split('.')
            lora_module_names.add(names[0] if len(names) == 1 else names[-1])


    # lm_head is often excluded.
    if 'lm_head' in lora_module_names:  # needed for 16-bit
        lora_module_names.remove('lm_head')
    return list(lora_module_names)


modules = find_all_linear_names(base_model)
print(modules)


In [None]:
################################################################################
# QLoRA parameters
################################################################################

# LoRA attention dimension
lora_r = 64

# Alpha parameter for LoRA scaling
lora_alpha = 16

# Dropout probability for LoRA layers
lora_dropout = 0.1

#Layers of the NN on which we want to add LoRA Adapters ( it's the ones used in Zephyr)
targets=[
    "q_proj",
    "k_proj",
    "v_proj",
    "o_proj",
]
#anf these are the all the ones that we can applu LoRA on.
targets=modules

################################################################################
# bitsandbytes parameters
################################################################################

# Activate 4-bit precision base model loading
use_4bit = True

# Compute dtype for 4-bit base models
bnb_4bit_compute_dtype = "float16"

# Quantization type (fp4 or nf4)
bnb_4bit_quant_type = "nf4"

# Activate nested quantization for 4-bit base models (double quantization)
use_nested_quant = False

################################################################################
# TrainingArguments parameters
################################################################################

# Output directory where the model predictions and checkpoints will be stored
output_dir = "./results"

# Number of training epochs
num_train_epochs = 1

# Enable fp16/bf16 training (set bf16 to True with an A100)
fp16 = False
bf16 = False

# Batch size per GPU for training
per_device_train_batch_size = 4

# Batch size per GPU for evaluation
per_device_eval_batch_size = 4

# Number of update steps to accumulate the gradients for
gradient_accumulation_steps = 1

# Enable gradient checkpointing
gradient_checkpointing = True

# Maximum gradient normal (gradient clipping)
max_grad_norm = 0.3

# Initial learning rate (AdamW optimizer)
learning_rate = 2.0e-05

# Weight decay to apply to all layers except bias/LayerNorm weights
weight_decay = 0.001

# Optimizer to use
optim = "paged_adamw_32bit"

# Learning rate schedule (constant a bit better than cosine)
lr_scheduler_type = "constant"

# Number of training steps (overrides num_train_epochs)
max_steps = -1

# Ratio of steps for a linear warmup (from 0 to learning rate)
warmup_ratio = 0.03

# Group sequences into batches with same length
# Saves memory and speeds up training considerably
group_by_length = True

# Save checkpoint every X updates steps
save_steps = 25

# Log every X updates steps
logging_steps = 25

################################################################################
# SFT parameters
################################################################################

# Maximum sequence length to use
max_seq_length = None

# Pack multiple short examples in the same input sequence to increase efficiency
packing = False

# Load the entire model on the GPU 0
device_map = {"": 0}

In [None]:
# Load the base model with QLoRA configuration
compute_dtype = getattr(torch, bnb_4bit_compute_dtype)

bnb_config = BitsAndBytesConfig(
    load_in_4bit=use_4bit,
    bnb_4bit_quant_type=bnb_4bit_quant_type,
    bnb_4bit_compute_dtype=compute_dtype,
    bnb_4bit_use_double_quant=use_nested_quant,
)

base_model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map={"": 0}
)

base_model.config.use_cache = False
base_model.config.pretraining_tp = 1


In [None]:
# You set these environment variables for the HF `Trainer`.
# But you can also define your run beforehand.
os.environ["WANDB_PROJECT"] = wandb_project_name

# Log model when running HF Trainer reporting to wandb.
os.environ["WANDB_LOG_MODEL"] = "true"  # Apparently this is deprecated in version 5 of transformers.

# Use wandb to watch the gradients & model parameters.
os.environ["WANDB_WATCH"] = "all"

In [None]:
run = wandb.init(
    project=wandb_project_name,  # Project name.
    name="log_dataset",          # name of the run within this project.
    config={                     # Configuration dictionary.
        "split": "train"
    },
    group="dataset",             # Group runs. This run belongs in "dataset".
    tags=["dataset"],            # Tags. More dynamic, low-level grouping.
    notes="Logging subset of Puffin dataset."  # Description about the run.
)  # Check out the other parameters in the `wandb.init`!

In [None]:
# Load LoRA configuration
peft_config = LoraConfig(
    lora_alpha=lora_alpha,
    lora_dropout=lora_dropout,
    r=lora_r,
    target_modules=targets,
    bias="none",
    task_type="CAUSAL_LM",
)

# Set training parameters
training_arguments = TrainingArguments(
    output_dir=output_dir,
    num_train_epochs=num_train_epochs,
    per_device_train_batch_size=per_device_train_batch_size,
    gradient_accumulation_steps=gradient_accumulation_steps,
    optim=optim,
    save_steps=save_steps,
    logging_steps=logging_steps,
    learning_rate=learning_rate,
    weight_decay=weight_decay,
    fp16=fp16,
    bf16=bf16,
    max_grad_norm=max_grad_norm,
    max_steps=100, # the number of training steps the model will take
    warmup_ratio=warmup_ratio,
    group_by_length=group_by_length,
    lr_scheduler_type=lr_scheduler_type,
    report_to="tensorboard"
)

# Set supervised fine-tuning parameters
trainer = SFTTrainer(
    model=base_model,
    train_dataset=dataset['train'],
    peft_config=peft_config,
    dataset_text_field="text",
    max_seq_length=max_seq_length,
    tokenizer=tokenizer,
    args=training_arguments,
    packing=packing,
)

In [None]:
# Train model
trainer.train()

# Save trained model
trainer.model.save_pretrained(new_model)



In [None]:
run.finish()

In [None]:
# Empty VRAM
import gc
del base_model
gc.collect()

del trainer
gc.collect()

torch.cuda.empty_cache()
gc.collect()

In [None]:
base_model = AutoModelForCausalLM.from_pretrained(
    model_name,
    low_cpu_mem_usage=True,
    return_dict=True,
    torch_dtype=torch.float16,
    device_map={"": 0},
)
merged_model= PeftModel.from_pretrained(base_model, new_model)
merged_model= merged_model.merge_and_unload()

# Save the merged model
merged_model.save_pretrained("merged_model",safe_serialization=True)
tokenizer.save_pretrained("merged_model")
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

In [None]:
# Push the model and tokenizer to the Hugging Face Model Hub
merged_model.push_to_hub(new_model, use_temp_dir=False)
tokenizer.push_to_hub(new_model, use_temp_dir=False)