# Local Finetune Mistral 7B

Mistral 7B is the open LLM from Mistral AI.

This sample is modified from this tutorial 
https://adithyask.medium.com/a-beginners-guide-to-fine-tuning-mistral-7b-instruct-model-0f39647b20fe

What this tutorial will, step-by-step, cover:

- Setup Development Environment
- Load and prepare the dataset
- Fine-Tune Mistral with QLoRA


This notebook has been tested on Amazon SageMaker Notebook Instances with single GPU on ml.g5.2xlarge

## Setup development environment

In [1]:
!pip install transformers==4.38.1 datasets==2.17.1 peft==0.8.2 bitsandbytes==0.42.0 trl==0.7.11 --upgrade --quiet

## Load and prepare the dataset


### Choose a dataset

For the purpose of this tutorial, we will use dolly, an open-source dataset containing 15k instruction pairs.

Example record from dolly:
```
{
  "instruction": "Who was the first woman to have four country albums reach No. 1 on the Billboard 200?",
  "context": "",
  "response": "Carrie Underwood."
}
```


In [2]:
from datasets import load_dataset
from random import randrange

# Load dataset from the hub
#dataset = load_dataset("databricks/databricks-dolly-15k", split="train")

#For local testing the fine tuning code, we limit the dataset to 20 samples 
dataset = load_dataset("databricks/databricks-dolly-15k", split="train").select(range(20))

print(f"dataset size: {len(dataset)}")
print(dataset[randrange(len(dataset))])

dataset size: 20
{'instruction': 'Why mobile is bad for human', 'context': '', 'response': 'We are always engaged one phone which is not good.', 'category': 'brainstorming'}


### Understand the Mistral format and prepare the prompt input

The mistralai/Mistral-7B-Instruct-v0.1 is a conversational chat model meaning we can chat with it using the following prompt:


```
<s> [INST] User Instruction 1 [/INST] Model answer 1</s> [INST] User instruction 2 [/INST]
```


For instruction fine-tuning, it is quite common to have two columns inside the dataset: one for the prompt & the other for the response.

In [3]:
from random import randint

# Define the create_prompt function
def create_prompt(sample):
    bos_token = "<s>"
    eos_token = "</s>"
    
    instruction = sample['instruction']
    context = sample['context']
    response = sample['response']

    text_row = f"""[INST] Below is the question based on the context. Question: {instruction}. Below is the given the context {context}. Write a response that appropriately completes the request.[/INST]"""
    answer_row = response

    sample["prompt"] = bos_token + text_row
    sample["completion"] = answer_row + eos_token

    return sample

lets test our formatting function on a random example.

In [4]:
dataset_instruct_format = dataset.map(create_prompt, remove_columns=['instruction','context','response','category'])
# print random sample
print(dataset_instruct_format[randint(0, len(dataset_instruct_format))])

Map:   0%|          | 0/20 [00:00<?, ? examples/s]

{'prompt': "<s>[INST] Below is the question based on the context. Question: Who was John Moses Browning?. Below is the given the context John Moses Browning (January 23, 1855 – November 26, 1926) was an American firearm designer who developed many varieties of military and civilian firearms, cartridges, and gun mechanisms – many of which are still in use around the world. He made his first firearm at age 13 in his father's gun shop and was awarded the first of his 128 firearm patents on October 7, 1879, at the age of 24. He is regarded as one of the most successful firearms designers of the 19th and 20th centuries and pioneered the development of modern repeating, semi-automatic, and automatic firearms.\n\nBrowning influenced nearly all categories of firearms design, especially the autoloading of ammunition. He invented, or made significant improvements to, single-shot, lever-action, and pump-action rifles and shotguns. He developed the first reliable and compact autoloading pistols by

### Prepare the configuration for training the LLM

Making LLMs even more accessible with bitsandbytes, 4-bit quantization and QLoRA
https://huggingface.co/blog/4bit-transformers-bitsandbytes


In [5]:
model_id = "mistralai/Mistral-7B-Instruct-v0.1"
new_model = "Mistral-qlora-7B-Instruct-v0.1" #set the name of the new model

In [6]:

################################################################################
# QLoRA parameters
################################################################################

# LoRA attention dimension
lora_r = 64

# Alpha parameter for LoRA scaling
lora_alpha = 16

# Dropout probability for LoRA layers
lora_dropout = 0.1

################################################################################
# bitsandbytes parameters
################################################################################

# Activate 4-bit precision base model loading
use_4bit = True

# Compute dtype for 4-bit base models
bnb_4bit_compute_dtype = "bfloat16"

# Quantization type (fp4 or nf4)
bnb_4bit_quant_type = "nf4"

# Activate nested quantization for 4-bit base models (double quantization)
use_nested_quant = True


################################################################################
# TrainingArguments parameters
################################################################################

# Output directory where the model predictions and checkpoints will be stored
output_dir = "./results"

# Number of training epochs
num_train_epochs = 1

# Enable fp16/bf16 training (set bf16 to True with an A100)
fp16 = False
bf16 = False

# Batch size per GPU for training
per_device_train_batch_size = 4

# Batch size per GPU for evaluation
per_device_eval_batch_size = 4

# Number of update steps to accumulate the gradients for
gradient_accumulation_steps = 1

# Enable gradient checkpointing
gradient_checkpointing = True

# Maximum gradient normal (gradient clipping)
max_grad_norm = 0.3

# Initial learning rate (AdamW optimizer)
learning_rate = 2e-4

# Weight decay to apply to all layers except bias/LayerNorm weights
weight_decay = 0.001

# Optimizer to use
optim = "paged_adamw_32bit"

# Learning rate schedule (constant a bit better than cosine)
lr_scheduler_type = "constant"

# Number of training steps (overrides num_train_epochs)
max_steps = -1

# Ratio of steps for a linear warmup (from 0 to learning rate)
warmup_ratio = 0.03

# Group sequences into batches with same length
# Saves memory and speeds up training considerably
group_by_length = True

# Save checkpoint every X updates steps
save_steps = 25

# Log every X updates steps
logging_steps = 25

################################################################################
# SFT parameters
################################################################################

# Maximum sequence length to use
max_seq_length = 1024

# Pack multiple short examples in the same input sequence to increase efficiency
packing = False

# Load the entire model on the GPU 0
device_map = {"": 0}
#device_map = torch.device('cuda' if torch.cuda.is_available() else 'cpu')


In [7]:
import json
import pandas as pd
import torch
from datasets import Dataset, load_dataset
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    TrainingArguments,
    pipeline,
    logging,
)

# Load the base model with QLoRA configuration
compute_dtype = getattr(torch, bnb_4bit_compute_dtype)

bnb_config = BitsAndBytesConfig(
    load_in_4bit=use_4bit,
    bnb_4bit_quant_type=bnb_4bit_quant_type,
    bnb_4bit_compute_dtype=compute_dtype,
    bnb_4bit_use_double_quant=use_nested_quant,
)

base_model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=bnb_config,
    device_map=device_map
)

base_model.config.use_cache = False
base_model.config.pretraining_tp = 1

# Load MitsralAi tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = 'right'

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/25.1k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/9.94G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/4.54G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.47k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.80M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/72.0 [00:00<?, ?B/s]

In [None]:
print(base_model)

In [None]:
import bitsandbytes as bnb

def find_all_linear_names(model):
    lora_module_names = set()
    for name, module in model.named_modules():
        if isinstance(module, bnb.nn.Linear4bit):
            names = name.split(".")
            lora_module_names.add(names[0] if len(names) == 1 else names[-1])

    if "lm_head" in lora_module_names:  # needed for 16-bit
        lora_module_names.remove("lm_head")
    return list(lora_module_names)

In [None]:
# get lora target modules
modules = find_all_linear_names(base_model)

In [None]:
print(modules)

#### Inference using base model only before fine tuning 

In [None]:
# eval_prompt = create_prompt(dataset[randrange(len(dataset))])["prompt"]

# # import random
# model_input = tokenizer(eval_prompt, return_tensors="pt").to("cuda")

# base_model.eval()
# with torch.no_grad():
#     print(tokenizer.decode(base_model.generate(**model_input, max_new_tokens=256, pad_token_id=2)[0], skip_special_tokens=True))

In [None]:
from peft import LoraConfig, PeftModel
from trl import SFTTrainer, DataCollatorForCompletionOnlyLM


# Set LoRA configuration
peft_config = LoraConfig(
    lora_alpha=lora_alpha,
    lora_dropout=lora_dropout,
    r=lora_r,
    target_modules=[
        "q_proj",
        "k_proj",
        "v_proj",
        "o_proj",
        "gate_proj",
        "up_proj",
        "down_proj",
    ],
    bias="none",
    task_type="CAUSAL_LM",
)

# Set training parameters
training_arguments = TrainingArguments(
    output_dir=output_dir,
    num_train_epochs=num_train_epochs,
    per_device_train_batch_size=per_device_train_batch_size,
    gradient_accumulation_steps=gradient_accumulation_steps,
    optim=optim,
    save_steps=save_steps,
    logging_steps=logging_steps,
    learning_rate=learning_rate,
    weight_decay=weight_decay,
    gradient_checkpointing=gradient_checkpointing,
    fp16=fp16,
    bf16=bf16,
    max_grad_norm=max_grad_norm,
    max_steps=100, # the total number of training steps to perform
    warmup_ratio=warmup_ratio,
    group_by_length=group_by_length,
    lr_scheduler_type=lr_scheduler_type,
)

Train on completions only https://huggingface.co/docs/trl/en/sft_trainer

In [None]:
def formatting_prompts_func(example):
    output_texts = []
    for i in range(len(example['prompt'])):
        text = f"{example['prompt'][i]}\n\n ### Answer: {example['completion'][i]}"
        output_texts.append(text)
    return output_texts

response_template = "### Answer:"
collator = DataCollatorForCompletionOnlyLM(response_template, tokenizer=tokenizer)

In [None]:
# Initialize the SFTTrainer for fine-tuning
trainer = SFTTrainer(
    model=base_model,
    train_dataset=dataset_instruct_format,
    formatting_func=formatting_prompts_func,
    data_collator=collator,
    peft_config=peft_config,
    max_seq_length=max_seq_length,  # You can specify the maximum sequence length here
    tokenizer=tokenizer,
    args=training_arguments,
    packing=packing
)

In [None]:
# Start the training process
trainer.train()

In [None]:
# Save the fine-tuned model
trainer.model.save_pretrained(new_model)

### Finished all training steps ...  

#### Merge the trained qlora into the base model 

In [None]:
base_model = AutoModelForCausalLM.from_pretrained(
    model_id,
    low_cpu_mem_usage=True,
    torch_dtype=torch.float16,
    device_map=device_map
)

In [None]:
print(base_model)

In [None]:
merged_model= PeftModel.from_pretrained(base_model, new_model)

In [None]:
print(merged_model)

In [None]:
merged_model= merged_model.merge_and_unload()

In [None]:
print(merged_model)

In [None]:
sagemaker_save_dir = "Mistral-Finetuned-Merged"

In [None]:
merged_model.save_pretrained(sagemaker_save_dir, safe_serialization=True, max_shard_size="2GB")
# save tokenizer for easy inference
tokenizer.save_pretrained(sagemaker_save_dir)