<a href="https://colab.research.google.com/github/fatemafaria142/Fine-Tuning-of-Mistral-7B-on-Diverse-Instruction-Based-Datasets/blob/main/Mistral_7B_Instruct_v0_2_using_databricks_dolly_15k_dataset.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### **Install Required Packages**

In [1]:
!pip install accelerate peft bitsandbytes transformers trl datasets torch

Collecting accelerate
  Downloading accelerate-0.26.1-py3-none-any.whl (270 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m270.9/270.9 kB[0m [31m5.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting peft
  Downloading peft-0.7.1-py3-none-any.whl (168 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m168.3/168.3 kB[0m [31m9.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting bitsandbytes
  Downloading bitsandbytes-0.42.0-py3-none-any.whl (105.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m105.0/105.0 MB[0m [31m10.0 MB/s[0m eta [36m0:00:00[0m
Collecting trl
  Downloading trl-0.7.9-py3-none-any.whl (141 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m141.1/141.1 kB[0m [31m18.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting datasets
  Downloading datasets-2.16.1-py3-none-any.whl (507 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m507.1/507.1 kB[0m [31m53.4 MB/s[0m eta [36m0:00:00

#### **Dataset Link:** https://huggingface.co/datasets/databricks/databricks-dolly-15k?row=59

In [2]:
from datasets import load_dataset

instruct_tune_dataset = load_dataset("databricks/databricks-dolly-15k")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Downloading readme:   0%|          | 0.00/8.20k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/13.1M [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

### **Dataset structure**
* The dataset contains four different columns. We are only interested in the columns `instruction` and `response`. There are 8 different possible source value in the `category` column. We are only interested in one of them.

In [3]:
instruct_tune_dataset

DatasetDict({
    train: Dataset({
        features: ['instruction', 'context', 'response', 'category'],
        num_rows: 15011
    })
})

In [4]:
instruct_tune_dataset = instruct_tune_dataset.filter(lambda x: x["category"] == "classification")

Filter:   0%|          | 0/15011 [00:00<?, ? examples/s]

In [5]:
instruct_tune_dataset

DatasetDict({
    train: Dataset({
        features: ['instruction', 'context', 'response', 'category'],
        num_rows: 2136
    })
})

#### **We will use just a small subset of the data for this training example**

In [6]:
instruct_tune_dataset["train"] = instruct_tune_dataset["train"].select(range(1500))
instruct_tune_dataset["test"] = instruct_tune_dataset["train"].select(range(200))

In [7]:
instruct_tune_dataset

DatasetDict({
    train: Dataset({
        features: ['instruction', 'context', 'response', 'category'],
        num_rows: 1500
    })
    test: Dataset({
        features: ['instruction', 'context', 'response', 'category'],
        num_rows: 200
    })
})

* Note that this time, the tokenizer has added the control tokens `[INST]` and `[/INST]` to indicate the start and end of user messages (but not assistant messages!). **Mistral-instruct was trained with these tokens.**
* In order to leverage instruction fine-tuning, your prompt should be surrounded by `[INST]` and `[/INST]` tokens. The very first instruction should begin with a begin of sentence id. The next instructions should not. The assistant generation will be ended by the end-of-sentence token id.

In [8]:
def create_prompt(sample):
    """
    Update the prompt template:
    Combine both the prompt and input into a single column.
    """
    bos_token = "<s>"
    eos_token = "</s>"

    # Use a predefined template for instructions
    instructions_template = "Below is an instruction that describes a task. Write a response that appropriately completes the request. "



    full_prompt = ""
    full_prompt += bos_token
    full_prompt += "[INST]"
    full_prompt += instructions_template
    full_prompt += sample['instruction']
    full_prompt += "[/INST]"
    full_prompt += sample['response']
    full_prompt += eos_token

    return full_prompt

In [9]:
create_prompt(instruct_tune_dataset["train"][0])

'<s>[INST]Below is an instruction that describes a task. Write a response that appropriately completes the request. Which is a species of fish? Tope or Rope[/INST]Tope</s>'

In [10]:
create_prompt(instruct_tune_dataset["train"][1])

'<s>[INST]Below is an instruction that describes a task. Write a response that appropriately completes the request. Identify which instrument is string or percussion: Cantaro, Gudok[/INST]Gudok is string, Cantaro is percussion.</s>'

In [11]:
create_prompt(instruct_tune_dataset["train"][2])

'<s>[INST]Below is an instruction that describes a task. Write a response that appropriately completes the request. Which of the following is useful for transportation: a glass of wine, a lamp, a train, an iced cube tray, a plane, a bicycle, an apple and a scooter.[/INST]The useful objects for transportation in the list are train, plane, bicyle and scooter.</s>'

In [12]:
create_prompt(instruct_tune_dataset["train"][3])

'<s>[INST]Below is an instruction that describes a task. Write a response that appropriately completes the request. Identify which instrument is string or woodwind: Panduri, Zurna[/INST]Zurna is woodwind, Panduri is string.</s>'

### **Initializing the Model**
* Load the model using a 4-bit configuration, employing double quantization, and set bfloat16 as the compute data type.

* Notably, we opt for the instruct-tuned model in this instance rather than the base model. It's worth mentioning that fine-tuning a base model necessitates a more substantial amount of data!

In [16]:
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch

bnb_config = BitsAndBytesConfig(
        load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype="float16", bnb_4bit_use_double_quant=True
    )


* https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2

In [17]:
mode_id = "mistralai/Mistral-7B-Instruct-v0.2"

In [18]:
model = AutoModelForCausalLM.from_pretrained(
        mode_id, quantization_config=bnb_config, device_map="auto", use_cache=False
    )

config.json:   0%|          | 0.00/596 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/25.1k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/3 [00:00<?, ?it/s]

model-00001-of-00003.safetensors:   0%|          | 0.00/4.94G [00:00<?, ?B/s]

model-00002-of-00003.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00003-of-00003.safetensors:   0%|          | 0.00/4.54G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/111 [00:00<?, ?B/s]

In [19]:
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.2")
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

tokenizer_config.json:   0%|          | 0.00/1.46k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.80M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/72.0 [00:00<?, ?B/s]

### **Let's example how well the model does at this task currently:**

In [21]:
def generate_response(prompt, model):
  encoded_input = tokenizer(prompt,  return_tensors="pt", add_special_tokens=True)
  model_inputs = encoded_input.to('cuda')

  generated_ids = model.generate(**model_inputs, max_new_tokens=512, do_sample=True, pad_token_id=tokenizer.eos_token_id)

  decoded_output = tokenizer.batch_decode(generated_ids)

  return decoded_output[0].replace(prompt, "")

In [22]:
# Use a predefined template for instructions
prompt = "<s>[INST] Below is an instruction that describes a task. Write a response that appropriately completes the request. "
prompt += "Identify which instrument is string or percussion: Kpanlogo, Shamisen [/INST]"
print(prompt)

<s>[INST] Below is an instruction that describes a task. Write a response that appropriately completes the request. Identify which instrument is string or percussion: Kpanlogo, Shamisen [/INST]


In [23]:
generate_response(prompt, model)



'<s><s> [INST] Below is an instruction that describes a task. Write a response that appropriately completes the request. Identify which instrument is string or percussion: Kpanlogo, Shamisen [/INST] The Kpanlogo is a traditional string instrument that is commonly used in West African music, particularly in Ghana. It is constructed from a series of slender logs or sections of hollow bamboo, each of which is strung with a single heavy wire. The player strikes the log tuning keys or sounding boards with the tips of their fingers to produce distinct tones.\n\nOn the other hand, the Shamisen is a three-stringed Japanese musical instrument that resembles a banjo or a lute. It features a long neck with three tuning pegs, a rounded body covered in animal skin, and a bridge where the three strings rest. The player uses a large plectrum called a bachi to strike the strings, producing a sound reminiscent of a guitar or a sitar.\n\nIn conclusion, the Kpanlogo is a string instrument, as it is playe

### **Setting up the Training**
we will be using the `huggingface` and the `peft` library!

In [24]:
from peft import AutoPeftModelForCausalLM, LoraConfig, get_peft_model, prepare_model_for_kbit_training

peft_config = LoraConfig(r=8, lora_alpha=16, lora_dropout=0.05, bias="none", task_type="CAUSAL_LM")


* we need to prepare the model to be trained in 4bit so we will use the  **`prepare_model_for_kbit_training`** function from peft




In [25]:
model = prepare_model_for_kbit_training(model)
model = get_peft_model(model, peft_config)

# **Training Hyperparameters**
The choice of hyperparameters is contingent upon the desired training duration. Pay special attention to the following key factors:

* `num_train_epochs/max_steps:` Dictates the number of iterations over the data. Exercise caution, as an excessive number may lead to overfitting!

* `learning_rate:` Governs the convergence speed of the model. Adjust this parameter judiciously for optimal results.

In [26]:
from transformers import TrainingArguments
output_model= "mistral_instruct_generation"
training_arguments = TrainingArguments(
        output_dir=output_model,
        per_device_train_batch_size=1,
        gradient_accumulation_steps=4,
        optim="paged_adamw_32bit",
        learning_rate=2e-4,
        lr_scheduler_type="cosine",
        save_strategy="epoch",
        logging_steps=10,
        num_train_epochs=1,
        max_steps=100,
        fp16=True,
)


### **Setting up the trainer**

`max_seq_length`: Context window size


In [27]:
from trl import SFTTrainer

max_seq_length = 512

trainer = SFTTrainer(
  model=model,
  peft_config=peft_config,
  max_seq_length=max_seq_length,
  tokenizer=tokenizer,
  packing=True,
  formatting_func=create_prompt, # this will aplly the create_prompt mapping to all training and test dataset
  args=training_arguments,
  train_dataset=instruct_tune_dataset["train"],
  eval_dataset=instruct_tune_dataset["test"]
)

Generating train split: 0 examples [00:00, ? examples/s]

Generating train split: 0 examples [00:00, ? examples/s]



### **Training starts here**

In [28]:
trainer.train()

You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss
10,2.3518
20,1.4203
30,1.2864
40,1.2612
50,1.2747
60,1.3086
70,1.2637
80,1.1476
90,1.1226
100,1.0869




TrainOutput(global_step=100, training_loss=1.3523885631561279, metrics={'train_runtime': 808.4496, 'train_samples_per_second': 0.495, 'train_steps_per_second': 0.124, 'total_flos': 8741766719078400.0, 'train_loss': 1.3523885631561279, 'epoch': 1.27})

### **Save the model**

In [32]:
trainer.save_model("mistral_instruct_generation")

In [33]:
merged_model = model.merge_and_unload()

In [34]:
def generate_response(prompt, model):
  encoded_input = tokenizer(prompt,  return_tensors="pt", add_special_tokens=True)
  model_inputs = encoded_input.to('cuda')

  generated_ids = model.generate(**model_inputs, max_new_tokens=512, do_sample=True, pad_token_id=tokenizer.eos_token_id)

  decoded_output = tokenizer.batch_decode(generated_ids)

  return decoded_output[0]

### **Example No:1**

In [38]:
# Use a predefined template for instructions
prompt = '''<s>[INST] Below is an instruction that describes a task. Write a response that appropriately completes the request. '''
prompt += '''Pick if these would be useful or not useful for a high school student to put in their backpack. Notebooks, textbook, desk lamp, pencil pouch, beach ball, pillow, laptop. [/INST]'''
response = generate_response(prompt, model)
# Print the response with formatted output
print(response)



<s><s> [INST] Below is an instruction that describes a task. Write a response that appropriately completes the request. Pick if these would be useful or not useful for a high school student to put in their backpack. Notebooks, textbook, desk lamp, pencil pouch, beach ball, pillow, laptop. [/INST] Of the items listed, the following would be useful for a high school student to put in their backpack:
[ ] Notebooks
[ ] Textbook
[ ] Desk lamp
[ ] Pencil pouch
[ ] Laptop
[ ] Beach ball
[ ] Pillow

Useful: Notebooks, Textbook, Desk lamp, Pencil pouch, Laptop.

Not Useful: Beach ball, Pillow.</s>


### **Example No:2**

In [39]:
# Use a predefined template for instructions
prompt = '''<s>[INST] Below is an instruction that describes a task. Write a response that appropriately completes the request. '''
prompt += '''Which sports would be easiest to find success in if you’re not tall: baseball, soccer, basketball, bowling. [/INST]'''
response = generate_response(prompt, model)
# Print the response with formatted output
print(response)

<s><s> [INST] Below is an instruction that describes a task. Write a response that appropriately completes the request. Which sports would be easiest to find success in if you’re not tall: baseball, soccer, basketball, bowling. [/INST] Bowling and baseball would likely be the easiest sports to find success in without being particularly tall. Baseball and soccer do not necessarily require height to excel at. Basketball, on the other hand, heavily favors tall players as it is a sport that relies largely on height advantages for playing effectively.</s>


### **Example No:3**

In [40]:
# Use a predefined template for instructions
prompt = '''<s>[INST] Below is an instruction that describes a task. Write a response that appropriately completes the request. '''
prompt += '''
Classify the following types of cars as "economy" or "luxury": Ford, Chevrolet, Lamborghini, Ferrari, Mercedes, Honda, Lexus, Toyota, Nissan, Subaru [/INST]'''
response = generate_response(prompt, model)
# Print the response with formatted output
print(response)

<s><s> [INST] Below is an instruction that describes a task. Write a response that appropriately completes the request. 
Classify the following types of cars as "economy" or "luxury": Ford, Chevrolet, Lamborghini, Ferrari, Mercedes, Honda, Lexus, Toyota, Nissan, Subaru [/INST] Ford, Chevrolet, Honda, Toyota, Nissan and Subaru are typically classified as economy cars. Lamborghini, Ferrari, Mercedes, and Lexus are classified as luxury cars.</s>
