<a href="https://colab.research.google.com/github/fatemafaria142/Finetuning-of-Mistral-7B-using-instruct-v3-dataset/blob/main/Finetuning_Mistral_7B_using_instruct_v3_dataset.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### **Install Required Packages**

In [1]:
!pip install accelerate peft bitsandbytes transformers trl datasets torch

Collecting accelerate
  Downloading accelerate-0.26.1-py3-none-any.whl (270 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m270.9/270.9 kB[0m [31m3.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting peft
  Downloading peft-0.7.1-py3-none-any.whl (168 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m168.3/168.3 kB[0m [31m19.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting bitsandbytes
  Downloading bitsandbytes-0.42.0-py3-none-any.whl (105.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m105.0/105.0 MB[0m [31m9.0 MB/s[0m eta [36m0:00:00[0m
Collecting trl
  Downloading trl-0.7.9-py3-none-any.whl (141 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m141.1/141.1 kB[0m [31m18.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting datasets
  Downloading datasets-2.16.1-py3-none-any.whl (507 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m507.1/507.1 kB[0m [31m35.5 MB/s[0m eta [36m0:00:00

### **Dataset Link:** https://huggingface.co/datasets/mosaicml/instruct-v3

In [2]:
from datasets import load_dataset

instruct_tune_dataset = load_dataset("mosaicml/instruct-v3")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Downloading readme:   0%|          | 0.00/3.05k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/127M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/10.3M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/56167 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/6807 [00:00<?, ? examples/s]

### **Dataset structure**

* The dataset contains three different columns. We are only interested in the columns `prompt` and `response`. There are 9 different possible source value in the `source` column. We are only interested in one of them.

In [3]:
instruct_tune_dataset

DatasetDict({
    train: Dataset({
        features: ['prompt', 'response', 'source'],
        num_rows: 56167
    })
    test: Dataset({
        features: ['prompt', 'response', 'source'],
        num_rows: 6807
    })
})

* We aim to create a model specifically designed for generating instructions, and in order to achieve this, we will be focusing exclusively on the **` dolly_hhrlhf `** component. As a result, we will filter out all subset datasets and concentrate our efforts on refining the model using this particular component.

In [4]:
instruct_tune_dataset = instruct_tune_dataset.filter(lambda x: x["source"] == "dolly_hhrlhf")

Filter:   0%|          | 0/56167 [00:00<?, ? examples/s]

Filter:   0%|          | 0/6807 [00:00<?, ? examples/s]

In [5]:
instruct_tune_dataset

DatasetDict({
    train: Dataset({
        features: ['prompt', 'response', 'source'],
        num_rows: 34333
    })
    test: Dataset({
        features: ['prompt', 'response', 'source'],
        num_rows: 4771
    })
})

### **We will use just a small subset of the data for this training example**

In [9]:
instruct_tune_dataset["train"] = instruct_tune_dataset["train"].select(range(500))

In [10]:
instruct_tune_dataset["test"] = instruct_tune_dataset["test"].select(range(100))

In [11]:
instruct_tune_dataset

DatasetDict({
    train: Dataset({
        features: ['prompt', 'response', 'source'],
        num_rows: 500
    })
    test: Dataset({
        features: ['prompt', 'response', 'source'],
        num_rows: 100
    })
})

In [12]:
def create_prompt(sample):
  """
  Update the prompt template:
  Combine both the prompt and input into a single column.

  """
  bos_token = "<s>"
  original_system_message = "Below is an instruction that describes a task. Write a response that appropriately completes the request."
  system_message = "Use the provided input to create an instruction that could have been used to generate the response with an LLM."
  response = sample["prompt"].replace(original_system_message, "").replace("\n\n### Instruction\n", "").replace("\n### Response\n", "").strip()
  input = sample["response"]
  eos_token = "</s>"

  full_prompt = ""
  full_prompt += bos_token
  full_prompt += "### Instruction:"
  full_prompt += "\n" + system_message
  full_prompt += "\n\n### Input:"
  full_prompt += "\n" + input
  full_prompt += "\n\n### Response:"
  full_prompt += "\n" + response
  full_prompt += eos_token

  return full_prompt

In [13]:
create_prompt(instruct_tune_dataset["train"][0])

'<s>### Instruction:\nUse the provided input to create an instruction that could have been used to generate the response with an LLM.\n\n### Input:\nThere are more than 12,000 species of grass. The most common is Kentucky Bluegrass, because it grows quickly, easily, and is soft to the touch. Rygrass is shiny and bright green colored. Fescues are dark green and shiny. Bermuda grass is harder but can grow in drier soil.\n\n### Response:\nWhat are different types of grass?</s>'

### **Initializing the Model**
* Load the model using a 4-bit configuration, employing double quantization, and set bfloat16 as the compute data type.

* Notably, we opt for the instruct-tuned model in this instance rather than the base model. It's worth mentioning that fine-tuning a base model necessitates a more substantial amount of data!

In [31]:
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch

bnb_config = BitsAndBytesConfig(
        load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype="float16", bnb_4bit_use_double_quant=True
    )


* https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.1

In [32]:
mode_id = "mistralai/Mistral-7B-Instruct-v0.1"

In [33]:
model = AutoModelForCausalLM.from_pretrained(
        mode_id, quantization_config=bnb_config, device_map="auto", use_cache=False
    )

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [34]:
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-v0.1")
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

### **Let's example how well the model does at this task currently:**

In [35]:
def generate_response(prompt, model):
  encoded_input = tokenizer(prompt,  return_tensors="pt", add_special_tokens=True)
  model_inputs = encoded_input.to('cuda')

  generated_ids = model.generate(**model_inputs, max_new_tokens=1000, do_sample=True, pad_token_id=tokenizer.eos_token_id)

  decoded_output = tokenizer.batch_decode(generated_ids)

  return decoded_output[0].replace(prompt, "")

In [36]:
prompt="### Instruction:\nUse the provided input to create an instruction that could have been used to generate the response with an LLM.### Input:\nThere are more than 12,000 species of grass. The most common is Kentucky Bluegrass, because it grows quickly, easily, and is soft to the touch. Rygrass is shiny and bright green colored. Fescues are dark green and shiny. Bermuda grass is harder but can grow in drier soil.\n\n### Response:"

In [37]:
generate_response(prompt, model)



"<s> \nIf you're looking for a grass species that grows quickly, easily and is soft to the touch, Kentucky Bluegrass is the most common and a great choice. If you want a shiny and bright green-colored grass, Rygrass could be the right option for you. Fescues are dark green and shiny, if you prefer that look. If you're looking for a harder grass that can still grow in drier soil, Bermuda grass is another option to consider.</s>"

### **Setting up the Training**
we will be using the `huggingface` and the `peft` library!

In [38]:
from peft import AutoPeftModelForCausalLM, LoraConfig, get_peft_model, prepare_model_for_kbit_training

peft_config = LoraConfig(r=8, lora_alpha=16, lora_dropout=0.05, bias="none", task_type="CAUSAL_LM")


* we need to prepare the model to be trained in 4bit so we will use the  **`prepare_model_for_kbit_training`** function from peft




In [39]:
model = prepare_model_for_kbit_training(model)
model = get_peft_model(model, peft_config)

# **Training Hyperparameters**
The choice of hyperparameters is contingent upon the desired training duration. Pay special attention to the following key factors:

* `num_train_epochs/max_steps:` Dictates the number of iterations over the data. Exercise caution, as an excessive number may lead to overfitting!

* `learning_rate:` Governs the convergence speed of the model. Adjust this parameter judiciously for optimal results.

In [48]:
from transformers import TrainingArguments
output_model= "mistral_instruct_generation"
training_arguments = TrainingArguments(
        output_dir=output_model,
        per_device_train_batch_size=1,
        gradient_accumulation_steps=4,
        optim="paged_adamw_32bit",
        learning_rate=2e-4,
        lr_scheduler_type="cosine",
        save_strategy="epoch",
        logging_steps=10,
        num_train_epochs=1,
        max_steps=50,
        fp16=True,
)


### **Setting up the trainer**

`max_seq_length`: Context window size


In [49]:
from trl import SFTTrainer

max_seq_length = 1024

trainer = SFTTrainer(
  model=model,
  peft_config=peft_config,
  max_seq_length=max_seq_length,
  tokenizer=tokenizer,
  packing=True,
  formatting_func=create_prompt, # this will aplly the create_prompt mapping to all training and test dataset
  args=training_arguments,
  train_dataset=instruct_tune_dataset["train"],
  eval_dataset=instruct_tune_dataset["test"]
)

Generating train split: 0 examples [00:00, ? examples/s]

Generating train split: 0 examples [00:00, ? examples/s]

### **Training starts here**

In [50]:
trainer.train()

Step,Training Loss
10,1.7008
20,1.5103
30,1.4407
40,1.4307
50,1.4135




TrainOutput(global_step=50, training_loss=1.4992018699645997, metrics={'train_runtime': 754.1514, 'train_samples_per_second': 0.265, 'train_steps_per_second': 0.066, 'total_flos': 8741766719078400.0, 'train_loss': 1.4992018699645997, 'epoch': 2.15})

### **Save the model**

In [51]:
trainer.save_model("mistral_instruct_generation")

In [52]:
merged_model = model.merge_and_unload()



In [53]:
def generate_response(prompt, model):
  encoded_input = tokenizer(prompt,  return_tensors="pt", add_special_tokens=True)
  model_inputs = encoded_input.to('cuda')

  generated_ids = model.generate(**model_inputs, max_new_tokens=1000, do_sample=True, pad_token_id=tokenizer.eos_token_id)

  decoded_output = tokenizer.batch_decode(generated_ids)

  return decoded_output[0]

In [54]:
generate_response("### Instruction:\nUse the provided input to create an instruction that could have been used to generate the response with an LLM.### Input:\nThe first thing to know is that guacamole is a popular dip made from avocados, tomatoes, onions, and spices. It originated in Mexico, and is generally eaten as an appetizer or snack. Here are some simple steps: Choose 2 ripe avocados, about 2 cups Mash avocados in a large bowl using a fork or potato masher Add in 1-2 chopped tomatoes, salt, pepper, 1-2 garlic cloves, minced, 1-2 teaspoons fresh lime juice, and 1⁄4-1⁄2 cup chopped cilantro (optional). Let sit for about 10 minutes Taste, and add more salt, pepper, cilantro, or lime juice if needed Guacamole is usually served with tortilla chips. There are many variations, such as adding sour cream, diced vegetables, or more spicy hot peppers.\n\n### Response:", merged_model)



"<s> ### Instruction:\nUse the provided input to create an instruction that could have been used to generate the response with an LLM.### Input:\nThe first thing to know is that guacamole is a popular dip made from avocados, tomatoes, onions, and spices. It originated in Mexico, and is generally eaten as an appetizer or snack. Here are some simple steps: Choose 2 ripe avocados, about 2 cups Mash avocados in a large bowl using a fork or potato masher Add in 1-2 chopped tomatoes, salt, pepper, 1-2 garlic cloves, minced, 1-2 teaspoons fresh lime juice, and 1⁄4-1⁄2 cup chopped cilantro (optional). Let sit for about 10 minutes Taste, and add more salt, pepper, cilantro, or lime juice if needed Guacamole is usually served with tortilla chips. There are many variations, such as adding sour cream, diced vegetables, or more spicy hot peppers.\n\n### Response:\nHere's an instruction on how to make guacamole that can be used with an LLM: \nStart by choosing 2 ripe avocados. Next, mash the avocado