<a href="https://colab.research.google.com/github/fatemafaria142/Fine-Tuning-of-Mistral-7B-on-Diverse-Instruction-Based-Datasets/blob/main/Mistral_7B_Instruct_v0_2_using_instruct_v3_dataset.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### **Install Required Packages**

In [1]:
!pip install accelerate peft bitsandbytes transformers trl datasets torch

Collecting accelerate
  Downloading accelerate-0.26.1-py3-none-any.whl (270 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m270.9/270.9 kB[0m [31m1.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting peft
  Downloading peft-0.7.1-py3-none-any.whl (168 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m168.3/168.3 kB[0m [31m13.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting bitsandbytes
  Downloading bitsandbytes-0.42.0-py3-none-any.whl (105.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m105.0/105.0 MB[0m [31m8.8 MB/s[0m eta [36m0:00:00[0m
Collecting trl
  Downloading trl-0.7.9-py3-none-any.whl (141 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m141.1/141.1 kB[0m [31m20.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting datasets
  Downloading datasets-2.16.1-py3-none-any.whl (507 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m507.1/507.1 kB[0m [31m50.6 MB/s[0m eta [36m0:00:00

#### **Dataset Link: https://huggingface.co/datasets/mosaicml/instruct-v3**

In [2]:
from datasets import load_dataset

instruct_tune_dataset = load_dataset("mosaicml/instruct-v3")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Downloading readme:   0%|          | 0.00/3.05k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/127M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/10.3M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/56167 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/6807 [00:00<?, ? examples/s]

### **Dataset structure**
* The dataset contains three different columns. We are only interested in the columns `prompt` and `response`. There are 9 different possible source value in the `source` column. We are only interested in one of them.

In [3]:
instruct_tune_dataset

DatasetDict({
    train: Dataset({
        features: ['prompt', 'response', 'source'],
        num_rows: 56167
    })
    test: Dataset({
        features: ['prompt', 'response', 'source'],
        num_rows: 6807
    })
})

* We aim to create a model specifically designed for generating instructions, and in order to achieve this, we will be focusing exclusively on the `dolly_hhrlhf` component. As a result, we will filter out all subset datasets and concentrate our efforts on refining the model using this particular component.

In [4]:
instruct_tune_dataset = instruct_tune_dataset.filter(lambda x: x["source"] == "dolly_hhrlhf")

Filter:   0%|          | 0/56167 [00:00<?, ? examples/s]

Filter:   0%|          | 0/6807 [00:00<?, ? examples/s]

In [5]:
instruct_tune_dataset

DatasetDict({
    train: Dataset({
        features: ['prompt', 'response', 'source'],
        num_rows: 34333
    })
    test: Dataset({
        features: ['prompt', 'response', 'source'],
        num_rows: 4771
    })
})

### **We will use just a small subset of the data for this training example**

In [6]:
instruct_tune_dataset["train"] = instruct_tune_dataset["train"].select(range(3500))
instruct_tune_dataset["test"] = instruct_tune_dataset["train"].select(range(300))

In [7]:
instruct_tune_dataset

DatasetDict({
    train: Dataset({
        features: ['prompt', 'response', 'source'],
        num_rows: 3500
    })
    test: Dataset({
        features: ['prompt', 'response', 'source'],
        num_rows: 300
    })
})

* Note that this time, the tokenizer has added the control tokens `[INST]` and `[/INST]` to indicate the start and end of user messages (but not assistant messages!). **Mistral-instruct was trained with these tokens.**
* In order to leverage instruction fine-tuning, your prompt should be surrounded by `[INST]` and `[/INST]` tokens. The very first instruction should begin with a begin of sentence id. The next instructions should not. The assistant generation will be ended by the end-of-sentence token id.

In [14]:
def create_prompt(sample):
    """
    Update the prompt template:
    Combine both the prompt and input into a single column.
    """
    bos_token = "<s>"
    eos_token = "</s>"

    full_prompt = ""
    full_prompt += bos_token
    full_prompt += "[INST]"
    full_prompt += sample['prompt']
    full_prompt += "[/INST]"
    full_prompt += sample['response']
    full_prompt += eos_token

    return full_prompt

In [38]:
def create_prompt(sample):
    """
    Update the prompt template:
    Combine both the prompt and input into a single column.
    Replace '\n\n###' with an empty string in the 'instruction' column.
    """
    bos_token = "<s>"
    eos_token = "</s>"

    # Replace '\n\n###' with an empty string in the 'instruction' column
    cleaned_instruction = sample['prompt'].replace('\n\n### Instruction\n', '').replace('\n\n### Response\n', '')

    full_prompt = ""
    full_prompt += bos_token
    full_prompt += "[INST]"
    full_prompt += " " + cleaned_instruction
    full_prompt += "[/INST]"
    full_prompt += sample['response']
    full_prompt += eos_token

    return full_prompt

In [39]:
create_prompt(instruct_tune_dataset["train"][0])

'<s>[INST] Below is an instruction that describes a task. Write a response that appropriately completes the request.What are different types of grass?[/INST]There are more than 12,000 species of grass. The most common is Kentucky Bluegrass, because it grows quickly, easily, and is soft to the touch. Rygrass is shiny and bright green colored. Fescues are dark green and shiny. Bermuda grass is harder but can grow in drier soil.</s>'

In [40]:
create_prompt(instruct_tune_dataset["train"][1])

'<s>[INST] Below is an instruction that describes a task. Write a response that appropriately completes the request.How can I cook food while camping?[/INST]The best way to cook food is over a fire. You’ll need to build a fire and light it first, and then heat food in a pot on top of the fire.</s>'

In [42]:
create_prompt(instruct_tune_dataset["train"][2])

'<s>[INST] Below is an instruction that describes a task. Write a response that appropriately completes the request.What are some fun scenarios my kids can play with their Barbies?[/INST]Some fun scenarios for your kids to play with their Barbies include designing Barbies’ dream homes, organizing a Barbie-themed party, or staging a fashion show with their Barbies.  Barbie games can also involve learning basic household skills, such as cooking or cleaning, or practicing Barbie’s favorite activities, such as shopping or dancing.  Some other fun Barbie scenarios include Barbie vacation adventures or a makeover salon!</s>'

In [43]:
create_prompt(instruct_tune_dataset["train"][3])

'<s>[INST] Below is an instruction that describes a task. Write a response that appropriately completes the request.How many titles have Liverpool won?\nDomestically, the club has won 19 League titles, eight FA Cups, a record nine League Cups and 16 FA Community Shields. In international competitions, the club has won six European Cups, three UEFA Cups, four UEFA Super Cups—all English records—and one FIFA Club World Cup.[/INST]Liverpool has won 19 League titles, 8 FA cups, 9 League cups, 16 FA community shields and 6 European cups.</s>'

### **Initializing the Model**
* Load the model using a 4-bit configuration, employing double quantization, and set bfloat16 as the compute data type.

* Notably, we opt for the instruct-tuned model in this instance rather than the base model. It's worth mentioning that fine-tuning a base model necessitates a more substantial amount of data!

In [46]:
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch

bnb_config = BitsAndBytesConfig(
        load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype="float16", bnb_4bit_use_double_quant=True
    )


* https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2

In [47]:
mode_id = "mistralai/Mistral-7B-Instruct-v0.2"

In [48]:
model = AutoModelForCausalLM.from_pretrained(
        mode_id, quantization_config=bnb_config, device_map="auto", use_cache=False
    )

config.json:   0%|          | 0.00/596 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/25.1k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/3 [00:00<?, ?it/s]

model-00001-of-00003.safetensors:   0%|          | 0.00/4.94G [00:00<?, ?B/s]

model-00002-of-00003.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00003-of-00003.safetensors:   0%|          | 0.00/4.54G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/111 [00:00<?, ?B/s]

In [49]:
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.2")
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

tokenizer_config.json:   0%|          | 0.00/1.46k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.80M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/72.0 [00:00<?, ?B/s]

### **Let's example how well the model does at this task currently:**

In [50]:
def generate_response(prompt, model):
  encoded_input = tokenizer(prompt,  return_tensors="pt", add_special_tokens=True)
  model_inputs = encoded_input.to('cuda')

  generated_ids = model.generate(**model_inputs, max_new_tokens=512, do_sample=True, pad_token_id=tokenizer.eos_token_id)

  decoded_output = tokenizer.batch_decode(generated_ids)

  return decoded_output[0].replace(prompt, "")

In [70]:
# Use a predefined template for instructions
prompt = "<s>[INST] Below is an instruction that describes a task. Write a response that appropriately completes the request. "
prompt += "Was She Couldn't Say No movie re-released? She Couldn't Say No is a 1954 American rural comedy film starring Robert Mitchum, Jean Simmons and Arthur Hunnicutt. The last film in the long directing career of Lloyd Bacon, it was later re-released as Beautiful but Dangerous [/INST]"
print(prompt)

<s>[INST] Below is an instruction that describes a task. Write a response that appropriately completes the request. Was She Couldn't Say No movie re-released? She Couldn't Say No is a 1954 American rural comedy film starring Robert Mitchum, Jean Simmons and Arthur Hunnicutt. The last film in the long directing career of Lloyd Bacon, it was later re-released as Beautiful but Dangerous [/INST]


In [52]:
generate_response(prompt, model)



"<s><s> [INST] Below is an instruction that describes a task. Write a response that appropriately completes the request. Was She Couldn't Say No movie re-released? She Couldn't Say No is a 1954 American rural comedy film starring Robert Mitchum, Jean Simmons and Arthur Hunnicutt. The last film in the long directing career of Lloyd Bacon, it was later re-released as Beautiful but Dangerous [/INST] Yes, that is correct. She Couldn't Say No, the 1954 American rural comedy film starring Robert Mitchum, Jean Simmons, and Arthur Hunnicutt, was indeed re-released under the title Beautiful but Dangerous. Lloyd Bacon directed the film, which was his last work in the film industry.</s>"

### **Setting up the Training**
we will be using the `huggingface` and the `peft` library!

In [53]:
from peft import AutoPeftModelForCausalLM, LoraConfig, get_peft_model, prepare_model_for_kbit_training

peft_config = LoraConfig(r=8, lora_alpha=16, lora_dropout=0.05, bias="none", task_type="CAUSAL_LM")


* we need to prepare the model to be trained in 4bit so we will use the  **`prepare_model_for_kbit_training`** function from peft




In [54]:
model = prepare_model_for_kbit_training(model)
model = get_peft_model(model, peft_config)

# **Training Hyperparameters**
The choice of hyperparameters is contingent upon the desired training duration. Pay special attention to the following key factors:

* `num_train_epochs/max_steps:` Dictates the number of iterations over the data. Exercise caution, as an excessive number may lead to overfitting!

* `learning_rate:` Governs the convergence speed of the model. Adjust this parameter judiciously for optimal results.

In [56]:
from transformers import TrainingArguments
output_model= "mistral_instruct_generation"
training_arguments = TrainingArguments(
        output_dir=output_model,
        per_device_train_batch_size=1,
        gradient_accumulation_steps=4,
        optim="paged_adamw_32bit",
        learning_rate=2e-4,
        lr_scheduler_type="cosine",
        save_strategy="epoch",
        logging_steps=10,
        num_train_epochs=1,
        max_steps=100,
        fp16=True,
)


### **Setting up the trainer**

`max_seq_length`: Context window size


In [57]:
from trl import SFTTrainer

max_seq_length = 512

trainer = SFTTrainer(
  model=model,
  peft_config=peft_config,
  max_seq_length=max_seq_length,
  tokenizer=tokenizer,
  packing=True,
  formatting_func=create_prompt, # this will aplly the create_prompt mapping to all training and test dataset
  args=training_arguments,
  train_dataset=instruct_tune_dataset["train"],
  eval_dataset=instruct_tune_dataset["test"]
)

Generating train split: 0 examples [00:00, ? examples/s]

Generating train split: 0 examples [00:00, ? examples/s]



### **Training starts here**

In [58]:
trainer.train()

You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss
10,2.4893
20,1.7983
30,1.747
40,1.647
50,1.5615
60,1.6279
70,1.6465
80,1.5857
90,1.496
100,1.5503


TrainOutput(global_step=100, training_loss=1.7149563503265381, metrics={'train_runtime': 807.6419, 'train_samples_per_second': 0.495, 'train_steps_per_second': 0.124, 'total_flos': 8741766719078400.0, 'train_loss': 1.7149563503265381, 'epoch': 0.31})

### **Save the model**

In [59]:
trainer.save_model("mistral_instruct_generation")

In [60]:
merged_model = model.merge_and_unload()



In [61]:
def generate_response(prompt, model):
  encoded_input = tokenizer(prompt,  return_tensors="pt", add_special_tokens=True)
  model_inputs = encoded_input.to('cuda')

  generated_ids = model.generate(**model_inputs, max_new_tokens=512, do_sample=True, pad_token_id=tokenizer.eos_token_id)

  decoded_output = tokenizer.batch_decode(generated_ids)

  return decoded_output[0]

### **Example No:1**

In [71]:
# Use a predefined template for instructions
prompt = '''<s>[INST] Below is an instruction that describes a task. Write a response that appropriately completes the request. '''
prompt += '''I am making a YouTube video on sports contracts that changed the sporting world. Can you compile a list of sport contracts that were thought to be abnormally large/game changing when they were signed? [/INST]'''
response = generate_response(prompt, model)
# Print the response with formatted output
print(response)



<s><s> [INST] Below is an instruction that describes a task. Write a response that appropriately completes the request. I am making a YouTube video on sports contracts that changed the sporting world. Can you compile a list of sport contracts that were thought to be abnormally large/game changing when they were signed? [/INST] Of course! Here are some of the most notable sports contracts in history that were considered game changing when they were signed:

1. LeBron James’ contract with the Miami Heat in 2010, worth $100 million over six years.
2. Alex Rodriguez’s contract with the Texas Rangers in 2001, worth $252 million over 10 years.
3. Lionel Messi’s contract with Barcelona in 2017, worth €512 million plus performance-related bonuses over four years.
4. David Beckham’s contract with the LA Galaxy in 2007, worth $32.5 million over five years.
5. Peyton Manning’s contract with the Indianapolis Colts in 2012, worth $96 million over five years.
6. Floyd Mayweather Jr.’s contract with 

### **Example No:2**

In [72]:
# Use a predefined template for instructions
prompt = '''<s>[INST] Below is an instruction that describes a task. Write a response that appropriately completes the request. '''
prompt += '''How do you get rid of the smell in your gym shoes? [/INST]'''
response = generate_response(prompt, model)
# Print the response with formatted output
print(response)

<s><s> [INST] Below is an instruction that describes a task. Write a response that appropriately completes the request. How do you get rid of the smell in your gym shoes? [/INST]There are several methods to get rid of the smell from your gym shoes:

1. Air them out regularly and use a foot powder or cedar shavings to absorb any sweat or odor.
2. Put your shoes in the freezer overnight which kills bacteria that causes odor.
3. Use a mixture of water, baking soda and tea tree oil to clean the insides and leave the shoes to dry.
4. Spray deodorant inside the shoes and let them sit for a few hours before wiping it off.
5. Clean the insides with a mild soap and water then rinse thoroughly. Avoid harsh detergents and bleach as they can be damaging to the materials of your shoes. Let them air dry.
6. Place a charcoal filter inside the shoe after cleaning to further help absorb any remaining odors.</s>


### **Example No:3**

In [73]:
# Use a predefined template for instructions
prompt = '''<s>[INST] Below is an instruction that describes a task. Write a response that appropriately completes the request. '''
prompt += '''Is it hard to become an ophthalmologist ? [/INST]'''
response = generate_response(prompt, model)
# Print the response with formatted output
print(response)

<s><s> [INST] Below is an instruction that describes a task. Write a response that appropriately completes the request. Is it hard to become an ophthalmologist ? [/INST] Becoming an ophthalmologist is a complex process that requires a great deal of education and training. To become an ophthalmologist, one must first earn a college degree then earn a Doctor of Medicine (MD) degree from medical school. After medical school, an ophthalmology residency is necessary. This typically involves completing a rotation in a hospital for each of the major sub-specialties in ophthalmology. After residency training, formal fellowship training is often pursued in a specific sub-specialty to further enhance one's understanding and skills. As you can see, the pathway to becoming an ophthalmologist is quite extensive. It requires dedication, hard work, and a solid understanding of anatomy, physiology, pharmacology, and a multitude of other subjects.</s>
