<a href="https://colab.research.google.com/github/fatemafaria142/Instructions-Tuning-Across-Various-LLMs-with-Alpaca-Dataset/blob/main/Instructions_Tuning_using_Mistral_7B_Instruct_v0_2_.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### **Install Required Packages**

In [1]:
!pip install accelerate peft bitsandbytes transformers trl datasets torch

Collecting accelerate
  Downloading accelerate-0.26.1-py3-none-any.whl (270 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m270.9/270.9 kB[0m [31m5.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting peft
  Downloading peft-0.7.1-py3-none-any.whl (168 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m168.3/168.3 kB[0m [31m9.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting bitsandbytes
  Downloading bitsandbytes-0.42.0-py3-none-any.whl (105.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m105.0/105.0 MB[0m [31m10.0 MB/s[0m eta [36m0:00:00[0m
Collecting trl
  Downloading trl-0.7.9-py3-none-any.whl (141 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m141.1/141.1 kB[0m [31m18.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting datasets
  Downloading datasets-2.16.1-py3-none-any.whl (507 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m507.1/507.1 kB[0m [31m53.4 MB/s[0m eta [36m0:00:00

#### **Dataset Link:** https://huggingface.co/datasets/tatsu-lab/alpaca?row=0

In [41]:
from datasets import load_dataset

instruct_tune_dataset = load_dataset("tatsu-lab/alpaca")

Downloading readme:   0%|          | 0.00/7.47k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/24.2M [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

### **Dataset structure**


In [42]:
instruct_tune_dataset

DatasetDict({
    train: Dataset({
        features: ['instruction', 'input', 'output', 'text'],
        num_rows: 52002
    })
})

In [44]:
# Display information for 5 data points from the 'train' split
num_samples_to_show = 5
for i in range(num_samples_to_show):
    data = instruct_tune_dataset['train'][i]
    print(f"Data Point {i + 1}:")
    print("Instruction:", data['instruction'])
    print("Input:", data['input'])
    print("Output:", data['output'])
    print("\n-----------------------------\n")

Data Point 1:
Instruction: Give three tips for staying healthy.
Input: 
Output: 1.Eat a balanced diet and make sure to include plenty of fruits and vegetables. 
2. Exercise regularly to keep your body active and strong. 
3. Get enough sleep and maintain a consistent sleep schedule.

-----------------------------

Data Point 2:
Instruction: What are the three primary colors?
Input: 
Output: The three primary colors are red, blue, and yellow.

-----------------------------

Data Point 3:
Instruction: Describe the structure of an atom.
Input: 
Output: An atom is made up of a nucleus, which contains protons and neutrons, surrounded by electrons that travel in orbits around the nucleus. The protons and neutrons have a positive charge, while the electrons have a negative charge, resulting in an overall neutral atom. The number of each particle determines the atomic number and the type of atom.

-----------------------------

Data Point 4:
Instruction: How can we reduce air pollution?
Input: 

#### **We will use just a small subset of the data for this training example**

In [45]:
instruct_tune_dataset["train"] = instruct_tune_dataset["train"].select(range(3500))
instruct_tune_dataset["test"] = instruct_tune_dataset["train"].select(range(300))

In [46]:
instruct_tune_dataset

DatasetDict({
    train: Dataset({
        features: ['instruction', 'input', 'output', 'text'],
        num_rows: 3500
    })
    test: Dataset({
        features: ['instruction', 'input', 'output', 'text'],
        num_rows: 300
    })
})

* Note that this time, the tokenizer has added the control tokens `[INST]` and `[/INST]` to indicate the start and end of user messages (but not assistant messages!). **Mistral-instruct was trained with these tokens.**
* In order to leverage instruction fine-tuning, your prompt should be surrounded by `[INST]` and `[/INST]` tokens. The very first instruction should begin with a begin of sentence id. The next instructions should not. The assistant generation will be ended by the end-of-sentence token id.

In [48]:
def create_prompt(sample):
    """
    Update the prompt template:
    Combine both the prompt and input into a single column.
    """
    bos_token = "<s>"
    eos_token = "</s>"

    # Use a predefined template for instructions
    instructions_template = "Below is an instruction that describes a task. Write a response that appropriately completes the request. "



    full_prompt = ""
    full_prompt += bos_token
    full_prompt += "[INST]"
    full_prompt += instructions_template
    full_prompt += sample['instruction']
    full_prompt += "[/INST]"
    full_prompt += sample['output']
    full_prompt += eos_token

    return full_prompt

In [49]:
create_prompt(instruct_tune_dataset["train"][0])

'<s>[INST]Below is an instruction that describes a task. Write a response that appropriately completes the request. Give three tips for staying healthy.[/INST]1.Eat a balanced diet and make sure to include plenty of fruits and vegetables. \n2. Exercise regularly to keep your body active and strong. \n3. Get enough sleep and maintain a consistent sleep schedule.</s>'

In [50]:
create_prompt(instruct_tune_dataset["train"][1])

'<s>[INST]Below is an instruction that describes a task. Write a response that appropriately completes the request. What are the three primary colors?[/INST]The three primary colors are red, blue, and yellow.</s>'

In [51]:
create_prompt(instruct_tune_dataset["train"][2])

'<s>[INST]Below is an instruction that describes a task. Write a response that appropriately completes the request. Describe the structure of an atom.[/INST]An atom is made up of a nucleus, which contains protons and neutrons, surrounded by electrons that travel in orbits around the nucleus. The protons and neutrons have a positive charge, while the electrons have a negative charge, resulting in an overall neutral atom. The number of each particle determines the atomic number and the type of atom.</s>'

In [52]:
create_prompt(instruct_tune_dataset["train"][3])

'<s>[INST]Below is an instruction that describes a task. Write a response that appropriately completes the request. How can we reduce air pollution?[/INST]There are a number of ways to reduce air pollution, such as shifting to renewable energy sources, encouraging the use of public transportation, prohibiting the burning of fossil fuels, implementing policies to reduce emissions from industrial sources, and implementing vehicle emissions standards. Additionally, individuals can do their part to reduce air pollution by reducing car use, avoiding burning materials such as wood, and changing to energy efficient appliances.</s>'

### **Initializing the Model**
* Load the model using a 4-bit configuration, employing double quantization, and set bfloat16 as the compute data type.

* Notably, we opt for the instruct-tuned model in this instance rather than the base model. It's worth mentioning that fine-tuning a base model necessitates a more substantial amount of data!

In [53]:
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch

bnb_config = BitsAndBytesConfig(
        load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype="float16", bnb_4bit_use_double_quant=True
    )


* https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2

In [54]:
mode_id = "mistralai/Mistral-7B-Instruct-v0.2"

In [55]:
model = AutoModelForCausalLM.from_pretrained(
        mode_id, quantization_config=bnb_config, device_map="auto", use_cache=False
    )

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

In [56]:
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.2")
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

### **Let's example how well the model does at this task currently:**

In [63]:
def generate_response(prompt, model):
  encoded_input = tokenizer(prompt,  return_tensors="pt", add_special_tokens=True)
  model_inputs = encoded_input.to('cuda')

  generated_ids = model.generate(**model_inputs, max_new_tokens=512, do_sample=True, pad_token_id=tokenizer.eos_token_id)

  decoded_output = tokenizer.batch_decode(generated_ids)

  return decoded_output[0].replace(prompt, "")

In [64]:
# Use a predefined template for instructions
prompt = "<s>[INST] Below is an instruction that describes a task. Write a response that appropriately completes the request. "
prompt += "Come up with a creative tagline for a beauty product. [/INST]"
print(prompt)

<s>[INST] Below is an instruction that describes a task. Write a response that appropriately completes the request. Come up with a creative tagline for a beauty product. [/INST]


In [65]:
generate_response(prompt, model)



'<s><s> [INST] Below is an instruction that describes a task. Write a response that appropriately completes the request. Come up with a creative tagline for a beauty product. [/INST] Task Instruction: Create a tagline for a new line of luxury skincare products.\n\nResponse:\nIntroducing "Aura Essence": Pamper Your Skin with Nature\'s Finest, where Radiant Beauty Meets Rejuvenating Indulgence. Each luxurious, botanical infused elixir is meticulously crafted to unlock the timeless secrets of youthful, flawless, and vibrant skin. Embrace the power of nature to transform your daily routine into a sublime journey towards an exceptional, redefined beauty experience.</s>'

### **Setting up the Training**
we will be using the `huggingface` and the `peft` library!

In [67]:
from peft import AutoPeftModelForCausalLM, LoraConfig, get_peft_model, prepare_model_for_kbit_training

peft_config = LoraConfig(r=8, lora_alpha=16, lora_dropout=0.05, bias="none", task_type="CAUSAL_LM")


* we need to prepare the model to be trained in 4bit so we will use the  **`prepare_model_for_kbit_training`** function from peft




In [68]:
model = prepare_model_for_kbit_training(model)
model = get_peft_model(model, peft_config)

# **Training Hyperparameters**
The choice of hyperparameters is contingent upon the desired training duration. Pay special attention to the following key factors:

* `num_train_epochs/max_steps:` Dictates the number of iterations over the data. Exercise caution, as an excessive number may lead to overfitting!

* `learning_rate:` Governs the convergence speed of the model. Adjust this parameter judiciously for optimal results.

In [69]:
from transformers import TrainingArguments
output_model= "mistral_instruct_generation"
training_arguments = TrainingArguments(
        output_dir=output_model,
        per_device_train_batch_size=1,
        gradient_accumulation_steps=4,
        optim="paged_adamw_32bit",
        learning_rate=2e-4,
        lr_scheduler_type="cosine",
        save_strategy="epoch",
        logging_steps=10,
        num_train_epochs=1,
        max_steps=150,
        fp16=True,
)


### **Setting up the trainer**

`max_seq_length`: Context window size


In [70]:
from trl import SFTTrainer

max_seq_length = 512

trainer = SFTTrainer(
  model=model,
  peft_config=peft_config,
  max_seq_length=max_seq_length,
  tokenizer=tokenizer,
  packing=True,
  formatting_func=create_prompt, # this will aplly the create_prompt mapping to all training and test dataset
  args=training_arguments,
  train_dataset=instruct_tune_dataset["train"],
  eval_dataset=instruct_tune_dataset["test"]
)

Generating train split: 0 examples [00:00, ? examples/s]

Generating train split: 0 examples [00:00, ? examples/s]



### **Training starts here**

In [71]:
trainer.train()

You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss
10,2.325
20,1.503
30,1.4464
40,1.3774
50,1.3852
60,1.3173
70,1.2721
80,1.1302
90,0.9911
100,1.0665


TrainOutput(global_step=150, training_loss=1.2601723861694336, metrics={'train_runtime': 1210.2682, 'train_samples_per_second': 0.496, 'train_steps_per_second': 0.124, 'total_flos': 1.31126500786176e+16, 'train_loss': 1.2601723861694336, 'epoch': 0.87})

### **Save the model**

In [80]:
trainer.save_model("mistral_instruct_generation")

In [81]:
merged_model = model.merge_and_unload()

In [74]:
def generate_response(prompt, model):
  encoded_input = tokenizer(prompt,  return_tensors="pt", add_special_tokens=True)
  model_inputs = encoded_input.to('cuda')

  generated_ids = model.generate(**model_inputs, max_new_tokens=512, do_sample=True, pad_token_id=tokenizer.eos_token_id)

  decoded_output = tokenizer.batch_decode(generated_ids)

  return decoded_output[0]

### **Example No:1**

In [85]:
# Use a predefined template for instructions
prompt = '''[INST] Below is an instruction that describes a task. Write a response that appropriately completes the request. '''
prompt += '''Find the highest peak in the world. [/INST]'''
response = generate_response(prompt, model)
# Print the response with formatted output
print(response)

<s> [INST] Below is an instruction that describes a task. Write a response that appropriately completes the request. Find the highest peak in the world. [/INST] The highest peak in the world is Mount Everest, with a height of 29,029 feet (8,848 meters).</s>


### **Example No:2**

In [83]:
# Use a predefined template for instructions
prompt = '''[INST] Below is an instruction that describes a task. Write a response that appropriately completes the request. '''
prompt += '''Describe the behavior and characteristics of a falcon [/INST]'''
response = generate_response(prompt, model)
# Print the response with formatted output
print(response)

<s> [INST] Below is an instruction that describes a task. Write a response that appropriately completes the request. Describe the behavior and characteristics of a falcon [/INST] A falcon is a bird of prey that belongs to the family of birds of prey, known as Falconidae. Known for their agility and sharp eyesight, these birds display several distinct behaviors and characteristics. 

Falcons are known for their swift and powerful flight. They are known to be one of the fastest birds in the world, able to reach speeds of up to 240 miles per hour in a high-speed dive. They are also known for their acrobatic flight, which involves climbing high in the sky and performing rolls and flips while hunting.

Another distinguishing characteristic of falcons is their ability to see very clearly, especially in low-light conditions. This allows them to locate their prey even in the darkest of environments. They typically use their strong beaks to tear apart their prey and swallow it whole.

Falcons a

### **Example No:3**

In [84]:
# Use a predefined template for instructions
prompt = '''[INST] Below is an instruction that describes a task. Write a response that appropriately completes the request. '''
prompt += '''Create a timeline for the Battle of Gettysburg. [/INST]'''
response = generate_response(prompt, model)
# Print the response with formatted output
print(response)

<s> [INST] Below is an instruction that describes a task. Write a response that appropriately completes the request. Create a timeline for the Battle of Gettysburg. [/INST] The Battle of Gettysburg took place between July 1 and July 3, 1863. The Confederate Army, led by General Robert E. Lee, launched an attack on the Union Army, led by General George G. Meade, in Pennsylvania.

* July 1: The two sides clashed at McPherson's Ridge. The Confederates outnumbered the Union army, but the Union held their ground.
* July 2: The Confederates continued their attack, focusing on Cemetery Hill and Culp's Hill. However, the Union was able to prevent them from taking these positions.
* July 3: Lee launched an assault on the Union's right flank at Pickett's Charge. It was the largest assault of the war, but the Union troops were able to hold their ground against the Confederate onslaught.
* July 4: Lee decided to leave the area given the Union's success in repelling their attacks.

Overall, the out