<a href="https://colab.research.google.com/github/fatemafaria142/Instructions-Tuning-Across-Various-LLMs-with-Alpaca-Dataset/blob/main/Instructions_Tuning_using_Mistral_7B.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### **Install Required Packages**

In [1]:
!pip install accelerate peft bitsandbytes transformers trl datasets torch

Collecting accelerate
  Downloading accelerate-0.26.1-py3-none-any.whl (270 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m270.9/270.9 kB[0m [31m2.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting peft
  Downloading peft-0.7.1-py3-none-any.whl (168 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m168.3/168.3 kB[0m [31m12.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting bitsandbytes
  Downloading bitsandbytes-0.42.0-py3-none-any.whl (105.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m105.0/105.0 MB[0m [31m7.4 MB/s[0m eta [36m0:00:00[0m
Collecting trl
  Downloading trl-0.7.9-py3-none-any.whl (141 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m141.1/141.1 kB[0m [31m8.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting datasets
  Downloading datasets-2.16.1-py3-none-any.whl (507 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m507.1/507.1 kB[0m [31m16.6 MB/s[0m eta [36m0:00:00

### **Dataset Link:** Dataset Link: https://huggingface.co/datasets/tatsu-lab/alpaca?row=0

In [12]:
from datasets import load_dataset

instruct_tune_dataset = load_dataset("tatsu-lab/alpaca")

Downloading readme:   0%|          | 0.00/7.47k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/24.2M [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

In [13]:
instruct_tune_dataset

DatasetDict({
    train: Dataset({
        features: ['instruction', 'input', 'output', 'text'],
        num_rows: 52002
    })
})

In [15]:
# Display information for 5 data points from the 'train' split
num_samples_to_show = 5
for i in range(num_samples_to_show):
    data = instruct_tune_dataset['train'][i]
    print(f"Data Point {i + 1}:")
    print("Instruction:", data['instruction'])
    print("Input:", data['input'])
    print("Output:", data['output'])
    print("\n-----------------------------\n")


Data Point 1:
Instruction: Give three tips for staying healthy.
Input: 
Output: 1.Eat a balanced diet and make sure to include plenty of fruits and vegetables. 
2. Exercise regularly to keep your body active and strong. 
3. Get enough sleep and maintain a consistent sleep schedule.

-----------------------------

Data Point 2:
Instruction: What are the three primary colors?
Input: 
Output: The three primary colors are red, blue, and yellow.

-----------------------------

Data Point 3:
Instruction: Describe the structure of an atom.
Input: 
Output: An atom is made up of a nucleus, which contains protons and neutrons, surrounded by electrons that travel in orbits around the nucleus. The protons and neutrons have a positive charge, while the electrons have a negative charge, resulting in an overall neutral atom. The number of each particle determines the atomic number and the type of atom.

-----------------------------

Data Point 4:
Instruction: How can we reduce air pollution?
Input: 

### **We will use just a small subset of the data for this training example**

In [16]:
instruct_tune_dataset["train"] = instruct_tune_dataset["train"].select(range(1500))
instruct_tune_dataset["test"] = instruct_tune_dataset["train"].select(range(200))

In [17]:
instruct_tune_dataset

DatasetDict({
    train: Dataset({
        features: ['instruction', 'input', 'output', 'text'],
        num_rows: 1500
    })
    test: Dataset({
        features: ['instruction', 'input', 'output', 'text'],
        num_rows: 200
    })
})

In [18]:
def create_prompt(sample):
  """
  Update the prompt template:
  Combine both the prompt and input into a single column.

  """
  bos_token = "<s>"
  original_system_message = "Below is an instruction that describes a task. Write a response that appropriately completes the request."
  system_message = "Use the provided input to create an instruction that could have been used to generate the response with an LLM."
  response = sample["output"].replace(original_system_message, "").replace("\n\n### Instruction\n", "").replace("\n### Response\n", "").strip()
  input = sample["instruction"]
  eos_token = "</s>"

  full_prompt = ""
  full_prompt += bos_token
  full_prompt += "### Instruction:"
  full_prompt += "\n" + system_message
  full_prompt += "\n\n### Input:"
  full_prompt += "\n" + input
  full_prompt += "\n\n### Response:"
  full_prompt += "\n" + response
  full_prompt += eos_token

  return full_prompt

In [19]:
create_prompt(instruct_tune_dataset["train"][0])

'<s>### Instruction:\nUse the provided input to create an instruction that could have been used to generate the response with an LLM.\n\n### Input:\nGive three tips for staying healthy.\n\n### Response:\n1.Eat a balanced diet and make sure to include plenty of fruits and vegetables. \n2. Exercise regularly to keep your body active and strong. \n3. Get enough sleep and maintain a consistent sleep schedule.</s>'

### **Initializing the Model**
* Load the model using a 4-bit configuration, employing double quantization, and set bfloat16 as the compute data type.

* Notably, we opt for the instruct-tuned model in this instance rather than the base model. It's worth mentioning that fine-tuning a base model necessitates a more substantial amount of data!

In [20]:
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch

bnb_config = BitsAndBytesConfig(
        load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype="float16", bnb_4bit_use_double_quant=True
    )


* https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.1

In [21]:
mode_id = "mistralai/Mistral-7B-Instruct-v0.1"

In [22]:
model = AutoModelForCausalLM.from_pretrained(
        mode_id, quantization_config=bnb_config, device_map="auto", use_cache=False
    )

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

In [23]:
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-v0.1")
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

tokenizer_config.json:   0%|          | 0.00/967 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.80M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/72.0 [00:00<?, ?B/s]

### **Let's example how well the model does at this task currently:**

In [24]:
def generate_response(prompt, model):
  encoded_input = tokenizer(prompt,  return_tensors="pt", add_special_tokens=True)
  model_inputs = encoded_input.to('cuda')

  generated_ids = model.generate(**model_inputs, max_new_tokens=1000, do_sample=True, pad_token_id=tokenizer.eos_token_id)

  decoded_output = tokenizer.batch_decode(generated_ids)

  return decoded_output[0].replace(prompt, "")

In [27]:
prompt="### Instruction:\nUse the provided input to create an instruction that could have been used to generate the response with an LLM.### Input:\n What is the capital of France?\n\n### Response:"

In [28]:
generate_response(prompt, model)

'<s> \n The capital of France is Paris.</s>'

### **Setting up the Training**
we will be using the `huggingface` and the `peft` library!

In [29]:
from peft import AutoPeftModelForCausalLM, LoraConfig, get_peft_model, prepare_model_for_kbit_training

peft_config = LoraConfig(r=8, lora_alpha=16, lora_dropout=0.05, bias="none", task_type="CAUSAL_LM")


* we need to prepare the model to be trained in 4bit so we will use the  **`prepare_model_for_kbit_training`** function from peft




In [30]:
model = prepare_model_for_kbit_training(model)
model = get_peft_model(model, peft_config)

# **Training Hyperparameters**
The choice of hyperparameters is contingent upon the desired training duration. Pay special attention to the following key factors:

* `num_train_epochs/max_steps:` Dictates the number of iterations over the data. Exercise caution, as an excessive number may lead to overfitting!

* `learning_rate:` Governs the convergence speed of the model. Adjust this parameter judiciously for optimal results.

In [31]:
from transformers import TrainingArguments
output_model= "mistral_instruct_generation"
training_arguments = TrainingArguments(
        output_dir=output_model,
        per_device_train_batch_size=1,
        gradient_accumulation_steps=4,
        optim="paged_adamw_32bit",
        learning_rate=2e-4,
        lr_scheduler_type="cosine",
        save_strategy="epoch",
        logging_steps=10,
        num_train_epochs=1,
        max_steps=50,
        fp16=True,
)


### **Setting up the trainer**

`max_seq_length`: Context window size


In [32]:
from trl import SFTTrainer

max_seq_length = 1024

trainer = SFTTrainer(
  model=model,
  peft_config=peft_config,
  max_seq_length=max_seq_length,
  tokenizer=tokenizer,
  packing=True,
  formatting_func=create_prompt, # this will aplly the create_prompt mapping to all training and test dataset
  args=training_arguments,
  train_dataset=instruct_tune_dataset["train"],
  eval_dataset=instruct_tune_dataset["test"]
)

Generating train split: 0 examples [00:00, ? examples/s]

Generating train split: 0 examples [00:00, ? examples/s]



### **Training starts here**

In [33]:
trainer.train()

You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss
10,1.2982
20,1.0573
30,0.9844
40,0.9211
50,0.9419




TrainOutput(global_step=50, training_loss=1.0405669403076172, metrics={'train_runtime': 765.8477, 'train_samples_per_second': 0.261, 'train_steps_per_second': 0.065, 'total_flos': 8741766719078400.0, 'train_loss': 1.0405669403076172, 'epoch': 1.25})

### **Save the model**

In [34]:
trainer.save_model("mistral_instruct_generation")

In [35]:
merged_model = model.merge_and_unload()



In [36]:
def generate_response(prompt, model):
  encoded_input = tokenizer(prompt,  return_tensors="pt", add_special_tokens=True)
  model_inputs = encoded_input.to('cuda')

  generated_ids = model.generate(**model_inputs, max_new_tokens=1000, do_sample=True, pad_token_id=tokenizer.eos_token_id)

  decoded_output = tokenizer.batch_decode(generated_ids)

  return decoded_output[0]

In [40]:
# Example usage
prompt = "### Instruction:\nUse the provided input to create an instruction that could have been used to generate the response with an LLM."
prompt += "\n### Input:\nWrite a short story in third person narration about a protagonist who has to make an important career decision."
prompt += "\n\n### Response:"
response = generate_response(prompt, model)

# Print the response with formatted output
print(response)

<s> ### Instruction:
Use the provided input to create an instruction that could have been used to generate the response with an LLM.
### Input:
Write a short story in third person narration about a protagonist who has to make an important career decision.

### Response:
The protagonist found themselves standing at a crossroads in their career. They had been offered a promotion, which would require them to move across the country and take on more responsibility. At the same time, their dream job had finally opened up, and they had been invited for an interview. They had to make a decision - should they take the promotion and leave everything behind, or should they risk not getting the job and stay in their current position?</s>


In [38]:
# Example usage
prompt = "### Instruction:\nUse the provided input to create an instruction that could have been used to generate the response with an LLM."
prompt += "\n### Input:\nDescribe a time when you had to make a difficult decision."
prompt += "\n\n### Response:"
response = generate_response(prompt, model)

# Print the response with formatted output
print(response)

<s> ### Instruction:
Use the provided input to create an instruction that could have been used to generate the response with an LLM.
### Input:
Describe a time when you had to make a difficult decision.

### Response:
One time I had to make a difficult decision was when I was faced with the choice of leaving my stable job for a new opportunity that paid more but required me to relocate. While my current job was comfortable, I knew that the benefits of the new opportunity could be significant in terms of my future career growth and financial stability. I had to weigh the risks and benefits, considering the impact on my personal life, as well as my long-term professional goals. Ultimately, I decided to take the chance and pursue the new opportunity, which proved to be the best decision I have ever made for my career.</s>


In [39]:
# Example usage
prompt = "### Instruction:\nUse the provided input to create an instruction that could have been used to generate the response with an LLM."
prompt += "\n### Input:\nDescribe the structure of an atom."
prompt += "\n\n### Response:"
response = generate_response(prompt, model)

# Print the response with formatted output
print(response)

<s> ### Instruction:
Use the provided input to create an instruction that could have been used to generate the response with an LLM.
### Input:
Describe the structure of an atom.

### Response:
An atom is the basic unit of ordinary matter that constitutes a chemical element. It contains a dense nucleus at its center made up of protons and neutrons, which is surrounded by a cloud of electrons that forms the atom's electron shell. The number of electrons in an atom's shell, along with the number of protons in the nucleus, determines what element the atom is, and how it interacts with other matter.</s>
