<a href="https://colab.research.google.com/github/fatemafaria142/From-Text-to-Code-Generation-Pipeline-Incorporating-Various-LLMs/blob/main/Code_Generation_using_Mistral_7B_Instruct_v0_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### **Install Required Packages**

In [1]:
!pip install accelerate peft bitsandbytes transformers trl datasets torch

Collecting accelerate
  Downloading accelerate-0.26.1-py3-none-any.whl (270 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/270.9 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━━[0m [32m112.6/270.9 kB[0m [31m3.3 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m270.9/270.9 kB[0m [31m5.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting peft
  Downloading peft-0.7.1-py3-none-any.whl (168 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/168.3 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m168.3/168.3 kB[0m [31m23.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting bitsandbytes
  Downloading bitsandbytes-0.42.0-py3-none-any.whl (105.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m105.0/105.0 MB[0m [31m8.9 MB/s[0m eta [36m0:00:00[0m
Collecting trl
  D

#### **Dataset Link:** https://huggingface.co/datasets/sahil2801/CodeAlpaca-20k

In [2]:
from datasets import load_dataset

instruct_tune_dataset = load_dataset("sahil2801/CodeAlpaca-20k")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Downloading readme:   0%|          | 0.00/147 [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/8.06M [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

### **Dataset structure**
* The dataset contains two columns.

In [3]:
instruct_tune_dataset

DatasetDict({
    train: Dataset({
        features: ['output', 'input', 'instruction'],
        num_rows: 20022
    })
})

In [4]:
# Display information for 5 data points from the 'train' split
num_samples_to_show = 5
for i in range(num_samples_to_show):
    data = instruct_tune_dataset['train'][i]
    print(f"Data Point {i + 1}:")
    print("Instruction:", data['instruction'])
    print("Input:", data['input'])
    print("Output:", data['output'])
    print("\n-----------------------------\n")


Data Point 1:
Instruction: Create an array of length 5 which contains all even numbers between 1 and 10.
Input: 
Output: arr = [2, 4, 6, 8, 10]

-----------------------------

Data Point 2:
Instruction: Formulate an equation to calculate the height of a triangle given the angle, side lengths and opposite side length.
Input: 
Output: Height of triangle = opposite side length * sin (angle) / side length

-----------------------------

Data Point 3:
Instruction: Write a replace method for a string class which replaces the given string with a given set of characters.
Input: string = "Hello World!"
replace_with = "Greetings!"
Output: def replace(self, replace_with):
    new_string = ""
    for char in self:
        if char == " ":
            new_string += replace_with
        else:
            new_string += char
    return new_string

-----------------------------

Data Point 4:
Instruction: Create an array of length 15 containing numbers divisible by 3 up to 45.
Input: 
Output: arr = [3, 

### **We will use just a small subset of the data for this training example**

In [5]:
instruct_tune_dataset["train"] = instruct_tune_dataset["train"].select(range(3500))
instruct_tune_dataset["test"] = instruct_tune_dataset["train"].select(range(300))

In [6]:
instruct_tune_dataset

DatasetDict({
    train: Dataset({
        features: ['output', 'input', 'instruction'],
        num_rows: 3500
    })
    test: Dataset({
        features: ['output', 'input', 'instruction'],
        num_rows: 300
    })
})

* Note that this time, the tokenizer has added the control tokens `[INST]` and `[/INST]` to indicate the start and end of user messages (but not assistant messages!). **Mistral-instruct was trained with these tokens.**
* In order to leverage instruction fine-tuning, your prompt should be surrounded by `[INST]` and `[/INST]` tokens. The very first instruction should begin with a begin of sentence id. The next instructions should not. The assistant generation will be ended by the end-of-sentence token id.

In [7]:
def create_prompt(sample):
    """
    Update the prompt template:
    Combine both the prompt and input into a single column.
    """
    bos_token = "<s>"
    eos_token = "</s>"

    # Use a predefined template for instructions
    instructions_template = " Think of yourself as a Coding Expert. Your expertise lies in solving coding problems with precision and clarity. Imagine you're ready to provide accurate solutions to any coding-related inquiry. Your goal is to solve the given coding problem with utmost accuracy. Please proceed with the solution once the question is provided. Coding Question: "



    full_prompt = ""
    full_prompt += bos_token
    full_prompt += "[INST]"
    full_prompt += instructions_template
    full_prompt += sample['instruction']
    full_prompt += sample['input']
    full_prompt += "[/INST]"
    full_prompt += sample['output']
    full_prompt += eos_token

    return full_prompt

In [8]:
create_prompt(instruct_tune_dataset["train"][0])

"<s>[INST] Think of yourself as a Coding Expert. Your expertise lies in solving coding problems with precision and clarity. Imagine you're ready to provide accurate solutions to any coding-related inquiry. Your goal is to solve the given coding problem with utmost accuracy. Please proceed with the solution once the question is provided. Coding Question: Create an array of length 5 which contains all even numbers between 1 and 10.[/INST]arr = [2, 4, 6, 8, 10]</s>"

In [9]:
create_prompt(instruct_tune_dataset["train"][1])

"<s>[INST] Think of yourself as a Coding Expert. Your expertise lies in solving coding problems with precision and clarity. Imagine you're ready to provide accurate solutions to any coding-related inquiry. Your goal is to solve the given coding problem with utmost accuracy. Please proceed with the solution once the question is provided. Coding Question: Formulate an equation to calculate the height of a triangle given the angle, side lengths and opposite side length.[/INST]Height of triangle = opposite side length * sin (angle) / side length</s>"

In [10]:
create_prompt(instruct_tune_dataset["train"][2])

'<s>[INST] Think of yourself as a Coding Expert. Your expertise lies in solving coding problems with precision and clarity. Imagine you\'re ready to provide accurate solutions to any coding-related inquiry. Your goal is to solve the given coding problem with utmost accuracy. Please proceed with the solution once the question is provided. Coding Question: Write a replace method for a string class which replaces the given string with a given set of characters.string = "Hello World!"\nreplace_with = "Greetings!"[/INST]def replace(self, replace_with):\n    new_string = ""\n    for char in self:\n        if char == " ":\n            new_string += replace_with\n        else:\n            new_string += char\n    return new_string</s>'

In [11]:
create_prompt(instruct_tune_dataset["train"][3])

"<s>[INST] Think of yourself as a Coding Expert. Your expertise lies in solving coding problems with precision and clarity. Imagine you're ready to provide accurate solutions to any coding-related inquiry. Your goal is to solve the given coding problem with utmost accuracy. Please proceed with the solution once the question is provided. Coding Question: Create an array of length 15 containing numbers divisible by 3 up to 45.[/INST]arr = [3, 6, 9, 12, 15, 18, 21, 24, 27, 30, 33, 36, 39, 42, 45]</s>"

In [12]:
create_prompt(instruct_tune_dataset["train"][4])

'<s>[INST] Think of yourself as a Coding Expert. Your expertise lies in solving coding problems with precision and clarity. Imagine you\'re ready to provide accurate solutions to any coding-related inquiry. Your goal is to solve the given coding problem with utmost accuracy. Please proceed with the solution once the question is provided. Coding Question: Write a function to find the number of distinct states in a given matrix.matrix = [[1, 0, 0],\n          [1, 0, 1],\n          [1, 1, 1]][/INST]def find_num_distinct_states(matrix):\n    states = set()\n    for row in matrix:\n        state = "".join([str(x) for x in row])\n        states.add(state)\n    return len(states)</s>'

In [13]:
create_prompt(instruct_tune_dataset["train"][5])

"<s>[INST] Think of yourself as a Coding Expert. Your expertise lies in solving coding problems with precision and clarity. Imagine you're ready to provide accurate solutions to any coding-related inquiry. Your goal is to solve the given coding problem with utmost accuracy. Please proceed with the solution once the question is provided. Coding Question: Create a nested loop to print every combination of numbers between 0-9[/INST]for i in range(10):\n    for j in range(10):\n        print(i, j)</s>"

### **Initializing the Model**
* Load the model using a 4-bit configuration, employing double quantization, and set bfloat16 as the compute data type.

* Notably, we opt for the instruct-tuned model in this instance rather than the base model. It's worth mentioning that fine-tuning a base model necessitates a more substantial amount of data!

In [14]:
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch

bnb_config = BitsAndBytesConfig(
        load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype="float16", bnb_4bit_use_double_quant=True
    )


* https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2

In [15]:
mode_id = "mistralai/Mistral-7B-Instruct-v0.2"

In [16]:
model = AutoModelForCausalLM.from_pretrained(
        mode_id, quantization_config=bnb_config, device_map="auto", use_cache=False
    )

config.json:   0%|          | 0.00/596 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/25.1k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/3 [00:00<?, ?it/s]

model-00001-of-00003.safetensors:   0%|          | 0.00/4.94G [00:00<?, ?B/s]

model-00002-of-00003.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00003-of-00003.safetensors:   0%|          | 0.00/4.54G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/111 [00:00<?, ?B/s]

In [17]:
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.2")
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

tokenizer_config.json:   0%|          | 0.00/1.46k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.80M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/72.0 [00:00<?, ?B/s]

### **Let's example how well the model does at this task currently:**

In [18]:
def generate_response(prompt, model):
  encoded_input = tokenizer(prompt,  return_tensors="pt", add_special_tokens=True)
  model_inputs = encoded_input.to('cuda')

  generated_ids = model.generate(**model_inputs, max_new_tokens=1024, do_sample=True, pad_token_id=tokenizer.eos_token_id)

  decoded_output = tokenizer.batch_decode(generated_ids)

  return decoded_output[0].replace(prompt, "")

In [26]:
def create_prompt(sample):
    """
    Update the prompt template:
    Combine both the prompt and input into a single column.
    """
    bos_token = "<s>"
    eos_token = "</s>"

    # Use a predefined template for instructions
    instructions_template = " Think of yourself as a Coding Expert. Your expertise lies in solving coding problems with precision and clarity. Imagine you're ready to provide accurate solutions to any coding-related inquiry. Your goal is to solve the given coding problem with utmost accuracy. Please proceed with the solution once the question is provided. Coding Question: "



    full_prompt = ""
    full_prompt += bos_token
    full_prompt += "[INST]"
    full_prompt += instructions_template
    full_prompt += sample['instruction']
    full_prompt += sample['input']
    full_prompt += "[/INST] "
    full_prompt += sample['output']
    full_prompt += eos_token

    return full_prompt

### **Check if the prompt is in right way or not?**

In [27]:
check_prompt = create_prompt(instruct_tune_dataset["train"][0])
print(check_prompt)

<s>[INST] Think of yourself as a Coding Expert. Your expertise lies in solving coding problems with precision and clarity. Imagine you're ready to provide accurate solutions to any coding-related inquiry. Your goal is to solve the given coding problem with utmost accuracy. Please proceed with the solution once the question is provided. Coding Question: Create an array of length 5 which contains all even numbers between 1 and 10.[/INST] arr = [2, 4, 6, 8, 10]</s>


In [29]:
# Use a predefined template for instructions
prompt = "<s>[INST] Think of yourself as a Coding Expert. Your expertise lies in solving coding problems with precision and clarity. Imagine you're ready to provide accurate solutions to any coding-related inquiry. Your goal is to solve the given coding problem with utmost accuracy. Please proceed with the solution once the question is provided. Coding Question: "
prompt += "Create an array of length 5 which contains all even numbers between 1 and 10. [/INST]"
print(prompt)

<s>[INST] Think of yourself as a Coding Expert. Your expertise lies in solving coding problems with precision and clarity. Imagine you're ready to provide accurate solutions to any coding-related inquiry. Your goal is to solve the given coding problem with utmost accuracy. Please proceed with the solution once the question is provided. Coding Question: Create an array of length 5 which contains all even numbers between 1 and 10. [/INST]


In [30]:
generate_response(prompt, model)



'<s><s> [INST] Think of yourself as a Coding Expert. Your expertise lies in solving coding problems with precision and clarity. Imagine you\'re ready to provide accurate solutions to any coding-related inquiry. Your goal is to solve the given coding problem with utmost accuracy. Please proceed with the solution once the question is provided. Coding Question: Create an array of length 5 which contains all even numbers between 1 and 10. [/INST] To create an array of length 5 filled with even numbers between 1 and 10, you can follow these steps:\n\n1. Initialize an empty integer array of length 5.\n2. Assign even numbers between 1 and 10 to each index of the array.\n\nHere is the code for solving this problem in several common programming languages:\n\n**Python:**\n\n```python\neven_numbers = [2, 4, 6, 8, 10]\n```\n\n**JavaScript:**\n\n```javascript\nlet even_numbers = [2, 4, 6, 8, 10];\n```\n\n**Java:**\n\n```java\nint[] even_numbers = {2, 4, 6, 8, 10};\n```\n\n**C++:**\n\n```cpp\n#inclu

### **Setting up the Training**
we will be using the `huggingface` and the `peft` library!

In [31]:
from peft import AutoPeftModelForCausalLM, LoraConfig, get_peft_model, prepare_model_for_kbit_training

peft_config = LoraConfig(r=8, lora_alpha=16, lora_dropout=0.05, bias="none", task_type="CAUSAL_LM")


* we need to prepare the model to be trained in 4bit so we will use the  **`prepare_model_for_kbit_training`** function from peft




In [32]:
model = prepare_model_for_kbit_training(model)
model = get_peft_model(model, peft_config)

# **Training Hyperparameters**
The choice of hyperparameters is contingent upon the desired training duration. Pay special attention to the following key factors:

* `num_train_epochs/max_steps:` Dictates the number of iterations over the data. Exercise caution, as an excessive number may lead to overfitting!

* `learning_rate:` Governs the convergence speed of the model. Adjust this parameter judiciously for optimal results.

In [33]:
from transformers import TrainingArguments
output_model= "mistral_instruct_generation"
training_arguments = TrainingArguments(
        output_dir=output_model,
        per_device_train_batch_size=1,
        gradient_accumulation_steps=4,
        optim="paged_adamw_32bit",
        learning_rate=2e-4,
        lr_scheduler_type="cosine",
        save_strategy="epoch",
        logging_steps=10,
        num_train_epochs=1,
        max_steps=100,
        fp16=True,
)


### **Setting up the trainer**

`max_seq_length`: Context window size


In [34]:
from trl import SFTTrainer

max_seq_length = 512

trainer = SFTTrainer(
  model=model,
  peft_config=peft_config,
  max_seq_length=max_seq_length,
  tokenizer=tokenizer,
  packing=True,
  formatting_func=create_prompt, # this will aplly the create_prompt mapping to all training and test dataset
  args=training_arguments,
  train_dataset=instruct_tune_dataset["train"],
  eval_dataset=instruct_tune_dataset["test"]
)

Generating train split: 0 examples [00:00, ? examples/s]

Generating train split: 0 examples [00:00, ? examples/s]



### **Training starts here**

In [35]:
trainer.train()

You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss
10,1.527
20,0.8725
30,0.6953
40,0.6378
50,0.6656
60,0.6709
70,0.6851
80,0.6611
90,0.6432
100,0.6319


TrainOutput(global_step=100, training_loss=0.769051103591919, metrics={'train_runtime': 807.742, 'train_samples_per_second': 0.495, 'train_steps_per_second': 0.124, 'total_flos': 8741766719078400.0, 'train_loss': 0.769051103591919, 'epoch': 0.34})

### **Save the model**

In [36]:
trainer.save_model("mistral_instruct_generation")

In [37]:
merged_model = model.merge_and_unload()



In [38]:
def generate_response(prompt, model):
  encoded_input = tokenizer(prompt,  return_tensors="pt", add_special_tokens=True)
  model_inputs = encoded_input.to('cuda')

  generated_ids = model.generate(**model_inputs, max_new_tokens=1024, do_sample=True, pad_token_id=tokenizer.eos_token_id)

  decoded_output = tokenizer.batch_decode(generated_ids)

  return decoded_output[0]

### **Example No:1**

In [42]:
# Use a predefined template for instructions
prompt = '''<s>[INST] Think of yourself as a Coding Expert. Your expertise lies in solving coding problems with precision and clarity. Imagine you're ready to provide accurate solutions to any coding-related inquiry. Your goal is to solve the given coding problem with utmost accuracy. Please proceed with the solution once the question is provided. Coding Question: '''
prompt += '''Write a Python function to replace all the occurrences of a given substring with another. string = "Hello World"
substring = "World"
replacement = "Universe" [/INST]'''
response = generate_response(prompt, model)
# Print the response with formatted output
print(response)



<s><s> [INST] Think of yourself as a Coding Expert. Your expertise lies in solving coding problems with precision and clarity. Imagine you're ready to provide accurate solutions to any coding-related inquiry. Your goal is to solve the given coding problem with utmost accuracy. Please proceed with the solution once the question is provided. Coding Question: Write a Python function to replace all the occurrences of a given substring with another. string = "Hello World"
substring = "World"
replacement = "Universe" [/INST] def replace_substring(string, substring, replacement):
    """
    Replace all occurrences of a given substring with another in a string

    :param string: str, given string
    :param substring: str, substring to be replaced
    :param replacement: str, the replacement substring
    :return: str, modified string
    """
    return string.replace(substring, replacement) 

# Test the function
string = "Hello World"
substring = "World"
replacement = "Universe"
print(repla

### **Example No:2**

In [43]:
# Use a predefined template for instructions
prompt = '''<s>[INST] Think of yourself as a Coding Expert. Your expertise lies in solving coding problems with precision and clarity. Imagine you're ready to provide accurate solutions to any coding-related inquiry. Your goal is to solve the given coding problem with utmost accuracy. Please proceed with the solution once the question is provided. Coding Question: '''
prompt += '''Find the smallest element in a given array.[6, 8, 1, 4, 5][/INST]'''
response = generate_response(prompt, model)
# Print the response with formatted output
print(response)

<s><s> [INST] Think of yourself as a Coding Expert. Your expertise lies in solving coding problems with precision and clarity. Imagine you're ready to provide accurate solutions to any coding-related inquiry. Your goal is to solve the given coding problem with utmost accuracy. Please proceed with the solution once the question is provided. Coding Question: Find the smallest element in a given array.[6, 8, 1, 4, 5][/INST] Function find_smallest(arr):

    smallest = arr[0]
    for i in arr:
        if smallest > i:
            smallest = i
    return smallest

# Test the function
arr = [6, 8, 1, 4, 5]
print(find_smallest(arr)) # Output: 1</s>


### **Example No:3**

In [44]:
# Use a predefined template for instructions
prompt = '''<s>[INST] Think of yourself as a Coding Expert. Your expertise lies in solving coding problems with precision and clarity. Imagine you're ready to provide accurate solutions to any coding-related inquiry. Your goal is to solve the given coding problem with utmost accuracy. Please proceed with the solution once the question is provided. Coding Question: '''
prompt += '''Formulate an equation to calculate the height of a triangle given the angle, side lengths and opposite side length. [/INST]'''
response = generate_response(prompt, model)
# Print the response with formatted output
print(response)

<s><s> [INST] Think of yourself as a Coding Expert. Your expertise lies in solving coding problems with precision and clarity. Imagine you're ready to provide accurate solutions to any coding-related inquiry. Your goal is to solve the given coding problem with utmost accuracy. Please proceed with the solution once the question is provided. Coding Question: Formulate an equation to calculate the height of a triangle given the angle, side lengths and opposite side length. [/INST] To calculate the height of a triangle given the angle, side lengths and opposite side length, you can use the following formula:

h = opposite side length / (2 * tan(angle / 2))

where:
- h is the height,
- angle is the angle in radians,
- opposite side length is the length of the side opposite to the angle.</s>
