<a href="https://colab.research.google.com/github/fatemafaria142/SQL-Query-Answer-Generation/blob/main/SQL_Query_Answer_Generation_using_Mistral_7B_Instruct_v0.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### **Install Required Packages**

In [1]:
!pip install accelerate peft bitsandbytes transformers trl datasets torch

Collecting accelerate
  Downloading accelerate-0.26.1-py3-none-any.whl (270 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/270.9 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━[0m [32m194.6/270.9 kB[0m [31m5.4 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m270.9/270.9 kB[0m [31m5.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting peft
  Downloading peft-0.7.1-py3-none-any.whl (168 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m168.3/168.3 kB[0m [31m9.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting bitsandbytes
  Downloading bitsandbytes-0.42.0-py3-none-any.whl (105.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m105.0/105.0 MB[0m [31m9.6 MB/s[0m eta [36m0:00:00[0m
Collecting trl
  Downloading trl-0.7.10-py3-none-any.whl (150 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m150.

#### **Dataset Link:**  https://huggingface.co/datasets/b-mc2/sql-create-context

In [2]:
from datasets import load_dataset
instruct_tune_dataset = load_dataset("b-mc2/sql-create-context")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Downloading readme:   0%|          | 0.00/3.35k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/21.8M [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

### **Dataset structure**
* The dataset contains three columns.

In [3]:
instruct_tune_dataset

DatasetDict({
    train: Dataset({
        features: ['context', 'question', 'answer'],
        num_rows: 78577
    })
})

In [4]:
# Display information for 5 data points from the 'train' split
num_samples_to_show = 5
for i in range(num_samples_to_show):
    data = instruct_tune_dataset['train'][i]
    print(f"Data Point {i + 1}:")
    print("Answer:", data['answer'])
    print("Question:", data['question'])
    print("Context:", data['context'])
    print("\n-----------------------------\n")


Data Point 1:
Answer: SELECT COUNT(*) FROM head WHERE age > 56
Question: How many heads of the departments are older than 56 ?
Context: CREATE TABLE head (age INTEGER)

-----------------------------

Data Point 2:
Answer: SELECT name, born_state, age FROM head ORDER BY age
Question: List the name, born state and age of the heads of departments ordered by age.
Context: CREATE TABLE head (name VARCHAR, born_state VARCHAR, age VARCHAR)

-----------------------------

Data Point 3:
Answer: SELECT creation, name, budget_in_billions FROM department
Question: List the creation year, name and budget of each department.
Context: CREATE TABLE department (creation VARCHAR, name VARCHAR, budget_in_billions VARCHAR)

-----------------------------

Data Point 4:
Answer: SELECT MAX(budget_in_billions), MIN(budget_in_billions) FROM department
Question: What are the maximum and minimum budget of the departments?
Context: CREATE TABLE department (budget_in_billions INTEGER)

----------------------------

### **We will use just a small subset of the data for this training example**

In [5]:
instruct_tune_dataset["train"] = instruct_tune_dataset["train"].select(range(4500))
instruct_tune_dataset["test"] = instruct_tune_dataset["train"].select(range(500))

In [6]:
instruct_tune_dataset

DatasetDict({
    train: Dataset({
        features: ['context', 'question', 'answer'],
        num_rows: 4500
    })
    test: Dataset({
        features: ['context', 'question', 'answer'],
        num_rows: 500
    })
})

* Note that this time, the tokenizer has added the control tokens `[INST]` and `[/INST]` to indicate the start and end of user messages (but not assistant messages!). **Mistral-instruct was trained with these tokens.**
* In order to leverage instruction fine-tuning, your prompt should be surrounded by `[INST]` and `[/INST]` tokens. The very first instruction should begin with a begin of sentence id. The next instructions should not. The assistant generation will be ended by the end-of-sentence token id.

#### **Prompt Creation**

In [7]:
def create_prompt(sample):
    """
    Update the prompt template:
    Combine both the prompt and input into a single column.
    """
    bos_token = "<s>"
    eos_token = "</s>"

    # Use a predefined template for instructions
    instructions_template = "Your task is to write SQL queries to retrieve specific information from the given database. Please follow the instructions carefully and use the appropriate SQL syntax. "



    full_prompt = ""
    full_prompt += bos_token
    full_prompt += "[INST]"
    full_prompt += instructions_template
    full_prompt += "Here is the table I have created: "
    full_prompt += sample['context']
    full_prompt += " and My question is about the table is: "
    full_prompt += sample['question']
    full_prompt += "[/INST]"
    full_prompt += sample['answer']
    full_prompt += eos_token

    return full_prompt

In [8]:
create_prompt(instruct_tune_dataset["train"][0])

'<s>[INST]Your task is to write SQL queries to retrieve specific information from the given database. Please follow the instructions carefully and use the appropriate SQL syntax. Here is the table I have created: CREATE TABLE head (age INTEGER) and My question is about the table is: How many heads of the departments are older than 56 ?[/INST]SELECT COUNT(*) FROM head WHERE age > 56</s>'

In [9]:
create_prompt(instruct_tune_dataset["train"][1])

'<s>[INST]Your task is to write SQL queries to retrieve specific information from the given database. Please follow the instructions carefully and use the appropriate SQL syntax. Here is the table I have created: CREATE TABLE head (name VARCHAR, born_state VARCHAR, age VARCHAR) and My question is about the table is: List the name, born state and age of the heads of departments ordered by age.[/INST]SELECT name, born_state, age FROM head ORDER BY age</s>'

In [10]:
create_prompt(instruct_tune_dataset["train"][2])

'<s>[INST]Your task is to write SQL queries to retrieve specific information from the given database. Please follow the instructions carefully and use the appropriate SQL syntax. Here is the table I have created: CREATE TABLE department (creation VARCHAR, name VARCHAR, budget_in_billions VARCHAR) and My question is about the table is: List the creation year, name and budget of each department.[/INST]SELECT creation, name, budget_in_billions FROM department</s>'

### **Initializing the Model**
* Load the model using a 4-bit configuration, employing double quantization, and set bfloat16 as the compute data type.

* Notably, we opt for the instruct-tuned model in this instance rather than the base model. It's worth mentioning that fine-tuning a base model necessitates a more substantial amount of data!

In [11]:
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch

bnb_config = BitsAndBytesConfig(
        load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype="float16", bnb_4bit_use_double_quant=True
    )


* **Model and Tokenizer Link:** https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.1

In [12]:
mode_id = "mistralai/Mistral-7B-Instruct-v0.1"

In [13]:
model = AutoModelForCausalLM.from_pretrained(
        mode_id, quantization_config=bnb_config, device_map="auto", use_cache=False
    )

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/25.1k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/9.94G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/4.54G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

In [14]:
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.1")
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

tokenizer_config.json:   0%|          | 0.00/1.47k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.80M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/72.0 [00:00<?, ?B/s]

### **Let's example how well the model does at this task currently:**

In [15]:
def generate_response(prompt, model):
  encoded_input = tokenizer(prompt,  return_tensors="pt", add_special_tokens=True)
  model_inputs = encoded_input.to('cuda')

  generated_ids = model.generate(**model_inputs, max_new_tokens=256, top_k=5,temperature=0.7, repetition_penalty=1.2,do_sample=True, pad_token_id=tokenizer.eos_token_id)

  decoded_output = tokenizer.batch_decode(generated_ids)

  return decoded_output[0].replace(prompt, "")

In [16]:
# Use a predefined template for instructions
prompt = "[INST] Your task is to write SQL queries to retrieve specific information from the given database. Please follow the instructions carefully and use the appropriate SQL syntax. "
prompt += "Here is the table I have created: "
prompt +="CREATE TABLE inst (Id VARCHAR) " #context
prompt +="and My question is about the table is: "
prompt += "How many institutions are there?[/INST] " #question
print(prompt)

[INST] Your task is to write SQL queries to retrieve specific information from the given database. Please follow the instructions carefully and use the appropriate SQL syntax. Here is the table I have created: CREATE TABLE inst (Id VARCHAR) and My question is about the table is: How many institutions are there?[/INST] 


In [17]:
generate_response(prompt, model)



'<s> 1. To count the number of rows in the "inst" table, you can use the following query:\n```\nSELECT COUNT(*) FROM inst;\n```</s>'

### **Setting up the Training**
we will be using the `huggingface` and the `peft` library!
* `r (int):` Lora attention dimension.
* `lora_alpha (int):` The alpha parameter for Lora scaling.
* `lora_dropout (float):` The dropout probability for Lora layers.
* `bias (str):` Bias type for Lora. Can be 'none', 'all' or 'lora_only'. If 'all' or 'lora_only', the

In [22]:
from peft import AutoPeftModelForCausalLM, LoraConfig, get_peft_model, prepare_model_for_kbit_training
#This is the configuration class to store the configuration of a [LoraModel].
peft_config = LoraConfig(r=8, lora_alpha=16, lora_dropout=0.05, bias="none", task_type="CAUSAL_LM")


* we need to prepare the model to be trained in 4bit so we will use the  **`prepare_model_for_kbit_training`** function from peft




In [23]:
model = prepare_model_for_kbit_training(model)
model = get_peft_model(model, peft_config)

# **Training Hyperparameters**
The choice of hyperparameters is contingent upon the desired training duration. Pay special attention to the following key factors:

* `num_train_epochs/max_steps:` Dictates the number of iterations over the data. Exercise caution, as an excessive number may lead to overfitting!

* `learning_rate:` Governs the convergence speed of the model. Adjust this parameter judiciously for optimal results.

In [24]:
from transformers import TrainingArguments
output_model= "mistral_instruct_generation"
training_arguments = TrainingArguments(
        output_dir=output_model,
        per_device_train_batch_size=1,
        gradient_accumulation_steps=4,
        optim="paged_adamw_32bit",
        learning_rate=2e-4,
        lr_scheduler_type="cosine",
        save_strategy="epoch",
        logging_steps=10,
        num_train_epochs=1,
        max_steps=150,
        fp16=True,
)


### **Setting up the trainer**

`max_seq_length`: Context window size


In [25]:
from trl import SFTTrainer

max_seq_length = 256

trainer = SFTTrainer(
  model=model,
  peft_config=peft_config,
  max_seq_length=max_seq_length,
  tokenizer=tokenizer,
  packing=True,
  formatting_func=create_prompt, # this will aplly the create_prompt mapping to all training and test dataset
  args=training_arguments,
  train_dataset=instruct_tune_dataset["train"],
  eval_dataset=instruct_tune_dataset["test"]
)

Generating train split: 0 examples [00:00, ? examples/s]

Generating train split: 0 examples [00:00, ? examples/s]



### **Training starts here**

In [26]:
trainer.train()

You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss
10,1.65
20,0.9146
30,0.6446
40,0.5549
50,0.5476
60,0.5404
70,0.539
80,0.5348
90,0.5322
100,0.5118


TrainOutput(global_step=150, training_loss=0.6325917879740397, metrics={'train_runtime': 832.8364, 'train_samples_per_second': 0.72, 'train_steps_per_second': 0.18, 'total_flos': 6556325039308800.0, 'train_loss': 0.6325917879740397, 'epoch': 0.25})

### **Save the model**

In [27]:
trainer.save_model("mistral_instruct_generation")

In [28]:
merged_model = model.merge_and_unload()



### **Example No:1**

In [32]:
# Use a predefined template for instructions
prompt = "[INST] Your task is to write SQL queries to retrieve specific information from the given database. Please follow the instructions carefully and use the appropriate SQL syntax. "
prompt += "Here is the table I have created: "
prompt +="CREATE TABLE loan (branch_id VARCHAR); CREATE TABLE bank (bname VARCHAR, branch_id VARCHAR) " #context
prompt +="and My question is about the table is: "
prompt += "Find the total amount of loans offered by each bank branch.[/INST] " #question
response = generate_response(prompt, merged_model)
# Print the response with formatted output
print(response)

<s> 1. SELECT b.bname FROM loan l JOIN bank b ON l.branch_id = b.branch_id GROUP BY b.bname</s>


### **Example No:2**

In [33]:
# Use a predefined template for instructions
prompt = "[INST] Your task is to write SQL queries to retrieve specific information from the given database. Please follow the instructions carefully and use the appropriate SQL syntax. "
prompt += "Here is the table I have created: "
prompt +="CREATE TABLE inst (Id VARCHAR) " #context
prompt +="and My question is about the table is: "
prompt += "How many institutions are there?[/INST] " #question
response = generate_response(prompt, merged_model)
# Print the response with formatted output
print(response)

<s> SELECT COUNT(*) FROM inst</s>


### **Example No:3**

In [34]:
# Use a predefined template for instructions
prompt = "[INST] Your task is to write SQL queries to retrieve specific information from the given database. Please follow the instructions carefully and use the appropriate SQL syntax. "
prompt += "Here is the table I have created: "
prompt +="CREATE TABLE player (Player_name VARCHAR, Player_ID VARCHAR); CREATE TABLE coach (coach_name VARCHAR, Coach_ID VARCHAR); CREATE TABLE player_coach (Coach_ID VARCHAR, Player_ID VARCHAR)  " #context
prompt +="and My question is about the table is: "
prompt += "Show the names of players and names of their coaches.[/INST] " #question
response = generate_response(prompt, merged_model)
# Print the response with formatted output
print(response)

<s> 1. SELECT p.Player_name, c.coach_name FROM player p JOIN coach c ON p.Player_ID = c.Coach_ID</s>
