<a href="https://colab.research.google.com/github/fatemafaria142/SQL-Query-Answer-Generation/blob/main/SQL_Query_Answer_Generation_using_Starling_LM_7B_alpha.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### **Install Required Packages**

In [1]:
!pip install accelerate peft bitsandbytes transformers trl datasets torch

Collecting accelerate
  Downloading accelerate-0.26.1-py3-none-any.whl (270 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m270.9/270.9 kB[0m [31m4.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting peft
  Downloading peft-0.7.1-py3-none-any.whl (168 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m168.3/168.3 kB[0m [31m10.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting bitsandbytes
  Downloading bitsandbytes-0.42.0-py3-none-any.whl (105.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m105.0/105.0 MB[0m [31m6.6 MB/s[0m eta [36m0:00:00[0m
Collecting trl
  Downloading trl-0.7.10-py3-none-any.whl (150 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m150.9/150.9 kB[0m [31m9.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting datasets
  Downloading datasets-2.16.1-py3-none-any.whl (507 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m507.1/507.1 kB[0m [31m21.8 MB/s[0m eta [36m0:00:00

#### **Dataset Link:** https://huggingface.co/datasets/b-mc2/sql-create-context

In [2]:
from datasets import load_dataset
instruct_tune_dataset = load_dataset("b-mc2/sql-create-context")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Downloading readme:   0%|          | 0.00/3.35k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/21.8M [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

### **Dataset structure**
* The dataset contains two columns.

In [3]:
instruct_tune_dataset

DatasetDict({
    train: Dataset({
        features: ['context', 'question', 'answer'],
        num_rows: 78577
    })
})

In [4]:
# Display information for 5 data points from the 'train' split
num_samples_to_show = 5
for i in range(num_samples_to_show):
    data = instruct_tune_dataset['train'][i]
    print(f"Data Point {i + 1}:")
    print("Context:", data['context'])
    print("Question:", data['question'])
    print("Answer:", data['answer'])
    print("\n-----------------------------\n")


Data Point 1:
Context: CREATE TABLE head (age INTEGER)
Question: How many heads of the departments are older than 56 ?
Answer: SELECT COUNT(*) FROM head WHERE age > 56

-----------------------------

Data Point 2:
Context: CREATE TABLE head (name VARCHAR, born_state VARCHAR, age VARCHAR)
Question: List the name, born state and age of the heads of departments ordered by age.
Answer: SELECT name, born_state, age FROM head ORDER BY age

-----------------------------

Data Point 3:
Context: CREATE TABLE department (creation VARCHAR, name VARCHAR, budget_in_billions VARCHAR)
Question: List the creation year, name and budget of each department.
Answer: SELECT creation, name, budget_in_billions FROM department

-----------------------------

Data Point 4:
Context: CREATE TABLE department (budget_in_billions INTEGER)
Question: What are the maximum and minimum budget of the departments?
Answer: SELECT MAX(budget_in_billions), MIN(budget_in_billions) FROM department

----------------------------

### **We will use just a small subset of the data for this training example**

In [5]:
instruct_tune_dataset["train"] = instruct_tune_dataset["train"].select(range(4500))
instruct_tune_dataset["test"] = instruct_tune_dataset["train"].select(range(500))

In [6]:
instruct_tune_dataset

DatasetDict({
    train: Dataset({
        features: ['context', 'question', 'answer'],
        num_rows: 4500
    })
    test: Dataset({
        features: ['context', 'question', 'answer'],
        num_rows: 500
    })
})

In [7]:
def create_prompt(sample):
    # Use a predefined template for instructions
    prompt = "Your task is to write SQL queries to retrieve specific information from the given database. Please follow the instructions carefully and use the appropriate SQL syntax. "
    prompt += f"Context: {sample['context']}\n"
    prompt += f"User's Question: {sample['question']}\n"
    single_turn_prompt = f"GPT4 Correct User: {prompt}<|end_of_turn|>GPT4 Correct Assistant: {sample['answer']}"
    return single_turn_prompt

In [8]:
create_prompt(instruct_tune_dataset["train"][0])

"GPT4 Correct User: Your task is to write SQL queries to retrieve specific information from the given database. Please follow the instructions carefully and use the appropriate SQL syntax. Context: CREATE TABLE head (age INTEGER)\nUser's Question: How many heads of the departments are older than 56 ?\n<|end_of_turn|>GPT4 Correct Assistant: SELECT COUNT(*) FROM head WHERE age > 56"

In [9]:
create_prompt(instruct_tune_dataset["train"][1])

"GPT4 Correct User: Your task is to write SQL queries to retrieve specific information from the given database. Please follow the instructions carefully and use the appropriate SQL syntax. Context: CREATE TABLE head (name VARCHAR, born_state VARCHAR, age VARCHAR)\nUser's Question: List the name, born state and age of the heads of departments ordered by age.\n<|end_of_turn|>GPT4 Correct Assistant: SELECT name, born_state, age FROM head ORDER BY age"

In [10]:
create_prompt(instruct_tune_dataset["train"][2])

"GPT4 Correct User: Your task is to write SQL queries to retrieve specific information from the given database. Please follow the instructions carefully and use the appropriate SQL syntax. Context: CREATE TABLE department (creation VARCHAR, name VARCHAR, budget_in_billions VARCHAR)\nUser's Question: List the creation year, name and budget of each department.\n<|end_of_turn|>GPT4 Correct Assistant: SELECT creation, name, budget_in_billions FROM department"

In [11]:
create_prompt(instruct_tune_dataset["train"][3])

"GPT4 Correct User: Your task is to write SQL queries to retrieve specific information from the given database. Please follow the instructions carefully and use the appropriate SQL syntax. Context: CREATE TABLE department (budget_in_billions INTEGER)\nUser's Question: What are the maximum and minimum budget of the departments?\n<|end_of_turn|>GPT4 Correct Assistant: SELECT MAX(budget_in_billions), MIN(budget_in_billions) FROM department"

In [12]:
create_prompt(instruct_tune_dataset["train"][4])

"GPT4 Correct User: Your task is to write SQL queries to retrieve specific information from the given database. Please follow the instructions carefully and use the appropriate SQL syntax. Context: CREATE TABLE department (num_employees INTEGER, ranking INTEGER)\nUser's Question: What is the average number of employees of the departments whose rank is between 10 and 15?\n<|end_of_turn|>GPT4 Correct Assistant: SELECT AVG(num_employees) FROM department WHERE ranking BETWEEN 10 AND 15"

In [13]:
create_prompt(instruct_tune_dataset["train"][5])

"GPT4 Correct User: Your task is to write SQL queries to retrieve specific information from the given database. Please follow the instructions carefully and use the appropriate SQL syntax. Context: CREATE TABLE head (name VARCHAR, born_state VARCHAR)\nUser's Question: What are the names of the heads who are born outside the California state?\n<|end_of_turn|>GPT4 Correct Assistant: SELECT name FROM head WHERE born_state <> 'California'"

### **Initializing the Model**
* Load the model using a 4-bit configuration, employing double quantization, and set bfloat16 as the compute data type.

* Notably, we opt for the instruct-tuned model in this instance rather than the base model. It's worth mentioning that fine-tuning a base model necessitates a more substantial amount of data!

In [14]:
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch

bnb_config = BitsAndBytesConfig(
        load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype="float16", bnb_4bit_use_double_quant=True
    )


* **Model and Tokenizer:** https://huggingface.co/berkeley-nest/Starling-LM-7B-alpha

In [15]:
mode_id = "berkeley-nest/Starling-LM-7B-alpha"

In [16]:
model = AutoModelForCausalLM.from_pretrained(
        mode_id, quantization_config=bnb_config, device_map="auto", use_cache=False
    )

config.json:   0%|          | 0.00/613 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/3 [00:00<?, ?it/s]

model-00001-of-00003.safetensors:   0%|          | 0.00/4.94G [00:00<?, ?B/s]

model-00002-of-00003.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00003-of-00003.safetensors:   0%|          | 0.00/4.54G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/115 [00:00<?, ?B/s]

In [17]:
tokenizer = AutoTokenizer.from_pretrained("berkeley-nest/Starling-LM-7B-alpha")
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

tokenizer_config.json:   0%|          | 0.00/1.61k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.80M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/560 [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


### **Let's example how well the model does at this task currently:**

In [18]:
def generate_response(prompt, model):
  encoded_input = tokenizer(prompt,  return_tensors="pt", add_special_tokens=True)
  model_inputs = encoded_input.to('cuda')

  generated_ids = model.generate(**model_inputs, max_new_tokens=256, do_sample=True, pad_token_id=tokenizer.eos_token_id)

  decoded_output = tokenizer.batch_decode(generated_ids)

  return decoded_output[0].replace(prompt, "")

In [19]:
prompt = '''GPT4 Correct User: Your task is to write SQL queries to retrieve specific information from the given database. Please follow the instructions carefully and use the appropriate SQL syntax. Context: CREATE TABLE head (name VARCHAR, born_state VARCHAR)
User's Question: What are the names of the heads who are born outside the California state?
<|end_of_turn|>GPT4 Correct Assistant:'''

In [20]:
generate_response(prompt, model)



"<s> GPT4 Correct User: Your task is to write SQL queries to retrieve specific information from the given database. Please follow the instructions carefully and use the appropriate SQL syntax. Context: CREATE TABLE head (name VARCHAR, born_state VARCHAR)\nUser's Question: What are the names of the heads who are born outside the California state?\n<|end_of_turn|> GPT4 Correct Assistant: To retrieve the names of the heads who are born outside the California state, you can use the following SQL query:\n\n```sql\nSELECT name\nFROM head\nWHERE born_state != 'California';\n```<|end_of_turn|>"

### **Setting up the Training**
we will be using the `huggingface` and the `peft` library!

In [21]:
from peft import AutoPeftModelForCausalLM, LoraConfig, get_peft_model, prepare_model_for_kbit_training

peft_config = LoraConfig(r=8, lora_alpha=16, lora_dropout=0.05, bias="none", task_type="CAUSAL_LM")


* we need to prepare the model to be trained in 4bit so we will use the  **`prepare_model_for_kbit_training`** function from peft




In [22]:
model = prepare_model_for_kbit_training(model)
model = get_peft_model(model, peft_config)

# **Training Hyperparameters**
The choice of hyperparameters is contingent upon the desired training duration. Pay special attention to the following key factors:

* `num_train_epochs/max_steps:` Dictates the number of iterations over the data. Exercise caution, as an excessive number may lead to overfitting!

* `learning_rate:` Governs the convergence speed of the model. Adjust this parameter judiciously for optimal results.

In [23]:
from transformers import TrainingArguments
output_model= "Starling_instruct_generation"
training_arguments = TrainingArguments(
        output_dir=output_model,
        per_device_train_batch_size=1,
        gradient_accumulation_steps=4,
        optim="paged_adamw_32bit",
        learning_rate=2e-4,
        lr_scheduler_type="cosine",
        save_strategy="epoch",
        logging_steps=10,
        num_train_epochs=1,
        max_steps=200,
        fp16=True,
)


### **Setting up the trainer**

`max_seq_length`: Context window size


In [24]:
from trl import SFTTrainer

max_seq_length = 256

trainer = SFTTrainer(
  model=model,
  peft_config=peft_config,
  max_seq_length=max_seq_length,
  tokenizer=tokenizer,
  packing=True,
  formatting_func=create_prompt, # this will aplly the create_prompt mapping to all training and test dataset
  args=training_arguments,
  train_dataset=instruct_tune_dataset["train"],
  eval_dataset=instruct_tune_dataset["test"]
)

Generating train split: 0 examples [00:00, ? examples/s]

Generating train split: 0 examples [00:00, ? examples/s]



### **Training starts here**

In [25]:
trainer.train()

You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss
10,1.5242
20,0.842
30,0.579
40,0.7582
50,0.5522
60,0.5141
70,0.5015
80,0.4941
90,0.497
100,0.5156


TrainOutput(global_step=200, training_loss=0.5767436313629151, metrics={'train_runtime': 1039.1083, 'train_samples_per_second': 0.77, 'train_steps_per_second': 0.192, 'total_flos': 8741776785408000.0, 'train_loss': 0.5767436313629151, 'epoch': 0.33})

### **Save the model**

In [26]:
trainer.save_model("Starling_instruct_generation")

In [27]:
merged_model = model.merge_and_unload()



In [28]:
def generate_response(prompt, model):
  encoded_input = tokenizer(prompt,  return_tensors="pt", add_special_tokens=True)
  model_inputs = encoded_input.to('cuda')

  generated_ids = model.generate(**model_inputs, max_new_tokens=256, do_sample=True, pad_token_id=tokenizer.eos_token_id)

  decoded_output = tokenizer.batch_decode(generated_ids)

  return decoded_output[0]

### **Example No:1**

In [30]:
prompt = "GPT4 Correct User: Your task is to write SQL queries to retrieve specific information from the given database. Please follow the instructions carefully and use the appropriate SQL syntax. "
prompt += "Context: CREATE TABLE services (service_name VARCHAR, service_id VARCHAR); CREATE TABLE first_notification_of_loss (service_id VARCHAR)\n"
prompt += "User's Question: Find the name of services that have been used for more than 2 times in first notification of loss. <|end_of_turn|>\n"
prompt += "GPT4 Correct Assistant:"
response = generate_response(prompt, model)
# Print the response with formatted output
print(response)

<s> GPT4 Correct User: Your task is to write SQL queries to retrieve specific information from the given database. Please follow the instructions carefully and use the appropriate SQL syntax. Context: CREATE TABLE services (service_name VARCHAR, service_id VARCHAR); CREATE TABLE first_notification_of_loss (service_id VARCHAR)
User's Question: Find the name of services that have been used for more than 2 times in first notification of loss. <|end_of_turn|> 
GPT4 Correct Assistant: SELECT T1.service_name
FROM services AS T1 JOIN first_notification_of_loss AS T2 ON T1.service_id = T2.service_id
GROUP BY T1.service_name
HAVING COUNT(*) > 2<|end_of_turn|>


### **Example No:2**

In [31]:
prompt = "GPT4 Correct User: Your task is to write SQL queries to retrieve specific information from the given database. Please follow the instructions carefully and use the appropriate SQL syntax. "
prompt += "Context: CREATE TABLE player (Player_name VARCHAR, Player_ID VARCHAR); CREATE TABLE coach (coach_name VARCHAR, Coach_ID VARCHAR); CREATE TABLE player_coach (Coach_ID VARCHAR, Player_ID VARCHAR)\n"
prompt += "User's Question: Show the names of players and names of their coaches. <|end_of_turn|>\n"
prompt += "GPT4 Correct Assistant:"
response = generate_response(prompt, model)
# Print the response with formatted output
print(response)

<s> GPT4 Correct User: Your task is to write SQL queries to retrieve specific information from the given database. Please follow the instructions carefully and use the appropriate SQL syntax. Context: CREATE TABLE player (Player_name VARCHAR, Player_ID VARCHAR); CREATE TABLE coach (coach_name VARCHAR, Coach_ID VARCHAR); CREATE TABLE player_coach (Coach_ID VARCHAR, Player_ID VARCHAR)
User's Question: Show the names of players and names of their coaches. <|end_of_turn|> 
GPT4 Correct Assistant: SELECT T1.Player_name, T2.coach_name FROM player AS T1 JOIN player_coach AS T2 ON T1.player_ID = T2.player_ID JOIN coach AS T3 ON T2.coach_ID = T3.Coach_ID<|end_of_turn|>


### **Example No:3**

In [32]:
prompt = "GPT4 Correct User: Your task is to write SQL queries to retrieve specific information from the given database. Please follow the instructions carefully and use the appropriate SQL syntax. "
prompt += "Context: CREATE TABLE Movie (title VARCHAR, director VARCHAR, YEAR VARCHAR)\n"
prompt += "User's Question: What are names of the movies that are either made before 1980 or directed by James Cameron? <|end_of_turn|>\n"
prompt += "GPT4 Correct Assistant:"
response = generate_response(prompt, model)
# Print the response with formatted output
print(response)

<s> GPT4 Correct User: Your task is to write SQL queries to retrieve specific information from the given database. Please follow the instructions carefully and use the appropriate SQL syntax. Context: CREATE TABLE Movie (title VARCHAR, director VARCHAR, YEAR VARCHAR)
User's Question: What are names of the movies that are either made before 1980 or directed by James Cameron? <|end_of_turn|> 
GPT4 Correct Assistant: SELECT title FROM Movie WHERE YEAR < 1980 OR director = 'James Cameron'<|end_of_turn|>


### **Example No:4**

In [33]:
prompt = "GPT4 Correct User: Your task is to write SQL queries to retrieve specific information from the given database. Please follow the instructions carefully and use the appropriate SQL syntax. "
prompt += "Context: CREATE TABLE organisation_Types (organisation_type VARCHAR, organisation_type_description VARCHAR); CREATE TABLE Grants (grant_id VARCHAR, organisation_id VARCHAR, grant_amount VARCHAR); CREATE TABLE Organisations (organisation_id VARCHAR, organisation_type VARCHAR); CREATE TABLE documents (sent_date VARCHAR, grant_id VARCHAR)\n"
prompt += "User's Question: Find out the send dates of the documents with the grant amount of more than 5000 were granted by organisation type described <|end_of_turn|>\n"
prompt += "GPT4 Correct Assistant:"
response = generate_response(prompt, model)
# Print the response with formatted output
print(response)

<s> GPT4 Correct User: Your task is to write SQL queries to retrieve specific information from the given database. Please follow the instructions carefully and use the appropriate SQL syntax. Context: CREATE TABLE organisation_Types (organisation_type VARCHAR, organisation_type_description VARCHAR); CREATE TABLE Grants (grant_id VARCHAR, organisation_id VARCHAR, grant_amount VARCHAR); CREATE TABLE Organisations (organisation_id VARCHAR, organisation_type VARCHAR); CREATE TABLE documents (sent_date VARCHAR, grant_id VARCHAR)
User's Question: Find out the send dates of the documents with the grant amount of more than 5000 were granted by organisation type described <|end_of_turn|> 
GPT4 Correct Assistant: SELECT sent_date FROM documents, Grants, Organisations WHERE documents.grant_id = Grants.grant_id AND Grants.organisation_id = Organisations.organisation_id AND organisation_type IN (SELECT organisation_type FROM organisation_Types WHERE organisation_type_description = 'described') AND 