# Text-to-SQL Fine Tuning Lab

**Note** This is a slightly modified version of a notebook from Phil Schmid's blog [here](https://www.philschmid.de/fine-tune-llms-in-2024-with-trl)

This lab illustrates fine tuning a 7 billion parameter LLM for text to SQL use cases. It's often valuable to include generalized instruction tuning datapoints into the dataset as well as text to sql datapoints to make the model a bit more robust. If you include only sql examples, the model will not generalize as well your users inputs drift over time.

## Concepts

### Fine tuning:
Fine tuning a model is the process taking an already trained model, and further training it on specific tasks. In our case, we'll be training it to follow instructions (using the dolly dataset) as well as a SQL dataset.

### LoRA
LoRA is a parameter-efficient fine-tuning technique for large language models (LLMs). It works by introducing trainable low-rank decomposition matrices to the weights of the model. Instead of fine-tuning all parameters of a pre-trained model, LoRA freezes the original model weights and injects trainable rank decomposition matrices into each layer of the model.
The key idea behind LoRA is to represent the weight updates during fine-tuning as the product of two low-rank matrices. Mathematically, if W is the original weight matrix, the LoRA update can be expressed as:

W' = W + BA

Where B and A are low-rank matrices, and their product BA represents the update to the original weights.
LoRA works effectively for several reasons:

* Parameter efficiency: By using low-rank matrices, LoRA dramatically reduces the number of trainable parameters compared to full fine-tuning. This makes it possible to adapt large models on limited hardware.
* Preservation of pre-trained knowledge: Since the original weights are kept frozen, the model retains most of its pre-trained knowledge while learning new tasks.
Adaptability: The low-rank update allows the model to learn task-specific adaptations without overfitting as easily as full fine-tuning might.
* Computational efficiency: Training and applying LoRA updates is computationally cheaper than full fine-tuning or using adapter layers.
* Theoretical foundation: The effectiveness of LoRA is grounded in the observation that the weight updates during fine-tuning often have a low intrinsic rank, meaning they can be well-approximated by low-rank matrices.
* Composability: Multiple LoRA adaptations can be combined, allowing for interesting multi-task and transfer learning scenarios.

The reason LoRA works so well is that it exploits the low intrinsic dimensionality of the updates needed to adapt a pre-trained model to a new task. By focusing on these key directions of change, LoRA can achieve performance comparable to full fine-tuning with only a fraction of the trainable parameters.
This approach has proven particularly effective for large language models, where the cost and computational requirements of full fine-tuning can be prohibitive.

## Steps
1. Install dependencies & setup SageMaker Session
2. Create and process our dataset
3. Train our model in a notebook

# Takeaways
There are many ways to fine tune a model. This training job will take roughly ~2 hours on a G5.2xlarge ($1.515 / hr in us-west-2). 

This means the total training job will cost ~$3.03 dollars. Not bad! 

# 2. Setup development environment
Our first step is to install Hugging Face Libraries and Pyroch, including trl, transformers and datasets.

In [None]:
%pip install "torch==2.4.0" tensorboard

# Install Hugging Face libraries
%pip install  --upgrade \
  "transformers==4.44.2" \
  "datasets==2.21.0" \
  "accelerate==0.33.0" \
  "evaluate==0.4.2" \
  "bitsandbytes==0.43.3" \
  "trl==0.9.6" \
  "peft==0.12.0" 

In [None]:
# Import HF token

from dotenv import load_dotenv, find_dotenv
import os

# loading environment variables that are stored in local file
local_env_filename = 'dev.env'
load_dotenv(find_dotenv(local_env_filename),override=True)

os.environ['REGION'] = os.getenv('REGION')
os.environ['HF_TOKEN'] = os.getenv('HF_TOKEN')

REGION = os.environ['REGION']
HF_TOKEN = os.environ['HF_TOKEN']



# Install Flash Attention
For this lab, flash attention will take too long to install. Leave this commented out. It'll take longer for the model to train without flash attention, so it's recommended to use it when doing this work outside of a lab. 

In [None]:
# import torch; assert torch.cuda.get_device_capability()[0] >= 8, 'Hardware not supported for Flash Attention'
# # install flash-attn
# !pip install ninja packaging
# !MAX_JOBS=4 pip install flash-attn --no-build-isolation

# Login to Hugging Face
We need to log into hugging face to download gated models

In [None]:
!huggingface-cli login --token {HF_TOKEN}

# Create Dataset
To make a more robust model, we're going to take our synthetically generated data and mix it with an instruction dataset + a more generic SQL database. The original instruction tuning paper used ~15k examples, but later research indicates you potentially need way less to get a performant model. 

Resources: 
1. [LIMA](https://arxiv.org/abs/2305.11206)
2. [Instruct](https://arxiv.org/abs/2203.02155)

Most LLMs released to consumers are further refined using reinforcement learning with human feedback. However, you can still get a decent model with regular supervised fine tuning (SFT). In a production system, the dataset would be changing over time and it's not uncommon to have 10s of thousands or even hundreds of thousands of training samples.

Because we're training off the base model, the model isn't aligned by default to protect against harmful queries. You should consider further tuning it on alignment data like the dataset provided by Anthropic [here](https://huggingface.co/datasets/Anthropic/hh-rlhf)

In [None]:
from datasets import load_dataset, Dataset
from random import randrange
import json
 
# Load dataset from the hub
dolly_dataset = load_dataset("databricks/databricks-dolly-15k", split="train")
sql_dataset = load_dataset("b-mc2/sql-create-context", split="train")

# Load our synthetic dataset from disk
synthetic_data = []
with open('./data/synthetic_data.jsonl', 'r') as f:
    for line in f:
        # Parse each line as a JSON object
        synthetic_data.append(json.loads(line.strip()))

# Pull it into a huggingface Dataset.
synthetic_dataset = Dataset.from_list(synthetic_data)

In [None]:
from datasets import concatenate_datasets
from typing import Dict, List, Tuple, Any

SYSTEM_MESSAGE: str = 'You are a helpful assistant'

# Format functions provided by the user, both now returning tuples
def format_dolly(sample: Dict[str, str]) -> Tuple[str, str]:
    instruction: str = f"{sample['instruction']}"
    context: str = f"### Context\n{sample['context']}" if len(sample["context"]) > 0 else ""
    
    # Join the instruction and context together
    user_msg: str = "\n\n".join([i for i in [instruction, context] if i])
    
    return user_msg, sample['response']

def format_sql(sample: Dict[str, str]) -> Tuple[str, str]:
    instruction: str = f"{sample['question']}"
    context: str = f"### Context\n{sample['context']}" if len(sample["context"]) > 0 else ""

    # Join the instruction and context together
    user_msg: str = "\n\n".join([i for i in [instruction, context] if i])

    return user_msg, sample['answer']

def format_synthetic_data(sample: Dict[str, str]) -> Tuple[str, str]:
    instruction: str = f"{sample['Question']}"
    context: str = f"### Context\n{sample['Context']}" if len(sample["Context"]) > 0 else ""

    # Join the instruction and context together
    user_msg: str = "\n\n".join([i for i in [instruction, context] if i])

    return user_msg, sample['Query']

def create_conversation(sample: Dict[str, str], format_func: callable) -> Dict[str, List[Dict[str, str]]]:
    user_msg: str
    ai_msg: str
    user_msg, ai_msg = format_func(sample)

    return {
        "messages": [
            {"role": "system", "content": SYSTEM_MESSAGE},
            {"role": "user", "content": user_msg},
            {"role": "assistant", "content": ai_msg}
        ]
    }

In [None]:
# Apply formatting and create conversations for each dataset
dolly_formatted: Dataset = dolly_dataset.map(
    lambda x: create_conversation(x, format_dolly),
    remove_columns = dolly_dataset.features,batched=False
)
sql_formatted: Dataset = sql_dataset.map(
    lambda x: create_conversation(x, format_sql),
    remove_columns = sql_dataset.features,batched=False
)

synthetic_formatted: Dataset = synthetic_dataset.map(
    lambda x: create_conversation(x, format_synthetic_data),
    remove_columns = synthetic_dataset.features,batched=False
)

# To keep training time down, alternatively you can set the max examples to ~2200 total.
dolly_size, sql_size, synthetic_size = 1200, 200, 1000

# Balance the datasets
balanced_dolly: Dataset = dolly_formatted.shuffle(seed=42).select(range(dolly_size))
balanced_sql: Dataset = sql_formatted.shuffle(seed=42).select(range(sql_size))
balanced_synthetic: Dataset = synthetic_formatted.shuffle(seed=42).select(range(synthetic_size))

# Combine the balanced datasets
combined_dataset: Dataset = concatenate_datasets([balanced_dolly, balanced_sql, balanced_synthetic])

# Shuffle the combined dataset
dataset: Dataset = combined_dataset.shuffle(seed=42)

# Calculate the number of samples for the test set (10% of total)
test_size: int = int(len(dataset) * 0.1)

# Split to Test/Train
dataset = dataset.train_test_split(test_size=test_size)

# Save dataset to disk
Lets save the dataset to our notebook

In [None]:
print(dataset["train"][345]["messages"])

# save datasets to disk 
dataset["train"].to_json("train_dataset.json", orient="records")
dataset["test"].to_json("test_dataset.json", orient="records")

# 4. Fine-tune LLM using trl and the SFTTrainer
We are now ready to fine-tune our model. We will use the SFTTrainer from trl to fine-tune our model. The SFTTrainer makes it straightfoward to supervise fine-tune open LLMs. The SFTTrainer is a subclass of the Trainer from the transformers library and supports all the same features, including logging, evaluation, and checkpointing, but adds additiional quality of life features, including:

Dataset formatting, including conversational and instruction format
Training on completions only, ignoring prompts
Packing datasets for more efficient training
PEFT (parameter-efficient fine-tuning) support including Q-LoRA
Preparing the model and tokenizer for conversational fine-tuning (e.g. adding special tokens)
We will use the dataset formatting, packing and PEFT features in our example. As peft method we will use QLoRA a technique to reduce the memory footprint of large language models during finetuning, without sacrificing performance by using quantization. If you want to learn more about QLoRA and how it works, check out Making LLMs even more accessible with bitsandbytes, 4-bit quantization and QLoRA blog post.

Now, lets get started! 🚀

First, we need to load our dataset from disk.

In [None]:
from datasets import load_dataset

# Load jsonl data from disk
dataset = load_dataset("json", data_files="train_dataset.json", split="train")

Next, we will load our LLM. For our use case we are going to use Mistral 7B. But we can easily swap out the model for another model, e.g. Llama or Mixtral models, TII Falcon, or any other LLMs by changing our model_id variable. We will use bitsandbytes to quantize our model to 4-bit.

Note: Be aware the bigger the model the more memory it will require. In our example we will use the 7B version, which can be tuned on 24GB GPUs. If you have a smaller GPU.

Correctly, preparing the LLM and Tokenizer for training chat/conversational models is crucial. We need to add new special tokens to the tokenizer and model and teach to understand the different roles in a conversation. In trl we have a convinient method called setup_chat_format, which:

Adds special tokens to the tokenizer, e.g. <|im_start|> and <|im_end|>, to indicate the start and end of a conversation.
Resizes the model’s embedding layer to accommodate the new tokens.
Sets the chat_template of the tokenizer, which is used to format the input data into a chat-like format. The default is chatml from OpenAI.

In [None]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
from trl import setup_chat_format

# Hugging Face model id
model_id = "mistralai/Mistral-7B-v0.1" # or `meta-llama/Meta-Llama-3.1-8B`

# BitsAndBytesConfig int-4 config
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True, bnb_4bit_use_double_quant=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.bfloat16
)

# Load model and tokenizer
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    # attn_implementation="flash_attention_2", # Uncomment this line to use flash attention.
    torch_dtype=torch.bfloat16,
    quantization_config=bnb_config
)
tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.padding_side = 'right' # to prevent warnings

# # set chat template to OAI chatML, remove if you start from a fine-tuned model
model, tokenizer = setup_chat_format(model, tokenizer)

The SFTTrainer  supports a native integration with peft, which makes it super easy to efficiently tune LLMs using, e.g. QLoRA. We only need to create our LoraConfig and provide it to the trainer. Our LoraConfig parameters are defined based on the [qlora paper](https://arxiv.org/pdf/2305.14314) and [sebastian's blog post](https://magazine.sebastianraschka.com/p/practical-tips-for-finetuning-llms).

In [None]:
from peft import LoraConfig

# LoRA config based on QLoRA paper & Sebastian Raschka experiment
peft_config = LoraConfig(
        lora_alpha=128,
        lora_dropout=0.05,
        r=256,
        bias="none",
        target_modules="all-linear",
        task_type="CAUSAL_LM", 
)

In [None]:
from transformers import TrainingArguments

args = TrainingArguments(
    output_dir="mistral-7b-text-to-sql",    # directory to save and repository id
    num_train_epochs=3,                     # number of training epochs
    per_device_train_batch_size=1,          # batch size per device during training
    gradient_accumulation_steps=8,          # number of steps before performing a backward/update pass
    gradient_checkpointing=True,            # use gradient checkpointing to save memory
    optim="adamw_torch_fused",              # use fused adamw optimizer
    logging_steps=10,                       # log every 10 steps
    save_strategy="epoch",                  # save checkpoint every epoch
    learning_rate=2e-4,                     # learning rate, based on QLoRA paper
    bf16=True,                              # use bfloat16 precision
    tf32=True,                              # use tf32 precision
    max_grad_norm=0.3,                      # max gradient norm based on QLoRA paper
    warmup_ratio=0.03,                      # warmup ratio based on QLoRA paper
    lr_scheduler_type="constant",           # use constant learning rate scheduler
    push_to_hub=True,                       # push model to hub
    report_to="tensorboard",                # report metrics to tensorboard
)

We now have every building block we need to create our SFTTrainer to start then training our model.

In [None]:
from trl import SFTTrainer

max_seq_length = 4096 # max sequence length for model and packing of the dataset

trainer = SFTTrainer(
    model=model,
    args=args,
    train_dataset=dataset,
    peft_config=peft_config,
    max_seq_length=max_seq_length,
    tokenizer=tokenizer,
    packing=True,
    dataset_kwargs={
        "add_special_tokens": False,  # We template with special tokens
        "append_concat_token": False, # No need to add additional separator token
    }
)

In [None]:
# start training, the model will be automatically saved to the hub and the output directory
trainer.train()

# save model 
trainer.save_model()

# Congrats!
Congrats! You just completed your first training job! In the previous sections, we pulled three datasets together to fine tune a base Mistral 7b model on instructions & SQL generation examples. 


### Next Steps
This training job takes about ~2 hours to run at 3 epochs. You will have your workshop environment for 72 hours. After this workshop you can go back and deploy this model to an endpoint and test it out. It's encoraged that you move to the next lab. We will pull a model trained the same way, deploy it to an endpoint and use that for the rest of the workshop

If you'd like to play with the model you trained, you can leave the training job running and follow the appendix steps below

# Appendix A) Run Inference

# Merge LoRA adapter in to the original model
When using QLoRA, we only train adapters and not the full model. This means when saving the model during training we only save the adapter weights and not the full model. If you want to save the full model, which makes it easier to use with Text Generation Inference you can merge the adapter weights into the model weights using the merge_and_unload method and then save the model with the save_pretrained method. This will save a default model, which can be used for inference.

Note: This requires > 30GB CPU Memory.

In [None]:
#### COMMENT IN TO MERGE PEFT AND BASE MODEL ####
from peft import AutoPeftModelForCausalLM

# Load PEFT model on CPU
model = AutoPeftModelForCausalLM.from_pretrained(
    args.output_dir,
    torch_dtype=torch.float16,
    low_cpu_mem_usage=True,
)  
# Merge LoRA and base model and save
merged_model = model.merge_and_unload()
merged_model.save_pretrained(args.output_dir,safe_serialization=True, max_shard_size="2GB")

4. Test Model and run Inference
After the training is done we want to evaluate and test our model. We will load different samples from the original dataset and evaluate the model on those samples, using a simple loop and accuracy as our metric.

In [None]:
import torch
from transformers import AutoTokenizer, pipeline, AutoModelForCausalLM

model_id = "./code-llama-3-1-8b-text-to-sql"

# Load Model with PEFT adapter
model = AutoModelForCausalLM.from_pretrained(
  model_id,
  device_map="auto",
  torch_dtype=torch.float16
)
tokenizer = AutoTokenizer.from_pretrained(model_id)
# load into pipeline
pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)

In [None]:
from datasets import load_dataset 
from random import randint


# Load our test dataset
eval_dataset = load_dataset("json", data_files="test_dataset.json", split="train")
rand_idx = randint(0, len(eval_dataset))

# Test on sample 
prompt = pipe.tokenizer.apply_chat_template(eval_dataset[rand_idx]["messages"][:2], tokenize=False, add_generation_prompt=True)
outputs = pipe(prompt, max_new_tokens=256, do_sample=False, temperature=0.1, top_k=50, top_p=0.1, eos_token_id=pipe.tokenizer.eos_token_id, pad_token_id=pipe.tokenizer.pad_token_id)

print(f"Query:\n{eval_dataset[rand_idx]['messages'][1]['content']}")
print(f"Original Answer:\n{eval_dataset[rand_idx]['messages'][2]['content']}")
print(f"Generated Answer:\n{outputs[0]['generated_text'][len(prompt):].strip()}")