# Text-to-SQL Fine Tuning Lab
This lab illustrates fine tuning a 7 billion parameter LLM for text to SQL use cases. It's often valuable to include generalized instruction tuning datapoints into the dataset as well as text to sql datapoints to make the model a bit more robust. If you include only sql examples, the model will not generalize as well your users inputs drift over time.

## Concepts

### Fine tuning:
Fine tuning a model is the process taking an already trained model, and further training it on specific tasks. In our case, we'll be training it to follow instructions (using the dolly dataset) as well as a SQL dataset.

### LoRA
LoRA is a parameter-efficient fine-tuning technique for large language models (LLMs). It works by introducing trainable low-rank decomposition matrices to the weights of the model. Instead of fine-tuning all parameters of a pre-trained model, LoRA freezes the original model weights and injects trainable rank decomposition matrices into each layer of the model.
The key idea behind LoRA is to represent the weight updates during fine-tuning as the product of two low-rank matrices. Mathematically, if W is the original weight matrix, the LoRA update can be expressed as:

W' = W + BA

Where B and A are low-rank matrices, and their product BA represents the update to the original weights.
LoRA works effectively for several reasons:

* Parameter efficiency: By using low-rank matrices, LoRA dramatically reduces the number of trainable parameters compared to full fine-tuning. This makes it possible to adapt large models on limited hardware.
* Preservation of pre-trained knowledge: Since the original weights are kept frozen, the model retains most of its pre-trained knowledge while learning new tasks.
Adaptability: The low-rank update allows the model to learn task-specific adaptations without overfitting as easily as full fine-tuning might.
* Computational efficiency: Training and applying LoRA updates is computationally cheaper than full fine-tuning or using adapter layers.
* Theoretical foundation: The effectiveness of LoRA is grounded in the observation that the weight updates during fine-tuning often have a low intrinsic rank, meaning they can be well-approximated by low-rank matrices.
* Composability: Multiple LoRA adaptations can be combined, allowing for interesting multi-task and transfer learning scenarios.

The reason LoRA works so well is that it exploits the low intrinsic dimensionality of the updates needed to adapt a pre-trained model to a new task. By focusing on these key directions of change, LoRA can achieve performance comparable to full fine-tuning with only a fraction of the trainable parameters.
This approach has proven particularly effective for large language models, where the cost and computational requirements of full fine-tuning can be prohibitive.

## Steps
1. Install dependencies & setup SageMaker Session
2. Create and process our dataset
3. Configure our SageMaker training job
4. Run training job

# Takeaways
There are many ways to fine tune a model. This training job will take roughly ~6 hours on a G5.2xlarge ($1.515 / hr in us-west-2). 

This means the total training job will cost ~$9.09 dollars. Not bad! 

In [None]:
# 2. Install dependencies

# !pip install -r requirements.txt
# %pip install transformers "datasets[s3]==2.18.0" "sagemaker>=2.190.0" "huggingface_hub[cli]" "python-dotenv" --upgrade --quiet

In [None]:
import sagemaker
import boto3
import os
from dotenv import load_dotenv, find_dotenv

local_env_filename = 'dev.env'
load_dotenv(find_dotenv(local_env_filename),override=True)
os.environ['HF_TOKEN'] = os.getenv('HF_TOKEN')
HF_TOKEN = os.environ['HF_TOKEN']

sess = sagemaker.Session()
# sagemaker session bucket -> used for uploading data, models and logs
# sagemaker will automatically create this bucket if it not exists
sagemaker_session_bucket=None
if sagemaker_session_bucket is None and sess is not None:
    # set to default bucket if a bucket name is not given
    sagemaker_session_bucket = sess.default_bucket()
 
try:
    role = sagemaker.get_execution_role()
except ValueError:
    iam = boto3.client('iam')
    role = iam.get_role(RoleName='sagemaker_execution_role')['Role']['Arn']
 
sess = sagemaker.Session(default_bucket=sagemaker_session_bucket)
 
print(f"sagemaker role arn: {role}")
print(f"sagemaker bucket: {sess.default_bucket()}")
print(f"sagemaker session region: {sess.boto_region_name}")


In [None]:
from datasets import load_dataset, Dataset
from random import randrange
 
# Load dataset from the hub
dolly_dataset = load_dataset("databricks/databricks-dolly-15k", split="train")
sql_dataset = load_dataset("b-mc2/sql-create-context", split="train")

In [None]:
from datasets import concatenate_datasets
from typing import Dict, List, Tuple, Any

SYSTEM_MESSAGE: str = 'You are a helpful assistant'

# Format functions provided by the user, both now returning tuples
def format_dolly(sample: Dict[str, str]) -> Tuple[str, str]:
    instruction: str = f"{sample['instruction']}"
    context: str = f"### Context\n{sample['context']}" if len(sample["context"]) > 0 else ""
    
    # Join the instruction and context together
    user_msg: str = "\n\n".join([i for i in [instruction, context] if i])
    
    return user_msg, sample['response']

def format_sql(sample: Dict[str, str]) -> Tuple[str, str]:
    instruction: str = f"{sample['question']}"
    context: str = f"### Context\n{sample['context']}" if len(sample["context"]) > 0 else ""
    
    # Join the instruction and context together
    user_msg: str = "\n\n".join([i for i in [instruction, context] if i])
    
    return user_msg, sample['answer']

def create_conversation(sample: Dict[str, str], format_func: callable) -> Dict[str, List[Dict[str, str]]]:
    user_msg: str
    ai_msg: str
    user_msg, ai_msg = format_func(sample)
    
    return {
        "messages": [
            {"role": "system", "content": SYSTEM_MESSAGE},
            {"role": "user", "content": user_msg},
            {"role": "assistant", "content": ai_msg}
        ]
    }

In [None]:
# Apply formatting and create conversations for each dataset
dolly_formatted: Dataset = dolly_dataset.map(
    lambda x: create_conversation(x, format_dolly),
    remove_columns = dolly_dataset.features,batched=False
)
sql_formatted: Dataset = sql_dataset.map(
    lambda x: create_conversation(x, format_sql),
    remove_columns = sql_dataset.features,batched=False
)

# To keep training time down, alternatively you can set the max examples to ~1200 total.
# min_size = 600

## Alternatively, Determine the size of the smaller dataset
min_size: int = min(len(dolly_formatted), len(sql_formatted))

# Balance the datasets
balanced_dolly: Dataset = dolly_formatted.shuffle(seed=42).select(range(min_size))
balanced_sql: Dataset = sql_formatted.shuffle(seed=42).select(range(min_size))

# Combine the balanced datasets
combined_dataset: Dataset = concatenate_datasets([balanced_dolly, balanced_sql])

# Shuffle the combined dataset
dataset: Dataset = combined_dataset.shuffle(seed=42)

# Calculate the number of samples for the test set (10% of total)
test_size: int = int(len(dataset) * 0.1)

# Split to Test/Train
dataset = dataset.train_test_split(test_size=test_size)

In [None]:
# Uncomment to save datasets to disk if you'd like
# dataset["train"].to_json("train_dataset.json", orient="records")
# dataset["test"].to_json("test_dataset.json", orient="records")

In [None]:
# save train_dataset to s3 using our SageMaker session
training_input_path = f's3://{sess.default_bucket()}/datasets/text-to-sql'

# save datasets to s3
dataset["train"].to_json(f"{training_input_path}/train_dataset.json", orient="records")
dataset["test"].to_json(f"{training_input_path}/test_dataset.json", orient="records")

In [None]:
!huggingface-cli login --token {HF_TOKEN}

In [None]:
# hyperparameters, which are passed into the training job
hyperparameters = {
  ### SCRIPT PARAMETERS ###
  'dataset_path': '/opt/ml/input/data/training/train_dataset.json', # path where sagemaker will save training dataset
  'model_id': "mistralai/Mistral-7B-v0.1",           # or `mistralai/Mistral-7B-v0.1`
  'max_seq_len': 3072,                               # max sequence length for model and packing of the dataset
  'use_qlora': True,                                 # use QLoRA model
  ### TRAINING PARAMETERS ###
  'num_train_epochs': 3,                             # number of training epochs
  'per_device_train_batch_size': 1,                  # batch size per device during training
  'gradient_accumulation_steps': 4,                  # number of steps before performing a backward/update pass
  'gradient_checkpointing': True,                    # use gradient checkpointing to save memory
  'optim': "adamw_torch_fused",                      # use fused adamw optimizer
  'logging_steps': 10,                               # log every 10 steps
  'save_strategy': "epoch",                          # save checkpoint every epoch
  'learning_rate': 2e-4,                             # learning rate, based on QLoRA paper
  'bf16': True,                                      # use bfloat16 precision
  'tf32': True,                                      # use tf32 precision
  'max_grad_norm': 0.3,                              # max gradient norm based on QLoRA paper
  'warmup_ratio': 0.03,                              # warmup ratio based on QLoRA paper
  'lr_scheduler_type': "constant",                   # use constant learning rate scheduler
  'report_to': "tensorboard",                        # report metrics to tensorboard
  'output_dir': '/tmp/tun',                          # Temporary output directory for model checkpoints
  'merge_adapters': False,                            # merge LoRA adapters into model for easier deployment
}

In [None]:
from sagemaker.huggingface import HuggingFace

# define Training Job Name 
job_name = f'mistral7b-text-to-sql'

# create the Estimator
huggingface_estimator = HuggingFace(
    entry_point          = 'trl_sft.py',      # train script
    source_dir           = './scripts',       # directory which includes all the files needed for training
    instance_type        = 'ml.g5.2xlarge',   # instances type used for the training job
    instance_count       = 1,                 # the number of instances used for training
    max_run              = 2*24*60*60,        # maximum runtime in seconds (days * hours * minutes * seconds)
    base_job_name        = job_name,          # the name of the training job
    role                 = role,              # Iam role used in training job to access AWS ressources, e.g. S3
    volume_size          = 300,               # the size of the EBS volume in GB
    transformers_version = '4.36',            # the transformers version used in the training job
    pytorch_version      = '2.1',             # the pytorch_version version used in the training job
    py_version           = 'py310',           # the python version used in the training job
    hyperparameters      =  hyperparameters,  # the hyperparameters passed to the training job
    disable_output_compression = True,        # not compress output to save training time and cost
    environment          = {
                            "HUGGINGFACE_HUB_CACHE": "/tmp/.cache", # set env variable to cache models in /tmp
                            "HF_TOKEN": HF_TOKEN # huggingface token to access gated models, e.g. llama 2
                            }, 
)

In [None]:
# define a data input dictonary with our uploaded s3 uris
data = {'training': training_input_path}

# starting the train job with our uploaded datasets as input
huggingface_estimator.fit(data, wait=False)

In [None]:
# Print the current job name
print(huggingface_estimator._current_job_name)

# Congrats!
Congrats! You just kicked off your first training job! In the previous sections, we pulled two datasets together to fine tune a base Mistral 7b model on instructions & SQL generation examples. 


### Next Steps
This training job takes about ~6 hours to run at 3 epochs. You will have your workshop environment for 72 hours. After this workshop you can go back and deploy this model to an endpoint and test it out. It's encoraged that you move to the next lab. We will pull a LoRA adapter trained using the same script, merge it into the same base model model and use that for the rest of the workshop

If you'd like to play with the model you trained, you can leave the training job running and follow the appendix steps below

# Appendix A) Try out the model
Once the training job is completed, you can use this code to create a SageMaker endpoint and test out the model you trained in this notebook. The loRA adapter you trained in the previous step and the one we pull in the next lab will operate very similarly. 

In [None]:
from sagemaker.estimator import Estimator

TRAINING_JOB_NAME = huggingface_estimator._current_job_name

huggingface_estimator = Estimator.attach(TRAINING_JOB_NAME, sagemaker_session=sess)

In [None]:
huggingface_estimator.model_data["S3DataSource"]["S3Uri"].replace("s3://", "https://s3.console.aws.amazon.com/s3/buckets/")

In [None]:
from sagemaker.huggingface import get_huggingface_llm_image_uri
 
# retrieve the llm image uri
llm_image = get_huggingface_llm_image_uri(
  "huggingface",
  version="1.4.0",
  session=sess,
)
 
# print ecr image uri
print(f"llm image uri: {llm_image}")

In [None]:
import json
from sagemaker.huggingface import HuggingFaceModel
 
# s3 path where the model will be uploaded
# if you try to deploy the model to a different time add the s3 path here
model_s3_path = huggingface_estimator.model_data["S3DataSource"]["S3Uri"]
 
# sagemaker config
instance_type = "ml.g5.2xlarge"
number_of_gpu = 1
health_check_timeout = 300
 
# Define Model and Endpoint configuration parameter
config = {
  'HF_MODEL_ID': "/opt/ml/model", # path to where sagemaker stores the model
  'SM_NUM_GPUS': json.dumps(number_of_gpu), # Number of GPU used per replica
  'MAX_INPUT_LENGTH': json.dumps(1024), # Max length of input text
  'MAX_TOTAL_TOKENS': json.dumps(2048), # Max length of the generation (including input text)
}
 
# create HuggingFaceModel with the image uri
llm_model = HuggingFaceModel(
  role=role,
  image_uri=llm_image,
  model_data={'S3DataSource':{'S3Uri': model_s3_path,'S3DataType': 'S3Prefix','CompressionType': 'None'}},
  env=config
)

In [None]:
# Deploy model to an endpoint
# https://sagemaker.readthedocs.io/en/stable/api/inference/model.html#sagemaker.model.Model.deploy
llm = llm_model.deploy(
  initial_instance_count=1,
  instance_type=instance_type,
  container_startup_health_check_timeout=health_check_timeout, # 10 minutes to give SageMaker the time to download the model
)

In [None]:
from transformers import AutoTokenizer
from sagemaker.s3 import S3Downloader
 
# Load the test dataset from s3
S3Downloader.download(f"{training_input_path}/test_dataset.json", ".")
test_dataset = load_dataset("json", data_files="test_dataset.json",split="train")
random_sample = test_dataset[345]


In [None]:
# We need the tokenizer so that we can use the apply_chat_template() function. This is only on the instruct version of the tokenizer.
# We essentially recreated this function above when formatting our inputs.
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.1")

def request(sample):
    prompt = tokenizer.apply_chat_template(sample, tokenize=False, add_generation_prompt=True)
    outputs = llm.predict({
      "inputs": prompt,
      "parameters": {
        "max_new_tokens": 512,
        "do_sample": False,
        "return_full_text": False,
        "stop": ["<|im_end|>"],
      }

    })
    return {"role": "assistant", "content": outputs[0]["generated_text"].strip()}


print(random_sample["messages"])

# We don't need the answer to do inference.
request(random_sample["messages"][:2])