# Fine-tuning for Text-to-SQL on Amazon Bedrock
Use of Bedrock Fine-tuning to improve Text-to-SQL accuracy.

---
---


## Suggested SageMaker Environment
Sagemaker Image: sagemaker-distribution-cpu

Kernel: Python 3

Instance Type: ml.m5.large

---

## Contents
1. [Overview of Building Training Data Frame](#step-1-overview-of-building-training-data-frame)
1. [Download the Spider Dataset](#step-2-download-the-spider-data-set)
1. [Install Dependencies](#step-3-install-dependencies)
1. [Training Data Helper Functions](#step-4-training-data-helper-functions)
1. [Build Training Data Frames and Save](#step-5-build-training-data-frames-and-save)
1. [Format Training Data for Amazon Bedrock Fine-Tuning](#step-6-format-training-data-for-amazon-bedrock-fine-tuning)
1. [Configure Constraints for Fine-Tuning](#step-7-configure-constraints-for-fine-tuning)
1. [Generate Files Formatted for Bedrock Fine-Tuning](#step-8-generate-files-formatted-for-bedrock-fine-tuning)
1. [Validate Training File Structure](#step-9-validate-training-file-structure)
1. [Upload Training Data](#step-10-upload-training-data)
1. [Kick Off a Fine-Tuning Job from the SDK](#step-11-kick-off-a-fine-tuning-job-from-the-sdk)
1. [Created Bedrock Provisioned Model](#step-12-create-provisioned-model)
1. [Setup for Benchmarking Model Performance](#step-13-setup-for-benchmarking-model-performance)
1. [Run Off-the-Shelf Models](#step-14-run-off-the-shelf-models)
1. [Run Our Titan Fine-Tuned Model](#step-15-run-on-our-titan-fine-tuned-model)
1. [Set up for Model Performance](#step-16-set-up-for-model-performance)
1. [Evaluate Models](#step-17-evaluate-models)
1. [Analyze Results](#step-18-analyze-results)
1. [Summary](#summary)

---
## Objective
The following notebook will walk you through setting up and extracting the right information from your source databases, as well as how to kick off a fine-tuning job on Amazon Bedrock.

To get a good understanding of our performance, we'll run our benchmarks against the actual SQL databases provided in the [Spider Dataset](https://github.com/taoyds/spider/tree/master).


---
## The Approach to the Text-to-SQL Problem
The [SQL-PALM](https://arxiv.org/abs/2010.02840) publication was one of the very influential papers in the natural language to SQL space.
We therefore want to recreate the training set for the [Spider Dataset](https://github.com/taoyds/spider/tree/master) with the same schema, foreign and primary keys information as we have found in the paper.

### Tools
Amazon Bedrock SDK, Pandas


---

### Step 1: Overview of Building Training Data Frame

Before we can start training, we need to set up our datasets. Therefore, we will bring in the original JSON files found in the data repository and define a few helper functions to assist us with wrangling the JSON files into a coherent data frame.

After that, we're going to concatenate all the training data files to obtain one large training dataset. We will also examine the table information provided to us by the original Spider dataset. Once we're done constructing the query data frame, we will start constructing the schema from the table information, as well as constructing the primary and foreign keys.

The SQL PALM paper has a different format than the regular DDL description of tables. We can observe that the DDL format, on average, uses more tokens to describe the same table, but might also be found more frequently in training datasets that have been ingested into a large language model training. For our example, we are going to stick to the SQL PALM format.

If the DDL format is new for you, feel free to follow the [LINK](https://www.ibm.com/docs/en/i/7.2?topic=programming-data-definition-language) for a brief intro.


### Step 2: Download the Spider Data Set
1. Visit [this link](https://drive.google.com/u/0/uc?id=1iRDVHLr4mX2wQKSgA9J8Pire73Jahh0m&export=download) and click 'Download anyway'
2. Once the download to your computer is complete, upload to the JupyterLab file explorer on the left. This can take a few minutes and should say `Uploading...` below in the status bar of JupyterLab.
3. [Open a terminal session](https://jupyterlab.readthedocs.io/en/stable/user/terminal.html), navigate to your uploaded file, and use `unzip` to unzip the file. For example: `sagemaker-user@default:~/module_3$ unzip spider.zip`
4. Verify the `spider` folder has been extracted, and update the below file path to point its location. This will be used throughout this notebook

In [None]:
spider_folder = '/home/sagemaker-user/text-to-sql-bedrock-workshop/module_3/spider'

### Step 3: Install dependencies

In [None]:
import pandas as pd
import boto3
import json
from random import randint
import logging
from botocore.exceptions import ClientError
from botocore.config import Config
import os
import datetime
from tqdm import tqdm
from time import sleep
import pickle
from concurrent.futures import ThreadPoolExecutor
import random
import sqlite3

# print the full string, no truncation
pd.set_option("display.max_colwidth", None)

# set s3 bucket:
S3_BUCKET_NAME = "<AthenaResultsS3Location>" # Can be found in CloudFormation outputs
FINE_TUNING_JOB_ROLE_ARN = "<BedrockFineTuningJobRole>" # can be found in the cloudformation outputs under BedrockFineTuningJobRole

### Step 4: Training Data Helper Functions
These helper functions will assist with reading the .json files and constructing the query data sets appropriately for fine-tuning our Titan model.

In [None]:
def read_json_file(file_name):
    """
    Reads a JSON file and returns its contents as a dictionary.

    Args:
    file_name (str): The name of the JSON file to read.

    Returns:
    dict: The contents of the JSON file as a dictionary.
    """
    with open(file_name) as f:
        data = json.load(f)
    return data


def construct_queries(queries):
    """
    Reads a JSON file containing queries and returns a pandas DataFrame with the db_id, query, and question.

    Args:
    queries (str): The path to the JSON file containing the queries.

    Returns:
    pandas.DataFrame: A DataFrame containing the db_id, query, and question.
    """
    queries = read_json_file(queries)

    query_df = pd.DataFrame(columns=["db_id", "query", "question"])
    for idx, _ in enumerate(queries):
        db_id = queries[idx]["db_id"]
        query = queries[idx]["query"]
        question = queries[idx]["question"]

        query_df.loc[idx] = [db_id, query, question]

    return query_df



Now we'll combine the two training data sets offered by Spider, into a single dataframe.

In [None]:
query_train_spider = construct_queries(
    f"{spider_folder}/train_spider.json"
)
query_train_other = construct_queries(
    f"{spider_folder}/train_others.json"
)

# Concatenate DataFrames
query_train = pd.concat([query_train_spider, query_train_other], ignore_index=True)

# Add index column to query_train_other
query_train.insert(0, "index", range(len(query_train)))

query_dev = construct_queries(f"{spider_folder}/dev.json")

More Helper functions for preparing our training data

In [None]:
def construct_schema(table):
    """
    Constructs a string representation of the schema for a given table.

    Args:
    table (dict): A dictionary containing information about a table in the Spider dataset.

    Returns:
    str: A string representation of the schema for the given table.
    """
    no_tables = len(table["table_names_original"])
    table_names_original = table["table_names_original"]

    Schema = f"[Schema (values) (types)]: | {table['db_id']} | "
    for i in range(no_tables):
        Schema += f" {table_names_original[i]} : "

        tableCols = [x[1] for x in table["column_names_original"] if x[0] == i]
        for j in range(len(tableCols)):
            if j != len(tableCols) - 1:
                Schema += f"{tableCols[j].lower()} ({table['column_types'][j]}) , "
            elif j == len(tableCols) - 1 and i != no_tables - 1:
                Schema += f"{tableCols[j].lower()} ({table['column_types'][j]}) |"
            else:
                Schema += f"{tableCols[j].lower()} ({table['column_types'][j]});"

    return Schema

In [None]:
def construct_primary_keys(table):
    """
    Constructs a string representation of the primary keys for a given table.

    Args:
    table (dict): A dictionary containing information about a table in the Spider dataset.

    Returns:
    str: A string representation of the primary keys for the given table.
    """
    primary_keys = "[Primary Keys]: "
    for idx, key in enumerate(table["primary_keys"]):
        table_name = table["table_names_original"][idx].lower()
        primary_key = table["column_names_original"][key][1].lower()
        if idx != len(table["primary_keys"]) - 1:
            primary_keys += f"{table_name} : {primary_key}, "
        else:
            primary_keys += f"{table_name} : {primary_key}"
    return primary_keys

In [None]:
def construct_foreign_keys(table):
    """
    Constructs a string representation of the foreign keys for a given table.

    Args:
    table (dict): A dictionary containing information about a table in the Spider dataset.

    Returns:
    str: A string representation of the foreign keys for the given table.
    """
    foreign_keys = "[Foreign Keys]: "
    for i in range(len(table["foreign_keys"])):
        fk1 = table["foreign_keys"][i][0]
        fk2 = table["foreign_keys"][i][1]

        fk1_name = table["column_names_original"][fk1][1].lower()
        fk1_table_idx = table["column_names_original"][fk1][0]
        fk1_table = table["table_names_original"][fk1_table_idx].lower()

        fk2_name = table["column_names_original"][fk2][1].lower()
        fk2_table_idx = table["column_names_original"][fk2][0]
        fk2_table = table["table_names_original"][fk2_table_idx].lower()

        if i != len(table["foreign_keys"]) - 1:
            foreign_keys += f"{fk1_table} : {fk1_name} = {fk2_table} : {fk2_name} | "
        else:
            foreign_keys += f"{fk1_table} : {fk1_name} = {fk2_table} : {fk2_name}"

    return foreign_keys

In [None]:
def construct_table_df(tables_path):
    """
    Constructs a pandas dataframe containing information about tables in the Spider dataset.

    Args:
    tables_path (str): The path to the tables.json file in the Spider dataset.

    Returns:
    pandas.DataFrame: A dataframe containing information about tables in the Spider dataset.
    """
    tables = read_json_file(tables_path)
    table_df = pd.DataFrame(columns=["db_id", "schema", "primary_keys", "foreign_keys"])
    for idx, table in enumerate(tables):
        db_id = table["db_id"]
        schema = construct_schema(table)
        primary_keys = construct_primary_keys(table)
        foreign_keys = construct_foreign_keys(table)
        table_df.loc[idx] = [db_id, schema, primary_keys, foreign_keys]
    return table_df



### Step 5: Build training data frames and save
Here we build our training data frame, verify its the correct shape, and save as a `.csv`

In [None]:

table_df = construct_table_df(f"{spider_folder}/tables.json")
train_df = pd.merge(query_train, table_df, on="db_id", how="inner")
dev_df = pd.merge(query_dev, table_df, on="db_id", how="inner")

assert len(train_df) == len(query_train_other) + len(query_train_spider)
assert len(dev_df) == len(query_dev)

train_df.to_csv("train_master.csv", index=False)
dev_df.to_csv("dev_master.csv", index=False)

In [None]:
dev_df.head(1)

### Step 6: Format Training Data for Amazon Bedrock Fine-Tuning
Before we can use the data, that we have put in our data frame, we first need to wrangle the data to align to a supervised fine-tuning approach.

This means that we need to construct an instruction set with context, as well as the desired output. Amazon Bedrock fine-tuning will then run a supervised fine-tuning on our data with the selected model.

Amazon Bedrock expects our data in the `.jsonl` format.
Please be aware that the keys in the `.jsonl` data that we need to dump might be different from model to model.
For Amazon Titan, we're looking for a JSON line entry that has the following keys: `input` and `output`. The `input` is the instruction that describes the task, as well as any context you want to give to the model. Furthermore, examples could be placed here, which we call few-shot learning.

When we fine-tune large language models on instruction datasets,
the clarity of the instruction can help us decrease the loss in the training process quicker than having no instructions.
Furthermore, you can use "few-shot examples", which means giving the model an example of your desired output, which can help to further reduce the training loss quicker.

However, you want to vary your examples throughout your training dataset. Otherwise, the you might overfit to this example, which could have negative effects on your inference performance.

The second part of every `.jsonl` entry - the `output` - is the desired output from the model. Many models follow the prompt technique that we call "putting words in their mouths," so if you know that you are always going to be running a `SELECT` statement, you could trigger the model, by adding the `SELECT` already at the end of the instruction. However, for our example, we're not going to do that. We simply want to create a SQL statement that will be the only output from the model.


In [None]:
def template_dataset_titan(
    sample: dict, return_jsonl: bool = True, SQL_flavour="SQLite"
) -> str:
    """
    Generates a JSON Lines string from a sample dictionary.

    Args:
        sample (dict): A dictionary containing the sample data.

    Returns:
        str: A JSON Lines string representing the sample data.
    """
    SystemPrompt = f"Your tasks converting text into SQL statements. We will first give the dataset schema and column types, primary keys and foreign keys and then ask a question in text. You are asked to generate SQL statement in {SQL_flavour}.\n "
    prompt_template = (
        SystemPrompt
        + '{Schema}"\n{primary_keys}"\n{foreign_keys}"\nAnswer the following question with a SQL Statement:{Question}\n[SQL]:\n'
    )

    question = sample["question"]
    schema = sample["schema"]
    primary_keys = sample["primary_keys"]
    foreign_keys = sample["foreign_keys"]
    sql_query = sample["query"]

    input_text = prompt_template.format(
        Question=question,
        Schema=schema,
        primary_keys=primary_keys,
        foreign_keys=foreign_keys,
    )
    output_text = sql_query
    if return_jsonl:
        json_line = json.dumps({"input": input_text, "output": output_text})
        return json_line
    else:
        return {"input": input_text, "output": output_text}

Lets inspect if everything is as we would expect it, by looking at an example.


In [None]:
template_dataset_titan(train_df.iloc[0].to_dict(), return_jsonl=False)

### Step 7: Configure Constraints for Fine-Tuning

When fine-tuning on Amazon Titan models on Amazon Bedrock, we must adhere to several constraints.
Firstly, we have to set the maximum input and output characters to roughly 12,000 characters.

The total number of training records we can supply is 10,000 and the maximum validation records are 1,000 examples.

Therefore, we are going to generate our training datasets according to those constraints.

We will do so with the help of a few functions that we will be defining in the following cells.


In [None]:
# constants
constrains = {
    "maxInputChars": 12288,
    "maxOutputChars": 12288,
    "maxTotalChars": 24576,
    "TrainingRecords": 10000,
    "ValidationRecords": 1000,
}

In [None]:
def generate_jsonl_file(
    dataset: pd.DataFrame,
    savepath: str,
    enforce_titan_constrains: bool = True,
    constrains: dict = None,
    train: bool = True,
) -> list:
    """
    Generates a JSON Lines file from a given dataset.

    Args:
        dataset (pd.DataFrame): A pandas DataFrame containing the dataset.
        savepath (str): The path to save the generated JSON Lines file.
        enforce_titan_constrains (bool, optional): Whether to enforce Titan constraints. Defaults to True.
        constrains (dict, optional): A dictionary containing the constraints to enforce. Defaults to None.
        train (bool, optional): Whether the dataset is for training. Defaults to True.

    Returns:
        list: A list of indices of dropped records.
    """
    databaseList = []
    if enforce_titan_constrains != True:
        json_output = []
        for idx in range(0, dataset.shape[0]):
            sample = dataset.iloc[idx]
            json_line = template_dataset_titan(sample)
            json_output.append(json_line)
            databaseList.append(sample["db_id"])
        with open(savepath, "w") as f:
            f.write("\n".join(json_output))

    elif enforce_titan_constrains == True:
        json_output = []
        droppedIdx = []
        if train == True:
            maxRecords = constrains["TrainingRecords"]
        else:
            maxRecords = constrains["ValidationRecords"]

        maxTotalChars = constrains["maxTotalChars"]
        maxOutputChars = constrains["maxOutputChars"]
        maxInputChars = constrains["maxInputChars"]

        droppedRecords = 0

        for idx in range(0, min(maxRecords, dataset.shape[0])):
            sample = dataset.iloc[idx]
            json_line = template_dataset_titan(sample, return_jsonl=False)
            if (
                len(json_line["input"]) <= maxInputChars
                and len(json_line["output"]) <= maxOutputChars
                and len(json_line["input"]) + len(json_line["output"]) <= maxTotalChars
            ):
                json_line = json.dumps(json_line)
                json_output.append(json_line)
                databaseList.append(sample["db_id"])
            else:
                print(f"Sample at index {idx} dropped due to constraints")
                droppedRecords += 1
        print(f"Total records dropped: {droppedRecords}")

        with open(savepath, "w") as f:
            f.write("\n".join(json_output))

        return droppedIdx

### Step 8: Generate Files Formatted for Bedrock Fine-Tuning
Now with our helper functions described above we'll generate the necessary `.jsonl` files to use in our job

In [None]:
# create directory for output data sets
!mkdir output_datasets
output_datasets_dir = "./output_datasets"

# set output paths for training files
train_path_local = (
    f"{output_datasets_dir}/train_titan.jsonl"
)
validation_path_local = (
    f"{output_datasets_dir}/eval_titan.jsonl"
)
benchmark_path_local = (
    f"{output_datasets_dir}/validation_titan.jsonl"
)

# generate training files in their respective locations
dropped_records_train = generate_jsonl_file(
    train_df,
    train_path_local,
    constrains=constrains,
    train=True,
    enforce_titan_constrains=True,
)

dropped_records_dev = generate_jsonl_file(
    dev_df,
    validation_path_local,
    constrains=constrains,
    train=False,
    enforce_titan_constrains=True,
)

dropped_records_dev = generate_jsonl_file(
    dev_df,
    benchmark_path_local,
    constrains=constrains,
    train=False,
    enforce_titan_constrains=False,
)

Save in feather format for use later

In [None]:
# # Intermediary save to feather format
dev_df.to_feather(
    f"{output_datasets_dir}/dev_df.feather"
)
train_df.to_feather(
    f"{output_datasets_dir}/train_df.feather"
)

### Step 9: Validate training file structure
We'll do this by creating a `test_jsonl_file` helper function to verify the file is ready for use in Bedrock Fine-Tuning

In [None]:
def test_jsonl_file(savepath):
    """
    Tests that a JSON Lines file adheres to the JSONL standard and only contains the "input" and "output" fields.

    Args:
        savepath (str): The path to the JSON Lines file.

    Returns:
        None
    """
    with open(savepath, "r") as f:
        for idx, line in enumerate(f):
            json_line = json.loads(line)

            # Check that the JSON line only contains the "input" and "output" fields
            assert len(json_line.keys()) == 2
            assert "input" in json_line.keys()
            assert "output" in json_line.keys()

            # Check that the "input" and "output" fields are not empty
            assert json_line["input"] != ""
            assert json_line["output"] != ""

            # Check that the JSON line adheres to the JSONL standard (ends with a newline character) as long as it is not the last line
            if idx != len(f.readlines()) - 1:
                assert line.endswith("\n")

    # Check that the last line of the file does not have a newline character
    with open(savepath, "rb") as f:
        f.seek(-1, os.SEEK_END)
        assert f.read() != b"\n"
    print(
        f'JSON Lines file {savepath} adheres to the JSONL standard and only contains the "input" and "output" fields.'
    )

Let's validate our training files are ready for use

In [None]:
# Test the train dataset
test_jsonl_file(train_path_local)

# Test the validation dataset
test_jsonl_file(validation_path_local)

If you see `JSON Lines file ./output_datasets/train_titan.jsonl adheres to the JSONL standard and only contains the "input" and "output" fields.` then you can proceed.

### Step 10: Upload training data
To use our data for our fine-tuning job on Amazon Bedrock, we need to upload the data sets to an S3 bucket in the same region as the fine-tuning will be happening.

The following function uploads the data to S3 for you and returns the S3 URI associated with the uploaded file.


In [None]:
def get_s3_uri(bucket_name: str, local_path: str, s3_path: str) -> str:
    # Create an S3 client
    s3 = boto3.client("s3")

    # Upload the file to S3
    s3.upload_file(local_path, bucket_name, s3_path)

    # Get the S3 URI for the uploaded file
    s3_uri = f"s3://{bucket_name}/{s3_path}"

    return s3_uri

Initialize our bedrock boto3 client and test to make sure we can successfully call the API.

If you receive a 200, you are good to proceed.

In [None]:
session = boto3.session.Session()
bedrock = session.client("bedrock")

# Test the connection
bedrock.list_foundation_models()['ResponseMetadata']['HTTPStatusCode']

In [None]:
bucket_name = S3_BUCKET_NAME

trainingDataS3URI = get_s3_uri(
    bucket_name=bucket_name,
    local_path=train_path_local,
    s3_path="data/train_FINAL_20240122.jsonl"
)
validationDataS3URI = get_s3_uri(
    bucket_name=bucket_name,
    local_path=validation_path_local,
    s3_path="data/dev_FINAL_20240122.jsonl"
)
print(trainingDataS3URI)
print(validationDataS3URI)

### Step 11: Kick off a fine-tuning job from the SDK

The BOTO3 SDK allows us to kick off a training job or a fine-tuning job on Amazon Bedrock with no manual interaction of the UI.
We only need to identify the base model as well as the learning rate, the number of epochs and the other hyper-parameters that can be seen below. You can also follow [our guidelines](https://docs.aws.amazon.com/bedrock/latest/userguide/model-customization-guidelines.html#model-customization-guidelines-titan-text-express) for fine-tuning Titan Express to fit your use case.

For a quick iteration, you might want to set the number of epochs to 1 by setting `epochs = "1"`
Otherwise, ***as this is currently configured this training job will take up to 12 hours***. If you do decide to change the epochs, you'll want to update your learning rate as well, using the following expression: `newLearningRate = oldLearningRate x newBatchSize / oldBatchSize` per [our documentation for fine-tuning Titan](https://docs.aws.amazon.com/bedrock/latest/userguide/model-customization-guidelines.html#model-customization-guidelines-titan-text-express).
Please go ahead and inspect all of the parameters that have been set in the code below.

Notably, you can supply a training and validation data config, which points to your training and validation data sets in S3. You will get an output of loss and perplexity scores at the end of every epoch of your training. These metrics will be stored in a CSV file on S3 for you in the output location that you can follow from the UI.


In [None]:
role_arn = FINE_TUNING_JOB_ROLE_ARN
base_model_identifier = (
    f"arn:aws:bedrock:{bedrock.meta.region_name}::foundation-model/amazon.titan-text-express-v1:0:8k"
)
learning_rate = "1E-6"
epochs = "10"

## for faster fine-tuning, but poorer results, do the following:
# learning_rate = "1E-6" 
# epochs = "1" 

current_time = datetime.datetime.now().strftime("%Y-%m-%d-%H-%M-%S")

custom_model_name = f"ft-spider-{learning_rate}-{epochs}-{current_time}"
jobName = f"job-{custom_model_name}"

roleArn = (
    role_arn
)
outputDataS3URI = f"s3://{bucket_name}/titan-output/" + jobName


response = bedrock.create_model_customization_job(
    jobName=jobName,
    customModelName=custom_model_name,
    roleArn=roleArn,
    baseModelIdentifier=base_model_identifier,
    trainingDataConfig={"s3Uri": trainingDataS3URI},
    validationDataConfig={
        "validators": [{"s3Uri": validationDataS3URI}]
    },
    outputDataConfig={"s3Uri": outputDataS3URI},
    hyperParameters={
        "epochCount": epochs,
        "batchSize": "1",
        "learningRate": learning_rate,
    }
)
print(response)

In order to understand if our model training job has `Completed`, is `In Progress` or has `Stopped`, we can use the BOTO3 SDK once again to inspect the current model customization job.

If you train multiple models, just remove the job identifier and filter for the last ones that you would like to inspect.

Once your training job has jumped to `Completed`, we are ready to deploy the model.


In [None]:
# check if tuning job has finished
status = bedrock.get_model_customization_job(jobIdentifier=str(jobName))["status"]

if status in ("Completed", "Failed", "Stopped"):
    jobStatusFinished = True
else:
    jobStatusFinished = False

status

### Step 12: Create provisioned model
In order to support the throughput necessary for running our benchmarks in the next notebook, we need to create a provisioned model. For more information on how Provisioned Models work in Bedrock, [check out our documentation](https://docs.aws.amazon.com/bedrock/latest/userguide/prov-throughput.html) and also note there are [**costs associated with using these**](https://console.aws.amazon.com/bedrock/home#/providers) that vary by model provider.

In [None]:
provisionedModelName = f"pvs-{custom_model_name}"

provisioned_model = bedrock.create_provisioned_model_throughput(
    modelUnits=1,
    provisionedModelName=provisionedModelName,
    modelId=custom_model_name,
)
provisioned_model_arn = provisioned_model['provisionedModelArn']

Let's make sure our provisioned model is ready before moving on. Once the Provisioned model is in `InService` status, move on to the next cell.

In [None]:
pvs_status = bedrock.get_provisioned_model_throughput(
    provisionedModelId=provisioned_model_arn
)['status']

if pvs_status in ("InService", "Failed"):
    ProvisionedModelStatus = True
else:
    ProvisionedModelStatus = False

pvs_status

Let's test our provisioned throughput model to ensure its working as expected.

In [None]:
boto3_config = Config(read_timeout=1000)
bedrock_runtime = session.client(
    service_name="bedrock-runtime",
    config=boto3_config
)

response = bedrock_runtime.invoke_model(
    modelId=provisioned_model_arn,
    contentType="application/json",
    accept="application/json",
    body=json.dumps(
        {
            "inputText": "SELECT * FROM table1 WHERE table1.id = 1",
        }
    ),
)

response_body = json.loads(response.get("body").read())
output = response_body.get("results")[0].get("outputText")
print(f"model output: {output}")

### Step 13: Setup for Benchmarking Model Performance

In order to benchmark our fine-tuned models, as well as the models that are available out of the box on Amazon Bedrock. In the following section, you will learn how to invoke the custom models that you fine-tuned as well as the non-fine-tuned models. And we're going to put a specific focus on multi-threading our invocations to Amazon Bedrock endpoints. One of the nice features of having your own provision throughput is that you can expect a very good throughput on your model endpoints. In order to harness this throughput, we really want to go with multi-processing or multi-threading. However, when running a benchmark or utilizing multi-threading or multi-processing, you really need to pay attention on execution order for the `golden-queries` to align with the model generated-queries.


Before we go any further lets ensure we have our results directory ready to hold our results.

In [None]:
# results directory c
results_directory = './results'

# Ensure the directory exists
if not os.path.exists(results_directory):
    os.makedirs(results_directory)

Here we build some utility functions to assist with calling Bedrock


In [None]:
def call_bedrock(modelId, prompt_data, bedrock_runtime_client, custom_base_model_id=None, system_prompt=""):
    """
    Wraps bedrock.invoke_model to handle input and output parsing.
    This method leverages the word-in-mouth technique by starting assistant response with SELECT
    :modelId                : the model ID to invoke. Can be custom model id.
    :prompt_data            : the text to pass to the model as a prompt. For Anthropic models
                            this is used for user message in bedrock messages API
    :bedrock_runtime_client : a boto3 bedrock runtime client to use
    :custom_base_model_id   : (Optional) custom model flag.
                            model id of base model used for customized model.
                            indicates use of customized model.
    :system_prompt          : (Optional) Used for anthropic models bedrock messages API
    """
    requested_models = f"{modelId}, {custom_base_model_id}"
    response = None
    if "amazon." in requested_models:
        body = json.dumps(
            {
                "inputText": f"{prompt_data} SELECT",
                "textGenerationConfig": {
                    "maxTokenCount": 4096,
                    "stopSequences": [],
                    "temperature": 0,
                    "topP": 0.9,
                },
            }
        )
    elif "anthropic." in requested_models:
        user_message =  {"role": "user", "content": prompt_data}
        asst_message = {"role": "assistant", "content": "SELECT"}
        messages = [user_message, asst_message]
        body=json.dumps(
            {
                "anthropic_version": "bedrock-2023-05-31",
                "max_tokens": 4096,
                "system": system_prompt,
                "messages": messages,
                "temperature": 0.0,
                "stop_sequences": [';']
            }
        )

    else:
        print("Parameter model must be one of providers amazon or anthropic")
        return response

    model_response = bedrock_runtime_client.invoke_model(
        body=body,
        modelId=modelId,
        accept="application/json",
        contentType="application/json"
    )

    response_dict = json.loads(model_response.get("body").read().decode("utf-8"))

    if "amazon." in requested_models:
        response = "SELECT" + response_dict["results"][0]["outputText"]
    elif "anthropic." in requested_models:
        response = "SELECT" + response_dict["content"][0]["text"]

    return response

In [None]:
def process_with_retries(model_id, input_data, bedrock_runtime_client, custom_base_model_id=None, max_retries=5, initial_sleep=60):
    """
    calls our bedrock utility function with exponential backoff
    :modelId                : the model ID to invoke. Can be custom model id.
    :input_data             : the text to pass to the model as a prompt.
    :bedrock_runtime_client : a boto3 bedrock runtime client to use
    :custom_base_model_id   : (Optional) custom model flag.
                            model id of base model used for customized model.
                            indicates use of customized model.
    :max_retries            : (Optional) maximum number of retries. Defaults to 5.
    :initial_sleep          : (Optional) initial sleep time in seconds. Defaults to 60 seconds
    """
    attempts = 0
    while attempts < max_retries:
        try:
            # response = call_bedrock(model_id, input_data["input"])
            response = call_bedrock(model_id, input_data, bedrock_runtime_client, custom_base_model_id)
            return response
        except Exception as e:
            if "ModelNotReadyException" in str(e):
                sleep_time = initial_sleep * (2**attempts)
                print(f"Retrying in {sleep_time} seconds...")
                sleep(sleep_time)
                attempts += 1
            else:
                print(f"Error with input: {input_data['input']}")
                print(e)
                return None
    return None  # Return None after max_retries

Let's test our utility functions on some text to ensure we get a response back

In [None]:
# test a call to titan model:
modelId = "amazon.titan-text-express-v1"
call_bedrock(modelId, "The quick brown fox jumps over the lazy dog", bedrock_runtime_client=bedrock_runtime)

In [None]:
# test a call to titan model:
modelId = "amazon.titan-text-express-v1"
call_bedrock(modelId, "What is Amazon Lambda?", bedrock_runtime_client=bedrock_runtime)

### Step 14: Run Off-the-shelf Models
Here we'll use a few non-fine-tuned models to run the Spider data set questions, that we'll later compare with our fine-tune model for performance.


In [None]:
# Use the function:
validation_file_path = (
    f"{output_datasets_dir}/validation_titan.jsonl"
)

validation_data = []
with open(validation_file_path, "r") as file:
    for line in file:
        json_line = json.loads(line)
        validation_data.append(json_line)

len(validation_data)

#### Titan Express
Prompt the model for every question in the Spider data set and record the results.
Note we're leveraging thread pooling so we can get our results faster through parallelism.

In [None]:
model_id = "amazon.titan-text-express-v1"
model_name = model_id.split(".")[1]
# Read the .jsonl file into a list
data_list = []
with open(validation_file_path, "r") as file:
    for idx, line in enumerate(file):
        json_line = json.loads(line)
        data_list.append(json_line)
        if idx == 12:
            break

# Set up ThreadPoolExecutor
with ThreadPoolExecutor(10) as executor:
    futures = [
        executor.submit(call_bedrock, model_id, line['input'], bedrock_runtime) for line in validation_data
    ]

    results = []
    for future in tqdm(futures, total=len(validation_data)):
        try:
            result = future.result()  # You can add a timeout value here if needed
            results.append(result)
        except Exception as e:
            print("Error processing a line.")
            print(e)
            results.append(None)  # Append None or some error indicator


# Save the results
with open(f"{results_directory}/{model_name}", "wb") as fp:  # Pickling
    pickle.dump(results, fp)

#### Claude 3 Haiku
Prompt the model for every question in the Spider data set and record the results.

**This process with Claude 3 Haiku can take up to 2 hours.** If you'd like to skip to save time please do so.

In [None]:
# Claude 3 Haiku
model_id = "anthropic.claude-3-haiku-20240307-v1:0"
model_name = model_id.split(".")[1]

answers = []
with open(validation_file_path, "r") as f:
    num_lines = sum(1 for line in f)
    f.seek(0)
    for idx, line in tqdm(enumerate(f), total=num_lines):
        json_line = json.loads(line)
        try:
            split_string = 'Answer the following question with a SQL Statement:'
            user_question = json_line["input"].split(split_string)[1]
            system_prompt = json_line["input"].split(split_string)[0] + split_string
            response = call_bedrock(
                            model_id,
                            prompt_data=user_question,
                            bedrock_runtime_client=bedrock_runtime,
                            system_prompt=system_prompt
                        )
            answers.append(response)
        except Exception as e:
            if "ModelNotReadyException" in str(e):
                print(e)
                print("waiting 60 seconds and retrying")
                sleep(60)
                idx -= 1
                continue
            else:
                print(e)

# Save the results
with open(f"{results_directory}/{model_name}", "wb") as fp:  # Pickling
    pickle.dump(answers, fp)

#### Claude Instant
Prompt the model for every question in the Spider data set and record the results.

In [None]:
model_id = "anthropic.claude-instant-v1"
model_name = model_id.split(".")[1]

answers = []
with open(validation_file_path, "r") as f:
    num_lines = sum(1 for line in f)
    f.seek(0)
    for idx, line in tqdm(enumerate(f), total=num_lines):
        json_line = json.loads(line)
        try:
            split_string = 'Answer the following question with a SQL Statement:'
            user_question = json_line["input"].split(split_string)[1]
            system_prompt = json_line["input"].split(split_string)[0] + split_string
            response = call_bedrock(
                            model_id,
                            prompt_data=user_question,
                            bedrock_runtime_client=bedrock_runtime,
                            system_prompt=system_prompt
                        )
            answers.append(f"SELECT{response}")
        except Exception as e:
            if "ModelNotReadyException" in str(e):
                print(e)
                print("waiting 60 seconds and retrying")
                sleep(60)
                idx -= 1
                continue
            else:
                print(e)

# Save the results
with open(f"{results_directory}/{model_name}", "wb") as fp:  # Pickling
    pickle.dump(answers, fp)

### Step 15: Run on our Titan Fine-Tuned Model
Here we'll use our fine-tuned model created in the previous notebook to run the same test questions from the Spider data set, just as we've done with the off-the-shelf models. Before doing so, let's make sure our model is responding reasonably to requests. You should see something like "There are more than 400000 singers in the world."

Now, let's run all of our validations with our fine-tuned model.

In [None]:
# Set up ThreadPoolExecutor
with ThreadPoolExecutor(4) as executor:
    futures = [
        executor.submit(
            process_with_retries, 
            model_id=provisioned_model_arn, 
            input_data=line["input"], 
            bedrock_runtime_client=bedrock_runtime,
            custom_base_model_id="amazon.titan-text-express-v1")
        # print(line)
        for line in validation_data
    ]
    results = []
    for future in tqdm(futures, total=len(validation_data)):
        try:
            result = future.result()  # You can add a timeout value here if needed
            results.append(result)
        except Exception as e:
            print("Error processing a line.")
            print(e)
            results.append("Query could not be completed.")  # Append string in order to match size of dataframe

model_name = custom_model_name

# Save the results
with open(f"{results_directory}/{model_name}", "wb") as fp:  # Pickling
    pickle.dump(results, fp)

https://docs.aws.amazon.com/bedrock/latest/userguide/quotas.html


### Step 16: Set Up for Model Performance
To analyze our performance across all models, we'll create a pandas dataframe to hold all of our generated queries from the models we're benchmarking. We'll use this analyze results.

In [None]:
df_eval_bedrock = pd.read_feather(
    f"{output_datasets_dir}/dev_df.feather"
)

In [None]:
df_eval_bedrock.head(1)

We'll create a couple helper functions to assist.
`clean_results` will ensure all of the answers we received from our models are cleansed of leading and trailing whitespace, quotes, and the like.

In [None]:
def clean_results(answerlist):
    """
    Cleans a list of strings
    :param answerlist: a list of strings
    :return: a list of strings
    """
    # Clean the list of strings
    clean_list = []
    for item in answerlist:
        if item:
            # Remove any leading or trailing whitespace
            item = item.strip()
            # Remove any trailing double quotes
            if item.startswith('"') and item.endswith('"'):
                item = item.rstrip('"')
                item = item.lstrip('"')
            # Remove any newlines
            item = item.replace("\n", " ")
            # remove any trailing semi-colons to match validation set
            item = item.replace(";", "")
            # Add the cleaned item to the clean_list
            clean_list.append(item)

    
    return clean_list

This utility function will compare the results from our model Spider's labeled ground truth to determine the Execution Match accuracy.  This does an exact string comparison of the generated SQL query and gold SQL query, for every question in the Spider data set. 

In [None]:
def run_exact_match_bench(df, model, spider_folder):
    """
    Compares model results to ground truth and returns the accuracy
    :param df               : dataframe holding model outputs
    :param model            : name of column in data frame results can be found
    :param spider_folder    : parent spider directory name that holds the ground truth spider database 
    :returns                : dict
    """
    results = []
    delimiter_char = '|'
    counter = 0

    for idx in range(0, df.shape[0]):
        sql_query = df.iloc[idx]["query"]
        prediction_query = df.iloc[idx][model]

        db_id = df.iloc[idx]["db_id"]
        db_file = f"{spider_folder}/database/{db_id}/{db_id}.sqlite"
        conn = sqlite3.connect(db_file)
        cursor = conn.cursor()
        result_row = {}
        try:
            # Fetching the gold standard
            cursor.execute(sql_query)
            result_gold = cursor.fetchall()
            result_row['gold_query'] = sql_query
            result_row['gold_result'] = result_gold
            
            try:
                # Fetching prediction results
                cursor.execute(prediction_query)
                result_preds = cursor.fetchall()
                result_row['prediction_query'] = prediction_query
                result_row['prediction'] = result_preds
            except Exception as e:
                result_row['prediction_query'] = prediction_query
                result_row['prediction'] = result_preds
                continue

            # Comparing the results
            if result_gold == result_preds:
                result_row['match'] = True
                counter += 1
            else:
                result_row['match'] = False
        except Exception as e:
            error = "General error\n"
            result_row['match'] = False
            result_row['error'] = e
        results.append(result_row)

    df_results = pd.DataFrame(results)
    df_results.to_csv(
        path_or_buf=f"comparison_{model}.txt", 
        sep=',', 
        header=True, 
        index=False, 
        mode='w', 
    )

    print(f"{model} Accuracy: {counter/df.shape[0]}")
    return {counter / df.shape[0]}

### Step 17: Evaluate Models
With our utility functions available, we'll iterate through each of our output results files, aggregate the accuracy as measured against the labeled Spider data set, and display the accuracy of each model. While we're at it, we'll also add the generated SQL to our `df_eval_bedrock` data frame for analysis.

In [None]:
# add results from all tests to benchmark dataframe
# build accuracy results
accuracy_results = []
directory = os.fsencode(results_directory)
for file in os.listdir(directory):
    filename = os.fsdecode(file)
    if filename not in ('.ipynb_checkpoints'):
        dir = os.fsdecode(directory)
        with open(f"{dir}/{filename}", "rb") as fp:  # Unpickling
            results = pickle.load(fp)
            cleaned_results = clean_results(results)
            try:
                df_eval_bedrock[filename] = cleaned_results
                accuracy = run_exact_match_bench(df_eval_bedrock, filename, spider_folder)
                accuracy_results.append(accuracy)
            except:
                print(f"couldn't add for {dir}/{filename}.")

### Step 18: Analyze Results
Now that we've benchmarked our models, including our fine-tuned model, take a look at our dataframe to compare how the models generated SQL.

How much of an improvement over the basemodel Titan-Express did our Fine-Tuned Model do?
Why might some of Claude's code not have matched? Did it alias columns? 

In [None]:
df_eval_bedrock.head()

Let's see why some claude responses may not have matched by loading our fine-tuned results into a pandas dataframe and filtering to non-matches.

How would you improve your prompting to better account for these?

In [None]:
df_fine_tuned_result=pd.read_csv(f"comparison_{custom_model_name}.txt", sep=',', header=1, names=['gold_query','gold_result','prediction_query','prediction','match','error'])

In [None]:
df_fine_tuned_result.loc[df_fine_tuned_result['match'] == False]

## Summary
In this notebook we used the Spider data set to fine-tune a Titan Text Model to be optimized for Text-to-SQL.

Its important to remember that our fine-tuned model is now optimized to accomplish a much narrower scope of tasks: convert a natural language question to a SQL query.
Specifically, we've fine-tuned on the Spider data set so it will use its understanding of this data set when responding to questions. This means using this model on a data set other than Spider will yield poor results. 
Secondly, the SQL used in the Spider data set is biased for SQL-lite syntax. This will add to poor performance when used against databases other than SQL-Lite.

Those wishing to fine-tune a model on their own data will require a fully labeled list of questions and SQL queries, just like Spider. This effort can consitute a significant engineering cost to the project, as a human must carefully curate questions with their syntactically correct SQL query.