# Deploy a Huggingface LLM on SageMaker
## Overview
Amazon SageMaker provides a fully managed model hosting capability for any machine learning (ML) models for inferences. Specifically, SageMaker hosting offers a broad selection of ML infrastructure and model deployment options to help meet all your ML inference needs. 

## Deploy SqlCoder LLM on SageMaker
SQLCoder is a family of state-of-the-art LLMs for converting natural language questions to SQL queries. [SQLCoder](https://github.com/defog-ai/sqlcoder) has shown impressive benchmark that outperforms GPT4 and GPT4-Turbo on text to SQL generation task. In this lab, we'll deploy a quantized version of this model to a SageMaker endpoint using SageMaker LMI, running inference on it and validate the inference results. To optimize the deployment and inference, we'll use SageMaker LMI to host the model in a SageMaker endpoint.

## SageMaker LMI Containers
SageMaker LMI containers are a set of high performance Docker Containers purpose built for large language model (LLM) inference. With these containers you can leverage high performance open-source inference libraries like vLLM, TensorRT-LLM, DeepSpeed, Transformers NeuronX to deploy LLMs on AWS SageMaker Endpoints. These containers bundle together a model server with open-source inference libraries to deliver an all-in-one LLM serving solution. We provide quick start notebooks that get you deploying popular open source models in minutes, and advanced guides to maximize performance of your endpoint.

The lab can be organized into the following key steps:
1. Create a model service configuration file that specifies a 4-Bit quantized SQLCoder available on Huggingface Hub.
2. Deploy the model to SageMaker as a realtime endpoint using SageMaker LMI.
3. Test and verify the model is deployed successfully and able to send inference requests to the model for SQL query generation.
4. Setup a test database called **Chinook** running as SQLite, an in memory database.
5. Sends requests to the SQLCoder LLM using natural language as the LLM prompt.
6. Invokes the model and receives a response query.
7. Use the response query to invoke against the test SQLite database and verify the results.



> This notebook has been tested in a **`SageMaker Distribution 1.4`** Image using **Base Python 3.0 kernel** on  **ml.m5.large** instance.
> 

First, let's install all the required dependencies

In [1]:
%pip install boto3 sagemaker fmeval jsonlines transformers -Uq

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
aiobotocore 2.7.0 requires botocore<1.31.65,>=1.31.16, but you have botocore 1.34.76 which is incompatible.
autogluon-multimodal 0.8.2 requires transformers[sentencepiece]<4.32.0,>=4.31.0, but you have transformers 4.39.3 which is incompatible.[0m[31m
[0mNote: you may need to restart the kernel to use updated packages.


In [2]:
%pip install -Iv pandas==2.1.4 -q

Collecting pandas==2.1.4
  Using cached pandas-2.1.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (18 kB)
Collecting numpy<2,>=1.22.4 (from pandas==2.1.4)
  Using cached numpy-1.26.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (61 kB)
Collecting python-dateutil>=2.8.2 (from pandas==2.1.4)
  Using cached python_dateutil-2.9.0.post0-py2.py3-none-any.whl.metadata (8.4 kB)
Collecting pytz>=2020.1 (from pandas==2.1.4)
  Using cached pytz-2024.1-py2.py3-none-any.whl.metadata (22 kB)
Collecting tzdata>=2022.1 (from pandas==2.1.4)
  Using cached tzdata-2024.1-py2.py3-none-any.whl.metadata (1.4 kB)
Collecting six>=1.5 (from python-dateutil>=2.8.2->pandas==2.1.4)
  Using cached six-1.16.0-py2.py3-none-any.whl.metadata (1.8 kB)
Using cached pandas-2.1.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (12.3 MB)
Using cached numpy-1.26.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (18.2 MB)
Using cached python_dateutil-2.9.0.p

In [3]:
import warnings
warnings.filterwarnings('ignore', category=DeprecationWarning)
warnings.simplefilter(action='ignore', category=FutureWarning)

Import the library to be used throughout the lab

In [4]:
import boto3
import sagemaker
from sagemaker import Model, image_uris, serializers, deserializers
import json
import sqlite3

role = sagemaker.get_execution_role()  # execution role for the endpoint
sess = sagemaker.session.Session()  # sagemaker session for interacting with different AWS APIs
region = sess._region_name  # region name of the current SageMaker Studio environment
account_id = sess.account_id()  # account_id of the current SageMaker Studio environment


sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/sagemaker-user/.config/sagemaker/config.yaml


# Deploy a Huggingface LLM Using SageMaker LMI
In order to deploy an LLM using SageMaker LMI, we need to setup a configuration file with key information about how the model should be hosted and serving inferences. For instance, to deploy a huggingface LLM, we only need to provide the huggingface model ID in the configuration, SageMaker LMI will automatically take care of downloading the model and loads the model in the serving container. SgaeMaker LMI is highly configurable to provide users the flexibility in choosing the most optmized configuration to serve their models. Please refer to this [link](https://docs.aws.amazon.com/sagemaker/latest/dg/large-model-inference-configuration.html) to lear more about the the avaialble configuration parameters for LMI. 

The following diagram gives an overview of a SageMaker LMI deployment pipeline you can use to deploy your models.

![SageMaker LMI Deployment](images/sm_lmi_pipeline.jpg)

This [blog post](https://aws.amazon.com/blogs/machine-learning/boost-inference-performance-for-llms-with-new-amazon-sagemaker-containers/) provides great amount of detail about SageMaker LMI, the inference optimization frameworks that it supports and the performance benchmarks for each of the supported frameworks. 

First, we provide a serving.properties file with the model specific details:

In [5]:
%%writefile serving.properties
engine=Python
option.model_id=TheBloke/sqlcoder-34b-alpha-GPTQ
option.task=text-generation
option.trust_remote_code=true
option.tensor_parallel_degree=max
option.rolling_batch=auto
option.quantize=gptq

Writing serving.properties


In the following, we'll create a tar file with only the service.properties file.

In [6]:
%%sh
mkdir model
mv serving.properties model/
tar czvf sqlcoder.tar.gz model/
rm -rf model

model/
model/serving.properties


Define an LMI container to use by specifying the framework and the framework version.

In [7]:
image_uri = image_uris.retrieve(
        framework="djl-deepspeed",
        region=sess.boto_session.region_name,
        version="0.26.0"
    )

Uploads the .gz file to S3 for serving container to pick up at model deployment time. 

In [8]:
s3_code_prefix = "models/large-model-lmi/sqlcoder"
bucket = sess.default_bucket()  # bucket to house artifacts
code_artifact = sess.upload_data("sqlcoder.tar.gz", bucket, s3_code_prefix)
print(f"S3 Code or Model tar ball uploaded to --- &gt; {code_artifact}")

model = Model(image_uri=image_uri, model_data=code_artifact, role=role)

S3 Code or Model tar ball uploaded to --- &gt; s3://sagemaker-us-east-1-602900100639/models/large-model-lmi/sqlcoder/sqlcoder.tar.gz


Specifies the mode and use SageMaker SDK to trigger the model deployment.

In [9]:
instance_type = "ml.g5.2xlarge"
endpoint_name = sagemaker.utils.name_from_base("sqlcoder-lmi-model")

model.deploy(initial_instance_count=1,
             instance_type=instance_type,
             endpoint_name=endpoint_name,
            )

---------------!

Once the model is deployed successfully, we can start running inference against the endpoint. SageMaker SDK provides a Predictor class that helps simplifying inference request and response. In the following cell, we'll create a predictor object for the endpoint so that we could use it for generating SQL query.

In [10]:
# our requests and responses will be in json format so we specify the serializer and the deserializer
predictor = sagemaker.Predictor(
    endpoint_name=endpoint_name,
    sagemaker_session=sess,
    serializer=serializers.JSONSerializer(),
)

Here's a sample prompt template that consists of an instruction with placeholders to be sent to SQLCoder LLM for inference. You can modify the template and observe how different prompts would impact the response from the LLM.

In [11]:
prompt_template = """### Task
Generate a SQL query to answer [QUESTION]{user_question}[/QUESTION]

### Instructions
- If you cannot answer the question with the available database schema, return 'I do not know'

### Database Schema
The query will run on a database with the following schema:
{table_metadata_string}

### Answer
Given the database schema, here is the SQL query that answers [QUESTION]{user_question}[/QUESTION]
[SQL]"""

## SQL Query Validation
At this point, we will start validating the LLM capability by sending the generated SQL query to a test database. For this example, We use [Chinook](https://github.com/lerocha/chinook-database) database which contains sample Music album sales across music companies. We also created `metadata.sql` that contains the DDL for the database schema. The schema information is to be fed to the prompt template above to complete a prompt. 

The following database diagram illustrates the chinook database tables and their relationships.

![chinook schema](images/chinook-schema.jpg)

In [12]:
metadata_file = "metadata.sql"
with open(metadata_file, "r") as f:
    table_metadata_string = f.read()

Let's ask a question using natural language relevant to the given DB schema.

In [13]:
question = "how many unique albums are there?"
prompt = prompt_template.format(user_question=question, table_metadata_string=table_metadata_string)

In [14]:
print(prompt)

### Task
Generate a SQL query to answer [QUESTION]how many unique albums are there?[/QUESTION]

### Instructions
- If you cannot answer the question with the available database schema, return 'I do not know'

### Database Schema
The query will run on a database with the following schema:
CREATE TABLE [Album]
(
    [AlbumId] INTEGER  NOT NULL,
    [Title] NVARCHAR(160)  NOT NULL,
    [ArtistId] INTEGER  NOT NULL,
    CONSTRAINT [PK_Album] PRIMARY KEY  ([AlbumId]),
    FOREIGN KEY ([ArtistId]) REFERENCES [Artist] ([ArtistId])
                ON DELETE NO ACTION ON UPDATE NO ACTION
);

CREATE TABLE [Artist]
(
    [ArtistId] INTEGER  NOT NULL,
    [Name] NVARCHAR(120),
    CONSTRAINT [PK_Artist] PRIMARY KEY  ([ArtistId])
);

CREATE TABLE [Customer]
(
    [CustomerId] INTEGER  NOT NULL,
    [FirstName] NVARCHAR(40)  NOT NULL,
    [LastName] NVARCHAR(20)  NOT NULL,
    [Company] NVARCHAR(80),
    [Address] NVARCHAR(70),
    [City] NVARCHAR(40),
    [State] NVARCHAR(40),
    [Country] NVARCHA

In [15]:
def invoke_model(prompt):
    response = predictor.predict(
        {"inputs": prompt, "parameters": {"max_tokens":1024}}
    )
    output = response.decode("utf-8")
    full_output_text = json.loads(output)["generated_text"]
    sql_query = full_output_text.split("[/SQL]")[0]
    return sql_query

Invokes the LLM for SQL generation

In [16]:
sql_query = invoke_model(prompt)

In [17]:
print(sql_query)

SELECT COUNT(DISTINCT Album.AlbumId) AS total_albums FROM Album;


Next, we'll start creating a database connection to a test db hosted in memory. 
We'll also download the dataset from the given link so that they could be populated into the database.

In [18]:
connection = sqlite3.connect("test.db")

In [19]:
db_file_url = "https://github.com/lerocha/chinook-database/releases/download/v1.4.5/Chinook_Sqlite.sqlite"
db_filename = "Chinook_Sqlite.sqlite"

In [20]:
from urllib.request import urlretrieve

In [21]:
urlretrieve(db_file_url, db_filename)

('Chinook_Sqlite.sqlite', <http.client.HTTPMessage at 0x7f66eca80d90>)

In [22]:
def run_query(query):
    # Create a SQL connection to our SQLite database
    con = sqlite3.connect(db_filename)
    cur = con.cursor()

    # The result of a "cursor.execute" can be iterated over by row
    for row in cur.execute(query):
        print(row)
    # Be sure to close the connection
    con.close()

## Use SageMaker Foundation Model Evaluation (fmeval) to evaluate SQLCoder
### Foundation Model Evaluations Library
fmeval is an open source library to evaluate Large Language Models (LLMs) in order to help select the best LLM for your use case. The library evaluates LLMs for the following tasks:

* Open-ended generation - The production of natural human responses to text that does not have a pre-defined structure.
* Text summarization - The generation of a condensed summary retaining the key information contained in a longer text.
* Question Answering - The generation of a relevant and accurate response to an answer.
* Classification - Assigning a category, such as a label or score to text, based on its content.

To learn more about how to use `fmeval` library, please follow this [github](https://github.com/aws/fmeval) repository and [documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/clarify-foundation-model-evaluate-overview.html).

For SQL generation task, we'll leverage the Q&A evaluation task to measure the accuracy of the generated query.
For this lab, we've curated a few Q&A smaples to serve as the ground truth data. The `fmeval` will use these data to perform evaluation on the model. 

In [23]:
import jsonlines

input_file = "fmeval_data_inputs.jsonl"
output_file = "fmeval_data_outputs.jsonl"

# For each line in `input_file`, invoke the model using the input from that line,
# augment the line with the invocation results, and write the augmented line to `output_file`.
with jsonlines.open(input_file) as input_fh, jsonlines.open(output_file, "w") as output_fh:
    for line in input_fh:
        if "question" in line:
            question = line["question"]
            print(f"Question: {question}")
            p = prompt_template.format(user_question=question, table_metadata_string=table_metadata_string)
            output = invoke_model(p)
            print(f"Model output: {output}")
            print("==============================")
            line["model_output"] = output
            output_fh.write(line)

Question: How many artists whose name starts with 'Adam'?
Model output: SELECT COUNT(ArtistId) AS ArtistCount FROM Artist WHERE Name like 'Adam%';
Question: how many unique albums are there?
Model output: SELECT COUNT(DISTINCT Album.AlbumId) AS total_albums FROM Album;
Question: How many customers do employee with first name of 'Jane' has?
Model output: SELECT COUNT(DISTINCT c.customerid) AS total_customers FROM customer c JOIN employee e ON c.supportrepid = e


## FMEval Setup

In this section, we will perform the evaluation on the model that we deployed. We will a ModelRunner to evaluate the model on Accuracy using the FMEval library.

In [24]:
from fmeval.data_loaders.data_config import DataConfig
from fmeval.constants import MIME_TYPE_JSONLINES
from fmeval.eval_algorithms.qa_accuracy import QAAccuracy, QAAccuracyConfig

### Data Config Setup
Below, we create a DataConfig for the local dataset file we just created, trex_sample_with_model_outputs.jsonl.

* dataset_name is just an identifier for your own reference
* dataset_uri is either a local path to a file or an S3 URI
* dataset_mime_type is the MIME type of the dataset. Currently, JSON and JSON Lines are supported.
* model_input_location, target_output_location, and model_output_location are JMESPath queries used to find the model inputs, target outputs, and model outputs within the dataset. The values that you specify here depend on the structure of the dataset itself. Take a look at trex_sample_with_model_outputs.jsonl to see where "question", "answers", and "model_output" show up.

In [25]:
config = DataConfig(
    dataset_name="llm_sample_with_model_outputs",
    dataset_uri="fmeval_data_outputs.jsonl",
    dataset_mime_type=MIME_TYPE_JSONLINES,
    model_input_location="question",
    target_output_location="answers",
    model_output_location="model_output"
)

## Run Evaluation
In use cases that we demonstrate in the other example notebooks, we usually pass a model runner and prompt template to the evaluate method of our evaluation algorithm. However, since our dataset already contains all of the model inference outputs, we only need to pass our dataset config.

In [26]:
eval_algo = QAAccuracy(QAAccuracyConfig(target_output_delimiter="<OR>"))
eval_output = eval_algo.evaluate(dataset_config=config, save=True)

2024-04-03 06:43:36,838	INFO worker.py:1724 -- Started a local Ray instance.


Read progress 0:   0%|          | 0/1 [00:00<?, ?it/s]

Read progress 0:   0%|          | 0/1 [00:00<?, ?it/s]

2024-04-03 06:43:43,005	INFO streaming_executor.py:112 -- Executing DAG InputDataBuffer[Input] -> AllToAllOperator[Repartition]
2024-04-03 06:43:43,007	INFO streaming_executor.py:113 -- Execution config: ExecutionOptions(resource_limits=ExecutionResources(cpu=None, gpu=None, object_store_memory=None), exclude_resources=ExecutionResources(cpu=0, gpu=0, object_store_memory=0), locality_with_output=False, preserve_order=True, actor_locality_enabled=True, verbose_progress=False)
2024-04-03 06:43:43,009	INFO streaming_executor.py:115 -- Tip: For detailed progress reporting, run `ray.data.DataContext.get_current().execution_options.verbose_progress = True`


- Repartition 1:   0%|          | 0/5 [00:00<?, ?it/s]

Split Repartition 2:   0%|          | 0/5 [00:00<?, ?it/s]

Running 0:   0%|          | 0/5 [00:00<?, ?it/s]

2024-04-03 06:43:43,207	INFO streaming_executor.py:112 -- Executing DAG InputDataBuffer[Input] -> ActorPoolMapOperator[Map(QAAccuracyScores)]
2024-04-03 06:43:43,209	INFO streaming_executor.py:113 -- Execution config: ExecutionOptions(resource_limits=ExecutionResources(cpu=None, gpu=None, object_store_memory=None), exclude_resources=ExecutionResources(cpu=0, gpu=0, object_store_memory=0), locality_with_output=False, preserve_order=True, actor_locality_enabled=True, verbose_progress=False)
2024-04-03 06:43:43,213	INFO streaming_executor.py:115 -- Tip: For detailed progress reporting, run `ray.data.DataContext.get_current().execution_options.verbose_progress = True`
2024-04-03 06:43:43,255	INFO actor_pool_map_operator.py:114 -- Map(QAAccuracyScores): Waiting for 1 pool actors to start...


[36m(_MapWorker pid=530)[0m sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
[36m(_MapWorker pid=530)[0m sagemaker.config INFO - Not applying SDK defaults from location: /home/sagemaker-user/.config/sagemaker/config.yaml


Running 0:   0%|          | 0/5 [00:00<?, ?it/s]

[36m(MapWorker(Map(QAAccuracyScores)) pid=530)[0m   if isinstance(items[0], TensorArrayElement):
[36m(MapWorker(Map(QAAccuracyScores)) pid=530)[0m   return items[0]
[36m(MapWorker(Map(QAAccuracyScores)) pid=530)[0m   if isinstance(items[0], TensorArrayElement):
[36m(MapWorker(Map(QAAccuracyScores)) pid=530)[0m   return items[0]
[36m(MapWorker(Map(QAAccuracyScores)) pid=530)[0m   if isinstance(items[0], TensorArrayElement):
[36m(MapWorker(Map(QAAccuracyScores)) pid=530)[0m   return items[0]
[36m(MapWorker(Map(QAAccuracyScores)) pid=530)[0m   if isinstance(items[0], TensorArrayElement):
[36m(MapWorker(Map(QAAccuracyScores)) pid=530)[0m   return items[0]
[36m(MapWorker(Map(QAAccuracyScores)) pid=530)[0m   if isinstance(items[0], TensorArrayElement):
[36m(MapWorker(Map(QAAccuracyScores)) pid=530)[0m   return items[0]
[36m(MapWorker(Map(QAAccuracyScores)) pid=530)[0m   if isinstance(items[0], TensorArrayElement):
[36m(MapWorker(Map(QAAccuracyScores)) pid=530)[0m   re

- Aggregate 1:   0%|          | 0/5 [00:00<?, ?it/s]

Shuffle Map 2:   0%|          | 0/5 [00:00<?, ?it/s]

Shuffle Reduce 3:   0%|          | 0/5 [00:00<?, ?it/s]

Running 0:   0%|          | 0/5 [00:00<?, ?it/s]

2024-04-03 06:43:48,462	INFO streaming_executor.py:112 -- Executing DAG InputDataBuffer[Input] -> AllToAllOperator[Aggregate] -> LimitOperator[limit=1]
2024-04-03 06:43:48,466	INFO streaming_executor.py:113 -- Execution config: ExecutionOptions(resource_limits=ExecutionResources(cpu=None, gpu=None, object_store_memory=None), exclude_resources=ExecutionResources(cpu=0, gpu=0, object_store_memory=0), locality_with_output=False, preserve_order=True, actor_locality_enabled=True, verbose_progress=False)
2024-04-03 06:43:48,469	INFO streaming_executor.py:115 -- Tip: For detailed progress reporting, run `ray.data.DataContext.get_current().execution_options.verbose_progress = True`


- Aggregate 1:   0%|          | 0/5 [00:00<?, ?it/s]

Shuffle Map 2:   0%|          | 0/5 [00:00<?, ?it/s]

Shuffle Reduce 3:   0%|          | 0/5 [00:00<?, ?it/s]

Running 0:   0%|          | 0/5 [00:00<?, ?it/s]

2024-04-03 06:43:49,630	INFO streaming_executor.py:112 -- Executing DAG InputDataBuffer[Input] -> AllToAllOperator[Aggregate] -> LimitOperator[limit=1]
2024-04-03 06:43:49,631	INFO streaming_executor.py:113 -- Execution config: ExecutionOptions(resource_limits=ExecutionResources(cpu=None, gpu=None, object_store_memory=None), exclude_resources=ExecutionResources(cpu=0, gpu=0, object_store_memory=0), locality_with_output=False, preserve_order=True, actor_locality_enabled=True, verbose_progress=False)
2024-04-03 06:43:49,633	INFO streaming_executor.py:115 -- Tip: For detailed progress reporting, run `ray.data.DataContext.get_current().execution_options.verbose_progress = True`


- Aggregate 1:   0%|          | 0/5 [00:00<?, ?it/s]

Shuffle Map 2:   0%|          | 0/5 [00:00<?, ?it/s]

Shuffle Reduce 3:   0%|          | 0/5 [00:00<?, ?it/s]

Running 0:   0%|          | 0/5 [00:00<?, ?it/s]

2024-04-03 06:43:49,806	INFO streaming_executor.py:112 -- Executing DAG InputDataBuffer[Input] -> AllToAllOperator[Aggregate] -> LimitOperator[limit=1]
2024-04-03 06:43:49,808	INFO streaming_executor.py:113 -- Execution config: ExecutionOptions(resource_limits=ExecutionResources(cpu=None, gpu=None, object_store_memory=None), exclude_resources=ExecutionResources(cpu=0, gpu=0, object_store_memory=0), locality_with_output=False, preserve_order=True, actor_locality_enabled=True, verbose_progress=False)
2024-04-03 06:43:49,808	INFO streaming_executor.py:115 -- Tip: For detailed progress reporting, run `ray.data.DataContext.get_current().execution_options.verbose_progress = True`


- Aggregate 1:   0%|          | 0/5 [00:00<?, ?it/s]

Shuffle Map 2:   0%|          | 0/5 [00:00<?, ?it/s]

Shuffle Reduce 3:   0%|          | 0/5 [00:00<?, ?it/s]

Running 0:   0%|          | 0/5 [00:00<?, ?it/s]

2024-04-03 06:43:49,982	INFO streaming_executor.py:112 -- Executing DAG InputDataBuffer[Input] -> AllToAllOperator[Aggregate] -> LimitOperator[limit=1]
2024-04-03 06:43:49,984	INFO streaming_executor.py:113 -- Execution config: ExecutionOptions(resource_limits=ExecutionResources(cpu=None, gpu=None, object_store_memory=None), exclude_resources=ExecutionResources(cpu=0, gpu=0, object_store_memory=0), locality_with_output=False, preserve_order=True, actor_locality_enabled=True, verbose_progress=False)
2024-04-03 06:43:49,986	INFO streaming_executor.py:115 -- Tip: For detailed progress reporting, run `ray.data.DataContext.get_current().execution_options.verbose_progress = True`


- Aggregate 1:   0%|          | 0/5 [00:00<?, ?it/s]

Shuffle Map 2:   0%|          | 0/5 [00:00<?, ?it/s]

Shuffle Reduce 3:   0%|          | 0/5 [00:00<?, ?it/s]

Running 0:   0%|          | 0/5 [00:00<?, ?it/s]

2024-04-03 06:43:50,150	INFO streaming_executor.py:112 -- Executing DAG InputDataBuffer[Input] -> TaskPoolMapOperator[Map(<lambda>)]
2024-04-03 06:43:50,153	INFO streaming_executor.py:113 -- Execution config: ExecutionOptions(resource_limits=ExecutionResources(cpu=None, gpu=None, object_store_memory=None), exclude_resources=ExecutionResources(cpu=0, gpu=0, object_store_memory=0), locality_with_output=False, preserve_order=True, actor_locality_enabled=True, verbose_progress=False)
2024-04-03 06:43:50,154	INFO streaming_executor.py:115 -- Tip: For detailed progress reporting, run `ray.data.DataContext.get_current().execution_options.verbose_progress = True`


Running 0:   0%|          | 0/5 [00:00<?, ?it/s]

2024-04-03 06:43:51,663	INFO streaming_executor.py:112 -- Executing DAG InputDataBuffer[Input] -> TaskPoolMapOperator[Map(<lambda>)]
2024-04-03 06:43:51,663	INFO streaming_executor.py:113 -- Execution config: ExecutionOptions(resource_limits=ExecutionResources(cpu=None, gpu=None, object_store_memory=None), exclude_resources=ExecutionResources(cpu=0, gpu=0, object_store_memory=0), locality_with_output=False, preserve_order=True, actor_locality_enabled=True, verbose_progress=False)
2024-04-03 06:43:51,665	INFO streaming_executor.py:115 -- Tip: For detailed progress reporting, run `ray.data.DataContext.get_current().execution_options.verbose_progress = True`


Running 0:   0%|          | 0/5 [00:00<?, ?it/s]

### Parse Evaluation Results

In [30]:
# Pretty-print the evaluation output (notice the score).
import json
print(json.dumps(eval_output, default=vars, indent=4))

[
    {
        "eval_name": "qa_accuracy",
        "dataset_name": "llm_sample_with_model_outputs",
        "dataset_scores": [
            {
                "name": "f1_score",
                "value": 0.9555555555555556
            },
            {
                "name": "exact_match_score",
                "value": 0.6666666666666666
            },
            {
                "name": "quasi_exact_match_score",
                "value": 0.6666666666666666
            },
            {
                "name": "precision_over_words",
                "value": 1.0
            },
            {
                "name": "recall_over_words",
                "value": 0.9215686274509803
            }
        ],
        "prompt_template": null,
        "category_scores": null,
        "output_path": "/tmp/eval_results/qa_accuracy_llm_sample_with_model_outputs.jsonl",
        "error": null
    }
]


In [31]:
eval_output[0].dataset_scores

[EvalScore(name='f1_score', value=0.9555555555555556),
 EvalScore(name='exact_match_score', value=0.6666666666666666),
 EvalScore(name='quasi_exact_match_score', value=0.6666666666666666),
 EvalScore(name='precision_over_words', value=1.0),
 EvalScore(name='recall_over_words', value=0.9215686274509803)]

In [32]:
# Create a Pandas DataFrame to visualize the results
import pandas as pd

data = []

# We obtain the path to the results file from "output_path" in the cell above
with open("fmeval_data_outputs.jsonl", "r") as file:
    for line in file:
        data_dict = json.loads(line)
        data_dict["eval_f1_score"] = eval_output[0].dataset_scores[0].value
        data_dict["eval_exact_match_score"] = eval_output[0].dataset_scores[1].value
        data_dict["eval_quasi_exact_match_score"] = eval_output[0].dataset_scores[2].value
        data_dict["eval_quasi_precision_over_words_score"] = eval_output[0].dataset_scores[3].value
        data_dict["eval_quasi_recall_over_words_score"] = eval_output[0].dataset_scores[4].value
        data.append(data_dict)

df = pd.DataFrame(data)

In [33]:
df

Unnamed: 0,answers,knowledge_category,question,model_output,eval_f1_score,eval_exact_match_score,eval_quasi_exact_match_score,eval_quasi_precision_over_words_score,eval_quasi_recall_over_words_score
0,SELECT COUNT(ArtistId) AS ArtistCount FROM Art...,SQL,How many artists whose name starts with 'Adam'?,SELECT COUNT(ArtistId) AS ArtistCount FROM Art...,0.955556,0.666667,0.666667,1.0,0.921569
1,SELECT COUNT(DISTINCT Album.AlbumId) AS total_...,SQL,how many unique albums are there?,SELECT COUNT(DISTINCT Album.AlbumId) AS total_...,0.955556,0.666667,0.666667,1.0,0.921569
2,"SELECT e.FirstName, COUNT(DISTINCT c.customeri...",SQL,How many customers do employee with first name...,SELECT COUNT(DISTINCT c.customerid) AS total_c...,0.955556,0.666667,0.666667,1.0,0.921569


# Conclusion
In this lab, we learn how to deploy a Huggingface model (SQLCoder) in a SageMaker Deep Learning Container using SageMaker SDK. 

To test the deployed LLM, we used a natural language to ask a question and have the LLM to generate a relevant SQL query based on the given context. 

We also loaded a test database using SQLLite with sample data so that we could use the generated query against the database to fetch the results. 

Additionally, we used an open source FM evaluation framework called FMEval to evaluate the performance of the model. 

Finally, we showed the evaluation results as a pandas dataframe for visualization.

# Clean Up

In [None]:
sess.delete_endpoint(endpoint_name)
sess.delete_endpoint_config(endpoint_name)
model.delete_model()