## Deploy fine-tuned model
Now that we have fine-tuned the model, we can deploy it to a SageMaker endpoint. There are numerous deployment [options](https://docs.aws.amazon.com/sagemaker/latest/dg/deploy-model.html) in SageMaker including RealTime, Serverless, Asynchronous, and Batch Transform. In this notebook, we will deploy the model as a RealTime endpoint. 
There are also numerous options for deploying LLMs for RealTime inference including:
- Single model or multi-model endpoints
- Instance Types (GPU, Inferentia2)
- Various inference frameworks such as Large Model Inference, Text Generation Inference, TorchServe, and TensorRT LLM
We'll use the Large Model Inference (LMI) container to deploy the LLM. 

Refer to the blog post [here](https://aws.amazon.com/blogs/machine-learning/boost-inference-performance-for-mixtral-and-llama-2-models-with-new-amazon-sagemaker-containers/) for detailed recommendations for configuring various model architectures for optimal performance on thr LMI container. 

In [None]:
import sys
import os
module_path = "../.."
sys.path.append(os.path.abspath(module_path))
from utils.environment_validation import validate_environment, validate_model_access
validate_environment()

In [None]:
import boto3
import sagemaker
from pathlib import Path
from sagemaker import Model, image_uris, serializers, deserializers

role = sagemaker.get_execution_role()  # execution role for the endpoint
sess = sagemaker.session.Session()  # sagemaker session for interacting with different AWS APIs
region = sess._region_name  # region name of the current SageMaker Studio environment
bucket = sess.default_bucket()  # default bucket name
account_id = sess.account_id() 

In [None]:
# first we need to upload our merged model to S3
local_model_path = "merged_model"
s3_model_path = f"s3://{bucket}/banking-regulations-model"
!aws s3 cp {local_model_path} {s3_model_path} --recursive

The deployment configuration for the LMI container is specified via a `serving.properties` file. The file contains various configuration options such as the number of threads, batch size, and the maximum sequence length. In this case we will use the following parameters:
- **engine**: The runtime engine of the inference code. For LMI this should either be `MPI` or `Python` depending on which inference backend is used
- **option.model_id**: The S3 location of the model artifact or the model name from Hugging Face Hub
- **option.trust_remote_code**: Enables trust remote code execution which is required for some models
- **option.tensor_parallel_degree**: The number of GPUs across which the model should be parallelized (Our deployment instance has just 1 GPU)
- **option.rolling_batch**: Enables continuous batching (iteration level batching) with one of the supported backends. See [here](https://docs.djl.ai/docs/serving/serving/docs/lmi/deployment_guide/configurations.html#inference-library-configuration) for more details
- **option.max_rolling_batch_size**: The maximum number of requests/sequences the model can process at a time. This parameter should be tuned to maximize throughput while staying within the available memory limits
- **option.dtype**: The data type you plan to cast the model weights to.
- **option.max_model_len**: The maximum number of tokens the model can process in a single sequence
- **option.gpu_memory_utilization**: The fraction of the GPU memory that the model can use. This parameter should be tuned to maximize throughput while staying within the available memory limits

For more details on the available configuration options, refer to the [documentation](https://docs.djl.ai/docs/serving/serving/docs/lmi/deployment_guide/configurations.html).

In [None]:
with Path("serving.properties").open("w") as f:
    f.write((
        "engine=Python\n"
        f"option.model_id={s3_model_path}\n"
        "option.trust_remote_code=true\n"
        "option.tensor_parallel_degree=1\n"
        "option.rolling_batch=vllm\n"
        "option.max_rolling_batch_size=4\n"
        "option.dtype=fp16\n"
        "option.max_model_len=8192\n"
        "option.gpu_memory_utilization=0.9\n"
    ))

The `serving.properties` file is then placed into a tarball and uploaded to S3. The tarball can contain additional files such as the model weights and custom inference code. In this case though, the weights will be downloaded from the S3 location specified in the `serving.properties` file `option.model_id` setting and we will use the default inference code provided by the LMI container.

In [None]:
%%sh
mkdir mymodel
mv serving.properties mymodel/
tar czvf mymodel.tar.gz mymodel/
rm -rf mymodel

In [None]:
# use the image_uris module to look up the container uri for the large model inference container

image_uri = image_uris.retrieve(
        framework="djl-deepspeed",
        region=region,
        version="0.27.0"
    )

In [None]:
# upload the model configuration artifact to S3

s3_code_prefix = "banking-regulations-model/code"
code_artifact = sess.upload_data("mymodel.tar.gz", bucket, s3_code_prefix)
print(f"S3 Code or Model tar ball uploaded to --- > {code_artifact}")

model = Model(image_uri=image_uri, model_data=code_artifact, role=role)

In [None]:
# finally deploy the model using the SageMaker SDK

instance_type = "ml.g5.2xlarge"
endpoint_name = sagemaker.utils.name_from_base("banking-regulatory-model")

model.deploy(initial_instance_count=1,
             instance_type=instance_type,
             endpoint_name=endpoint_name,
             # container_startup_health_check_timeout=3600
            )

# our requests and responses will be in json format so we specify the serializer and the deserializer
predictor = sagemaker.Predictor(
    endpoint_name=endpoint_name,
    sagemaker_session=sess,
    serializer=serializers.JSONSerializer(),
    deserializer=deserializers.JSONDeserializer()
)

Let's test the endpoint with an example question

In [None]:
# load the test data

import json
test_data = []
with open("data/prepared_data/prepared_data_test.jsonl", "r") as f:
    for line in f:
        test_data.append(json.loads(line))
    

inference_template = "[INST] You are a Banking Regulations expert.\nGiven this context\nCONTEXT\n{context}\n Answer this question\nQuestion: {question} [/INST]"

In [None]:
# invoke the endpoint with a sample question
idx = 125
context = test_data[idx]["context"]
question = test_data[idx]["question"]
answer = test_data[idx]["answer"]

prompt = inference_template.format(context=context, question=question)

response = predictor.predict(
    {"inputs": prompt, "parameters": {"max_new_tokens":256, "do_sample":False, "temperature":0}}
)

In [None]:
print("Question: ", question)
print("\nGenerated Answer: ", response["generated_text"])
print("\nGround Truth Answer: ", answer)

In [None]:
# save the endpoint name to a file so we can use it in the next notebook

with open("endpoint_config.json", "w") as f:
    f.write(json.dumps({"endpoint_name": endpoint_name}))

### Conclusion
In this notebook, we deployed a fine-tuned model to a SageMaker endpoint using the Large Model Inference container. In the next notebook, we will incorporate the endpoint into our RAG pipeline. 