## Deploy fine-tuned model
Now that we have fine-tuned the model, we can deploy it to a SageMaker endpoint. There are numerous deployment [options](https://docs.aws.amazon.com/sagemaker/latest/dg/deploy-model.html) in SageMaker including RealTime, Serverless, Asynchronous, and Batch Transform. In this notebook, we will deploy the model as a RealTime endpoint. 
There are also numerous options for deploying LLMs for RealTime inference including:
- Single model or multi-model endpoints
- Instance Types (GPU, Inferentia2)
- Various inference frameworks such as Large Model Inference, Text Generation Inference, TorchServe, and TensorRT LLM
We'll use the Large Model Inference (LMI) container to deploy the LLM. 

Refer to the blog post [here](https://aws.amazon.com/blogs/machine-learning/boost-inference-performance-for-mixtral-and-llama-2-models-with-new-amazon-sagemaker-containers/) for detailed recommendations for configuring various model architectures for optimal performance on thr LMI container. 

In [1]:
import sys
import os
module_path = "../.."
sys.path.append(os.path.abspath(module_path))
from utils.environment_validation import validate_environment, validate_model_access
validate_environment()

Validating base environment
Base environment validated successfully


In [2]:
import os
import boto3
import sagemaker
from pathlib import Path
from sagemaker.djl_inference.model import DJLModel
from sagemaker import serializers, deserializers

role = sagemaker.get_execution_role()  # execution role for the endpoint
sess = sagemaker.session.Session()  # sagemaker session for interacting with different AWS APIs
region = sess._region_name  # region name of the current SageMaker Studio environment
bucket = sess.default_bucket()  # default bucket name
account_id = sess.account_id() 



sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /root/.config/sagemaker/config.yaml


In [3]:
# first we need to upload our merged model to S3
local_model_path = "merged_model"
s3_model_path = f"s3://{bucket}/banking-regulations-model"
!aws s3 sync {local_model_path} {s3_model_path}

In [4]:
# define inference environment for LLM
# for more details see documentation here https://docs.djl.ai/master/docs/serving/serving/docs/lmi/deployment_guide/configurations.html
llm_env = {
    "TENSOR_PARALLEL_DEGREE": "1",  # use 1 GPUs
    "OPTION_ROLLING_BATCH": "vllm", # use VLLM rolling batch
    "OPTION_MAX_ROLLING_BATCH_SIZE": "32", # max rolling batch size (controls the concurrency)
    "OPTION_DTYPE": "fp16", # load weights in fp16
    "OPTION_MAX_MODEL_LEN": "16384", # max context length in tokens for the model
    "OPTION_TRUST_REMOTE_CODE": "true", # trust remote code
    "OPTION_GPU_MEMORY_UTILIZATION": "0.95", # use 95% of GPU memory
}

# create DJLModel object for LLM
# see here for LMI version updates https://github.com/aws/deep-learning-containers/blob/master/available_images.md#large-model-inference-containers 
sm_llm_model = DJLModel(
    model_id=s3_model_path,
    djl_version="0.30.0",
    djl_framework="djl-lmi",
    role=role,
    env=llm_env,
)

In [5]:
instance_type = "ml.g5.xlarge"
endpoint_name = sagemaker.utils.name_from_base(f"bank-new")

predictor = sm_llm_model.deploy(initial_instance_count=1,
             instance_type=instance_type,
             endpoint_name=endpoint_name,
             serializer=serializers.JSONSerializer(),
             deserializer=deserializers.JSONDeserializer(),
             container_startup_health_check_timeout=1800
                                
            )

----------------!

Let's test the endpoint with an example question

In [6]:
# load the test data

import json
test_data = []
with open("data/prepared_data/prepared_data_test.jsonl", "r") as f:
    for line in f:
        test_data.append(json.loads(line))
    

inference_template = "[INST] You are a Banking Regulations expert.\nGiven this context\nCONTEXT\n{context}\n Answer this question\nQuestion: {question} [/INST]"

In [7]:
# invoke the endpoint with a sample question
idx = 125
context = test_data[idx]["context"]
question = test_data[idx]["question"]
answer = test_data[idx]["answer"]

prompt = inference_template.format(context=context, question=question)

response = predictor.predict(
    {"inputs": prompt, "parameters": {"max_new_tokens":256, "do_sample":False, "temperature":0}}
)

In [8]:
print("Question: ", question)
print("\nGenerated Answer: ", response["generated_text"])
print("\nGround Truth Answer: ", answer)

Question:  What is the impact of co-branding in the marketing of private education loans, and what disclosure is required to avoid implying endorsement by the covered educational institution?

Generated Answer:   Co-branding in the marketing of private education loans can imply endorsement by the covered educational institution, which is prohibited unless the marketing includes a clear and conspicuous disclosure that the covered educational institution does not endorse the creditor's loans and that the creditor is not affiliated with the covered educational institution (§ 226.48(a)(1)). 

Ground Truth Answer:  Co-branding in the marketing of private education loans implies endorsement by the covered educational institution. To avoid implying endorsement, the creditor's marketing must include a clear and conspicuous disclosure that is equally prominent and closely proximate to the reference to the covered educational institution, stating that the covered educational institution does not

In [9]:
# save the endpoint name to a file so we can use it in the next notebook

with open("endpoint_config.json", "w") as f:
    f.write(json.dumps({"endpoint_name": endpoint_name}))

### Conclusion
In this notebook, we deployed a fine-tuned model to a SageMaker endpoint using the Large Model Inference container. In the next notebook, we will incorporate the endpoint into our RAG pipeline. 