# Deploying Llama2-7b  using Large Model Inference contianer DLC with SageMaker Hosting

In this notebook, we explore how to use SageMaker's Large Model Inference(LMI) container and deploy Llama2 on a SageMakre realtime endpoint.  We use DJLServing as the model serving solution in this example that is bundled in the LMI container. DJLServing is a high-performance universal model serving solution powered by the Deep Java Library (DJL) that is programming language agnostic. To learn more about DJL and DJLServing, you can refer to this link (https://docs.djl.ai/docs/serving/index.html).

We utilize SageMaker LMI container which provides rolling batch capability for Continuous Batching along with Paged Attention. In this notebook, we deploy https://huggingface.co/TheBloke/Llama-2-7B-fp16 model across multiple GPUs that are available on an ml.g5.12xlarge instance (which contains 4 GPUs in total).

### Import required libraries and establish session using SageMaker SDK

In [None]:
!pip install sagemaker boto3 huggingface_hub --upgrade --quiet

In [None]:
import sagemaker
import jinja2
from sagemaker import image_uris
import boto3
import os
import time
import json
from pathlib import Path

In [None]:
role = sagemaker.get_execution_role()  # execution role for the endpoint
sess = sagemaker.session.Session()  # sagemaker session for interacting with different AWS APIs
bucket = sess.default_bucket()  # bucket to house artifacts

In [None]:
model_bucket = sess.default_bucket()  # bucket to house model artifacts
s3_code_prefix = "hf-large-model-djl/meta-llama/Llama-2-7b-fp16/code"  # folder within bucket where code artifact will go

s3_model_prefix = "hf-large-model-djl/meta-llama/Llama-2-7b-fp16/model"  # folder within bucket where model artifact will go
region = sess._region_name
account_id = sess.account_id()

s3_client = boto3.client("s3")
sm_client = boto3.client("sagemaker")
smr_client = boto3.client("sagemaker-runtime")

jinja_env = jinja2.Environment()

### [OPTIONAL] Download the model from Hugging Face and upload the model artifacts on Amazon S3

If you intend to download your copy of the model and upload it to a s3 location in your AWS account, please follow the below steps, else you can skip to the next step.

In [None]:
"""from huggingface_hub import snapshot_download
from pathlib import Path
import os

# - This will download the model into the current directory where ever the jupyter notebook is running
local_model_path = Path(".")
local_model_path.mkdir(exist_ok=True)
model_name = "TheBloke/Llama-2-7b-fp16"
# Only download pytorch checkpoint files
allow_patterns = ["*.json", "*.txt", "*.model", "*.safetensors", "*.bin", "*.chk", "*.pth"]

# - Leverage the snapshot library to donload the model since the model is stored in repository using LFS
model_download_path = snapshot_download(
    repo_id=model_name, cache_dir=local_model_path, allow_patterns=allow_patterns
)"""

In [None]:
# upload files from local to S3 location
# pretrained_model_location = sess.upload_data(path=model_download_path, key_prefix=s3_model_prefix)
# print(f"Model uploaded to --- > {pretrained_model_location}")

In [None]:
# Cleanup locally stored model files post S3 upload
#!rm -rf {model_download_path}

### Define a variable to contain the s3 url of the location that has the model

In [None]:
# Define a variable to contain the s3 url of the location that has the model. For demo purpose, we use Llama-2-7b-fp16 model artifacts from our S3 bucket
pretrained_model_location = f"s3://sagemaker-example-files-prod-{region}/models/llama-2/fp16/7B/"

### 1. Depoy Llama2 on SageMaker using LMI
#### 1.1 Create serving.properties
We start with creating a configuration file to indicate to DJL Serving which model parallelization and inference optimization libraries you would like to use. Depending on your need, you can set the appropriate configuration.

Here is a list of settings that we use in this configuration file -

    engine: The engine for DJL to use. You can choose from options such as Python or MPI which may support different capabilities such as dynamic, continuous, and Paged Attention Batching.

    option.entryPoint: If you don't provide this option, DJL Serving looks for model.py in the model directory. (eg. djl_python.huggingface)

    option.rolling_batch: Enable iteration level batching using one of the supported strategies. (auto, scheduler, lmi-dist)

    option.max_rolling_batch_size: Limits the number of concurrent requests.

    option.paged_attention: Only supported for option.rolling_batch=lmi-dist. Enabling this would require more GPU memory to be preallocated for caching. 

    option.max_rolling_batch_prefill_tokens: Only supported for option.rolling_batch=lmi-dist. Limits the number of tokens for caching. This needs to be tuned based on batch size and input sequence length to avoid GPU OOM.

    option.model_id: The model ID of a pretrained model hosted inside a model repository on huggingface.co (https://huggingface.co/models) or S3 path to the model artifacts. LMI will use this parameter to dynamically download the model at deploy time.

    option.tensor_parallel_degree: If you plan to run in multiple GPUs, use this option to set the number of GPUs per worker. If you have more GPUs available on the instance LMI can create multiple copies (eg. tp=2 on an instance with 4 GPUs will create 2 copies)

For more details on the configuration options and an exhaustive list, you can refer the documentation - https://docs.aws.amazon.com/sagemaker/latest/dg/large-model-inference-configuration.html.

In [None]:
!rm -rf code_llama2_7b_fp16
!mkdir -p code_llama2_7b_fp16

In [None]:
%%writefile code_llama2_7b_fp16/serving.properties
engine = MPI
option.entryPoint = djl_python.huggingface
option.rolling_batch = auto
option.max_rolling_batch_size = 64
option.paged_attention = true
option.max_rolling_batch_prefill_tokens = 16080
option.tensor_parallel_degree = 2
option.model_loading_timeout = 900
option.model_id = {{model_id}}

In [None]:
# we plug in the appropriate model location into our `serving.properties`
template = jinja_env.from_string(Path("code_llama2_7b_fp16/serving.properties").open().read())
Path("code_llama2_7b_fp16/serving.properties").open("w").write(
    template.render(model_id=pretrained_model_location)
)
!pygmentize code_llama2_7b_fp16/serving.properties | cat -n

In [None]:
inference_image_uri = image_uris.retrieve(
    framework="djl-deepspeed", region=region, version="0.24.0"
)
print(f"Image going to be used is ---- > {inference_image_uri}")

In [None]:
!rm model.tar.gz
!tar czvf model.tar.gz code_llama2_7b_fp16

In [None]:
s3_code_artifact = sess.upload_data("model.tar.gz", bucket, s3_code_prefix)

#### 1.2 Deploy endpoint

### Create SageMaker model object
The model object allows you to manage your models by specifying a unique name for the model object and associating an execution role. In addition, you will need to map the model artifact wtiht he inferance container image you would like to use that artifact with.

In [None]:
from sagemaker.utils import name_from_base

model_name = name_from_base(f"Llama-2-7b-fp16-mpi")
print(model_name)

create_model_response = sm_client.create_model(
    ModelName=model_name,
    ExecutionRoleArn=role,
    PrimaryContainer={"Image": inference_image_uri, "ModelDataUrl": s3_code_artifact},
)
model_arn = create_model_response["ModelArn"]

print(f"Created Model: {model_arn}")

### Create SageMaker Endpoint Coniguration
The SageMaker endpoint configuration allows you to manage your endpoints by specifying properites that will be used for the creation of endpooints. This is where you will specify properities such as the instance type you would like to use as well as the name of the model object you would like to deploy which we specified in the previous step.

In [None]:
endpoint_config_name = f"{model_name}-config"
endpoint_name = f"{model_name}-endpoint"

endpoint_config_response = sm_client.create_endpoint_config(
    EndpointConfigName=endpoint_config_name,
    ProductionVariants=[
        {
            "VariantName": "variant1",
            "ModelName": model_name,
            "InstanceType": "ml.g5.12xlarge",
            "InitialInstanceCount": 1,
            "ModelDataDownloadTimeoutInSeconds": 900,
            "ContainerStartupHealthCheckTimeoutInSeconds": 900,
        },
    ],
)
endpoint_config_response

### Create SageMaker Endpoint 
We can now create the SageMaker endpoint using the configuration we specified in the last step.

In [None]:
create_endpoint_response = sm_client.create_endpoint(
    EndpointName=f"{endpoint_name}", EndpointConfigName=endpoint_config_name
)
print(f"Created Endpoint: {create_endpoint_response['EndpointArn']}")

#### This can take a while, so please be patient

In [None]:
import time

resp = sm_client.describe_endpoint(EndpointName=endpoint_name)
status = resp["EndpointStatus"]
print("Status: " + status)

while status == "Creating":
    time.sleep(60)
    resp = sm_client.describe_endpoint(EndpointName=endpoint_name)
    status = resp["EndpointStatus"]
    print("Status: " + status)

print("Arn: " + resp["EndpointArn"])
print("Status: " + status)

### Perform Inference Request

This is a generative model, so we pass in a Text as a prompt and the Model will complete the sentence and return the results. 

In [None]:
%%time
smr_client.invoke_endpoint(
    EndpointName=endpoint_name,
    Body=json.dumps(
        {
            "inputs": "The diamondback terrapin was the first reptile to be",
            "parameters": {
                "do_sample": True,
                "max_new_tokens": 256,
                "min_new_tokens": 256,
                "temperature": 0.3,
                "watermark": True,
            },
        }
    ),
    ContentType="application/json",
)["Body"].read().decode("utf8")

## Clean Up
Delete the resources (Endpoint, Endpoint config, Model) deployed for the 3 endpoints used in above tests.

In [None]:
# - Delete the end point
sm_client.delete_endpoint(EndpointName=endpoint_name)

In [None]:

# - In case the end point failed we still want to delete the model
sm_client.delete_endpoint_config(EndpointConfigName=endpoint_config_name)
sm_client.delete_model(ModelName=model_name)