# Serve CodeLlama-34b and CodeLlama-13b-Instruct using Inference Components with LMI container at scale with high performance on SageMaker

In this notebook, we deploy the [CodeLlama-34b](https://huggingface.co/codellama/CodeLlama-34b-hf) and [CodeLlama-13b-Instruct](https://huggingface.co/codellama/CodeLlama-13b-Instruct-hf) models on the same inference endpoint## General Setup using Inference Components on SageMaker by leveraging the [SageMaker Large Model Inference Container](https://github.com/aws/deep-learning-containers/blob/master/available_images.md#large-model-inference-containers). 

Code Llama is a family of large language models (LLM), released by Meta, with the capabilities to accept text prompts and generate and discuss code. The release also includes two other variants (Code Llama Python and Code Llama Instruct) and different sizes (13b, 13B, 34B, and 70B).

For the purpose of this notebook, we'll use the weights from the following source:
https://huggingface.co/codellama/CodeLlama-34b-hf

However, you can use the same approach to deploy the model using any other codellama weights.


For information on codellama, please refer [here](https://ai.meta.com/research/publications/code-llama-open-foundation-models-for-code/)

This notebook explains how to deploy model optimized for latency and throughput. The tuning guide is available [LLM Tuning Guide](https://github.com/aws/deep-learning-containers/blob/master/available_images.md#large-model-inference-containers). 


Additionally, you can refer to [this AWS resource](https://aws.amazon.com/blogs/machine-learning/boost-inference-performance-for-llms-with-new-amazon-sagemaker-containers/) and [this AWS resource](https://aws.amazon.com/blogs/aws/amazon-sagemaker-adds-new-inference-capabilities-to-help-reduce-foundation-model-deployment-costs-and-latency/)




## General Setup

### ### Install dependencies

In [None]:
%pip install sagemaker --upgrade  --quiet


### Import libraries

In [None]:
import json
import boto3
import sagemaker
from sagemaker import Model, image_uris, serializers, deserializers
import sys
import time

### Set configurations

In [None]:
role = sagemaker.get_execution_role()  # execution role for the endpoint
sess = sagemaker.session.Session()  # sagemaker session for interacting with different AWS APIs
region = sess._region_name  # region name of the current SageMaker Studio environment
account_id = sess.account_id()  # account_id of the current SageMaker Studio environment
sm_client = boto3.client("sagemaker")
smr_client = boto3.client("sagemaker-runtime")
prefix = sagemaker.utils.unique_name_from_base("DEMO-IC-CodeGen-")

### Create SageMaker Endpoint Configuration
There are a few parameters we want to setup for our endpoint. We first start by setting the variant name, and instance type we want our endpoint to use. In addition we set the *model_data_download_timeout_in_seconds* and *container_startup_health_check_timeout_in_seconds* to have some guardrails for when we deploy inference components to our endpoint. In addition we will use Managed Instance Scaling which allows SageMaker to scale the number of instances based on the requirements of the scaling of your inference components. We set a *MinInstanceCount* and *MinInstanceCount* variable to size this according to the workload you want to service and also maintain controls around cost. Lastly, we set *RoutingStrategy* for the endpoint to optimally tune how to route requests to instances and inference components for the best performance. 

In [None]:
# Set an unique endpoint config name
endpoint_config_name = f"{prefix}-endpoint-config"
print(f"Demo endpoint config name: {endpoint_config_name}")

# Set varient name and instance type for hosting
variant_name = "AllTraffic"
instance_type = "ml.p4d.24xlarge"
model_data_download_timeout_in_seconds = 400
container_startup_health_check_timeout_in_seconds = 300

initial_instance_count = 1 #Change this as per hour traffic pattern and autoscaling needs
max_instance_count = 1 #Change this as per your autoscaling needs
print(f"Initial instance count: {initial_instance_count}")
print(f"Max instance count: {max_instance_count}")

sm_client.create_endpoint_config(
    EndpointConfigName=endpoint_config_name,
    ExecutionRoleArn=role,
    ProductionVariants=[
        {
            "VariantName": variant_name,
            "InstanceType": instance_type,
            "InitialInstanceCount": 1,
            "ModelDataDownloadTimeoutInSeconds": model_data_download_timeout_in_seconds,
            "ContainerStartupHealthCheckTimeoutInSeconds": container_startup_health_check_timeout_in_seconds,
            "ManagedInstanceScaling": {
                "Status": "ENABLED",
                "MinInstanceCount": initial_instance_count,
                "MaxInstanceCount": max_instance_count,
            },
            "RoutingConfig": {"RoutingStrategy": "LEAST_OUTSTANDING_REQUESTS"},
        }
    ],
)

### Create SageMaker Endpoint
We can now use the EndpointConfiguration created in the last step to create and endpoint with SageMaker

In [None]:
# Set a unique endpoint name
endpoint_name = f"{prefix}-endpoint"
print(f"Demo endpoint name: {endpoint_name}")

sm_client.create_endpoint(
    EndpointName=endpoint_name,
    EndpointConfigName=endpoint_config_name,
)

In [None]:
sess.wait_for_endpoint(endpoint_name)

Thats it! Your endpoint is now ready. We can now reference the endpoint in the following notebooks to deploy inference components. Now that the endpoint is in service you can then start associate it with models by creating one or many inference components. We will create 2 Inference components - 1/ CodeLlama-34b and 2/ CodeLlama-13b-Instruct

In [None]:
inference_component_name_34b = f"{prefix}-IC-34b-0"
inference_component_name_13b = f"{prefix}-IC-13b-0"

## Select the appropriate configuration parameters and container
To optimize the deployment of Large Language Models (LLMs); one needs to choose the appropriate model partitioning framework, optimal batching technique, batching size, tensor parallelism degree, etc. The choice of a particular configuration depends on the usecase.

Determining the level of partitioning to use with your model comes down to the following factors:

1. Size of the model

2. Cost you are willing to pay for an instance

3. Availability of a given instance

4. Your latency requirements

More information [here](https://docs.aws.amazon.com/sagemaker/latest/dg/large-model-inference-choosing-instance-types.html) on best practices on selecting the model partitioning, instance type and other performance tunable parameters 

Hence, based on the usecase, you need to:
1. set the configuration parameters for the container.
2. select the appropriate container image to be used for inference.



### Set the configuration parameters using environment variables
1. `SERVING_LOAD_MODELS` - specifies the engine that will be used for this workload. In this case we'll be hosting a model using the **MPI**. **MPI** is an engine that allows the model server to start distributed processes to load and serve the model.

2. `OPTION_MODEL_ID`: Set this to the URI of the Amazon S3 bucket that contains the model. When this is set, the container leverages [s5cmd](https://github.com/peak/s5cmd) to download the model from s3. This enables faster deployments by utilizing optimized approach within the DJL inference container to transfer the model from S3 into the hosting instance.
If you want to download the model from huggingface.co, you can set `OPTION_MODEL_ID` to the model id of a pre-trained model hosted inside a model repository on huggingface.co (https://huggingface.co/models). The container uses this model id to download the corresponding model repository on huggingface.co.

3. `OPTION_TENSOR_PARALLEL_DEGREE`: Set to the number of GPU devices over which DeepSpeed needs to partition the model. This parameter also controls the number of workers per model which will be started up when DJL serving runs. In this example we use the `ml.g5.4xlarge` instance that has 1 GPU; this is set to `max` to utilize all the GPUs on the instance.

4. `OPTION_ROLLING_BATCH`: This parameter enables the use of a particular batching technique for continuous or iteration level batching to enable merging multiple concurrent requests that arrive at different times for inference. [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM) is a TensorRT Toolbox for Optimized Large Language Model Inference on Nvidia GPUs. To leverage this, we set this parameter to `trtllm`.

5. `OPTION_MAX_ROLLING_BATCH_SIZE`: The maximum number of concurrent requests to be used in a batch by the model server for inference. Clients can still send more requests to the endpoint, they will be queued.


For more information on the available options, please refer to the [DJL Serving - SageMaker Large Model Inference Configurations](https://github.com/deepjavalibrary/djl-serving/blob/master/serving/docs/lmi/configurations_large_model_inference_containers.md)


For more information on the available options, please refer to the [DJL Serving - SageMaker Large Model Inference Configurations](https://github.com/deepjavalibrary/djl-serving/blob/master/serving/docs/lmi/configurations_large_model_inference_containers.md)


We leverage the tensorRT container; for other containers refer [Large Model Inference available DLC](https://github.com/aws/deep-learning-containers/blob/master/available_images.md#large-model-inference-containers)

### When generating a large number of output tokens (> 1024), use the following configuration

For more information on the available options, please refer to the [DJL Serving - SageMaker Large Model Inference Configurations](https://github.com/deepjavalibrary/djl-serving/blob/master/serving/docs/lmi/configurations_large_model_inference_containers.md)
For CodeLlama-34B with Fp16 data type, we can easily fit single model copy across 2 A100 GPU devices with 40GB of high bandwidth GPU memory per device. Hence, we assign 2 GPU accelerators to host CodeLlama-34B inference component and set the container with model sharding degree (i.e.  OPTION_TENSOR_PARALLEL_DEGREE) to "max" (which means the container will use the maximum visible CUDA devices available to the Inference container).


In [None]:
env_lmidist_codellama34b = {"HUGGINGFACE_HUB_CACHE": "/tmp",
               "TRANSFORMERS_CACHE": "/tmp",
               "SERVING_LOAD_MODELS": "test::MPI=/opt/ml/model",
               "OPTION_MODEL_ID": "codellama/CodeLlama-34b-hf",
               "OPTION_TRUST_REMOTE_CODE": "true",
               "OPTION_TENSOR_PARALLEL_DEGREE": "max",
               "OPTION_ROLLING_BATCH": "lmi-dist",
               "OPTION_MAX_ROLLING_BATCH_SIZE": "32",
               "OPTION_DTYPE":"fp16"
              }

deepspeed_image_uri = image_uris.retrieve(
    framework="djl-deepspeed", 
    region=sess.boto_session.region_name, 
    version="0.27.0"
)

For CodeLlama-13B-Instruct with Fp16 data type, we can easily fit single model copy on 1 A100 GPU devices with 40GB of high bandwidth GPU memory. Hence, we assign 1 GPU accelerator to host CodeLlama-13B-instruct inference component and set the container with model sharding degree (i.e.  OPTION_TENSOR_PARALLEL_DEGREE) to "max" (which means the container will use the maximum visible CUDA devices available to the Inference container).

In [None]:
env_lmidist_codellama13binstruct = {"HUGGINGFACE_HUB_CACHE": "/tmp",
               "TRANSFORMERS_CACHE": "/tmp",
               "SERVING_LOAD_MODELS": "test::MPI=/opt/ml/model",
               "OPTION_MODEL_ID": "codellama/CodeLlama-13b-Instruct-hf",
               "OPTION_TRUST_REMOTE_CODE": "true",
               "OPTION_TENSOR_PARALLEL_DEGREE": "max",
               "OPTION_ROLLING_BATCH": "lmi-dist",
               "OPTION_MAX_ROLLING_BATCH_SIZE": "32",
               "OPTION_DTYPE":"fp16"
              }


In [None]:
# - Select the appropriate environment variable which will tune the deployment server.
#env = env_trtllm

# - now we select the appropriate container 
inference_image_uri = deepspeed_image_uri # use this when generating tokens > 1024 
#inference_image_uri = trtllm_image_uri

#print(f"Environment variables are ---- > {env}")
print(f"Image going to be used is ---- > {inference_image_uri}")

To create the end point the steps are:
- Create the Model using the inference image container

- Create the endpoint config using the following key parameters

In this notebook we leverage the boto3 SDK. You can also use the [SageMaker SDK](https://sagemaker.readthedocs.io/en/stable/).

### Create the Model
Leverage the `inference_image_uri` to create a model object.

In [None]:
model_name_34b = sagemaker.utils.name_from_base("lmi-codellama-34b")
print(model_name_34b)

create_model_response = sm_client.create_model(
    ModelName=model_name_34b,
    ExecutionRoleArn=role,
    PrimaryContainer={
        "Image": inference_image_uri,
        "Environment": env_lmidist_codellama34b,
    }
)
model_arn = create_model_response["ModelArn"]

print(f"Created Model: {model_arn}")



In [None]:
model_name_13b = sagemaker.utils.name_from_base("lmi-codellama-13b-instruct")
print(model_name_13b)

create_model_response = sm_client.create_model(
    ModelName=model_name_13b,
    ExecutionRoleArn=role,
    PrimaryContainer={
        "Image": inference_image_uri,
        "Environment": env_lmidist_codellama13binstruct,
    }
)
model_arn = create_model_response["ModelArn"]

print(f"Created Model: {model_arn}")


We assign 4 GPU accelerators to CodeLlama-34B Inference compomnent using the attribute NumberOfAcceleratorDevicesRequired. We will host 1 copy of Codellama-34B inference component by setting CopyCount to 1 for now. Refer to this [notebook](https://github.com/aws/amazon-sagemaker-examples/tree/main/inference/generativeai/llm-workshop/lab-inference-components-with-scaling) for the demo on configuring autoscaling policy to the Inference component

In [None]:
variant_name = "AllTraffic"
model_data_download_timeout_in_seconds = 3600
container_startup_health_check_timeout_in_seconds = 3600
min_memory_required_in_mb = 1024  # max memory util is up to 85%
initial_copy_count=1

sm_client.create_inference_component(
    InferenceComponentName=inference_component_name_34b,
    EndpointName=endpoint_name,
    VariantName=variant_name,
    Specification={
        "ModelName": model_name_34b,
        "StartupParameters": {
            "ModelDataDownloadTimeoutInSeconds": model_data_download_timeout_in_seconds,
            "ContainerStartupHealthCheckTimeoutInSeconds": container_startup_health_check_timeout_in_seconds,
        },
        "ComputeResourceRequirements": {
            # "NumberOfCpuCoresRequired": number_of_cpu_cores_required,
            "MinMemoryRequiredInMb": min_memory_required_in_mb,
            "NumberOfAcceleratorDevicesRequired": 4,
        },
    },
    RuntimeConfig={
        "CopyCount": initial_copy_count,
    },
)

We assign 1 GPU accelerator to CodeLlama-13B Inference compomnent using the attribute NumberOfAcceleratorDevicesRequired. We will host 4 copies of Codellama-13B-instruct inference component by setting CopyCount to 4 for now. Refer to this [notebook](https://github.com/aws/amazon-sagemaker-examples/tree/main/inference/generativeai/llm-workshop/lab-inference-components-with-scaling) for the demo on configuring autoscaling policy to the Inference component

In [None]:
initial_copy_count=4

sm_client.create_inference_component(
    InferenceComponentName=inference_component_name_13b,
    EndpointName=endpoint_name,
    VariantName=variant_name,
    Specification={
        "ModelName": model_name_13b,
        "StartupParameters": {
            "ModelDataDownloadTimeoutInSeconds": model_data_download_timeout_in_seconds,
            "ContainerStartupHealthCheckTimeoutInSeconds": container_startup_health_check_timeout_in_seconds,
        },
        "ComputeResourceRequirements": {
            # "NumberOfCpuCoresRequired": number_of_cpu_cores_required,
            "MinMemoryRequiredInMb": min_memory_required_in_mb,
            "NumberOfAcceleratorDevicesRequired": 1,
        },
    },
    RuntimeConfig={
        "CopyCount": initial_copy_count,
    },
)

In [None]:
import sys
import time


while True:
    desc = sm_client.describe_inference_component(InferenceComponentName=inference_component_name_34b)
    status = desc["InferenceComponentStatus"]
    print(status)
    sys.stdout.flush()
    if status in ["InService", "Failed"]:
        break
    time.sleep(30)

In [None]:
while True:
    desc = sm_client.describe_inference_component(InferenceComponentName=inference_component_name_13b)
    status = desc["InferenceComponentStatus"]
    print(status)
    sys.stdout.flush()
    if status in ["InService", "Failed"]:
        break
    time.sleep(30)


### Invoke the endpoint with a sample prompt

In [None]:
%%time

prompt = """
Table departments, columns = [DepartmentId, DepartmentName]
Table students, columns = [DepartmentId, StudentId, StudentName]
Create a MySQL query for all students in the Computer Science Department
""""""
 
"""

params = { "max_new_tokens": 1500, 
          "do_sample": True,
          "top_p": 0.5,
          "temperature": 0.01,
         }

response_model = smr_client.invoke_endpoint(
    EndpointName=endpoint_name,
    InferenceComponentName=inference_component_name_34b,
    Body=json.dumps(
        {
            "inputs": prompt,
            "parameters": params
        }
    ),
    ContentType="application/json",
)

response_model["Body"].read().decode("utf8")

In [None]:
response_model = smr_client.invoke_endpoint(
    EndpointName=endpoint_name,
    InferenceComponentName=inference_component_name_13b,
    Body=json.dumps(
        {
            "inputs": prompt,
            "parameters": params
        }
    ),
    ContentType="application/json",
)

response_model["Body"].read().decode("utf8")

## Clean up

In [None]:
sm_client.delete_endpoint(EndpointName=endpoint_name)
sm_client.delete_endpoint_config(EndpointConfigName=endpoint_config_name)
sm_client.delete_model(ModelName=model_name_13b)
sm_client.delete_model(ModelName=model_name_34b)