# Deploy LLama2 7B Chat LMI Model with response streaming on SageMaker



In this notebook, we explore how to host a LLama2 7B Chat large language model on SageMaker using the DeepSpeed. We use DJLServing as the model serving solution in this example that is bundled in the Large Model Inference (LMI) container. DJLServing is a high-performance universal model serving solution powered by the Deep Java Library (DJL) that is programming language agnostic. To learn more about DJL and DJLServing, you can refer to our recent blog post (https://aws.amazon.com/blogs/machine-learning/deploy-bloom-176b-and-opt-30b-on-amazon-sagemaker-with-large-model-inference-deep-learning-containers-and-deepspeed/).


Model parallelism can help deploy large models that would normally be too large for a single GPU. With model parallelism, we partition and distribute a model across multiple GPUs. Each GPU holds a different part of the model, resolving the memory capacity issue for the largest deep learning models with billions of parameters. 

SageMaker has rolled out DeepSpeed container which now provides users with the ability to leverage the managed serving capabilities and help to provide the un-differentiated heavy lifting.

In this notebook, we deploy `'meta-llama/Llama-2-7b-chat-hf` model on a `ml.g5.2xlarge` instance. 

## Prerequisite
### Hugging Face Account

You need to have Hugging Face account. Sign Up here https://huggingface.co/join with your email if you do not already have account.

- For seamless access of the models avaialble on Hugging Face especially gated models such as Llama, for fine-tuning and inferencing purposes, you need to have Hugging Face Account to obtain read Access Token.
- After signup, [login](https://huggingface.co/login) to visit https://huggingface.co/settings/tokens to create read Access token.

### Request access to the next version of Llama

Use the same email id to obtain permission from meta by visiting this link: https://ai.meta.com/resources/models-and-libraries/llama-downloads/

- The Llama models available via Hugging Face are gated models. The use of Llama model is governed by the Meta license. In order to download the model weights and tokenizer, please visit https://ai.meta.com/resources/models-and-libraries/llama-downloads/ and accept their License before requesting access.
- Within 2 days you might be granted access to use Llama models via a confirmation email with subject: [Access granted] Your request to access model meta-llama/Llama-2-13b-chat-hf has been accepted. Though the model id is Llama-2-13b-chat-hf, you should be able to access other variants too.


In [None]:
!pip install -Uq pip
!pip install -Uq sagemaker boto3 huggingface_hub 

In [None]:
import sagemaker
import jinja2
from sagemaker import image_uris
import boto3
import os
import time
import json
from pathlib import Path

In [None]:
role = sagemaker.get_execution_role()  # execution role for the endpoint
sess = sagemaker.session.Session()  # sagemaker session for interacting with different AWS APIs
bucket = sess.default_bucket()  # bucket to house artifacts

In [None]:
s3_prefix = "hf-large-model-djl/meta-llama/Llama-2-7b-chat"
s3_code_prefix = f"{s3_prefix}/code"  # folder within bucket where code artifact will go
s3_model_prefix = f"{s3_prefix}/model"  # folder within bucket where model artifact will go

region = sess._region_name
account_id = sess.account_id()

s3_client = boto3.client("s3")
sm_client = boto3.client("sagemaker")
smr_client = boto3.client("sagemaker-runtime")

jinja_env = jinja2.Environment()

## Download the model snapshot from Hugging Face and upload the model artifacts on Amazon S3

If you intend to download your copy of the model and upload it to a s3 location in your AWS account, please follow the below steps, else you can skip to the next step.

Following Snapshot Download will take around 4 to 6 mins.

In [None]:
%%time
from huggingface_hub import snapshot_download
from pathlib import Path
import os

# - This will download the model into the current directory where ever the jupyter notebook is running
local_model_path = Path(".")
local_model_path.mkdir(exist_ok=True)
model_name = 'meta-llama/Llama-2-7b-chat-hf'
# Only download pytorch checkpoint files
allow_patterns = ["*.json", "*.txt", "*.model", "*.safetensors", "*.bin", "*.chk", "*.pth"]

# - Leverage the snapshot library to donload the model since the model is stored in repository using LFS
model_download_path = snapshot_download(
    repo_id=model_name, 
    cache_dir=local_model_path, 
    allow_patterns=allow_patterns, 
    token='<YOUR_HUGGING_FACE_READ_ACCESS_TOKEN>'
)

Upload files to default S3 bucket and obtain the URI in a variable.

In [None]:
base_model_s3_uri = sess.upload_data(path=model_download_path, key_prefix=s3_model_prefix)
print(f"Model uploaded to --- > {base_model_s3_uri}")

In [None]:
# Cleanup locally stored model files post S3 upload
!rm -rf {model_download_path}

## Create SageMaker compatible Model artifact,  upload Model to S3 and bring your own inference script.

SageMaker Large Model Inference containers can be used to host models without providing your own inference code. This is extremely useful when there is no custom pre-processing of the input data or postprocessing of the model's predictions.

SageMaker needs the model artifacts to be in a Tarball format. In this example, we provide the following files - serving.properties.

The tarball is in the following format:

```
code
├──── 
│   └── serving.properties
```

    serving.properties is the configuration file that can be used to configure the model server.


### Create serving.properties file for response streaming

This is a configuration file to indicate to DJL Serving which model parallelization and inference optimization libraries you would like to use. Depending on your need, you can set the appropriate configuration.

Here is a list of settings that we use in this configuration file -

- `engine`: The runtime engine for DJL to use. The possible values for engine include *Python*, *DeepSpeed*, *FasterTransformer*, and *MPI*. In this case, we set it to MPI. MPI, Model Parallelization and Inference facilitates partitioning the model across all the available GPUs and thus accelerate the inference.
- `option.tensor_parallel_degree` - This option specifies number of tensor parallel partitions performed on the model.
- `option.rolling_batch` – Enables iteration-level batching using one of the supported strategies. Values include `auto`, `scheduler`, and `lmi-dist`. We use `lmi-dist` for turning on continuous batching for Llama 2.
- `option.max_rolling_batch_size` – Limits the number of concurrent requests in the continuous batch. Defaults to 32.
- `option.model_id`: The model id of a pretrained model hosted inside a [model repository on huggingface](https://huggingface.co/models) or S3 path to the model artefact. 
- `option.tensor_parallel_degree`: Set to the number of GPU devices over which Accelerate needs to partition the model. This parameter also controls the no of workers per model which will be started up when DJL serving runs. For example if we have a 4 GPU machine and we are creating 4 partitions then we will have 1 worker per model to serve the requests.
- `option.enable_streaming`: As we need a response streaming for inferencing have reduced perceived latency, we will set it to *true*

For more details on the configuration options and an exhaustive list, you can refer the documentation - https://docs.aws.amazon.com/sagemaker/latest/dg/realtime-endpoints-large-model-configuration.html.

Since we are serving the model using deepspeed container, and Llama 2 being a large model used for inference,  we are following the approach of [Large model inference with DeepSpeed and DJL Serving](https://docs.aws.amazon.com/sagemaker/latest/dg/large-model-inference-tutorials-deepspeed-djl.html)

In [None]:
!rm -rf chat_llama2_7b_hf
!mkdir -p chat_llama2_7b_hf
model_id = base_model_s3_uri

We will also set *enable_streaming* to *true* for obtaining response stream when we inference Llama 2. Since we are deploying llama 2 7b, we are setting the **tensor_parallel_degree** to **1** and making use of 1 NVIDIA A10G GPU available on the [`ml.g5.2xlarge`](https://aws.amazon.com/sagemaker/pricing/#:~:text=Amazon%20SageMaker%20G5%20instance%20product%20details) instance.

https://aws.amazon.com/blogs/machine-learning/improve-throughput-performance-of-llama-2-models-using-amazon-sagemaker/

In [None]:
%%writefile chat_llama2_7b_hf/serving.properties
engine = MPI
option.entryPoint=djl_python.huggingface
option.tensor_parallel_degree=1
option.rolling_batch_type=LmiDistRollingBatch
option.rolling_batch=lmi-dist
option.max_rolling_batch_size=16
option.model_loading_timeout=3600
option.model_id={{model_id}}
option.paged_attention=true
option.enable_streaming=true

In [None]:
# we plug in the appropriate model location into our `serving.properties`
template = jinja_env.from_string(Path("chat_llama2_7b_hf/serving.properties").open().read())
Path("chat_llama2_7b_hf/serving.properties").open("w").write(
    template.render(model_id=base_model_s3_uri)
)
!pygmentize chat_llama2_7b_hf/serving.properties | cat -n

Image URI for the DJL container is being used here

In [None]:
inference_image_uri = image_uris.retrieve(
    framework="djl-deepspeed", region=region, version="0.24.0"
)
inference_image_uri

Create the Tarball and then upload to S3 location

In [None]:
!rm model.tar.gz
!tar czvf model.tar.gz chat_llama2_7b_hf

In [None]:
s3_code_artifact = sess.upload_data("model.tar.gz", bucket, s3_code_prefix)

In [None]:
!rm model.tar.gz

## Deploy Llama 2 7B Chat LMI Model

[Choosing instance types for large model inference](https://docs.aws.amazon.com/sagemaker/latest/dg/large-model-inference-choosing-instance-types.html)

We will proceed with deploying `meta-llama/Llama-2-7b-chat-hf` model on `ml.g5.2xlarge`

Steps to deploy the model to SageMaker Endpoint will be as follows:

1. Create the Model using the Image container and the Model Tarball uploaded earlier
2. Create the endpoint config using the following key parameters

    a) Instance Type is ml.g5.2xlarge 
    
    b) ContainerStartupHealthCheckTimeoutInSeconds is 3600 to ensure health check starts after the model is ready    
3. Create the end point using the endpoint config created    


#### Create the Model
Use the image URI for the DJL container and the s3 location to which the tarball was uploaded.

The container downloads the model into the `/tmp` space on the instance because SageMaker maps the `/tmp` to the Amazon Elastic Block Store (Amazon EBS) volume that is mounted when we specify the endpoint creation parameter VolumeSizeInGB. 
It leverages `s5cmd`(https://github.com/peak/s5cmd) which offers a very fast download speed and hence extremely useful when downloading large models.

For instances like p4dn, which come pre-built with the volume instance, we can continue to leverage the `/tmp` on the container. The size of this mount is large enough to hold the model.


In [None]:
from sagemaker.utils import name_from_base

model_name = name_from_base(f"Llama-2-7b-chat-lmi-streaming")

create_model_response = sm_client.create_model(
    ModelName=model_name,
    ExecutionRoleArn=role,
    PrimaryContainer={
        "Image": inference_image_uri,
        "ModelDataUrl": s3_code_artifact,
        "Environment": {"MODEL_LOADING_TIMEOUT": "3600"},
    },
)
model_arn = create_model_response["ModelArn"]

#print(f"Created Model: {model_arn}")

#### Create the Endpoint Config

In [None]:
endpoint_config_name = f"{model_name}-config"

endpoint_name = name_from_base(model_name)

endpoint_config_response = sm_client.create_endpoint_config(
    EndpointConfigName=endpoint_config_name,
    ProductionVariants=[
        {
            "VariantName": "variant1",
            "ModelName": model_name,
            "InstanceType": "ml.g5.2xlarge",
            "InitialInstanceCount": 1,
            "ModelDataDownloadTimeoutInSeconds": 3600,
            "ContainerStartupHealthCheckTimeoutInSeconds": 3600,
        },
    ],
)
#endpoint_config_response

#### Create Endpoint and Deploy the model

In [None]:
create_endpoint_response = sm_client.create_endpoint(
    EndpointName=f"{endpoint_name}", EndpointConfigName=endpoint_config_name
)
#print(f"Created Endpoint: {create_endpoint_response['EndpointArn']}")

In [None]:
%%time
import time

resp = sm_client.describe_endpoint(EndpointName=endpoint_name)
status = resp["EndpointStatus"]
print("Status: " + status)

while status == "Creating":
    time.sleep(60)
    resp = sm_client.describe_endpoint(EndpointName=endpoint_name)
    status = resp["EndpointStatus"]
    print("🌊", end='')

if status =="InService":
    print('🏄‍♂️')
else:
    print('💦')
print("\nArn: " + resp["EndpointArn"])
print("Status: " + status)

#### While you wait for the endpoint to be created, you can read more about:
- [Deep Learning containers for large model inference](https://docs.aws.amazon.com/sagemaker/latest/dg/realtime-endpoints-large-model-dlc.html)

We will store the value of the variable endpoint_name to use it in inference notebook.

In [None]:
%store \
endpoint_name \
bucket \
s3_prefix