# Deploying Large Language Models Using the VLLM Backend for General Inference


In this tutorial, you will employ the VLLM backend of the Large Model Inference (LMI) DLC to deploy a Hugging Face model and use boto3 to test the inference capabilities, including options for streaming and non-streaming features.

Please ensure that your machine has sufficient disk space before proceeding.


## Step 1: Setup development environment

In [None]:
!pip install "sagemaker>=2.216.0" --upgrade --quiet

In [None]:
!pip install huggingface_hub jinja2

In [None]:
import sagemaker
import boto3
sess = sagemaker.Session()
# sagemaker session bucket -> used for uploading data, models and logs
# sagemaker will automatically create this bucket if it not exists
sagemaker_session_bucket=None
if sagemaker_session_bucket is None and sess is not None:
    # set to default bucket if a bucket name is not given
    sagemaker_session_bucket = sess.default_bucket()

try:
    role = sagemaker.get_execution_role()
except ValueError:
    iam = boto3.client('iam')
    role = iam.get_role(RoleName='sagemaker_execution_role')['Role']['Arn']

sess = sagemaker.Session(default_bucket=sagemaker_session_bucket)

print(f"sagemaker role arn: {role}")
print(f"sagemaker session region: {sess.boto_region_name}")


## Step 2: Start preparing model artifacts
In LMI container, we expect some artifacts to help setting up the model

- serving.properties (required): Defines the model server settings
- model.py (optional): A python file to define the core inference logic
- requirements.txt (optional): Any additional pip wheel need to install

For the purpose of this tutorial, we will focus only on the `serving.properties` file.

### Download the model to local and upload to s3

Please skip this step if you already possess a model on S3, whether it's a downloaded or a fine-tuned version.


Update your model ID and Hugging Face token.

In [None]:
model_id ="****"
huggingface_token = "****"

In [None]:
from huggingface_hub import snapshot_download
# Download the model repository from the Hugging Face Hub
model_directory = snapshot_download(model_id, token= huggingface_token, local_dir=f"/home/ec2-user/SageMaker/{model_id}", ignore_patterns=["*.pth", "original/*"])
print(f"Downloaded model {model_id} to {model_directory}")

In [None]:
### Upload to s3
from sagemaker.s3 import S3Uploader

S3Uploader.upload(
        local_path=model_id,
        desired_s3_uri=f"s3://{sagemaker_session_bucket}/models/{model_id}",
        sagemaker_session=sess
    )

### Prepare the serving.properties

Update the model location to the correct S3 path. If you have fine-tuned a model stored in S3, please change the value to reflect your specific S3 bucket location.

In [None]:
import jinja2
import os
from pathlib import Path

# Define the directory path
deployment_path = "deployment"

# Check if the directory exists. If not, create it.
os.makedirs(deployment_path, exist_ok=True)

jinja_env = jinja2.Environment()

template = jinja_env.from_string(Path("serving.template").open().read())
Path(f"{deployment_path}/serving.properties").open("w").write(
    template.render(model_id=f"s3://{sagemaker_session_bucket}/models/{model_id}")

)
!pygmentize deployment/serving.properties | cat -n

Pack your serviing.properties in a tar file

In [None]:
%%sh
mkdir mymodel
rm -f mymodel.tar.gz
mv deployment/serving.properties mymodel/
tar czvf mymodel.tar.gz mymodel/
rm -rf mymodel

## Step 3: Start building SageMaker endpoint

Getting the container image URI

In [None]:
from sagemaker import image_uris 
image_uri = image_uris.retrieve(
        framework="djl-deepspeed",
        region=sess.boto_session.region_name,
        version="0.27.0"
    )

Upload artifact on S3 and create SageMaker model

In [None]:
from sagemaker import Model

s3_code_prefix = f"large-model-vllm/{model_id}_code"
bucket = sess.default_bucket()  # bucket to house artifacts
code_artifact = sess.upload_data("mymodel.tar.gz", bucket, s3_code_prefix)
print(f"S3 Code or Model tar ball uploaded to --- > {code_artifact}")

model = Model(image_uri=image_uri, model_data=code_artifact, role=role)

Deploy SageMaker endpoint
- instance_type = "ml.g5.2xlarge": This line sets the type of machine that SageMaker will use to host the endpoint. The instance type ml.g5.2xlarge is generally suitable for demanding machine learning tasks. If you are planning to use larger token lengths in your model, you might need to choose a more powerful instance type to ensure optimal performance.
- endpoint_name = sagemaker.utils.name_from_base(f"lmi-model-{model_id.replace('/', '-')}"): This line generates a unique name for the SageMaker endpoint. It uses the model ID, modifying it to replace slashes with hyphens to create a valid endpoint name. This is necessary because certain characters like slashes may not be permitted in AWS resource names.
- model.deploy(...): This function call deploys the model to the configured SageMaker endpoint. Here are the parameters used:
   * initial_instance_count=1: This specifies that one instance of the specified type should be used.
   * instance_type: As defined earlier, this is the type of instance to deploy.
   * endpoint_name: The unique name generated for the endpoint.
   *  container_startup_health_check_timeout=1800: This sets a timeout value in seconds for the container startup  
   health check, ensuring that the deployment does not hang indefinitely if issues occur during startup.

In [None]:
# Set the instance type; update this if using larger token lengths
instance_type = "ml.g5.2xlarge"
endpoint_name = sagemaker.utils.name_from_base(f"lmi-model-{model_id.replace('/', '-')}")
print(f"endpoint_name: {endpoint_name}")

model.deploy(initial_instance_count=1,
             instance_type=instance_type,
             endpoint_name=endpoint_name,
             container_startup_health_check_timeout=1800
        )

## Step 4: Run inference

In the example below, we demonstrate the inference process using a sample question as follows:

In [None]:
question= "tell me about Harry Potter in 100 words"

### Normal request

To query an Amazon SageMaker model endpoint effectively, you use the invoke_endpoint API provided by the SageMaker Runtime service. This API allows you to send input data to your deployed model and receive predictions in response. 

In [None]:
input_data = {
    "inputs": f"<|begin_of_text|><|start_header_id|>user<|end_header_id|>\n\n{question}<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n\n\n", 
    "parameters": {"max_new_tokens":1024}
}

In [None]:
import json
# Create a SageMaker runtime client with the AWS SDK
client = boto3.client('sagemaker-runtime')

# Convert the input data to JSON string
payload = json.dumps(input_data)

# Set the content type for the endpoint, adjust if different for your model
content_type = "application/json"

# Invoke the SageMaker endpoint
response = client.invoke_endpoint(
        EndpointName=endpoint_name,
        ContentType=content_type,
        Body=payload
)

# The response is a stream of bytes. We need to read and decode it.
result = response['Body'].read().decode('utf-8')

print(result)

### Streaming

The invoke_endpoint_with_response_stream function is an API provided by Amazon SageMaker, designed to handle streaming responses from a deployed model endpoint. 

The `LineIterator` is copied from https://github.com/deepjavalibrary/djl-demo/blob/master/aws/sagemaker/large-model-inference/sample-llm/utils/LineIterator.py

In [None]:
import json
import boto3
from utils.LineIterator import LineIterator

smr_client = boto3.client("sagemaker-runtime")
def get_realtime_response_stream(sagemaker_runtime, endpoint_name, payload):
    response_stream = sagemaker_runtime.invoke_endpoint_with_response_stream(
        EndpointName=endpoint_name,
        Body=json.dumps(payload), 
        ContentType="application/json"
    )
    return response_stream



def print_response_stream(response_stream):
    event_stream = response_stream.get('Body')
    last_error_line =''
    for line in LineIterator(event_stream):
        try:
            print(json.loads(last_error_line+line)["token"]["text"], end='')
            last_error_line =''
        except:
            last_error_line = line

In [None]:
payload = {    
    "inputs":  f"<|begin_of_text|><|start_header_id|>user<|end_header_id|>\n\n{question}<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n\n\n", 
    "parameters": {
        "max_new_tokens":1024, 
        "stop":["<|start_header_id|>", "<|end_header_id|>", "<|eot_id|>", "<|reserved_special_token"]
    },
    "stream": True ## <-- to have response stream.
}
response_stream = get_realtime_response_stream(smr_client, endpoint_name, payload)
print_response_stream(response_stream)

## Clear Resources

In [None]:
sess.delete_endpoint(endpoint_name)
sess.delete_endpoint_config(endpoint_name)
model.delete_model()


## Further reading

Please visit https://github.com/deepjavalibrary/djl-demo/tree/master/aws/sagemaker/large-model-inference/sample-llm 
for more LMI usage tutorials.