# Deploying a model with vLLM to Amazon SageMaker AI for Inference

This notebook guides you through the process of deploying any model supported by vLLM on SageMaker AI. The deployment process includes several key steps:

1. **Environment Setup**: Installing necessary dependencies and configuring the AWS environment
2. **Container Infrastructure**: Building and pushing a custom container to Amazon ECR
3. **SageMaker Deployment**: Creating and deploying a SageMaker endpoint

**Prerequisites**

Before starting, ensure you have:
- AWS credentials configured with appropriate permissions
- AWS CLI installed
- vLLM model support

**Important Notes**

- The deployment uses an ml.g5.2xlarge instance which provides GPU acceleration necessary for efficient inference

## Environment Setup

First, we'll install `jq`, a lightweight command-line JSON processor. This will be used to parse AWS metadata and credentials later in our deployment process.

In [None]:
!sudo apt-get install -qq -y jq > /dev/null

In [None]:
%pip install boto3 sagemaker pandas huggingface_hub --quiet

In [None]:
%load_ext autoreload
%autoreload 2

Import the required libraries:

In [None]:
import json
import os
import boto3
import sagemaker
import pandas as pd
import ipywidgets as widgets
from IPython.display import display
from sagemaker.model import Model

## Initialize AWS Services

In [None]:
bucket_name = sagemaker.Session().default_bucket()
region = boto3.session.Session().region_name
account_id = boto3.client("sts").get_caller_identity()["Account"]
role = sagemaker.get_execution_role()
session = sagemaker.Session(boto_session=boto3.Session(region_name=region))
sm_client = boto3.client("sagemaker", region_name=region)

## Build and Push Docker Inference Container

Amazon SageMaker AI offers [three primary methods](https://docs.aws.amazon.com/sagemaker/latest/dg/docker-containers.html) for deploying ML models to an SageMaker AI Inference Endpoint:
1. Using pre-built SageMaker containers for standard frameworks like PyTorch or TensorFlow
2. Modifying existing Docker containers with your own dependencies through a requirements.txt file
3. Or creating completely custom containers that implements a web server listening for requests (/invocations) for maximum flexibility and control over dependencies and requirements.

To run the fine-tuned models you will build our custom container and push it to Amazon Elastic Container Registry (ECR). The container will be built from our Dockerfile and pushed to ECR, making it available for SageMaker to use when deploying our endpoint.

### Docker Installation

To create our custom container for model serving, we first need Docker installed in our environment. This script handles the installation of Docker and its dependencies, including necessary security keys and repository configurations.

**Install docker-cli**

At the end of this install you should see,

```bash
Client: Docker Engine - Community
 Version:           20.10.24
 API version:       1.41
 Go version:        go1.19.7
 Git commit:        297e128
 Built:             Tue Apr  4 18:21:03 2023
 OS/Arch:           linux/amd64
 Context:           default
 Experimental:      true
```

In [None]:
!bash docker-artifacts/01_docker_install.sh

### Build and push a custom image

SageMaker Inference supports simplified deployment of Qwen2 using Large Model Inference (LMI) container images as indicated [here](https://docs.aws.amazon.com/sagemaker/latest/dg/large-model-inference-container-docs.html) and available images listed [here](https://github.com/aws/deep-learning-containers/blob/master/available_images.md).

However, due to the fast release cycle of new models, it happens that the available LMI container doesn't support the new model yet. In these scenarios, we can use a more recent vLLM version which supports the new model and build a custom image inside SageMaker Studio using `docker-cli`. If needed, adjust the vLLM version (docker-artifacts/dockerfile) to fit your and model deployment. 

**Build our custom Docker image containing custom inference handler and push it to Amazon ECR (Elastic Container Registry).**

In [None]:
REPO_NAME = "vllm-sagemaker"
os.environ['REPO_NAME'] = REPO_NAME

In [None]:
%%bash -s {region} {account_id}

REGION=$1

VERSION_TAG="latest"
CURRENT_ACCOUNT_NUMBER=$2

echo "bash 02_build_and_push.sh $REPO_NAME $VERSION_TAG $REGION $CURRENT_ACCOUNT_NUMBER"
cd docker-artifacts && bash 02_build_and_push.sh $REPO_NAME $VERSION_TAG $REGION $CURRENT_ACCOUNT_NUMBER

### Getting Container Image URI

Retrieve the full URI of our Docker image from ECR. This URI is essential for SageMaker deployment as it tells SageMaker exactly where to find our custom container image. The URI follows the format:
`{account_id}.dkr.ecr.{region}.amazonaws.com/{repository_name}:{tag}`

In [None]:
image_uri = f"{account_id}.dkr.ecr.{region}.amazonaws.com/{REPO_NAME}:latest"
print(f"Base image to deploy a SageMaker endpoint: {image_uri}")

os.environ['CUSTOM_IMAGE'] = image_uri

## Understanding the Model Serving Architecture

When we deploy our model to a SageMaker endpoint, here's how the components work together:

1. **Docker Container Structure**:
   - The container runs on the SageMaker instance (ml.g5.2xlarge)

2. **Request Flow**:
   - External requests → SageMaker endpoint → Container's port 8080
   - The `sed` commands we used modified the API paths to match SageMaker's expected structure:
     - `/ping` for health checks
     - `/invocations` for model inference
     - `/invocations/completions` for completion requests

3. **SageMaker Integration**:
   - Routes HTTPS requests to our container
   - Manages container lifecycle
   - Handles authentication and scaling
   - Monitors container health via the `/ping` endpoint

This setup allows us to serve our fine-tuned model with production-grade reliability and performance.

**[Optional] We can run our container interactively in Terminal by using the command below. Make sure you are using a GPU instance for your Jupyterlab space since model inference requires a GPU.**

In [None]:
def dict_to_env_file(dict_data, output_file='output.env'):
    """
    Convert dictionary to environment file format (VARIABLE=VALUE)
    
    Args:
        dict_data (dict): Dictionary containing environment variables
        output_file (str): Name of the output file (default: output.env)
    """
    try:
        with open(output_file, 'w') as f:
            for key, value in dict_data.items():
                # Convert value to string and escape special characters if needed
                value_str = str(value).replace('"', '\\"')
                # Write each variable in KEY=VALUE format
                f.write(f'{key}={value_str}\n')
        print(f"Successfully wrote environment variables to {output_file}")
    except Exception as e:
        print(f"Error writing to env file: {e}")

In [None]:
# Define environment variables for the model
environment = {
    "HF_TOKEN":"your_token_here"
    # "USE_HF_TRANSFER": "true",  # Enable faster downloads
    # "HF_HUB_ENABLE_HF_TRANSFER": "1",
    "SM_VLLM_MODEL": "Qwen/Qwen2.5-VL-3B-Instruct", # you can name your model whatever you want    
    "SM_VLLM_LIMIT_MM_PER_PROMPT.IMAGE": 2, # max number of images allowed in prompt. Increase for multi-page documents. Requires more memory.
    "SM_VLLM_LIMIT_MM_PER_PROMPT.VIDEO": 0
    "SM_VLLM_MAX_NUM_SEQS":"8", # decrease if less GPU memory available
    "SM_VLLM_MAX_MODEL_LEN":"38608", # max context length, decrease if less GPU memory available
    # "SM_VLLM_MAX_MODEL_LEN":"10608", # max context length, decrease if less GPU memory available
    # "SM_VLLM_DTYPE": "bfloat16"
}

dict_to_env_file(environment)

In [None]:
%%writefile run_container.sh
REPO_NAME=$1

# # Get credentials from instance metadata
TOKEN=$(curl -s -X PUT "http://169.254.169.254/latest/api/token" -H "X-aws-ec2-metadata-token-ttl-seconds: 21600")
ROLE_NAME=$(curl -s -H "X-aws-ec2-metadata-token: $TOKEN" http://169.254.169.254/latest/meta-data/iam/security-credentials/)
export $(curl -s -H "X-aws-ec2-metadata-token: $TOKEN" \
  http://169.254.169.254/latest/meta-data/iam/security-credentials/$ROLE_NAME \
  | jq -r '"AWS_ACCESS_KEY_ID="+.AccessKeyId, "AWS_SECRET_ACCESS_KEY="+.SecretAccessKey, "AWS_SESSION_TOKEN="+.Token')

# # Now run your docker container with these environment variables
# # Add --entrypoint /bin/bash in case you want to manually look into the container
set -x
docker run -d --gpus all --network host -e AWS_ACCESS_KEY_ID -e AWS_SECRET_ACCESS_KEY -e AWS_SESSION_TOKEN --env-file output.env ${REPO_NAME}

echo "Waiting for container to be ready..."
max_attempts=60  # Maximum number of attempts (10 minutes with 10-second intervals)
attempt=1

while [ $attempt -le $max_attempts ]; do
    echo "Attempt $attempt of $max_attempts: Checking if container is ready..."
    
    if curl -s -f http://localhost:8080/ping > /dev/null; then
        echo "Container is ready!"
        break
    fi
    
    if [ $attempt -eq $max_attempts ]; then
        echo "Container failed to become ready within the timeout period"
        docker logs inference_container
        exit 1
    fi
    
    attempt=$((attempt + 1))
    sleep 10
done

# Test chat completions endpoint
echo -e "\nTesting chat completions endpoint..."
curl -X POST http://localhost:8080/invocations \
    -H "Content-Type: application/json" \
    -d '{
        "messages": [
            {"role": "user", "content": "Hello, how are you?"}
        ],
        "model": "Qwen/Qwen2.5-VL-3B-Instruct",
        "temperature": 0.7,
        "max_tokens": 100
    }'

# Test completions endpoint
echo -e "\nTesting completions endpoint..."
curl -X POST http://localhost:8080/invocations/completions \
    -H "Content-Type: application/json" \
    -d '{
        "prompt": "Hello, how are you?",
        "model": "Qwen/Qwen2.5-VL-3B-Instruct",
        "temperature": 0.7,
        "max_tokens": 100
    }'



In [None]:
!bash run_container.sh "{REPO_NAME}"

## Creating a SageMaker Model and deploy a SageMaker endpoint

Finally, we'll create a SageMaker model and deploy it to an inference endpoint. This will give us an HTTPS endpoint that we can use for inference.

Note: We're using an ml.g5.2xlarge instance which provides GPU acceleration necessary for efficient inference with a small multimodal model.

For more throughphut, lower latency, or when deploying a bigger model you might want to use a bigger instance type.

In [None]:
hf_model = "Qwen/Qwen2.5-VL-3B-Instruct"
sm_model_name = "qwen25vl3b"
sm_endpoint_name = "vllm-sagemaker-qwen"

print(f"Model name: {sm_model_name}")
print(f"Endpoint name: {sm_endpoint_name}")

Deploy our model to a SageMaker endpoint using an ml.g5.2xlarge instance. This GPU-enabled instance type provides the computational power needed for efficient inference with a Qwen2-VL model. The deployment:
- Creates a SageMaker model using our custom container
- Configures the endpoint with specified resources
- Initiates asynchronous deployment (wait=False)
- Sets up HTTPS endpoint for inference



In [None]:
# Main deployment logic
from utils.helpers import check_model_exists, check_endpoint_config_exists, check_endpoint_exists, delete_all_resources

endpoint_exists = check_endpoint_exists(endpoint_name=sm_endpoint_name, sm_client=sm_client)
model_exists = check_model_exists(sm_model_name, sm_client=sm_client)
config_exists = check_endpoint_config_exists(sm_endpoint_name, sm_client=sm_client)

if endpoint_exists or model_exists or config_exists:
    print(f"\nFound existing resources:")
    if endpoint_exists:
        print(f"- Endpoint: {sm_endpoint_name}")
    if model_exists:
        print(f"- Model: {sm_model_name}")
    if config_exists:
        print(f"- Endpoint config: {sm_endpoint_name}")
    
    delete_all_resources(sm_model_name, sm_endpoint_name, sm_client=sm_client)

# # Define environment variables for the model
# environment = {
#     # "USE_HF_TRANSFER": "true",  # Enable faster downloads
#     # "HF_HUB_ENABLE_HF_TRANSFER": "1",
#     "SM_VLLM_MODEL": hf_model, # you can name your model whatever you want
#     # "SM_VLLM_LIMIT_MM_PER_PROMPT": "image=2, video=0", # max number of images allowed in prompt. Increase for multi-page documents. Requires more memory.
#     # "SM_VLLM_MAX_NUM_SEQS":"8", # decrease if less GPU memory available
#     # "SM_VLLM_MAX_MODEL_LEN":"38608", # max context length, decrease if less GPU memory available
#     # "SM_VLLM_DTYPE": "bfloat16"
# }

# If we get here, either nothing existed or we've cleaned up
model = Model(
    image_uri=image_uri,
    role=role,
    sagemaker_session=session,
    name=sm_model_name,
    env=environment,
)

print(f"\nEndpoint is now being deployed.... This may take several minutes.")

# Deploy a new endpoint
model.deploy(
    initial_instance_count=1,
    instance_type="ml.g5.2xlarge",
    # instance_type="local",
    endpoint_name=sm_endpoint_name,
    wait=True
)


## Next steps

After deploying the model as a SageMaker endpoint, we can call the model endpoint to run inference with the sample code in the next notebook [02_consume_model.ipynb](./02_consume_model.ipynb).