# SageMaker VLLM endpoint example

## 1. Define some variables

The byoc will build and store a vllm endpoint docker image in you ECR private repo (for example `sagemaker_endpoint/vllm`), you need to define the following variables.

In [None]:
MODEL_ID = "deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B"
INSTANCE_TYPE = "ml.g5.2xlarge"
# better to work with vllm>=v0.7.2
VLLM_VERSION = "v0.7.2"
REPO_NAMESPACE = "sagemaker_endpoint/vllm"
ACCOUNT = !aws sts get-caller-identity --query Account --output text
REGION = !aws configure get region
ACCOUNT = ACCOUNT[0]
REGION = REGION[0]
if REGION.startswith("cn"):
    # this is a container mirror in cn region: https://github.com/nwcdlabs/container-mirror
    VLLM_REPO = "048912060910.dkr.ecr.cn-northwest-1.amazonaws.com.cn/dockerhub/vllm/vllm-openai"
    CONTAINER = f"{ACCOUNT}.dkr.ecr.{REGION}.amazonaws.com.cn/{REPO_NAMESPACE}:{VLLM_VERSION}"
else:
    VLLM_REPO = "vllm/vllm-openai"
    CONTAINER = f"{ACCOUNT}.dkr.ecr.{REGION}.amazonaws.com/{REPO_NAMESPACE}:{VLLM_VERSION}"

## 2. Build the container

Endpoint starting codes are in `app/`. The script will build and push to ecr. 

**The docker only need to be built once**, and after that, when deploying other endpoints, the same docker image can be shared.

In [None]:
cmd = f"VLLM_REPO={VLLM_REPO} VLLM_VERSION={VLLM_VERSION} REPO_NAMESPACE={REPO_NAMESPACE} ACCOUNT={ACCOUNT} REGION={REGION} bash ./build_and_push.sh "
print("Runging:", cmd)
!{cmd}

## 3. Deploy on SageMaker

define the model and deploy on SageMaker


In [None]:
%pip install -U boto3 sagemaker transformers huggingface_hub modelscope s5cmd datasets

### 3.1 Init SageMaker session

In [None]:
import sagemaker

In [None]:
import os
import re
import json
from datetime import datetime
import time

import boto3
import sagemaker

sess = sagemaker.Session()
role = sagemaker.get_execution_role()
default_bucket = sess.default_bucket()

sagemaker_client = boto3.client("sagemaker")

### 3.2 Download and upload model file

Firstly, you need to prepare model weights and upload to S3. You can download from HuggingFace, ModelScope or upload your own model. 

If you want vllm to automatically pull the model when it starts, this step can be skipped.

In [None]:
model_name = MODEL_ID.replace("/", "-").replace(".", "-")
local_model_path = "./models/" + model_name
s3_model_path = f"s3://{default_bucket}/models/" + model_name

%mkdir -p code {local_model_path}

print("local_model_path:", local_model_path)

##### Option 1: Global region (download from HuggingFace)

In [None]:
!huggingface-cli download --resume-download {MODEL_ID} --local-dir {local_model_path} --max-workers 32

##### Option 2: China region  (download from ModelScope)

In [None]:
# !modelscope download --local_dir {local_model_path} {MODEL_ID} 

#### upload to s3

In [None]:
!s5cmd sync --concurrency 32 {local_model_path}/ {s3_model_path}/
print("s3_model_path:", s3_model_path)

### 3.3 Prepare vllm start scripts

Then you need to a write the vllm starting scripts for endpoint, the container will automatically use the `start.sh` as the entrypont.

Please carefully modify the startup script file as needed, such as the model running parameter information. All parameters can be referenced at [https://docs.vllm.ai/en/latest/serving/openai_compatible_server.html](https://docs.vllm.ai/en/latest/serving/openai_compatible_server.html)

Here is a simple script that pulling a model from S3 and starting a vllm server.

In [None]:
endpoint_model_name = sagemaker.utils.name_from_base(model_name, short=True)
local_code_path = endpoint_model_name
s3_code_path = f"s3://{default_bucket}/endpoint_code/vllm_byoc/{endpoint_model_name}.tar.gz"

%mkdir -p {local_code_path}

print("local_code_path:", local_code_path)

with open(f"{local_code_path}/start.sh", "w") as f:
    f.write(f"""
#!/bin/bash

# download model to local
s5cmd sync --concurrency 64 \
    {s3_model_path}/* /temp/model_weight


# the start script need to be adjust as you needed
# port needs to be $SAGEMAKER_BIND_TO_PORT

python3 -m vllm.entrypoints.openai.api_server \\
    --port $SAGEMAKER_BIND_TO_PORT \\
    --trust-remote-code \\
    --tensor-parallel-size 1 --max-model-len 65536 --enforce-eager \\
    --served-model-name {MODEL_ID} \\
    --model /temp/model_weight
""")

In [None]:
!rm -f {local_code_path}.tar.gz
!tar czvf {local_code_path}.tar.gz {local_code_path}/
!aws s3 cp {local_code_path}.tar.gz {s3_code_path}
print("s3_code_path:", s3_code_path)

### 3.3 Deploy endpoint on SageMaker

In [None]:
# Step 0. create model

# endpoint_model_name already defined in above step

create_model_response = sagemaker_client.create_model(
    ModelName=endpoint_model_name,
    ExecutionRoleArn=role,
    PrimaryContainer={
        "Image": CONTAINER,
        "ModelDataUrl": s3_code_path
    },
)
print(create_model_response)
print("endpoint_model_name:", endpoint_model_name)

In [None]:
# Step 1. create transform input

# https://aws.amazon.com/cn/blogs/machine-learning/perform-batch-transforms-with-amazon-sagemaker-jumpstart-text2text-generation-large-language-models/

transform_job_name = sagemaker.utils.name_from_base(model_name, short=True)

from datasets import load_dataset
cnn_test = load_dataset('cnn_dailymail','3.0.0',split='test')

#You can specify a prompt here
prompt = "Briefly summarize this text: "
#Provide the test data and the ground truth file name
test_data_file_name = "articles.jsonl"

test_articles = []

max_tokens = 512

# We will go over each data entry and create the data in the input required format as described above
for id, test_entry in enumerate(cnn_test):
    article = test_entry['article']
    # Create a payload like this if you want to have different hyperparameters for each test input
    messages = [{
        "role": "user",
        "content": f"{prompt}{article}"
    }]
    payload = {
        "model": MODEL_ID,
        "messages": messages,
        "max_tokens": max_tokens,
        "stream": False
    }
    test_articles.append(payload)
    if id > 512:
        break

with open(test_data_file_name, "w") as outfile:
    for entry in test_articles:
        outfile.write("%s\n" % json.dumps(entry))

s3_transform_input_path = f"s3://{default_bucket}/batch_transform_job/{transform_job_name}/input"
s3_transform_output_path = f"s3://{default_bucket}/batch_transform_job/{transform_job_name}/output"

!aws s3 cp {test_data_file_name} {s3_transform_input_path}/{test_data_file_name}
# # Uploading the data        
# s3 = boto3.client("s3")
# s3.upload_file(test_data_file_name, output_bucket, os.path.join(output_prefix + "/batch_input/articles.jsonl"))


In [None]:
# Step 2. create transform job

response = sagemaker_client.create_transform_job(
    TransformJobName=transform_job_name,
    ModelName=endpoint_model_name,
    MaxConcurrentTransforms=32,
    BatchStrategy='SingleRecord',
    TransformInput={
        'DataSource': {
            'S3DataSource': {
                'S3DataType': 'S3Prefix',
                'S3Uri': s3_transform_input_path,
            }
        },
        'ContentType': 'application/jsonlines',
        'SplitType': 'Line'
    },
    TransformOutput={
        'S3OutputPath': s3_transform_output_path,
        'Accept': "text/csv",
        'AssembleWith': 'Line',
    },
    TransformResources={
        'InstanceType': INSTANCE_TYPE,
        'InstanceCount': 1,
    },
)

In [None]:
while 1:
    status = sagemaker_client.describe_transform_job(TransformJobName=transform_job_name)["TransformJobStatus"]
    if status != "InProgress":
        break
    print(datetime.now().strftime('%Y%m%d-%H:%M:%S') + " status: " + status)
    time.sleep(60)
print("Batch transform job finished:", status)

## 4. Result

The result is in `s3_transform_output_path`

In [None]:
!aws s3 cp {s3_transform_output_path}/{test_data_file_name}.out .

for output_line in open(f"{test_data_file_name}.out").readlines():
    output = json.loads(output_line)
    print(output["choices"][0]["message"]["content"])
    break