# SageMaker VLLM endpoint example

## 1. Define some variables

The byoc will build and store a vllm endpoint docker image in you ECR private repo (for example `sagemaker_endpoint/vllm`), you need to define the following variables.

In [None]:
MODEL_ID = "Qwen/Qwen2-VL-7B-Instruct-AWQ"
INSTANCE_TYPE = "ml.g5.2xlarge"
VLLM_VERSION = "v0.6.4.post1"
REPO_NAMESPACE = "sagemaker_endpoint/vllm"
ACCOUNT = !aws sts get-caller-identity --query Account --output text
REGION = !aws configure get region
ACCOUNT = ACCOUNT[0]
REGION = REGION[0]
if REGION.startswith("cn"):
    # this is a example repo port from vllm/vllm-openai, you can create your own docker image in your global region account
    VLLM_REPO = "public.ecr.aws/y0a9p9k0/vllm/vllm-openai"
    CONTAINER = f"{ACCOUNT}.dkr.ecr.{REGION}.amazonaws.com.cn/{REPO_NAMESPACE}:{VLLM_VERSION}"
else:
    VLLM_REPO = "vllm/vllm-openai"
    CONTAINER = f"{ACCOUNT}.dkr.ecr.{REGION}.amazonaws.com/{REPO_NAMESPACE}:{VLLM_VERSION}"

## 2. Build the container

Endpoint starting codes are in `app/`. The script will build and push to ecr. 

**The docker only need to be built once**, and after that, when deploying other endpoints, the same docker image can be shared.

In [None]:
cmd = f"VLLM_REPO={VLLM_REPO} VLLM_VERSION={VLLM_VERSION} REPO_NAMESPACE={REPO_NAMESPACE} ACCOUNT={ACCOUNT} REGION={REGION} bash ./build_and_push.sh"
print("Runging:", cmd)
!{cmd}

## 3. Deploy on SageMaker

define the model and deploy on SageMaker


In [None]:
%pip install -U boto3 sagemaker transformers huggingface_hub modelscope

### 3.1 Init SageMaker session

In [None]:
import os
import re
import json
from datetime import datetime
import time

import boto3
import sagemaker

sess = sagemaker.Session()
role = sagemaker.get_execution_role()
default_bucket = sess.default_bucket()

sagemaker_client = boto3.client("sagemaker")

### 3.2 Download and upload model file

Firstly, you need to prepare model weights and upload to S3. You can download from HuggingFace, ModelScope or upload your own model. 

If you want vllm to automatically pull the model when it starts, this step can be skipped.

In [None]:
model_name = MODEL_ID.replace("/", "-").replace(".", "-")
local_model_path = os.environ['HOME'] + "/models/" + model_name
s3_model_path = f"s3://{default_bucket}/models/" + model_name

%mkdir -p code {local_model_path}

print("local_model_path:", local_model_path)

##### Option 1: Global region (download from HuggingFace)

In [None]:
# !huggingface-cli download --resume-download {MODEL_ID} --local-dir {local_model_path}

##### Option 2: China region  (download from ModelScope)

In [None]:
!modelscope download --local_dir {local_model_path} {MODEL_ID} 

#### upload to s3

In [None]:
!aws s3 sync {local_model_path} {s3_model_path}
print("s3_model_path:", s3_model_path)

### 3.3 Prepare vllm start scripts

Then you need to a write the vllm starting scripts for endpoint, the container will automatically use the `start.sh` as the entrypont.

Please carefully modify the startup script file as needed, such as the model running parameter information. All parameters can be referenced at [https://docs.vllm.ai/en/latest/serving/openai_compatible_server.html](https://docs.vllm.ai/en/latest/serving/openai_compatible_server.html)

Here is a simple script that pulling a model from S3 and starting a vllm server.

In [None]:
endpoint_model_name = sagemaker.utils.name_from_base(model_name, short=True)
local_code_path = endpoint_model_name
s3_code_path = f"s3://{default_bucket}/endpoint_code/vllm_byoc/{endpoint_model_name}.tar.gz"

%mkdir -p {local_code_path}

print("local_code_path:", local_code_path)

with open(f"{local_code_path}/start.sh", "w") as f:
    f.write(f"""
#!/bin/bash

# download model to local
s5cmd sync {s3_model_path}/* /opt/ml/modelfile/


# the start script need to be adjust as you needed
# port needs to be $SAGEMAKER_BIND_TO_PORT

python3 -m vllm.entrypoints.openai.api_server \\
    --port $SAGEMAKER_BIND_TO_PORT \\
    --trust-remote-code \\
    --model /opt/ml/modelfile/
""")

In [None]:
!rm -f {local_code_path}.tar.gz
!tar czvf {local_code_path}.tar.gz {local_code_path}/
!aws s3 cp {local_code_path}.tar.gz {s3_code_path}
print("s3_code_path:", s3_code_path)

### 3.3 Deploy endpoint on SageMaker

In [None]:
# Step 0. create model

# endpoint_model_name already defined in above step

create_model_response = sagemaker_client.create_model(
    ModelName=endpoint_model_name,
    ExecutionRoleArn=role,
    PrimaryContainer={
        "Image": CONTAINER,
        "ModelDataUrl": s3_code_path
    },
    
)
print(create_model_response)
print("endpoint_model_name:", endpoint_model_name)

In [None]:
# Step 1. create endpoint config

endpoint_config_name = sagemaker.utils.name_from_base(model_name, short=True)

endpoint_config_response = sagemaker_client.create_endpoint_config(
    EndpointConfigName=endpoint_config_name,
    ProductionVariants=[
        {
            "VariantName": "variant1",
            "ModelName": endpoint_model_name,
            "InstanceType": INSTANCE_TYPE,
            "InitialInstanceCount": 1,
            "ContainerStartupHealthCheckTimeoutInSeconds": 1000,
            # "EnableSSMAccess": True,
        },
    ],
)
print(endpoint_config_response)
print("endpoint_config_name:", endpoint_config_name)

In [None]:
# Step 2. create endpoint

endpoint_name = sagemaker.utils.name_from_base(model_name, short=True)

create_endpoint_response = sagemaker_client.create_endpoint(
    EndpointName=endpoint_name, EndpointConfigName=endpoint_config_name
)
print(create_endpoint_response)
print("endpoint_config_name:", endpoint_name)
while 1:
    status = sagemaker_client.describe_endpoint(EndpointName=endpoint_name)["EndpointStatus"]
    if status != "Creating":
        break
    print(datetime.now().strftime('%Y%m%d-%H:%M:%S') + " status: " + status)
    time.sleep(60)
print("Endpoint created:", endpoint_name)

## 4. Test

You can invoke your model with SageMaker runtime.

In [None]:
messages = [{
        "role": "user",
        "content": "Write a quick sort in python"
}]

### 4.1 Message api non-stream mode

In [None]:
sagemaker_runtime = boto3.client('runtime.sagemaker')

payload = {
    "messages": messages,
    "max_tokens": 1024,
    "stream": False
}
response = sagemaker_runtime.invoke_endpoint(
    EndpointName=endpoint_name,
    ContentType='application/json',
    Body=json.dumps(payload)
)

print(json.loads(response['Body'].read())["choices"][0]["message"]["content"])

### 4.2 Message api stream mode

In [None]:
payload = {
    "messages": messages,
    "max_tokens": 1024,
    "stream": True
}

response = sagemaker_runtime.invoke_endpoint_with_response_stream(
    EndpointName=endpoint_name,
    ContentType='application/json',
    Body=json.dumps(payload)
)

buffer = ""
for t in response['Body']:
    buffer += t["PayloadPart"]["Bytes"].decode()
    last_idx = 0
    for match in re.finditer(r'^data:\s*(.+?)(\n\n)', buffer):
        try:
            data = json.loads(match.group(1).strip())
            last_idx = match.span()[1]
            print(data["choices"][0]["delta"]["content"], end="")
        except (json.JSONDecodeError, KeyError, IndexError) as e:
            pass
    buffer = buffer[last_idx:]
print()

### 4.3 Completion api non-stream mode

In [None]:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(local_model_path, trust_remote_code=True)

prompt = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)

payload = {
    "prompt": prompt,
    "max_tokens": 1024,
    "stream": False
}

response = sagemaker_runtime.invoke_endpoint(
    EndpointName=endpoint_name,
    ContentType='application/json',
    Body=json.dumps(payload)
)

print(json.loads(response['Body'].read())["choices"][0]["text"])

### 4.4 Completion api stream mode

In [None]:
payload = {
    "prompt": prompt,
    "max_tokens": 1024,
    "stream": True
}

response = sagemaker_runtime.invoke_endpoint_with_response_stream(
    EndpointName=endpoint_name,
    ContentType='application/json',
    Body=json.dumps(payload)
)

buffer = ""
for t in response['Body']:
    buffer += t["PayloadPart"]["Bytes"].decode()
    last_idx = 0
    for match in re.finditer(r'^data:\s*(.+?)(\n\n)', buffer):
        try:
            data = json.loads(match.group(1).strip())
            last_idx = match.end()
            # print(data)
            print(data["choices"][0]["text"], end="")
        except (json.JSONDecodeError, KeyError, IndexError) as e:
            pass
    buffer = buffer[last_idx:]
print()