# **Deploy DeepSeek-Coder-V2 with vLLM on SageMaker Endpoint using LMI container from DJL.**

## Use DJL with the SageMaker Python SDK
- SageMaker Python SDK를 사용하면 Deep Java Library를 이용하여 Amazon SageMaker에서 모델을 호스팅할 수 있습니다. <BR>
- Deep Java Library (DJL) Serving은 DJL이 제공하는 고성능 범용 독립형 모델 서빙 솔루션입니다. DJL Serving은 다양한 프레임워크로 학습된 모델을 로드하는 것을 지원합니다. <BR>
- SageMaker Python SDK를 사용하면 DeepSpeed와 HuggingFace Accelerate와 같은 백엔드를 활용하여 DJL Serving으로 대규모 모델을 호스팅할 수 있습니다. <BR>
- DJL Serving의 지원 버전에 대한 정보는 [AWS 문서](https://docs.aws.amazon.com/deep-learning-containers/latest/devguide/deep-learning-containers-images.html)를 참조하십시오. <BR>
- 최신 지원 버전을 사용하는 것을 권장합니다. 왜냐하면 그곳에 우리의 개발 노력이 집중되어 있기 때문입니다. <BR>
- SageMaker Python SDK 사용에 대한 일반적인 정보는 [SageMaker Python SDK 사용하기](https://sagemaker.readthedocs.io/en/v2.139.0/overview.html#using-the-sagemaker-python-sdk)를 참조하십시오.
    
REF: [BLOG] [Deploy LLM with vLLM on SageMaker in only 13 lines of code](https://mrmaheshrajput.medium.com/deploy-llm-with-vllm-on-sagemaker-in-only-13-lines-of-code-1601f780c0cf)

## 1. Depoly model on SageMaker

In [1]:
import boto3
import sagemaker
from sagemaker import get_execution_role

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/ec2-user/.config/sagemaker/config.yaml


- [Avalable DLC (Deep Learning Containers)](https://github.com/aws/deep-learning-containers/blob/master/available_images.md)

In [2]:
role = get_execution_role()
region=boto3.Session().region_name
sagemaker_session = sagemaker.session.Session()
smr_client = boto3.client("sagemaker-runtime")

### Setup Configuration


In [5]:
model_id = "deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct"
#model_id = "deepseek-ai/DeepSeek-Coder-V2-Instruct"

In [7]:
container_uri = sagemaker.image_uris.retrieve(
    framework="djl-lmi", version="0.29.0", region=region
)
if model_id == "deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct":
    instance_type = "ml.g5.12xlarge"
elif model_id == "deepseek-ai/DeepSeek-Coder-V2-Instruct":
    instance_type = "ml.p4de.12xlarge"
    
endpoint_name = sagemaker.utils.name_from_base("DeepSeek-Coder-V2-Instruct")

print (f'container_uri: {container_uri}')
print (f'instance_type: {instance_type}')
print (f'endpoint_name: {endpoint_name}')

container_uri: 763104351884.dkr.ecr.us-east-1.amazonaws.com/djl-inference:0.29.0-lmi11.0.0-cu124
instance_type: ml.g5.12xlarge
endpoint_name: DeepSeek-Coder-V2-Instruct-2024-08-21-07-42-56-186


### Creat model with env variables


- Target model: [DeepSeek-Coder-V2-Light-Instruct](https://huggingface.co/deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct)

In [None]:
deploy_env = {
    "HF_MODEL_ID": model_id,
    "OPTION_ROLLING_BATCH": "vllm",
    "OPTION_TENSOR_PARALLEL_DEGREE": "max",
    "OPTION_MAX_ROLLING_BATCH_SIZE": "2",
    "OPTION_DTYPE":"fp16",
    "OPTION_TRUST_REMOTE_CODE": "true",
    "OPTION_MAX_MODEL_LEN": "8192",
}

In [None]:
model = sagemaker.Model(
    image_uri=container_uri, 
    role=role,
    env=deploy_env
)

### Deploy model

In [None]:
model.deploy(
    instance_type=instance_type,
    initial_instance_count=1,
    endpoint_name=endpoint_name,
    container_startup_health_check_timeout=900
)

## 2. Invocation (Generate Text using the endpoint)

### Get a predictor for your endpoint

In [None]:
predictor = sagemaker.Predictor(
    endpoint_name=endpoint_name,
    sagemaker_session=sagemaker_session,
    serializer=sagemaker.serializers.JSONSerializer(),
    deserializer=sagemaker.deserializers.JSONDeserializer(),
)

### Make a prediction with your endpoint

- **question candidates**
    - write a quick sort algorithm in python.
    - Write a piece of quicksort code in C++.

In [None]:
outputs = predictor.predict(
    {
        "inputs": "write a quick sort algorithm in python.",
        "parameters": {"do_sample": True, "max_new_tokens": 256},
    }
)

print(outputs["generated_text"])

### Streaming output from the endpoint


In [None]:
import io
import json
from pprint import pprint

In [None]:
class LineIterator:
    """
    A helper class for parsing the byte stream input.

    The output of the model will be in the following format:
    ```
    b'{"outputs": [" a"]}\n'
    b'{"outputs": [" challenging"]}\n'
    b'{"outputs": [" problem"]}\n'
    ...
    ```

    While usually each PayloadPart event from the event stream will contain a byte array
    with a full json, this is not guaranteed and some of the json objects may be split across
    PayloadPart events. For example:
    ```
    {'PayloadPart': {'Bytes': b'{"outputs": '}}
    {'PayloadPart': {'Bytes': b'[" problem"]}\n'}}
    ```

    This class accounts for this by concatenating bytes written via the 'write' function
    and then exposing a method which will return lines (ending with a '\n' character) within
    the buffer via the 'scan_lines' function. It maintains the position of the last read
    position to ensure that previous bytes are not exposed again.
    """

    def __init__(self, stream):
        self.byte_iterator = iter(stream)
        self.buffer = io.BytesIO()
        self.read_pos = 0

    def __iter__(self):
        return self

    def __next__(self):
        while True:
            self.buffer.seek(self.read_pos)
            line = self.buffer.readline()
            if line and line[-1] == ord("\n"):
                self.read_pos += len(line)
                return line[:-1]
            try:
                chunk = next(self.byte_iterator)
            except StopIteration:
                if self.read_pos < self.buffer.getbuffer().nbytes:
                    continue
                raise
            if "PayloadPart" not in chunk:
                print("Unknown event type:" + chunk)
                continue
            self.buffer.seek(0, io.SEEK_END)
            self.buffer.write(chunk["PayloadPart"]["Bytes"])

In [None]:
stop_token = "\n" #Check the stop token for you model

In [None]:
# Create body object and pass 'stream' to True
body = {
    "inputs": "write a quick sort algorithm in python.",
    "parameters": {
        "max_new_tokens": 400,
        # "return_full_text": False  # This does not work with Phi3
    },
    "stream": True,
}

In [None]:
%%time
# Invoke the endpoint
resp = smr_client.invoke_endpoint_with_response_stream(
    EndpointName=endpoint_name, Body=json.dumps(body), ContentType="application/json"
)

# Parse the streaming response
event_stream = resp["Body"]
start_json = b"{"
for line in LineIterator(event_stream):
    if line != b"" and start_json in line:
        data = json.loads(line[line.find(start_json) :].decode("utf-8"))
        if data["token"]["text"] != stop_token:
            print(data["token"]["text"], end="")

In [None]:
# Create body object and pass 'stream' to True
body = {
    "inputs": "The meaning of life",
    "parameters": {
        "max_new_tokens": 400,
        # "return_full_text": False  # This does not work with Phi3
    },
    "stream": True,
}