# **Deploy DeepSeek-Coder-V2 on "S3" with vLLM on SageMaker Endpoint using LMI container from DJL.**

## Use DJL with the SageMaker Python SDK
- SageMaker Python SDK를 사용하면 Deep Java Library를 이용하여 Amazon SageMaker에서 모델을 호스팅할 수 있습니다. <BR>
- 이 노트북은 1_deepseek-deploy-djl-lmi.ipynb 와 거의 유사하지만 다음과 같은 점이 다릅니다.
    - [deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct](https://huggingface.co/deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct) (16 B )모델을 Hugging Face 에서 다운로드 하지 않고, S3에서 다운로드 해서 SageMaker Endpoint 를 생성 합니다.모델 파일들이 다르기에 다음과 같은 작업을 수행 합니다.
        - deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct 를 로컬에 다운로드 합니다.
        - 로컬에 다운로드 한 모델 파일을 S3 에 업로드 합니다. 
- 또한 이 노트북은 파인 튜닝한 모델 파일 (가중치 및 모델 정의) 들을 S3 에 업로딩하고 세이지 메이커 엔드포인트를 생성하는데 활용할 수 있습니다.

### 선수 조건
- 이 노트북을 실행 전에 [0_setup.ipynb](0_setup.ipynb) 을 실행해야 합니다.
- 커널은 conda_pytorch_p310 을 사용합니다.

## 1. Depoly model on SageMaker

In [None]:
import boto3
import sagemaker
from sagemaker import get_execution_role

- [Avalable DLC (Deep Learning Containers)](https://github.com/aws/deep-learning-containers/blob/master/available_images.md)

In [None]:
role = get_execution_role()
region = boto3.Session().region_name
sagemaker_session = sagemaker.session.Session()
# 기본 버킷 이름 가져오기
default_bucket = sagemaker_session.default_bucket()
print(f"Default SageMaker bucket name: {default_bucket}")

sm_client = boto3.client("sagemaker", region_name=region)
sm_runtime_client = boto3.client("sagemaker-runtime")
sm_autoscaling_client = boto3.client("application-autoscaling")

### Setup Configuration


 - [[DOC] DJL for serving](https://docs.djl.ai/master/docs/serving/serving/docs/lmi/index.html)
 - 인스턴스는 ml.g5.12xlarge 를 권장 사용합니다. 
     - ml.p4d.24xlarge 또한 더 좋은 성능을 위해서 사용할 수 있습니다.

In [None]:
model_id = "deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct"
instance_type = "ml.g5.12xlarge"
# instance_type = "ml.p4d.24xlarge"

container_uri ="763104351884.dkr.ecr.us-east-1.amazonaws.com/djl-inference:0.32.0-lmi14.0.0-cu124"
model_id == "deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct"
endpoint_name = sagemaker.utils.name_from_base("DeepSeek-Coder-V2-Instruct")
model_name = sagemaker.utils.name_from_base("DeepSeek-Coder-V2-Instruct")
container_startup_health_check_timeout = 120 # seconds

print (f'model_id: {model_id}')
print (f'container_uri: {container_uri}')
print (f'instance_type: {instance_type}')
print (f'model_name: {model_name}')
print (f'endpoint_name: {endpoint_name}')
print (f'container_startup_health_check_timeout: {endpoint_name}')

### LMI container Image:  v1.0-djl-0.32.0-inf-lmi-14.0.0
* Release date (Feb 8, 25) 
    * https://github.com/aws/deep-learning-containers/releases/tag/v1.0-djl-0.32.0-inf-lmi-14.0.0
* Docker Image
    * 763104351884.dkr.ecr.us-west-2.amazonaws.com/djl-inference:0.32.0-lmi14.0.0-cu124
    * 763104351884.dkr.ecr.us-east-1.amazonaws.com/djl-inference:0.32.0-lmi14.0.0-cu124

위의 버전은 2025.2.12 현재 SageMaker SDK 에 업데이트 되지 않은 컨테이너 임.

### Download model 
- HF 에서 해당 모델을 로컬에 다운로드 합니다.

In [None]:
from huggingface_hub import snapshot_download

# is_needed_downlaod_model = True
is_needed_downlaod_model = False

local_model_path = "./deepseek-coder-v2"

if is_needed_downlaod_model:
    model_path = snapshot_download(
        repo_id="deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct",
        local_dir=local_model_path,  # 저장하고 싶은 로컬 경로
        local_dir_use_symlinks=False  # 실제 파일을 다운로드
    )
else:
    print("model is already downloaded")
    pass



## Upload model files to S3
- 로컬의 모델을 S3 에 업로딩 합니다.

In [None]:
import sagemaker
from sagemaker.session import Session


def upload_files_to_s3(local_path, bucket, key_prefix):
    sagemaker_session = sagemaker.Session()

    # 로컬 파일을 S3에 업로드
    # upload_data는 파일이나 디렉토리 경로를 받아서 S3에 업로드하고, S3 경로를 반환합니다
    s3_path = sagemaker_session.upload_data(
        path=local_path,
        bucket=None,  # None으로 설정하면 기본 SageMaker 버킷을 사용합니다
        key_prefix= key_prefix
    )

    print(f"Uploaded to: {s3_path}")
    return s3_path
    
# SageMaker 세션 초기화
# is_needed_upload_model = True
is_needed_upload_model = False

bucket_key_prefix = "deepseek"
if is_needed_upload_model: 
    
    s3_model_path = upload_files_to_s3(
                        local_path=local_model_path, 
                        bucket=default_bucket, 
                        key_prefix=bucket_key_prefix
    )    
else:
    s3_model_path = "s3://sagemaker-us-east-1-057716757052/deepseek"
    print("model is already uploaded")
    print("s3 model path: ", s3_model_path)

In [None]:
! aws s3 ls s3://sagemaker-us-east-1-057716757052/deepseek --recursive

In [None]:
# ! aws s3 rm s3://sagemaker-us-east-1-057716757052/deepseek --recursive

### Creat model with env variables


- Target model: [DeepSeek-Coder-V2-Light-Instruct](https://huggingface.co/deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct)

- **[Backend for attention computation in vLLM](https://docs.vllm.ai/en/latest/serving/env_vars.html)**
    - Available options:
        - "TORCH_SDPA": use torch.nn.MultiheadAttention
        - "FLASH_ATTN": use FlashAttention
        - "XFORMERS": use XFormers
        - "ROCM_FLASH": use ROCmFlashAttention
        - "FLASHINFER": use flashinfer

- **'"OPTION_DISABLE_FLASH_ATTN": "false"'** is for HF Accelerate with Seq-Scheduler
    - It will be ignored when using vLLM beckend

- [[DOC] DJL-Container and Model Configurations (info. about properties)](https://docs.djl.ai/master/docs/serving/serving/docs/lmi/deployment_guide/configurations.html)

### 아래의 환경 변수에서 HF_MODEL_ID 삭제 함
- 1_deepseek-deploy-djl-lmi.ipynb 노트북과 다르게 HF_MODEL_ID": model_id 삭제 함
- S3 에서 모델 파일을 사용하기 때문

In [None]:
deploy_env = {
    "OPTION_ROLLING_BATCH": "vllm",
    "OPTION_TENSOR_PARALLEL_DEGREE": "max",
    "OPTION_MAX_ROLLING_BATCH_SIZE": "2",
    "OPTION_DTYPE":"fp16",
    "OPTION_TRUST_REMOTE_CODE": "true",
    "OPTION_MAX_MODEL_LEN": "8192",
    "VLLM_ATTENTION_BACKEND": "XFORMERS",
    "OPTION_GPU_MEMORY_UTILIZATION": "0.9",  # GPU 메모리 사용률 제한 (기본값 0.9)
    "VLLM_MAX_NUM_SEQS": "16",  # 동시 처리 시퀀스 수 제한    
}

In [None]:
model_s3_path={'S3DataSource': {'S3Uri': f'{s3_model_path}/', 'S3DataType': 'S3Prefix', 'CompressionType': 'None'}}
print("model_s3_path: \n", model_s3_path)

In [None]:
model = sagemaker.Model(
    image_uri=container_uri, 
    model_data=model_s3_path,
    role=role,
    sagemaker_session=sagemaker_session,
    name = model_name,
    env=deploy_env,
)

### Deploy model

In [None]:
%%time

model.deploy(
    instance_type=instance_type,
    initial_instance_count=1,
    endpoint_name=endpoint_name,
    container_startup_health_check_timeout=container_startup_health_check_timeout
)

## 2. Invocation (Generate Text using the endpoint)

### Get a predictor for your endpoint

In [None]:
predictor = sagemaker.Predictor(
    endpoint_name=endpoint_name,
    sagemaker_session=sagemaker_session,
    serializer=sagemaker.serializers.JSONSerializer(),
    deserializer=sagemaker.deserializers.JSONDeserializer(),
)

### Make a prediction with your endpoint

- **question candidates**
    - write a quick sort algorithm in python.
    - Write a piece of quicksort code in C++.

In [None]:
outputs = predictor.predict(
    {
        "inputs": "write a quick sort algorithm in python and description",
        "parameters": {"do_sample": True, "max_new_tokens": 2048},
    }
)

print(outputs["generated_text"])

- **With chat template**
    - [DJL Chat Completions API Schema](https://docs.djl.ai/master/docs/serving/serving/docs/lmi/user_guides/chat_input_output_schema.html)

In [None]:
chat = [
    {"role": "user", "content": "Hello, how are you?"},
    {"role": "assistant", "content": "I'm doing great. How can I help you today?"},
    {"role": "user", "content": "I'd like to show off how chat templating works! anyway, write a quick sort algorithm in python and description"},
]

result = predictor.predict(
    {"messages": chat, "max_tokens": 1024}
)
result

## 3. Streaming output from the endpoint


In [None]:
import json
import random 

In [None]:
# 다양한 코딩 태스크를 위한 프롬프트 리스트
prompts = [
    "write a quick sort algorithm in python.",
    "Write a Python function to implement a binary search algorithm.",
    "Create a JavaScript function to flatten a nested array.",
    "Implement a simple REST API using Flask in Python.",
    "Write a SQL query to find the top 5 customers by total purchase amount.",
    "Create a React component for a todo list with basic CRUD operations.",
    "Implement a depth-first search algorithm for a graph in C++.",
    "Write a bash script to find and delete files older than 30 days.",
    "Create a Python class to represent a deck of cards with shuffle and deal methods.",
    "Write a regular expression to validate email addresses.",
    "Implement a basic CI/CD pipeline using GitHub Actions."
]

def generate_payload():
    # 랜덤하게 프롬프트 선택
    prompt = random.choice(prompts)
    
    # JSON 페이로드 생성
    body = {
        "inputs": prompt,
        "parameters": {
            "max_new_tokens": 400,
            # "return_full_text": False  # This does not work with Phi3
        },
        "stream": True,
    }
    
    # JSON을 문자열로 변환하고 bytes로 인코딩
    return json.dumps(body).encode('utf-8')

In [None]:
%%time
# Invoke the endpoint
resp = sm_runtime_client.invoke_endpoint_with_response_stream(
    EndpointName=endpoint_name, 
    # Body=json.dumps(body), 
    Body=generate_payload(), 
    
    ContentType="application/json"
)
print("Generated response:")
print("-" * 40)

buffer = ""
for event in resp['Body']:
    if 'PayloadPart' in event:
        chunk = event['PayloadPart']['Bytes'].decode()
        buffer += chunk
        try:
            # Try to parse the buffer as JSON
            data = json.loads(buffer)
            if 'token' in data:
                print(data['token']['text'], end='', flush=True)
            buffer = ""  # Clear the buffer after successful parsing
        except json.JSONDecodeError:
            # If parsing fails, keep the buffer for the next iteration
            pass

print("\n" + "-" * 40)

- **With chat template**
    - [DJL Chat Completions API Schema](https://docs.djl.ai/master/docs/serving/serving/docs/lmi/user_guides/chat_input_output_schema.html)

In [None]:
# 다양한 코딩 태스크를 위한 프롬프트 리스트
chat = [
    {"role": "user", "content": "Hello, how are you?"},
    {"role": "assistant", "content": "I'm doing great. How can I help you today?"},
    {"role": "user", "content": "I'd like to show off how chat templating works! anyway, write a quick sort algorithm in python and description"},
]

result = predictor.predict(
    {"messages": chat, "max_tokens": 1024}
)

def generate_payload():
    # 랜덤하게 프롬프트 선택
    prompt = random.choice(prompts)
    
    # JSON 페이로드 생성
    body = {
        "messages": chat,
        "max_tokens": 1024,
        "stream": True,
    }
    
    # JSON을 문자열로 변환하고 bytes로 인코딩
    return json.dumps(body).encode('utf-8')

In [None]:
%%time
# Invoke the endpoint
resp = sm_runtime_client.invoke_endpoint_with_response_stream(
    EndpointName=endpoint_name, 
    # Body=json.dumps(body), 
    Body=generate_payload(), 
    
    ContentType="application/json"
)
print("Generated response:")
print("-" * 40)

buffer = ""
for event in resp['Body']:
    if 'PayloadPart' in event:
        chunk = event['PayloadPart']['Bytes'].decode()
        buffer += chunk
        try:
            # Try to parse the buffer as JSON
            data = json.loads(buffer)
            if 'choices' in data:
                print(data['choices'][0]['delta']['content'], end='', flush=True)
            buffer = ""  # Clear the buffer after successful parsing
        except json.JSONDecodeError:
            # If parsing fails, keep the buffer for the next iteration
            pass

print("\n" + "-" * 40)

## 4. delete endpoint

In [None]:
# delete endpoint
predictor.delete_model()
predictor.delete_endpoint()