# **Deploy Llama 3.1 through vLLM on SageMaker Endpoint using LMI container from DJL.**

## Use DJL with the SageMaker Python SDK
- SageMaker Python SDK를 사용하면 Deep Java Library를 이용하여 Amazon SageMaker에서 모델을 호스팅할 수 있습니다. <BR>
- Deep Java Library (DJL) Serving은 DJL이 제공하는 고성능 범용 독립형 모델 서빙 솔루션입니다. DJL Serving은 다양한 프레임워크로 학습된 모델을 로드하는 것을 지원합니다. <BR>
- SageMaker Python SDK를 사용하면 DeepSpeed와 HuggingFace Accelerate와 같은 백엔드를 활용하여 DJL Serving으로 대규모 모델을 호스팅할 수 있습니다. <BR>
- DJL Serving의 지원 버전에 대한 정보는 [AWS 문서](https://docs.aws.amazon.com/deep-learning-containers/latest/devguide/deep-learning-containers-images.html)를 참조하십시오. <BR>
- 최신 지원 버전을 사용하는 것을 권장합니다. 왜냐하면 그곳에 우리의 개발 노력이 집중되어 있기 때문입니다. <BR>
- SageMaker Python SDK 사용에 대한 일반적인 정보는 [SageMaker Python SDK 사용하기](https://sagemaker.readthedocs.io/en/v2.139.0/overview.html#using-the-sagemaker-python-sdk)를 참조하십시오.
    
REF: [BLOG] [Deploy LLM with vLLM on SageMaker in only 13 lines of code](https://mrmaheshrajput.medium.com/deploy-llm-with-vllm-on-sagemaker-in-only-13-lines-of-code-1601f780c0cf)

## 1. Depoly model on SageMaker

In [None]:
import boto3
import sagemaker
from sagemaker import get_execution_role

- [Avalable DLC (Deep Learning Containers)](https://github.com/aws/deep-learning-containers/blob/master/available_images.md)

In [None]:
role = get_execution_role()
region = boto3.Session().region_name
sagemaker_session = sagemaker.session.Session()
sm_client = boto3.client("sagemaker", region_name=region)
sm_runtime_client = boto3.client("sagemaker-runtime")
sm_autoscaling_client = boto3.client("application-autoscaling")

### Setup Configuration


 - [[DOC] DJL for serving](https://docs.djl.ai/master/docs/serving/serving/docs/lmi/index.html)

In [None]:
model_id = "meta-llama/Llama-3.2-90B-Vision-Instruct"

In [None]:
container_uri = sagemaker.image_uris.retrieve(
    framework="djl-lmi", version="0.30.0", region=region
)
container_uri

In [None]:
instance_type = "ml.g5.48xlarge"
container_startup_health_check_timeout = 900

endpoint_name = sagemaker.utils.name_from_base("Meta-Llama-3-2-90B-Vision-Instruct")

print (f'container_uri: {container_uri}')
print (f'container_startup_health_check_timeout: {container_startup_health_check_timeout}')
print (f'instance_type: {instance_type}')
print (f'endpoint_name: {endpoint_name}')

### Creat model with env variables


- Target model: [DeepSeek-Coder-V2-Light-Instruct](https://huggingface.co/deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct)

- **[Backend for attention computation in vLLM](https://docs.vllm.ai/en/latest/serving/env_vars.html)**
    - Available options:
        - "TORCH_SDPA": use torch.nn.MultiheadAttention
        - "FLASH_ATTN": use FlashAttention
        - "XFORMERS": use XFormers
        - "ROCM_FLASH": use ROCmFlashAttention
        - "FLASHINFER": use flashinfer

- **'"OPTION_DISABLE_FLASH_ATTN": "false"'** is for HF Accelerate with Seq-Scheduler
    - It will be ignored when using vLLM beckend

- [[DOC] DJL-Container and Model Configurations (info. about properties)](https://docs.djl.ai/master/docs/serving/serving/docs/lmi/deployment_guide/configurations.html)
- [[DOC] Backend Specific Configurations](https://docs.djl.ai/master/docs/serving/serving/docs/lmi/user_guides/index.html)

In [None]:
deploy_env = {
    "HF_MODEL_ID": model_id,
    "OPTION_ROLLING_BATCH": "vllm",
    "OPTION_TENSOR_PARALLEL_DEGREE": "max",
    "OPTION_MAX_ROLLING_BATCH_SIZE": "32",
    "OPTION_DTYPE":"fp16",
    "OPTION_TRUST_REMOTE_CODE": "true",
    "OPTION_MAX_MODEL_LEN": "4096",
    "OPTION_ENFORCE_EAGER": "true", ## For llama 3.2
    "VLLM_ATTENTION_BACKEND": "XFORMERS",
    #"OPTION_DISABLE_FLASH_ATTN": "false", ## HF Accelerate with Seq-Scheduler
    "HF_TOKEN": "<your token>"
}

In [None]:
model = sagemaker.Model(
    image_uri=container_uri, 
    role=role,
    env=deploy_env
)

### Deploy model

In [None]:
model.deploy(
    instance_type=instance_type,
    initial_instance_count=1,
    endpoint_name=endpoint_name,
    container_startup_health_check_timeout=container_startup_health_check_timeout,
    sagemaker_session=sagemaker_session
)

## 2. Invocation (Generate Text using the endpoint)

### Get a predictor for your endpoint

In [None]:
predictor = sagemaker.Predictor(
    endpoint_name=endpoint_name,
    sagemaker_session=sagemaker_session,
    serializer=sagemaker.serializers.JSONSerializer(),
    deserializer=sagemaker.deserializers.JSONDeserializer(),
)

### Make a prediction with your endpoint (with stream)

In [None]:
import json
import random 

- **With chat template**
    - [DJL Chat Completions API Schema](https://docs.djl.ai/master/docs/serving/serving/docs/lmi/user_guides/chat_input_output_schema.html)

In [None]:
def generate_payload(chat):

    # JSON 페이로드 생성
    body = {
        "messages": chat,
        "max_tokens": 1024,
        "stream": True,
        #"stop": terminators,
        "ignore_eos": False
    }
    
    # JSON을 문자열로 변환하고 bytes로 인코딩
    return json.dumps(body).encode('utf-8')

In [None]:
# 다양한 코딩 태스크를 위한 프롬프트 리스트
chat = [
    #{"role": "system", "content": "너는 질의응답 챗봇입니다. 사용자의 질문의 의도를 파악하여 답변합니다."},
    {"role": "user", "content": "I would like to get better at basketball. Can you provide me a 3 month plan to improve my skills?"},
    #{"role": "user", "content": "철수가 20개의 연필을 가지고 있었는데 영희가 절반을 가져가고 민수가 남은 5개를 가져갔으면 철수에게 남은 연필의 갯수는 몇개인가요?"},
]

In [None]:
%%time
# Invoke the endpoint
resp = sm_runtime_client.invoke_endpoint_with_response_stream(
    EndpointName=endpoint_name, 
    # Body=json.dumps(body), 
    Body=generate_payload(chat), 
    
    ContentType="application/json"
)
print("Generated response:")
print("-" * 40)

buffer = ""
string = "" 
for event in resp['Body']:
    if 'PayloadPart' in event:
        chunk = event['PayloadPart']['Bytes'].decode()
        buffer += chunk
        try:
            # Try to parse the buffer as JSON
            data = json.loads(buffer)
            if 'choices' in data:
                print(data['choices'][0]['delta']['content'], end='', flush=True)
                string += data['choices'][0]['delta']['content'] 
            buffer = ""  # Clear the buffer after successful parsing
        except json.JSONDecodeError:
            # If parsing fails, keep the buffer for the next iteration
            pass

print("\n" + "-" * 40)

## 3. Performance checker

In [None]:
HF_TOKEN = "<your token>"
!huggingface-cli login --token {HF_TOKEN}

In [None]:
import os
import boto3
import json
import time
import concurrent.futures
from tqdm import tqdm
import numpy as np
from transformers import AutoTokenizer

class LLMPerformanceTester:
    
    def __init__(self, endpoint_name, model_id):
        self.runtime_client = boto3.client('sagemaker-runtime')
        self.endpoint_name = endpoint_name
        self.tokenizer = AutoTokenizer.from_pretrained(model_id)
        
    def invoke_endpoint(self, prompt):
        
        print (prompt)
        start_time = time.time()
        try:
            
            resp = self.runtime_client.invoke_endpoint_with_response_stream(
                EndpointName=endpoint_name, 
                # Body=json.dumps(body), 
                Body=generate_payload(prompt), 

                ContentType="application/json"
            )
            print("Generated response:")
            print("-" * 40)

            buffer = ""
            response_body = ""
            for event in resp['Body']:
                if 'PayloadPart' in event:
                    chunk = event['PayloadPart']['Bytes'].decode()
                    buffer += chunk
                    try:
                        # Try to parse the buffer as JSON
                        data = json.loads(buffer)
                        if 'choices' in data:
                            #print(data['choices'][0]['delta']['content'], end='', flush=True)
                            response_body += data['choices'][0]['delta']['content'] 
                        buffer = ""  # Clear the buffer after successful parsing
                    except json.JSONDecodeError:
                        # If parsing fails, keep the buffer for the next iteration
                        pass
            
            end_time = time.time()
            latency = end_time - start_time

            token = len(self.tokenizer.encode(response_body))
            print (token)
            #print (response_body)
            print("\n" + "-" * 40)
            return {
                'success': True,
                'latency': latency,
                'response': response_body,
                'token': token
            }
        except Exception as e:
            return {
                'success': False,
                'error': str(e),
                'latency': time.time() - start_time
            }

    def run_concurrent_test(self, prompts, concurrent_requests=5):
        results = []
        start_time = time.time()
        with concurrent.futures.ThreadPoolExecutor(max_workers=concurrent_requests) as executor:
            futures = [executor.submit(self.invoke_endpoint, prompt) for prompt in prompts]
            for future in tqdm(concurrent.futures.as_completed(futures), total=len(prompts)):
                results.append(future.result())
        
        end_time = time.time()
        total_latency = end_time - start_time
        return results, total_latency

    def analyze_results(self, results, total_latency):
        successful_requests = [r for r in results if r['success']]
        failed_requests = [r for r in results if not r['success']]
        
        if not successful_requests:
            return "No successful requests"
            
        latencies = [r['latency'] for r in successful_requests]
        tokens = [r['token'] for r in successful_requests]
        
        analysis = {
            'total_requests': len(results),
            'successful_requests': len(successful_requests),
            'failed_requests': len(failed_requests),
            #'throughput': len(successful_requests) / sum(latencies),  # requests per second
            'throughput': sum(tokens) / total_latency,  # requests per second
            'latency_stats': {
                'mean': np.mean(latencies),
                'median': np.median(latencies),
                'p95': np.percentile(latencies, 95),
                'p99': np.percentile(latencies, 99),
                'min': min(latencies),
                'max': max(latencies)
            }
        }
        return analysis

In [None]:
# 테스트 설정
#endpoint_name = "Meta-Llama-3-2-3B-Instruct-2024-11-08-11-00-28-868"
test_prompts = [
    [{"role": "user", "content": "What is machine learning?"}],
    [{"role": "user", "content": "Explain quantum computing"}],
    [{"role": "user", "content": "How does blockchain work?"}]
    # ... 더 많은 테스트 프롬프트 추가
] * 20  # 각 프롬프트를 5번 반복

# 테스터 초기화 및 실행
tester = LLMPerformanceTester(endpoint_name, model_id)

# 동시성 테스트 실행
print("Running performance test...")
results, total_latency = tester.run_concurrent_test(test_prompts, concurrent_requests=60)

# 결과 분석
analysis = tester.analyze_results(results, total_latency)
print("\nPerformance Analysis:")
print(json.dumps(analysis, indent=2))

## 4. Clean up

In [None]:
# Delete model
sm_client.delete_model(ModelName=model_name)

# Delete endpoint configuration
sm_client.delete_endpoint_config(EndpointConfigName=endpoint_config_name)

# Delete endpoint
sm_client.delete_endpoint(EndpointName=endpoint_name)