# Using LLama Factory finetune on SageMaker 
# 3. 使用SageMaker LMI(Large Model Inference) vLLM 引擎部署模型至SageMaker Endpoint

The LMI container offers the out-of-box integration with SageMaker for hosting multiple LoRA adapters with higher performance (low latency and high throughput) using the vLLM library that uses S-LORA and Punica. S-LoRA stores all adapters in the main memory and fetches the adapters used by the currently running queries to the GPU memory. To efficiently use the GPU memory and reduce fragmentation, S-LoRA proposes Unified Paging. Unified Paging uses a unified memory pool to manage dynamic adapter weights with different ranks and KV cache tensors with varying sequence lengths. Additionally, S-LoRA employs a novel tensor parallelism strategy and highly optimized custom CUDA kernels for heterogeneous batching of LoRA computation. Collectively, these features enable S-LoRA to serve thousands of LoRA adapters on a single GPU or across multiple GPUs with a small overhead.

Below diagram shows the Multi LoRA-Adapter serving stack of LMI container on SageMaker Multi LoRA-Adapter serving stack of LMI container on SageMaker  
![imge](https://raw.githubusercontent.com/aws-samples/sagemaker-genai-hosting-examples/0a98859eef9c53a5aa3beeae7d59b38a8de934dc/Llama2/Llama2-7b/LMI/LoRA-LMI-SageMaker.png)

In [1]:
import sagemaker
from sagemaker import image_uris
import boto3
import os
import time
import json

role = sagemaker.get_execution_role()  # execution role for the endpoint
sess = sagemaker.session.Session()  # sagemaker session for interacting with different AWS APIs
default_bucket = sess.default_bucket()  # bucket to house artifacts

region = sess._region_name
account_id = sess.account_id()

s3_client = boto3.client("s3")
sm_client = boto3.client("sagemaker")
smr_client = boto3.client("sagemaker-runtime")

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/ec2-user/.config/sagemaker/config.yaml


In [71]:
s3_model_prefix = "Meta-Llama-3-8B-Instruct-AWQ"  
s3_model_path =  f"s3://{default_bucket}/{s3_model_prefix}/"

In [72]:
inference_image_uri = image_uris.retrieve(
    framework="djl-lmi",
    region=sess.boto_session.region_name,
    version="0.29.0"
)
# inference_image_uri = (
#     "763104351884.dkr.ecr.us-east-1.amazonaws.com/djl-inference:0.28.0-lmi10.0.0-cu124-v1.0"
# )
print(f"Image going to be used is ---- > {inference_image_uri}")


Image going to be used is ---- > 763104351884.dkr.ecr.us-east-1.amazonaws.com/djl-inference:0.29.0-lmi11.0.0-cu124


In [73]:
local_code_dir = 'vllm_inference'
!mkdir -p {local_code_dir}


* Note: option.model_id 需要改成模型下载的s3_url


In [97]:
%%writefile {local_code_dir}/serving.properties
engine=Python
option.model_id=TechxGenus/Meta-Llama-3-8B-Instruct-AWQ
option.dtype=fp16
option.enable_lora=true
option.rolling_batch=vllm
option.tensor_parallel_degree=1

Overwriting vllm_inference/serving.properties


In [146]:
%%writefile {local_code_dir}/requirements.txt
transformers==4.45.2

Overwriting vllm_inference/requirements.txt


In [147]:
%%writefile {local_code_dir}/serving.properties
engine=MPI
option.model_id=TechxGenus/Meta-Llama-3-8B-Instruct-AWQ
option.rolling_batch=lmi-dist
option.tensor_parallel_degree=1
option.enable_lora=true

Overwriting vllm_inference/serving.properties


- 创建lora目录

In [148]:
!mkdir -p {local_code_dir}/adapters

### 下载训练好的Lora至本地目录打包

In [149]:
!aws s3 sync s3://{default_bucket}/llama3-8b-qlora/finetuned_model/ {local_code_dir}/adapters/exp
!rm -rf {local_code_dir}/adapters/exp/checkpoint*
!rm -rf {local_code_dir}/adapters/exp/runs

download: s3://sagemaker-us-east-1-434444145045/llama3-8b-qlora/finetuned_model/checkpoint-500/adapter_config.json to vllm_inference/adapters/exp/checkpoint-500/adapter_config.json
download: s3://sagemaker-us-east-1-434444145045/llama3-8b-qlora/finetuned_model/checkpoint-500/README.md to vllm_inference/adapters/exp/checkpoint-500/README.md
download: s3://sagemaker-us-east-1-434444145045/llama3-8b-qlora/finetuned_model/checkpoint-500/rng_state.pth to vllm_inference/adapters/exp/checkpoint-500/rng_state.pth
download: s3://sagemaker-us-east-1-434444145045/llama3-8b-qlora/finetuned_model/checkpoint-500/tokenizer_config.json to vllm_inference/adapters/exp/checkpoint-500/tokenizer_config.json
download: s3://sagemaker-us-east-1-434444145045/llama3-8b-qlora/finetuned_model/checkpoint-500/special_tokens_map.json to vllm_inference/adapters/exp/checkpoint-500/special_tokens_map.json
download: s3://sagemaker-us-east-1-434444145045/llama3-8b-qlora/finetuned_model/checkpoint-500/scheduler.pt to vllm

In [150]:
!rm model.tar.gz
!cd {local_code_dir} && rm -rf ".ipynb_checkpoints"
!tar czvf model.tar.gz {local_code_dir}

vllm_inference/
vllm_inference/serving.properties
vllm_inference/adapters/
vllm_inference/adapters/exp/
vllm_inference/adapters/exp/trainer_log.jsonl
vllm_inference/adapters/exp/special_tokens_map.json
vllm_inference/adapters/exp/eval_results.json
vllm_inference/adapters/exp/README.md
vllm_inference/adapters/exp/all_results.json
vllm_inference/adapters/exp/adapter_model.safetensors
vllm_inference/adapters/exp/adapter_config.json
vllm_inference/adapters/exp/training_loss.png
vllm_inference/adapters/exp/tokenizer.json
vllm_inference/adapters/exp/train_results.json
vllm_inference/adapters/exp/training_eval_loss.png
vllm_inference/adapters/exp/tokenizer_config.json
vllm_inference/adapters/exp/training_args.bin
vllm_inference/adapters/exp/trainer_state.json
vllm_inference/requirements.txt


In [151]:
s3_code_prefix = "llm_finetune/llama-3-8b-qlora"
print(f"s3_code_prefix: {s3_code_prefix}")

s3_code_prefix: llm_finetune/llama-3-8b-qlora


In [152]:
s3_code_artifact = sess.upload_data("model.tar.gz", default_bucket, s3_code_prefix)
print(f"S3 Code or Model tar ball uploaded to --- > {s3_code_artifact}")

S3 Code or Model tar ball uploaded to --- > s3://sagemaker-us-east-1-434444145045/llm_finetune/llama-3-8b-qlora/model.tar.gz


### 创建SageMaker模型

In [153]:
from sagemaker.utils import name_from_base
import boto3

model_name = name_from_base(f"llama3-8b-qlora").replace('.','-').replace('_','-')
print(model_name)
print(f"Image going to be used is ---- > {inference_image_uri}")

create_model_response = sm_client.create_model(
    ModelName=model_name,
    ExecutionRoleArn=role,
    PrimaryContainer={
        "Image": inference_image_uri,
        "ModelDataUrl": s3_code_artifact
    },
    
)
model_arn = create_model_response["ModelArn"]

print(f"Created Model: {model_arn}")

llama3-8b-qlora-2024-11-02-13-19-12-418
Image going to be used is ---- > 763104351884.dkr.ecr.us-east-1.amazonaws.com/djl-inference:0.29.0-lmi11.0.0-cu124
Created Model: arn:aws:sagemaker:us-east-1:434444145045:model/llama3-8b-qlora-2024-11-02-13-19-12-418


### 创建SageMaker端点模型配置

In [154]:
endpoint_config_name = f"{model_name}-config"
endpoint_name = f"{model_name}-endpoint"

#Note: ml.g4dn.2xlarge 也可以选择
endpoint_config_response = sm_client.create_endpoint_config(
    EndpointConfigName=endpoint_config_name,
    ProductionVariants=[
        {
            "VariantName": "variant1",
            "ModelName": model_name,
            "InstanceType": "ml.g5.2xlarge",
            "InitialInstanceCount": 1,
            # "VolumeSizeInGB" : 400,
            # "ModelDataDownloadTimeoutInSeconds": 2400,
            "ContainerStartupHealthCheckTimeoutInSeconds": 10*60,
        },
    ],
)
endpoint_config_response

{'EndpointConfigArn': 'arn:aws:sagemaker:us-east-1:434444145045:endpoint-config/llama3-8b-qlora-2024-11-02-13-19-12-418-config',
 'ResponseMetadata': {'RequestId': '27361383-f26e-45d8-84e9-b0fe99b1ac9e',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': '27361383-f26e-45d8-84e9-b0fe99b1ac9e',
   'content-type': 'application/x-amz-json-1.1',
   'content-length': '127',
   'date': 'Sat, 02 Nov 2024 13:19:16 GMT'},
  'RetryAttempts': 0}}

### 创建SageMaker端点

In [155]:
create_endpoint_response = sm_client.create_endpoint(
    EndpointName=f"{endpoint_name}", EndpointConfigName=endpoint_config_name
)
print(f"Created Endpoint: {create_endpoint_response['EndpointArn']}")

Created Endpoint: arn:aws:sagemaker:us-east-1:434444145045:endpoint/llama3-8b-qlora-2024-11-02-13-19-12-418-endpoint


### 大概等到8分钟左右节点部署成功

In [156]:
import time

resp = sm_client.describe_endpoint(EndpointName=endpoint_name)
status = resp["EndpointStatus"]
print("Status: " + status)

while status == "Creating":
    time.sleep(60)
    resp = sm_client.describe_endpoint(EndpointName=endpoint_name)
    status = resp["EndpointStatus"]
    print("Status: " + status)

print("Arn: " + resp["EndpointArn"])
print("Status: " + status)

Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: InService
Arn: arn:aws:sagemaker:us-east-1:434444145045:endpoint/llama3-8b-qlora-2024-11-02-13-19-12-418-endpoint
Status: InService


### 加载Tokenizer， 使用其chat template

In [175]:
from transformers import AutoTokenizer

model_id = 'TechxGenus/Meta-Llama-3-8B-Instruct-AWQ'
tokenizer = AutoTokenizer.from_pretrained(model_id)

## 调用SageMaker Endpoint 推理

In [176]:
%%time
import json
import boto3

smr_client = boto3.client("sagemaker-runtime")
parameters = {
  "max_new_tokens": 512,
  "temperature": 0.9,
  "top_p":0.8
}

CPU times: user 2.69 ms, sys: 568 μs, total: 3.26 ms
Wall time: 2.6 ms


In [182]:
#测试第一个消息
messages = [
    {"role": "system", "content":"请始终用中文回答"},
     {"role": "user", "content": "你是谁？"},
]

# 测试第二个消息
# messages = [
#     {"role": "system", "content":"请始终用中文回答"},
#      {"role": "user", "content": "睡觉时被女鬼压床我该怎么办？"},
# ]


inputs = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)

In [190]:
parameters = {
        "max_tokens":512, 
        "temperature": 0.5,
    }


## 不使用Lora的情况下测试

In [200]:
invoke_response = smr_client.invoke_endpoint(
            EndpointName=endpoint_name,
            Body=json.dumps(
            {
                "inputs": inputs,
                "stream" : False,
                **parameters,
            }
            ),
            ContentType="application/json",
            CustomAttributes='accept_eula=false'
        )

# print(invoke_response)
print(json.loads(invoke_response["Body"].read().decode("utf-8")))

{'generated_text': '我是 LLaMA，一个由 Meta 开发的基于人工智能的对话系统。我可以理解和生成自然语言，帮助用户'}


## 使用Lora的情况下测试
- 通过指定 "adapters":"exp" 加载lora，如果有多个lora模型，也可以实现不同lora模型之间的切换



In [201]:
invoke_response = smr_client.invoke_endpoint(
            EndpointName=endpoint_name,
            Body=json.dumps(
            {
                "inputs": inputs,
                "stream" : False,
                 "adapters":"exp" , # Lora Adapter
                     **parameters,
            }
            ),
            ContentType="application/json",
            CustomAttributes='accept_eula=false'
        )

# print(invoke_response)
print(json.loads(invoke_response["Body"].read().decode("utf-8")))

{'generated_text': '您好，我是 Riverbot，一个由 Riverbot 开发的人工智能助手。我的任务是回答用户的问题并提供必要的支持'}


## ！！！！实验结束之后，运行下面命令删除节点！！！

In [202]:
!aws sagemaker delete-endpoint --endpoint-name {endpoint_name}
!aws sagemaker delete-endpoint-config --endpoint-config-name {endpoint_config_name}
!aws sagemaker delete-model --model-name {model_name}