# Using LLama Factory finetune on SageMaker 
# 3. 使用SageMaker LMI(Large Model Inference) vLLM 引擎部署模型至SageMaker Endpoint

The LMI container offers the out-of-box integration with SageMaker for hosting multiple LoRA adapters with higher performance (low latency and high throughput) using the vLLM library that uses S-LORA and Punica. S-LoRA stores all adapters in the main memory and fetches the adapters used by the currently running queries to the GPU memory. To efficiently use the GPU memory and reduce fragmentation, S-LoRA proposes Unified Paging. Unified Paging uses a unified memory pool to manage dynamic adapter weights with different ranks and KV cache tensors with varying sequence lengths. Additionally, S-LoRA employs a novel tensor parallelism strategy and highly optimized custom CUDA kernels for heterogeneous batching of LoRA computation. Collectively, these features enable S-LoRA to serve thousands of LoRA adapters on a single GPU or across multiple GPUs with a small overhead.

Below diagram shows the Multi LoRA-Adapter serving stack of LMI container on SageMaker Multi LoRA-Adapter serving stack of LMI container on SageMaker  
![imge](https://raw.githubusercontent.com/aws-samples/sagemaker-genai-hosting-examples/0a98859eef9c53a5aa3beeae7d59b38a8de934dc/Llama2/Llama2-7b/LMI/LoRA-LMI-SageMaker.png)

In [1]:
import sagemaker
from sagemaker import image_uris
import boto3
import os
import time
import json

role = sagemaker.get_execution_role()  # execution role for the endpoint
sess = sagemaker.session.Session()  # sagemaker session for interacting with different AWS APIs
default_bucket = sess.default_bucket()  # bucket to house artifacts

region = sess._region_name
account_id = sess.account_id()

s3_client = boto3.client("s3")
sm_client = boto3.client("sagemaker")
smr_client = boto3.client("sagemaker-runtime")

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/ec2-user/.config/sagemaker/config.yaml


In [2]:
# inference_image_uri = image_uris.retrieve(
#     framework="djl-deepspeed",
#     region=sess.boto_session.region_name,
#     version="0.27.0"
# )
inference_image_uri = (
    "763104351884.dkr.ecr.us-east-1.amazonaws.com/djl-inference:0.28.0-lmi10.0.0-cu124-v1.0"
)
print(f"Image going to be used is ---- > {inference_image_uri}")


Image going to be used is ---- > 763104351884.dkr.ecr.us-east-1.amazonaws.com/djl-inference:0.28.0-lmi10.0.0-cu124-v1.0


In [3]:
local_code_dir = 'vllm_inference'
!mkdir -p {local_code_dir}


* Note: option.model_id 需要改成模型下载的s3_url


In [10]:
%%writefile {local_code_dir}/serving.properties
engine=Python
option.model_id=TechxGenus/Meta-Llama-3-8B-Instruct-AWQ
option.dtype=fp16
option.enable_lora=true
option.rolling_batch=vllm
option.tensor_parallel_degree=1
option.max_model_len=8192
option.max_tokens=8192
option.output_formatter = json
option.model_loading_timeout = 1200
option.max_rolling_batch_size=64
option.max_cpu_loras=4

Overwriting vllm_inference/serving.properties


- 创建lora目录

In [11]:
!mkdir -p {local_code_dir}/adapters

### 下载训练好的Lora至本地目录打包

In [12]:
!aws s3 sync s3://{default_bucket}/llama3-8b-qlora/finetuned_model/ {local_code_dir}/adapters/exp
!rm -rf {local_code_dir}/adapters/exp/checkpoint*
!rm -rf {local_code_dir}/adapters/exp/runs

download: s3://sagemaker-us-east-1-577976195821/llama3-8b-qlora/finetuned_model/runs/Jun26_14-15-21_algo-1-nescd/events.out.tfevents.1719412318.algo-1-nescd.216.1 to vllm_inference/adapters/exp/runs/Jun26_14-15-21_algo-1-nescd/events.out.tfevents.1719412318.algo-1-nescd.216.1
download: s3://sagemaker-us-east-1-577976195821/llama3-8b-qlora/finetuned_model/runs/Jun26_14-15-21_algo-1-nescd/events.out.tfevents.1719411354.algo-1-nescd.216.0 to vllm_inference/adapters/exp/runs/Jun26_14-15-21_algo-1-nescd/events.out.tfevents.1719411354.algo-1-nescd.216.0


In [13]:
!rm model.tar.gz
!cd {local_code_dir} && rm -rf ".ipynb_checkpoints"
!tar czvf model.tar.gz {local_code_dir}

vllm_inference/
vllm_inference/serving.properties
vllm_inference/adapters/
vllm_inference/adapters/exp/
vllm_inference/adapters/exp/all_results.json
vllm_inference/adapters/exp/README.md
vllm_inference/adapters/exp/eval_results.json
vllm_inference/adapters/exp/training_args.bin
vllm_inference/adapters/exp/trainer_state.json
vllm_inference/adapters/exp/train_results.json
vllm_inference/adapters/exp/special_tokens_map.json
vllm_inference/adapters/exp/adapter_model.safetensors
vllm_inference/adapters/exp/trainer_log.jsonl
vllm_inference/adapters/exp/adapter_config.json
vllm_inference/adapters/exp/training_loss.png
vllm_inference/adapters/exp/tokenizer.json
vllm_inference/adapters/exp/tokenizer_config.json


In [14]:
s3_code_prefix = "llm_finetune/llama-3-8b-qlora"
print(f"s3_code_prefix: {s3_code_prefix}")

s3_code_prefix: llm_finetune/llama-3-8b-qlora


In [16]:
s3_code_artifact = sess.upload_data("model.tar.gz", default_bucket, s3_code_prefix)
print(f"S3 Code or Model tar ball uploaded to --- > {s3_code_artifact}")

S3 Code or Model tar ball uploaded to --- > s3://sagemaker-us-east-1-577976195821/llm_finetune/llama-3-8b-qlora/model.tar.gz


### 创建SageMaker模型

In [17]:
from sagemaker.utils import name_from_base
import boto3

model_name = name_from_base(f"llama3-8b-qlora-vllm").replace('.','-').replace('_','-')
print(model_name)
print(f"Image going to be used is ---- > {inference_image_uri}")

create_model_response = sm_client.create_model(
    ModelName=model_name,
    ExecutionRoleArn=role,
    PrimaryContainer={
        "Image": inference_image_uri,
        "ModelDataUrl": s3_code_artifact
    },
    
)
model_arn = create_model_response["ModelArn"]

print(f"Created Model: {model_arn}")

llama3-8b-qlora-vllm-2024-06-26-14-39-03-287
Image going to be used is ---- > 763104351884.dkr.ecr.us-east-1.amazonaws.com/djl-inference:0.28.0-lmi10.0.0-cu124-v1.0
Created Model: arn:aws:sagemaker:us-east-1:577976195821:model/llama3-8b-qlora-vllm-2024-06-26-14-39-03-287


### 创建SageMaker端点模型配置

In [18]:
endpoint_config_name = f"{model_name}-config"
endpoint_name = f"{model_name}-endpoint"

#Note: ml.g4dn.2xlarge 也可以选择
endpoint_config_response = sm_client.create_endpoint_config(
    EndpointConfigName=endpoint_config_name,
    ProductionVariants=[
        {
            "VariantName": "variant1",
            "ModelName": model_name,
            "InstanceType": "ml.g5.2xlarge",
            "InitialInstanceCount": 1,
            # "VolumeSizeInGB" : 400,
            # "ModelDataDownloadTimeoutInSeconds": 2400,
            "ContainerStartupHealthCheckTimeoutInSeconds": 10*60,
        },
    ],
)
endpoint_config_response

{'EndpointConfigArn': 'arn:aws:sagemaker:us-east-1:577976195821:endpoint-config/llama3-8b-qlora-vllm-2024-06-26-14-39-03-287-config',
 'ResponseMetadata': {'RequestId': 'df1f01e9-2b23-4d1f-8d50-4f7148e7bfa1',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': 'df1f01e9-2b23-4d1f-8d50-4f7148e7bfa1',
   'content-type': 'application/x-amz-json-1.1',
   'content-length': '132',
   'date': 'Wed, 26 Jun 2024 14:39:06 GMT'},
  'RetryAttempts': 0}}

### 创建SageMaker端点

In [19]:
create_endpoint_response = sm_client.create_endpoint(
    EndpointName=f"{endpoint_name}", EndpointConfigName=endpoint_config_name
)
print(f"Created Endpoint: {create_endpoint_response['EndpointArn']}")

Created Endpoint: arn:aws:sagemaker:us-east-1:577976195821:endpoint/llama3-8b-qlora-vllm-2024-06-26-14-39-03-287-endpoint


### 大概等到8分钟左右节点部署成功

In [20]:
import time

resp = sm_client.describe_endpoint(EndpointName=endpoint_name)
status = resp["EndpointStatus"]
print("Status: " + status)

while status == "Creating":
    time.sleep(60)
    resp = sm_client.describe_endpoint(EndpointName=endpoint_name)
    status = resp["EndpointStatus"]
    print("Status: " + status)

print("Arn: " + resp["EndpointArn"])
print("Status: " + status)

Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: InService
Arn: arn:aws:sagemaker:us-east-1:577976195821:endpoint/llama3-8b-qlora-vllm-2024-06-26-14-39-03-287-endpoint
Status: InService


## 调用SageMaker Endpoint 推理

In [21]:
%%time
import json
import boto3

smr_client = boto3.client("sagemaker-runtime")
parameters = {
  "max_new_tokens": 512,
  "temperature": 0.9,
  "top_p":0.8
}

CPU times: user 1.51 ms, sys: 2.53 ms, total: 4.04 ms
Wall time: 2.94 ms


### 加载Tokenizer， 使用其chat template

In [22]:
from transformers import AutoTokenizer

model_id = 'TechxGenus/Meta-Llama-3-8B-Instruct-AWQ'
tokenizer = AutoTokenizer.from_pretrained(model_id)

tokenizer_config.json:   0%|          | 0.00/50.9k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.08M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/301 [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


### 流式输出

In [23]:
import io
import re

NEWLINE = re.compile(r'\\n')  
DOUBLE_NEWLINE = re.compile(r'\\n\\n')

class LineIterator:
    """
    A helper class for parsing the byte stream from Llama 2 model inferenced with LMI Container. 
    
    The output of the model will be in the following repetetive but incremental format:
    ```
    b'{"generated_text": "'
    b'lo from L"'
    b'LM \\n\\n'
    b'How are you?"}'
    ...

    For each iteration, we just read the incremental part and seek for the new position for the next iteration till the end of the line.

    """
    
    def __init__(self, stream):
        self.byte_iterator = iter(stream)
        self.buffer = io.BytesIO()
        self.read_pos = 0

    def __iter__(self):
        return self

    def __next__(self):
        start_sequence = b'{"generated_text": "'
        stop_sequence = b'"}'
        new_line = '\n'
        double_new_line = '\n\n'
        while True:
            self.buffer.seek(self.read_pos)
            line = self.buffer.readline()
            if line:
                self.read_pos += len(line)
                if line.startswith(start_sequence):# in :
                    line = line.lstrip(start_sequence)

                if line.endswith(stop_sequence):
                    line =line.rstrip(stop_sequence)
                line = line.decode('utf-8')
                line = NEWLINE.sub(new_line, line)
                line = DOUBLE_NEWLINE.sub(double_new_line, line)
                return line
            try:
                chunk = next(self.byte_iterator)
            except StopIteration:
                if self.read_pos < self.buffer.getbuffer().nbytes:
                    continue
                raise
            if 'PayloadPart' not in chunk:
                print('Unknown event type:' + chunk)
                continue
            self.buffer.seek(0, io.SEEK_END)
            self.buffer.write(chunk['PayloadPart']['Bytes'])

In [34]:
#测试第一个消息
messages = [
    {"role": "system", "content":"请始终用中文回答"},
     {"role": "user", "content": "你是谁？你是干嘛的"},
]

# 测试第二个消息
# messages = [
#     {"role": "system", "content":"请始终用中文回答"},
#      {"role": "user", "content": "睡觉时被女鬼压床我该怎么办？"},
# ]


inputs = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)

In [35]:
parameters = {
        "max_new_tokens":512, 
        "do_sample": True,
        "temperature": 0.1,
        "top_p": 0.95,
    }


## 不使用Lora的情况下测试

In [36]:
response_stream = smr_client.invoke_endpoint_with_response_stream(
            EndpointName=endpoint_name,
            Body=json.dumps(
            {
                "inputs": inputs,
                "parameters":parameters,
                "stream" : True,
                # "adapters":"exp" # Lora Adapter
            }
            ),
            ContentType="application/json",
            CustomAttributes='accept_eula=false'
        )

for token in LineIterator(response_stream["Body"]):
    # pass
    print(token, end="")

我是 LLaMA，一个由 Meta 开发的基于人工智能的语言模型。我可以理解和生成自然语言，帮助用户回答问题、完成任务、甚至进行对话。

我可以用来做很多事情，例如：

* 解答问题：我可以回答各种问题，包括历史、科学、技术、娱乐等领域。
* 翻译：我可以将文本翻译成多种语言。
* 文本生成：我可以生成文本，包括文章、故事、诗歌等。
* 对话：我可以与用户进行对话，回答问题、提供信息和建议。

我是一个机器人，我的目的是帮助用户获取信息、完成任务和提高语言能力。

## 使用Lora的情况下测试
- 通过指定 "adapters":"exp" 加载lora，如果有多个lora模型，也可以实现不同lora模型之间的切换



In [37]:
response_stream = smr_client.invoke_endpoint_with_response_stream(
            EndpointName=endpoint_name,
            Body=json.dumps(
            {
                "inputs": inputs,
                "parameters":parameters,
                "stream" : True,
                "adapters":"exp" # Lora Adapter
            }
            ),
            ContentType="application/json",
            CustomAttributes='accept_eula=false'
        )

for token in LineIterator(response_stream["Body"]):
    # pass
    print(token, end="")

您好，我是 RiverBot，一个由 GOGOGO 开发的人工智能助手，我可以帮您回答各种问题，提供实用的建议和帮助。

## ！！！！实验结束之后，运行下面命令删除节点！！！

In [38]:
!aws sagemaker delete-endpoint --endpoint-name {endpoint_name}
!aws sagemaker delete-endpoint-config --endpoint-config-name {endpoint_config_name}
!aws sagemaker delete-model --model-name {model_name}