# Deploy the pre-uploaded HuggingFace Model in S3 to Sagemaker

## Step1: Initialize the Deploy Environment

### 1.1 Install Python Packages

In [1]:
!pip install huggingface_hub -U -q -i https://pypi.tuna.tsinghua.edu.cn/simple
!pip install -U sagemaker -q -i https://pypi.tuna.tsinghua.edu.cn/simple
!pip install --upgrade sagemaker -q -i https://pypi.tuna.tsinghua.edu.cn/simple

### 1.2 Initialize Python Code

In [2]:
from huggingface_hub import snapshot_download
from pathlib import Path
import sagemaker
from sagemaker import image_uris
import boto3
import os
import time
import json
from sagemaker.model import Model
from sagemaker import serializers, deserializers

role = sagemaker.get_execution_role()  # execution role for the endpoint
sess = sagemaker.session.Session()  # sagemaker session for interacting with different AWS APIs
bucket = sess.default_bucket()  # bucket to house artifacts

region = sess._region_name
account_id = sess.account_id()

s3_client = boto3.client("s3")
sm_client = boto3.client("sagemaker")
smr_client = boto3.client("sagemaker-runtime")

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/ec2-user/.config/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/ec2-user/.config/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/ec2-user/.config/sagemaker/config.yaml


## Step 2: Prepare the parameters for deployment
- model_name ： HuggingFace中的模型名称 
- s3_model_prefix ：模型文件在S3中的位置的文件夹路径（不包含bucket name和文件名称）【需要提前准备】
- s3_code_prefix ： 模型执行代码在S3中的位置的文件夹路径（不包含bucket name和文件名称）【执行S3的文件夹路径即可，代码会自动上传到S3】
- endpoint_config_name ： 部署Sagemaker Configuration 的名称
- endpoint_name ： 部署Sagemaker endpoint的名称
- deploy_cache_location ： 部署时，产生的代码文件所在的本地路径
- inference_image_uri ： 部署所使用的推理容器

In [3]:
model_name = "meta-llama/Llama-2-7b-chat-hf"
s3_model_prefix = f"llm/model/{model_name}"
s3_code_prefix = f"llm/code/{model_name}"
deploy_cache_location = f"../cache/{model_name}"

endpoint_model_name = f"{model_name.replace('/', '-').replace('.', '-')}"
endpoint_config_name = endpoint_model_name # f"{model_name}-config"
endpoint_name = endpoint_model_name


inference_image_uri = f"727897471807.dkr.ecr.{region}.amazonaws.com.cn/djl-inference:0.22.1-deepspeed0.8.3-cu118"

!mkdir -p $deploy_cache_location/code

## Step 3: Prepare code of Model

### 3.1 Prepare Model Entry Script：

In [4]:
%%writefile $deploy_cache_location/code/model.py
from djl_python import Input, Output
from djl_python.streaming_utils import StreamingUtils
import os
import deepspeed
import torch
import logging
from transformers import pipeline, AutoModelForCausalLM, AutoTokenizer
from transformers.models.llama.modeling_llama import LlamaDecoderLayer
##
from transformers import LlamaTokenizer, LlamaForCausalLM
import json

model = None
tokenizer = None


def get_model(properties):
    model_name = properties["model_id"]
    tensor_parallel_degree = properties["tensor_parallel_degree"]
    max_tokens = int(properties.get("max_tokens", "768"))
    dtype = torch.float16

    model = LlamaForCausalLM.from_pretrained(model_name, low_cpu_mem_usage=True, torch_dtype=dtype, device_map='auto')
    tokenizer = LlamaTokenizer.from_pretrained(model_name)
    tokenizer.pad_token = tokenizer.eos_token

    return model, tokenizer


# system_prompt = """
# You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe.  Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.
#             If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.
#             """

system_prompt = ""


def get_prompt(message: str, chat_history: list[tuple[str, str]]) -> str:
    texts = [f'[INST] <<SYS>>\n{system_prompt}\n<</SYS>>\n\n']
    for user_input, response in chat_history:
        texts.append(f'{user_input.strip()} [/INST] {response.strip()} </s><s> [INST] ')
    texts.append(f'{message.strip()} [/INST]')
    return ''.join(texts)


def inference(inputs):
    try:
        input_map = inputs.get_as_json()
        data = input_map.pop("ask", input_map)
        
        if data.startswith("[INST]"):
            data = data
        else:
            data = get_prompt(data, [])
        
        parameters = input_map.pop("parameters", {})
        outputs = Output()

        enable_streaming = inputs.get_properties().get("enable_streaming",
                            "false").lower() == "true"
        if enable_streaming:
            stream_generator = StreamingUtils.get_stream_generator(
                "DeepSpeed")
            outputs.add_stream_content(
                stream_generator(model, tokenizer, data,
                                 **parameters))
            return outputs

        tokenizer.add_special_tokens({'pad_token': '[PAD]'})
        tokenizer.padding_side = 'left'
        input_tokens = tokenizer(data, padding=True,
                                return_tensors="pt").to(
                                torch.cuda.current_device())
        # with torch.no_grad():
        #     output_tokens = model.generate(input_tokens.input_ids, **parameters)
        # output_tokens = model.generate(input_tokens.input_ids, **parameters)
        
        # input_tokens = tokenizer(data, return_tensors='pt')
        output_tokens = model.generate(input_tokens.input_ids, **parameters)

        
        # print("output_tokens", json.dumps(output_tokens))
        generated_text = tokenizer.batch_decode(output_tokens, skip_special_tokens=True)

        answer = [{"generated_text": s} for s in generated_text]
        answer_text = ''
        for item in answer:  
            if '[/INST]' in item['generated_text']:  
                answer_text += item['generated_text'].split('[/INST]')[1]
        
        outputs.add_as_json({"answer": answer_text})
        return outputs
    
        # outputs.add_as_json([{"generated_text": s} for s in generated_text])
        # return outputs
    except Exception as e:
        logging.exception("Huggingface inference failed")
        # error handling
        outputs = Output().error(str(e))


def handle(inputs: Input) -> None:
    global model, tokenizer
    if not model:
        model, tokenizer = get_model(inputs.get_properties())

    if inputs.is_empty():
        # Model server makes an empty call to warmup the model on startup
        return None

    return inference(inputs)

Overwriting ../cache/meta-llama/Llama-2-7b-chat-hf/code/model.py


### 3.2 Prepare the model metadata

In [5]:
submission_ending = f'''engine=Python
option.tensor_parallel_degree=1
option.s3url = s3://{bucket}/{s3_model_prefix}/
'''

with open(f'{deploy_cache_location}/code/serving.properties', mode='w+') as file:
    file.write(submission_ending)

### 3.3 Prepare the model depended python packages

In [6]:
%%writefile $deploy_cache_location/code/requirements.txt
-i https://pypi.tuna.tsinghua.edu.cn/simple
transformers==4.28.1
protobuf==3.20.1
torch
fairscale
fire
sentencepiece

Overwriting ../cache/meta-llama/Llama-2-7b-chat-hf/code/requirements.txt


### 3.4 Package all the model required resources and upload to S3

In [7]:
!rm $deploy_cache_location/code/model.tar.gz
!cd $deploy_cache_location/code && rm -rf ".ipynb_checkpoints"
!tar czvf $deploy_cache_location/code/model.tar.gz -C $deploy_cache_location/ code

s3_code_artifact = sess.upload_data(f"{deploy_cache_location}/code/model.tar.gz", bucket, s3_code_prefix)
print(f"S3 Code or Model tar ball uploaded to --- > {s3_code_artifact}")

code/
code/model.py
code/serving.properties
code/requirements.txt
tar: code: file changed as we read it
S3 Code or Model tar ball uploaded to --- > s3://sagemaker-cn-northwest-1-768219110428/llm/code/meta-llama/Llama-2-7b-chat-hf/model.tar.gz


## Step 4: Start Deployment

In [8]:
# model = Model(image_uri=inference_image_uri,
#               model_data=s3_code_artifact, 
#               role=role)

# model.deploy(initial_instance_count = 1,
#              instance_type = 'ml.p3.2xlarge', 
#              endpoint_name = endpoint_name,
#              container_startup_health_check_timeout = 2900
#             )


### 4.1 Create Sagemaker Mode

In [9]:
from sagemaker.utils import name_from_base
import boto3

create_model_response = sm_client.create_model(
    ModelName=endpoint_model_name,
    ExecutionRoleArn=role,
    PrimaryContainer={
        "Image": inference_image_uri,
        "ModelDataUrl": s3_code_artifact
    },
    
)
model_arn = create_model_response["ModelArn"]

print(f"Created Model: {model_arn}")

Created Model: arn:aws-cn:sagemaker:cn-northwest-1:768219110428:model/meta-llama-llama-2-7b-chat-hf


### 4.2 Create Sagemaker Endpoint Configuration

In [10]:
endpoint_config_response = sm_client.create_endpoint_config(
    EndpointConfigName=endpoint_config_name,
    ProductionVariants=[
        {
            "VariantName": "variant1",
            "ModelName": endpoint_model_name,
            "InstanceType": "ml.p3.2xlarge",
            "InitialInstanceCount": 1,
            # "VolumeSizeInGB" : 400,
            # "ModelDataDownloadTimeoutInSeconds": 2400,
            "ContainerStartupHealthCheckTimeoutInSeconds": 15*60,
        },
    ],
)
endpoint_config_response

{'EndpointConfigArn': 'arn:aws-cn:sagemaker:cn-northwest-1:768219110428:endpoint-config/meta-llama-llama-2-7b-chat-hf',
 'ResponseMetadata': {'RequestId': '2f54c0cf-e96e-4572-9a29-f33b09bcd98d',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': '2f54c0cf-e96e-4572-9a29-f33b09bcd98d',
   'content-type': 'application/x-amz-json-1.1',
   'content-length': '118',
   'date': 'Mon, 13 Nov 2023 09:53:50 GMT'},
  'RetryAttempts': 0}}

### 4.3 Create Sagemaker Endpoint

In [11]:
create_endpoint_response = sm_client.create_endpoint(
    EndpointName=f"{endpoint_name}", EndpointConfigName=endpoint_config_name
)
print(f"Created Endpoint: {create_endpoint_response['EndpointArn']}")

Created Endpoint: arn:aws-cn:sagemaker:cn-northwest-1:768219110428:endpoint/meta-llama-llama-2-7b-chat-hf


### 4.4 Monitor the Sagemaker Endpoint Creating Progress

In [12]:
import time

resp = sm_client.describe_endpoint(EndpointName=endpoint_name)
status = resp["EndpointStatus"]
print("Status: " + status)

while status == "Creating":
    time.sleep(60)
    resp = sm_client.describe_endpoint(EndpointName=endpoint_name)
    status = resp["EndpointStatus"]
    print("Status: " + status)

print("Arn: " + resp["EndpointArn"])
print("Status: " + status)

Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: InService
Arn: arn:aws-cn:sagemaker:cn-northwest-1:768219110428:endpoint/meta-llama-llama-2-7b-chat-hf
Status: InService


## Step 5 : (Optional) Config Sagemaker Endpoint Autoscaling

In [13]:
asg = boto3.client('application-autoscaling')

# Resource type is variant and the unique identifier is the resource ID.
resource_id=f"endpoint/{endpoint_name}/variant/variant1"

# scaling configuration
response = asg.register_scalable_target(
    ServiceNamespace='sagemaker', #
    ResourceId=resource_id,
    ScalableDimension='sagemaker:variant:DesiredInstanceCount', 
    MinCapacity=1,
    MaxCapacity=4
)

In [14]:
response = asg.put_scaling_policy(
    PolicyName=f'Request-ScalingPolicy-{endpoint_name}',
    ServiceNamespace='sagemaker',
    ResourceId=resource_id,
    ScalableDimension='sagemaker:variant:DesiredInstanceCount',
    PolicyType='TargetTrackingScaling',
    TargetTrackingScalingPolicyConfiguration={
        'TargetValue': 10.0, # Threshold
        'PredefinedMetricSpecification': {
            'PredefinedMetricType': 'SageMakerVariantInvocationsPerInstance',
        },
        'ScaleInCooldown': 300, # duration until scale in
        'ScaleOutCooldown': 60 # duration between scale out
    }
)

## Step 6: (Optional) Testing Model Endpoint

### 6.1 Prepare the Testing method

In [15]:


# endpoint_name = 'llama-2-model-2023-08-28-09-26-16-994'

predictor = sagemaker.Predictor(
            endpoint_name=endpoint_name,
            sagemaker_session=sess,
            serializer=serializers.JSONSerializer(),
            deserializer=deserializers.JSONDeserializer(),
            )

### 6.2 Demo 1: Generate the Embedding Value By Input Text Value

In [16]:

system_prompt = """
"""

ask = """
you should use the knowledge provided to answer user's question.  
the knowledge you known are: [21] after modification.\n\nThe ABTS+ radical reaction solution configuration was as follows: 5 mL of 7 mmol/L of ABTS and 5 mL of 2.45 mmol/L of potassium persulfate were mixed and stored in the dark for 12 h. Before use, 0.1 mol/L of pH 7.4 phosphate buffer saline (PBS) was added to dilute until the OD734 value was 0.70 ± 0.02.\n\nThe sample solution was the same as that of the EPS sample solution measured by DPPH clearing ability.
question: how to config the ABTS  radical reaction  ? 
"""

def get_prompt(message: str, chat_history: list[tuple[str, str]]) -> str:
    texts = [f'[INST] <<SYS>>\n{system_prompt}\n<</SYS>>\n\n']
    for user_input, response in chat_history:
        texts.append(f'{user_input.strip()} [/INST] {response.strip()} </s><s> [INST] ')
    texts.append(f'{message.strip()} [/INST]')
    return ''.join(texts)

ask = get_prompt(ask, [])
print(ask)

predictor.predict(
    {"ask": ask, "parameters": {"max_new_tokens": 300}}
)

[INST] <<SYS>>


<</SYS>>

you should use the knowledge provided to answer user's question.  
the knowledge you known are: [21] after modification.

The ABTS+ radical reaction solution configuration was as follows: 5 mL of 7 mmol/L of ABTS and 5 mL of 2.45 mmol/L of potassium persulfate were mixed and stored in the dark for 12 h. Before use, 0.1 mol/L of pH 7.4 phosphate buffer saline (PBS) was added to dilute until the OD734 value was 0.70 ± 0.02.

The sample solution was the same as that of the EPS sample solution measured by DPPH clearing ability.
question: how to config the ABTS  radical reaction  ? [/INST]


{'answer': '  Based on the information provided, the ABTS+ radical reaction configuration is as follows:\n\n1. Volume: 5 mL\n2. ABTS concentration: 7 mmol/L (5 mL x 7 mmol/L = 35 mmol ABTS)\n3. Potassium persulfate concentration: 2.45 mmol/L (5 mL x 2.45 mmol/L = 12.25 mmol KPS)\n4. Mixing time: 12 h (the reaction mixture is stored in the dark for 12 hours)\n5. Dilution: After the reaction, 0.1 mol/L of pH 7.4 phosphate buffer saline (PBS) is added to dilute the solution until the OD734 value is 0.70 ± 0.02.\n\nSo, to configure the ABTS+ radical reaction, you will need:\n\n* 5 mL of 7 mmol/L ABTS solution\n* 5 mL of 2.45 mmol/L potassium persulfate solution\n* A mixing vessel (such as a beaker or flask)\n* Darkness for the reaction to occur (12 hours)\n* A spectrophotometer to measure the absorbance at 734'}

## 7: (Optional) Delete all resources (Sagemaker Model, Endpoint, Endpoint Configuration)

In [17]:
# !aws sagemaker delete-endpoint-config --endpoint-config-name $endpoint_config_name
# !aws sagemaker delete-endpoint --endpoint-name $endpoint_name
# !aws sagemaker delete-model --model-name $endpoint_model_name