# Kor LLM 모델 서빙

### 참조: 
- Model 정보
    - beomi/KoAlpaca-Polyglot-12.8B
        - This model is a fine-tuned version of EleutherAI/polyglot-ko-12.8b on a KoAlpaca Dataset v1.1b
        - https://huggingface.co/beomi/KoAlpaca-Polyglot-12.8B
    - EleutherAI/polyglot-ko-12.8b
        - Polyglot-Ko-12.8B was trained for 167 billion tokens over 301,000 steps on 256 A100 GPUs with the GPT-NeoX framework. It was trained as an autoregressive language model, using cross-entropy loss to maximize the likelihood of predicting the next token.
        - License: Apache 2.0
        - https://huggingface.co/EleutherAI/polyglot-ko-12.8b
        
- 블로그
    - https://aws.amazon.com/ko/blogs/machine-learning/deploy-large-models-on-amazon-sagemaker-using-djlserving-and-deepspeed-model-parallel-inference/
- 코드
    - Boto3
        - https://github.com/aws/amazon-sagemaker-examples/blob/main/advanced_functionality/pytorch_deploy_large_GPT_model/GPT-J-6B-model-parallel-inference-DJL.ipynb
    - Python SDK
    - https://github.com/aws/amazon-sagemaker-examples/blob/main/inference/generativeai/deepspeed/GPT-J-6B_DJLServing_with_PySDK.ipynb

# 1. 기본 환경 설정

In [40]:
%load_ext autoreload
%autoreload 2

# src 폴더 경로 설정
import sys
sys.path.append('../common_code')

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


# 2. SageMaker endpoint 의 추론 도커 이미지 인 DLC image URL 가져오기
- We get DLC image URL for djl-deepspeed 0.21.0 and set SageMaker settings

In [41]:
import sagemaker, boto3
from sagemaker import image_uris


role = sagemaker.get_execution_role()  # execution role for the endpoint
session = sagemaker.session.Session()  # sagemaker session for interacting with different AWS APIs
region = session._region_name
bucket = session.default_bucket()  # bucket to house artifacts

img_uri = image_uris.retrieve(framework="djl-deepspeed", region=region, version="0.21.0")
img_uri

'763104351884.dkr.ecr.us-east-1.amazonaws.com/djl-inference:0.21.0-deepspeed0.8.3-cu117'

# 3. Set configuration

## 테스트 모델 지정

In [42]:
serve_model = 'KoAlpaca-12-8B'
# serve_model = 'Polyglot-Kor-5-8B'

In [43]:
model_artifact_name = f'{serve_model}.tar.gz'

instance_type = "ml.g5.12xlarge"
# instance_type = "ml.g5.48xlarge"

print("instance_type  :", instance_type)  


instance_type  : ml.g5.12xlarge


In [44]:
s3_location = f"s3://{bucket}/{serve_model}/"
s3_location

's3://sagemaker-us-east-1-057716757052/KoAlpaca-12-8B/'

# 4. 모델 추론 코드 및 모델 설정 파일을 패키징
- `model.py` and `serving.properties`
- The code below creates the SageMaker model file (`model.tar.gz`) and upload it to S3. 

In [45]:
%%sh -s {serve_model} {model_artifact_name}
serve_model=$1
model_artifact_name=$2
echo $serve_model
echo $model_artifact_name

rm -rf $serve_model/.ipynb_checkpoints

tar -czvf $model_artifact_name $serve_model/



KoAlpaca-12-8B
KoAlpaca-12-8B.tar.gz
KoAlpaca-12-8B/
KoAlpaca-12-8B/serving.properties
KoAlpaca-12-8B/model.py


## mode.tar.gz 를 S3 업로드

In [46]:
model_tar_url = sagemaker.s3.S3Uploader.upload(model_artifact_name, s3_location)

# 5. SageMaker endpoint 생성

- Now we create our [SageMaker model](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker.html#SageMaker.Client.create_model). Make sure your execution role has access to your model artifacts and ECR image. Please check out our SageMaker Roles [documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-roles.html) for more details. 

In [47]:
from datetime import datetime

sm_client = boto3.client("sagemaker")

time_stamp = datetime.now().strftime("%Y-%m-%d-%H-%M-%S")
model_name = f"{serve_model}-" + time_stamp

create_model_response = sm_client.create_model(
    ModelName=model_name,
    ExecutionRoleArn=role,
    PrimaryContainer={"Image": img_uri, "ModelDataUrl": model_tar_url},
)

Now we create an endpoint configuration that SageMaker hosting services uses to deploy models. Note that we configured `ModelDataDownloadTimeoutInSeconds` and `ContainerStartupHealthCheckTimeoutInSeconds` to accommodate the large size of our model. 

In [48]:
initial_instance_count = 1
variant_name = "AllTraffic"
endpoint_config_name = f"{serve_model}-config-" + time_stamp

production_variants = [
    {
        "VariantName": variant_name,
        "ModelName": model_name,
        "InitialInstanceCount": initial_instance_count,
        "InstanceType": instance_type,
        "ModelDataDownloadTimeoutInSeconds": 300,
        "ContainerStartupHealthCheckTimeoutInSeconds": 300,
    }
]

endpoint_config = {
    "EndpointConfigName": endpoint_config_name,
    "ProductionVariants": production_variants,
}

ep_conf_res = sm_client.create_endpoint_config(**endpoint_config)

We are ready to create an endpoint using the model and the endpoint configuration created from above steps. 

In [49]:
endpoint_name = f"{serve_model}-" + time_stamp
ep_res = sm_client.create_endpoint(
    EndpointName=endpoint_name, EndpointConfigName=endpoint_config_name
)

In [50]:
print("endpoint_name: ", endpoint_name)

endpoint_name:  KoAlpaca-12-8B-2023-05-27-06-52-39


In [51]:
%%time 

from inference_lib import descirbe_endpoint
descirbe_endpoint(endpoint_name)            

Endpoint is  Creating
Endpoint is  Creating
Endpoint is  Creating
Endpoint is  Creating
Endpoint is  Creating
Endpoint is  Creating
Endpoint is  Creating
Endpoint is  InService
CPU times: user 124 ms, sys: 4.13 ms, total: 129 ms
Wall time: 7min 1s


# 6. 엔드포인트 추론 

In [52]:

from inference_lib import invoke_inference
    

## (1) 맥락 (Context) 없이 질문

In [53]:
q = "홈플러스 중계점은 몇시까지 장사해?"
c = ""#"홈플러스 영업시간은 오전 10시 부터 오후 12시까지 입니다."
prompt_wo_c = f"### 질문: {q}\n\n### 맥락: {c}\n\n### 답변:" if c else f"### 질문: {q}\n\n### 답변:" 
print("prompt_wo_c: \n", prompt_wo_c)

prompt_wo_c: 
 ### 질문: 홈플러스 중계점은 몇시까지 장사해?

### 답변:


In [54]:
%%time 

invoke_inference(endpoint_name, prompt_wo_c)

### 질문: 홈플러스 중계점은 몇시까지 장사해?

### 답변: 홈플러스 중계점은 매일 오전 10시부터 오후 12시까지 영업합니다. 
CPU times: user 15.9 ms, sys: 106 µs, total: 16 ms
Wall time: 1.72 s


## (2) 맥락 (Context) 가지고 질문

In [55]:
q = "홈플러스 중계점은 몇시까지 장사해?"
c = "홈플러스 영업시간은 오전 10시 부터 오후 10시까지 입니다. 홈플러스 매장 찾기(영업시간 확인)는  다음의 URL을 이용하세요:  http://corporate.homeplus.co.kr/Store.aspx?isA=%C1%F6%BF%B4%C7%B0%BF%AE%C0%C7%C1%F2%B5%B5%B4%F6 "
prompt_w_c = f"### 질문: {q}\n\n### 맥락: {c}\n\n### 답변:" if c else f"### 질문: {q}\n\n### 답변:" 
print("prompt_w_c:\n", prompt_w_c)

prompt_w_c:
 ### 질문: 홈플러스 중계점은 몇시까지 장사해?

### 맥락: 홈플러스 영업시간은 오전 10시 부터 오후 10시까지 입니다. 홈플러스 매장 찾기(영업시간 확인)는  다음의 URL을 이용하세요:  http://corporate.homeplus.co.kr/Store.aspx?isA=%C1%F6%BF%B4%C7%B0%BF%AE%C0%C7%C1%F2%B5%B5%B4%F6 

### 답변:


In [56]:
%%time 

invoke_inference(endpoint_name,  prompt_w_c)

### 질문: 홈플러스 중계점은 몇시까지 장사해?

### 맥락: 홈플러스 영업시간은 오전 10시 부터 오후 10시까지 입니다. 홈플러스 매장 찾기(영업시간 확인)는  다음의 URL을 이용하세요:  http://corporate.homeplus.co.kr/Store.aspx?isA=%C1%F6%BF%B4%C7%B0%BF%AE%C0%C7%C1%F2%B5%B5%B4%F6 

### 답변:홈플러스의 영업 시간은 오전 10시부터 오후 10시까지입니다. 
다음의 URL을 방문하여 각 매장의 영업 시간을 확인하실 수 있습니다: http://corporate.homeplus.co.kr/Store.aspx?isA=%C1%F6%BF%B4%C7%B0%BF%AE%C0%C7%C1%F2%B5%B5%B4%F6 
CPU times: user 16.3 ms, sys: 0 ns, total: 16.3 ms
Wall time: 5.6 s


# 7. [중요] 클린업 엔트포인트 

In [57]:
# sm_client.delete_endpoint(EndpointName=endpoint_name)
# sm_client.delete_endpoint_config(EndpointConfigName=endpoint_config_name)
# sm_client.delete_model(ModelName=model_name)