# Module 3. Deployment on MMS(Multi Model Server)
---

본 모듈에서는 모델의 배포(deployment)를 수행합니다. 노트북 실행에는 약 15분 가량 소요되며, 핸즈온 실습 시에는 25분을 권장드립니다.

<br>

## 1. Inference script
---

아래 코드 셀은 `src` 디렉토리에 SageMaker 추론 스크립트인 `inference.py`를 저장합니다.<br>

이 스크립트는 SageMaker 상에서 MMS(Multi Model Server)를 쉽고 편하게 배포할 수 이는 high-level 툴킷인 SageMaker inference toolkit의 인터페이스를
사용하고 있으며, 여러분께서는 인터페이스에 정의된 핸들러(handler) 함수들만 구현하시면 됩니다.

#### MMS(Multi Model Server)란?
- [https://github.com/awslabs/multi-model-server](https://github.com/awslabs/multi-model-server) (2017년 12월 초 MXNet 1.0 릴리스 시 최초 공개, MXNet용 모델 서버로 시작)
- Prerequisites: Java 8, MXNet (단, MXNet 사용 시에만)
- MMS는 프레임워크에 구애받지 않도록 설계되었기 때문에, 모든 프레임워크의 백엔드 엔진 역할을 할 수 있는 충분한 유연성을 제공합니다.
- SageMaker MXNet 추론 컨테이너와 PyTorch 추론 컨테이너는 SageMaker inference toolkit으로 MMS를 래핑하여 사용합니다.
    - 2020년 4월 말 PyTorch용 배포 웹 서비스인 torchserve가 출시되면서, 향후 PyTorch 추론 컨테이너는 MMS 기반에서 torchserve 기반으로 마이그레이션될 예정입니다. 

In [1]:
%%writefile ./src/inference.py

import os
import pandas as pd
import gluonts 
import numpy as np
import argparse
import json
import pathlib
from mxnet import gpu, cpu
from mxnet.context import num_gpus
import matplotlib.pyplot as plt

from gluonts.dataset.util import to_pandas
from gluonts.mx.distribution import DistributionOutput, StudentTOutput, NegativeBinomialOutput, GaussianOutput
from gluonts.model.deepar import DeepAREstimator
from gluonts.mx.trainer import Trainer
from gluonts.evaluation import Evaluator
from gluonts.evaluation.backtest import make_evaluation_predictions, backtest_metrics
from gluonts.model.predictor import Predictor
from gluonts.dataset.field_names import FieldName
from gluonts.dataset.common import ListDataset


def model_fn(model_dir):
    path = pathlib.Path(model_dir)   
    predictor = Predictor.deserialize(path)
    print("model was loaded successfully")
    return predictor


def transform_fn(model, request_body, content_type='application/json', accept_type='application/json'):
    
    related_cols = ['holiday', 'temp', 'rain_1h', 'snow_1h', 'clouds_all', 'weather_main', 'weather_description']    
    FREQ = 'H'
    pred_length = 24*7

    data = json.loads(request_body)

    target_test_df = pd.DataFrame(data['target_values'], index=data['timestamp'])
    related_test_df = pd.DataFrame(data['related_values'], index=data['timestamp'])
    related_test_df.columns = related_cols
        
    target = target_test_df.values
    related = related_test_df.values
    num_series = target_test_df.shape[1]
    start_dt = target_test_df.index[0]

    related_list = [related_test_df[c].values for c in related_cols]    
    test_lst = []

    target_vec = target.squeeze()
    related_vecs = [related.squeeze() for related in related_list]
    dic = {FieldName.TARGET: target_vec, 
           FieldName.START: start_dt,
           FieldName.FEAT_DYNAMIC_REAL: related_vecs
          } 
    test_lst.append(dic)
    test_ds = ListDataset(test_lst, freq=FREQ)
    
    response_body = {}
    forecast_it = model.predict(test_ds)
    forecast = list(forecast_it)
    response_body['out'] = forecast[0].samples.mean(axis=0).tolist()
    return json.dumps(response_body)

Overwriting ./src/inference.py


<br>

## 2. Test Inference code 
---

엔드포인트 배포 전, 추론 스크립트를 검증합니다. 

In [2]:
%store -r

In [7]:
from src.inference import model_fn, transform_fn
import json
import numpy as np
import pandas as pd

# Prepare test data
target_test_df = pd.read_csv("data/target_test.csv", index_col=0)
related_test_df = pd.read_csv("data/related_test.csv", index_col=0)

input_data = {'target_values': target_test_df.values.tolist(), 
              'related_values': related_test_df.values.tolist(),
              'timestamp': target_test_df.index.tolist()
             }
request_body = json.dumps(input_data)

# Test inference script 
model = model_fn('./model')
response = transform_fn(model, request_body)
outputs = json.loads(response)
print(outputs['out'])

The history saving thread hit an unexpected error (OperationalError('database or disk is full',)).History will not be written to the database.
model was loaded successfully
[943.9248657226562, 240.01544189453125, 72.08430480957031, 262.6484069824219, 464.26739501953125, 2012.26123046875, 4445.40771484375, 4047.429443359375, 4763.0654296875, 5302.21240234375, 4416.32861328125, 4526.18603515625, 4707.32861328125, 5293.30517578125, 5780.46630859375, 7090.29833984375, 6285.76806640625, 6898.302734375, 5814.22314453125, 3767.296875, 2415.335205078125, 1821.4932861328125, 1141.8502197265625, 1012.423828125, 658.1253051757812, 348.40155029296875, 19.034622192382812, 58.60297393798828, 1348.5244140625, 3476.54052734375, 5536.71142578125, 7529.82421875, 7083.34521484375, 7169.65185546875, 5991.9814453125, 5732.31494140625, 5472.51318359375, 5481.4462890625, 6118.572265625, 6432.88818359375, 8180.08935546875, 6933.18701171875, 4911.9990234375, 4108.01123046875, 2938.84619140625, 2386.14721679687

<br>

## 3. Local Endpoint Inference
---

충분한 검증 및 테스트 없이 훈련된 모델을 곧바로 실제 운영 환경에 배포하기에는 많은 위험 요소들이 있습니다. 따라서, 로컬 모드를 사용하여 실제 운영 환경에 배포하기 위한 추론 인스턴스를 시작하기 전에 노트북 인스턴스의 로컬 환경에서 모델을 배포하는 것을 권장합니다. 이를 로컬 모드 엔드포인트(Local Mode Endpoint)라고 합니다.

In [8]:
import os
import time
import sagemaker
from sagemaker.mxnet import MXNetModel
role = sagemaker.get_execution_role()

In [9]:
local_model_path = f'file://{os.getcwd()}/model/model.tar.gz'
endpoint_name = "local-endpoint-traffic-volume-forecast-{}".format(int(time.time()))


아래 코드 셀을 실행 후, 로그를 확인해 보세요. MMS에 대한 세팅값들을 확인하실 수 있습니다.

```bash
algo-1-u3xwd_1  | MMS Home: /usr/local/lib/python3.6/site-packages
algo-1-u3xwd_1  | Current directory: /
algo-1-u3xwd_1  | Temp directory: /home/model-server/tmp
algo-1-u3xwd_1  | Number of GPUs: 0
algo-1-u3xwd_1  | Number of CPUs: 2
algo-1-u3xwd_1  | Max heap size: 878 M
algo-1-u3xwd_1  | Python executable: /usr/local/bin/python3.6
algo-1-u3xwd_1  | Config file: /etc/sagemaker-mms.properties
algo-1-u3xwd_1  | Inference address: http://0.0.0.0:8080
algo-1-u3xwd_1  | Management address: http://0.0.0.0:8080
algo-1-u3xwd_1  | Model Store: /.sagemaker/mms/models
...
```

In [11]:
local_model = MXNetModel(model_data=local_model_path,
                         role=role,
                         source_dir='src',
                         entry_point='inference.py',
                         framework_version='1.6.0',
                         py_version='py3')

predictor = local_model.deploy(instance_type='local', 
                           initial_instance_count=1, 
                           endpoint_name=endpoint_name,
                           wait=True)

Attaching to uz5etc0xo9-algo-1-uenxg
[36muz5etc0xo9-algo-1-uenxg |[0m Collecting pandas==1.1.5
[36muz5etc0xo9-algo-1-uenxg |[0m   Downloading pandas-1.1.5-cp36-cp36m-manylinux1_x86_64.whl (9.5 MB)
[K     |████████████████████████████████| 9.5 MB 18.1 MB/s eta 0:00:01
[36muz5etc0xo9-algo-1-uenxg |[0m [?25hCollecting gluonts==0.6.7
[36muz5etc0xo9-algo-1-uenxg |[0m   Downloading gluonts-0.6.7-py3-none-any.whl (569 kB)
[K     |████████████████████████████████| 569 kB 57.0 MB/s eta 0:00:01
[36muz5etc0xo9-algo-1-uenxg |[0m [?25hCollecting toolz~=0.10
[36muz5etc0xo9-algo-1-uenxg |[0m   Downloading toolz-0.11.1-py3-none-any.whl (55 kB)
[K     |████████████████████████████████| 55 kB 5.6 MB/s  eta 0:00:01
[36muz5etc0xo9-algo-1-uenxg |[0m Collecting pydantic<1.7,~=1.1
[36muz5etc0xo9-algo-1-uenxg |[0m   Downloading pydantic-1.6.1-cp36-cp36m-manylinux2014_x86_64.whl (8.7 MB)
[K     |████████████████████████████████| 8.7 MB 51.9 MB/s eta 0:00:01
[36muz5etc0xo9-algo-1-uenxg |

[36muz5etc0xo9-algo-1-uenxg |[0m 2021-04-07 06:38:59,179 [INFO ] pool-1-thread-10 ACCESS_LOG - /172.18.0.1:47416 "GET /ping HTTP/1.1" 200 25
![36muz5etc0xo9-algo-1-uenxg |[0m 2021-04-07 06:39:02,457 [INFO ] W-9000-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Generating new fontManager, this may take some time...
[36muz5etc0xo9-algo-1-uenxg |[0m 2021-04-07 06:39:02,467 [INFO ] W-9000-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Generating new fontManager, this may take some time...
[36muz5etc0xo9-algo-1-uenxg |[0m 2021-04-07 06:39:02,492 [INFO ] W-9000-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Generating new fontManager, this may take some time...
[36muz5etc0xo9-algo-1-uenxg |[0m 2021-04-07 06:39:02,496 [INFO ] W-9000-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Generating new fontManager, this may take some time...
[36muz5etc0xo9-algo-1-uenxg |[0m 2021-04-07 06:39:02,503 [INFO ] W-9000-model-stdout com.amazonaws.ml.mms.wlm.Work

로컬에서 컨테이너를 배포했기 때문에 컨테이너가 현재 실행 중임을 확인할 수 있습니다.

In [12]:
!docker ps

CONTAINER ID        IMAGE                                                                        COMMAND                  CREATED             STATUS              PORTS                              NAMES
65f8a12afbaf        763104351884.dkr.ecr.us-east-1.amazonaws.com/mxnet-inference:1.6.0-cpu-py3   "python /usr/local/b…"   5 minutes ago       Up 5 minutes        0.0.0.0:8080->8080/tcp, 8081/tcp   uz5etc0xo9-algo-1-uenxg


### Inference using SageMaker SDK

SageMaker SDK의 `predict()` 메서드로 쉽게 추론을 수행할 수 있습니다. 

In [13]:
outputs = predictor.predict(input_data)

[36muz5etc0xo9-algo-1-uenxg |[0m 2021-04-07 06:44:07,472 [INFO ] W-9000-model com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 48
[36muz5etc0xo9-algo-1-uenxg |[0m 2021-04-07 06:44:07,472 [INFO ] W-9000-model ACCESS_LOG - /172.18.0.1:43932 "POST /invocations HTTP/1.1" 200 54


In [14]:
print(outputs)

{'out': [917.7115478515625, 189.37432861328125, 84.92733764648438, 224.2052001953125, 458.60882568359375, 1810.460205078125, 4772.833984375, 5505.134765625, 4140.82080078125, 4779.0166015625, 4505.71044921875, 4427.0166015625, 4544.5849609375, 5169.572265625, 6098.62939453125, 7205.91748046875, 6688.689453125, 7187.91748046875, 5541.04736328125, 3531.455322265625, 2053.02099609375, 1849.30126953125, 1249.337646484375, 1102.44580078125, 616.233642578125, 393.4483642578125, 34.6899528503418, 65.4824447631836, 1410.8447265625, 3479.644775390625, 5495.203125, 6423.46484375, 6735.57666015625, 6528.0107421875, 6131.7548828125, 5412.97998046875, 5282.89208984375, 5458.107421875, 6098.064453125, 6893.7841796875, 6999.64013671875, 5758.9775390625, 4646.97265625, 4179.44287109375, 2922.569091796875, 2236.9970703125, 1364.1856689453125, 794.4921264648438, 184.84010314941406, 69.28057098388672, -174.8890838623047, 174.46640014648438, 1456.85791015625, 3016.038818359375, 4838.28173828125, 6272.0654

### Inference using Boto3 SDK

SageMaker SDK의 `predict()` 메서드로 추론을 수행할 수도 있지만, 이번에는 boto3의 `invoke_endpoint()` 메서드로 추론을 수행해 보겠습니다.<br>
Boto3는 서비스 레벨의 low-level SDK로, ML 실험에 초점을 맞춰 일부 기능들이 추상화된 high-level SDK인 SageMaker SDK와 달리
SageMaker API를 완벽하게 제어할 수 있습으며, 프로덕션 및 자동화 작업에 적합합니다.

참고로 `invoke_endpoint()` 호출을 위한 런타임 클라이언트 인스턴스 생성 시, 로컬 배포 모드에서는 `sagemaker.local.LocalSagemakerRuntimeClient()`를 호출해야 합니다.

In [15]:
client = sagemaker.local.LocalSagemakerClient()
runtime_client = sagemaker.local.LocalSagemakerRuntimeClient()
endpoint_name = local_model.endpoint_name

response = runtime_client.invoke_endpoint(
    EndpointName=endpoint_name, 
    ContentType='application/json',
    Accept='application/json',
    Body=json.dumps(input_data)
    )
outputs = json.loads(response['Body'].read().decode())

[36muz5etc0xo9-algo-1-uenxg |[0m 2021-04-07 06:44:09,907 [INFO ] W-9000-model com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 54
[36muz5etc0xo9-algo-1-uenxg |[0m 2021-04-07 06:44:09,908 [INFO ] W-9000-model ACCESS_LOG - /172.18.0.1:44134 "POST /invocations HTTP/1.1" 200 56


In [16]:
print(outputs)

{'out': [917.7115478515625, 189.37432861328125, 84.92733764648438, 224.2052001953125, 458.60882568359375, 1810.460205078125, 4772.833984375, 5505.134765625, 4140.82080078125, 4779.0166015625, 4505.71044921875, 4427.0166015625, 4544.5849609375, 5169.572265625, 6098.62939453125, 7205.91748046875, 6688.689453125, 7187.91748046875, 5541.04736328125, 3531.455322265625, 2053.02099609375, 1849.30126953125, 1249.337646484375, 1102.44580078125, 616.233642578125, 393.4483642578125, 34.6899528503418, 65.4824447631836, 1410.8447265625, 3479.644775390625, 5495.203125, 6423.46484375, 6735.57666015625, 6528.0107421875, 6131.7548828125, 5412.97998046875, 5282.89208984375, 5458.107421875, 6098.064453125, 6893.7841796875, 6999.64013671875, 5758.9775390625, 4646.97265625, 4179.44287109375, 2922.569091796875, 2236.9970703125, 1364.1856689453125, 794.4921264648438, 184.84010314941406, 69.28057098388672, -174.8890838623047, 174.46640014648438, 1456.85791015625, 3016.038818359375, 4838.28173828125, 6272.0654

### Local Mode Endpoint Clean-up

엔드포인트를 계속 사용하지 않는다면, 엔드포인트를 삭제해야 합니다. 
SageMaker SDK에서는 `delete_endpoint()` 메소드로 간단히 삭제할 수 있습니다.

In [17]:
def delete_endpoint(client, endpoint_name):
    response = client.describe_endpoint_config(EndpointConfigName=endpoint_name)
    model_name = response['ProductionVariants'][0]['ModelName']

    client.delete_model(ModelName=model_name)    
    client.delete_endpoint(EndpointName=endpoint_name)
    client.delete_endpoint_config(EndpointConfigName=endpoint_name)    
    
    print(f'--- Deleted model: {model_name}')
    print(f'--- Deleted endpoint: {endpoint_name}')
    print(f'--- Deleted endpoint_config: {endpoint_name}')    
delete_endpoint(client, endpoint_name)

Gracefully stopping... (press Ctrl+C again to force)
--- Deleted model: mxnet-inference-2021-04-07-06-38-43-459
--- Deleted endpoint: local-endpoint-traffic-volume-forecast-1617777052
--- Deleted endpoint_config: local-endpoint-traffic-volume-forecast-1617777052


<br>

## 4. SageMaker Hosted Endpoint Inference
---

이제 실제 운영 환경에 엔드포인트 배포를 수행해 보겠습니다. 로컬 모드 엔드포인트와 대부분의 코드가 동일하며, 모델 아티팩트 경로(`model_data`)와 인스턴스 유형(`instance_type`)만 변경해 주시면 됩니다. SageMaker가 관리하는 배포 클러스터를 프로비저닝하는 시간이 소요되기 때문에 추론 서비스를 시작하는 데에는 약 5~10분 정도 소요됩니다.

In [20]:
import os
import boto3
import sagemaker
from sagemaker.mxnet import MXNet

boto_session = boto3.Session()
sagemaker_session = sagemaker.Session(boto_session=boto_session)
role = sagemaker.get_execution_role()
bucket = sagemaker.Session().default_bucket()

In [21]:
model_path = os.path.join(s3_model_dir, "model.tar.gz")
endpoint_name = "endpoint-traffic-volume-forecast-{}".format(int(time.time()))

In [22]:
model = MXNetModel(model_data=model_path,
                         role=role,
                         source_dir='src',
                         entry_point='inference.py',
                         framework_version='1.6.0',
                         py_version='py3')

predictor = model.deploy(instance_type="ml.c5.large", 
                         initial_instance_count=1, 
                         endpoint_name=endpoint_name,
                         wait=True)

-------------!

추론을 수행합니다. 로컬 모드의 코드와 동일합니다.

In [23]:
import boto3
client = boto3.client('sagemaker')
runtime_client = boto3.client('sagemaker-runtime')
endpoint_name = model.endpoint_name

In [24]:
response = runtime_client.invoke_endpoint(
    EndpointName=endpoint_name, 
    ContentType='application/json',
    Accept='application/json',
    Body=json.dumps(input_data)
    )
outputs = json.loads(response['Body'].read().decode())

In [25]:
outputs

{'out': [917.7115478515625,
  189.37432861328125,
  84.92733764648438,
  224.2052001953125,
  458.60882568359375,
  1810.460205078125,
  4772.833984375,
  5505.134765625,
  4140.82080078125,
  4779.0166015625,
  4505.71044921875,
  4427.0166015625,
  4544.5849609375,
  5169.572265625,
  6098.62939453125,
  7205.91748046875,
  6688.689453125,
  7187.91748046875,
  5541.04736328125,
  3531.455322265625,
  2053.02099609375,
  1849.30126953125,
  1249.337646484375,
  1102.44580078125,
  616.233642578125,
  393.4483642578125,
  34.6899528503418,
  65.4824447631836,
  1410.8447265625,
  3479.644775390625,
  5495.203125,
  6423.46484375,
  6735.57666015625,
  6528.0107421875,
  6131.7548828125,
  5412.97998046875,
  5282.89208984375,
  5458.107421875,
  6098.064453125,
  6893.7841796875,
  6999.64013671875,
  5758.9775390625,
  4646.97265625,
  4179.44287109375,
  2922.569091796875,
  2236.9970703125,
  1364.1856689453125,
  794.4921264648438,
  184.84010314941406,
  69.28057098388672,
  -174

### SageMaker Hosted Endpoint Clean-up

엔드포인트를 계속 사용하지 않는다면, 불필요한 과금을 피하기 위해 엔드포인트를 삭제해야 합니다. 
SageMaker SDK에서는 `delete_endpoint()` 메소드로 간단히 삭제할 수 있으며, UI에서도 쉽게 삭제할 수 있습니다.

In [26]:
delete_endpoint(client, endpoint_name)

--- Deleted model: mxnet-inference-2021-04-07-06-46-42-331
--- Deleted endpoint: endpoint-traffic-volume-forecast-1617777999
--- Deleted endpoint_config: endpoint-traffic-volume-forecast-1617777999
