# Module 5. Deployment on MMS(Multi Model Server)
---

본 모듈에서는 모델의 배포(deployment)를 수행합니다. 

<br>

## 1. Inference script
---

아래 코드 셀은 `src` 디렉토리에 SageMaker 추론 스크립트인 `inference.py`를 저장합니다.<br>

이 스크립트는 SageMaker 상에서 MMS(Multi Model Server)를 쉽고 편하게 배포할 수 이는 high-level 툴킷인 SageMaker inference toolkit의 인터페이스를
사용하고 있으며, 여러분께서는 인터페이스에 정의된 핸들러(handler) 함수들만 구현하시면 됩니다.

#### MMS(Multi Model Server)란?
- [https://github.com/awslabs/multi-model-server](https://github.com/awslabs/multi-model-server) (2017년 12월 초 MXNet 1.0 릴리스 시 최초 공개, MXNet용 모델 서버로 시작)
- Prerequisites: Java 8, MXNet (단, MXNet 사용 시에만)
- MMS는 프레임워크에 구애받지 않도록 설계되었기 때문에, 모든 프레임워크의 백엔드 엔진 역할을 할 수 있는 충분한 유연성을 제공합니다.
- SageMaker MXNet 추론 컨테이너와 PyTorch 추론 컨테이너는 SageMaker inference toolkit으로 MMS를 래핑하여 사용합니다.
    - 2020년 4월 말 PyTorch용 배포 웹 서비스인 torchserve가 출시되면서, 향후 PyTorch 추론 컨테이너는 MMS 기반에서 torchserve 기반으로 마이그레이션될 예정입니다. 

In [1]:
%%writefile ./src/inference.py

from __future__ import absolute_import

import argparse
import json
import logging
import os
import sys
import time
import random
from os.path import join
import numpy as np
import io
import tarfile

import boto3

from PIL import Image

import torch
import torch.distributed as dist
import torch.nn as nn
import torch.nn.functional as F
from torch.optim import lr_scheduler
import torch.optim as optim
import torchvision
import copy
import torch.utils.data
import torch.utils.data.distributed
from torchvision import datasets, transforms, models
from torch import topk

logger = logging.getLogger(__name__)
logger.setLevel(logging.DEBUG)
logger.addHandler(logging.StreamHandler(sys.stdout))

JSON_CONTENT_TYPE = 'application/json'

# Loads the model into memory from storage and return the model.
def model_fn(model_dir):
    logger.info("==> model_dir : {}".format(model_dir))
    model = models.resnet18(pretrained=True)
    last_hidden_units = model.fc.in_features
    model.fc = torch.nn.Linear(last_hidden_units, 186)
    model.load_state_dict(torch.load(os.path.join(model_dir, 'model.pt')))
    return model

# Deserialize the request body
def input_fn(request_body, request_content_type='application/x-image'):
    print('An input_fn that loads a image tensor')
    print(request_content_type)
    if request_content_type == 'application/x-image':             
        img = np.array(Image.open(io.BytesIO(request_body)))
    elif request_content_type == 'application/x-npy':    
        img = np.frombuffer(request_body, dtype='uint8').reshape(137, 236)   
    else:
        raise ValueError(
            'Requested unsupported ContentType in content_type : ' + request_content_type)

    img = 255 - img
    img = img[:,:,np.newaxis]
    img = np.repeat(img, 3, axis=2)    

    test_transforms = transforms.Compose([
        transforms.ToTensor()
    ])

    img_tensor = test_transforms(img)

    return img_tensor         
        

# Predicts on the deserialized object with the model from model_fn()
def predict_fn(input_data, model):
    logger.info('Entering the predict_fn function')
    start_time = time.time()
    input_data = input_data.unsqueeze(0)
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    model.to(device)
    model.eval()
    input_data = input_data.to(device)
                          
    result = {}
                                                 
    with torch.no_grad():
        logits = model(input_data)
        pred_probs = F.softmax(logits, dim=1).data.squeeze()   
        outputs = topk(pred_probs, 5)                  
        result['score'] = outputs[0].detach().cpu().numpy()
        result['class'] = outputs[1].detach().cpu().numpy()
    
    print("--- Elapsed time: %s secs ---" % (time.time() - start_time))    
    return result        

# Serialize the prediction result into the response content type
def output_fn(pred_output, accept=JSON_CONTENT_TYPE):
    return json.dumps({'score': pred_output['score'].tolist(), 
                       'class': pred_output['class'].tolist()}), accept

Overwriting ./src/inference.py


<br>

## 2. Local Endpoint Inference
---

충분한 검증 및 테스트 없이 훈련된 모델을 곧바로 실제 운영 환경에 배포하기에는 많은 위험 요소들이 있습니다. 따라서, 로컬 모드를 사용하여 실제 운영 환경에 배포하기 위한 추론 인스턴스를 시작하기 전에 노트북 인스턴스의 로컬 환경에서 모델을 배포하는 것을 권장합니다. 이를 로컬 모드 엔드포인트(Local Mode Endpoint)라고 합니다.

먼저, 로컬 모드 엔드포인트의 컨테이너 배포 이전에 로컬 환경 상에서 직접 추론을 수행하여 결과를 확인하고, 곧바로 로컬 모드 엔드포인트를 배포해 보겠습니다.

### Local Inference

`content_type='application/x-image'` 일 경우 추론을 수행하는 예시입니다.

In [2]:
from src.inference import model_fn, input_fn, predict_fn, output_fn
from PIL import Image
import numpy as np
import json

file_path = 'test_imgs/test_0.jpg'
with open(file_path, mode='rb') as file:
    img_byte = bytearray(file.read())
data = input_fn(img_byte)
model = model_fn('./model')
result = predict_fn(data, model)
print(result)

An input_fn that loads a image tensor
application/x-image
==> model_dir : ./model
Entering the predict_fn function
--- Elapsed time: 3.42805814743042 secs ---
{'score': array([0.62198865, 0.2314413 , 0.04159949, 0.02067479, 0.0189735 ],
      dtype=float32), 'class': array([  3,   2, 169, 168,  70])}


`content_type='application/x-npy'` 일 경우 추론을 수행하는 예시이며, numpy 행렬을 그대로 전송하게 됩니다. 속도는 `content_type='application/x-image'` 보다 더 빠르지만, `tobytes()`로 
변환하여 전송할 경우 numpy 행렬의 `dtype`과 행렬 `shape`이 보존되지 않으므로 별도의 처리가 필요합니다.

In [3]:
img_arr = np.array(Image.open(file_path))
data = input_fn(img_arr.tobytes(), request_content_type='application/x-npy')
model = model_fn('./model')
result = predict_fn(data, model)
print(result)

An input_fn that loads a image tensor
application/x-npy
==> model_dir : ./model
Entering the predict_fn function
--- Elapsed time: 0.01936507225036621 secs ---
{'score': array([0.62198865, 0.2314413 , 0.04159949, 0.02067479, 0.0189735 ],
      dtype=float32), 'class': array([  3,   2, 169, 168,  70])}


### Local Mode Endpoint

In [4]:
import os
import time
import sagemaker
from sagemaker.pytorch.model import PyTorchModel
role = sagemaker.get_execution_role()


아래 코드 셀을 실행 후, 로그를 확인해 보세요. MMS에 대한 세팅값들을 확인하실 수 있습니다.

```bash
algo-1-cgw5k_1  | 2020-08-19 00:26:27,519 [INFO ] main com.amazonaws.ml.mms.ModelServer - 
algo-1-cgw5k_1  | MMS Home: /opt/conda/lib/python3.6/site-packages
algo-1-cgw5k_1  | Current directory: /
algo-1-cgw5k_1  | Temp directory: /home/model-server/tmp
algo-1-cgw5k_1  | Number of GPUs: 0
algo-1-cgw5k_1  | Number of CPUs: 8
algo-1-cgw5k_1  | Max heap size: 13646 M
algo-1-cgw5k_1  | Python executable: /opt/conda/bin/python
algo-1-cgw5k_1  | Config file: /etc/sagemaker-mms.properties
algo-1-cgw5k_1  | Inference address: http://0.0.0.0:8080
algo-1-cgw5k_1  | Management address: http://0.0.0.0:8080
algo-1-cgw5k_1  | Model Store: /.sagemaker/mms/models
...
```

In [5]:
local_model_path = f'file://{os.getcwd()}/model/model.tar.gz'
endpoint_name = "local-endpoint-bangali-classifier-{}".format(int(time.time()))

local_pytorch_model = PyTorchModel(model_data=local_model_path,
                                   role=role,
                                   entry_point='./src/inference.py',
                                   framework_version='1.3.1',
                                   py_version='py3')

local_pytorch_model.deploy(instance_type='local', 
                           initial_instance_count=1, 
                           endpoint_name=endpoint_name,
                           wait=True)

Attaching to tmp2r4844_r_algo-1-vxhvg_1
[36malgo-1-vxhvg_1  |[0m 2020-08-19 10:17:56,279 [INFO ] main com.amazonaws.ml.mms.ModelServer - 
[36malgo-1-vxhvg_1  |[0m MMS Home: /opt/conda/lib/python3.6/site-packages
[36malgo-1-vxhvg_1  |[0m Current directory: /
[36malgo-1-vxhvg_1  |[0m Temp directory: /home/model-server/tmp
[36malgo-1-vxhvg_1  |[0m Number of GPUs: 0
[36malgo-1-vxhvg_1  |[0m Number of CPUs: 8
[36malgo-1-vxhvg_1  |[0m Max heap size: 13646 M
[36malgo-1-vxhvg_1  |[0m Python executable: /opt/conda/bin/python
[36malgo-1-vxhvg_1  |[0m Config file: /etc/sagemaker-mms.properties
[36malgo-1-vxhvg_1  |[0m Inference address: http://0.0.0.0:8080
[36malgo-1-vxhvg_1  |[0m Management address: http://0.0.0.0:8080
[36malgo-1-vxhvg_1  |[0m Model Store: /.sagemaker/mms/models
[36malgo-1-vxhvg_1  |[0m Initial Models: ALL
[36malgo-1-vxhvg_1  |[0m Log dir: /logs
[36malgo-1-vxhvg_1  |[0m Metrics dir: /logs
[36malgo-1-vxhvg_1  |[0m Netty threads: 0
[36malgo-1-vxhvg

[36malgo-1-vxhvg_1  |[0m 2020-08-19 10:17:57,082 [INFO ] W-9007-model com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 420
[36malgo-1-vxhvg_1  |[0m 2020-08-19 10:17:57,089 [INFO ] W-9002-model com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 411
[36malgo-1-vxhvg_1  |[0m 2020-08-19 10:17:57,093 [INFO ] W-9003-model com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 443
[36malgo-1-vxhvg_1  |[0m 2020-08-19 10:17:57,095 [INFO ] W-9004-model com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 445
[36malgo-1-vxhvg_1  |[0m 2020-08-19 10:17:57,099 [INFO ] W-9006-model com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 437
[36malgo-1-vxhvg_1  |[0m 2020-08-19 10:17:57,111 [INFO ] W-9000-model com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 465
[36malgo-1-vxhvg_1  |[0m 2020-08-19 10:17:57,114 [INFO ] W-9001-model com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 464
[36malgo-1-vxhvg_1  |[0m 

<sagemaker.pytorch.model.PyTorchPredictor at 0x7f63ad36a9b0>

로컬에서 컨테이너를 배포했기 때문에 컨테이너가 현재 실행 중임을 확인할 수 있습니다.

In [6]:
!docker ps

CONTAINER ID        IMAGE                                                                          COMMAND                  CREATED             STATUS              PORTS                              NAMES
85b681e61ed8        763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-inference:1.3.1-cpu-py3   "python /usr/local/b…"   5 seconds ago       Up 3 seconds        0.0.0.0:8080->8080/tcp, 8081/tcp   tmp2r4844_r_algo-1-vxhvg_1


SageMaker SDK `predict()` 메서드로 추론을 수행할 수도 있지만, 이번에는 boto3의 `invoke_endpoint()` 메서드로 추론을 수행해 보겠습니다.<br>
Boto3는 서비스 레벨의 low-level SDK로, ML 실험에 초점을 맞춰 일부 기능들이 추상화된 high-level SDK인 SageMaker SDK와 달리
SageMaker API를 완벽하게 제어할 수 있습으며, 프로덕션 및 자동화 작업에 적합합니다.

참고로 `invoke_endpoint()` 호출을 위한 런타임 클라이언트 인스턴스 생성 시, 로컬 배포 모드에서는 `sagemaker.local.LocalSagemakerRuntimeClient()`를 호출해야 합니다.


In [7]:
client = sagemaker.local.LocalSagemakerClient()
runtime_client = sagemaker.local.LocalSagemakerRuntimeClient()
endpoint_name = local_pytorch_model.endpoint_name

response = runtime_client.invoke_endpoint(
    EndpointName=endpoint_name, 
    ContentType='application/x-npy',
    Accept='application/json',
    Body=img_arr.tobytes()
    )
print(response['Body'].read().decode())

[36malgo-1-vxhvg_1  |[0m 2020-08-19 10:17:58,917 [INFO ] W-9007-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - ==> model_dir : /opt/ml/model
[36malgo-1-vxhvg_1  |[0m 2020-08-19 10:17:58,917 [INFO ] W-9007-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - ==> model_dir : /opt/ml/model
[36malgo-1-vxhvg_1  |[0m 2020-08-19 10:17:59,200 [WARN ] W-9007-model-stderr com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Downloading: "https://download.pytorch.org/models/resnet18-5c106cde.pth" to /root/.cache/torch/checkpoints/resnet18-5c106cde.pth
[36malgo-1-vxhvg_1  |[0m 2020-08-19 10:17:59,200 [WARN ] W-9007-model-stderr com.amazonaws.ml.mms.wlm.WorkerLifeCycle - 
[36malgo-1-vxhvg_1  |[0m 2020-08-19 10:17:59,300 [WARN ] W-9007-model-stderr com.amazonaws.ml.mms.wlm.WorkerLifeCycle -   0%|          | 0.00/44.7M [00:00<?, ?B/s]
[36malgo-1-vxhvg_1  |[0m 2020-08-19 10:17:59,400 [WARN ] W-9007-model-stderr com.amazonaws.ml.mms.wlm.WorkerLifeCycle -  33%|███▎      | 14.7M/44.7M [00:0

In [8]:
response = runtime_client.invoke_endpoint(
    EndpointName=endpoint_name, 
    ContentType='application/x-image',
    Accept='application/json',
    Body=img_byte
    )

print(json.loads(response['Body'].read().decode()))

[36malgo-1-vxhvg_1  |[0m 2020-08-19 10:17:59,911 [INFO ] W-9002-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - ==> model_dir : /opt/ml/model
[36malgo-1-vxhvg_1  |[0m 2020-08-19 10:17:59,911 [INFO ] W-9002-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - ==> model_dir : /opt/ml/model
[36malgo-1-vxhvg_1  |[0m 2020-08-19 10:18:00,258 [INFO ] W-9002-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - An input_fn that loads a image tensor
[36malgo-1-vxhvg_1  |[0m 2020-08-19 10:18:00,258 [INFO ] W-9002-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - application/x-image
[36malgo-1-vxhvg_1  |[0m 2020-08-19 10:18:00,258 [INFO ] W-9002-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Entering the predict_fn function
[36malgo-1-vxhvg_1  |[0m 2020-08-19 10:18:00,258 [INFO ] W-9002-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Entering the predict_fn function
[36malgo-1-vxhvg_1  |[0m 2020-08-19 10:18:00,330 [INFO ] W-9002-model com.amazonaw

### Local Mode Endpoint Clean-up

엔드포인트를 계속 사용하지 않는다면, 엔드포인트를 삭제해야 합니다. 
SageMaker SDK에서는 `delete_endpoint()` 메소드로 간단히 삭제할 수 있습니다.

In [9]:
def delete_endpoint(client, endpoint_name):
    response = client.describe_endpoint_config(EndpointConfigName=endpoint_name)
    model_name = response['ProductionVariants'][0]['ModelName']

    client.delete_model(ModelName=model_name)    
    client.delete_endpoint(EndpointName=endpoint_name)
    client.delete_endpoint_config(EndpointConfigName=endpoint_name)    
    
    print(f'--- Deleted model: {model_name}')
    print(f'--- Deleted endpoint: {endpoint_name}')
    print(f'--- Deleted endpoint_config: {endpoint_name}')    

In [10]:
delete_endpoint(client, endpoint_name)

Gracefully stopping... (press Ctrl+C again to force)
--- Deleted model: pytorch-inference-2020-08-19-10-17-52-283
--- Deleted endpoint: local-endpoint-bangali-classifier-1597832268
--- Deleted endpoint_config: local-endpoint-bangali-classifier-1597832268


컨테이너가 삭제된 것을 확인할 수 있습니다.

In [11]:
!docker ps

CONTAINER ID        IMAGE               COMMAND             CREATED             STATUS              PORTS               NAMES


<br>

## 3. SageMaker Hosted Endpoint Inference
---

이제 실제 운영 환경에 엔드포인트 배포를 수행해 보겠습니다. 로컬 모드 엔드포인트와 대부분의 코드가 동일하며, 모델 아티팩트 경로(`model_data`)와 인스턴스 유형(`instance_type`)만 변경해 주시면 됩니다. SageMaker가 관리하는 배포 클러스터를 프로비저닝하는 시간이 소요되기 때문에 추론 서비스를 시작하는 데에는 약 5~10분 정도 소요됩니다.


In [12]:
import boto3
client = boto3.client('sagemaker')
runtime_client = boto3.client('sagemaker-runtime')

In [13]:
def get_model_path(sm_client, max_results=1, name_contains='pytorch'):
    training_job = sm_client.list_training_jobs(MaxResults=max_results,
                                         NameContains=name_contains,
                                         SortBy='CreationTime', 
                                         SortOrder='Descending')
    training_job_name = training_job['TrainingJobSummaries'][0]['TrainingJobName']
    training_job_description = sm_client.describe_training_job(TrainingJobName=training_job_name)
    model_path = training_job_description['ModelArtifacts']['S3ModelArtifacts']  
    return model_path

In [14]:
model_path = get_model_path(client, max_results=3)
endpoint_name = "endpoint-bangali-classifier-{}".format(int(time.time()))

pytorch_model = PyTorchModel(model_data=model_path,
                                   role=role,
                                   entry_point='./src/inference.py',
                                   framework_version='1.3.1',
                                   py_version='py3')

predictor = pytorch_model.deploy(instance_type='ml.m5.xlarge', 
                                 initial_instance_count=1, 
                                 endpoint_name=endpoint_name,
                                 wait=False)

In [18]:
import boto3
client = boto3.client('sagemaker')
runtime_client = boto3.client('sagemaker-runtime')
endpoint_name = pytorch_model.endpoint_name
client.describe_endpoint(EndpointName = endpoint_name)

{'EndpointName': 'endpoint-bangali-classifier-1597722095',
 'EndpointArn': 'arn:aws:sagemaker:us-east-1:143656149352:endpoint/endpoint-bangali-classifier-1597722095',
 'EndpointConfigName': 'endpoint-bangali-classifier-1597722095',
 'ProductionVariants': [{'VariantName': 'AllTraffic',
   'DeployedImages': [{'SpecifiedImage': '763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-inference:1.3.1-cpu-py3',
     'ResolvedImage': '763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-inference@sha256:a1b552fe76355d9f921c896c03ea985b00d7829f6f95c3ffc1d8bae9f29d8626',
     'ResolutionTime': datetime.datetime(2020, 8, 18, 3, 41, 43, 604000, tzinfo=tzlocal())}],
   'CurrentWeight': 1.0,
   'DesiredWeight': 1.0,
   'CurrentInstanceCount': 1,
   'DesiredInstanceCount': 1}],
 'EndpointStatus': 'InService',
 'CreationTime': datetime.datetime(2020, 8, 18, 3, 41, 41, 692000, tzinfo=tzlocal()),
 'LastModifiedTime': datetime.datetime(2020, 8, 18, 3, 47, 58, 503000, tzinfo=tzlocal()),
 'ResponseMetadata'

추론을 수행합니다. 로컬 모드의 코드와 동일합니다.

In [16]:
response = runtime_client.invoke_endpoint(
    EndpointName=endpoint_name, 
    ContentType='application/x-image',
    Accept='application/json',
    Body=img_byte
    )

print(json.loads(response['Body'].read().decode()))

{'score': [0.6219883561134338, 0.23144130408763885, 0.04159952700138092, 0.02067478932440281, 0.01897350326180458], 'class': [3, 2, 169, 168, 70]}


### SageMaker Hosted Endpoint Clean-up

엔드포인트를 계속 사용하지 않는다면, 불필요한 과금을 피하기 위해 엔드포인트를 삭제해야 합니다. 
SageMaker SDK에서는 `delete_endpoint()` 메소드로 간단히 삭제할 수 있으며, UI에서도 쉽게 삭제할 수 있습니다.

In [19]:
delete_endpoint(client, endpoint_name)

--- Deleted model pytorch-inference-2020-08-18-03-41-40-100
--- Deleted endpoint endpoint-bangali-classifier-1597722095
--- Deleted endpoint_config endpoint-bangali-classifier-1597722095
