# Module 5. Amazon SageMaker Deployment 
---

본 모듈에서는 SageMaker에서 호스팅 엔드포인트를 배포하는 법을 알아봅니다. 


### AWS Managed Inference Container
SageMaker 추론은 각 프레임워크별에 적합한 배포 컨테이너들이 사전에 빌드되어 있으며, TensorFlow는 텐서플로 서빙, 파이토치는 torchserve, MXNet은 MMS(Multi Model Server), scikit learn은 Flask가 내장되어 있습니다. PyTorch, 기존에는 MMS가 내장되어 있었지만, 2020년 말부터 Amazon과 facebook이 공동으로 개발한 torchserve를 내장하기 시작했습니다. 

배포 컨테이너를 구동할 때에는 추론을 위한 http 요청을 받아들일 수 있는 RESTful API를 실행하는 serve 명령어가 자동으로 실행되면서 엔드포인트가 시작됩니다. 엔드포인트를 시작할 때, SageMaker는 도커 컨테이너에서 사용 가능한 외부의 모델 아티팩트, 데이터, 그리고 기타 환경 설정 정보 등을 배포 인스턴스의 /opt/ml 폴더로 로딩합니다. 

![container](imgs/inference_container.png)

도커 파일은 오픈 소스로 공개되어 있으며, AWS에서는 구 버전부터 최신 버전까지 다양한 버전을 제공하고 있습니다.
각 프레임워크의 도커 파일은 아래 링크를 참조하십시오.

- TensorFlow containes: https://github.com/aws/sagemaker-tensorflow-containers 
- PyTorch container: https://github.com/aws/sagemaker-pytorch-container   
- MXNet containes: https://github.com/aws/sagemaker-mxnet-containers
- Chainer container: https://github.com/aws/sagemaker-chainer-container 
- Scikit-learn container: https://github.com/aws/sagemaker-scikit-learn-container
- SparkML serving container: https://github.com/aws/sagemaker-sparkml-serving-container

또한, AWS CLI를 사용하여 프레임워크별로 지원되는 버전을 간단하게 확인 가능합니다.

```sh
$ aws ecr list-images --repository-name tensorflow-inference --registry-id 76310435188
$ aws ecr list-images --repository-name pytorch-inference --registry-id 763104351884
$ aws ecr list-images --repository-name mxnet-inference --registry-id 763104351884

# EIA(Elastic Inference)
$ aws ecr list-images --repository-name tensorflow-inference-eia --registry-id 763104351884
$ aws ecr list-images --repository-name pytorch-inference-eia --registry-id 763104351884
$ aws ecr list-images --repository-name mxnet-inference-eia --registry-id 763104351884
```

<br>

## 1. Inference script
---

아래 코드 셀은 `src` 디렉토리에 SageMaker 추론 스크립트인 `inference.py`를 저장합니다.<br>

이 스크립트는 SageMaker 상에서 호스팅 엔드포인트를 쉽게 배포할 수 이는 high-level 툴킷인 SageMaker inference toolkit의 인터페이스를
사용하고 있으며, 여러분께서는 인터페이스에 정의된 핸들러(handler) 함수들만 구현하시면 됩니다. 아래 인터페이스는 텐서플로를 제외한 프레임워크들에서 공용으로 사용됩니다. 
- `model_fn()`: S3나 model zoo에 저장된 모델을 추론 인스턴스의 메모리로 로드 후, 모델을 리턴하는 방법을 정의하는 전처리 함수입니다.
- `input_fn()`: 사용자로부터 입력받은 내용을 모델 추론에 적합하게 변환하는 전처리 함수로, content_type 인자값을 통해 입력값 포맷을 확인할 수 있습니다.
- `predict_fn()`: model_fn()에서 리턴받은 모델과 input_fn()에서 변환된 데이터로 추론을 수행합니다.
- `output_fn()`: 추론 결과를 반환하는 후처리 함수입니다.

Tip: `input_fn(), predict_fn(), output_fn()`을 각각 구현하는 대신, 세 함수들을 한꺼번에 묶어서 `transform()` 함수에 구현하는 것도 가능합니다. 아래 Code snippet 예시를 참조하십시오.

```python
# Option 1
def model_fn(model_dir):
    model = Your_Model()
    return model

def input_fn(request_body, content_type):
    if content_type == 'text/csv'
        ...
    else:
        pass:
        
def predict_fn(request_body, content_type):
    # Preform prediction
    return model(input_data)
      
def output_fn(prediction, content_type):
    # Serialize the prediction result 
```

```python
# Option 2
def model_fn(model_dir):
    model = Your_Model()
    return model

def transform_fn(model, input_data, content_type, accept):
    # All-in-one function, including input_fn, predict_fn(), and output_fn()
```

SageMaker 훈련 컨테이너에서 1.6.0을 사용하였기 때문에, 로컬 추론 테스트 시에도 동일한 버전으로 추론합니다.

In [1]:
%load_ext autoreload
%autoreload 2
!pip install torch==1.6.0



In [2]:
%%writefile ./src/inference.py

from __future__ import absolute_import

import argparse
import json
import logging
import os
import sys
import time
import random
from os.path import join
import numpy as np
import io
import tarfile

import boto3

from PIL import Image

import torch
import torch.distributed as dist
import torch.nn as nn
import torch.nn.functional as F
from torch.optim import lr_scheduler
import torch.optim as optim
import torchvision
import copy
import torch.utils.data
import torch.utils.data.distributed
from torchvision import datasets, transforms, models
from torch import topk

logger = logging.getLogger(__name__)
logger.setLevel(logging.DEBUG)
logger.addHandler(logging.StreamHandler(sys.stdout))

JSON_CONTENT_TYPE = 'application/json'

# Loads the model into memory from storage and return the model.
def model_fn(model_dir):
    logger.info("==> model_dir : {}".format(model_dir))
    model = models.resnet18(pretrained=True)
    last_hidden_units = model.fc.in_features
    model.fc = torch.nn.Linear(last_hidden_units, 186)
    model.load_state_dict(torch.load(os.path.join(model_dir, 'model.pt')))
    return model

# Deserialize the request body
def input_fn(request_body, request_content_type='application/x-image'):
    print('An input_fn that loads a image tensor')
    print(request_content_type)
    if request_content_type == 'application/x-image':             
        img = np.array(Image.open(io.BytesIO(request_body)))
    elif request_content_type == 'application/x-npy':    
        img = np.frombuffer(request_body, dtype='uint8').reshape(137, 236)   
    else:
        raise ValueError(
            'Requested unsupported ContentType in content_type : ' + request_content_type)

    img = 255 - img
    img = img[:,:,np.newaxis]
    img = np.repeat(img, 3, axis=2)    

    test_transforms = transforms.Compose([
        transforms.ToTensor()
    ])

    img_tensor = test_transforms(img)

    return img_tensor         
        

# Predicts on the deserialized object with the model from model_fn()
def predict_fn(input_data, model):
    logger.info('Entering the predict_fn function')
    start_time = time.time()
    input_data = input_data.unsqueeze(0)
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    model.to(device)
    model.eval()
    input_data = input_data.to(device)
                          
    result = {}
                                                 
    with torch.no_grad():
        logits = model(input_data)
        pred_probs = F.softmax(logits, dim=1).data.squeeze()   
        outputs = topk(pred_probs, 5)                  
        result['score'] = outputs[0].detach().cpu().numpy()
        result['class'] = outputs[1].detach().cpu().numpy()
    
    print("--- Elapsed time: %s secs ---" % (time.time() - start_time))    
    return result        

# Serialize the prediction result into the response content type
def output_fn(pred_output, accept=JSON_CONTENT_TYPE):
    return json.dumps({'score': pred_output['score'].tolist(), 
                       'class': pred_output['class'].tolist()}), accept

Overwriting ./src/inference.py


<br>

## 2. Local Endpoint Inference
---

충분한 검증 및 테스트 없이 훈련된 모델을 곧바로 실제 운영 환경에 배포하기에는 많은 위험 요소들이 있습니다. 따라서, 로컬 모드를 사용하여 실제 운영 환경에 배포하기 위한 추론 인스턴스를 시작하기 전에 노트북 인스턴스의 로컬 환경에서 모델을 배포하는 것을 권장합니다. 이를 로컬 모드 엔드포인트(Local Mode Endpoint)라고 합니다.

먼저, 로컬 모드 엔드포인트의 컨테이너 배포 이전에 로컬 환경 상에서 직접 추론을 수행하여 결과를 확인하고, 곧바로 로컬 모드 엔드포인트를 배포해 보겠습니다.

### Local Inference

`content_type='application/x-image'` 일 경우 추론을 수행하는 예시입니다.

In [3]:
from src.inference import model_fn, input_fn, predict_fn, output_fn
from PIL import Image
import numpy as np
import json

file_path = 'test_imgs/test_0.jpg'
with open(file_path, mode='rb') as file:
    img_byte = bytearray(file.read())
data = input_fn(img_byte)
model = model_fn('./model')
result = predict_fn(data, model)
print(result)

An input_fn that loads a image tensor
application/x-image
==> model_dir : ./model
Entering the predict_fn function
--- Elapsed time: 0.08690214157104492 secs ---
{'score': array([0.40557373, 0.26362863, 0.11161146, 0.04144654, 0.02641259],
      dtype=float32), 'class': array([  3,   2,  70,  64, 169])}


`content_type='application/x-npy'` 일 경우 추론을 수행하는 예시이며, numpy 행렬을 그대로 전송하게 됩니다. 속도는 `content_type='application/x-image'` 보다 더 빠르지만, `tobytes()`로 
변환하여 전송할 경우 numpy 행렬의 `dtype`과 행렬 `shape`이 보존되지 않으므로 별도의 처리가 필요합니다.

In [4]:
img_arr = np.array(Image.open(file_path))
data = input_fn(img_arr.tobytes(), request_content_type='application/x-npy')
model = model_fn('./model')
result = predict_fn(data, model)
print(result)

An input_fn that loads a image tensor
application/x-npy
==> model_dir : ./model
Entering the predict_fn function
--- Elapsed time: 0.020900249481201172 secs ---
{'score': array([0.40557373, 0.26362863, 0.11161146, 0.04144654, 0.02641259],
      dtype=float32), 'class': array([  3,   2,  70,  64, 169])}


### Local Mode Endpoint

In [5]:
import os
import time
import sagemaker
from sagemaker.pytorch.model import PyTorchModel
role = sagemaker.get_execution_role()


아래 코드 셀을 실행 후, 로그를 확인해 보세요. MMS에 대한 세팅값들을 확인하실 수 있습니다.

```bash
sc0es6wfbp-algo-1-f5wnl | 2021-03-02 13:28:03,924 [INFO ] main com.amazonaws.ml.mms.ModelServer - 
sc0es6wfbp-algo-1-f5wnl | MMS Home: /opt/conda/lib/python3.6/site-packages
sc0es6wfbp-algo-1-f5wnl | Current directory: /
sc0es6wfbp-algo-1-f5wnl | Temp directory: /home/model-server/tmp
sc0es6wfbp-algo-1-f5wnl | Number of GPUs: 0
sc0es6wfbp-algo-1-f5wnl | Number of CPUs: 8
sc0es6wfbp-algo-1-f5wnl | Max heap size: 3463 M
sc0es6wfbp-algo-1-f5wnl | Python executable: /opt/conda/bin/python
sc0es6wfbp-algo-1-f5wnl | Config file: /etc/sagemaker-mms.properties
sc0es6wfbp-algo-1-f5wnl | Inference address: http://0.0.0.0:8080
sc0es6wfbp-algo-1-f5wnl | Management address: http://0.0.0.0:8080
...
```



### 디버깅 Tip
만약 로컬에서 추론이 잘 되는데, 엔드포인트 배포에서 에러가 발생하면 프레임워크 버전이 맞지 않거나 컨테이너 환경 변수 설정이 잘못되었을 가능성이 높습니다.
프레임워크 버전은 최대한 동일한 버전으로 통일하되, 버전이 맞지 않으면 가장 비슷한 버전을 사용해 보세요. 아래 코드는 PyTorch 1.6.0으로 훈련한 모델을 1.5.0 버전 상에서 배포하는 예시입니다.







In [9]:
local_model_path = f'file://{os.getcwd()}/model/model.tar.gz'
endpoint_name = "local-endpoint-bangali-classifier-{}".format(int(time.time()))

local_pytorch_model = PyTorchModel(model_data=local_model_path,
                                   role=role,
                                   entry_point='./src/inference.py',
                                   framework_version='1.5.0',
                                   py_version='py3')

local_pytorch_model.deploy(instance_type='local', 
                           initial_instance_count=1, 
                           endpoint_name=endpoint_name,
                           wait=True)

Attaching to sc0es6wfbp-algo-1-f5wnl
[36msc0es6wfbp-algo-1-f5wnl |[0m 2021-03-02 13:28:03,924 [INFO ] main com.amazonaws.ml.mms.ModelServer - 
[36msc0es6wfbp-algo-1-f5wnl |[0m MMS Home: /opt/conda/lib/python3.6/site-packages
[36msc0es6wfbp-algo-1-f5wnl |[0m Current directory: /
[36msc0es6wfbp-algo-1-f5wnl |[0m Temp directory: /home/model-server/tmp
[36msc0es6wfbp-algo-1-f5wnl |[0m Number of GPUs: 0
[36msc0es6wfbp-algo-1-f5wnl |[0m Number of CPUs: 8
[36msc0es6wfbp-algo-1-f5wnl |[0m Max heap size: 3463 M
[36msc0es6wfbp-algo-1-f5wnl |[0m Python executable: /opt/conda/bin/python
[36msc0es6wfbp-algo-1-f5wnl |[0m Config file: /etc/sagemaker-mms.properties
[36msc0es6wfbp-algo-1-f5wnl |[0m Inference address: http://0.0.0.0:8080
[36msc0es6wfbp-algo-1-f5wnl |[0m Management address: http://0.0.0.0:8080
[36msc0es6wfbp-algo-1-f5wnl |[0m Model Store: /.sagemaker/mms/models
[36msc0es6wfbp-algo-1-f5wnl |[0m Initial Models: ALL
[36msc0es6wfbp-algo-1-f5wnl |[0m Log dir: /logs

[36msc0es6wfbp-algo-1-f5wnl |[0m 2021-03-02 13:28:05,390 [INFO ] W-9004-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - ==> model_dir : /.sagemaker/mms/models/model
[36msc0es6wfbp-algo-1-f5wnl |[0m 2021-03-02 13:28:05,390 [INFO ] W-9004-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - ==> model_dir : /.sagemaker/mms/models/model
[36msc0es6wfbp-algo-1-f5wnl |[0m 2021-03-02 13:28:05,392 [INFO ] W-9007-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - ==> model_dir : /.sagemaker/mms/models/model
[36msc0es6wfbp-algo-1-f5wnl |[0m 2021-03-02 13:28:05,392 [INFO ] W-9007-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - ==> model_dir : /.sagemaker/mms/models/model
[36msc0es6wfbp-algo-1-f5wnl |[0m 2021-03-02 13:28:05,393 [INFO ] W-9001-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - ==> model_dir : /.sagemaker/mms/models/model
[36msc0es6wfbp-algo-1-f5wnl |[0m 2021-03-02 13:28:05,393 [INFO ] W-9001-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCyc

[36msc0es6wfbp-algo-1-f5wnl |[0m 2021-03-02 13:28:06,209 [WARN ] W-9004-model-stderr com.amazonaws.ml.mms.wlm.WorkerLifeCycle -  93%|█████████▎| 41.6M/44.7M [00:00<00:00, 100MB/s] 
[36msc0es6wfbp-algo-1-f5wnl |[0m 2021-03-02 13:28:06,630 [INFO ] W-9001-model com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 2431
[36msc0es6wfbp-algo-1-f5wnl |[0m 2021-03-02 13:28:08,085 [INFO ] W-9002-model com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 3894
[36msc0es6wfbp-algo-1-f5wnl |[0m 2021-03-02 13:28:08,105 [INFO ] W-9004-model com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 3919
[36msc0es6wfbp-algo-1-f5wnl |[0m 2021-03-02 13:28:08,106 [INFO ] W-9005-model com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 3915
[36msc0es6wfbp-algo-1-f5wnl |[0m 2021-03-02 13:28:08,107 [INFO ] W-9006-model com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 3921
[36msc0es6wfbp-algo-1-f5wnl |[0m 2021-03-02 13:28:08,128 [INFO ] pool-1-thr

<sagemaker.pytorch.model.PyTorchPredictor at 0x7f6c65e25dd8>

[36msc0es6wfbp-algo-1-f5wnl |[0m 2021-03-02 13:28:08,219 [INFO ] W-9003-model com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 4033
[36msc0es6wfbp-algo-1-f5wnl |[0m 2021-03-02 13:28:08,235 [INFO ] W-9000-model com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 4048
[36msc0es6wfbp-algo-1-f5wnl |[0m 2021-03-02 13:28:08,238 [INFO ] W-9007-model com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 4031


로컬에서 컨테이너를 배포했기 때문에 컨테이너가 현재 실행 중임을 확인할 수 있습니다.

In [10]:
!docker ps

CONTAINER ID        IMAGE                                                                               COMMAND                  CREATED             STATUS              PORTS                              NAMES
afeebf748adc        763104351884.dkr.ecr.ap-northeast-2.amazonaws.com/pytorch-inference:1.5.0-cpu-py3   "python /usr/local/b…"   23 seconds ago      Up 15 seconds       0.0.0.0:8080->8080/tcp, 8081/tcp   sc0es6wfbp-algo-1-f5wnl


SageMaker SDK `predict()` 메서드로 추론을 수행할 수도 있지만, 이번에는 boto3의 `invoke_endpoint()` 메서드로 추론을 수행해 보겠습니다.<br>
Boto3는 서비스 레벨의 low-level SDK로, ML 실험에 초점을 맞춰 일부 기능들이 추상화된 high-level SDK인 SageMaker SDK와 달리
SageMaker API를 완벽하게 제어할 수 있습으며, 프로덕션 및 자동화 작업에 적합합니다.

참고로 `invoke_endpoint()` 호출을 위한 런타임 클라이언트 인스턴스 생성 시, 로컬 배포 모드에서는 `sagemaker.local.LocalSagemakerRuntimeClient()`를 호출해야 합니다.


In [11]:
client = sagemaker.local.LocalSagemakerClient()
runtime_client = sagemaker.local.LocalSagemakerRuntimeClient()
endpoint_name = local_pytorch_model.endpoint_name

response = runtime_client.invoke_endpoint(
    EndpointName=endpoint_name, 
    ContentType='application/x-npy',
    Accept='application/json',
    Body=img_arr.tobytes()
    )
print(response['Body'].read().decode())

[36msc0es6wfbp-algo-1-f5wnl |[0m 2021-03-02 13:28:19,702 [INFO ] W-9001-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - An input_fn that loads a image tensor
[36msc0es6wfbp-algo-1-f5wnl |[0m 2021-03-02 13:28:19,702 [INFO ] W-9001-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - application/x-npy
[36msc0es6wfbp-algo-1-f5wnl |[0m 2021-03-02 13:28:19,703 [INFO ] W-9001-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Entering the predict_fn function
[36msc0es6wfbp-algo-1-f5wnl |[0m 2021-03-02 13:28:19,703 [INFO ] W-9001-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Entering the predict_fn function
[36msc0es6wfbp-algo-1-f5wnl |[0m 2021-03-02 13:28:19,759 [INFO ] W-9001-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - --- Elapsed time: 0.05574846267700195 secs ---
[36msc0es6wfbp-algo-1-f5wnl |[0m 2021-03-02 13:28:19,759 [INFO ] W-9001-model com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 58
[36msc0es6wfbp-algo-1-f5wnl 

In [12]:
response = runtime_client.invoke_endpoint(
    EndpointName=endpoint_name, 
    ContentType='application/x-image',
    Accept='application/json',
    Body=img_byte
    )

print(json.loads(response['Body'].read().decode()))

[36msc0es6wfbp-algo-1-f5wnl |[0m 2021-03-02 13:28:27,264 [INFO ] W-9002-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - An input_fn that loads a image tensor
[36msc0es6wfbp-algo-1-f5wnl |[0m 2021-03-02 13:28:27,265 [INFO ] W-9002-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - application/x-image
[36msc0es6wfbp-algo-1-f5wnl |[0m 2021-03-02 13:28:27,265 [INFO ] W-9002-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Entering the predict_fn function
[36msc0es6wfbp-algo-1-f5wnl |[0m 2021-03-02 13:28:27,265 [INFO ] W-9002-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Entering the predict_fn function
[36msc0es6wfbp-algo-1-f5wnl |[0m 2021-03-02 13:28:27,320 [INFO ] W-9002-model com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 63
[36msc0es6wfbp-algo-1-f5wnl |[0m 2021-03-02 13:28:27,321 [INFO ] W-9002-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - --- Elapsed time: 0.05550980567932129 secs ---
[36msc0es6wfbp-algo-1-f5wn

### Local Mode Endpoint Clean-up

엔드포인트를 계속 사용하지 않는다면, 엔드포인트를 삭제해야 합니다. 
SageMaker SDK에서는 `delete_endpoint()` 메소드로 간단히 삭제할 수 있습니다.

In [13]:
def delete_endpoint(client, endpoint_name):
    response = client.describe_endpoint_config(EndpointConfigName=endpoint_name)
    model_name = response['ProductionVariants'][0]['ModelName']

    client.delete_model(ModelName=model_name)    
    client.delete_endpoint(EndpointName=endpoint_name)
    client.delete_endpoint_config(EndpointConfigName=endpoint_name)    
    
    print(f'--- Deleted model: {model_name}')
    print(f'--- Deleted endpoint: {endpoint_name}')
    print(f'--- Deleted endpoint_config: {endpoint_name}')    

In [14]:
delete_endpoint(client, endpoint_name)

Gracefully stopping... (press Ctrl+C again to force)
--- Deleted model: pytorch-inference-2021-03-02-13-27-11-884
--- Deleted endpoint: local-endpoint-bangali-classifier-1614691629
--- Deleted endpoint_config: local-endpoint-bangali-classifier-1614691629


컨테이너가 삭제된 것을 확인할 수 있습니다.

In [15]:
!docker ps

CONTAINER ID        IMAGE               COMMAND             CREATED             STATUS              PORTS               NAMES


<br>

## 3. SageMaker Hosted Endpoint Inference
---

이제 실제 운영 환경에 엔드포인트 배포를 수행해 보겠습니다. 로컬 모드 엔드포인트와 대부분의 코드가 동일하며, 모델 아티팩트 경로(`model_data`)와 인스턴스 유형(`instance_type`)만 변경해 주시면 됩니다. SageMaker가 관리하는 배포 클러스터를 프로비저닝하는 시간이 소요되기 때문에 추론 서비스를 시작하는 데에는 약 5~10분 정도 소요됩니다.


In [16]:
import boto3
client = boto3.client('sagemaker')
runtime_client = boto3.client('sagemaker-runtime')

In [17]:
def get_model_path(sm_client, max_results=1, name_contains='pytorch'):
    training_job = sm_client.list_training_jobs(MaxResults=max_results,
                                         NameContains=name_contains,
                                         SortBy='CreationTime', 
                                         SortOrder='Descending')
    training_job_name = training_job['TrainingJobSummaries'][0]['TrainingJobName']
    training_job_description = sm_client.describe_training_job(TrainingJobName=training_job_name)
    model_path = training_job_description['ModelArtifacts']['S3ModelArtifacts']  
    return model_path

In [22]:
%%time
model_path = get_model_path(client, max_results=3)
endpoint_name = "endpoint-bangali-classifier-{}".format(int(time.time()))

pytorch_model = PyTorchModel(model_data=model_path,
                                   role=role,
                                   entry_point='./src/inference.py',
                                   framework_version='1.5.0',
                                   py_version='py3')

predictor = pytorch_model.deploy(instance_type='ml.m5.xlarge', 
                                 initial_instance_count=1, 
                                 endpoint_name=endpoint_name,
                                 wait=True)

---------------!CPU times: user 2.88 s, sys: 422 ms, total: 3.3 s
Wall time: 7min 35s


In [23]:
import boto3
client = boto3.client('sagemaker')
runtime_client = boto3.client('sagemaker-runtime')
endpoint_name = pytorch_model.endpoint_name
client.describe_endpoint(EndpointName = endpoint_name)

{'EndpointName': 'endpoint-bangali-classifier-1614692838',
 'EndpointArn': 'arn:aws:sagemaker:ap-northeast-2:387793684046:endpoint/endpoint-bangali-classifier-1614692838',
 'EndpointConfigName': 'endpoint-bangali-classifier-1614692838',
 'ProductionVariants': [{'VariantName': 'AllTraffic',
   'DeployedImages': [{'SpecifiedImage': '763104351884.dkr.ecr.ap-northeast-2.amazonaws.com/pytorch-inference:1.5.0-cpu-py3',
     'ResolvedImage': '763104351884.dkr.ecr.ap-northeast-2.amazonaws.com/pytorch-inference@sha256:fdd5a5514161af205d600520f70e762e36f4ce0fa89ad8667a0b59ee2dda44e4',
     'ResolutionTime': datetime.datetime(2021, 3, 2, 13, 47, 24, 613000, tzinfo=tzlocal())}],
   'CurrentWeight': 1.0,
   'DesiredWeight': 1.0,
   'CurrentInstanceCount': 1,
   'DesiredInstanceCount': 1}],
 'EndpointStatus': 'InService',
 'CreationTime': datetime.datetime(2021, 3, 2, 13, 47, 22, 26000, tzinfo=tzlocal()),
 'LastModifiedTime': datetime.datetime(2021, 3, 2, 13, 54, 31, 295000, tzinfo=tzlocal()),
 'Res

추론을 수행합니다. 로컬 모드의 코드와 동일합니다.

In [24]:
response = runtime_client.invoke_endpoint(
    EndpointName=endpoint_name, 
    ContentType='application/x-image',
    Accept='application/json',
    Body=img_byte
    )

print(json.loads(response['Body'].read().decode()))

{'score': [0.4055737257003784, 0.26362863183021545, 0.11161146312952042, 0.041446536779403687, 0.026412585750222206], 'class': [3, 2, 70, 64, 169]}


### SageMaker Hosted Endpoint Clean-up

엔드포인트를 계속 사용하지 않는다면, 불필요한 과금을 피하기 위해 엔드포인트를 삭제해야 합니다. 
SageMaker SDK에서는 `delete_endpoint()` 메소드로 간단히 삭제할 수 있으며, UI에서도 쉽게 삭제할 수 있습니다.

In [25]:
delete_endpoint(client, endpoint_name)

--- Deleted model: pytorch-inference-2021-03-02-13-47-21-516
--- Deleted endpoint: endpoint-bangali-classifier-1614692838
--- Deleted endpoint_config: endpoint-bangali-classifier-1614692838
