# Deploying fine-tuned model to SageMaker Endpoint to perform Inference (Bring Your Own Container)

---

A great tutorial has already been introduced in the AWS Korea AIML blog and GitHub by Amazon Machine Learning Solutions Lab. Based on this method, it is easy to perform endpoint deployment by making minor modifications.
- https://github.com/aws-samples/kogpt2-sagemaker/blob/master/sagemaker-deploy-en.md


### Modify DockerFile

Basic contents can be done in the same way as the above tutorial. When editing Dockerfile(Based on `./docker/1.6.0/py3/Dockerfile.gpu`), you need to edit as follows. (If you do not use KoGPT2, you can delete 4 lines below `#For KoGPT2 installation`.)

```shell
RUN ${PIP} install --no-cache-dir \
    ${MX_URL} \
    git+git://github.com/dmlc/gluon-nlp.git@v0.9.0 \
    gluoncv==0.6.0 \
    mxnet-model-server==$MMS_VERSION \
    keras-mxnet==2.2.4.1 \
    numpy==1.17.4 \
    onnx==1.4.1 \
    "sagemaker-mxnet-inferenc>2"

# For KoBERT installation
RUN git clone https://github.com/SKTBrain/KoBERT.git \
&& cd KoBERT \
&& ${PIP} install -r requirements.txt \
&& ${PIP} install .

# For KoGPT2 installation
RUN git clone https://github.com/SKT-AI/KoGPT2.git \
&& cd KoGPT2 \
&& ${PIP} install -r requirements.txt \
&& ${PIP} install .

RUN ${PIP} uninstall -y mxnet ${MX_URL}
RUN ${PIP} install ${MX_URL}
```

### Inference

Now you can paste the script code below in the SageMaker notebook instance and then create the endpoint by specifying the script code as the entrypoint. The code example is shown below.

Note that the endpoint deployment time is about 9-11 minutes when using the GPU instance and about 7-8 minutes when using the CPU instance.

In [1]:
import sagemaker
from sagemaker.mxnet.model import MXNetModel
from sagemaker import get_execution_role

sagemaker_session = sagemaker.Session()
role = get_execution_role()
#model_data = 's3://<YOUR BUCKET>/<YOUR MODEL PATH>/model.tar.gz'
model_data = 's3://sagemaker-us-east-1-143656149352/mxnet-training-2020-05-19-01-08-59-727/output/model.tar.gz'
#entry_point = './src/inference.py'

mxnet_model = MXNetModel(model_data=model_data,
                         role=role,
                         entry_point='inference.py',
                         source_dir = './src',
                         py_version='py3',
                         framework_version='1.6.0'
                         #image='<YOUR ECR IMAGE>'
                         #model_server_workers=2
                        )

In [None]:
%%time
predictor = mxnet_model.deploy(instance_type='ml.p2.xlarge', initial_instance_count=1)
print(predictor.endpoint)

--

In [2]:
%%time
predictor = mxnet_model.deploy(instance_type='ml.p2.xlarge', initial_instance_count=1)
print(predictor.endpoint)

-------------------!konlp-2020-05-14-11-57-34-075
CPU times: user 23 s, sys: 3.89 s, total: 26.9 s
Wall time: 9min 54s


If the endpoint is created and you want to restart the jupyter notebook, initializing the predictor can be done using the code cell below.

In [3]:
# import sagemaker
# from sagemaker.mxnet.model import MXNetPredictor
# sagemaker_session = sagemaker.Session()
# endpoint_name = '<YOUR ENDPOINT NAME>'
# predictor = MXNetPredictor(endpoint_name, sagemaker_session)

The code cell below performs real-time prediction.

In [3]:
# Wow, this is a story that repeats reversal over reversal. Highly recommended
input_sentence = '우와, 정말 반전에 반전을 거듭하는 스토리입니다. 강력 추천합니다.'
pred_out = predictor.predict(input_sentence)
print(pred_out)

[36malgo-1-4z4hp_1  |[0m 2020-05-28 11:51:15,497 [INFO ] W-9002-model com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 91
[36malgo-1-4z4hp_1  |[0m 2020-05-28 11:51:15,497 [INFO ] W-9002-model ACCESS_LOG - /172.18.0.1:54172 "POST /invocations HTTP/1.1" 200 94
{'score': [0.030505415052175522, 0.9694945812225342], 'time': 0.08880257606506348}


In [4]:
# The contents are really messed up, and the actor's acting skills are also messed up.
input_sentence = '하하, 정말 엉망진창에 배우 연기력도 꽝이에요.'
pred_out = predictor.predict(input_sentence)
print(pred_out)

[36malgo-1-4z4hp_1  |[0m 2020-05-28 11:51:21,400 [INFO ] W-9003-model com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 87
[36malgo-1-4z4hp_1  |[0m 2020-05-28 11:51:21,401 [INFO ] W-9003-model ACCESS_LOG - /172.18.0.1:54172 "POST /invocations HTTP/1.1" 200 89
{'score': [0.9753417372703552, 0.024658288806676865], 'time': 0.08612608909606934}


### Optional. Delete Endpoint

In [7]:
# predictor.delete_endpoint()
# predictor.delete_model()