# Fine tune a PyTorch BERT model and deploy it with Elastic Inference on Amazon SageMaker

Text classification is a technique for putting text into different categories and has a wide range of applications: email providers use text classification to detect to spam emails, marketing agencies use it for sentiment analysis of customer reviews, and moderators of discussion forums use it to detect inappropriate comments.

In the past, data scientists used methods such as [tf-idf](https://en.wikipedia.org/wiki/Tf%E2%80%93idf), [word2vec](https://en.wikipedia.org/wiki/Word2vec), or [bag-of-words (BOW)](https://en.wikipedia.org/wiki/Bag-of-words_model) to generate features for training classification models. While these techniques have been very successful in many NLP tasks, they don't always capture the meanings of words accurately when they appear in different contexts. Recently, we see increasing interest in using Bidirectional Encoder Representations from Transformers (BERT) to achieve better results in text classification tasks, due to its ability more accurately encode the meaning of words in different contexts.

BERT was trained on BookCorpus and English Wikipedia data, which contain 800 million words and 2,500 million words, respectively. Training BERT from scratch would be prohibitively expensive. By taking advantage of transfer learning, one can quickly fine tune BERT for another use case with a relatively small amount of training data to achieve state-of-the-art results for common NLP tasks, such as text classification and question answering. 

Amazon SageMaker is a fully managed service that provides developers and data scientists with the ability to build, train, and deploy machine learning (ML) models quickly. Amazon SageMaker removes the heavy lifting from each step of the machine learning process to make it easier to develop high-quality models. The SageMaker Python SDK provides open source APIs and containers that make it easy to train and deploy models in Amazon SageMaker with several different machine learning and deep learning frameworks.

Our customers often ask for quick fine-tuning and easy deployment of their NLP models. Furthermore, customers prefer low inference latency and low model inference cost. [Amazon Elastic Inference](https://aws.amazon.com/machine-learning/elastic-inference) enables attaching GPU-powered inference acceleration to endpoints, reducing the cost of deep learning inference without sacrificing performance.

This blog post demonstrates how to use Amazon SageMaker to fine tune a PyTorch BERT model and deploy it with Elastic Inference. This work is inspired by a post by [Chris McCormick and Nick Ryan](https://mccormickml.com/2019/07/22/BERT-fine-tuning).

In this example, we walk through our dataset, the training process, and finally model deployment. 

# Setup

To start, we import some Python libraries and initialize a SageMaker session, S3 bucket and prefix, and IAM role.

In [3]:
# need torch 1.3.1 for elastic inference

!pip install torch==1.3.1

!pip install transformers

Collecting torch==1.3.1
  Downloading torch-1.3.1-cp36-cp36m-manylinux1_x86_64.whl (734.6 MB)
[K     |████████████████████████████████| 734.6 MB 4.1 kB/s s eta 0:00:01
Installing collected packages: torch
  Attempting uninstall: torch
    Found existing installation: torch 1.4.0
    Uninstalling torch-1.4.0:
      Successfully uninstalled torch-1.4.0
Successfully installed torch-1.3.1


In [1]:
import os
import numpy as np
import pandas as pd
import sagemaker

sagemaker_session = sagemaker.Session()

bucket = sagemaker_session.default_bucket()
prefix = "sagemaker/DEMO-pytorch-bert"

role = sagemaker.get_execution_role()

In [2]:
import torch
print(torch.__version__)

1.3.1


# Prepare training data

We use Corpus of Linguistic Acceptability (CoLA) (https://nyu-mll.github.io/CoLA/), a dataset of 10,657 English sentences labeled as grammatical or ungrammatical from published linguistics literature. We download and unzip the data using the following code:

### Download data

In [3]:
if not os.path.exists("./cola_public_1.1.zip"):
    !curl -o ./cola_public_1.1.zip https://nyu-mll.github.io/CoLA/cola_public_1.1.zip
if not os.path.exists("./cola_public/"):
    !unzip cola_public_1.1.zip

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  249k  100  249k    0     0  1548k      0 --:--:-- --:--:-- --:--:-- 1548k
Archive:  cola_public_1.1.zip
   creating: cola_public/
  inflating: cola_public/README      
   creating: cola_public/tokenized/
  inflating: cola_public/tokenized/in_domain_dev.tsv  
  inflating: cola_public/tokenized/in_domain_train.tsv  
  inflating: cola_public/tokenized/out_of_domain_dev.tsv  
   creating: cola_public/raw/
  inflating: cola_public/raw/in_domain_dev.tsv  
  inflating: cola_public/raw/in_domain_train.tsv  
  inflating: cola_public/raw/out_of_domain_dev.tsv  


### Get sentences and labels

Let us take a quick look at our data. First we read in the training data. The only two columns we need are the sentence itself and its label. 

In [4]:
df = pd.read_csv(
    "./cola_public/raw/in_domain_train.tsv",
    sep="\t",
    header=None,
    usecols=[1, 3],
    names=["label", "sentence"],
)
sentences = df.sentence.values
labels = df.label.values

Printing out a few sentences shows us how sentences are labeled based on their grammatical completeness. 

In [5]:
print(sentences[20:25])
print(labels[20:25])

['The professor talked us.' 'We yelled ourselves hoarse.'
 'We yelled ourselves.' 'We yelled Harry hoarse.'
 'Harry coughed himself into a fit.']
[0 1 0 0 1]


We then split the dataset for training and testing.

In [6]:
from sklearn.model_selection import train_test_split

train, test = train_test_split(df)
train.to_csv("./cola_public/train.csv", index=False)
test.to_csv("./cola_public/test.csv", index=False)

Next, we upload both to Amazon S3 for use later. The SageMaker Python SDK provides a helpful function for uploading to Amazon S3:

In [7]:
inputs_train = sagemaker_session.upload_data("./cola_public/train.csv", bucket=bucket, key_prefix=prefix)
inputs_test = sagemaker_session.upload_data("./cola_public/test.csv", bucket=bucket, key_prefix=prefix)

# Run training

## Training script

We use the [PyTorch-Transformers library](https://pytorch.org/hub/huggingface_pytorch-transformers), which contains PyTorch implementations and pre-trained model weights for many NLP models, including BERT.

Our training script should save model artifacts learned during training to a file path called `model_dir`, as stipulated by the SageMaker PyTorch image. Upon completion of training, model artifacts saved in `model_dir` will be uploaded to S3 by SageMaker and will become available in S3 for deployment.

We save this script in a file named `train_deploy.py`, and put the file in a directory named `code/`. The full training script can be viewed under `code/`.

In [8]:
!pygmentize code/train_deploy.py

[34mimport[39;49;00m [04m[36margparse[39;49;00m
[34mimport[39;49;00m [04m[36mjson[39;49;00m
[34mimport[39;49;00m [04m[36mlogging[39;49;00m
[34mimport[39;49;00m [04m[36mos[39;49;00m
[34mimport[39;49;00m [04m[36msys[39;49;00m

[34mimport[39;49;00m [04m[36mnumpy[39;49;00m [34mas[39;49;00m [04m[36mnp[39;49;00m
[34mimport[39;49;00m [04m[36mpandas[39;49;00m [34mas[39;49;00m [04m[36mpd[39;49;00m
[34mimport[39;49;00m [04m[36mtorch[39;49;00m
[34mimport[39;49;00m [04m[36mtorch[39;49;00m[04m[36m.[39;49;00m[04m[36mdistributed[39;49;00m [34mas[39;49;00m [04m[36mdist[39;49;00m
[34mimport[39;49;00m [04m[36mtorch[39;49;00m[04m[36m.[39;49;00m[04m[36mutils[39;49;00m[04m[36m.[39;49;00m[04m[36mdata[39;49;00m
[34mimport[39;49;00m [04m[36mtorch[39;49;00m[04m[36m.[39;49;00m[04m[36mutils[39;49;00m[04m[36m.[39;49;00m[04m[36mdata[39;49;00m[04m[36m.[39;49;00m[04m[36mdistributed[39;49;00m
[34mfro

            logits = outputs[[34m0[39;49;00m]
            logits = logits.detach().cpu().numpy()
            label_ids = b_labels.to([33m"[39;49;00m[33mcpu[39;49;00m[33m"[39;49;00m).numpy()
            tmp_eval_accuracy = flat_accuracy(logits, label_ids)
            eval_accuracy += tmp_eval_accuracy

    logger.info([33m"[39;49;00m[33mTest set: Accuracy: [39;49;00m[33m%f[39;49;00m[33m\n[39;49;00m[33m"[39;49;00m, tmp_eval_accuracy)


[34mdef[39;49;00m [32mmodel_fn[39;49;00m(model_dir):
    device = torch.device([33m"[39;49;00m[33mcuda[39;49;00m[33m"[39;49;00m [34mif[39;49;00m torch.cuda.is_available() [34melse[39;49;00m [33m"[39;49;00m[33mcpu[39;49;00m[33m"[39;49;00m)

    model = BertForSequenceClassification.from_pretrained(model_dir)
    [34mreturn[39;49;00m model.to(device)


[34mdef[39;49;00m [32minput_fn[39;49;00m(request_body, request_content_type):
    [33m"""An input_fn that loads a pickled tensor"""[39;49;00m
   

## Train on Amazon SageMaker

We use Amazon SageMaker to train and deploy a model using our custom PyTorch code. The Amazon SageMaker Python SDK makes it easier to run a PyTorch script in Amazon SageMaker using its PyTorch estimator. After that, we can use the SageMaker Python SDK to deploy the trained model and run predictions. For more information on how to use this SDK with PyTorch, see [the SageMaker Python SDK documentation](https://sagemaker.readthedocs.io/en/stable/using_pytorch.html).

To start, we use the `PyTorch` estimator class to train our model. When creating our estimator, we make sure to specify a few things:

* `entry_point`: the name of our PyTorch script. It contains our training script, which loads data from the input channels, configures training with hyperparameters, trains a model, and saves a model. It also contains code to load and run the model during inference.
* `source_dir`: the location of our training scripts and requirements.txt file. "requirements.txt" lists packages you want to use with your script.
* `framework_version`: the PyTorch version we want to use

The PyTorch estimator supports multi-machine, distributed PyTorch training. To use this, we just set train_instance_count to be greater than one. Our training script supports distributed training for only GPU instances. 

After creating the estimator, we then call fit(), which launches a training job. We use the Amazon S3 URIs where we uploaded the training data earlier.

In [8]:
from sagemaker.pytorch import PyTorch

# place to save model artifact
output_path = f"s3://{bucket}/{prefix}"

estimator = PyTorch(
    entry_point="train_deploy.py",
    source_dir="code",
    role=role,
    framework_version="1.3.1",
    py_version="py3",
    instance_count=1, #debug 2,  # this script only support distributed training for GPU instances.
    instance_type="local", # debug #"ml.p3.2xlarge",
    output_path=output_path,
    hyperparameters={
        "epochs": 0, # debug #1,
        "num_labels": 2,
        #"backend": "gloo",
    },
    disable_profiler=True, # disable debugger
)
estimator.fit({"training": inputs_train, "testing": inputs_test})

Creating tmpcno7ooe6_algo-1-pkukq_1 ... 
[1BAttaching to tmpcno7ooe6_algo-1-pkukq_12mdone[0m
[36malgo-1-pkukq_1  |[0m 2021-01-14 20:23:09,825 sagemaker-containers INFO     Imported framework sagemaker_pytorch_container.training
[36malgo-1-pkukq_1  |[0m 2021-01-14 20:23:09,827 sagemaker-containers INFO     No GPUs detected (normal if no gpus installed)
[36malgo-1-pkukq_1  |[0m 2021-01-14 20:23:09,837 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.
[36malgo-1-pkukq_1  |[0m 2021-01-14 20:23:09,839 sagemaker_pytorch_container.training INFO     Invoking user training script.
[36malgo-1-pkukq_1  |[0m 2021-01-14 20:23:10,968 sagemaker-containers INFO     Module default_user_module_name does not provide a setup.py. 
[36malgo-1-pkukq_1  |[0m Generating setup.py
[36malgo-1-pkukq_1  |[0m 2021-01-14 20:23:10,968 sagemaker-containers INFO     Generating setup.cfg
[36malgo-1-pkukq_1  |[0m 2021-01-14 20:23:10,968 sagemaker-containers INFO     

[36malgo-1-pkukq_1  |[0m Loading BERT tokenizer...
[36malgo-1-pkukq_1  |[0m Distributed training - False
[36malgo-1-pkukq_1  |[0m Number of gpus available - 0
[36malgo-1-pkukq_1  |[0m Get train data loader
[36malgo-1-pkukq_1  |[0m Processes 6413/6413 (100%) of train data
[36malgo-1-pkukq_1  |[0m Processes 2138/2138 (100%) of test data
[36malgo-1-pkukq_1  |[0m Starting BertForSequenceClassification
[36malgo-1-pkukq_1  |[0m 
[36malgo-1-pkukq_1  |[0m End of defining BertForSequenceClassification
[36malgo-1-pkukq_1  |[0m 
[36malgo-1-pkukq_1  |[0m Saving tuned model.
[36malgo-1-pkukq_1  |[0m 2021-01-14 20:23:46,254 sagemaker-containers INFO     Reporting training SUCCESS
[36mtmpcno7ooe6_algo-1-pkukq_1 exited with code 0
[0mAborting on container exit...
===== Job Complete =====


# Host

After training our model, we host it on an Amazon SageMaker Endpoint. To make the endpoint load the model and serve predictions, we implement a few methods in `train_deploy.py`.

* `model_fn()`: function defined to load the saved model and return a model object that can be used for model serving. The SageMaker PyTorch model server loads our model by invoking model_fn.
* `input_fn()`: deserializes and prepares the prediction input. In this example, our request body is first serialized to JSON and then sent to model serving endpoint. Therefore, in `input_fn()`, we first deserialize the JSON-formatted request body and return the input as a `torch.tensor`, as required for BERT.
* `predict_fn()`: performs the prediction and returns the result.

To deploy our endpoint, we call `deploy()` on our PyTorch estimator object, passing in our desired number of instances and instance type:


In [15]:
predictor = estimator.deploy(initial_instance_count=1, instance_type="local") # debug #"ml.m4.xlarge")

Using the short-lived AWS credentials found in session. They might expire while running.


Attaching to tmph585qkaq_algo-1-f0a2q_1
[36malgo-1-f0a2q_1  |[0m Collecting regex
[36malgo-1-f0a2q_1  |[0m   Downloading regex-2020.11.13-cp36-cp36m-manylinux2014_x86_64.whl (723 kB)
[K     |████████████████████████████████| 723 kB 7.0 MB/s eta 0:00:01
[36malgo-1-f0a2q_1  |[0m [?25hCollecting sentencepiece
[36malgo-1-f0a2q_1  |[0m   Downloading sentencepiece-0.1.95-cp36-cp36m-manylinux2014_x86_64.whl (1.2 MB)
[K     |████████████████████████████████| 1.2 MB 18.5 MB/s eta 0:00:01
[36malgo-1-f0a2q_1  |[0m [?25hCollecting sacremoses
[36malgo-1-f0a2q_1  |[0m   Downloading sacremoses-0.0.43.tar.gz (883 kB)
[K     |████████████████████████████████| 883 kB 29.2 MB/s eta 0:00:01
[36malgo-1-f0a2q_1  |[0m [?25hCollecting transformers==2.3.0
[36malgo-1-f0a2q_1  |[0m   Downloading transformers-2.3.0-py3-none-any.whl (447 kB)
[K     |████████████████████████████████| 447 kB 19.0 MB/s eta 0:00:01
[36malgo-1-f0a2q_1  |[0m Building wheels for collected packages: sacremoses
[3

[36malgo-1-f0a2q_1  |[0m 2021-01-13 19:30:37,973 [INFO ] W-9000-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - PyTorch version 1.5.0+cpu available.
[36malgo-1-f0a2q_1  |[0m 2021-01-13 19:30:37,973 [INFO ] W-9004-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - PyTorch version 1.5.0+cpu available.
[36malgo-1-f0a2q_1  |[0m 2021-01-13 19:30:37,973 [INFO ] W-9005-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - PyTorch version 1.5.0+cpu available.
[36malgo-1-f0a2q_1  |[0m 2021-01-13 19:30:37,973 [INFO ] W-9002-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - PyTorch version 1.5.0+cpu available.
[36malgo-1-f0a2q_1  |[0m 2021-01-13 19:30:37,973 [INFO ] W-9007-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - PyTorch version 1.5.0+cpu available.
[36malgo-1-f0a2q_1  |[0m 2021-01-13 19:30:37,973 [INFO ] W-9003-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - PyTorch version 1.5.0+cpu available.
[36malgo-1-f0a2q_1  |[0m 2021-01-13 19:30:37

[36malgo-1-f0a2q_1  |[0m 2021-01-13 19:30:40,014 [INFO ] pool-1-thread-9 ACCESS_LOG - /172.18.0.1:48368 "GET /ping HTTP/1.1" 200 14
![36malgo-1-f0a2q_1  |[0m 2021-01-13 19:30:42,580 [INFO ] W-9001-model com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 6292


We then configure the predictor to use `application/json` for the content type when sending requests to our endpoint:

In [16]:
#from sagemaker.predictor import json_deserializer, json_serializer

#predictor.content_type = "application/json"
#predictor.accept = "application/json"
predictor.serializer = sagemaker.serializers.JSONSerializer()
predictor.deserializer = sagemaker.deserializers.JSONDeserializer()

Finally, we use the returned predictor object to call the endpoint:

In [18]:
result = predictor.predict("Somebody just left - guess who.")
print("predicted class: ", np.argmax(result, axis=1))

[36malgo-1-f0a2q_1  |[0m 2021-01-13 19:31:12,051 [INFO ] W-9005-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Loading BERT tokenizer...
[36malgo-1-f0a2q_1  |[0m 2021-01-13 19:31:12,051 [INFO ] W-9005-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Loading BERT tokenizer...
[36malgo-1-f0a2q_1  |[0m 2021-01-13 19:31:12,051 [INFO ] W-9005-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - loading file https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-vocab.txt from cache at /root/.cache/torch/transformers/26bc1ad6c0ac742e9b52263248f6d0f00068293b33709fae12320c0e35ccfbbb.542ce4285a40d23a559526243235df47c5f75c197f04f37d1a0c124c32c9a084
[36malgo-1-f0a2q_1  |[0m 2021-01-13 19:31:12,076 [INFO ] W-9005-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - loading configuration file /.sagemaker/mms/models/model/config.json
[36malgo-1-f0a2q_1  |[0m 2021-01-13 19:31:12,077 [INFO ] W-9005-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - M

We can see the predicted class is 1 as expected because test sentence is a grammatically correct sentence. 

Before moving on, let's delete the Amazon SageMaker endpoint to avoid charges:

In [19]:
predictor.delete_endpoint()

KeyboardInterrupt: 

In [20]:
# debug
!docker ps -a -q | grep 8080 | awk ''{print $1}'' | xargs docker rm -f 

awk: cmd. line:1: ‘{print
awk: cmd. line:1: ^ invalid char '�' in expression
"docker rm" requires at least 1 argument.
See 'docker rm --help'.

Usage:  docker rm [OPTIONS] CONTAINER [CONTAINER...]

Remove one or more containers


## Use a pretrained model

If you want to reuse pretrained model, you can create a `PyTorchModel` from existing model artifacts. For example,
we can retrieve model artifacts we just trained. 

In [10]:
model_data = estimator.model_data
print(model_data)

s3://sagemaker-us-west-2-688520471316/sagemaker/DEMO-pytorch-bert/pytorch-training-2021-01-14-20-21-55-608/model.tar.gz


In [11]:
from sagemaker.pytorch.model import PyTorchModel 

pytorch_model = PyTorchModel(model_data=model_data,
                             role=role,
                             framework_version="1.3.1",
                             source_dir="code",
                             py_version="py3",
                             entry_point="train_deploy.py")

predictor = pytorch_model.deploy(initial_instance_count=1, instance_type="local") #Debug #"ml.m4.xlarge")

Attaching to tmpb42upwt__algo-1-29n03_1
[36malgo-1-29n03_1  |[0m Collecting regex
[36malgo-1-29n03_1  |[0m [?25l  Downloading https://files.pythonhosted.org/packages/0d/8a/3ac62dadb767ace65a5b954265de4031a99b27148fe14b24771f5c2c2dca/regex-2020.11.13-cp36-cp36m-manylinux2014_x86_64.whl (723kB)
[K     |████████████████████████████████| 727kB 6.1MB/s eta 0:00:01
[36malgo-1-29n03_1  |[0m [?25hCollecting sentencepiece
[36malgo-1-29n03_1  |[0m [?25l  Downloading https://files.pythonhosted.org/packages/14/67/e42bd1181472c95c8cda79305df848264f2a7f62740995a46945d9797b67/sentencepiece-0.1.95-cp36-cp36m-manylinux2014_x86_64.whl (1.2MB)
[K     |████████████████████████████████| 1.2MB 13.6MB/s eta 0:00:01
[36malgo-1-29n03_1  |[0m [?25hCollecting sacremoses
[36malgo-1-29n03_1  |[0m [?25l  Downloading https://files.pythonhosted.org/packages/7d/34/09d19aff26edcc8eb2a01bed8e98f13a1537005d31e95233fd48216eed10/sacremoses-0.0.43.tar.gz (883kB)
[K     |████████████████████████████████| 

[36malgo-1-29n03_1  |[0m 2021-01-14 20:26:23,639 [INFO ] W-9005-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Listening on port: /home/model-server/tmp/.mms.sock.9005
[36malgo-1-29n03_1  |[0m Model server started.
[36malgo-1-29n03_1  |[0m 2021-01-14 20:26:23,639 [INFO ] W-9005-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - [PID]63
[36malgo-1-29n03_1  |[0m 2021-01-14 20:26:23,639 [INFO ] W-9005-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - MXNet worker started.
[36malgo-1-29n03_1  |[0m 2021-01-14 20:26:23,640 [INFO ] W-9005-model com.amazonaws.ml.mms.wlm.WorkerThread - Connecting to: /home/model-server/tmp/.mms.sock.9005
[36malgo-1-29n03_1  |[0m 2021-01-14 20:26:23,640 [INFO ] W-9005-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Python runtime: 3.6.6
[36malgo-1-29n03_1  |[0m 2021-01-14 20:26:23,645 [WARN ] pool-2-thread-1 com.amazonaws.ml.mms.metrics.MetricCollector - worker pid is not available yet.
[36malgo-1-29n03_1  |[0m 2021-01

## Elastic Inference

Selecting the right instance type for inference requires deciding between different amounts of GPU, CPU, and memory resources, and optimizing for one of these resources on a standalone GPU instance usually leads to under-utilization of other resources. [Amazon Elastic Inference](https://aws.amazon.com/machine-learning/elastic-inference/) solves this problem by enabling us to attach the right amount of GPU-powered inference acceleration to our endpoint. In March 2020, [Elastic Inference support for PyTorch became available](https://aws.amazon.com/blogs/machine-learning/reduce-ml-inference-costs-on-amazon-sagemaker-for-pytorch-models-using-amazon-elastic-inference/) for both Amazon SageMaker and Amazon EC2.

To use Elastic Inference, we must convert our trained model to TorchScript. The location of the model artifacts is `estimator.model_data`. 

First we create a folder to save model trained model, and download the `model.tar.gz` file to local directory. 

In [12]:
%%sh -s $estimator.model_data
mkdir model
aws s3 cp $1 model/ 
tar xvzf model/model.tar.gz --directory ./model

Completed 256.0 KiB/386.8 MiB (2.3 MiB/s) with 1 file(s) remainingCompleted 512.0 KiB/386.8 MiB (4.5 MiB/s) with 1 file(s) remainingCompleted 768.0 KiB/386.8 MiB (6.5 MiB/s) with 1 file(s) remainingCompleted 1.0 MiB/386.8 MiB (8.5 MiB/s) with 1 file(s) remaining  Completed 1.2 MiB/386.8 MiB (10.5 MiB/s) with 1 file(s) remaining Completed 1.5 MiB/386.8 MiB (12.4 MiB/s) with 1 file(s) remaining Completed 1.8 MiB/386.8 MiB (14.1 MiB/s) with 1 file(s) remaining Completed 2.0 MiB/386.8 MiB (16.0 MiB/s) with 1 file(s) remaining Completed 2.2 MiB/386.8 MiB (17.8 MiB/s) with 1 file(s) remaining Completed 2.5 MiB/386.8 MiB (19.8 MiB/s) with 1 file(s) remaining Completed 2.8 MiB/386.8 MiB (21.3 MiB/s) with 1 file(s) remaining Completed 3.0 MiB/386.8 MiB (23.1 MiB/s) with 1 file(s) remaining Completed 3.2 MiB/386.8 MiB (24.7 MiB/s) with 1 file(s) remaining Completed 3.5 MiB/386.8 MiB (26.4 MiB/s) with 1 file(s) remaining Completed 3.8 MiB/386.8 MiB (28.1 MiB/s) with 1 file(s) remain

The following code converts our model into the TorchScript format:

In [14]:
import subprocess
import torch
from transformers import BertForSequenceClassification

model_torchScript = BertForSequenceClassification.from_pretrained("model/", torchscript=True)
device = "cpu"
for_jit_trace_input_ids = [0] * 64
for_jit_trace_attention_masks = [0] * 64
for_jit_trace_input = torch.tensor([for_jit_trace_input_ids])
for_jit_trace_masks = torch.tensor([for_jit_trace_input_ids])

traced_model = torch.jit.trace(
    model_torchScript, [for_jit_trace_input.to(device), for_jit_trace_masks.to(device)]
)
torch.jit.save(traced_model, "traced_bert.pt")

subprocess.call(["tar", "-czvf", "traced_bert.tar.gz", "traced_bert.pt"])

0

[36malgo-1-29n03_1  |[0m 2021-01-14 20:29:19,509 [INFO ] epollEventLoopGroup-4-6 com.amazonaws.ml.mms.wlm.WorkerThread - 9003 Worker disconnected. WORKER_MODEL_LOADED
[36malgo-1-29n03_1  |[0m 2021-01-14 20:29:19,511 [INFO ] epollEventLoopGroup-4-4 com.amazonaws.ml.mms.wlm.WorkerThread - 9005 Worker disconnected. WORKER_MODEL_LOADED
[36mtmpb42upwt__algo-1-29n03_1 exited with code 137
[0mAborting on container exit...


Exception in thread Thread-4:
Traceback (most recent call last):
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/sagemaker/local/image.py", line 632, in run
    _stream_output(self.process)
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/sagemaker/local/image.py", line 692, in _stream_output
    raise RuntimeError("Process exited with code: %s" % exit_code)
RuntimeError: Process exited with code: 137

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/threading.py", line 916, in _bootstrap_inner
    self.run()
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/sagemaker/local/image.py", line 637, in run
    raise RuntimeError(msg)
RuntimeError: Failed to run: ['docker-compose', '-f', '/tmp/tmpb42upwt_/docker-compose.yaml', 'up', '--build', '--abort-on-container-exit'], Process exited with code

Loading the TorchScript model and using it for prediction require small changes in our model loading and prediction functions. We create a new script `deploy_ei.py` that is slightly different from `train_deploy.py` script.

In [4]:
!pygmentize code/deploy_ei.py

[34mimport[39;49;00m [04m[36mjson[39;49;00m
[34mimport[39;49;00m [04m[36mlogging[39;49;00m
[34mimport[39;49;00m [04m[36mos[39;49;00m
[34mimport[39;49;00m [04m[36msys[39;49;00m

[34mimport[39;49;00m [04m[36mtorch[39;49;00m
[34mimport[39;49;00m [04m[36mtorch[39;49;00m[04m[36m.[39;49;00m[04m[36mutils[39;49;00m[04m[36m.[39;49;00m[04m[36mdata[39;49;00m
[34mimport[39;49;00m [04m[36mtorch[39;49;00m[04m[36m.[39;49;00m[04m[36mutils[39;49;00m[04m[36m.[39;49;00m[04m[36mdata[39;49;00m[04m[36m.[39;49;00m[04m[36mdistributed[39;49;00m
[34mfrom[39;49;00m [04m[36mtransformers[39;49;00m [34mimport[39;49;00m BertTokenizer

logger = logging.getLogger([31m__name__[39;49;00m)
logger.setLevel(logging.DEBUG)
logger.addHandler(logging.StreamHandler(sys.stdout))

MAX_LEN = [34m64[39;49;00m  [37m# this is the max length of the sentence[39;49;00m

[36mprint[39;49;00m([33m"[39;49;00m[33mLoading BERT tokenizer...[39;

Next we upload TorchScript model to S3 and deploy using Elastic Inference. The accelerator_type=`ml.eia2.xlarge` parameter is how we attach the Elastic Inference accelerator to our endpoint.

In [11]:
from sagemaker.pytorch import PyTorchModel

instance_type = 'ml.m5.large'
accelerator_type = 'ml.eia2.xlarge'

# TorchScript model
tar_filename = 'traced_bert.tar.gz'

# Returns S3 bucket URL
print('Upload tarball to S3')
model_data = sagemaker_session.upload_data(path=tar_filename, bucket=bucket, key_prefix=prefix)

endpoint_name = 'bert-ei-traced-{}-{}'.format(instance_type, accelerator_type).replace('.', '').replace('_', '')

pytorch = PyTorchModel(
    model_data=model_data,
    role=role,
    entry_point='deploy_ei.py',
    source_dir='code',
    framework_version='1.3.1',
    py_version='py3',
    sagemaker_session=sagemaker_session
)

# Function will exit before endpoint is finished creating
'''
predictor = pytorch.deploy(
    initial_instance_count=1,
    instance_type=instance_type,
    accelerator_type=accelerator_type,
    endpoint_name=endpoint_name,
    wait=True, # Debug #False
)
'''

Upload tarball to S3


'\npredictor = pytorch.deploy(\n    initial_instance_count=1,\n    instance_type=instance_type,\n    accelerator_type=accelerator_type,\n    endpoint_name=endpoint_name,\n    wait=True, # Debug #False\n)\n'

In [34]:
# Remote predictor

instance_type = 'ml.m5.large'
accelerator_type = 'ml.eia2.xlarge'

pytorch = PyTorchModel(
    model_data=model_data,
    role=role,
    entry_point='deploy_ei.py',
    source_dir='code',
    framework_version='1.3.1',
    py_version='py3',
    sagemaker_session=sagemaker_session
)

predictor = pytorch.deploy(
    initial_instance_count=1,
    instance_type=instance_type,
    accelerator_type=accelerator_type,
    #endpoint_name=endpoint_name,
    wait=True, # Debug #False
)

---------------!

In [35]:
predictor.serializer = sagemaker.serializers.JSONSerializer()
predictor.deserializer = sagemaker.deserializers.JSONDeserializer()

In [38]:
res = predictor.predict('Please remember to delete me when you are done')
print(res)

[[0.12622171640396118, 0.22182804346084595]]


In [None]:
# Debug
# deploy to local and check model_fn
from sagemaker.pytorch import PyTorchModel

# TorchScript model
tar_filename = 'traced_bert.tar.gz'

# Returns S3 bucket URL
print('Upload tarball to S3')
model_data = sagemaker_session.upload_data(path=tar_filename, bucket=bucket, key_prefix=prefix)

#endpoint_name = 'bert-ei-traced-{}-{}'.format(instance_type, accelerator_type).replace('.', '').replace('_', '')

In [31]:

pytorch = PyTorchModel(
    model_data=model_data,
    role=role,
    entry_point='deploy_ei.py',
    source_dir='code',
    framework_version='1.3.1',
    py_version='py3',
    #sagemaker_session=sagemaker_session
)

local_predictor = pytorch.deploy(
    initial_instance_count=1,
    instance_type='local',
    wait = True
)

local_predictor.serializer = sagemaker.serializers.JSONSerializer()
local_predictor.deserializer = sagemaker.deserializers.JSONDeserializer()

Attaching to tmp9j60ynn7_algo-1-n5nl6_1
[36malgo-1-n5nl6_1  |[0m Collecting regex
[36malgo-1-n5nl6_1  |[0m [?25l  Downloading https://files.pythonhosted.org/packages/0d/8a/3ac62dadb767ace65a5b954265de4031a99b27148fe14b24771f5c2c2dca/regex-2020.11.13-cp36-cp36m-manylinux2014_x86_64.whl (723kB)
[K     |████████████████████████████████| 727kB 8.9MB/s eta 0:00:01
[36malgo-1-n5nl6_1  |[0m [?25hCollecting sentencepiece
[36malgo-1-n5nl6_1  |[0m [?25l  Downloading https://files.pythonhosted.org/packages/14/67/e42bd1181472c95c8cda79305df848264f2a7f62740995a46945d9797b67/sentencepiece-0.1.95-cp36-cp36m-manylinux2014_x86_64.whl (1.2MB)
[K     |████████████████████████████████| 1.2MB 18.8MB/s eta 0:00:01
[36malgo-1-n5nl6_1  |[0m [?25hCollecting sacremoses
[36malgo-1-n5nl6_1  |[0m [?25l  Downloading https://files.pythonhosted.org/packages/7d/34/09d19aff26edcc8eb2a01bed8e98f13a1537005d31e95233fd48216eed10/sacremoses-0.0.43.tar.gz (883kB)
[K     |████████████████████████████████| 

[36malgo-1-n5nl6_1  |[0m 2021-01-14 20:55:55,526 [INFO ] W-9001-model com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 337
[36malgo-1-n5nl6_1  |[0m 2021-01-14 20:55:55,526 [INFO ] W-9007-model com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 347
[36malgo-1-n5nl6_1  |[0m 2021-01-14 20:55:55,532 [INFO ] W-9004-model com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 353
[36malgo-1-n5nl6_1  |[0m 2021-01-14 20:55:55,545 [INFO ] W-9005-model com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 375
[36malgo-1-n5nl6_1  |[0m 2021-01-14 20:55:55,551 [INFO ] W-9002-model com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 368
[36malgo-1-n5nl6_1  |[0m 2021-01-14 20:55:55,554 [INFO ] W-9006-model com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 364
[36malgo-1-n5nl6_1  |[0m 2021-01-14 20:55:55,565 [INFO ] W-9000-model com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 343
[36malgo-1-n5nl6_1  |[0m 

In [22]:
print(local_predictor)

<sagemaker.pytorch.model.PyTorchPredictor object at 0x7f84efc4f908>


In [32]:
local_predictor.predict('Please remember to delete me when you are done')

[36malgo-1-n5nl6_1  |[0m 2021-01-14 20:56:02,728 [INFO ] W-9001-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - PyTorch version 1.3.1 available.
[36malgo-1-n5nl6_1  |[0m 2021-01-14 20:56:03,261 [INFO ] W-9001-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Loading BERT tokenizer...
[36malgo-1-n5nl6_1  |[0m 2021-01-14 20:56:03,261 [INFO ] W-9001-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-vocab.txt not found in cache or force_download set to True, downloading to /home/model-server/tmp/tmp89hjbd91
[36malgo-1-n5nl6_1  |[0m 2021-01-14 20:56:03,790 [INFO ] W-9001-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - copying /home/model-server/tmp/tmp89hjbd91 to cache at /root/.cache/torch/transformers/26bc1ad6c0ac742e9b52263248f6d0f00068293b33709fae12320c0e35ccfbbb.542ce4285a40d23a559526243235df47c5f75c197f04f37d1a0c124c32c9a084
[36malgo-1-n5nl6_1  |[0m 2021-01-14 20:56:03,791 [INFO

[[0.12622159719467163, 0.22182820737361908]]

[36malgo-1-n5nl6_1  |[0m 2021-01-14 20:56:05,453 [INFO ] W-9001-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - tensor([[0.1262, 0.2218]])


In [30]:
!docker rm -f $(docker ps -a -q)

[36malgo-1-817p1_1  |[0m 2021-01-14 20:51:38,394 [INFO ] epollEventLoopGroup-4-4 com.amazonaws.ml.mms.wlm.WorkerThread - 9005 Worker disconnected. WORKER_MODEL_LOADED
[36mtmph3yyzggg_algo-1-817p1_1 exited with code 137
[0mAborting on container exit...
30b8485c6240


Exception in thread Thread-9:
Traceback (most recent call last):
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/sagemaker/local/image.py", line 632, in run
    _stream_output(self.process)
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/sagemaker/local/image.py", line 692, in _stream_output
    raise RuntimeError("Process exited with code: %s" % exit_code)
RuntimeError: Process exited with code: 137

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/threading.py", line 916, in _bootstrap_inner
    self.run()
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/sagemaker/local/image.py", line 637, in run
    raise RuntimeError(msg)
RuntimeError: Failed to run: ['docker-compose', '-f', '/tmp/tmph3yyzggg/docker-compose.yaml', 'up', '--build', '--abort-on-container-exit'], Process exited with code

In [31]:
def model_fn(model_dir):
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

    loaded_model = torch.jit.load(os.path.join(model_dir, "traced_bert.pt"))
    return loaded_model.to(device)

from transformers import BertTokenizer

MAX_LEN = 64

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True)

input_ids = []
encoded = tokenizer.encode('this is a sentence', add_special_tokens=True)
    
input_ids.append(encoded)

print(input_ids)

# pad shorter sentences
input_ids_padded = []
for i in input_ids:
    while len(i) < MAX_LEN:
        i.append(0)
    input_ids_padded.append(i)
input_ids = input_ids_padded
print(input_ids)

# mask; 0: added, 1: otherwise
attention_masks = []
# For each sentence...
for sent in input_ids:
    att_mask = [int(token_id > 0) for token_id in sent]
    attention_masks.append(att_mask)

# convert to PyTorch data types.
train_inputs = torch.tensor(input_ids)
train_masks = torch.tensor(attention_masks)



[[101, 2023, 2003, 1037, 6251, 102]]
[[101, 2023, 2003, 1037, 6251, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]


In [25]:
print(train_masks)

tensor([[1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])


In [38]:
def predict_fn(input_data, model):
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    model.to(device)
    model.eval()

    input_id, input_mask = input_data
    input_id = input_id.to(device)
    input_mask = input_mask.to(device)
    with torch.no_grad():
        #with torch.jit.optimized_execution(True, {"target_device": "eia:0"}):
        return model(input_id, attention_mask=input_mask)[0]

In [39]:
model = model_fn('./')
input_data = (train_inputs, train_masks)

In [40]:
predict_fn(input_data, model)

tensor([[0.2675, 0.3289]])

In [None]:
import deploy_ei
import json

model = deploy_ei.model_fn('../')
b = json.dumps('this is a test sentence')

data, mask = deploy_ei.input_fn(b, 'application/json') 

output = deploy_ei.predict_fn((data, mask), model)
print(output)

In [36]:
import inspect
print(inspect.getsource(torch.jit.optimized_execution))

@contextlib.contextmanager
def optimized_execution(should_optimize):
    """
    A context manager that controls whether the JIT's executor will run
    optimizations before executing a function.
    """
    stored_flag = torch._C._get_graph_executor_optimize()
    torch._C._set_graph_executor_optimize(should_optimize)
    try:
        yield
    finally:
        torch._C._set_graph_executor_optimize(stored_flag)



# Cleanup

Lastly, please remember to delete the Amazon SageMaker endpoint to avoid charges:

In [None]:
predictor.delete_endpoint()