# Compiling and Deploying HuggingFace Pretrained BERT
Starting from `torch-neuron==1.0.1386.0`, the AWS Neuron PyTorch compilation API `torch.neuron.trace` supports assigning unsupported `aten` operators to run on CPU. Here we demonstrate its example usage on HuggingFace's BERT-base.

### Install dependencies
This tutorial depends on `torch-neuron>=1.0.1386.0`, `neuron-cc>=1.0.16861.0`, and HuggingFace's `transformers` package. You may install them with `pip`.
```bash
python3 -m pip install torch-neuron neuron-cc[tensorflow] transformers --upgrade --extra-index-url=https://pip.repos.neuron.amazonaws.com
```
For simplicity, it is recommended to do a one-stop setup of all these dependencies on an inf1 instance. However, do note that our compiler can cross-compile for inf1 on a CPU-only machine, and so you may try the compilation step on your existing EC2 instance, or a local machine running Linux.

### Compile a model into an AWS Neuron optimized TorchScript

This step can be done by calling `torch.neuron.trace`.

In [None]:
# You may save the content of this cell as compile_bert.py and run it with python3.
import tensorflow  # to workaround a protobuf version conflict issue
import torch
import torch.neuron
from transformers import AutoTokenizer, AutoModelForSequenceClassification


# Build tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased-finetuned-mrpc")
model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased-finetuned-mrpc")

# Setup some example inputs
sequence_0 = "The company HuggingFace is based in New York City"
sequence_1 = "Apples are especially bad for your health"
sequence_2 = "HuggingFace's headquarters are situated in Manhattan"
paraphrase = tokenizer.encode_plus(sequence_0, sequence_2, max_length=128, pad_to_max_length=True, return_tensors="pt")
not_paraphrase = tokenizer.encode_plus(sequence_0, sequence_1, max_length=128, pad_to_max_length=True, return_tensors="pt")

# Run the original PyTorch model on both example inputs
paraphrase_classification_logits = model(**paraphrase)[0]
not_paraphrase_classification_logits = model(**not_paraphrase)[0]

# Convert example inputs to a format that is compatible with TorchScript tracing
example_inputs_paraphrase = paraphrase['input_ids'], paraphrase['attention_mask'], paraphrase['token_type_ids']
example_inputs_not_paraphrase = not_paraphrase['input_ids'], not_paraphrase['attention_mask'], not_paraphrase['token_type_ids']

# Run torch.neuron.trace to generate a TorchScript that is optimized by AWS Neuron, using optimization level -O2
model_neuron = torch.neuron.trace(model, example_inputs_paraphrase, compiler_args=['-O2'])

# Verify the TorchScript works on both example inputs
paraphrase_classification_logits_neuron = model_neuron(*example_inputs_paraphrase)
not_paraphrase_classification_logits_neuron = model_neuron(*example_inputs_not_paraphrase)

# Save the TorchScript for later use
model_neuron.save('bert_neuron.pt')

The above example uses BERT-base. A full list of HuggingFace's pretrained BERT models can be found in the BERT section on this page https://huggingface.co/transformers/pretrained_models.html.

You may inspect `model_neuron.graph` to see which part is running on CPU versus running on the accelerator. All native `aten` operators in the graph will be running on CPU.

In [None]:
print(model_neuron.graph)

Don't forget to copy your saved TorchScript `bert_neuron.pt` to your `inf1` instance.

### Deploy the AWS Neuron optimized TorchScript on an `inf1` instance

To deploy the AWS Neuron optimized TorchScript on `inf1` instances, you may choose to load the saved TorchScript from disk and skip the slow compilation. Make sure you have both the pip package `torch-neuron>=1.0.1386.0` and the Debian/Rpm package `aws-neuron-runtime` installed. https://github.com/aws/aws-neuron-sdk/blob/master/docs/neuron-runtime/nrt_start.md constains the installation guide for `aws-neuron-runtime`.

In [None]:
# You may save the content of this cell as run_bert.py and run it with python3.
import torch
import torch.neuron
from transformers import AutoTokenizer


# Build tokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased-finetuned-mrpc")

# Setup some example inputs
sequence_0 = "The company HuggingFace is based in New York City"
sequence_1 = "Apples are especially bad for your health"
sequence_2 = "HuggingFace's headquarters are situated in Manhattan"
paraphrase = tokenizer.encode_plus(sequence_0, sequence_2, max_length=128, pad_to_max_length=True, return_tensors="pt")
not_paraphrase = tokenizer.encode_plus(sequence_0, sequence_1, max_length=128, pad_to_max_length=True, return_tensors="pt")

# Convert example inputs to a format that is compatible with TorchScript tracing
example_inputs_paraphrase = paraphrase['input_ids'], paraphrase['attention_mask'], paraphrase['token_type_ids']
example_inputs_not_paraphrase = not_paraphrase['input_ids'], not_paraphrase['attention_mask'], not_paraphrase['token_type_ids']

# Load TorchScript back
model_neuron = torch.jit.load('bert_neuron.pt')

# Verify the TorchScript works on both example inputs
paraphrase_classification_logits_neuron = model_neuron(*example_inputs_paraphrase)
not_paraphrase_classification_logits_neuron = model_neuron(*example_inputs_not_paraphrase)
classes = ['not paraphrase', 'paraphrase']
paraphrase_prediction = paraphrase_classification_logits_neuron[0][0].argmax().item()
not_paraphrase_prediction = not_paraphrase_classification_logits_neuron[0][0].argmax().item()
print('BERT says that "{}" and "{}" are {}'.format(sequence_0, sequence_2, classes[paraphrase_prediction]))
print('BERT says that "{}" and "{}" are {}'.format(sequence_0, sequence_1, classes[not_paraphrase_prediction]))