# Compiling and Deploying HuggingFace Pretrained BERT

### Prerequisites


Before running the following verify this Jupyter notebook is running “conda_aws_neuron_pytorch_p36” kernel. You can select the Kernel from the “Kernel -> Change Kernel” option on the top of this Jupyter notebook page.

### Compile the model into an AWS Neuron optimized TorchScript

This step can be done by calling `torch.neuron.trace` method on the model.

In [1]:
# You may save the content of this cell as compile_bert.py and run it with python3.
import tensorflow  # to workaround a protobuf version conflict issue
import torch
import torch.neuron
from transformers import AutoTokenizer, AutoModelForSequenceClassification, AutoConfig
import transformers

# Build tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased-finetuned-mrpc")

model = None

if transformers.__version__.startswith("4."):
    model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased-finetuned-mrpc", return_dict=False)
else:
    model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased-finetuned-mrpc")

# Setup some example inputs
sequence_0 = "The company HuggingFace is based in New York City"
sequence_1 = "Apples are especially bad for your health"
sequence_2 = "HuggingFace's headquarters are situated in Manhattan"

max_length=128
paraphrase = tokenizer.encode_plus(sequence_0, sequence_2, max_length=max_length, padding='max_length', truncation=True, return_tensors="pt")
not_paraphrase = tokenizer.encode_plus(sequence_0, sequence_1, max_length=max_length, padding='max_length', truncation=True, return_tensors="pt")

# Run the original PyTorch model on compilation exaple
paraphrase_classification_logits = model(**paraphrase)[0]

# Convert example inputs to a format that is compatible with TorchScript tracing
example_inputs_paraphrase = paraphrase['input_ids'], paraphrase['attention_mask'], paraphrase['token_type_ids']
example_inputs_not_paraphrase = not_paraphrase['input_ids'], not_paraphrase['attention_mask'], not_paraphrase['token_type_ids']

# Run torch.neuron.trace to generate a TorchScript that is optimized by AWS Neuron
model_neuron = torch.neuron.trace(model, example_inputs_paraphrase)

# Verify the TorchScript works on both example inputs
paraphrase_classification_logits_neuron = model_neuron(*example_inputs_paraphrase)
not_paraphrase_classification_logits_neuron = model_neuron(*example_inputs_not_paraphrase)

# Save the TorchScript for later use
model_neuron.save('bert_neuron.pt')

Downloading:   0%|          | 0.00/433 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/213k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/433M [00:00<?, ?B/s]

INFO:Neuron:There are 3 ops of 1 different types in the TorchScript that are not compiled by neuron-cc: aten::embedding, (For more information see https://github.com/aws/aws-neuron-sdk/blob/master/release-notes/neuron-cc-ops/neuron-cc-ops-pytorch.md)
INFO:Neuron:Number of arithmetic operators (pre-compilation) before = 714, fused = 694, percent fused = 97.2%


Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where


Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
INFO:Neuron:Compiling function _NeuronGraph$661 with neuron-cc
INFO:Neuron:Compiling with command line: '/home/ubuntu/anaconda3/envs/aws_neuron_pytorch_p36/bin/neuron-cc compile /tmp/tmpb4ha43x7/graph_def.pb --framework TENSORFLOW --pipeline compile SaveTemps --output /tmp/tmpb4ha43x7/graph_def.neff --io-config {"inputs": {"0:0": [[1, 128, 768], "float32"], "1:0": [[1, 1, 1, 128], "float32"]}, "outputs": ["Add_136:0"]} --verbose 35'
INFO:Neuron:Successfully embedded GraphDef into MetaNeff for _NeuronGraph#62
Tensor output are ** NOT CALCULATED ** during CPU execution and only indicate tensor shape (Triggered internally at  /opt/workspace/KaenaPyTorchRuntime/neuron_op/neuron_op_impl.cpp:73.)
  outs = wrap_retval(mod(*_clone_inputs(inputs)))
Tensor output are ** NOT CALCULATED ** during CPU execution and only indicate tensor shape (Triggered internally at  /opt/workspace/KaenaPyTorchRuntime/neur

The above example uses BERT-base. A full list of HuggingFace's pretrained BERT models can be found in the BERT section on this page https://huggingface.co/transformers/pretrained_models.html.

You may inspect `model_neuron.graph` to see which part is running on CPU versus running on the accelerator. All native `aten` operators in the graph will be running on CPU.

In [2]:
print(model_neuron.graph)

graph(%self : __torch__.torch_neuron.convert.AwsNeuronGraphModule,
      %tensor.4 : Long(1:128, 128:1, requires_grad=0, device=cpu),
      %tensor.1 : Long(1:128, 128:1, requires_grad=0, device=cpu),
      %4 : Long(1:128, 128:1, requires_grad=0, device=cpu)):
  %11 : int = prim::Constant[value=0]() # /home/ubuntu/anaconda3/envs/aws_neuron_pytorch_p36/lib/python3.6/site-packages/torch_neuron/native_ops/aten.py:34:0
  %12 : int = prim::Constant[value=0]() # /home/ubuntu/anaconda3/envs/aws_neuron_pytorch_p36/lib/python3.6/site-packages/torch_neuron/native_ops/aten.py:34:0
  %13 : int = prim::Constant[value=9223372036854775807]() # /home/ubuntu/anaconda3/envs/aws_neuron_pytorch_p36/lib/python3.6/site-packages/torch_neuron/native_ops/aten.py:34:0
  %14 : int = prim::Constant[value=1]() # /home/ubuntu/anaconda3/envs/aws_neuron_pytorch_p36/lib/python3.6/site-packages/torch_neuron/native_ops/aten.py:34:0
  %15 : Long(1:128, 128:1, requires_grad=0, device=cpu) = aten::slice(%tensor.1, %11, %1

Don't forget to copy your saved TorchScript `bert_neuron.pt` to your `inf1` instance.

### Deploy the AWS Neuron optimized TorchScript on an `inf1` instance

To deploy the AWS Neuron optimized TorchScript on `inf1` instances, you may choose to load the saved TorchScript from disk and skip the slow compilation.

In [3]:
# You may save the content of this cell as run_bert.py and run it with python3.
import torch
import torch.neuron
from transformers import AutoTokenizer

# Build tokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased-finetuned-mrpc")

# Setup some example inputs
sequence_0 = "The company HuggingFace is based in New York City"
sequence_1 = "Apples are especially bad for your health"
sequence_2 = "HuggingFace's headquarters are situated in Manhattan"
paraphrase = tokenizer.encode_plus(sequence_0, sequence_2, max_length=128, padding='max_length', truncation=True, return_tensors="pt")
not_paraphrase = tokenizer.encode_plus(sequence_0, sequence_1, max_length=128, padding='max_length', truncation=True, return_tensors="pt")

# Convert example inputs to a format that is compatible with TorchScript tracing
example_inputs_paraphrase = paraphrase['input_ids'], paraphrase['attention_mask'], paraphrase['token_type_ids']
example_inputs_not_paraphrase = not_paraphrase['input_ids'], not_paraphrase['attention_mask'], not_paraphrase['token_type_ids']

# Load TorchScript back
model_neuron = torch.jit.load('bert_neuron.pt')

# Verify the TorchScript works on both example inputs
paraphrase_classification_logits_neuron = model_neuron(*example_inputs_paraphrase)
not_paraphrase_classification_logits_neuron = model_neuron(*example_inputs_not_paraphrase)
classes = ['not paraphrase', 'paraphrase']
paraphrase_prediction = paraphrase_classification_logits_neuron[0][0].argmax().item()
not_paraphrase_prediction = not_paraphrase_classification_logits_neuron[0][0].argmax().item()
print('BERT says that "{}" and "{}" are {}'.format(sequence_0, sequence_2, classes[paraphrase_prediction]))
print('BERT says that "{}" and "{}" are {}'.format(sequence_0, sequence_1, classes[not_paraphrase_prediction]))

BERT says that "The company HuggingFace is based in New York City" and "HuggingFace's headquarters are situated in Manhattan" are paraphrase
BERT says that "The company HuggingFace is based in New York City" and "Apples are especially bad for your health" are not paraphrase


Now let's run the model in parallel on four cores

In [4]:
def get_input_with_padding(batch, batch_size, max_length):
    ## Reformulate the batch into three batch tensors - default batch size batches the outer dimension
    encoded = batch['encoded']
    inputs = torch.squeeze(encoded['input_ids'], 1)
    attention = torch.squeeze(encoded['attention_mask'], 1)
    token_type = torch.squeeze(encoded['token_type_ids'], 1)
    quality = list(map(int, batch['quality']))

    if inputs.size()[0] != batch_size:
        print("Input size = {} - padding".format(inputs.size()))
        remainder = batch_size - inputs.size()[0]
        zeros = torch.zeros( [remainder, max_length], dtype=torch.long )
        inputs = torch.cat( [inputs, zeros] )
        attention = torch.cat( [attention, zeros] )
        token_type = torch.cat( [token_type, zeros] )

    assert(inputs.size()[0] == batch_size and inputs.size()[1] == max_length)
    assert(attention.size()[0] == batch_size and attention.size()[1] == max_length)
    assert(token_type.size()[0] == batch_size and token_type.size()[1] == max_length)

    return (inputs, attention, token_type), quality

def count(output, quality):
    assert output.size(0) >= len(quality)
    correct_count = 0
    count = len(quality)
    
    batch_predictions = [ row.argmax().item() for row in output ]

    for a, b in zip(batch_predictions, quality):
        if int(a)==int(b):
            correct_count += 1

    return correct_count, count

In [5]:
from parallel import NeuronSimpleDataParallel
from bert_benchmark_utils import BertTestDataset, BertResults
import time

max_length = 128
num_cores = 4
batch_size = 1

tsv_file="glue_mrpc_dev.tsv"

data_set = BertTestDataset( tsv_file=tsv_file, tokenizer=tokenizer, max_length=max_length )
data_loader = torch.utils.data.DataLoader(data_set, batch_size=batch_size*num_cores, shuffle=True, num_workers=2)

# Create a model that will run parallel inferences on each core (code in parallel.py)
parallel_neuron_model = NeuronSimpleDataParallel('bert_neuron.pt', num_cores)

# Warm all cores
z = torch.zeros( [num_cores * batch_size, max_length], dtype=torch.long )
batch = (z, z, z)
parallel_neuron_model(*batch)

# Result aggregation class (code in bert_benchmark_utils.py)
results = BertResults(batch_size, num_cores)

for _ in range(5):
    for batch in data_loader:
        batch, quality = get_input_with_padding(batch, batch_size * num_cores, max_length)

        start = time.time()
        output = parallel_neuron_model(*batch)
        end = time.time()
        elapsed = end - start

        correct_count, inference_count = count(output, quality)
        results.add_result( correct_count, inference_count, [elapsed], [end], elapsed )

with open("benchmark.txt", "w") as f:
    results.report(f, bins=60)

with open("benchmark.txt", "r") as f:
    for line in f:
        print(line)



Histogram throughput (UTC times):

===

23:03:42.008 - 23:03:42.183 => 229 sentences/sec

23:03:42.183 - 23:03:42.357 => 229 sentences/sec

23:03:42.357 - 23:03:42.532 => 206 sentences/sec

23:03:42.532 - 23:03:42.707 => 229 sentences/sec

23:03:42.707 - 23:03:42.881 => 206 sentences/sec

23:03:42.881 - 23:03:43.056 => 229 sentences/sec

23:03:43.056 - 23:03:43.231 => 206 sentences/sec

23:03:43.231 - 23:03:43.405 => 206 sentences/sec

23:03:43.405 - 23:03:43.580 => 229 sentences/sec

23:03:43.580 - 23:03:43.754 => 206 sentences/sec

23:03:43.754 - 23:03:43.929 => 160 sentences/sec

23:03:43.929 - 23:03:44.104 => 137 sentences/sec

23:03:44.104 - 23:03:44.278 => 206 sentences/sec

23:03:44.278 - 23:03:44.453 => 206 sentences/sec

23:03:44.453 - 23:03:44.628 => 206 sentences/sec

23:03:44.628 - 23:03:44.802 => 183 sentences/sec

23:03:44.802 - 23:03:44.977 => 206 sentences/sec

23:03:44.977 - 23:03:45.152 => 206 sentences/sec

23:03:45.152 - 23:03:45.326 => 206 sentences/sec

23:03:45

Now recompile with a larger batch size of six sentence pairs

In [6]:
batch_size = 6

example_inputs_paraphrase = (
    torch.cat([paraphrase['input_ids']] * batch_size,0), 
    torch.cat([paraphrase['attention_mask']] * batch_size,0), 
    torch.cat([paraphrase['token_type_ids']] * batch_size,0)
)

# Run torch.neuron.trace to generate a TorchScript that is optimized by AWS Neuron
model_neuron_batch = torch.neuron.trace(model, example_inputs_paraphrase)

## Save the batched model
model_neuron_batch.save('bert_neuron_b{}.pt'.format(batch_size))

INFO:Neuron:There are 3 ops of 1 different types in the TorchScript that are not compiled by neuron-cc: aten::embedding, (For more information see https://github.com/aws/aws-neuron-sdk/blob/master/release-notes/neuron-cc-ops/neuron-cc-ops-pytorch.md)
INFO:Neuron:Number of arithmetic operators (pre-compilation) before = 714, fused = 694, percent fused = 97.2%
INFO:Neuron:Compiling function _NeuronGraph$1324 with neuron-cc
INFO:Neuron:Compiling with command line: '/home/ubuntu/anaconda3/envs/aws_neuron_pytorch_p36/bin/neuron-cc compile /tmp/tmp9slz5gra/graph_def.pb --framework TENSORFLOW --pipeline compile SaveTemps --output /tmp/tmp9slz5gra/graph_def.neff --io-config {"inputs": {"0:0": [[6, 128, 768], "float32"], "1:0": [[6, 1, 1, 128], "float32"]}, "outputs": ["Add_136:0"]} --verbose 35'
INFO:Neuron:Successfully embedded GraphDef into MetaNeff for _NeuronGraph#62
INFO:Neuron:Number of arithmetic operators (post-compilation) before = 714, compiled = 694, percent compiled = 97.2%
INFO:Ne

Rerun inference with batch 6

In [7]:
from parallel import NeuronSimpleDataParallel
from bert_benchmark_utils import BertTestDataset, BertResults
import time

batch_size = 6
num_cores = 4

tsv_file="glue_mrpc_dev.tsv"

data_set = BertTestDataset( tsv_file=tsv_file, tokenizer=tokenizer, max_length=max_length )
data_loader = torch.utils.data.DataLoader(data_set, batch_size=batch_size*num_cores, shuffle=True, num_workers=2)

# Create a model that will run parallel inferences on each core (code in parallel.py)
parallel_neuron_model = NeuronSimpleDataParallel('bert_neuron_b{}.pt'.format(batch_size), num_cores, batch_size)

# Warm all cores
z = torch.zeros( [num_cores * batch_size, max_length], dtype=torch.long )
batch = (z, z, z)
parallel_neuron_model(*batch)

# Result aggregation class (code in bert_benchmark_utils.py)
results = BertResults(batch_size, num_cores)

for _ in range(10):
    for batch in data_loader:
        batch, quality = get_input_with_padding(batch, batch_size * num_cores, max_length)

        start = time.time()
        output = parallel_neuron_model(*batch)
        end = time.time()
        elapsed = end - start

        correct_count, inference_count = count(output, quality)
        results.add_result( correct_count, inference_count, [elapsed], [end], elapsed )

with open("benchmark_b{}.txt".format(batch_size), "w") as f:
    results.report(f, bins=60)

with open("benchmark_b{}.txt".format(batch_size), "r") as f:
    for line in f:
        print(line)



Histogram throughput (UTC times):

===

23:30:26.213 - 23:30:26.325 => 860 sentences/sec

23:30:26.325 - 23:30:26.437 => 860 sentences/sec

23:30:26.437 - 23:30:26.548 => 860 sentences/sec

23:30:26.548 - 23:30:26.660 => 860 sentences/sec

23:30:26.660 - 23:30:26.771 => 215 sentences/sec

23:30:26.771 - 23:30:26.883 => 0 sentences/sec

23:30:26.883 - 23:30:26.994 => 860 sentences/sec

23:30:26.994 - 23:30:27.106 => 860 sentences/sec

23:30:27.106 - 23:30:27.217 => 860 sentences/sec

23:30:27.217 - 23:30:27.329 => 860 sentences/sec

23:30:27.329 - 23:30:27.440 => 215 sentences/sec

23:30:27.440 - 23:30:27.552 => 0 sentences/sec

23:30:27.552 - 23:30:27.663 => 645 sentences/sec

23:30:27.663 - 23:30:27.775 => 860 sentences/sec

23:30:27.775 - 23:30:27.886 => 860 sentences/sec

23:30:27.886 - 23:30:27.998 => 860 sentences/sec

23:30:27.998 - 23:30:28.109 => 430 sentences/sec

23:30:28.109 - 23:30:28.221 => 0 sentences/sec

23:30:28.221 - 23:30:28.332 => 430 sentences/sec

23:30:28.332 -