# Compiling and Deploying HuggingFace Pretrained BERT



### Introduction

In this tutorial we will compile and deploy BERT-base version of HuggingFace BERT for Inferentia. The full list of HuggingFace's pretrained BERT models can be found in the BERT section on this page https://huggingface.co/transformers/pretrained_models.html. 

This Jupyter notebook should be run on an instance which is inf1.6xlarge or larger. The compile part of this tutorial requires inf1.6xlarge and not the inference itself. For simplicity we will run this tutorial on inf1.6xlarge but in real life scenario the compilation should be done on a compute instance and the deployment on inf1 instance to save costs.

Before running the following verify this Jupyter notebook is running “conda_aws_neuron_pytorch_p36” kernel. You can select the Kernel from the “Kernel -> Change Kernel” option on the top of this Jupyter notebook page.

### Install Dependencies:
The tutorial depends on various packages like torch-neuron, neuron-cc and HuggingFace's transformers package that are part of the conda environment.
The following will install the required transformers version.

In [None]:
!python -m pip install -U "transformers==4.0"

### Compile the model into an AWS Neuron optimized TorchScript


In [None]:
import tensorflow  # to workaround a protobuf version conflict issue
import torch
import torch.neuron
from transformers import AutoTokenizer, AutoModelForSequenceClassification, AutoConfig
import transformers

# Build tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased-finetuned-mrpc")
model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased-finetuned-mrpc", return_dict=False)

# Setup some example inputs
sequence_0 = "The company HuggingFace is based in New York City"
sequence_1 = "Apples are especially bad for your health"
sequence_2 = "HuggingFace's headquarters are situated in Manhattan"

max_length=128
paraphrase = tokenizer.encode_plus(sequence_0, sequence_2, max_length=max_length, padding='max_length', truncation=True, return_tensors="pt")
not_paraphrase = tokenizer.encode_plus(sequence_0, sequence_1, max_length=max_length, padding='max_length', truncation=True, return_tensors="pt")

# Run the original PyTorch model on compilation exaple
paraphrase_classification_logits = model(**paraphrase)[0]

# Convert example inputs to a format that is compatible with TorchScript tracing
example_inputs_paraphrase = paraphrase['input_ids'], paraphrase['attention_mask'], paraphrase['token_type_ids']
example_inputs_not_paraphrase = not_paraphrase['input_ids'], not_paraphrase['attention_mask'], not_paraphrase['token_type_ids']

# Run torch.neuron.trace to generate a TorchScript that is optimized by AWS Neuron
model_neuron = torch.neuron.trace(model, example_inputs_paraphrase)

# Verify the TorchScript works on both example inputs
paraphrase_classification_logits_neuron = model_neuron(*example_inputs_paraphrase)
not_paraphrase_classification_logits_neuron = model_neuron(*example_inputs_not_paraphrase)

# Save the TorchScript for later use
model_neuron.save('bert_neuron.pt')

You may inspect `model_neuron.graph` to see which part is running on CPU versus running on the accelerator. All native `aten` operators in the graph will be running on CPU.

In [None]:
print(model_neuron.graph)



### Deploy the AWS Neuron optimized TorchScript

To deploy the AWS Neuron optimized TorchScript, you may choose to load the saved TorchScript from disk and skip the slow compilation.

In [None]:
# Load TorchScript back
model_neuron = torch.jit.load('bert_neuron.pt')
# Verify the TorchScript works on both example inputs
paraphrase_classification_logits_neuron = model_neuron(*example_inputs_paraphrase)
not_paraphrase_classification_logits_neuron = model_neuron(*example_inputs_not_paraphrase)
classes = ['not paraphrase', 'paraphrase']
paraphrase_prediction = paraphrase_classification_logits_neuron[0][0].argmax().item()
not_paraphrase_prediction = not_paraphrase_classification_logits_neuron[0][0].argmax().item()
print('BERT says that "{}" and "{}" are {}'.format(sequence_0, sequence_2, classes[paraphrase_prediction]))
print('BERT says that "{}" and "{}" are {}'.format(sequence_0, sequence_1, classes[not_paraphrase_prediction]))

Now let's run the model in parallel on four cores

In [None]:
def get_input_with_padding(batch, batch_size, max_length):
    ## Reformulate the batch into three batch tensors - default batch size batches the outer dimension
    encoded = batch['encoded']
    inputs = torch.squeeze(encoded['input_ids'], 1)
    attention = torch.squeeze(encoded['attention_mask'], 1)
    token_type = torch.squeeze(encoded['token_type_ids'], 1)
    quality = list(map(int, batch['quality']))

    if inputs.size()[0] != batch_size:
        print("Input size = {} - padding".format(inputs.size()))
        remainder = batch_size - inputs.size()[0]
        zeros = torch.zeros( [remainder, max_length], dtype=torch.long )
        inputs = torch.cat( [inputs, zeros] )
        attention = torch.cat( [attention, zeros] )
        token_type = torch.cat( [token_type, zeros] )

    assert(inputs.size()[0] == batch_size and inputs.size()[1] == max_length)
    assert(attention.size()[0] == batch_size and attention.size()[1] == max_length)
    assert(token_type.size()[0] == batch_size and token_type.size()[1] == max_length)

    return (inputs, attention, token_type), quality

def count(output, quality):
    assert output.size(0) >= len(quality)
    correct_count = 0
    count = len(quality)
    
    batch_predictions = [ row.argmax().item() for row in output ]

    for a, b in zip(batch_predictions, quality):
        if int(a)==int(b):
            correct_count += 1

    return correct_count, count

In [None]:
from parallel import NeuronSimpleDataParallel
from bert_benchmark_utils import BertTestDataset, BertResults
import time

max_length = 128
num_cores = 4
batch_size = 1

tsv_file="glue_mrpc_dev.tsv"

data_set = BertTestDataset( tsv_file=tsv_file, tokenizer=tokenizer, max_length=max_length )
data_loader = torch.utils.data.DataLoader(data_set, batch_size=batch_size*num_cores, shuffle=True, num_workers=2)

# Create a model that will run parallel inferences on each core (code in parallel.py)
parallel_neuron_model = NeuronSimpleDataParallel('bert_neuron.pt', num_cores)

# Warm all cores
z = torch.zeros( [num_cores * batch_size, max_length], dtype=torch.long )
batch = (z, z, z)
parallel_neuron_model(*batch)

# Result aggregation class (code in bert_benchmark_utils.py)
results = BertResults(batch_size, num_cores)

for _ in range(5):
    for batch in data_loader:
        batch, quality = get_input_with_padding(batch, batch_size * num_cores, max_length)

        start = time.time()
        output = parallel_neuron_model(*batch)
        end = time.time()
        elapsed = end - start

        correct_count, inference_count = count(output, quality)
        results.add_result( correct_count, inference_count, [elapsed], [end], elapsed )

with open("benchmark.txt", "w") as f:
    results.report(f, bins=60)

with open("benchmark.txt", "r") as f:
    for line in f:
        print(line)

Now recompile with a larger batch size of six sentence pairs

In [None]:
batch_size = 6

example_inputs_paraphrase = (
    torch.cat([paraphrase['input_ids']] * batch_size,0), 
    torch.cat([paraphrase['attention_mask']] * batch_size,0), 
    torch.cat([paraphrase['token_type_ids']] * batch_size,0)
)

# Run torch.neuron.trace to generate a TorchScript that is optimized by AWS Neuron
model_neuron_batch = torch.neuron.trace(model, example_inputs_paraphrase)

## Save the batched model
model_neuron_batch.save('bert_neuron_b{}.pt'.format(batch_size))

Rerun inference with batch 6

In [None]:
batch_size = 6
num_cores = 4

data_set = BertTestDataset( tsv_file=tsv_file, tokenizer=tokenizer, max_length=max_length )
data_loader = torch.utils.data.DataLoader(data_set, batch_size=batch_size*num_cores, shuffle=True, num_workers=2)

# Create a model that will run parallel inferences on each core (code in parallel.py)
parallel_neuron_model = NeuronSimpleDataParallel('bert_neuron_b{}.pt'.format(batch_size), num_cores, batch_size)

# Warm all cores
z = torch.zeros( [num_cores * batch_size, max_length], dtype=torch.long )
batch = (z, z, z)
parallel_neuron_model(*batch)

# Result aggregation class (code in bert_benchmark_utils.py)
results = BertResults(batch_size, num_cores)

for _ in range(10):
    for batch in data_loader:
        batch, quality = get_input_with_padding(batch, batch_size * num_cores, max_length)

        start = time.time()
        output = parallel_neuron_model(*batch)
        end = time.time()
        elapsed = end - start

        correct_count, inference_count = count(output, quality)
        results.add_result( correct_count, inference_count, [elapsed], [end], elapsed )

with open("benchmark_b{}.txt".format(batch_size), "w") as f:
    results.report(f, bins=60)

with open("benchmark_b{}.txt".format(batch_size), "r") as f:
    for line in f:
        print(line)