# Compiling and Deploying Pretrained HuggingFace Pipelines distilBERT with Tensorflow2 Neuron

### Introduction

In this tutorial you will compile and deploy distilBERT version of HuggingFace 🤗 Transformers BERT for Inferentia. The full list of HuggingFace's pretrained BERT models can be found in the BERT section on this page https://huggingface.co/transformers/pretrained_models.html. you can also read about HuggingFace's pipeline feature here: https://huggingface.co/transformers/main_classes/pipelines.html

This Jupyter notebook should be run on an instance which is inf1.6xlarge or larger, but in real life scenario the compilation should be done on a compute instance and the deployment on inf1 instance to save costs.

### Setting up your environment:

To run this tutorial, please make sure you deactivate any existing TensorFlow conda environments you already using. Install TensorFlow 2.x by following the instructions at [TensorFlow Tutorial Setup Guide](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/neuron-guide/neuron-frameworks/tensorflow-neuron/tutorials/tensorflow-tutorial-setup.html#tensorflow-tutorial-setup).

After following the Setup Guide, you need to change your kernel to ```Python (Neuron TensorFlow 2)``` by clicking Kerenel->Change Kernel->```Python (Neuron TensorFlow 2)```

Now you can install TensorFlow Neuron 2.x, HuggingFace transformers, and HuggingFace datasets dependencies here.

In [1]:
!pip install --upgrade transformers
!pip install ipywidgets

Defaulting to user installation because normal site-packages is not writeable
Looking in indexes: https://pypi.org/simple, https://pip.repos.neuron.amazonaws.com
You should consider upgrading via the '/usr/bin/python3 -m pip install --upgrade pip' command.[0m[33m
[0mDefaulting to user installation because normal site-packages is not writeable
Looking in indexes: https://pypi.org/simple, https://pip.repos.neuron.amazonaws.com


You should consider upgrading via the '/usr/bin/python3 -m pip install --upgrade pip' command.[0m[33m
[0m

In [2]:
from transformers import pipeline
from transformers import TFBertForSequenceClassification, BertTokenizer
import tensorflow as tf
import tensorflow.neuron as tfn

2022-07-03 03:25:58.863780: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
2022-07-03 03:25:59.860352: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcuda.so.1
2022-07-03 03:25:59.982273: E tensorflow/stream_executor/cuda/cuda_driver.cc:328] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
2022-07-03 03:25:59.982312: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (ip-172-31-49-69.us-west-2.compute.internal): /proc/driver/nvidia/version does not exist
2022-07-03 03:26:00.046481: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with 

### Compile the model into an AWS Neuron Optimized Model

In [3]:
#Create the huggingface pipeline for sentiment analysis
#this model tries to determine of the input text has a positive
#or a negative sentiment.
model_name = 'bert-base-uncased'

pipe = pipeline('sentiment-analysis', model=model_name, framework='tf')

#pipelines are extremely easy to use as they do all the tokenization,
#inference and output interpretation for you.
pipe(['I love pipelines, they are very easy to use!', 'this string makes it batch size two'])


All model checkpoint layers were used when initializing TFBertForSequenceClassification.

Some layers of TFBertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


[{'label': 'LABEL_0', 'score': 0.6728695034980774},
 {'label': 'LABEL_0', 'score': 0.6818313598632812}]

As yo've seen above, Huggingface's pipline feature is a great wrapper for running inference on their models. It takes care of the tokenization of the string inputs. Then feeds that tokenized input to the model. Finally it interprets the outputs of the model and formats them in a way that is very human readable. Our goal will be to compile the underlying model inside the pipeline as well as make some edits to the tokenizer. The reason you need to edit the tokenizer is to make sure that you have a standard sequence length (in this case 128) as neuron only accepts static input shapes.


In [None]:
model = TFBertForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=2)
model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=2e-5, epsilon=1e-08),
              loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True), metrics=["acc"])
original_tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True)

def wrapper_function(*args, **kwargs):
    kwargs['padding'] = 'max_length'
    #this is the key line here to set a static input shape
    #so that all inputs are set to a len of 128
    kwargs['max_length'] = 128 
    kwargs['truncation'] = True
    kwargs['return_tensors'] = 'tf'
    return original_tokenizer(*args, **kwargs)

#Our example data!
string_inputs = [
    'I love to eat pizza!',
    'I am sorry. I really want to like it, but I just can not stand sushi.',
    'I really do not want to type out 128 strings to create batch 128 data.',
    'Ah! Multiplying this list by 32 would be a great solution!',
]
string_inputs = string_inputs * 32


example_inputs = wrapper_function(string_inputs)
example_inputs_list = [example_inputs['input_ids'], example_inputs['attention_mask']]

# example_inputs['input_ids'], example_inputs['attention_mask']
#compile the model by calling tfn.trace by passing in the underlying model
#and the example inputs generated by our updated tokenizer
def subgraph_builder_function(node):
    return node.op == 'MatMul'

neuron_model = tfn.trace(model, example_inputs_list,
                         subgraph_builder_function=subgraph_builder_function)
neuron_model.save('./bert_based-uncased-neuron')

All model checkpoint layers were used when initializing TFBertForSequenceClassification.

Some layers of TFBertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Instructions for updating:
The `validate_indices` argument has no effect. Indices are always validated on CPU and never validated on GPU.


2022-07-03 03:26:15.419428: I tensorflow/core/grappler/devices.cc:69] Number of eligible GPUs (core count >= 8, compute capability >= 0.0): 0
2022-07-03 03:26:15.419574: I tensorflow/core/grappler/clusters/single_machine.cc:357] Starting new session
2022-07-03 03:26:15.439147: I tensorflow/core/platform/profile_utils/cpu_utils.cc:114] CPU Frequency: 2999995000 Hz


### Why use batch size 128?

You'll notice that in the above example we passed a two tensors of shape 128 (the batch size) x 128 (the sequence length) in this function call ```tfn.trace(pipe.model, example_inputs)```. The example_inputs argument is important to ```tfn.trace``` because it tells the neuron model what to expect (remember that a neuron model needs static input shapes, so example_inputs defines that static input shape). A smaller batch size would also compile, but a large batch size ensures that the neuron hardware will be fed enough data to be as performant as possible.

### What if my model isn't a Huggingface pipeline?

Not to worry! There is no requirement that your model needs to be Huggingface pipeline compatible. The Huggingface pipeline is just a wrapper for an underlying TensorFlow model (in our case ```pipe.model```). As long as you have a TensorFlow 2.x model you can compile it on neuron by calling ```tfn.trace(your_model, example_inputs)```. The processing the input and output to your own model is up to you! Take a look at the example below to see what happens when we call the model without the Huggingface pipeline wrapper as opposed to with it.

In [None]:
#directly call the model
neuron_model.save('./bert_based-uncased-neuron')
print(example_inputs)
print(string_inputs)

print(neuron_model(example_inputs))


### Save your neuron model to disk and avoid recompilation.

To avoid recompiling the model before every deployment, you can save the neuron model by calling ```model_neuron.save(model_dir)```. This ```save``` method prefers to work on a flat input/output lists and does not work on dictionary input/output - which is what the Huggingface distilBERT expects as input. You can work around this by writing a simple wrapper that takes in an input list instead of a dictionary, compile the wrapped model and save it for later use.

In [None]:
example_inputs_list = [example_inputs['input_ids'], example_inputs['attention_mask']]

#compile the wrapped model and save it to disk
model_wrapped_traced = tfn.trace(model_wrapped, example_inputs_list)
model_wrapped_traced.save('./distilbert_b128')

### Load the model from disk

Now you can reload the model by calling ```tf.keras.models.load_model(str : model_directory)```. This model is already compiled and could run inference on neuron, but if you want it to work with our Huggingface pipeline, you have to wrap it again to accept dictionary input.

### Benchmarking the neuron model

In [None]:
import warnings

warnings.warn("NEURONCORE_GROUP_SIZES is being deprecated, if your application is using NEURONCORE_GROUP_SIZES please \
see https://awsdocs-neuron.readthedocs-hosted.com/en/latest/release-notes/deprecation.html#announcing-end-of-support-for-neuroncore-group-sizes \
for more details.", DeprecationWarning)
%env NEURONCORE_GROUP_SIZES='4x1'

import time

#warmup inf
neuron_pipe(string_inputs)
#benchmark batch 128 neuron model
neuron_b128_times = []
for i in range(1000):
    start = time.time()
    outputs = neuron_model(example_inputs)
    end = time.time()
    neuron_b128_times.append(end - start)
    

neuron_b128_times = sorted(neuron_b128_times)

print(f"Average throughput for batch 128 neuron model is {128/(sum(neuron_b128_times)/len(neuron_b128_times))} sentences/s.")
print(f"Peak throughput for batch 128 neuron model is {128/min(neuron_b128_times)} sentences/s.")
print()


print(f"50th percentile latency for batch 128 neuron model is {neuron_b128_times[int(1000*.5)] * 1000} ms.")
print(f"90th percentile latency for batch 128 neuron model is {neuron_b128_times[int(1000*.9)] * 1000} ms.")
print(f"95th percentile latency for bacth 128 neuron model is {neuron_b128_times[int(1000*.95)] * 1000} ms.")
print(f"99th percentile latency for batch 128 neuron model is {neuron_b128_times[int(1000*.99)] * 1000} ms.")
print()

