![nebullvm nebuly AI accelerate inference optimize DeepLearning](https://user-images.githubusercontent.com/38586138/201391643-a80407e5-2c28-409c-90c9-327795cd27e8.png)

# Accelerate Hugging Face PyTorch BERT with Speedster


Hi and welcome 👋

In this notebook we will discover how in just a few steps you can speed up the response time of deep learning model inference using the Speedster app from the open-source library nebullvm.

With Speedster's latest API, you can speed up models up to 10 times without any loss of accuracy (option A), or accelerate them up to 20-30 times by setting a self-defined amount of accuracy/precision that you are willing to trade off to get even lower response time (option B). To accelerate your model, Speedster takes advantage of various optimization techniques such as deep learning compilers (in both option A and option B), quantization, half accuracy, and so on (option B).

Let's jump to the code.

In [26]:
!python --version

Python 3.8.10


In [1]:
%env CUDA_VISIBLE_DEVICES=0

env: CUDA_VISIBLE_DEVICES=0


# Installation

Install Speedster:

In [None]:
Install deep learning compilers:

In [30]:
!pip install git+https://github.com/nebuly-ai/nebullvm.git#subdirectory=apps/accelerate/speedster

Collecting git+https://github.com/nebuly-ai/nebullvm.git#subdirectory=apps/accelerate/speedster
  Cloning https://github.com/nebuly-ai/nebullvm.git to /tmp/pip-req-build-j5jcezwo
  Running command git clone --filter=blob:none --quiet https://github.com/nebuly-ai/nebullvm.git /tmp/pip-req-build-j5jcezwo
  Resolved https://github.com/nebuly-ai/nebullvm.git to commit fe6716f956e281076c90593c65935379bee6c992
  Preparing metadata (setup.py) ... [?25ldone
[0m

In [1]:
#!pip install git+https://github.com/nebuly-ai/nebullvm

In [1]:
#!pip install speedster

In [2]:
#!pip uninstall -y speedster

In [3]:
# !python -m nebullvm.installers.auto_installer  --backends huggingface-full-torch --compilers all

## Model and Dataset setup

We chose BERT as the pre-trained model that we want to optimize. Let's download both the pre-trained model and the tokenizer from the Hugging Face model hub.

In [2]:
import torch
from transformers import BertTokenizer, BertModel

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased', torchscript=True)

# Move the model to gpu if available and set eval mode
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device).eval()

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.weight', 'cls.predictions.bias', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Let's create an example dataset with some random sentences

In [3]:
import random

sentences = [
    "Mars is the fourth planet from the Sun.",
    "has a crust primarily composed of elements",
    "However, it is unknown",
    "can be viewed from Earth",
    "It was the Romans",
]

len_dataset = 100

texts = []
for _ in range(len_dataset):
    n_times = random.randint(1, 30)
    texts.append(" ".join(random.choice(sentences) for _ in range(n_times)))
encoded_inputs = [tokenizer(text, return_tensors="pt") for text in texts]
len(encoded_inputs),encoded_inputs[0].keys()

## Speed up inference with Speedster: no metric drop

It's now time of improving a bit the performance in terms of speed. Let's use `Speedster`.

In [5]:
import speedster
from speedster import optimize_model
speedster.__file__

'/home/ttj/github/nebullvm/apps/accelerate/speedster/speedster/__init__.py'

Using Speedster is very simple and straightforward! Just use the `optimize_model` function and provide as input the model, some input data as example and the optimization time mode. Optionally a dynamic_info dictionary can be also provided, in order to support inputs with dynamic shape.

In [7]:
dynamic_info = {
    "inputs": [
        {0: 'batch', 1: 'num_tokens'},
        {0: 'batch', 1: 'num_tokens'},
    ],
    "outputs": [
        {0: 'batch', 1: 'num_tokens'}
    ]
}

optimized_model = optimize_model(
    model=model,
    input_data=encoded_inputs,
    optimization_time="constrained",
#    ignore_compilers=["tensor RT"],  # TensorRT does not work for this model
    ignore_compilers=["tensor_rt", "tvm"],

    dynamic_info=dynamic_info,
)

[32m2023-01-26 16:58:02[0m | [1mINFO    [0m | [1mRunning Speedster on GPU[0m
[32m2023-01-26 16:58:02[0m | [1mINFO    [0m | [1minside type <class 'speedster.root_op.SpeedsterRootOp'>[0m
 Please install them to include them in the optimization pipeline.[0m
[32m2023-01-26 16:58:05[0m | [1mINFO    [0m | [1mBenchmark performance of original model[0m
[32m2023-01-26 16:58:06[0m | [1mINFO    [0m | [1mOriginal model latency: 0.0043879556655883785 sec/iter[0m
[32m2023-01-26 16:58:06[0m | [1mINFO    [0m | [1mframework is DeepLearningFramework.PYTORCH, will convert to onnx[0m
[32m2023-01-26 16:58:06[0m | [1mINFO    [0m | [1mself.conversion_op: <nebullvm.operations.conversions.converters.PytorchConverter object at 0x7f28323cbd90>, converted model will be saved at /tmp/tmpnhj7ieli/fp32[0m
self.conversion_op: <nebullvm.operations.conversions.converters.PytorchConverter object at 0x7f28323cbd90>, converted model will be saved at /tmp/tmpnhj7ieli/fp32
[32m2023-01-2

In [10]:
import time

# Move inputs to gpu if available
encoded_inputs = [tokenizer(text, return_tensors="pt").to(device) for text in texts]

Let's run the prediction 100 times to calculate the average response time of the original model.

In [11]:
import torch
torch.__version__

'1.13.1+cu117'

In [12]:
def benchmark(model, model_desc='original BERT'):
    times = []

    # Warmup for 30 iterations
    for encoded_input in encoded_inputs[:30]:
        with torch.no_grad():
            final_out = model(**encoded_input)

    # Benchmark
    for encoded_input in encoded_inputs:
        st = time.perf_counter()
        with torch.no_grad():
            final_out = model(**encoded_input)
        times.append(time.perf_counter()-st)
    original_model_time = sum(times)/len(times)*1000
    print(f"Average response time for {model_desc}: {original_model_time} ms")

In [13]:
benchmark(model, 'original BERT')

Average response time for original BERT: 4.486726749601075 ms


Let's see the output of the original model

In [14]:
# model(**encoded_input)

Let's run the prediction 100 times to calculate the average response time of the optimized model.

In [15]:
benchmark(optimized_model, 'optimized BERT (no metric drop)')

Average response time for optimized BERT (no metric drop): 2.9902880199369974 ms


In [16]:
type(optimized_model), optimized_model

(nebullvm.operations.inference_learners.huggingface.HuggingFaceInferenceLearner,
 HuggingFaceInferenceLearner(network_parameters=ModelParams(batch_size=1, input_infos=[<nebullvm.tools.base.InputInfo object at 0x7f9e96bbb880>, <nebullvm.tools.base.InputInfo object at 0x7f9e96bbb8e0>, <nebullvm.tools.base.InputInfo object at 0x7f9e96bbb940>], output_sizes=[(32, 768), (768,)], dynamic_info=DynamicAxisInfo(inputs=[{0: 'batch', 1: 'num_tokens'}, {0: 'batch', 1: 'num_tokens'}], outputs=[{0: 'batch', 1: 'num_tokens'}])), input_tfms=None, device=None))

Let's see the output of the optimized_model

In [17]:
# optimized_model(**encoded_input)

In [18]:
save_path = 'optimized_model'
optimized_model.save(save_path)

from nebullvm.operations.inference_learners.base import LearnerMetadata

optimized_model_reload = LearnerMetadata.read(save_path).load_model(save_path)

In [19]:
benchmark(optimized_model_reload, 'reloaded optimized BERT (no metric drop)')

Average response time for reloaded optimized BERT (no metric drop): 35.16878379959962 ms


In [21]:
# import rich
# rich.inspect(optimized_model, methods=True)

In [20]:
# import rich
# rich.inspect(optimized_model_reload, methods=True)

## Speed up inference with Speedster: metric drop

This time we will use the `metric_drop_ths` argument to accept a little drop in terms of precision, in order to enable quantization and obtain an higher speedup

In [21]:
optimized_model = optimize_model(
    model=model,
    input_data=encoded_inputs,
    optimization_time="constrained",
    ignore_compilers=["tensor_rt", "tvm"],
    dynamic_info=dynamic_info,
    metric_drop_ths=0.1,
)

[32m2023-01-21 10:22:56[0m | [1mINFO    [0m | [1mRunning Speedster on GPU[0m
[32m2023-01-21 10:22:59[0m | [1mINFO    [0m | [1mBenchmark performance of original model[0m
[32m2023-01-21 10:22:59[0m | [1mINFO    [0m | [1mOriginal model latency: 0.004553823471069336 sec/iter[0m
[32m2023-01-21 10:23:02[0m | [1mINFO    [0m | [1mOptimizing with PytorchBackendCompiler and q_type: None.[0m
[32m2023-01-21 10:23:03[0m | [1mINFO    [0m | [1mOptimized model latency: 0.0028252601623535156 sec/iter[0m
[32m2023-01-21 10:23:03[0m | [1mINFO    [0m | [1mOptimizing with PytorchBackendCompiler and q_type: QuantizationType.HALF.[0m
[32m2023-01-21 10:23:03[0m | [1mINFO    [0m | [1mOptimizing with ONNXCompiler and q_type: None.[0m
[32m2023-01-21 10:23:04[0m | [1mINFO    [0m | [1mOptimized model latency: 0.002164602279663086 sec/iter[0m
[32m2023-01-21 10:23:04[0m | [1mINFO    [0m | [1mOptimizing with ONNXCompiler and q_type: QuantizationType.HALF.[0m
[32m2

In [40]:
times = []

# Warmup for 30 iterations
for encoded_input in encoded_inputs[:30]:
    with torch.no_grad():
        final_out = model(**encoded_input)

# Benchmark
for encoded_input in encoded_inputs:
    st = time.perf_counter()
    with torch.no_grad():
        final_out = model(**encoded_input)
    times.append(time.perf_counter()-st)
original_model_time = sum(times)/len(times)*1000
print(f"Average response time for original BERT: {original_model_time} ms")

Average response time for original BERT: 4.694141010113526 ms


In [82]:
def benchmark(model):
    times = []

    # Warmup for 30 iterations
    for encoded_input in encoded_inputs[:30]:
        with torch.no_grad():
            final_out = model(**encoded_input)

    # Benchmark
    for encoded_input in encoded_inputs:
        st = time.perf_counter()
        with torch.no_grad():
            final_out = model(**encoded_input)
        times.append(time.perf_counter()-st)
    original_model_time = sum(times)/len(times)*1000
    print(f"Average response time for original BERT: {original_model_time} ms")

In [41]:
model(**encoded_input)

(tensor([[[-1.3959, -0.2622,  0.2219,  ..., -0.8757,  1.0242,  0.1929],
          [-0.1232,  1.0914,  0.7587,  ..., -0.7951,  1.3011,  0.3600],
          [-1.6476, -0.3754,  0.5519,  ...,  0.0806,  0.6507,  0.6131],
          ...,
          [-0.6952, -0.5882, -0.1518,  ..., -0.2717,  0.7255, -0.5284],
          [-1.0378, -1.0159, -0.3399,  ...,  0.5535,  0.9189, -0.3754],
          [ 0.1252,  0.1695, -0.2149,  ...,  0.4608, -0.2041, -0.1065]]],
        device='cuda:0', grad_fn=<NativeLayerNormBackward0>),
 tensor([[-0.4473, -0.7220, -0.9987,  0.8181,  0.9683, -0.6342, -0.2493,  0.6438,
          -0.9947, -0.9992, -0.9269,  0.9948,  0.6546,  0.9041, -0.1536, -0.7647,
          -0.7596, -0.6340,  0.3766,  0.9080,  0.4105,  1.0000, -0.8801,  0.6671,
           0.5953,  0.9994, -0.8529,  0.4076,  0.6064,  0.4705, -0.2271,  0.5080,
          -0.9505, -0.3206, -0.9993, -0.8728,  0.7812,  0.2545, -0.1004, -0.2399,
          -0.0988,  0.6550,  1.0000,  0.0055,  0.8167, -0.0669, -1.0000,  0.578

In [55]:
times = []

# Warmup for 30 iterations
for encoded_input in encoded_inputs[:30]:
    with torch.no_grad():
        final_out = optimized_model(**encoded_input)

# Benchmark
for encoded_input in encoded_inputs:
    st = time.perf_counter()
    with torch.no_grad():
        final_out = optimized_model(**encoded_input)
    times.append(time.perf_counter()-st)
optimized_model_time = sum(times)/len(times)*1000
print(f"Average response time for optimized BERT (metric drop): {optimized_model_time} ms")

Average response time for optimized BERT (metric drop): 128.33339850010816 ms


In [71]:
optimized_model_reload?

In [73]:
optimized_model_reload.device?

In [74]:
!pip install rich

[0m

In [86]:
import rich
rich.inspect(optimized_model_reload, methods=True)

In [89]:
optimized_model_reload.device = 'gpu'

In [None]:
optimized_model_reload.core_inference_learner.

In [90]:
benchmark(optimized_model_reload)

Average response time for original BERT: 32.61423967003793 ms


In [69]:
times = []

# Warmup for 30 iterations
for encoded_input in encoded_inputs[:30]:
    with torch.no_grad():
        final_out = optimized_model_reload(**encoded_input)

# Benchmark
for encoded_input in encoded_inputs:
    st = time.perf_counter()
    with torch.no_grad():
        final_out = optimized_model_reload(**encoded_input)
    times.append(time.perf_counter()-st)
optimized_model_time = sum(times)/len(times)*1000
print(f"Average response time for optimized BERT (metric drop): {optimized_model_time} ms")

Average response time for optimized BERT (metric drop): 35.351582590192265 ms


In [51]:
optimized_model(**encoded_input)

(tensor([[[-1.3945, -0.2617,  0.2220,  ..., -0.8755,  1.0244,  0.1929],
          [-0.1223,  1.0918,  0.7598,  ..., -0.7944,  1.2998,  0.3601],
          [-1.6465, -0.3750,  0.5508,  ...,  0.0803,  0.6504,  0.6123],
          ...,
          [-0.6943, -0.5884, -0.1506,  ..., -0.2715,  0.7266, -0.5283],
          [-1.0371, -1.0156, -0.3401,  ...,  0.5532,  0.9189, -0.3748],
          [ 0.1237,  0.1702, -0.2141,  ...,  0.4604, -0.2036, -0.1069]]],
        device='cuda:0', dtype=torch.float16),
 tensor([[-0.4470, -0.7217, -0.9985,  0.8179,  0.9683, -0.6338, -0.2499,  0.6436,
          -0.9946, -0.9990, -0.9268,  0.9946,  0.6553,  0.9038, -0.1532, -0.7642,
          -0.7588, -0.6338,  0.3765,  0.9082,  0.4102,  1.0000, -0.8799,  0.6670,
           0.5952,  0.9995, -0.8525,  0.4082,  0.6069,  0.4709, -0.2269,  0.5078,
          -0.9507, -0.3203, -0.9995, -0.8730,  0.7808,  0.2542, -0.1000, -0.2391,
          -0.0997,  0.6548,  1.0000,  0.0044,  0.8164, -0.0666, -1.0000,  0.5781,
          -0

## Save and reload the optimized model

We can easily save to disk the optimized model with the following line:

In [66]:
optimized_model.save("optimized_model")

from nebullvm.operations.inference_learners.base import LearnerMetadata

optimized_model_reload = LearnerMetadata.read("optimized_model").load_model("optimized_model")

In [49]:
optimized_model.save("model_save_path")

We can then load again the model:

In [79]:
from nebullvm.operations.inference_learners.base import LearnerMetadata

optimized_model = LearnerMetadata.read("model_save_path").load_model("model_save_path")

AttributeError: 'HuggingFaceInferenceLearner' object has no attribute 'to'

Great! Was it easy? How are the results? Do you have any comments?
Share your optimization results and thoughts with <a href="https://discord.gg/RbeQMu886J" target="_blank"> our community on Discord</a>, where we chat about Speedster and AI acceleration.

Note that the acceleration of Speedster depends very much on the hardware configuration and your AI model. Given the same input model, Speedster can accelerate it by 10 times on some machines and perform poorly on others.

If you want to learn more about how Speedster works, look at other tutorials and performance benchmarks, check out the links below or write to us on Discord.

<center> 
    <a href="https://discord.com/invite/RbeQMu886J" target="_blank" style="text-decoration: none;"> Join the community </a> |
    <a href="https://nebuly.gitbook.io/nebuly/welcome/questions-and-contributions" target="_blank" style="text-decoration: none;"> Contribute to the library </a>
</center>

<center> 
    <a href="https://github.com/nebuly-ai/nebullvm#how-it-works" target="_blank" style="text-decoration: none;"> How nebullvm works </a> •
    <a href="https://github.com/nebuly-ai/nebullvm#documentation" target="_blank" style="text-decoration: none;"> Documentation </a> •
    <a href="https://github.com/nebuly-ai/nebullvm#api-quick-view" target="_blank" style="text-decoration: none;"> API quick view </a> 
</center>