# Using NeuronCore Pipeline with PyTorch

In this tutorial you compile a pretrained BERT base model from HuggingFace 🤗 Transformers, using the NeuronCore Pipeline feature of the AWS Neuron SDK. You benchmark model latency of the pipeline parallel mode and compare with the usual data parallel (multi-worker) deployment.

This tutorial is intended to run in an inf1.6xlarge, running the latest AWS Deep Learning AMI (DLAMI). The inf1.6xlarge instance size has AWS Inferentia chips for a total of 16 NeuronCores.

Before continuing, verify that this Jupyter notebook is running `conda_aws_neuron_pytorch_p36` kernel of the DLAMI. You can select the Kernel from the “Kernel -> Change Kernel” option on the top of this Jupyter notebook page. If you are using your own AMI, follow [this instructions](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/neuron-intro/install-pytorch.html) to set up your environment. 

> __Note:__ Do not execute this tutorial using "Run -> Run all cells" option.  

## Install Dependencies:
This tutorial requires the following pip packages:

- `torch-neuron`
- `neuron-cc[tensorflow]`
- `transformers`

Most of these packages will be installed when configuring your environment using the Neuron PyTorch setup guide. The additional HuggingFace 🤗 Transformers dependency must be installed here.

In [1]:
!pip install --upgrade "transformers==4.6.0"

/bin/bash: switchml: line 1: syntax error: unexpected end of file
/bin/bash: error importing function definition for `switchml'
/bin/bash: _moduleraw: line 1: syntax error: unexpected end of file
/bin/bash: error importing function definition for `_moduleraw'
Looking in indexes: https://pypi.org/simple, https://pip.repos.neuron.amazonaws.com


## Compiling a BERT base model for a single NeuronCore

To run a HuggingFace [BERTModel](https://huggingface.co/transformers/model_doc/bert.html#bertmodel) on Inferentia, you only need to add a single extra line of code to the usual 🤗 Transformers PyTorch implementation, after importing the torch_neuron framework. 

Add the argument `return_dict=False` to the BERT transformers model so it can be traced with [TorchScript](https://pytorch.org/docs/stable/jit.html). TorchScript is a way to create serializable and optimizable models from PyTorch code. 

Enable padding to a maximum sequence length of 128, to test the model's performance with a realistic payload size. You can adapt this sequence length to your application's requirement. 

You can adapt the original example on the [BertModel forward pass docstring](https://huggingface.co/transformers/model_doc/bert.html#transformers.BertModel.forward) according to the following cell


In [1]:
import torch
import torch_neuron
from transformers import BertTokenizer, BertModel

from joblib import Parallel, delayed  
import numpy as np
from tqdm import tqdm

import os
import time 


# tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
# model = BertModel.from_pretrained('bert-base-uncased',return_dict=False)

# inputs = tokenizer("Hello, my dog is cute",return_tensors="pt",max_length=128,padding='max_length',truncation=True)


from transformers import GPT2Tokenizer
from transformers import GPT2Model
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
tokenizer.pad_token = tokenizer.eos_token
model = GPT2Model.from_pretrained('gpt2', return_dict=False)
inputs = tokenizer("Hello, my dog is cute",return_tensors="pt")



Some weights of GPT2Model were not initialized from the model checkpoint at gpt2 and are newly initialized: ['h.5.attn.masked_bias', 'h.10.attn.masked_bias', 'h.11.attn.masked_bias', 'h.4.attn.masked_bias', 'h.1.attn.masked_bias', 'h.2.attn.masked_bias', 'h.0.attn.masked_bias', 'h.9.attn.masked_bias', 'h.3.attn.masked_bias', 'h.8.attn.masked_bias', 'h.6.attn.masked_bias', 'h.7.attn.masked_bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [2]:
inputs

{'input_ids': tensor([[15496,    11,   616,  3290,   318, 13779]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1]])}

In [3]:
%%time
a = model(**inputs)


CPU times: user 322 ms, sys: 0 ns, total: 322 ms
Wall time: 30 ms


In [4]:
import torch
import torch_neuron
# 
neuron_model = torch.neuron.trace(model,
example_inputs = (inputs['input_ids'], torch.tensor([]), inputs['attention_mask']),
verbose=1)


  if past_key_values is None or len(past_key_values)==0:
  assert batch_size > 0, "batch_size has to be defined and > 0"
  attn_weights = attn_weights / (float(value.size(-1)) ** 0.5)
  causal_mask = self.bias[:, :, key_length - query_length : key_length, :key_length].bool()
INFO:Neuron:There are 14 ops of 2 different types in the TorchScript that are not compiled by neuron-cc: aten::where, aten::embedding, (For more information see https://github.com/aws/aws-neuron-sdk/blob/master/release-notes/neuron-cc-ops/neuron-cc-ops-pytorch.md)
INFO:Neuron:Number of arithmetic operators (pre-compilation) before = 1182, fused = 1156, percent fused = 97.8%
INFO:Neuron:Compiling function _NeuronGraph$466 with neuron-cc
INFO:Neuron:Compiling with command line: '/home/ubuntu/anaconda3/envs/aws_neuron_pytorch_p36/bin/neuron-cc compile /tmp/tmpyxv2m3ce/graph_def.pb --framework TENSORFLOW --pipeline compile SaveTemps --output /tmp/tmpyxv2m3ce/graph_def.neff --io-config {"inputs": {"0:0": [[1, 6], "int64

Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where


INFO:Neuron:Compiling function _NeuronGraph$467 with neuron-cc
INFO:Neuron:Compiling with command line: '/home/ubuntu/anaconda3/envs/aws_neuron_pytorch_p36/bin/neuron-cc compile /tmp/tmp7l8mmwmk/graph_def.pb --framework TENSORFLOW --pipeline compile SaveTemps --output /tmp/tmp7l8mmwmk/graph_def.neff --io-config {"inputs": {"0:0": [[1, 6, 768], "float32"], "1:0": [[1, 6, 768], "float32"]}, "outputs": ["add:0", "split:2", "transpose_1:0", "truediv:0", "Cast_2:0"]} --verbose 1'
INFO:Neuron:Compiling function _NeuronGraph$468 with neuron-cc
INFO:Neuron:Compiling with command line: '/home/ubuntu/anaconda3/envs/aws_neuron_pytorch_p36/bin/neuron-cc compile /tmp/tmpknmy9bpm/graph_def.pb --framework TENSORFLOW --pipeline compile SaveTemps --output /tmp/tmpknmy9bpm/graph_def.neff --io-config {"inputs": {"0:0": [[1, 6], "int64"], "1:0": [[1, 6], "int64"], "2:0": [[1, 6, 768], "float32"], "3:0": [[1, 12, 6, 6], "float32"], "4:0": [[1, 12, 6, 64], "float32"], "5:0": [[1, 6, 768], "float32"]}, "outp

INFO:Neuron:Compile command returned: 1
ERROR:Neuron:neuron-cc failed with the following command line call:
/home/ubuntu/anaconda3/envs/aws_neuron_pytorch_p36/bin/neuron-cc compile /tmp/tmpgagbo84a/graph_def.pb --framework TENSORFLOW --pipeline compile SaveTemps --output /tmp/tmpgagbo84a/graph_def.neff --io-config '{"inputs": {"0:0": [[1, 6, 768], "float32"], "1:0": [[1, 12, 6, 6], "float32"], "2:0": [[1, 1, 1, 6], "float32"], "3:0": [[1, 12, 6, 64], "float32"], "4:0": [[1, 6, 768], "float32"]}, "outputs": ["add_7:0", "3:0", "transpose:0", "split:2", "transpose_3:0", "truediv:0", "Cast_2:0"]}' --verbose 1
Traceback (most recent call last):
  File "/home/ubuntu/anaconda3/envs/aws_neuron_pytorch_p36/lib/python3.6/site-packages/torch_neuron/convert.py", line 391, in op_converter
    item, inputs, compiler_workdir=sg_workdir, **kwargs)
  File "/home/ubuntu/anaconda3/envs/aws_neuron_pytorch_p36/lib/python3.6/site-packages/torch_neuron/decorators.py", line 194, in trace
    'neuron-cc failed

INFO:Neuron:Compiling function _NeuronGraph$475 with neuron-cc
INFO:Neuron:Compiling with command line: '/home/ubuntu/anaconda3/envs/aws_neuron_pytorch_p36/bin/neuron-cc compile /tmp/tmpjwrt047b/graph_def.pb --framework TENSORFLOW --pipeline compile SaveTemps --output /tmp/tmpjwrt047b/graph_def.neff --io-config {"inputs": {"0:0": [[1, 6, 768], "float32"], "1:0": [[1, 12, 6, 6], "float32"], "2:0": [[1, 1, 1, 6], "float32"], "3:0": [[1, 12, 6, 64], "float32"], "4:0": [[1, 6, 768], "float32"]}, "outputs": ["add_7:0", "3:0", "transpose:0", "split:2", "transpose_3:0", "truediv:0", "Cast_2:0"]} --verbose 1'
INFO:Neuron:Compile command returned: 1
ERROR:Neuron:neuron-cc failed with the following command line call:
/home/ubuntu/anaconda3/envs/aws_neuron_pytorch_p36/bin/neuron-cc compile /tmp/tmpjwrt047b/graph_def.pb --framework TENSORFLOW --pipeline compile SaveTemps --output /tmp/tmpjwrt047b/graph_def.neff --io-config '{"inputs": {"0:0": [[1, 6, 768], "float32"], "1:0": [[1, 12, 6, 6], "float

INFO:Neuron:Compiling function _NeuronGraph$479 with neuron-cc
INFO:Neuron:Compiling with command line: '/home/ubuntu/anaconda3/envs/aws_neuron_pytorch_p36/bin/neuron-cc compile /tmp/tmpvs89wcfg/graph_def.pb --framework TENSORFLOW --pipeline compile SaveTemps --output /tmp/tmpvs89wcfg/graph_def.neff --io-config {"inputs": {"0:0": [[1, 6], "int64"], "tensor.2:0": [[], "int64"], "2:0": [[1, 6, 768], "float32"], "3:0": [[1, 6, 768], "float32"], "4:0": [[1, 12, 6, 6], "float32"], "5:0": [[1, 1, 1, 6], "float32"], "6:0": [[1, 12, 6, 64], "float32"], "7:0": [[1, 6, 768], "float32"], "8:0": [[1, 12, 6, 64], "float32"], "9:0": [[1, 12, 6, 64], "float32"], "10:0": [[1, 12, 6, 64], "float32"], "11:0": [[1, 12, 6, 64], "float32"], "12:0": [[1, 12, 6, 64], "float32"], "13:0": [[1, 12, 6, 64], "float32"], "14:0": [[1, 12, 6, 64], "float32"], "15:0": [[1, 12, 6, 64], "float32"], "16:0": [[1, 12, 6, 64], "float32"], "17:0": [[1, 12, 6, 64], "float32"], "18:0": [[1, 12, 6, 64], "float32"], "19:0": [[1

The one extra line required is the call to torch.neuron.trace() method. This call compiles the model and returns the forwad method of the torch `nn.Model` method, which you can use to run inference. 

The compiled graph can be saved using the `torch.jit.save` function and restored using `torch.jit.load` function for inference on Inf1 instances. During inference, the previously compiled artifacts will be loaded into the Neuron Runtime for inference execution.


## Running the BERT base model on a single NeuronCore
With the model already available in memory, you can time one execution and check for the latency on the single inference call. You will load the model into Inferentia with a single inference call. A large "wall time" is expected when you first run the next cell, running the cell twice will show the actual inference latency:

In [6]:
%%time
# The following line tests inference and should be executed on Inf1 instance family. 
outputs = neuron_model(inputs['input_ids'], torch.tensor([]), inputs['attention_mask'])

CPU times: user 154 ms, sys: 0 ns, total: 154 ms
Wall time: 17.2 ms


You can also check for the throughput of the single model running on a single NeuronCore.

The sequential inference test (for loop) does not measure all the performance one can achieve in an instance with multiple NeuronCores. To improve hardwar utilization you can run parallel inference requests over multiple model workers, which you'll test in the Data Parallel Bonus Section below.

In [9]:
%%time
for _ in tqdm(range(100)):
    outputs = neuron_model(*(inputs['input_ids'],torch.tensor([]),inputs['attention_mask'])) 

100%|██████████| 100/100 [00:01<00:00, 96.58it/s]

CPU times: user 12.4 s, sys: 11 ms, total: 12.4 s
Wall time: 1.04 s





Save the compiled model for later use:

In [7]:
neuron_model.save('gpt2-neuron.pt')

## Compiling a BERT base model for 16 NeuronCores

Our next step is to compile the same model for all 16 NeuronCores available in the inf1.6xlarge and check the performance difference when running pipeline parallel inferences.. 

Prior to compiling and executing the model, use the following cell to restart your IPython Kernel. 

>__Note:__ If you run this notebook using Jupyter Notebooks, instead of Jupyterlab, you may need to restart the kernel using the "Kernel -> Restart" option, after running the next cell 

In [12]:
import IPython
# Automatically restarts kernel
IPython.Application.instance().kernel.do_shutdown(True) 

{'status': 'ok', 'restart': True}

After you get `{'status': 'ok', 'restart': True}`, reinstantiate your environment with the required libraries and the `BertTokenizer` and `BertModel` from 🤗Transformers.

In [2]:
import torch
import torch_neuron
from transformers import BertTokenizer, BertModel
from transformers import GPT2Tokenizer
from transformers import GPT2Model
from joblib import Parallel, delayed  
import numpy as np
from tqdm import tqdm

import os
import time 



tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
tokenizer.pad_token = tokenizer.eos_token
model = GPT2Model.from_pretrained('gpt2',return_dict=False)

inputs = tokenizer("Hello, my dog is cute",return_tensors="pt",max_length=128,padding='max_length',truncation=True)


Some weights of GPT2Model were not initialized from the model checkpoint at gpt2 and are newly initialized: ['h.8.attn.masked_bias', 'h.0.attn.masked_bias', 'h.10.attn.masked_bias', 'h.6.attn.masked_bias', 'h.9.attn.masked_bias', 'h.7.attn.masked_bias', 'h.3.attn.masked_bias', 'h.5.attn.masked_bias', 'h.4.attn.masked_bias', 'h.1.attn.masked_bias', 'h.2.attn.masked_bias', 'h.11.attn.masked_bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


To enable pipeline mode during compilation, you need only to add the compiler flag `--neuroncore-pipeline-cores` and set the number of desired cores. The cell below sets up a  `neuroncore_pipeline_cores` string, which you can set for the available number of NeuronCores on the instance: _inf1.6xlarge_ has 16 NeuronCores in 4 Inferentia chips. 


In [3]:
# Number of Cores in the Pipeline Mode
neuroncore_pipeline_cores = 8 # This string should be '4' on an inf1.xlarge

# Compiling for neuroncore-pipeline-cores='16'
neuron_pipeline_model = torch.neuron.trace(model,
                                           example_inputs = (inputs['input_ids'],torch.tensor([]),inputs['attention_mask']),
                                           verbose=1,
                                           compiler_args = ['--neuroncore-pipeline-cores', str(neuroncore_pipeline_cores)]
                                          )

  if past_key_values is None or len(past_key_values)==0:
  assert batch_size > 0, "batch_size has to be defined and > 0"
  attn_weights = attn_weights / (float(value.size(-1)) ** 0.5)
  causal_mask = self.bias[:, :, key_length - query_length : key_length, :key_length].bool()
With rtol=1e-05 and atol=1e-05, found 1 element(s) (out of 98304) whose difference(s) exceeded the margin of error (including 0 nan comparisons). The greatest difference was 1.3686716556549072e-05 (0.11178360879421234 vs. 0.11179729551076889), which occurred at index (0, 85, 314).
  _module_class,
INFO:Neuron:There are 14 ops of 2 different types in the TorchScript that are not compiled by neuron-cc: aten::where, aten::embedding, (For more information see https://github.com/aws/aws-neuron-sdk/blob/master/release-notes/neuron-cc-ops/neuron-cc-ops-pytorch.md)
INFO:Neuron:Number of arithmetic operators (pre-compilation) before = 1182, fused = 1156, percent fused = 97.8%
INFO:Neuron:Compiler args type is <class 'list'>

Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where


INFO:Neuron:Compiler args type is <class 'list'> value is ['--neuroncore-pipeline-cores', '8']
INFO:Neuron:Compiling function _NeuronGraph$467 with neuron-cc
INFO:Neuron:Compiling with command line: '/home/ubuntu/anaconda3/envs/aws_neuron_pytorch_p36/bin/neuron-cc compile /tmp/tmptn876rd8/graph_def.pb --framework TENSORFLOW --pipeline compile SaveTemps --output /tmp/tmptn876rd8/graph_def.neff --io-config {"inputs": {"0:0": [[1, 128, 768], "float32"], "1:0": [[1, 128, 768], "float32"]}, "outputs": ["add:0", "split:2", "transpose_1:0", "truediv:0", "Cast_2:0"]} --neuroncore-pipeline-cores 8 --verbose 1'
INFO:Neuron:Compiler args type is <class 'list'> value is ['--neuroncore-pipeline-cores', '8']
INFO:Neuron:Compiling function _NeuronGraph$468 with neuron-cc
INFO:Neuron:Compiling with command line: '/home/ubuntu/anaconda3/envs/aws_neuron_pytorch_p36/bin/neuron-cc compile /tmp/tmp_w_06i7o/graph_def.pb --framework TENSORFLOW --pipeline compile SaveTemps --output /tmp/tmp_w_06i7o/graph_def.

With rtol=1e-05 and atol=1e-05, found 12 element(s) (out of 196608) whose difference(s) exceeded the margin of error (including 0 nan comparisons). The greatest difference was 1.9311904907226562e-05 (0.5990979671478271 vs. 0.5990786552429199), which occurred at index (0, 11, 2, 59).
  _module_class,
INFO:Neuron:Compiler args type is <class 'list'> value is ['--neuroncore-pipeline-cores', '8']
INFO:Neuron:Compiling function _NeuronGraph$471 with neuron-cc
INFO:Neuron:Compiling with command line: '/home/ubuntu/anaconda3/envs/aws_neuron_pytorch_p36/bin/neuron-cc compile /tmp/tmp79tp8lki/graph_def.pb --framework TENSORFLOW --pipeline compile SaveTemps --output /tmp/tmp79tp8lki/graph_def.neff --io-config {"inputs": {"0:0": [[1, 128, 768], "float32"], "1:0": [[1, 12, 128, 128], "float32"], "2:0": [[1, 1, 1, 128], "float32"], "3:0": [[1, 12, 128, 64], "float32"], "4:0": [[1, 128, 768], "float32"]}, "outputs": ["add_7:0", "3:0", "transpose:0", "split:2", "transpose_3:0", "truediv:0", "Cast_2:0

INFO:Neuron:Compiler args type is <class 'list'> value is ['--neuroncore-pipeline-cores', '8']
INFO:Neuron:Compiling function _NeuronGraph$474 with neuron-cc
INFO:Neuron:Compiling with command line: '/home/ubuntu/anaconda3/envs/aws_neuron_pytorch_p36/bin/neuron-cc compile /tmp/tmpaqpbkgvh/graph_def.pb --framework TENSORFLOW --pipeline compile SaveTemps --output /tmp/tmpaqpbkgvh/graph_def.neff --io-config {"inputs": {"0:0": [[1, 128, 768], "float32"], "1:0": [[1, 12, 128, 128], "float32"], "2:0": [[1, 1, 1, 128], "float32"], "3:0": [[1, 12, 128, 64], "float32"], "4:0": [[1, 128, 768], "float32"]}, "outputs": ["add_7:0", "3:0", "transpose:0", "split:2", "transpose_3:0", "truediv:0", "Cast_2:0"]} --neuroncore-pipeline-cores 8 --verbose 1'
INFO:Neuron:Compile command returned: 1
ERROR:Neuron:neuron-cc failed with the following command line call:
/home/ubuntu/anaconda3/envs/aws_neuron_pytorch_p36/bin/neuron-cc compile /tmp/tmpaqpbkgvh/graph_def.pb --framework TENSORFLOW --pipeline compile S

INFO:Neuron:Compile command returned: 1
ERROR:Neuron:neuron-cc failed with the following command line call:
/home/ubuntu/anaconda3/envs/aws_neuron_pytorch_p36/bin/neuron-cc compile /tmp/tmphlg0af17/graph_def.pb --framework TENSORFLOW --pipeline compile SaveTemps --output /tmp/tmphlg0af17/graph_def.neff --io-config '{"inputs": {"0:0": [[1, 128, 768], "float32"], "1:0": [[1, 12, 128, 128], "float32"], "2:0": [[1, 1, 1, 128], "float32"], "3:0": [[1, 12, 128, 64], "float32"], "4:0": [[1, 128, 768], "float32"]}, "outputs": ["add_7:0", "3:0", "transpose:0", "split:2", "transpose_3:0", "truediv:0", "Cast_2:0"]}' --neuroncore-pipeline-cores 8 --verbose 1
Traceback (most recent call last):
  File "/home/ubuntu/anaconda3/envs/aws_neuron_pytorch_p36/lib/python3.6/site-packages/torch_neuron/convert.py", line 391, in op_converter
    item, inputs, compiler_workdir=sg_workdir, **kwargs)
  File "/home/ubuntu/anaconda3/envs/aws_neuron_pytorch_p36/lib/python3.6/site-packages/torch_neuron/decorators.py"

INFO:Neuron:Number of arithmetic operators (post-compilation) before = 1182, compiled = 39, percent compiled = 3.3%
INFO:Neuron:The neuron partitioner created 14 sub-graphs
INFO:Neuron:Neuron successfully compiled 1 sub-graphs, Total fused subgraphs = 14, Percent of model sub-graphs successfully compiled = 7.1%
INFO:Neuron:Compiled these operators (and operator counts) to Neuron:
INFO:Neuron: => aten::Int: 10
INFO:Neuron: => aten::add: 1
INFO:Neuron: => aten::addmm: 1
INFO:Neuron: => aten::div: 1
INFO:Neuron: => aten::dropout: 1
INFO:Neuron: => aten::layer_norm: 1
INFO:Neuron: => aten::matmul: 1
INFO:Neuron: => aten::permute: 2
INFO:Neuron: => aten::size: 9
INFO:Neuron: => aten::slice: 4
INFO:Neuron: => aten::split: 1
INFO:Neuron: => aten::sub: 1
INFO:Neuron: => aten::to: 1
INFO:Neuron: => aten::transpose: 1
INFO:Neuron: => aten::view: 4
INFO:Neuron:Not compiled operators (and operator counts) to Neuron:
INFO:Neuron: => aten::Int: 272 [supported]
INFO:Neuron: => aten::ScalarImplicit: 1

## Running the BERT base model on 16 NeuronCores
Next, time one execution and check for the latency on the single inference call over 16 cores. You will load the model into Inferentia with a single inference call. A large "wall time" is expected when you first run the next cell, running the cell twice will show the actual inference latency:

In [5]:
%%time
# The following line tests inference and should be executed on Inf1 instance family. 
outputs = neuron_pipeline_model(*(inputs['input_ids'],torch.tensor([]),inputs['attention_mask']))

CPU times: user 554 ms, sys: 15.5 ms, total: 570 ms
Wall time: 47.9 ms


Check also for the throughput of the single model running over a 16 NeuronCores. 

The sequential inference test (for loop) does not measure all the performance one can achieve with Pipeline mode. As the inference runs in streaming fashion, at least 15 cores are waiting for a new call until the last one processes the first call. This results in low NeuronCore utilization. To improve hardware utilization you will require parallel inference requests, which you'll test in the next section.

In [6]:
for _ in tqdm(range(100)):
    outputs = neuron_pipeline_model(*(inputs['input_ids'],torch.tensor([]),inputs['attention_mask']))
    

100%|██████████| 100/100 [00:04<00:00, 21.63it/s]


## Load Testing the Pipeline Parallel Mode

To put the 16 NeuronCores group to test, a client has to run concurrent requests to the model. In this Notebook setup you achieve it by creating a thread pool with `Joblib.Parallel`, with all workers on the pool runing one inference call. 

You can define a new method called `inference_latency()` so that you measure the amount of time each inference calls take.

In [10]:
def inference_latency(model,*inputs):
    """
    infetence_time is a simple method to return the latency of a model inference.
        
        Parameters:
            model: torch model onbject loaded using torch.jit.load
            inputs: model() args
        
        Returns:
            latency in seconds
    """
    start = time.time()
    _ = model(*inputs)
    return time.time() - start

Use `tqdm` to measure total throughput of your experiment, with a nice side-effect of "cool progress bar!". The total throughput is expected to be high, so set your experiment range to a large number, here 30k inferences. 

To calculate the latency statistics over the returned 30k list of latencies use `numpy.qunatile()` method.

In [None]:
t = tqdm(range(30000), position=0, leave=True)
latency = Parallel(n_jobs=12,prefer="threads")(delayed(inference_latency)(neuron_pipeline_model,*(inputs['input_ids'],torch.tensor([]),inputs['attention_mask'])) for i in t)

p50 = np.quantile(latency[-10000:],0.50) * 1000
p95 = np.quantile(latency[-10000:],0.95) * 1000
p99 = np.quantile(latency[-10000:],0.99) * 1000
avg_throughput = t.total/t.format_dict['elapsed']
print(f'Avg Throughput: :{avg_throughput:.1f}')
print(f'50th Percentile Latency:{p50:.1f} ms')
print(f'95th Percentile Latency:{p95:.1f} ms')
print(f'99th Percentile Latency:{p99:.1f} ms')

Save compile model for later use:

In [None]:
# Save the TorchScript graph
neuron_pipeline_model.save('gpt2-neuron-pipeline.pt')

## Bonus Section - Load Testing Data Parallel Mode

Prior to setting up a Data Parallel experiment using the model compiled for a single NeuronCore, run the following cell to restart your IPython Kernel. 

>__Note:__ If you run this notebook using Jupyter Notebooks, instead of Jupyterlab, you may need to restart the kernel using the "Kernel -> Restart" option, after running the next cell 

In [6]:
import IPython
# Automatically restarts kernel
IPython.Application.instance().kernel.do_shutdown(True) 

{'status': 'ok', 'restart': True}

After you get `{'status': 'ok', 'restart': True}`, reinstantiate your environment with the required libraries and just the `BertTokenizer` from 🤗Transformers. You will not re-compile the model as you have the `bert-base-uncased-neuron.pt` TorchScript graph available. Also, redefine the `inference_latency()` method. 

In [1]:
import torch
import torch_neuron
from transformers import BertTokenizer 

from joblib import Parallel, delayed  
import numpy as np
from tqdm import tqdm
from transformers import GPT2Tokenizer
from transformers import GPT2Model
import os
import time 

def inference_latency(model,*inputs):
    """
    infetence_time is a simple method to return the latency of a model inference.
        
        Parameters:
            model: torch model onbject loaded using torch.jit.load
            inputs: model() args
        
        Returns:
            latency in seconds
    """
    start = time.time()
    _ = model(*inputs)
    return time.time() - start

tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
tokenizer.pad_token = tokenizer.eos_token

inputs = tokenizer("Hello, my dog is cute",return_tensors="pt",max_length=128,padding='max_length',truncation=True)

You use the `'NEURONCORE_GROUP_SIZES'` environment variable to define NeuronCore groups that will each load a single model at runtime. Set the environment variable to the number of individual workers you want to test in parallel.

`torch_neuron` will load one model per NeuronCore group until it runs out of cores. At that point, if the Python process continues to spawn more model objest using `torch.jit.load`, `torch_neuron` will start stacking more than one model per core, until the Inferentia chip memory is full. 

Inferentia is able to run inference over all the loaded models, but only one at a time. The Neuron Runtime takes care of dynamically switching the model context as requests come in, no extra worker process management required. Use 1 model per NeuronCore to achieve maximum performance.

The following cell creates a list with as many models as NeuronCore Groups and execute one single dummy inference to load the models into Inferentia. 

In [8]:
import warnings
# Number of data parallel workers
number_of_workers=8 # This number should be 4 on an inf1.xlarge

# Setting up a data parallel group
warnings.warn("NEURONCORE_GROUP_SIZES is being deprecated, if your application is using NEURONCORE_GROUP_SIZES please \
see https://awsdocs-neuron.readthedocs-hosted.com/en/latest/release-notes/deprecation.html#announcing-end-of-support-for-neuroncore-group-sizes \
for more details.", DeprecationWarning)
os.environ['NEURONCORE_GROUP_SIZES'] = ",".join(['1']*number_of_workers)

# Loading 'number_of_workers' amount of models in Python memory
model_list = [torch.jit.load('gpt2-neuron.pt') for _ in range(number_of_workers)]

# Dummy inference to load models to Inferentia
_ = [mod(*(inputs['input_ids'],torch.tensor([]),inputs['attention_mask'])) for mod in model_list]




Adapt the call to `joblib.Parallel()` iterating over a concatenated version of the `model_list`, to run 'round-robin' calls to each of the model workers.  

In [12]:
t = tqdm(model_list*1500,position=0, leave=True)
latency = Parallel(n_jobs=number_of_workers,prefer="threads")(delayed(inference_latency)(mod,*(inputs['input_ids'],torch.tensor([]),inputs['attention_mask'])) for mod in t)

p50 = np.quantile(latency[-10000:],0.50) * 1000
p95 = np.quantile(latency[-10000:],0.95) * 1000
p99 = np.quantile(latency[-10000:],0.99) * 1000
avg_throughput = t.total/t.format_dict['elapsed']
print(f'Avg Throughput: :{avg_throughput:.1f}')
print(f'50th Percentile Latency:{p50:.1f} ms')
print(f'95th Percentile Latency:{p95:.1f} ms')
print(f'99th Percentile Latency:{p99:.1f} ms')

100%|██████████| 12000/12000 [04:22<00:00, 45.64it/s]

Avg Throughput: :45.6
50th Percentile Latency:174.4 ms
95th Percentile Latency:247.8 ms
99th Percentile Latency:279.4 ms





For this model, despite the larger number of workers, the per-worker latency increases when running a single model per core, which in turn reduces the total throughput. 

This behavior may not repeat if the model memory footprint or the input payload size changes, i.e batch size > 1. We encourage you to experiment with the data parallel and pipeline parallel modes to optimize your application performance. 