# Timing benchmark

This notebook will compare the timing difference between using a local torch model vs. a torch model hosted on the triton server. The model we are working with is an example of the ParticleNet model seen [here](https://cms-ml.github.io/documentation/inference/particlenet.html).

To test your own model, you will need to first prepare the model for the inference server. An example of the model conversion can be found in the ```model_conversion.ipynb``` notebook.

Let's load in the 2 different types of taggers we will look at (changes made here to switch out models):

In [1]:
import math
import numpy as np
import awkward as ak
import time

In [2]:
from models.ParticleNet import ParticleNetTagger
import torch

# load in local model
local_model = ParticleNetTagger(5, 2,
                        [(16, (64, 64, 64)), (16, (128, 128, 128)), (16, (256, 256, 256))],
                        [(256, 0.1)],
                        use_fusion=True,
                        use_fts_bn=False,
                        use_counts=True,
                        for_inference=False)

LOCAL_PATH = "/srv/models/pn_demo.pt"
local_model.load_state_dict(torch.load(LOCAL_PATH, map_location=torch.device('cpu')))
local_model.eval()

ParticleNetTagger(
  (conv): FeatureConv(
    (conv): Sequential(
      (0): BatchNorm1d(5, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (1): Conv1d(5, 32, kernel_size=(1,), stride=(1,), bias=False)
      (2): BatchNorm1d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (3): ReLU()
    )
  )
  (pn): ParticleNet(
    (edge_convs): ModuleList(
      (0): EdgeConvBlock(
        (convs): ModuleList(
          (0): Conv2d(64, 64, kernel_size=(1, 1), stride=(1, 1), bias=False)
          (1): Conv2d(64, 64, kernel_size=(1, 1), stride=(1, 1), bias=False)
          (2): Conv2d(64, 64, kernel_size=(1, 1), stride=(1, 1), bias=False)
        )
        (bns): ModuleList(
          (0): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True

In [3]:
from utils.tritonutils import wrapped_triton

# create instance of triton model
triton_model = wrapped_triton( "triton+grpc://triton.fnal.gov:443/pn_demo/1")

We will double check that the outputs of the local and triton models match within 10^-5 before moving on.

In [5]:
# create 5 random jets with 100 tracks each
test_inputs = {'points': np.random.rand(5,2,100).astype(np.float32),
               'features': np.random.rand(5,5,100).astype(np.float32),
               'mask': np.ones((5,1,100),dtype=np.float32)}

# slighlty different inputs for each model
test_inputs_local = []
test_inputs_triton = {}
c = 0
for k in test_inputs.keys():
    test_inputs_local.append(torch.from_numpy(test_inputs[k]))
    test_inputs_triton[f'{k}__{c}'] = test_inputs[k]
    c += 1

with torch.no_grad():
    local_output = local_model(*test_inputs_local).detach().numpy()
triton_output = triton_model(test_inputs_triton, 'softmax__0')
np.testing.assert_almost_equal(local_output, triton_output, decimal=5, err_msg='Outputs do NOT match')

Next, let's create a much large sample of data to test the timing between the different model versions. We will use the [awkward array](https://awkward-array.org/doc/main/) structure to hold the inputs because it is easier to adapt to both the local and triton model when batching.

In [6]:
# create 10000 random jets with 100 tracks each
test_inputs = {'points': np.random.rand(10000,2,100).astype(np.float32),
               'features': np.random.rand(10000,5,100).astype(np.float32),
               'mask': np.ones((10000,1,100),dtype=np.float32)}

test_inputs_ak = ak.Array(test_inputs)

The inputs (jets in the ParticleNet case) are batched as they are processed. The batch size should be determined based on what is most efficient for the current model being used. For this demo model, a batch size of 1024 is used but this variable can be changed if desired.

To test the timing differences between the two models, we will collect the time that has passed for each new batch of data and compare. Here is a function that will take in the full dataset, then batch and run either the local or triton model for inference.

In [7]:
def process_jets(in_jets, batch_size=1024, use_triton=False):
    
    print('Running triton server inference' if use_triton else 'Running local inference')
    
    # define variables to track processing time
    njets = []
    t = []
    t_begin = time.time()
    
    # loop through input data batches and run inference on each batch
    for ii in range(0, len(in_jets), batch_size):
        print('%i/%i jets processed, processing next batch'%(ii,len(in_jets)))

        # get a batch of data
        try:
            jets_eval = in_jets[ii:ii + batch_size]
            njets.append(ii+batch_size)
        except:
            jets_eval = in_jets[ii:-1]
            njets.append(len(in_jets))

        ## structure inputs slightly differently and run inference depending on model
        # triton model
        if use_triton:
            X = {}
            c = 0
            for k in jets_eval.fields:
                X[f'{k}__{c}'] = ak.to_numpy(jets_eval[k])
                c += 1
                
            # triton inference
            outputs = triton_model(X, 'softmax__0')
                
        # local model   
        else:
            X = []
            for k in jets_eval.fields:
                X.append(torch.from_numpy(ak.to_numpy(jets_eval[k])))
                
            # local inference
            with torch.no_grad():
                outputs = local_model(*X).detach().numpy()

        t.append(time.time()-t_begin)
        
    print('Total time elapsed = %f sec'%t[-1])

    return njets, t
    

In [None]:
local_njets, local_t = process_jets(test_inputs_ak, use_triton=False, batch_size=1024)

Running local inference
0/10000 jets processed, processing next batch
1024/10000 jets processed, processing next batch
2048/10000 jets processed, processing next batch
3072/10000 jets processed, processing next batch
4096/10000 jets processed, processing next batch
5120/10000 jets processed, processing next batch
6144/10000 jets processed, processing next batch
7168/10000 jets processed, processing next batch
8192/10000 jets processed, processing next batch


In [8]:
triton_njets, triton_t = process_jets(test_inputs_ak, use_triton=True, batch_size=1024)

Running triton server inference
0/10000 jets processed, processing next batch
1024/10000 jets processed, processing next batch
2048/10000 jets processed, processing next batch
3072/10000 jets processed, processing next batch
4096/10000 jets processed, processing next batch
5120/10000 jets processed, processing next batch
6144/10000 jets processed, processing next batch
7168/10000 jets processed, processing next batch
8192/10000 jets processed, processing next batch
9216/10000 jets processed, processing next batch
Total time elapsed = 6.103302 sec


Now we can plot some of the results and compare then between the two inference methods. We will use matplotlib as our plotting tools.

In [None]:
import matplotlib.pyplot as plt
from matplotlib import gridspec
import mplhep as hep
plt.style.use(hep.style.CMS)

Next, let's take a look at the time you gain when using triton as opposed to a local model.

In [None]:
fig = plt.figure()
# set height ratios for subplots
gs = gridspec.GridSpec(2, 1, height_ratios=[3, 1])

# the first subplot
ax0 = plt.subplot(gs[0])
# log scale for axis Y of the first subplot
ax0.set_yscale("log")
line0, = ax0.plot(local_njets, local_t, color='r')
line1, = ax0.plot(triton_njets, triton_t, color='b')

# the second subplot
# shared axis X
ax1 = plt.subplot(gs[1], sharex = ax0)
line2, = ax1.plot(local_njets, np.array(local_t)/np.array(triton_t), color='black', linestyle='--')
plt.setp(ax0.get_xticklabels(), visible=False)
# remove last tick label for the second subplot
yticks = ax1.yaxis.get_major_ticks()
yticks[-1].label1.set_visible(False)

# put legend on first subplot
ax0.legend((line0, line1), ('local model', 'triton model'), loc='lower left')

ax0.set_ylabel('time elapsed (s)')
ax1.set_ylabel('$t_{local}/t_{triton}$')
ax1.set_xlabel('# jets processed')

# remove vertical gap between subplots
plt.subplots_adjust(hspace=.0)
plt.rcParams["figure.figsize"] = (7,6)
plt.show()