# Convert model for triton server

This notebook will show an example of how to convert a model so that it can be used by the triton server.

Start by loading in the model you want to use. This will be called the local model because it is stored locally on the LPC nodes and will be used locally. Here, we are using a ParticleNet model which and uses PyTorch as a backend.

In [None]:
from models.ParticleNet import ParticleNetTagger
import torch

# load in local model
local_model = ParticleNetTagger(5, 2,
                        [(16, (64, 64, 64)), (16, (128, 128, 128)), (16, (256, 256, 256))],
                        [(256, 0.1)],
                        use_fusion=True,
                        use_fts_bn=False,
                        use_counts=True,
                        for_inference=False)

LOCAL_PATH = "/srv/models/pn_demo.pt"
local_model.load_state_dict(torch.load(LOCAL_PATH, map_location=torch.device('cpu')))
local_model.eval()

To upload the model to the triton server, it must be converted with jit and then some configuration files must be created. Examples of the structure of the triton files can be found [here](https://github.com/fastmachinelearning/sonic-models/tree/master/models/particlenet) and more on the specifics of configuration can be found [here](https://github.com/triton-inference-server/server/blob/main/docs/user_guide/model_configuration.md).

In [None]:
# convert model with jit and save to file
jit_model = torch.jit.script(local_model)
JIT_PATH = '/srv/models/jit_model_demo.pt'
torch.jit.save(jit_model, JIT_PATH)

Examples of the configuration files needed by the triton server:

Now the jitted torch model can be uploaded to the triton server at the multi-user facility. This demo hosts the server at [EAF](https://indico.cern.ch/event/903719/contributions/3803524/attachments/2013546/3364991/Elastic_AF_-_Fermilab.pdf) but the server path "triton+grpc://triton.apps.okddev.fnal.gov:443/MODEL_NAME/1" will need to be changed to the proper path used by your triton server.

Next, we will create a client that connects to the triton server. This will be created as a class to use the triton model in the same way as the local model. This class is modeled off of the code here.

In [None]:
import tritonclient.grpc as triton_grpc
import numpy as np

class wrapped_triton:
  def __init__(self, model_url: str, ) -> None:
    fullprotocol, location = model_url.split("://")
    _, protocol = fullprotocol.split("+")
    address, model, version = location.split("/")

    self._protocol = protocol
    self._address = address
    self._model = model
    self._version = version

    # check connection to server, throw error if connection doesn't work
    if self._protocol == "grpc":
      self._client = triton_grpc.InferenceServerClient(url=self._address,
                                                       verbose=False,
                                                       ssl=True)
      self._triton_protocol = triton_grpc
    else:
      raise ValueError(
          f"{self._protocol} does not encode a valid protocol (grpc or http)")

  def __call__(self, input_dict) -> np.ndarray:
    '''
    Run inference of model on triton server
    '''

    # put inputs in proper format
    inputs = []
    for key in input_dict:
      input = self._triton_protocol.InferInput(key, input_dict[key].shape,
                                               "FP32")
      input.set_data_from_numpy(input_dict[key])
      inputs.append(input)

    output = self._triton_protocol.InferRequestedOutput("softmax__0")

    # make request to server for inference
    request = self._client.infer(self._model,
                                 model_version=self._version,
                                 inputs=inputs,
                                 outputs=[output],
                                 )
    out = request.as_numpy("softmax__0")

    return out

Now we will create an instance of the triton version of the ParticleNet model. This instance will point towards the triton server hosted on EAF and inference will be called in the same way as with the local torch model.

In [None]:
triton_model = wrapped_triton( "triton+grpc://triton.fnal.gov:443/pn_demo/1")

We can now test both the local and triton models and see if they return the same output.

In [None]:
# create 5 random jets with 100 tracks each
test_inputs = {'points': np.random.rand(5,2,100).astype(np.float32),
               'features': np.random.rand(5,5,100).astype(np.float32),
               'mask': np.ones((5,1,100),dtype=np.float32)}

# slighlty different inputs for each model
test_inputs_local = []
test_inputs_triton = {}
c = 0
for k in test_inputs.keys():
    test_inputs_local.append(torch.from_numpy(test_inputs[k]))
    test_inputs_triton[f'{k}__{c}'] = test_inputs[k]
    c += 1

In [None]:
local_model(*test_inputs_local)

In [None]:
triton_model(test_inputs_triton)

The results match! Woohoo!