<a href="https://www.nvidia.com/dli"> <img src="images/DLI_Header.png" alt="Header" style="width: 400px;"/> </a>

# 4.0 Model Serving

In this notebook, you'll deploy an ONNX model to Triton Inference Server and run inference on it.

**[4.1 Learning Objectives](#4.1-Learning-Objectives)<br>**
**[4.2 Set Up Triton Server](#4.2-Set-Up-Triton-Server)<br>**
**[4.3 Query Model](#4.3-Query-Model)<br>**
**[4.4 Run Inference](#4.4-Run-Inference)<br>**
**[4.5 Conclusion](#4.5-Conclusion)<br>**

---
## 4.1 Learning Objectives

Centralized model serving can be a huge design win for your business products and/or applications. Hosting models in a central location reduces memory usage and can be designed to reduce inter-device communication.
This example will use [NVIDIA's Triton Inference Server](https://developer.nvidia.com/nvidia-triton-inference-server) to serve the model exported in the previous section. 

>Triton Inference Server, part of the NVIDIA AI platform, streamlines and standardizes AI inference by enabling teams to deploy, run, and scale trained AI models from any framework on any GPU- or CPU-based infrastructure. It provides AI researchers and data scientists the freedom to choose the right framework for their projects without impacting production deployment. It also helps developers deliver high-performance inference across cloud, on-prem, edge, and embedded devices.



<center>
    <video controls src="https://dli-lms.s3.amazonaws.com/assets/s-ov-10-v1/DLI_part_7.mp4" width=800 >
           type="video/mp4"
           width=800>
    </video>
</center>

---
## 4.2 Set Up Triton Server

In [None]:
!mkdir -p /opt/model_repository/ # create the folder for the models

In [None]:
# Run Triton Inference Server
import subprocess
server = subprocess.Popen(["tritonserver", "--model-repository=/opt/model_repository/", "--model-control-mode=poll", "--http-port=8988"])
server

One nice feature of Triton is the [ability to have it "poll" a model repository](https://github.com/triton-inference-server/server/blob/main/docs/user_guide/model_management.md#model-control-mode-poll) to see if a change has occurred. So all that needs to be done is copy the model into the `model_repository` directory. You can read more on the specifics [here](https://github.com/triton-inference-server/server/blob/main/docs/user_guide/model_repository.md#repository-layout).

In [None]:
!mkdir -p /opt/model_repository/our_new_model/ # create the folder with the model name

In [None]:
!mkdir -p /opt/model_repository/our_new_model/1/ # create the folder with the model name

Finally, we copy our model into the repository. An example ONNX model is provided and copied in the next cell.  If you would like to use the one you created, uncomment and run the second line instead. 

In [None]:
!cp /dli/task/data/model.onnx /opt/model_repository/our_new_model/1/model.onnx # move the file to the directory
#!cp /dli/task/data/custom_model.onnx /opt/model_repository/our_new_model/1/model.onnx # Delete above line and run this is you want to use you ONNX model

---
## 4.3 Query Model

Now that the model is in Triton, it has automatically created a model config and loaded it. Let's query it to find out more!

_Note: if executing this cell results in an error, this may be because the polling has not "found" the model yet.  Wait a few seconds and run the cell again._

In [None]:
import tritonclient.grpc as grpcclient

inference_server_url = "localhost:8001"
triton_client = grpcclient.InferenceServerClient(url=inference_server_url)

# find out info about model
model_name = "our_new_model"
triton_client.get_model_config(model_name)

You can also create a custom config to control other parameters like batch size or maximum number of requests.

However, now we are going to do our inference with the model!



---
## 4.4 Run Inference

You can see in the config above we have the input and output names of the model. Let's use this information to do inference.

In [None]:
from tritonclient.utils import triton_to_np_dtype
import cv2
import numpy as np
from matplotlib import pyplot as plt

# load image data
target_width, target_height = 1024, 1024
image_bgr = cv2.imread("sample_image.png")
image_bgr = cv2.resize(image_bgr, (target_width, target_height))
image_rgb = cv2.cvtColor(image_bgr, cv2.COLOR_BGR2RGB)
image = np.float32(image_rgb)

# preprocessing
image = image/255
image = np.moveaxis(image, -1, 0)  # HWC to CHW
    
image = image[np.newaxis, :] # add batch dimension
image = np.float32(image)

plt.imshow(image_rgb)

# create input
input_name = "input"
inputs = [grpcclient.InferInput(input_name, image.shape, "FP32")]
inputs[0].set_data_from_numpy(image)

output_names = ["boxes", "labels", "scores"]
outputs = [grpcclient.InferRequestedOutput(n) for n in output_names]


results = triton_client.infer(model_name, inputs, outputs=outputs)

boxes, labels, scores = [results.as_numpy(o) for o in output_names]

In [None]:
# annotate
annotated_image = image_bgr.copy()
            
if boxes.size > 0:  # ensure something is found
    for box, lab, scr in zip(boxes, labels, scores):

        if scr > 0.2:
            box_top_left = int(box[0]), int(box[1])
            box_bottom_right = int(box[2]), int(box[3])
            text_origin = int(box[0]), int(box[3])

            border_color = (50, 0, 100)
            text_color = (255, 255, 255)

            font_scale = 0.9
            thickness = 1

            # bounding box2
            cv2.rectangle(annotated_image, box_top_left, box_bottom_right, border_color, thickness=5,
                          lineType=cv2.LINE_8)
        
plt.imshow(cv2.cvtColor(annotated_image, cv2.COLOR_BGR2RGB))

In [None]:
# Shut down the Triton server
! kill $(pidof tritonserver)

---
## 4.5 Conclusion

You've completed the course - great job! <br>
Once you are ready, conclude by watching the final video.


<center>
    <video controls 
           src="https://dli-lms.s3.amazonaws.com/assets/s-ov-10-v1/DLI_part_8.mp4"
           type="video/mp4"
           width=800>
    </video>
</center>

---
<h2 style="color:green;">Congratulations!</h2>

In this final notebook, you have:
- Deployed an ONNX model to Triton Inference Server
- Used Triton to run inference with your model

<a href="https://www.nvidia.com/dli"> <img src="images/DLI_Header.png" alt="Header" style="width: 400px;"/> </a>