<div align="center"><a href="https://www.nvidia.com/en-us/deep-learning-ai/education/"><img src="./assets/DLI_Header.png"></a></div>

# Deploying a Model for Inference at Production Scale

## 01 - Getting Started

-------

**Table of Contents**

* [Introduction](#introduction)
* [Triton Inference Server](#triton)
* [Setup](#setup)
* [Conclusion](#conclusion)


<a id="introduction"></a>
### Introduction

In this notebook, we will walk through what Triton Inference Server is as well as do some light setup for our lab.

<a id="triton"></a>
### Triton Inference Server

NVIDIA Triton Inference Server simplifies the deployment of AI models at scale in production. Triton is an open-source, inference-serving software that lets teams deploy trained AI models from any framework, from local storage, or from Google Cloud Platform or Azure on any GPU or CPU-based infrastructure, cloud, data center, or edge. One can get started with Triton by pulling the container from the NVIDIA NGC catalog, the hub for GPU-optimized software for deep learning and machine learning that accelerates deployment to development workflows.

The below figure shows the Triton Inference Server high-level architecture. The model repository is a file-system based repository of the models that Triton will make available for inferencing. Inference requests arrive at the server via either HTTP/REST or GRPC or by the C API and are then routed to the appropriate per-model scheduler. Triton implements multiple scheduling and batching algorithms that can be configured on a model-by-model basis. Each model's scheduler optionally performs batching of inference requests and then passes the requests to the backend corresponding to the model type. The backend performs inferencing using the inputs provided in the batched requests to produce the requested outputs. The outputs are then returned.

<img src="./assets/A-schematic-of-Triton-Server-architecture.png" alt="A Schematic of Triton Inference Server" style="width: 600px;"/>

<a id="setup"></a>
### Setup

First, let's check what GPUs we have on our system:

In [1]:
!nvidia-smi

Tue Oct 31 17:21:49 2023       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.54.04              Driver Version: 536.23       CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  NVIDIA TITAN RTX               On  | 00000000:02:00.0  On |                  N/A |
| 41%   37C    P8              22W / 280W |    904MiB / 24576MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

We see our system has 1 GPU, a Tesla T4.

Additionally, let's examine our file system:

In [2]:
!ls -alh

total 160K
drwxrwxrwx 1 renan renan  512 Oct 31 13:05 .
drwxrwxrwx 1 renan renan  512 Oct 31 16:06 ..
drwxrwxrwx 1 renan renan  512 Oct 31 13:03 .ipynb_checkpoints
-rwxrwxrwx 1 renan renan 4.4K Dec 14  2021 00_jupyterlab.ipynb
-rwxrwxrwx 1 renan renan 6.7K Dec 14  2021 01_Getting_Started.ipynb
-rwxrwxrwx 1 renan renan  28K Oct 31 13:05 02_Simple_PyTorch_Model.ipynb
-rwxrwxrwx 1 renan renan  15K Dec 14  2021 03_HuggingFace_NLP_Model.ipynb
-rwxrwxrwx 1 renan renan  17K Dec 14  2021 04_Simple_TensorFlow_Model.ipynb
-rwxrwxrwx 1 renan renan  23K Dec 14  2021 05_Simple_TensorRT_Model.ipynb
-rwxrwxrwx 1 renan renan  26K Nov 24  2021 06_Advanced_Inference.ipynb
-rwxrwxrwx 1 renan renan 8.5K Dec 14  2021 07_Metrics.ipynb
drwxrwxrwx 1 renan renan  512 Oct 30 21:12 assets
-rwxrwxrwx 1 renan renan  14K Nov 18  2021 imagenet-simple-labels.json
drwxrwxrwx 1 renan renan  512 Oct 28 21:26 models


We see several folders and Jupyter notebooks. We'll visit each of these notebooks later in the lab. Lastly, let's check which version of CUDA we're working with. We can see from the output below that we're working with CUDA 11.1.

In [3]:
!nvcc --version

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Fri_Sep__8_19:17:24_PDT_2023
Cuda compilation tools, release 12.3, V12.3.52
Build cuda_12.3.r12.3/compiler.33281558_0


**Server**

In this lab, we already have Triton Inference Server instance running. The code to run a Triton Server Instance is shown below. More details can be found in the quickstart and build instructions:

* [Quickstart Documentation](https://github.com/triton-inference-server/server/blob/r20.12/docs/quickstart.md)
* [Build Documentation](https://github.com/triton-inference-server/server/blob/r20.12/docs/build.md)

```
docker run \
  --gpus=1 \
  --ipc=host --rm \
  --shm-size=1g \
  --ulimit memlock=-1 \
  --ulimit stack=67108864 \
  -p 8000:8000 -p 8001:8001 -p 8002:8002 \
  -v /models:/models \
  nvcr.io/nvidia/tritonserver:20.12-py3 \
  tritonserver \
  --model-repository=/models \
  --exit-on-error=false \
  --model-control-mode=poll \
  --repository-poll-secs 30
```

Triton Inference Server container can be found on NGC: https://ngc.nvidia.com/catalog/containers/nvidia:tritonserver

**Client**

We've also installed the Triton Inference Server Client libraries that provide APIs that make it easy to communicate with Triton from your C++ or Python application. Using these libraries you can send either HTTP/REST or GRPC requests to Triton to access all its capabilities: inferencing, status and health, statistics and metrics, model repository management, etc. These libraries also support using system and CUDA shared memory for passing inputs to and receiving outputs from Triton. Examples show the use of both the C++ and Python libraries.

The easiest way to get the Python client library is to use pip to install the `tritonclient` module, as detailed below. For more details on how to download or build the Triton Inference Server Client libraries, you can find the documentation here: https://github.com/triton-inference-server/server/blob/r20.12/docs/client_libraries.md

```
pip install nvidia-pyindex
pip install tritonclient[all]
```


<a id="conclusion"></a>
### Conclusion

In this notebook, we walked through what Triton Inference Server is as well as did some light setup for our lab. Feel free to move onto the next notebook!

<div align="center"><a href="https://www.nvidia.com/en-us/deep-learning-ai/education/"><img src="./assets/DLI_Header.png"></a></div>