Copyright 2021 NVIDIA Corporation. All Rights Reserved.

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.

<img src="http://developer.download.nvidia.com/compute/machine-learning/frameworks/nvidia_logo.png" style="width: 90px; float: center;">

# Triton Server

## Overview

This notebook shows the procedure to deploy a Triton Inference Server with image model and compare inferencing on GPU/CPU. 

## Setup <a class="anchor" id="Setup"></a>

To begin, check that the NVIDIA driver has been installed correctly. The `nvidia-smi` command should run and output information about the GPUs on your system:"

In [None]:
!nvidia-smi

## Start the Triton Server

Lets start the triton server in polling mode.

# Run the Triton Inference Server in a seperate termial from the Jupyter Notebook
```
tritonserver  --model-repository=/tritonworkspace/models --model-control-mode=POLL
````

The above command should load the model from the model directory and print the log `successfully loaded 'inception_grapghdef' version 1`. Triton server listens on the following endpoints:

```
Port 8000    -> HTTP Service
Port 8001    -> GRPC Service
Port 8002    -> Metrics
```

We can test the status of the server connection by running the curl command: `curl -v <IP of machine>:9000/v2/health/ready` which should return `HTTP/1.1 200 OK`

**NOTE:-** In our case the IP of machine on which Triton Server and this notebook are currently running is `localhost`

In [None]:
!curl -v localhost:8000/v2/health/ready

## Check inference with image_client 

Lets check the test data from the images folder

In [None]:
from IPython.display import Image, display
listOfImageNames = ['/tritonworkspace/images/basketball.jpg',
                    '/tritonworkspace/images/football.jpg',
                    '/tritonworkspace/images/soccer_ball.jpg',
                    '/tritonworkspace/images/volleyball.jpg']

for imageName in listOfImageNames:
    display(Image(filename=imageName, width = 100, height = 50))

We will use the data from the above folder and try to do inference using the image_client thats included with the container. For each image we will try to get 3 different classifications from the model.

In [None]:
!python3 image_client.py -u localhost:8000 -m inception_graphdef -x 1 -s INCEPTION -c 3 /tritonworkspace/images

Now lets try to batch all the four data into a single request. We can see all the 4 image data was sent to the server in one single HTTP request. This will be the first example of how batching works in  Triton.

In [None]:
!python3 image_client.py -u localhost:8000 -m inception_graphdef -x 1 -s INCEPTION -c 3 /tritonworkspace/images -b 4

## Determine throughput and latency with Perf Analyzer <a class="anchor" id="PerfAnalyzer"></a>

Once the model is deployed for inference in Triton, we can measure its inference performance using `perf_analyzer`. The perf_analyzer application generates inference requests to the deployed model and measures the throughput and latency of those requests. For more information on `perf_analyzer` utility, please refer this [link](https://github.com/triton-inference-server/server/blob/main/docs/perf_analyzer.md) 

Now change the config of the mode from GPU to CPU in the triton model config.pbtxt. We don't need to reload the triton server here as it will pull the latest config.pbtxt automatically.

```
instance_group [
   {
     count: 1
     kind: KIND_CPU
   }
]
```

In [None]:
!perf_analyzer -u localhost:8000  -m inception_graphdef --percentile=95 --concurrency-range=4 -b 1

Now change the config and move the model to GPU

```
instance_group [
   {
     count: 1
     kind: KIND_GPU
   }
]
```

In [None]:
!perf_analyzer -u localhost:8000  -m inception_graphdef --percentile=95 --concurrency-range=4 -b 1

Now change the config to have 2 instances of the same model

```
instance_group [
   {
     count: 2
     kind: KIND_GPU
   }
]
```

In [None]:
!perf_analyzer -u localhost:8000  -m inception_graphdef --percentile=95 --concurrency-range=4 -b 1

#### Now add the dynamic bathing block to the config of the model and send request with bs=2 and bs=4 and show the queue delay.

```
dynamic_batching {
   preferred_batch_size: [ 4, 8 ]
   max_queue_delay_microseconds: 2000
}
```

#### Now use the perf analyzer for measuring the performance again with different batch size and see the queue delay

In [None]:
!perf_analyzer -u localhost:8000  -m inception_graphdef --percentile=95 --concurrency-range=1:4 -b 4

Now lets add minimum queue delay to see what happens 

In [None]:
!perf_analyzer -u localhost:8000 -m inception_graphdef --percentile=95 --concurrency-range=1:4 -b 2

Now lets kill the Triton Server using Ctrl + C in the terminal