<center><img src="https://upload.wikimedia.org/wikipedia/en/thumb/6/6d/Nvidia_image_logo.svg/1200px-Nvidia_image_logo.svg.png" width="250"></center>

In [None]:
# Copyright (c) 2020 NVIDIA

# Permission is hereby granted, free of charge, to any person obtaining a copy
# of this software and associated documentation files (the "Software"), to deal
# in the Software without restriction, including without limitation the rights
# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
# copies of the Software, and to permit persons to whom the Software is
# furnished to do so, subject to the following conditions:

# The above copyright notice and this permission notice shall be included in all
# copies or substantial portions of the Software.

# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
# SOFTWARE.

# Object Detection w/ TensorRT

This notebook will illustrate the full pipeline of taking a TensorFlow model for object detection (in this case an SSD model), creating a TensorRT engine, and running inference.

These are the type of results that we are trying to achieve while working through this notebook:

|   Framework for Inference  | Inference Time (ms) | Frames Per Second (fps) | Speedup |
|----------------------------|:-------------------:|:-----------------------:|:-------:|
| TensorFlow                 |         276         |          3.6            |    1x   |            
| TensorRT (FP32)            |          48         |          20.8           |   5.8x  |
| TensorRT (FP16)            |          38         |          26.3           |   7.3x  |

## Setting MaxN mode

First, before we do anything else, we want to make sure that we put the Jetson Nano in MaxN mode.  This will mean that the the clock frequencies will be set to their highest in order to achieve the lowest inference times.

To do this, we want to set the max/min frequency to a preferred value, which we can do with the `nvpmodel` command

In [None]:
!echo nvidia | sudo -S nvpmodel -m 0

In [None]:
!echo nvidia | sudo -S nvpmodel -q

Then we want to use the `jetson_clocks` script to fix the frequency to maximal.  This will allow us to get the most performance out of our device.

In [None]:
!echo nvidia | sudo -S jetson_clocks

## Deep Learning Architecture

As you may be aware, there are numerous deep learning architectures geared towards object detection (or other computer vision tasks like segmentation).  Some of these networks are designed to be as performant as possible in terms of accuracy while others are designed for speed.  Since we will be focusing on object detection applications for embedded devices (namely, Jetson Nano), we want to choose a network that has been designed for speed and will still be performant enough on an embedded device.  For this reason we have chosen the Single-Shot Multibox Detector (SSD); one could also choose networks like YOLO that are also single-shot pipelines.  Other architectures (Faster RCNN, for example) contain a two-stage pipeline where region proposals are created during one stage of the pipeline and then the second stage evaluates each of these proposals.  For SSD and YOLO, this is all done in a single pipeline.  To learn more about the Single Shot Detector (SSD) architecture, you can read the [paper](https://arxiv.org/abs/1512.02325).  The network architecture is shown below:

<center><img src="https://cdn-images-1.medium.com/fit/t/1600/480/1*hdSE1UCV7gA7jzfQ03EnWw.png" alt="SSD Architecture" width="1000"/></center>
<center>Image credit: https://cdn-images-1.medium.com/fit/t/1600/480/1*hdSE1UCV7gA7jzfQ03EnWw.png</center>

For this lab, we will be using the Single-Shot MultiBox Detector (SSD) architecture as our starting point.  The above image shows a backbone network (feature extractor) of VGG16.  We will instead use an Mobilenet V2 backbone that will help in classifying objects a little better with a slightly better performance for an embedded device.

Let's get started...

## Setup

Let's go ahead and import all the necessary packages that we will use throughout this notebook.

In [None]:
import tensorflow as tf
import tensorrt as trt
import graphsurgeon as gs
import uff

import os
import ctypes
import time
import sys
import shutil
import tarfile

import numpy as np
from PIL import Image
from IPython.display import Image as read_img
from IPython.display import display as disp_img

import utils.inference as inference_utils
import utils.model as model_utils
import utils.boxes as boxes_utils
import utils.coco as coco_utils
import utils.engine as engine_utils
from utils.paths import PATHS
import common

print('TensorFlow version:', tf.__version__)
print('TensorRT version:', trt.__version__)

We can also setup a few variables that will make our life easier later in the notebook.  Some of the important ones include:

- Paths to models, intermediate file formats (uff), and TensorRT engines
- TensorRT parameters (precision, batch_size, etc.)
- Visualization information (thresholding, images, etc.)

In [None]:
# COCO label list
COCO_LABELS = coco_utils.COCO_CLASSES_LIST

USER = os.environ.get('USER')

# Confidence threshold for drawing bounding box
VISUALIZATION_THRESHOLD = 0.5

# Precision command line argument -> TRT Engine datatype
TRT_PRECISION_TO_DATATYPE = {
    16: trt.DataType.HALF,
    32: trt.DataType.FLOAT
}

# Layout of TensorRT network output metadata
TRT_PREDICTION_LAYOUT = {
    "image_id": 0,
    "label": 1,
    "confidence": 2,
    "xmin": 3,
    "ymin": 4,
    "xmax": 5,
    "ymax": 6
}

# Define model variables
PRECISION = 16
os.environ["PRECISION"] = str(PRECISION)
MAX_BATCH_SIZE, BATCH_SIZE = 1, 1

# Define path variables for all models
#MODEL_NAME = 'ssd_inception_v2_coco_2017_11_17'
MODEL_NAME = 'ssd_mobilenet_v2_coco_2018_03_29'
MODEL_PATH = os.path.join('/home', USER, 'jetson-training/models')
ENGINE_PATH = os.path.join('/home', USER, 'jetson-training/engines')
TF_FROZEN_MODEL_PATH = os.path.join(MODEL_PATH, MODEL_NAME, 'frozen_inference_graph.pb')
SAVED_MODEL_PATH = os.path.join(MODEL_PATH, MODEL_NAME, 'saved_model')
UFF_MODEL_PATH = os.path.join(MODEL_PATH, MODEL_NAME, 'froze_inference_graph.uff')
TRT_ENGINE_PATH = os.path.join(ENGINE_PATH, 'ssd_mobilenet_v2_fp16.engine') if PRECISION == 16 else os.path.join(ENGINE_PATH, 'ssd_mobilenet_v2_fp32.engine')
TFTRT_MODEL_PATH = os.path.join(MODEL_PATH, MODEL_NAME, 'ssd_mobilenet_v2_tftrt_fp16') if PRECISION == 16 else os.path.join(MODEL_PATH, MODEL_NAME, 'ssd_mobilenet_v2_tftrt_fp32')

# Define data for testing the models
IMAGE_PATH = os.path.join('/home', USER, 'jetson-training/images')
INPUT_IMG = os.path.join(IMAGE_PATH, 'dogs.jpg')
OUTPUT_IMG = os.path.join(IMAGE_PATH, 'ssd_output.jpg')

In [None]:
print('TensorFlow model path: {}'.format(TF_FROZEN_MODEL_PATH))
print('UFF model path: {}'.format(UFF_MODEL_PATH))
print('TensorRT engine path: {}'.format(TRT_ENGINE_PATH))

print('\nUsing precision mode {} ({}-bit float) for building engine and inference'.format(TRT_PRECISION_TO_DATATYPE[PRECISION], PRECISION))

print('\nInput image for inference')
disp_img(read_img(filename=INPUT_IMG))

## Inference

Now that we have everything setup, we can start with benchmarking the TensorFlow model.

### TensorFlow Inference

First we want to use our original model that was taken from the [Tensorflow Object Detection API](https://github.com/tensorflow/models/tree/master/research/object_detection) [Model Zoo](https://github.com/tensorflow/models/blob/master/research/object_detection/g3doc/detection_model_zoo.md).

This will allow us to have a baseline to compare our TensorRT optimized model for fp32 and fp16 once we generate them.

#### Helper Functions

These functions provide us the ability to load an image and then convert that image into a numpy array (which will then be processed by the network).

First we can define a little bit about the model so that we can use it later.

In [None]:
class ModelData(object):
    # Name of input node
    INPUT_NAME = "Input"
    # CHW format of model input
    INPUT_SHAPE = (3, 300, 300)
    # Name of output node
    OUTPUT_NAME = "NMS"

    @staticmethod
    def get_input_channels():
        return ModelData.INPUT_SHAPE[0]

    @staticmethod
    def get_input_height():
        return ModelData.INPUT_SHAPE[1]

    @staticmethod
    def get_input_width():
        return ModelData.INPUT_SHAPE[2]

Next, we can define a couple of helper functions that will allow us to load an image and convert it to a numpy array easily for inference with TensorFlow and TensorRT.

In [None]:
def load_image_into_numpy_array(image):
    (im_width, im_height) = image.size
    return np.array(image).reshape(
        (im_height, im_width, ModelData.get_input_channels())
    ).astype(np.uint8)

def load_img(image_path, fw='tf'):
    """Infers model on given image.
    Args:
         image_path (str): image to run classification/object detection model on
         fw (str): either 'tf' for tensorflow or 'trt' for tensorrt
    """
    # Load image into CPU
    image = Image.open(image_path)
    
    # Get model input width and height
    model_input_width = ModelData.get_input_width()
    model_input_height = ModelData.get_input_height()
    # Note: Bilinear interpolation used by Pillow is a little bit
    # different than the one used by Tensorflow, so if network receives
    # an image that is not 300x300, the network output may differ
    # from the one output by Tensorflow
    image_resized = image.resize(
        size=(model_input_width, model_input_height),
        resample=Image.BILINEAR
    )
    
    img_np = load_image_into_numpy_array(image_resized)
    if fw == 'tf':
        img_np = np.expand_dims(img_np, axis=0)
    else:
        img_np = img_np.transpose((2, 0, 1))
        # Normalize to [-1.0, 1.0] interval (expected by model)
        img_np = (2.0 / 255.0) * img_np - 1.0
        img_np = img_np.ravel()
    
    return img_np

Now let's download model that we will use for the rest of this notebook.  In this case, we will use the SSD Mobilenet V2 network.

In [None]:
if not os.path.exists(os.path.join(MODEL_PATH, MODEL_NAME)):
    if not os.path.exists(MODEL_PATH):
        os.mkdir(MODEL_PATH)
    print("Preparing pre-trained SSD model")
    model_dir = os.path.join(MODEL_PATH, MODEL_NAME)
    model_url = PATHS.get_model_url(MODEL_NAME)
    model_archive_path = os.path.join(MODEL_PATH, "{}.tar.gz".format(MODEL_NAME))
    model_utils.download_file(model_url, model_archive_path, False)
    print("Download complete\nUnpacking {}".format(model_archive_path))
    with tarfile.open(model_archive_path, "r:gz") as tar:
        tar.extractall(path=os.path.join(MODEL_PATH))
    print("Extracting complete\nRemoving {}".format(model_archive_path))
    os.remove(model_archive_path)
    print("Model ready")
else:
    print("Path already exists, no need to re-download and prepare the model")

Next we want to setup a TensorFlow session that will:

* Create input/output tensors
* Load the input image (and convert to numpy array)
* Start an inference timer
* Run inference using our TensorFlow model

The benchmarking numbers that are achieved from this inference execution will be used for our comparison later to TensorRT results.

In [None]:
def run_tf_sess():
    times = []
    
    with tf.Session() as sess:
        with tf.gfile.GFile(TF_FROZEN_MODEL_PATH, 'rb') as f:
            graph_def = tf.GraphDef()
            graph_def.ParseFromString(f.read())
            _ = tf.import_graph_def(graph_def)
            image_tensor = sess.graph.get_tensor_by_name('import/image_tensor:0')
            boxes = sess.graph.get_tensor_by_name('import/detection_boxes:0')
            scores = sess.graph.get_tensor_by_name('import/detection_scores:0')
            classes = sess.graph.get_tensor_by_name('import/detection_classes:0')
            num_detections = sess.graph.get_tensor_by_name('import/num_detections:0')

            iterations = 10

            input_image = load_img(INPUT_IMG, 'tf')  
            for index in range((iterations)):
                start = time.time()
                sess.run([boxes, scores, classes, num_detections],feed_dict={image_tensor: input_image})
                stop = time.time()
                if index > 1:
                   times.append(stop - start)
                   print("Time per run for Tensorflow Inference %d: %f ms" % (index, (stop - start)*1000))

run_tf_sess()

Now that we have our baseline numbers, we want to follow the same process we followed in the first image classification example.

## UFF Model Creation

We want to first create a UFF model (intermediate format) and then use that intermediate format to create a TensorRT engine we can then use for inference.

There are a few specialty layers in the SSD model which we have to take care of first (i.e. they are not natively supported by UFF).  Namely, the NMS layer at the end of the network for classification and the FlattenConcat layer.  These unsupported nodes will be set as "placeholder" nodes in the UFF graph and then replaced in the TensorRT graph with optimized nodes.

In [None]:
def ssd_unsupported_nodes_to_plugin_nodes(ssd_graph):
    """Makes ssd_graph TensorRT comparible using graphsurgeon.

    This function takes ssd_graph, which contains graphsurgeon
    DynamicGraph data structure. This structure describes frozen Tensorflow
    graph, that can be modified using graphsurgeon (by deleting, adding,
    replacing certain nodes). The graph is modified by removing
    Tensorflow operations that are not supported by TensorRT's UffParser
    and replacing them with custom layer plugin nodes.

    Note: This specific implementation works only for
    ssd_mobilenet_v2_coco_2018_03_29 network.

    Args:
        ssd_graph (gs.DynamicGraph): graph to convert
    Returns:
        gs.DynamicGraph: UffParser compatible SSD graph
    """
    # Create TRT plugin nodes to replace unsupported ops in Tensorflow graph
    channels = ModelData.get_input_channels()
    height = ModelData.get_input_height()
    width = ModelData.get_input_width()

    Input = gs.create_plugin_node(name="Input",
        op="Placeholder",
        dtype=tf.float32,
        shape=[1, channels, height, width])

    PriorBox = gs.create_plugin_node(name="MultipleGridAnchorGenerator", op="GridAnchor_TRT",
        minSize=0.2,
        maxSize=0.95,
        aspectRatios=[1.0, 2.0, 0.5, 3.0, 0.33],
        variance=[0.1,0.1,0.2,0.2],
        featureMapShapes=[19, 10, 5, 3, 2, 1],
        numLayers=6
    )

    NMS = gs.create_plugin_node(
        name="NMS",
        op="NMS_TRT",
        shareLocation=1,
        varianceEncodedInTarget=0,
        backgroundLabelId=0,
        confidenceThreshold=1e-8,
        nmsThreshold=0.6,
        topK=100,
        keepTopK=100,
        numClasses=91,
        inputOrder=[1, 0, 2],
        confSigmoid=1,
        isNormalized=1
    )

    concat_priorbox = gs.create_node(
        "concat_priorbox",
        op="ConcatV2",
        dtype=tf.float32,
        axis=2
    )
    
    concat_box_loc = gs.create_plugin_node(
        "concat_box_loc",
        op="FlattenConcat_TRT",
        dtype=tf.float32,
        axis=1,
        ignoreBatch=0
    )

    concat_box_conf = gs.create_plugin_node(
        "concat_box_conf",
        op="FlattenConcat_TRT",
        dtype=tf.float32,
        axis=1,
        ignoreBatch=0
    )

    # Create a mapping of namespace names -> plugin nodes.
    namespace_plugin_map = {
        "Concatenate": concat_priorbox,
        "MultipleGridAnchorGenerator": PriorBox,
        "Postprocessor": NMS,
        "Preprocessor": Input,
        "ToFloat": Input,
        "image_tensor": Input,
        "concat": concat_box_loc,
        "concat_1": concat_box_conf
    }

    # Create a new graph by collapsing namespaces
    ssd_graph.collapse_namespaces(namespace_plugin_map)
    # Remove the outputs, so we just have a single output node (NMS).
    # If remove_exclusive_dependencies is True, the whole graph will be removed!
    ssd_graph.remove(ssd_graph.graph_outputs, remove_exclusive_dependencies=False)
    
    ssd_graph.find_nodes_by_op("NMS_TRT")[0].input.remove("Input")
    return ssd_graph

Now, we can use the `uff.from_tensorflow()` operator to convert our model from a frozen TensorFlow graph to a UFF model.

In [None]:
def model_to_uff():
    dynamic_graph = gs.DynamicGraph(TF_FROZEN_MODEL_PATH)
    dynamic_graph = ssd_unsupported_nodes_to_plugin_nodes(dynamic_graph)

    uff.from_tensorflow(
        dynamic_graph.as_graph_def(),
        [ModelData.OUTPUT_NAME],
        output_filename=UFF_MODEL_PATH,
        text=True)
    
model_to_uff()

## TensorRT Engine Creation

Now that we have our intermediate representation, let's use it to create out TensorRT engine.

In [None]:
def build_trt_engine(trt_logger):
    with trt.Builder(trt_logger) as builder, builder.create_network() as network, trt.UffParser() as parser:
        builder.max_workspace_size = 1 << 30
        if TRT_PRECISION_TO_DATATYPE[PRECISION] == trt.DataType.HALF:
            builder.fp16_mode = True
        builder.max_batch_size = BATCH_SIZE
        
        parser.register_input(ModelData.INPUT_NAME, ModelData.INPUT_SHAPE)
        parser.register_output("MarkOutput_0")
        parser.parse(UFF_MODEL_PATH, network)
        
        print("Building TensorRT engine. This may take a few minutes.")
        
        return builder.build_cuda_engine(network)

In [None]:
def save_engine(engine, engine_dest_path):
    buf = engine.serialize()
    with open(engine_dest_path, 'wb') as f:
        f.write(buf)

def load_engine(trt_runtime, engine_path):
    with open(engine_path, 'rb') as f:
        engine_data = f.read()
    engine = trt_runtime.deserialize_cuda_engine(engine_data)
    return engine

In the interest of time, we will not go through the engine building process, we will simply download the engine that would have been produced by the following code.

If you would like to go through the engine building process, you can simply remove the `ssd_mobilenet_v2_<precision>.engine` from the `engines` directory and run the following cell.  The same engine file will be built.

In [None]:
%%bash

DOWNLOAD_PATH='/home/'${USER}'/jetson-training/engines'

if [ "$PRECISION" == "16" ]; then
    echo "Downloading 16-bit precision version of SSD Mobilenet TensorRT engine..."
    FILE_ID='1Rtlvz2Ei2PpYr712IMoDjj_JjbPX03Hc'
    ENGINE_FILE_NAME='ssd_mobilenet_v2_fp16.engine'
elif [ "$PRECISION" == "32" ]; then
    echo "Downloading 32-bit precision version of SSD Mobilenet TensorRT engine..."
    FILE_ID='1PfObqUIPiZ7W8e0m-xInSXy3WOopgEd-'
    ENGINE_FILE_NAME='ssd_mobilenet_v2_fp32.engine'
fi

mkdir -p ${DOWNLOAD_PATH}
wget --load-cookies /tmp/cookies.txt "https://docs.google.com/uc?export=download&confirm=$(wget --quiet --save-cookies /tmp/cookies.txt --keep-session-cookies --no-check-certificate 'https://docs.google.com/uc?export=download&id=${FILE_ID}' -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1\n/p')&id=${FILE_ID}" -O ${DOWNLOAD_PATH}/${ENGINE_FILE_NAME} && rm -rf /tmp/cookies.txt

In [None]:
!ls -alt engines/

In [None]:
TRT_LOGGER = trt.Logger(trt.Logger.WARNING)
# We first load all custom plugins shipped with TensorRT,
# some of which will be needed during inference
trt.init_libnvinfer_plugins(TRT_LOGGER, '')

# Initialize runtime needed for loading TensorRT engine from file
trt_runtime = trt.Runtime(TRT_LOGGER)

print("TensorRT inference engine settings:")
print("  * Inference precision - {}".format(TRT_PRECISION_TO_DATATYPE[PRECISION]))
print("  * Max batch size - {}\n".format(BATCH_SIZE))

if not os.path.exists(TRT_ENGINE_PATH):
    trt_engine = build_trt_engine(TRT_LOGGER)
    if not os.path.exists(ENGINE_PATH):
        os.mkdir(ENGINE_PATH)
    save_engine(trt_engine, TRT_ENGINE_PATH)
else:
    print("TensorRT model already exists at {}".format(TRT_ENGINE_PATH))
    print("Using this engine instead of creating a new one")

Now that we have our TensorRT engine, let's load it, and run inference and compare it to the TensorFlow runtime.

In [None]:
trt_engine = load_engine(trt_runtime, TRT_ENGINE_PATH)

# This allocates memory for network inputs/outputs on both CPU and GPU
inputs, outputs, bindings, stream = engine_utils.allocate_buffers(trt_engine)

# Execution context is needed for inference
context = trt_engine.create_execution_context()

# Allocate memory for multiple usage [e.g. multiple batch inference]
input_volume = trt.volume(ModelData.INPUT_SHAPE)
numpy_array = np.zeros((trt_engine.max_batch_size, input_volume))

# Load image into CPU
img = load_img(INPUT_IMG, 'trt')

# Copy it into appropriate place in memory
# inputs was returned earlier by allocate_buffers()
np.copyto(inputs[0].host, img.ravel())

In [None]:
inference_start_time = time.time()

# Fetch output from the model
[detection_out, keepCount_out] = common.do_inference(
    context, bindings=bindings, inputs=inputs, outputs=outputs, stream=stream)

With this inference, you can see that you can get ~4x performance improvement by just converting your model to TensorRT.

Let's make sure that the output of the network is proper.  Let's load the image and draw the output bounding boxes on the image.

In [None]:
def fetch_prediction_field(field_name, detection_out, pred_start_idx):
    """Fetches prediction field from prediction byte array.

    After TensorRT inference, prediction data is saved in
    byte array and returned by object detection network.
    This byte array contains several pieces of data about
    prediction - we call one such piece a prediction field.
    The prediction fields layout is described in TRT_PREDICTION_LAYOUT.

    This function, given prediction byte array returned by network,
    staring index of given prediction and field name of interest,
    returns prediction field data corresponding to given arguments.

    Args:
        field_name (str): field of interest, one of keys of TRT_PREDICTION_LAYOUT
        detection_out (array): object detection network output
        pred_start_idx (int): start index of prediction of interest in detection_out

    Returns:
        Prediction field corresponding to given data.
    """
    return detection_out[pred_start_idx + TRT_PREDICTION_LAYOUT[field_name]]

In [None]:
def analyze_prediction(detection_out, pred_start_idx, img_pil):
    image_id = int(fetch_prediction_field("image_id", detection_out, pred_start_idx))
    label = int(fetch_prediction_field("label", detection_out, pred_start_idx))
    confidence = fetch_prediction_field("confidence", detection_out, pred_start_idx)
    xmin = fetch_prediction_field("xmin", detection_out, pred_start_idx)
    ymin = fetch_prediction_field("ymin", detection_out, pred_start_idx)
    xmax = fetch_prediction_field("xmax", detection_out, pred_start_idx)
    ymax = fetch_prediction_field("ymax", detection_out, pred_start_idx)
    if confidence > VISUALIZATION_THRESHOLD:
        class_name = COCO_LABELS[label]
        confidence_percentage = "{0:.0%}".format(confidence)
        print("Detected {} with confidence {}".format(
            class_name, confidence_percentage))
        boxes_utils.draw_bounding_boxes_on_image(
            img_pil, np.array([[ymin, xmin, ymax, xmax]]),
            display_str_list=["{}: {}".format(
                class_name, confidence_percentage)],
            color=coco_utils.COCO_COLORS[label]
        )

In [None]:
img_pil = Image.open(INPUT_IMG)
prediction_fields = len(TRT_PREDICTION_LAYOUT)

for det in range(int(keepCount_out[0])):
    analyze_prediction(detection_out, det * prediction_fields, img_pil)

We can see that for this particular image we have detected 4 objects that have a higher confidence value than 50% (the threshold we set earlier when setting out environment variables).

Now let's save and view the image to finalize the process.

In [None]:
img_pil.save(OUTPUT_IMG)
print("Saved output image to: {}".format(OUTPUT_IMG))
disp_img(read_img(filename=OUTPUT_IMG))

Notice that the bounding boxes on the dogs are pretty good, but the two on the single person aren't that great.  We should mention here that this is a much smaller model than most so we do not expect the best accuracy values for this network, but for the two foreground objects, the network does really really well.

## TF-TRT

As mentioned before, there is a third option, TensorFlow TensorRT integration.

The reasons for integrating TensorRT into native TensorFlow are two-fold:

- If you are already developing a model in TensorFlow, it makes sense in some cases to stay in that workflow when moving to inference.
- If a model is not able to be fully converted by TensorRT, the TensorFlow-TensorRT integration will convert portions of the model it is able to convert and leave others as TensorFlow subgraphs.

For these reasons, NVIDIA has worked with TensorFlow teams to integrate TensorRT functionality into TensorFlow. Although you may not see the exact same performance improvements that you would with native TensorRT, TF-TRT provides a great alternative with relatively simple code additions.

For TensorFlow <= 1.13 the conversion process looks something like this:

```python
import tensorflow.contrib.tensorrt as trt
# For a frozen graph
convert_graph_def = trt.create_inference_graph(
    input_graph_def=frozen_graph,
    outputs=['logits', 'classes'],
    precision_mode="FP16")
# For a saved model
trt.create_inference_graph(
    input_saved_model_dir=input_saved_model_dir,
    output_saved_model_dir=output_saved_model_dir,
    precision_model="FP16")
```

For TensorFlow 1.14+ (1.x versions > 1.13), we will use something slightly different:

```python
from tensorflow.python.compiler.tensorrt import trt_convert as trt
converter = trt.TrtGraphConverter(
    input_saved_model_dir=input_saved_model_dir,
    precision_model=trt.TrtPrecisionMode.FP16)
converter.convert()
converter.save(output_saved_model_dir)
```

And for TensorFlow 2.x, we extend this a little bit differently:

```python
from tensorflow.python.compiler.tensorrt import trt_convert as trt
conversion_params = trt.TRTConversionParams(
    precision_model=trt.TrtPrecisionMode.FP16)
converter = trt.TrtGraphConverterV2(
    input_saved_model_dir=input_saved_model_dir,
    conversion_params=conversion_params)
converter.convert()
# optionally build TRT engines before deployment
converter.build(input_fun=my_input_fn)
converter.save(output_saved_model_dir)
```

In [None]:
%%script false --no-raise-error
from tensorflow.python.compiler.tensorrt import trt_convert

def create_tftrt_model():
    converter = trt_convert.TrtGraphConverter(input_saved_model_dir = SAVED_MODEL_PATH,
                                              precision_mode='FP16'
                                             )
    converter.convert()
    converter.save(TFTRT_MODEL_PATH)

create_tftrt_model()

In [None]:
%%script false --no-raise-error
def run_tftrt_sess():
    times = []
    
    with tf.Session(graph=tf.Graph()) as sess:
        tf.saved_model.loader.load(sess,
                                   [tf.saved_model.tag_constants.SERVING],
                                   TFTRT_MODEL_PATH)
        image_tensor = sess.graph.get_tensor_by_name('image_tensor:0')
        boxes = sess.graph.get_tensor_by_name('detection_boxes:0')
        scores = sess.graph.get_tensor_by_name('detection_scores:0')
        classes = sess.graph.get_tensor_by_name('detection_classes:0')
        num_detections = sess.graph.get_tensor_by_name('num_detections:0')
        
        iterations = 10
        
        input_image = load_img(INPUT_IMG, 'tf')
        for index in range((iterations)):
            start = time.time()
            sess.run([boxes, scores, classes, num_detections], feed_dict={image_tensor: input_image})
            stop = time.time()
            if index > 1:
                times.append(stop-start)
                print("Time per run for TensorFlow Inference %d: %f ms" % (index, (stop - start) * 1000))
                
run_tftrt_sess()

<center><img src="https://upload.wikimedia.org/wikipedia/en/thumb/6/6d/Nvidia_image_logo.svg/1200px-Nvidia_image_logo.svg.png" width="250"></center>