<center><img src="https://upload.wikimedia.org/wikipedia/en/thumb/6/6d/Nvidia_image_logo.svg/1200px-Nvidia_image_logo.svg.png" width="250"></center>

In [None]:
# Copyright (c) 2020 NVIDIA

# Permission is hereby granted, free of charge, to any person obtaining a copy
# of this software and associated documentation files (the "Software"), to deal
# in the Software without restriction, including without limitation the rights
# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
# copies of the Software, and to permit persons to whom the Software is
# furnished to do so, subject to the following conditions:

# The above copyright notice and this permission notice shall be included in all
# copies or substantial portions of the Software.

# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
# SOFTWARE.

# Image Classification w/ TensorRT

This notebook will illustrate the full pipeline of creating a very simple image classification model with TensorFlow/Keras, converting that to UFF, and then using the UFF file to create a TensorRT engine and run inference.

Let's get started.

## Setting MaxN mode

First, before we do anything else, we want to make sure that we put the Jetson Nano in MaxN mode.  This will mean that the the clock frequencies will be set to their highest in order to achieve the lowest inference times.

To do this, we want to set the max/min frequency to a preferred value, which we can do with the `nvpmodel` command

In [None]:
!echo nvidia | sudo -S nvpmodel -m 0

In [None]:
!echo nvidia | sudo -S nvpmodel -q

Then we want to use the `jetson_clocks` script to fix the frequency to maximal.  This will allow us to get the most performance out of our device.

In [None]:
!echo nvidia | sudo -S jetson_clocks

## Setup

Now that we have out clock frequencies set, let's go ahead and import all of that packages that we will be using for this particular notebook.

In [None]:
import tensorflow as tf
import tensorrt as trt
import uff

import pycuda.driver as cuda
import pycuda.autoinit

import matplotlib.pyplot as plt
from random import randint
from PIL import Image
import numpy as np
import time
import sys
import os
sys.path.insert(1, os.path.join(sys.path[0], ".."))
import common

print('TensorFlow version: ', tf.__version__)
print('TensorRT version: ', trt.__version__)

We now have all the necessary packages, now let's create a model and convert that to TensorRT.  First we want to create a couple of helper functions that will allow us to grab the data, create the LeNet5 model, perform the training, and then save the output model.

The dataset we will train on for this small model is the MNIST dataset consisting of handwritten digits in the form of 28x28 images.

<center><img src="https://miro.medium.com/max/530/1*VAjYygFUinnygIx9eVCrQQ.png" alt="MNIST Dataset" width="750"/></center>
<center>Image credit: https://miro.medium.com/max/530/1*VAjYygFUinnygIx9eVCrQQ.png<center>

In [None]:
def process_dataset():
    # Import the data
    (x_train, y_train),(x_test, y_test) = tf.keras.datasets.mnist.load_data()
    x_train_orig, y_train_orig, x_test_orig, y_test_orig = x_train, y_train, x_test, y_test
    x_train, x_test = x_train / 255.0, x_test / 255.0

    # Reshape the data
    NUM_TRAIN = 60000
    NUM_TEST = 10000
    x_train = np.reshape(x_train, (NUM_TRAIN, 28, 28, 1))
    x_test = np.reshape(x_test, (NUM_TEST, 28, 28, 1))
    return x_train, y_train, x_test, y_test, x_train_orig, y_train_orig 

The model that we will be creating is a simple version of the LeNet model shown in the image below containing Flatten and Dense (fully-connected) layers.

<center><img src="https://engmrk.com/wp-content/uploads/2018/09/LeNet_Original_Image.jpg" alt="LeNet5 Architecture" width="750"/></center>
<center>Image credit: https://engmrk.com/wp-content/uploads/2018/09/LeNet_Original_Image.jpg<center>

In [None]:
def create_model():
    model = tf.keras.models.Sequential()
    model.add(tf.keras.layers.InputLayer(input_shape=[28,28, 1], name="input"))
    model.add(tf.keras.layers.Flatten())
    model.add(tf.keras.layers.Dense(512, activation=tf.nn.relu))
    model.add(tf.keras.layers.Dense(10, activation=tf.nn.softmax, name="output"))
    model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])
    return model

In [None]:
def save(model, filename):
    # First freeze the graph and remove training nodes.
    output_names = model.output.op.name
    sess = tf.keras.backend.get_session()
    frozen_graph = tf.graph_util.convert_variables_to_constants(sess, sess.graph.as_graph_def(), [output_names])
    frozen_graph = tf.graph_util.remove_training_nodes(frozen_graph)
    # Save the model
    with open(filename, "wb") as ofile:
        ofile.write(frozen_graph.SerializeToString())

Now that we have our helper functions, let's go ahead and train our model.  Note that this should run relatively quickly since we are using a small model and a very small dataset.

**NOTE:** Normally you wouldn't do training on a device like the Jetson, you would perform training on a more capable GPU; for example, an A100 or V100 in AWS.

In [None]:
# Download and preprocess the MNIST dataset
x_train, y_train, x_test, y_test, x_train_orig, y_train_orig = process_dataset()

tf_frozen_model_file = "models/lenet5.pb"
tf_saved_model_path = "models/lenet5_saved_model_tf"

if not os.path.exists(os.path.dirname(tf_frozen_model_file)):
    os.mkdir(os.path.dirname(tf_frozen_model_file))

# Create the LeNet5 model (using the tf.keras API)
model = create_model()

# Train the model on the data
history = model.fit(x_train, y_train, epochs = 2, verbose = 1, validation_data = (x_test, y_test))

# Evaluate the model on test data
loss_and_metrics = model.evaluate(x_test, y_test)
print("Test Loss", loss_and_metrics[0])
print("Test Accuracy", loss_and_metrics[1])

# Save the model as a frozen graph (for use with UFF/TRT)
save(model, filename=tf_frozen_model_file)

# Save the model as a TF saved model (for use with TF-TRT)
model.save(tf_saved_model_path, save_format="tf")

fig = plt.figure()
for i in range(9):
    plt.subplot(3, 3, i+1)
    plt.tight_layout()
    plt.imshow(x_train_orig[i], cmap='gray', interpolation='none')
    plt.title('Digit: {}'.format(y_train_orig[i]))
    plt.xticks([])
    plt.yticks([])
fig

In [None]:
fig = plt.figure()
plt.subplot(2,1,1)
plt.plot(history.history['acc'])
plt.plot(history.history['val_acc'])
plt.title('model accuracy')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='lower right')

plt.subplot(2,1,2)
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper right')

plt.tight_layout()

In [None]:
predicted_classes = model.predict_classes(x_test)

# see which we predicted correctly and which not
correct_indices = np.nonzero(predicted_classes == y_test)[0]
incorrect_indices = np.nonzero(predicted_classes != y_test)[0]
print(len(correct_indices)," classified correctly")
print(len(incorrect_indices)," classified incorrectly")

# adapt figure size to accomodate 18 subplots
plt.rcParams['figure.figsize'] = (7,14)

figure_evaluation = plt.figure()

# plot 9 correct predictions
for i, correct in enumerate(correct_indices[:9]):
    plt.subplot(6,3,i+1)
    plt.imshow(x_test[correct].reshape(28,28), cmap='gray', interpolation='none')
    plt.title(
      "Predicted: {}, Truth: {}".format(predicted_classes[correct],
                                        y_test[correct]))
    plt.xticks([])
    plt.yticks([])

# plot 9 incorrect predictions
for i, incorrect in enumerate(incorrect_indices[:9]):
    plt.subplot(6,3,i+10)
    plt.imshow(x_test[incorrect].reshape(28,28), cmap='gray', interpolation='none')
    plt.title(
      "Predicted {}, Truth: {}".format(predicted_classes[incorrect], 
                                       y_test[incorrect]))
    plt.xticks([])
    plt.yticks([])

figure_evaluation

Notice we are saving the model twice here.  We will be using the frozen graph format (i.e. the .pb file) to create out UFF file (and consequently, our TensorRT engine), but we could go a different route and use a TensorFlow saved model to create a TensorFlow-TensorRT model.  More on this later.

## UFF Conversion

Now that we have our trained model, we can take that trained model and convert it to the UFF file format.  The reason we are doing this are two-fold:

* Easier to convert to a TensorRT model
* Intermediate file formats (like UFF/ONNX) provide much more flexibility when going between different frameworks (i.e. TensorFlow, PyTorch ,TensorRT, etc.)

Let's create a class that will house the information for our model and then read in our TensorFlow frozen graph so that we can use it for the conversion.

We want to get the proper information about our model, for that we will use the `saved_model_cli_show` function that will give us information about inputs/outputs/shapes/etc.

In [None]:
!saved_model_cli show --all --dir "models/lenet5_saved_model_tf"

Notice that the input and output nodes here are conveniently named `input` and `output` for simplicity, but in general it is always good to know the information about your model just incase these have changed.  If you don't have the correct information, it will cause problems later on down the road when we try to convert the model.

**NOTE:** Even though we set our output node to be `output`, tf.keras has appended Softmax to the name here as well.

In [None]:
class ModelData(object):
    UFF_MODEL_NAME = "models/lenet5.uff"
    PB_MODEL_NAME = "models/lenet5.pb"
    INPUT_NAME ="input"
    INPUT_SHAPE = (1, 28, 28)
    OUTPUT_NAME = "output/Softmax"

In [None]:
with tf.Graph().as_default():
    output_graph_def = tf.GraphDef()
    with open(ModelData.PB_MODEL_NAME, "rb") as f:
        output_graph_def.ParseFromString(f.read())

Now our model has been read in and parsed into `output_graph_def` so we can go ahead and use the `uff.from_tensorflow` operator to convert this to a UFF file.

In [None]:
def model_to_uff(graphdef):
    uff_model = uff.from_tensorflow(
        graphdef=graphdef,
        output_filename=ModelData.UFF_MODEL_NAME,
        output_nodes=[ModelData.OUTPUT_NAME],
        text=False)
    
model_to_uff(output_graph_def)

## TensorRT Engine Creation

We have our intermediate file now (in the form of UFF), so now we can go through the process of creating a TensorRT engine from that UFF file.

First, we want to set the logger for TensorRT to suppress messages that we don't need to see (i.e. warnings)

In [None]:
# You can set the logger severity higher to suppress messages (or lower to display more messages).
TRT_LOGGER = trt.Logger(trt.Logger.WARNING)

Normally, since we are running on a Jetson Nano, the process of building a TensorRT engine can sometimes take a while depending on the size of the network.  For this particular example, since we are using such a small model, it should only take a couple of seconds.

But what exactly is going on when you build a TensorRT engine?

* TensorRT has the ability to convert models from all the major frameworks and deploy them on any NVIDIA products (enterprise GPUs all the way to Jetson devices)
* TensorRT does layer fusion which combines layers horizontally and vertically so the computation is able to be done in a single CUDA kernel
* TensorRT performs precision calibration and converts layers and parameters to FP16 (or even INT8) to make operations faster
* TensorRT finds the correct kernel that will perform the best on the given hardware (different kernels for the same function on different hardware)

With all of that in mind, let's go ahead and create the function that will build our TensorRT engine.  Notice that we are using the `trt.Builder()` and `trt.Builder.create_network()` operators to aide in our creation of the engine.  We are also using `trt.UffParser` which alleviates some of the headache in converting a UFF file to a TensorRT engine.

**NOTE:** There is an equivalent parser for ONNX (`trt.ONNXParser`) if you were to ever need that pathway.

In [None]:
def build_engine(model_file):
    # For more information on TRT basics, refer to the introductory samples.
    with trt.Builder(TRT_LOGGER) as builder, builder.create_network() as network, trt.UffParser() as parser:
        # Set the workspace size for kernels selection
        builder.max_workspace_size = common.GiB(1)
        
        # Parse the Uff Network
        parser.register_input(ModelData.INPUT_NAME, ModelData.INPUT_SHAPE)
        parser.register_output(ModelData.OUTPUT_NAME)
        parser.parse(model_file, network)
        
        # Build and return an engine (underneath creating a CUDA context for creation).
        return builder.build_cuda_engine(network)

Now we can define a simple helper function that will choose a random image from out dataset for us to run inference on...

In [None]:
# Loads a test case into the provided pagelocked_buffer.
def load_normalized_test_case(data_paths, pagelocked_buffer, case_num=randint(0, 9)):
    [test_case_path] = common.locate_files(data_paths, [str(case_num) + ".pgm"])
    # Flatten the image into a 1D array, normalize, and copy to pagelocked memory.
    img = np.array(Image.open(test_case_path)).ravel()
    np.copyto(pagelocked_buffer, 1.0 - img / 255.0)
    return case_num

Now that we have our helper function, we can go ahead and use our `build_engine` function to create out TensorRT engine, serialize (save) it, and run inference on it to make sure we see the result we should.

In [None]:
data_paths, _ = common.find_sample_data(description="Runs an MNIST network using a UFF model file", subfolder="mnist")

with build_engine(ModelData.UFF_MODEL_NAME) as engine:
    # Build an engine, allocate buffers and create a stream.
    # For more information on buffer allocation, refer to the introductory samples.
    inputs, outputs, bindings, stream = common.allocate_buffers(engine)
    with engine.create_execution_context() as context:
        case_num = load_normalized_test_case(data_paths, pagelocked_buffer=inputs[0].host)
        # For more information on performing inference, refer to the introductory samples.
        # The common.do_inference function will return a list of outputs - we only have one in this case.
        start = time.time()
        [output] = common.do_inference(context, bindings=bindings, inputs=inputs, outputs=outputs, stream=stream)
        stop = time.time()
        pred = np.argmax(output)
        print("Test Case: " + str(case_num))
        print("Prediction: " + str(pred))
        print("Time for 15 TensorRT Inferences with memcpys, etc.: %f ms" % ((stop - start)*1000))
        with open("models/lenet5.engine", "wb") as f:
            f.write(engine.serialize())

We have now gone through the process of taking a very simple model and converting it into a TensorRT engine and using it for inference.

Now let's tackle something a little bit more difficult (as well as useable).  Continue onto [2-ObjectDetection-TRT.ipynb](2-ObjectDetection-TRT.ipynb) to go through the same process we have just gone through, but this time for an object detection model (SSD).  We will also discuss a little more in-depth about what is going on since it is a slightly more complex network.

<center><img src="https://upload.wikimedia.org/wikipedia/en/thumb/6/6d/Nvidia_image_logo.svg/1200px-Nvidia_image_logo.svg.png" width="250"></center>