# optCNN: AI Inference Optimization for Deep CNNs
Welcome to the outcome presentation of Part 1 of the series. Here, we will conduct a performance analysis of four different AI inference optimization levels. These levels include:

    1. Standard PyTorch Inference
    2. ONNX Runtime Inference
    3. TensorRT Inference
    4. TensorRT Mixed-Precision Inference

To ensure an apples-to-apples comparison, we will perform inference on the same GPU for all level. Our example model will be ResNet-50, a popular deep convolutional network. In this analysis, we will focus solely on the speed of inference rather than the quality of the results.

We will use **<span style="color:green">Nsight Systems</span>** to profile each approach and gain insights into their performance.

In [None]:
import time
import warnings # This is merely a presentation notebook.
warnings.filterwarnings("ignore") # We'd like it clean.

import cv2
import numpy as np
import onnx
import onnxruntime as ort
import pycuda.driver as cuda
import torch
import torchvision.models as models
import tensorrt as trt

## 1. PyTorch GPU Inference

Fetching model and Benchmarking Inference in PyTorch's native space

In [22]:
# Load the PyTorch model
model = models.resnet50(pretrained=True).eval().cuda()

# Prepare input data
input_image = torch.randn(1, 3, 224, 224).cuda()

# Warm-up
for _ in range(10):
    _ = model(input_image)

# Measure inference time
start_time = time.time()
with torch.no_grad():
    for _ in range(100):
        _ = model(input_image)
end_time = time.time()

print(f'PyTorch-GPU Baseline Inference Time: {(end_time - start_time) / 100:.6f} seconds')

PyTorch-GPU Baseline Inference Time: 0.017316 seconds


##### PyTorch-GPU Baseline Inference Time: 0.017316 seconds

![High Level Nsight-Systems Timeline Screenshot here](reports/snaps/torch_out.png)

The 17ms highlighted window corresponds to 1 inference (1 iteration of the loop above). We can use the patterns in the CUDA-HW memory channel as markers for iteration identification. We observe high variation between the durations of individual iterations, indicating room for optimization.

![Single iteration Level Nsight-Systems Timeline Screenshot here](reports/snaps/torch_in.png)

Zooming in, we observe frequent and significant gaps between kernel executions. This is evident in the CUDA-HW kernel channel, where we see an alternation between kernel execution and idle times. Over an entire inference, these add up to be significant, with almost 50% of the inference window consisting of kernel idle time.

![A few kernel Level Nsight-Systems Timeline Screenshot here](reports/snaps/torch_in_in.png)

Looking closer, we see that each kernel block (see streams channel) is preceded by a corresponding call to the kernel (CUDA API channel). The CUDA API makes a call to the kernel, followed by kernel execution, then idle time until another request is made. This indicates a need for better kernel request handling.


## 2. ONNX Runtime Inference

Loading serialized version of the model and benchmarking on ONNX Runtime Inference.

In [32]:
# Load the ONNX model with GPU (CUDA) execution provider 
onnx_model_path = "models/resnet50.onnx"
providers = ['CUDAExecutionProvider'] #if 'CUDAExecutionProvider' in ort.get_available_providers()# else ['CPUExecutionProvider']
session = ort.InferenceSession(onnx_model_path, providers=providers)

# Prepare input data
input_image = np.random.randn(1, 3, 224, 224).astype(np.float32)

# Warm-up
for _ in range(10):
    _ = session.run(None, {"input": input_image})

# Measure inference time
start_time = time.time()
for _ in range(100):
    _ = session.run(None, {"input": input_image})
end_time = time.time()

print(f'ONNX Runtime Inference Time: {(end_time - start_time) / 100:.6f} seconds')

ONNX Runtime Inference Time: 0.003505 seconds


##### ONNX Runtime Inference Time: 0.003505 seconds

![High Level Nsight-Systems Timeline Screenshot here](reports/snaps/onnx_out.png)

Using the patterns in the CUDA HW Memory channel as markers, we cut out a single 3.6 ms inference window. The iteration durations are much more consistent compared to PyTorch.

![A single iteration Level Nsight-Systems Timeline Screenshot here](reports/snaps/onnx_in.png)

A closer look reveals reduced idle times between CUDA API calls to kernels. The majority of the inference window is now occupied by kernel execution, with minimal idle time. Speedups should now come from accelerating individual kernel executions.

![A few kernel Level Nsight-Systems Timeline Screenshot here](reports/snaps/onnx_in_in.png)

Going deeper, we still see an alternation between kernel calls and executions, but the API calls are more evenly distributed, resulting in lower idle times. Queuing kernel call requests could further resolve this issue. We also observe more balanced activity across multiple CUDA streams, indicating a better-distributed workload and improved parallelism.


## 3. TensorRT Inference

Loading ONNX-serialized model and Benchmarking TensorRT-accelerated Inference

In [21]:
# Initialize CUDA
cuda.init()

# Create CUDA context
device = cuda.Device(0)
cuda_context = device.make_context()

# Load the ONNX model
onnx_model_path = "models/resnet50.onnx"
onnx_model = onnx.load(onnx_model_path)

# Create a TensorRT logger
TRT_LOGGER = trt.Logger(trt.Logger.WARNING)

In [22]:
# Build the TensorRT engine
def build_engine(onnx_file_path):
    with trt.Builder(TRT_LOGGER) as builder, \
         builder.create_network(1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH)) as network, \
         trt.OnnxParser(network, TRT_LOGGER) as parser:
        
        config = builder.create_builder_config()
        config.set_memory_pool_limit(trt.MemoryPoolType.WORKSPACE, 1 << 30)  # 1GB

        with open(onnx_file_path, 'rb') as model:
            if not parser.parse(model.read()):
                print('Failed to parse the ONNX file')
                for error in range(parser.num_errors):
                    print(parser.get_error(error))
                return None
            
        serialized_engine = builder.build_serialized_network(network, config)
        if serialized_engine is None:
            print("Failed to build the engine.")
            return None
        
        runtime = trt.Runtime(TRT_LOGGER)
        return runtime.deserialize_cuda_engine(serialized_engine)

engine = build_engine(onnx_model_path)
if engine is None:
    print("Engine could not be created.")
    exit()

In [23]:
# Allocate buffers
inputs, outputs, bindings, stream = [], [], [], cuda.Stream()
for i in range(engine.num_bindings):
    binding = engine.get_tensor_name(i)
    size = trt.volume(engine.get_tensor_shape(binding))
    dtype = trt.nptype(engine.get_tensor_dtype(binding))
    host_mem = cuda.pagelocked_empty(size, dtype)
    device_mem = cuda.mem_alloc(host_mem.nbytes)
    bindings.append(int(device_mem))
    if engine.get_tensor_mode(binding) == trt.TensorIOMode.INPUT:
        inputs.append((host_mem, device_mem))
    else:
        outputs.append((host_mem, device_mem))

# Create context
context = engine.create_execution_context()

In [24]:
# Prepare input data
input_image = cv2.imread("data/croc.jpeg")
input_image = cv2.resize(input_image, (224, 224))
input_image = input_image.astype(np.float32)
input_image = input_image.transpose(2, 0, 1)  # HWC to CHW
input_image = np.expand_dims(input_image, axis=0)  # Add batch dimension
input_image = np.ascontiguousarray(input_image)

# Copy input data to pagelocked buffer
np.copyto(inputs[0][0], input_image.ravel())

In [25]:
# Warm-up
for _ in range(10):
    cuda.memcpy_htod_async(inputs[0][1], inputs[0][0], stream)
    context.execute_async_v2(bindings=bindings, stream_handle=stream.handle)
    cuda.memcpy_dtoh_async(outputs[0][0], outputs[0][1], stream)
    stream.synchronize()

# Measure inference time
start_time = time.time()
for _ in range(100):
    cuda.memcpy_htod_async(inputs[0][1], inputs[0][0], stream)
    context.execute_async_v2(bindings=bindings, stream_handle=stream.handle)
    cuda.memcpy_dtoh_async(outputs[0][0], outputs[0][1], stream)
    stream.synchronize()
end_time = time.time()

print(f'TensorRT Inference Time: {(end_time - start_time) / 100:.6f} seconds')

# Clean up
del context
del engine
cuda_context.pop()
del cuda_context

TensorRT Inference Time: 0.001817 seconds


##### TensorRT Inference Time: 0.001817 seconds

![High Level Nsight-Systems Timeline Screenshot here](reports/snaps/tesnorrt_out.png)

We now see a dedicated TensorRT channel. Using this and the CUDA HW memory patterns, we isolate an inference.

![Single iteration Level Nsight-Systems Timeline Screenshot here](reports/snaps/tensorrt_in.png)

We observe that CUDA API calls for kernels are queued together, enabling a contiguous kernel execution block with virtually no idle time once execution begins. The number of kernels has reduced, and they are different from previous methods due to kernel auto-tuning and layer fusion, which significantly reduce total kernel execution time. However, the large cuStreamSynchronize block suggests potential for further optimization by speeding up individual kernels.


## 4. Quantized TensorRT Inference

Benchmarking mixed-precision quantized model using TensorRT

In [40]:
# Initialize CUDA
cuda.init()

# Create CUDA context
device = cuda.Device(0)
cuda_context = device.make_context()

# Load the ONNX model
onnx_model_path = "models/resnet50.onnx"
onnx_model = onnx.load(onnx_model_path)

# Create a TensorRT logger
TRT_LOGGER = trt.Logger(trt.Logger.WARNING)

In [41]:
# Define INT8 Calibrator class
class PythonEntropyCalibrator(trt.IInt8EntropyCalibrator2):
    def __init__(self, data, batch_size):
        trt.IInt8EntropyCalibrator2.__init__(self)
        self.data = data
        self.batch_size = batch_size
        self.current_index = 0
        self.device_input = cuda.mem_alloc(data.nbytes)

    def get_batch_size(self):
        return self.batch_size

    def get_batch(self, names):
        if self.current_index + self.batch_size > self.data.shape[0]:
            return None
        batch = self.data[self.current_index:self.current_index + self.batch_size].ravel()
        cuda.memcpy_htod(self.device_input, batch)
        self.current_index += self.batch_size
        return [int(self.device_input)]

    def read_calibration_cache(self):
        return None

    def write_calibration_cache(self, cache):
        return None

In [None]:
# Function to build the engine
def build_engine(onnx_file_path):
    with trt.Builder(TRT_LOGGER) as builder, \
         builder.create_network(1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH)) as network, \
         trt.OnnxParser(network, TRT_LOGGER) as parser:
        
        config = builder.create_builder_config()
        config.set_memory_pool_limit(trt.MemoryPoolType.WORKSPACE, 1 << 30)  # 1GB
        
        if builder.platform_has_fast_fp16:
            config.set_flag(trt.BuilderFlag.FP16)
        if builder.platform_has_fast_int8:
            config.set_flag(trt.BuilderFlag.INT8)
            # Create dummy calibration data for example purposes
            dummy_data = np.random.random((100, 3, 224, 224)).astype(np.float32)
            calibrator = PythonEntropyCalibrator(dummy_data, batch_size=1)
            config.int8_calibrator = calibrator

        with open(onnx_file_path, 'rb') as model:
            if not parser.parse(model.read()):
                print('Failed to parse the ONNX file')
                for error in range(parser.num_errors):
                    print(parser.get_error(error))
                return None
            
        serialized_engine = builder.build_serialized_network(network, config)
        if serialized_engine is None:
            print("Failed to build the engine.")
            return None
        
        runtime = trt.Runtime(TRT_LOGGER)
        return runtime.deserialize_cuda_engine(serialized_engine)

engine = build_engine(onnx_model_path)
if engine is None:
    print("Engine could not be created.")
    exit()

In [43]:
# Allocate buffers
inputs, outputs, bindings, stream = [], [], [], cuda.Stream()
for i in range(engine.num_bindings):
    binding = engine.get_tensor_name(i)
    size = trt.volume(engine.get_tensor_shape(binding))
    dtype = trt.nptype(engine.get_tensor_dtype(binding))
    host_mem = cuda.pagelocked_empty(size, dtype)
    device_mem = cuda.mem_alloc(host_mem.nbytes)
    bindings.append(int(device_mem))
    if engine.get_tensor_mode(binding) == trt.TensorIOMode.INPUT:
        inputs.append((host_mem, device_mem))
    else:
        outputs.append((host_mem, device_mem))

# Create context
context = engine.create_execution_context()

In [44]:
# Prepare input data
input_image = cv2.imread("data/croc.jpeg")
input_image = cv2.resize(input_image, (224, 224))
input_image = input_image.astype(np.float32)
input_image = input_image.transpose(2, 0, 1)  # HWC to CHW
input_image = np.expand_dims(input_image, axis=0)  # Add batch dimension
input_image = np.ascontiguousarray(input_image)

# Copy input data to pagelocked buffer
np.copyto(inputs[0][0], input_image.ravel())

In [45]:
# Warm-up
for _ in range(10):
    cuda.memcpy_htod_async(inputs[0][1], inputs[0][0], stream)
    context.execute_async_v2(bindings=bindings, stream_handle=stream.handle)
    cuda.memcpy_dtoh_async(outputs[0][0], outputs[0][1], stream)
    stream.synchronize()

# Measure inference time
start_time = time.time()
for _ in range(100):
    cuda.memcpy_htod_async(inputs[0][1], inputs[0][0], stream)
    context.execute_async_v2(bindings=bindings, stream_handle=stream.handle)
    cuda.memcpy_dtoh_async(outputs[0][0], outputs[0][1], stream)
    stream.synchronize()
end_time = time.time()

print(f'Quantized TensorRT Inference Time: {(end_time - start_time) / 100:.6f} seconds')

# Clean up
del context
del engine
cuda_context.pop()
del cuda_context

Quantized TensorRT Inference Time: 0.000506 seconds


##### Quantized TensorRT Inference Time: 0.000506 seconds

![Single iteration Level Nsight-Systems Timeline Screenshot here](reports/snaps/quant_in.png)

We now see negligible cuStreamSynchronize time, indicating that individual kernels execute very quickly due to INT8 calibration and FP16 mixed precision. The execution time is so fast that we start to see kernel idle time again, waiting for the next kernel call to be queued. The next steps would be to test the impact of lower precision on result quality, improve the management of the Memcpy HtoD block, and speed up the queuing of kernel calls.

# Summary

| Inference Type            | Time (seconds) |
|---------------------------|----------------|
| PyTorch GPU               | 0.017001       |
| ONNX-Runtime GPU          | 0.003505       |
| TensorRT                  | 0.001817       |
| TensorRT + Quantization   | 0.000506       |

In [21]:
# Manually inserted values for now
f'Inference time reduced to {int((0.000506/0.017001) * 100)}% the baseline GPU time'

'Inference time reduced to 2% the baseline GPU time'

While PyTorch with CUDA helps parallelize our model with GPUs, it has limitations in its native form. Moving to ONNX leverages operator-level optimizations and more evenly distributed CUDA API calls, significantly reducing GPU idle times. Transitioning to TensorRT provides scenario-tailored optimizations such as kernel auto-tuning, layer fusion, advanced memory management, and better kernel scheduling through meaningful queuing of CUDA API calls. Finally, reducing numerical precision enhances execution and memory transfer efficiency, resulting in lower latency and higher throughput.