# Setup and Imports

In [1]:
!nvidia-smi

Sun Aug 31 08:30:32 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  Tesla T4                       Off |   00000000:00:04.0 Off |                    0 |
| N/A   55C    P8              9W /   70W |       0MiB /  15360MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                

In [2]:
# Colab cell: basic helpers
!pip install onnx onnxruntime onnxsim pycuda
# NVIDIA Python index + tensorrt (often needed)
!pip install nvidia-pyindex
!pip install --upgrade nvidia-tensorrt
# Optional: torch-tensorrt to compile PyTorch models directly
!pip install torch-tensorrt

Collecting onnx
  Downloading onnx-1.19.0-cp312-cp312-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (7.0 kB)
Collecting onnxruntime
  Downloading onnxruntime-1.22.1-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (4.9 kB)
Collecting onnxsim
  Downloading onnxsim-0.4.36.tar.gz (21.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m21.0/21.0 MB[0m [31m86.1 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting pycuda
  Downloading pycuda-2025.1.1.tar.gz (1.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.7/1.7 MB[0m [31m87.5 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting coloredlogs (from onnxruntime)
  Downloading coloredlogs-15.0.1-py2.py3-none-any.whl.metadata (12 kB)
Collecting pytools>=2011.2 (f

In [3]:
import torch
import torch.nn as nn
import torch.optim as optim
import torchvision
import torchvision.transforms as transforms
from torch.utils.data import DataLoader

print("torch.cuda.available:", torch.cuda.is_available())
print("torch.version.cuda:", torch.version.cuda)
print("device name:", torch.cuda.get_device_name(0) if torch.cuda.is_available() else "no-gpu")

torch.cuda.available: True
torch.version.cuda: 12.6
device name: Tesla T4


Why: TensorRT requires an NVIDIA GPU + matching CUDA/driver. Check these first so you can pick the right binary/installation route. [NVIDIA Docs](https://docs.nvidia.com/deeplearning/tensorrt/latest/installing-tensorrt/overview.html?utm_source=chatgpt.com)

# Learning and Understand

### **Workflow:**

**Start from NVIDIA TensorRT Developer Guide**
✔ The content references official NVIDIA docs and provides direct links for deeper reading (e.g., \[Quick Start Guide], \[Python API Docs]).

---

### **Practical steps:**

**1. Train a model in PyTorch/TensorFlow**
✔ This content trains a simple NN model for demontration purposes
✔ It explains how to export the model to ONNX.

---

**2. Export to ONNX**
✔ Covered in detail with `torch.onnx.export()` including:

* `opset_version`
* `dynamic_axes`
* Best practices like constant folding

---

**3. Use TensorRT to convert ONNX → TRT Engine**
✔ The content gives
* **Python API** (more programmatic and customizable)

It also adds:

* Workspace size setting
* FP16 flag
* Handling ONNX parse errors
* Explanation of explicit batch flag

---

**4. Apply FP16 or INT8 quantization**
✔ FP16: clearly shown via `config.set_flag(trt.BuilderFlag.FP16)`
✔ INT8: mentions calibrators and links to references for implementation (not shown fully, but that's expected for a quick guide).

---

**5. Benchmark performance**
✔ `trtexec --iterations` for quick profiling and latency measurement explained in detail.
✔ Mentions Nsight tools for advanced profiling.


## Training a (Sample) Model in PyTorch

In [4]:
# ----------------------
# 1. Set device
# ----------------------
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print("Using device:", device)

# ----------------------
# 2. Define transforms and load CIFAR-10
# ----------------------
transform = transforms.Compose([
    transforms.Resize((224, 224)),  # Resize to match dummy input shape
    transforms.ToTensor()
])

train_dataset = torchvision.datasets.CIFAR10(root='./data', train=True,
                                             download=True, transform=transform)
test_dataset = torchvision.datasets.CIFAR10(root='./data', train=False,
                                            download=True, transform=transform)

train_loader = DataLoader(train_dataset, batch_size=64, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=64, shuffle=False)

# ----------------------
# 3. Define a simple CNN
# ----------------------
class SimpleCNN(nn.Module):
    def __init__(self):
        super(SimpleCNN, self).__init__()
        self.features = nn.Sequential(
            nn.Conv2d(3, 16, kernel_size=3, padding=1),
            nn.ReLU(),
            nn.MaxPool2d(2),
            nn.Conv2d(16, 32, kernel_size=3, padding=1),
            nn.ReLU(),
            nn.MaxPool2d(2)
        )
        self.classifier = nn.Sequential(
            nn.Flatten(),
            nn.Linear(32 * 56 * 56, 128),  # 224→112→56 after two maxpools
            nn.ReLU(),
            nn.Linear(128, 10)
        )

    def forward(self, x):
        x = self.features(x)
        x = self.classifier(x)
        return x

model = SimpleCNN().to(device)

# ----------------------
# 4. Loss and optimizer
# ----------------------
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

# ----------------------
# 5. Training loop (short for demo)
# ----------------------
epochs = 2  # keep small for Colab demo
for epoch in range(epochs):
    model.train()
    total_loss = 0
    for images, labels in train_loader:
        images, labels = images.to(device), labels.to(device)

        optimizer.zero_grad()
        outputs = model(images)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()
        total_loss += loss.item()
    print(f"Epoch [{epoch+1}/{epochs}], Loss: {total_loss/len(train_loader):.4f}")

# ----------------------
# 6. Test accuracy (quick check)
# ----------------------
model.eval()
correct = 0
total = 0
with torch.no_grad():
    for images, labels in test_loader:
        images, labels = images.to(device), labels.to(device)
        outputs = model(images)
        _, predicted = torch.max(outputs, 1)
        total += labels.size(0)
        correct += (predicted == labels).sum().item()
print(f"Test Accuracy: {100 * correct / total:.2f}%")

# ----------------------
# 7. Save model
# ----------------------
torch.save(model, '/content/model.pth')
print("Model saved as /content/model.pth")

Using device: cuda


100%|██████████| 170M/170M [00:04<00:00, 42.3MB/s]


Epoch [1/2], Loss: 1.6389
Epoch [2/2], Loss: 1.3562
Test Accuracy: 53.85%
Model saved as /content/model.pth


## Export to ONNX

In [5]:
# Save only weights
torch.save(model.state_dict(), '/content/model.pth')

# Load weights into model
model = SimpleCNN()  # initialize same architecture
model.load_state_dict(torch.load('/content/model.pth', map_location='cpu'))
model.eval()

# Dummy input (match the training shape)
dummy_input = torch.randn(1, 3, 224, 224)

torch.onnx.export(
    model,
    dummy_input,
    "model.onnx",
    export_params=True,
    opset_version=13,
    do_constant_folding=True,
    input_names=["input"],
    output_names=["output"],
    dynamic_axes={"input": {0: "batch_size"}, "output": {0: "batch_size"}}
)

print("ONNX model exported as model.onnx")

  torch.onnx.export(


ONNX model exported as model.onnx


Notes:

- Pick opset_version >= 11 (TensorRT supports modern opsets better).
[PyTorch Documentation](https://docs.pytorch.org/tutorials/beginner/onnx/export_simple_model_to_onnx_tutorial.html?utm_source=chatgpt.com)

- Also, make sure that the saved model is in the `.state_dict()` format, otherwise it would have saved the entire model class object, which would pose hinderances when exporting to ONNX

### `torch.onnx.export(...)`**

This is the main ONNX export function. Parameters:

#### **`model`**

* The PyTorch model you want to export.

#### **`dummy_input`**

* A sample input tensor to trace the model graph.

#### **`"model.onnx"`**

* Output ONNX file name.

#### **`export_params=True`**

* Exports **all trained parameters** along with the graph (weights, biases).

#### **`opset_version=13`**

* ONNX **operator set version**.
* `13` is stable and widely supported by TensorRT.

  * If your model uses newer ops, you might need a higher opset.

(**Additional details:** Opset (Operator Set) in ONNX refers to the versioned set of operations (ops) that define the computation graph.

Each ONNX model specifies an opset version (e.g., 11, 13, 17).

Higher opset = newer operators and features.

Exporting with a supported opset ensures compatibility with inference engines (like TensorRT).

So, opset_version=13 means the model uses ONNX opset v13, which TensorRT supports.)

<br>

#### **`do_constant_folding=True`**

* Performs **constant folding** during export:

  * Any constant operations (e.g., adding 0, multiplying by 1) are simplified for efficiency.

#### **`input_names=["input"], output_names=["output"]`**

* Names for model input and output nodes (useful for later inference).

#### **`dynamic_axes={"input": {0: "batch_size"}, "output": {0: "batch_size"}}`**

* Allows **dynamic batch size**.
* Means:

  * For the input tensor: axis `0` is named `batch_size` and can change at runtime.
  * For the output tensor: axis `0` can also change.
* Without this, the exported model will **only accept batch size 1**.

In [6]:
import onnx
from onnxsim import simplify

onnx_model = onnx.load("model.onnx")
model_simp, check = simplify(onnx_model)
if not check:
    raise RuntimeError("ONNX simplifier failed")
onnx.save(model_simp, "model_simplified.onnx")

# sanity check
onnx.checker.check_model("model_simplified.onnx")
print("ONNX simplified and valid")

ONNX simplified and valid


- Some framework exports add redundant nodes — simplify & check:
- This reduces parsing errors with TensorRT and can speed up conversion.

## ONNX to TRT Engine

### (Also Applies) FP16 Quantization

In [7]:
import tensorrt as trt

TRT_LOGGER = trt.Logger(trt.Logger.WARNING)
builder = trt.Builder(TRT_LOGGER)
network = builder.create_network(1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH))
parser = trt.OnnxParser(network, TRT_LOGGER)

with open("model_simplified.onnx", "rb") as f:
    if not parser.parse(f.read()):
        for i in range(parser.num_errors):
            print(parser.get_error(i))
        raise RuntimeError("Failed to parse ONNX")

config = builder.create_builder_config()
config.set_memory_pool_limit(trt.MemoryPoolType.WORKSPACE, 1 << 30)  # 1 GB
config.set_flag(trt.BuilderFlag.FP16)

# Add optimization profile for dynamic input
profile = builder.create_optimization_profile()
profile.set_shape("input", (1, 3, 224, 224), (4, 3, 224, 224), (8, 3, 224, 224))
config.add_optimization_profile(profile)

# Build serialized engine
serialized_engine = builder.build_serialized_network(network, config)
if serialized_engine is None:
    raise RuntimeError("Failed to build TensorRT engine")

with open("model_trt.engine", "wb") as f:
    f.write(serialized_engine)

print("Saved TRT engine as model_trt.engine")

Saved TRT engine as model_trt.engine


If the parser prints errors: inspect unsupported ops; either modify model, use ONNX Runtime, or implement a TRT plugin. [NVIDIA Docs](https://docs.nvidia.com/deeplearning/tensorrt/latest/inference-library/python-api-docs.html?utm_source=chatgpt.com)

* **`TRT_LOGGER = trt.Logger(trt.Logger.WARNING)`**
  → Creates a logger for TensorRT messages (only warnings and errors will be shown).

* **`builder = trt.Builder(TRT_LOGGER)`**
  → Builder object responsible for compiling and optimizing the network.

* **`builder.create_network(1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH))`**
  → Creates a network definition with **explicit batch flag**, needed for dynamic shapes and optimization profiles.

* **`parser = trt.OnnxParser(network, TRT_LOGGER)`**
  → Parses the ONNX model into TensorRT’s internal network format.

* **ONNX Parsing Block**

  ```python
  with open("model_simplified.onnx", "rb") as f:
      if not parser.parse(f.read()):
          for i in range(parser.num_errors):
              print(parser.get_error(i))
          raise RuntimeError("Failed to parse ONNX")
  ```

  → Reads ONNX file, parses it, and prints detailed errors if parsing fails.

* **`config = builder.create_builder_config()`**
  → Creates a configuration object for optimization and precision settings.

* **`config.set_memory_pool_limit(trt.MemoryPoolType.WORKSPACE, 1 << 30)`**
  → Sets maximum temporary GPU memory (workspace) to **1 GB** for kernel selection and optimization.

* **`config.set_flag(trt.BuilderFlag.FP16)`**
  → Enables **FP16 precision mode**, which uses Tensor Cores for faster inference on GPUs that support FP16.

* **Optimization Profile**

  ```python
  profile = builder.create_optimization_profile()
  profile.set_shape("input", (1, 3, 224, 224), (4, 3, 224, 224), (8, 3, 224, 224))
  config.add_optimization_profile(profile)
  ```

  → Adds **dynamic shape support** for the input tensor named `"input"`:

  * **min shape**: `(1, 3, 224, 224)`
  * **opt shape**: `(4, 3, 224, 224)` (preferred batch size for optimization)
  * **max shape**: `(8, 3, 224, 224)`

  This was **not in your old code**—it enables **dynamic batching**, which is key for real deployments.

* **`builder.build_serialized_network(network, config)`**
  → Builds and **serializes the TensorRT engine** in one step (modern approach).
  Older method `builder.build_engine()` is replaced with `build_serialized_network()` for efficiency.

* **Write Engine to Disk**

  ```python
  with open("model_trt.engine", "wb") as f:
      f.write(serialized_engine)
  ```

  → Saves the engine as `model_trt.engine` for deployment.


In [8]:
import tensorrt as trt
import pycuda.driver as cuda
import pycuda.autoinit
import numpy as np

TRT_LOGGER = trt.Logger(trt.Logger.WARNING)

# ----------------------
# LOAD ENGINE AND CONTEXT
# ----------------------
with open("/content/model_trt.engine", "rb") as f, trt.Runtime(TRT_LOGGER) as runtime:
    engine = runtime.deserialize_cuda_engine(f.read())

context = engine.create_execution_context()

print("Engine object:", engine)
print("Engine is None?", engine is None)

try:
    print("First binding name:", engine.get_binding_name(0))
except Exception as e:
    print("Error calling get_binding_name(0):", e)

print(dir(engine))

# ----------------------
# Robust helper (no engine.num_bindings)
# ----------------------

def _collect_binding_indices(engine):
    """
    Return a list of valid binding indices by probing get_binding_name(idx)
    until it raises an exception. This avoids relying on engine.num_bindings
    which may be missing in some TRT Python builds.
    """
    idx = 0
    indices = []
    while True:
        try:
            _ = engine.get_binding_name(idx)
            indices.append(idx)
            idx += 1
        except Exception:
            break
    return indices

def allocate_buffers(engine, context, input_shapes):
    """
    Allocate host/device buffers using TensorRT 10's tensor name-based API.
    input_shapes: dict mapping input tensor name -> shape tuple
    """
    inputs, outputs, bindings = [], [], []
    stream = cuda.Stream()

    # Get all tensor names
    tensor_names = [engine.get_tensor_name(i) for i in range(engine.num_io_tensors)]

    # Set dynamic shapes if provided
    for name, shape in input_shapes.items():
        if name not in tensor_names:
            raise ValueError(f"Input tensor '{name}' not found in engine tensors: {tensor_names}")
        if engine.get_tensor_mode(name) == trt.TensorIOMode.INPUT:
            context.set_input_shape(name, tuple(shape))

    # Allocate memory for all tensors
    for name in tensor_names:
        shape = context.get_tensor_shape(name)
        if not shape:
            raise RuntimeError(f"Tensor '{name}' shape is None. Check optimization profile.")
        size = int(trt.volume(shape))
        dtype = trt.nptype(engine.get_tensor_dtype(name))

        host_mem = cuda.pagelocked_empty(size, dtype)
        device_mem = cuda.mem_alloc(host_mem.nbytes)

        bindings.append(int(device_mem))

        entry = {"name": name, "host": host_mem, "device": device_mem, "shape": tuple(shape)}
        if engine.get_tensor_mode(name) == trt.TensorIOMode.INPUT:
            inputs.append(entry)
        else:
            outputs.append(entry)

    return inputs, outputs, bindings, stream

def infer(context, bindings, inputs, outputs, stream):
    # Copy input host → device
    for inp in inputs:
        cuda.memcpy_htod_async(inp["device"], inp["host"], stream)

    # Run inference
    context.execute_v2(bindings)

    # Copy output device → host
    for out in outputs:
        cuda.memcpy_dtoh_async(out["host"], out["device"], stream)

    stream.synchronize()

    # Convert outputs to numpy
    return [np.array(out["host"]).reshape(out["shape"]) for out in outputs]

# ----------------------
# EXAMPLE USAGE
# ----------------------

# Example: dynamic batch size 4 for input named exactly as the ONNX input (check name)
dynamic_shape = {"input": (4, 3, 224, 224)}

# Allocate
inputs, outputs, bindings, stream = allocate_buffers(engine, context, dynamic_shape)

# Fill input (host buffer is 1D pagelocked array)
in_shape = inputs[0]["shape"]
dummy_input = np.random.randn(*in_shape).astype(np.float32).ravel()
np.copyto(inputs[0]["host"], dummy_input)

# Run
results = infer(context, bindings, inputs, outputs, stream)

print("Output binding shapes:", [out["shape"] for out in outputs])
print("First 10 output values:", results[0].flatten()[:10])

Engine object: <tensorrt_bindings.tensorrt.ICudaEngine object at 0x7f00d838d530>
Engine is None? False
Error calling get_binding_name(0): 'tensorrt_bindings.tensorrt.ICudaEngine' object has no attribute 'get_binding_name'
['__class__', '__del__', '__delattr__', '__dir__', '__doc__', '__enter__', '__eq__', '__exit__', '__format__', '__ge__', '__getattribute__', '__getitem__', '__getstate__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__pybind11_module_local_v4_gcc_libstdcpp_cxxabi1016__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', 'create_engine_inspector', 'create_execution_context', 'create_execution_context_without_device_memory', 'create_runtime_config', 'create_serialization_config', 'device_memory_size', 'device_memory_size_v2', 'engine_capability', 'error_recorder', 'get_device_memory_size_for_profile', 'get_device_memory_size_for_profile_v2', 'get_tenso

# **TensorRT Inference Pipeline with PyCUDA (Dynamic Shapes)**

This script demonstrates how to **load a TensorRT engine**, allocate memory for inputs/outputs, and run inference using **TensorRT 10 API** and **PyCUDA**.

---

## **1. Setup and Imports**

```python
import tensorrt as trt
import pycuda.driver as cuda
import pycuda.autoinit
import numpy as np
```

* **tensorrt** → Core library for TensorRT engine handling.
* **pycuda.driver** → CUDA memory allocation and data transfer.
* **pycuda.autoinit** → Automatically initializes CUDA context.
* **numpy** → Handles input/output tensors.

---

## **2. Load Serialized TensorRT Engine**

```python
TRT_LOGGER = trt.Logger(trt.Logger.WARNING)

with open("/content/model_trt.engine", "rb") as f, trt.Runtime(TRT_LOGGER) as runtime:
    engine = runtime.deserialize_cuda_engine(f.read())

context = engine.create_execution_context()
```

* **TRT\_LOGGER** → Logger for TensorRT warnings and errors.
* **deserialize\_cuda\_engine** → Loads a serialized `.engine` file into memory.
* **create\_execution\_context()** → Creates an execution context for inference (manages dynamic shapes, bindings).

---

## **3. Inspect Engine**

```python
print("Engine object:", engine)
print("Engine is None?", engine is None)
```

* Verifies that the engine is successfully loaded.

---

## **4. Dynamic Shape & Buffer Management**

TensorRT 10 uses **tensor names** instead of old binding APIs.

### **Helper: Allocate Buffers**

```python
def allocate_buffers(engine, context, input_shapes):
    inputs, outputs, bindings = [], [], []
    stream = cuda.Stream()

    tensor_names = [engine.get_tensor_name(i) for i in range(engine.num_io_tensors)]

    # Set dynamic input shape
    for name, shape in input_shapes.items():
        if engine.get_tensor_mode(name) == trt.TensorIOMode.INPUT:
            context.set_input_shape(name, tuple(shape))

    # Allocate memory for each tensor
    for name in tensor_names:
        shape = context.get_tensor_shape(name)
        size = int(trt.volume(shape))
        dtype = trt.nptype(engine.get_tensor_dtype(name))

        host_mem = cuda.pagelocked_empty(size, dtype)  # CPU memory
        device_mem = cuda.mem_alloc(host_mem.nbytes)   # GPU memory

        bindings.append(int(device_mem))

        entry = {"name": name, "host": host_mem, "device": device_mem, "shape": tuple(shape)}
        if engine.get_tensor_mode(name) == trt.TensorIOMode.INPUT:
            inputs.append(entry)
        else:
            outputs.append(entry)

    return inputs, outputs, bindings, stream
```

✅ **Key points:**

* **`context.set_input_shape(name, shape)`** → Required for dynamic input sizes.
* **Host memory (CPU)**: `cuda.pagelocked_empty()` → Pinned memory for fast transfer.
* **Device memory (GPU)**: `cuda.mem_alloc()` → Allocates space on GPU.
* **bindings\[]** → List of device memory pointers passed to TensorRT.

---

## **5. Inference Execution**

```python
def infer(context, bindings, inputs, outputs, stream):
    for inp in inputs:
        cuda.memcpy_htod_async(inp["device"], inp["host"], stream)

    context.execute_v2(bindings)

    for out in outputs:
        cuda.memcpy_dtoh_async(out["host"], out["device"], stream)

    stream.synchronize()

    return [np.array(out["host"]).reshape(out["shape"]) for out in outputs]
```

✅ **Steps:**

* Copy **inputs from host → GPU** (`memcpy_htod_async`).
* Execute inference with `context.execute_v2(bindings)`.
* Copy **outputs from GPU → host** (`memcpy_dtoh_async`).
* Synchronize CUDA stream to ensure completion.

---

## **6. Example Usage**

```python
dynamic_shape = {"input": (4, 3, 224, 224)}

# Allocate memory
inputs, outputs, bindings, stream = allocate_buffers(engine, context, dynamic_shape)

# Fill input tensor
dummy_input = np.random.randn(*inputs[0]["shape"]).astype(np.float32).ravel()
np.copyto(inputs[0]["host"], dummy_input)

# Run inference
results = infer(context, bindings, inputs, outputs, stream)

print("Output binding shapes:", [out["shape"] for out in outputs])
print("First 10 output values:", results[0].flatten()[:10])
```

✅ **What happens here:**

* **Dynamic batch size**: `(4, 3, 224, 224)`.
* Generate random input data and copy into host buffer.
* Run inference and display output shape & sample predictions.

---

### **⚠ Why `get_binding_name()` failed?**

* TensorRT 10 uses **tensor name-based API**:

  * Use `engine.get_tensor_name(i)` instead of `get_binding_name()`.
  * Use `engine.num_io_tensors` instead of `num_bindings`.