
# TensorRT Acceleration for YOLOv5 — End‑to‑End Notebook

This notebook shows **how TensorRT achieves inference acceleration** on a real model (**YOLOv5**), with two practical routes:

1. **One‑click route (recommended first):** Use YOLOv5's `export.py` to directly generate a TensorRT **engine** (with EfficientNMS integrated).
2. **Controllable route:** Export **ONNX** → build **TensorRT FP16 / INT8 engines** via Python APIs (dynamic shapes, timing cache, calibrator) → **async inference** with multiple streams.

> **Requirements (run-time environment):**
> - NVIDIA GPU (Turing+ recommended, ideally Ampere/ADA)
> - CUDA + cuDNN + TensorRT 8.x/9.x installed
> - `pycuda`, `numpy`, `torch`
> - Internet access to clone YOLOv5 repo (or place it locally)
> 
> **You can run only the parts you need.**


## 0) Quick Environment Check

In [None]:

import sys, os, subprocess, platform
print("Python:", sys.version)
!nvidia-smi || echo "nvidia-smi not found (ensure NVIDIA drivers are installed)"
try:
    import tensorrt as trt
    print("TensorRT version:", trt.__version__)
except Exception as e:
    print("TensorRT import failed:", e)

try:
    import torch
    print("PyTorch:", torch.__version__, "| CUDA available:", torch.cuda.is_available())
except Exception as e:
    print("PyTorch import failed:", e)

try:
    import pycuda.driver as cuda
    import pycuda.autoinit  # noqa: F401
    print("PyCUDA OK")
except Exception as e:
    print("PyCUDA import failed:", e)



## 1) Get YOLOv5 and Weights

- Clone YOLOv5 (official Ultralytics repo).
- Download a small model weight (e.g., `yolov5s.pt`).

> If you already have the repo and weights, you can skip cloning/downloading.


In [None]:

import os, pathlib, sys
WORKDIR = pathlib.Path.cwd() / "yolov5"
if not WORKDIR.exists():
    !git clone --depth=1 https://github.com/ultralytics/yolov5.git
else:
    print("YOLOv5 repo already exists at", WORKDIR)

# Install Python deps for YOLOv5 (optional; comment out if handled elsewhere)
%cd yolov5
!pip -q install -r requirements.txt || echo "pip install skipped or failed"

# Get a small weights file
WEIGHTS = "yolov5s.pt"
if not os.path.exists(WEIGHTS):
    !python - << 'PY'
import torch
from urllib.request import urlretrieve
url = "https://github.com/ultralytics/yolov5/releases/download/v7.0/yolov5s.pt"
try:
    urlretrieve(url, "yolov5s.pt")
    print("Downloaded yolov5s.pt")
except Exception as e:
    print("Download yolov5s.pt failed:", e)
PY
else:
    print("Weights already present:", WEIGHTS)
%cd ..



## 2) Route A — One‑Click Export to TensorRT Engine (with EfficientNMS)

YOLOv5's `export.py` supports `--include engine` to produce a TensorRT `.engine` directly.
This path often embeds EfficientNMS inside the engine (fast post‑processing).

In [None]:

%cd yolov5

ENGINE_NAME = "yolov5s_fp16.engine"
# Fixed 640x640 (stable latency); add --dynamic if you need variable shapes
!python export.py --weights yolov5s.pt --include engine --img 640 640 --half --device 0

# Move engine to notebook root (optional)
import shutil, os
if os.path.exists("yolov5s.engine"):
    shutil.move("yolov5s.engine", f"../{ENGINE_NAME}")
    print("Engine saved to:", f"../{ENGINE_NAME}")
else:
    print("Engine not found (export may have failed). Check logs above.")

%cd ..



## 3) Route B — Export ONNX from YOLOv5

This gives you more control for later building FP16/INT8 engines via the TensorRT Python API.


In [None]:

%cd yolov5
!python export.py --weights yolov5s.pt --include onnx --img 640 640 --opset 13 --simplify --device 0

import shutil, os
if os.path.exists("yolov5s.onnx"):
    shutil.move("yolov5s.onnx", "../yolov5s.onnx")
    print("Exported ONNX to ../yolov5s.onnx")
else:
    print("ONNX not found. Check export logs.")
%cd ..



## 4) Build FP16 TensorRT Engine (Dynamic Shapes + Timing Cache)

This cell demonstrates:
- **FP16** build flag
- **Optimization Profile** (min/opt/max shapes)
- **Timing Cache** load/save


In [None]:

import tensorrt as trt, os, pathlib

ONNX_PATH = "yolov5s.onnx"
ENGINE_FP16 = "yolov5s_fp16.plan"
TIMING_CACHE = "yolo_timing.cache"

logger = trt.Logger(trt.Logger.INFO)
builder = trt.Builder(logger)
flag = 1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH)
network = builder.create_network(flag)
parser = trt.OnnxParser(network, logger)

assert os.path.exists(ONNX_PATH), "Missing yolov5s.onnx. Run the ONNX export cell first."
with open(ONNX_PATH, "rb") as f:
    parsed = parser.parse(f.read())
    if not parsed:
        for i in range(parser.num_errors):
            print(parser.get_error(i))
    assert parsed, "ONNX parse failed."

config = builder.create_builder_config()
config.set_memory_pool_limit(trt.MemoryPoolType.WORKSPACE, 8 << 30)  # 8GB for tactic search

if builder.platform_has_fast_fp16:
    config.set_flag(trt.BuilderFlag.FP16)

# Dynamic shapes (adjust ranges to your needs)
profile = builder.create_optimization_profile()
profile.set_shape("images",  # YOLOv5 ONNX input name may be 'images' or 'input'; adjust if needed
                  min=(1, 3, 320, 320),
                  opt=(8, 3, 640, 640),
                  max=(16, 3, 1280, 1280))

# If your input tensor name differs, fallback to 'input'
if not profile.is_valid:
    profile = builder.create_optimization_profile()
    profile.set_shape("input",
                      min=(1, 3, 320, 320),
                      opt=(8, 3, 640, 640),
                      max=(16, 3, 1280, 1280))

config.add_optimization_profile(profile)

# Timing cache (reuse kernel tactic results)
if os.path.exists(TIMING_CACHE):
    with open(TIMING_CACHE, "rb") as f:
        tc = config.create_timing_cache(f.read())
        config.set_timing_cache(tc, ignore_mismatch=True)
        print("[TimingCache] Loaded")
else:
    print("[TimingCache] None, will build from scratch")

engine = builder.build_engine(network, config)
assert engine, "Build failed."

with open(ENGINE_FP16, "wb") as f:
    f.write(engine.serialize())
print("[Engine] Saved:", ENGINE_FP16)

tc = config.get_timing_cache()
if tc:
    blob = tc.serialize()
    with open(TIMING_CACHE, "wb") as f:
        f.write(blob)
    print("[TimingCache] Saved:", TIMING_CACHE)



## 5) Optional — Build INT8 Engine with a Minimal Calibrator (PTQ)

> For production, **use a representative calibration dataset** and your real preprocessing.  
This demo uses random data just to show the **flow**.


In [None]:

import tensorrt as trt, numpy as np, os
import pycuda.driver as cuda
import pycuda.autoinit  # noqa: F401

ONNX_PATH = "yolov5s.onnx"
ENGINE_INT8 = "yolov5s_int8.plan"
CALIB_CACHE = "yolo_calib.cache"

class RandomCalibrator(trt.IInt8EntropyCalibrator2):
    def __init__(self, batch_size=8, n_batches=20, h=640, w=640):
        super().__init__()
        self.batch_size = batch_size
        self.n_batches = n_batches
        self.h, self.w = h, w
        self.idx = 0
        self.cache_file = CALIB_CACHE
        self.device_input = cuda.mem_alloc(batch_size * 3 * h * w * np.float32().nbytes)

    def get_batch_size(self): return self.batch_size

    def get_batch(self, names):
        if self.idx >= self.n_batches:
            return None
        batch = np.random.rand(self.batch_size, 3, self.h, self.w).astype(np.float32)
        cuda.memcpy_htod(self.device_input, batch)
        self.idx += 1
        return [int(self.device_input)]

    def read_calibration_cache(self):
        if os.path.exists(self.cache_file):
            print("[Calib] Load cache")
            return open(self.cache_file, "rb").read()
        return None

    def write_calibration_cache(self, cache):
        open(self.cache_file, "wb").write(cache)
        print("[Calib] Save cache")

logger = trt.Logger(trt.Logger.INFO)
builder = trt.Builder(logger)
flag = 1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH)
network = builder.create_network(flag)
parser = trt.OnnxParser(network, logger)

with open(ONNX_PATH, "rb") as f:
    assert parser.parse(f.read()), "ONNX parse failed"

config = builder.create_builder_config()
config.set_memory_pool_limit(trt.MemoryPoolType.WORKSPACE, 8 << 30)
config.set_flag(trt.BuilderFlag.INT8)
if builder.platform_has_fast_fp16:
    config.set_flag(trt.BuilderFlag.FP16)  # allow mixed precision

# Dynamic profile
profile = builder.create_optimization_profile()
# try 'images' first, fallback to 'input'
ok = True
try:
    profile.set_shape("images", min=(1,3,320,320), opt=(8,3,640,640), max=(16,3,1280,1280))
except Exception as e:
    ok = False
if not ok:
    profile = builder.create_optimization_profile()
    profile.set_shape("input", min=(1,3,320,320), opt=(8,3,640,640), max=(16,3,1280,1280))

config.add_optimization_profile(profile)
config.int8_calibrator = RandomCalibrator()

engine = builder.build_engine(network, config)
assert engine, "INT8 build failed"

with open(ENGINE_INT8, "wb") as f:
    f.write(engine.serialize())
print("[Engine-INT8] Saved:", ENGINE_INT8)



## 6) Async Inference (Two Streams) — FP16 Engine

This demonstrates:
- Setting dynamic shapes
- Using **pinned (page-locked) memory**
- **Async H2D → execute → D2H** on **two CUDA streams** (parallelism)
- Works with engines that contain EfficientNMS (common when using route A)


In [None]:

import tensorrt as trt, numpy as np, threading, os
import pycuda.driver as cuda
import pycuda.autoinit  # noqa: F401

ENGINE_PATH = "yolov5s_fp16.plan" if os.path.exists("yolov5s_fp16.plan") else "yolov5s_fp16.engine"
if not os.path.exists(ENGINE_PATH):
    # If route A engine exists instead (named yolov5s_fp16.engine above)
    ENGINE_PATH = "yolov5s_fp16.engine"
assert os.path.exists(ENGINE_PATH), "No engine found. Build FP16 or export engine first."

logger = trt.Logger(trt.Logger.ERROR)
with open(ENGINE_PATH, "rb") as f, trt.Runtime(logger) as rt:
    engine = rt.deserialize_cuda_engine(f.read())

# Try typical binding names; adapt as needed depending on your engine
input_name_candidates = ["images", "input"]
output_names_hint = []  # if EfficientNMS is present, expect multiple outputs

def get_binding_index_by_candidates(engine, names):
    for nm in names:
        try:
            idx = engine.get_binding_index(nm)
            if idx != -1:
                return idx, nm
        except Exception:
            pass
    # fallback to first input binding
    for i in range(engine.num_bindings):
        if engine.binding_is_input(i):
            return i, engine.get_binding_name(i)
    raise RuntimeError("No input binding found")

inp_idx, inp_name = get_binding_index_by_candidates(engine, input_name_candidates)
print("Using input binding:", inp_name, "(index:", inp_idx, ")")

out_indices = [i for i in range(engine.num_bindings) if not engine.binding_is_input(i)]
print("Output binding count:", len(out_indices), "->", [engine.get_binding_name(i) for i in out_indices])

def pagelocked_empty(shape, dtype=np.float32):
    return cuda.pagelocked_empty(int(np.prod(shape)), dtype).reshape(shape)

def infer_once(engine, shape, tag="stream"):
    context = engine.create_execution_context()
    context.set_binding_shape(inp_idx, shape)
    stream = cuda.Stream()

    in_dtype = trt.nptype(engine.get_binding_dtype(inp_idx))
    h_in = pagelocked_empty(shape, in_dtype)
    h_in[:] = np.random.rand(*shape).astype(in_dtype)

    d_in = cuda.mem_alloc(h_in.nbytes)

    # Prepare outputs
    bindings = [0]*engine.num_bindings
    bindings[inp_idx] = int(d_in)

    host_out = []
    device_out = []
    for oi in out_indices:
        out_shape = tuple(context.get_binding_shape(oi))
        out_dtype = trt.nptype(engine.get_binding_dtype(oi))
        h_o = pagelocked_empty(out_shape, out_dtype)
        d_o = cuda.mem_alloc(h_o.nbytes)
        host_out.append(h_o)
        device_out.append(d_o)
        bindings[oi] = int(d_o)

    cuda.memcpy_htod_async(d_in, h_in, stream)
    context.execute_async_v2(bindings, stream.handle)
    for h_o, d_o in zip(host_out, device_out):
        cuda.memcpy_dtoh_async(h_o, d_o, stream)
    stream.synchronize()

    print(f"[{tag}] shapes:", [arr.shape for arr in host_out])
    return host_out

# Two parallel requests with different shapes (if dynamic profiles allow it)
t1 = threading.Thread(target=infer_once, args=(engine, (1,3,640,640), "S1"))
t2 = threading.Thread(target=infer_once, args=(engine, (1,3,384,384), "S2"))
t1.start(); t2.start(); t1.join(); t2.join()



## 7) Notes & Troubleshooting

- If ONNX input name differs (e.g., `images` vs `input`), **adjust the code** accordingly.
- If your engine doesn't include EfficientNMS, you may see raw predictions and need to run NMS on the host. Prefer using the **engine with NMS integrated** (Route A).
- For **INT8**, replace the random calibrator with a **real calibration dataset** matching production distribution.
- Use `trtexec` for quick benchmarking and building when you don't need Python APIs:
```bash
trtexec --onnx=yolov5s.onnx --saveEngine=yolov5s_fp16.plan --fp16 --workspace=16384 \
  --minShapes=images:1x3x320x320 --optShapes=images:8x3x640x640 --maxShapes=images:16x3x1280x1280 \
  --timingCacheFile=yolo_timing.cache --buildOnly --verbose
```
- Profile with **Nsight Systems/Compute** to locate I/O vs compute bottlenecks.
