# Lab 5: Model Optimization and Performance Evaluation for Embedded AI

This lab extends Lab 4 by exporting YOLOv8x to ONNX, optimizing with TensorRT and torch2trt, running live object detection, benchmarking Precision/Recall/F1 and mAP, and profiling with NVIDIA tools [<a href='#ref-1'>1</a>, <a href='#ref-2'>2</a>, <a href='#ref-3'>3</a>, <a href='#ref-4'>4</a>, <a href='#ref-5'>5</a>, <a href='#ref-6'>6</a>, <a href='#ref-7'>7</a>, <a href='#ref-8'>8</a>, <a href='#ref-10'>10</a>].

By the end of this lab, students will be able to: [<a href='#ref-1'>1</a>]

- Export YOLOv8x from PyTorch to ONNX and build a TensorRT engine [<a href='#ref-1'>1</a>, <a href='#ref-2'>2</a>, <a href='#ref-4'>4</a>]
- Optimize a YOLOv8x PyTorch module using torch2trt and save the engine/module state [<a href='#ref-6'>6</a>]
- Run object detection and record throughput (FPS), latency, and task performance metrics (Precision, Recall, F1, mAP@0.50, mAP@0.50–0.95) for PyTorch, TensorRT (engine), and torch2trt [<a href='#ref-1'>1</a>]
- Compare and anylyze interpret results [<a href='#ref-9'>9</a>]
- Profile with Nsight Systems/Compute and the TensorRT profiler; visualize findings and summarize bottlenecks [<a href='#ref-7'>7</a>, <a href='#ref-8'>8</a>, <a href='#ref-5'>5</a>]

Assumptions: reuse the Lab 4 Docker image, where all tools are already installed [<a href='#ref-1'>1</a>].

## Folder Structure and Configuration [<a href='#ref-1'>1</a>]

- Create a working folder named `Lab5` inside your Jetson lab directory `~/CMPE6012/STUDENT_ID`  [<a href='#ref-1'>1</a>].
```bash
mkdir -p ~/CMPE6012/STUDENT_ID/Lab5 # Replace STUDENT_ID with your student ID
```
- Create the following files into `Lab5` and run them as indicated in later sections [<a href='#ref-1'>1</a>]:
  - config.py
  - baseline_infer_pytorch.py
  - task1_export_onnx.py
  - task1_build_trt_engine.py
  - task1_infer_trt.py
  - task2_optimize_torch2trt.py
  - task2_infer_torch2trt.py
  - task3_optimize_torch2trt.py
  - task3_infer_torch2trt.py
  - visualize_efficiency.py
  - visualize_task_performance.py
  - profile_wrappers.py

### Create a config file for global settings across all tasks in Lab 5
config.py [<a href='#ref-1'>1</a>]

In [None]:
# LAB_DIR/config.py
from pathlib import Path

LAB_DIR = Path.cwd()  # or Path('/workspace') inside container
LAB_DIR.mkdir(exist_ok=True, parents=True)

# Model and I/O
MODEL_PT = '../Lab4/yolov8x.pt'
IMG_SIZE = 640
BATCH = 1
DYNAMIC = False
HALF = False

# Artifacts
ONNX_PATH = LAB_DIR / 'yolov8x.onnx'
ENGINE_PATH = LAB_DIR / 'yolov8x_trt_from_onnx.engine'
ONNX_PATH_FP16 = LAB_DIR / 'yolov8x_fp16.onnx'
ENGINE_PATH_FP16 = LAB_DIR / 'yolov8x_trt_from_onnx_fp16.engine'

ENGINE_PATH_T2T = LAB_DIR / 'yolov8x_trt_from_torch2trt.engine'
ENGINE_PATH_T2T_FP16 = LAB_DIR / 'yolov8x_trt_from_torch2trt_fp16.engine'
T2TRT_STATE = LAB_DIR / 'yolov8x_torch2trt.pth'
T2TRT_STATE_FP16 = LAB_DIR / 'yolov8x_torch2trt_fp16.pth'

# Dataset
DATA_YAML = '/datasets/coco/coco_mini_val.yaml'

# Camera
CAMERA_INDEX = 0  # replace with GStreamer string on Jetson if needed


## Task 0: Baseline PyTorch Live Inference [<a href='#ref-1'>1</a>, <a href='#ref-3'>3</a>]

Instructions: [<a href='#ref-1'>1</a>]

- Create and run the baseline inference script to establish reference FPS and latency [<a href='#ref-1'>1</a>].
- Keep the same dataset and image size for fair comparison across tasks [<a href='#ref-1'>1</a>].
- You can also revise your Task 3.4 implementation in Lab 4 to your baseline for Lab 5.
- Record the reference metrics, you will need these for later comparisons.
- Questions to think about:
1. What is `p95 latency`?
2. What are the differences between `mean latency` and `p95 latency`?
3. Why percentile latency matters?

In [None]:
# LAB_DIR/baseline_infer_pytorch.py
import time, numpy as np, cv2, argparse, json
from ultralytics import YOLO
from config import MODEL_PT, IMG_SIZE, CAMERA_INDEX, DATA_YAML

def infer_live(camera_index, metrics_out='baseline_live_metrics.json'):
    model = YOLO(MODEL_PT)
    cap = cv2.VideoCapture(camera_index)
    if not cap.isOpened(): raise RuntimeError('Camera not available')
    latencies = []; t_last = time.perf_counter(); fps_smooth, alpha = 0.0, 0.1
    while True:
        ok, frame = cap.read()
        if not ok: break
        t0 = time.perf_counter()
        results = model.predict(source=frame, imgsz=IMG_SIZE, verbose=False, device=0)
        t1 = time.perf_counter()
        lat_ms = (t1 - t0) * 1000.0; latencies.append(lat_ms)
        annotated = results[0].plot()
        dt = t1 - t_last; inst_fps = 1.0 / max(dt, 1e-6)
        fps_smooth = (1 - alpha)*fps_smooth + alpha*inst_fps if fps_smooth > 0 else inst_fps
        t_last = t1
        cv2.putText(annotated, f'Latency: {lat_ms:.1f} ms  FPS: {fps_smooth:.1f}', (10, 30),
                    cv2.FONT_HERSHEY_SIMPLEX, 0.9, (0,255,0), 2, cv2.LINE_AA)
        cv2.imshow('PyTorch YOLOv8x (Baseline)', annotated)
        if cv2.waitKey(1) & 0xFF == 27: break
    cap.release(); cv2.destroyAllWindows()
    if latencies:
        metrics = {
            'mean_latency_ms': float(np.mean(latencies)),
            'p95_latency_ms': float(np.percentile(latencies, 95)),
            'frames': len(latencies),
            'fps': len(latencies) / (np.sum(latencies)/1000.0)
        }
        print(metrics)
        with open(metrics_out, 'w') as f:
            json.dump(metrics, f, indent=2)
        print(f"Saved live metrics to {metrics_out}")

def infer_coco(metrics_out='baseline_coco_metrics.json'):
    model = YOLO(MODEL_PT)

    t0 = time.perf_counter()
    results = model.val(data=DATA_YAML, imgsz=IMG_SIZE, device=0, verbose=False)
    t1 = time.perf_counter()
    latency_ms = (t1 - t0) * 1000.0

    def to_scalar(x):
        a = np.asarray(x)
        if a.size == 0:
            return float('nan')
        if a.shape == ():
            return float(a)
        return float(np.nanmean(a))

    # Pull metrics with array-safe conversion
    try:
        precision = to_scalar(results.box.p)
        recall = to_scalar(results.box.r)
        # f1 may be missing sometimes, compute if needed
        f1 = to_scalar(getattr(results.box, 'f1', np.array([2 * precision * recall / max(precision + recall, 1e-12)])))
        map50 = to_scalar(results.box.map50)
        map50_95 = to_scalar(results.box.map)
    except AttributeError:
        # Fallback for older APIs
        p, r, map50_val, map50_95_val = results.mean_results()
        precision = float(p); recall = float(r)
        map50 = float(map50_val); map50_95 = float(map50_95_val)
        f1 = float(2 * precision * recall / max(precision + recall, 1e-12))

    # FPS from reported per-image inference speed when available
    try:
        inf_ms = float(results.speed.get('inference', 0.0))
        fps = 1000.0 / inf_ms if inf_ms > 0 else 1.0 / max((t1 - t0), 1e-6)
    except Exception:
        fps = 1.0 / max((t1 - t0), 1e-6)

    metrics = {
        'precision': precision,
        'recall': recall,
        'f1': f1,
        'map50': map50,
        'map50_95': map50_95,
        'mean_latency_ms': float(latency_ms),
        'p95_latency_ms': float(latency_ms),  # single pass
        'fps': float(fps),
    }

    print(metrics)
    with open(metrics_out, 'w') as f:
        json.dump(metrics, f, indent=2)
    print(f"Saved COCO metrics to {metrics_out}")

if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument('--mode', choices=['live', 'coco'], default='coco')
    parser.add_argument('--camera', type=int, default=CAMERA_INDEX)
    parser.add_argument('--metrics_out', type=str, default=None)
    args = parser.parse_args()
    if args.mode == 'live':
        out = args.metrics_out or 'baseline_live_metrics.json'
        infer_live(args.camera, metrics_out=out)
    else:
        out = args.metrics_out or 'baseline_coco_metrics.json'
        infer_coco(metrics_out=out)

## Task 1: PyTorch → ONNX → TensorRT [<a href='#ref-1'>1</a>, <a href='#ref-2'>2</a>, <a href='#ref-4'>4</a>]

Instructions: [<a href='#ref-1'>1</a>]

- Create and run the export and engine-build scripts in `Lab5`; verify .onnx and .engine files are produced [<a href='#ref-1'>1</a>, <a href='#ref-2'>2</a>].
- Create and run the inference script with the .engine and observe the metrics [<a href='#ref-1'>1</a>].
- Record results for comparison with Task 2 and the baseline (Task 0) [<a href='#ref-1'>1</a>].

### task1_export_onnx.py [<a href='#ref-1'>1</a>, <a href='#ref-2'>2</a>]

In [None]:
# LAB_DIR/task1_export_onnx.py
from ultralytics import YOLO
from pathlib import Path
from config import MODEL_PT, IMG_SIZE, DYNAMIC, HALF, ONNX_PATH

def main():
    model = YOLO(MODEL_PT)
    exported = model.export(format='onnx', imgsz=IMG_SIZE, dynamic=DYNAMIC, half=HALF)
    src = Path(str(exported))
    if src.exists() and src.resolve() != ONNX_PATH.resolve():
        src.rename(ONNX_PATH)
    print(f'ONNX saved at: {ONNX_PATH}')

if __name__ == '__main__':
    main()


### task1_build_trt_engine.py [<a href='#ref-2'>2</a>, <a href='#ref-4'>4</a>]

In [None]:
# LAB_DIR/task1_build_trt_engine.py
from ultralytics.utils.export import export_engine
from config import ONNX_PATH, ENGINE_PATH, BATCH, IMG_SIZE, DYNAMIC

def main():
    assert ONNX_PATH.exists(), 'ONNX file not found'
    export_engine(
        onnx_file=str(ONNX_PATH),
        engine_file=str(ENGINE_PATH),
        workspace=4,
        half=False,
        int8=False,
        dynamic=DYNAMIC,
        shape=(BATCH, 3, IMG_SIZE, IMG_SIZE),
        verbose=True
    )
    print(f'Engine saved at: {ENGINE_PATH}')

if __name__ == '__main__':
    main()


### task1_infer_trt.py [<a href='#ref-1'>1</a>, <a href='#ref-6'>6</a>]

In [None]:
# LAB_DIR/task1_infer_trt.py
import time
import numpy as np
import cv2
import argparse
import json
import yaml
from pathlib import Path
from ultralytics import YOLO
from config import ENGINE_PATH, IMG_SIZE, CAMERA_INDEX, DATA_YAML

# Fix deprecated np.bool alias
if not hasattr(np, "bool"):
    np.bool = np.bool_

def xyxy_iou(box1, box2):
    """Compute IoU between two sets of boxes in xyxy format."""
    if len(box1) == 0 or len(box2) == 0:
        return np.zeros((len(box1), len(box2)), dtype=np.float32)

    x1 = np.maximum(box1[:, None, 0], box2[None, :, 0])
    y1 = np.maximum(box1[:, None, 1], box2[None, :, 1])
    x2 = np.minimum(box1[:, None, 2], box2[None, :, 2])
    y2 = np.minimum(box1[:, None, 3], box2[None, :, 3])

    inter_area = np.clip(x2 - x1, 0, None) * np.clip(y2 - y1, 0, None)
    box1_area = (box1[:, 2] - box1[:, 0]) * (box1[:, 3] - box1[:, 1])
    box2_area = (box2[:, 2] - box2[:, 0]) * (box2[:, 3] - box2[:, 1])
    union_area = box1_area[:, None] + box2_area[None, :] - inter_area
    iou = inter_area / np.clip(union_area, 1e-6, None)
    return iou

def load_yolo_dataset(yaml_file):
    """Load YOLO dataset from YAML file, supporting relative label folder."""
    with open(yaml_file, 'r') as f:
        data_cfg = yaml.safe_load(f)

    dataset_root = Path(data_cfg['path']).resolve()
    val_file = Path(data_cfg['val']).resolve()
    if not val_file.exists():
        val_file = dataset_root / data_cfg['val']
    if not val_file.exists():
        raise FileNotFoundError(f"Validation file not found: {val_file}")

    dataset = []
    with open(val_file, 'r') as f:
        lines = f.read().splitlines()

    for line in lines:
        img_path = (dataset_root / line).resolve()
        if not img_path.exists():
            print(f"Warning: image not found: {img_path}")
            continue
        img = cv2.imread(str(img_path))
        if img is None:
            continue
        h, w = img.shape[:2]

        # Correct relative label path
        label_path = (img_path.parent.parent.parent / "labels/val2017" / img_path.name).with_suffix('.txt')
        boxes = []
        if label_path.exists():
            with open(label_path, 'r') as lf:
                for lbl in lf:
                    cls, x_center, y_center, bw, bh = map(float, lbl.strip().split())
                    x1 = (x_center - bw/2) * w
                    y1 = (y_center - bh/2) * h
                    x2 = (x_center + bw/2) * w
                    y2 = (y_center + bh/2) * h
                    boxes.append([x1, y1, x2, y2])
        else:
            print(f"Warning: label file not found: {label_path}")

        dataset.append((str(img_path), np.array(boxes)))

    print(f"Found {len(dataset)} validation images.")
    return dataset

def infer_live(camera_index: int, metrics_out: str = 'trt_live_metrics.json'):
    assert ENGINE_PATH.exists(), f'TensorRT engine not found: {ENGINE_PATH}'
    model = YOLO(str(ENGINE_PATH))

    cap = cv2.VideoCapture(camera_index)
    if not cap.isOpened():
        raise RuntimeError(f'Camera {camera_index} not available')

    latencies = []
    t_last, fps_smooth, alpha = time.perf_counter(), 0.0, 0.1

    while True:
        ret, frame = cap.read()
        if not ret:
            break

        t0 = time.perf_counter()
        results = model.predict(source=frame, imgsz=IMG_SIZE, verbose=False, device=0)
        t1 = time.perf_counter()

        lat_ms = (t1 - t0) * 1000.0
        latencies.append(lat_ms)

        annotated = results[0].plot()
        inst_fps = 1.0 / max(t1 - t_last, 1e-6)
        fps_smooth = inst_fps if fps_smooth == 0 else (1 - alpha) * fps_smooth + alpha * inst_fps
        t_last = t1

        cv2.putText(
            annotated,
            f'Latency: {lat_ms:.1f} ms  FPS: {fps_smooth:.1f}',
            (10, 30),
            cv2.FONT_HERSHEY_SIMPLEX,
            0.9,
            (0, 255, 0),
            2,
            cv2.LINE_AA
        )
        cv2.imshow('TensorRT YOLOv8x', annotated)
        if cv2.waitKey(1) & 0xFF == 27:
            break

    cap.release()
    cv2.destroyAllWindows()

    if latencies:
        metrics = {
            'mean_latency_ms': float(np.mean(latencies)),
            'p95_latency_ms': float(np.percentile(latencies, 95)),
            'frames': len(latencies),
            'fps': len(latencies) / (np.sum(latencies) / 1000.0)
        }
    else:
        metrics = {
            'mean_latency_ms': float('nan'),
            'p95_latency_ms': float('nan'),
            'frames': 0,
            'fps': float('nan')
        }

    print(metrics)
    with open(metrics_out, 'w') as f:
        json.dump(metrics, f, indent=2)
    print(f"Saved live metrics to {metrics_out}")

def infer_coco(metrics_out: str = 'trt_coco_metrics.json'):
    assert ENGINE_PATH.exists(), f'TensorRT engine not found: {ENGINE_PATH}'
    model = YOLO(str(ENGINE_PATH))

    dataset = load_yolo_dataset(DATA_YAML)

    if len(dataset) == 0:
        print("Warning: No validation images found!")
        metrics = {
            'precision': float('nan'),
            'recall': float('nan'),
            'f1': float('nan'),
            'mean_latency_ms': float('nan'),
            'p95_latency_ms': float('nan'),
            'frames': 0,
            'fps': float('nan')
        }
        with open(metrics_out, 'w') as f:
            json.dump(metrics, f, indent=2)
        return

    latencies = []
    all_precisions, all_recalls = [], []

    for img_path, gt_boxes in dataset:
        t0 = time.perf_counter()
        results = model.predict(source=img_path, imgsz=IMG_SIZE, verbose=False, device=0)
        t1 = time.perf_counter()
        latencies.append((t1 - t0) * 1000.0)

        pred_boxes = results[0].boxes.xyxy.cpu().numpy() if results[0].boxes else np.zeros((0, 4))

        if len(pred_boxes) > 0 and len(gt_boxes) > 0:
            ious = xyxy_iou(pred_boxes, gt_boxes)
            tp = (ious >= 0.5).sum()
            fp = len(pred_boxes) - tp
            fn = len(gt_boxes) - tp
            precision = tp / max(tp + fp, 1e-6)
            recall = tp / max(tp + fn, 1e-6)
        else:
            precision, recall = 0.0, 0.0

        all_precisions.append(precision)
        all_recalls.append(recall)

    mean_precision = float(np.mean(all_precisions))
    mean_recall = float(np.mean(all_recalls))
    f1 = 2 * mean_precision * mean_recall / max(mean_precision + mean_recall, 1e-12)
    mean_latency = float(np.mean(latencies))
    fps = len(latencies) / (np.sum(latencies) / 1000.0)

    metrics = {
        'precision': mean_precision,
        'recall': mean_recall,
        'f1': f1,
        'mean_latency_ms': mean_latency,
        'p95_latency_ms': float(np.percentile(latencies, 95)),
        'frames': len(latencies),
        'fps': fps
    }

    print(metrics)
    with open(metrics_out, 'w') as f:
        json.dump(metrics, f, indent=2)
    print(f"Saved dataset metrics to {metrics_out}")

if __name__ == '__main__':
    parser = argparse.ArgumentParser(description="TensorRT YOLOv8 inference")
    parser.add_argument('--mode', choices=['live', 'coco'], default='coco')
    parser.add_argument('--camera', type=int, default=CAMERA_INDEX)
    parser.add_argument('--metrics_out', type=str, default=None)
    args = parser.parse_args()

    out_file = args.metrics_out or ('trt_live_metrics.json' if args.mode == 'live' else 'trt_coco_metrics.json')
    if args.mode == 'live':
        infer_live(args.camera, metrics_out=out_file)
    else:
        infer_coco(metrics_out=out_file)

## (Ignore) Task 2: torch2trt Optimization [<a href='#ref-6'>6</a>]

*Ignore this task as torch2trt is not compatible to YOLO.*

Instructions: [<a href='#ref-6'>6</a>]

- Create and run the optimization script to save the TensorRT engine and TRTModule state for YOLOv8x [<a href='#ref-6'>6</a>].
- Create and run the inference script for the torch2trt module and collect all metrics [<a href='#ref-6'>6</a>].
- Record and compare results against Task 1 and the baseline (Task 0) [<a href='#ref-1'>1</a>].

### task2_optimize_torch2trt.py [<a href='#ref-6'>6</a>]

In [None]:
# LAB_DIR/task2_optimize_torch2trt.py
import torch
from torch2trt import torch2trt, TRTModule
from ultralytics import YOLO
from config import MODEL_PT, IMG_SIZE, BATCH, T2TRT_STATE

def main():
    y = YOLO(MODEL_PT)
    core = y.model.eval().cuda()
    example = torch.randn(BATCH, 3, IMG_SIZE, IMG_SIZE).cuda()
    model_trt = torch2trt(core, [example], fp16_mode=False, max_batch_size=BATCH)
    torch.save(model_trt.state_dict(), str(T2TRT_STATE))
    print(f'torch2trt state saved at: {T2TRT_STATE}')
    # Save raw TensorRT engine file for trtexec
    with open(ENGINE_PATH_T2T, "wb") as f:
        f.write(model_trt.engine.serialize())

if __name__ == '__main__':
    main()


### task2_infer_torch2trt.py [<a href='#ref-6'>6</a>]

In [None]:
# LAB_DIR/task2_infer_torch2trt.py
import time, numpy as np, cv2, argparse, torch, json
from torchvision.ops import nms
from torch2trt import TRTModule
from config import T2TRT_STATE, IMG_SIZE, CAMERA_INDEX, DATA_YAML

def preprocess_bgr(frame, size=IMG_SIZE):
    img = cv2.resize(frame, (size, size), interpolation=cv2.INTER_LINEAR)
    img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
    t = torch.from_numpy(img).permute(2,0,1).float().cuda() / 255.0
    return t.unsqueeze(0), img

def postprocess_yolov8(pred, conf_thres=0.25, iou_thres=0.45):
    if isinstance(pred, (list, tuple)): pred = pred[0]
    if pred.ndim == 3: pred = pred[0]
    p = pred.detach().float()
    if p.shape[-1] < 6: return []
    boxes_cxcywh = p[:, :4]
    objectness = p[:, 4:5]
    class_scores = p[:, 5:]
    cls_conf, cls_idx = class_scores.max(dim=1, keepdim=True)
    conf = (objectness * cls_conf).squeeze(1)
    keep = conf > conf_thres
    if keep.sum() == 0: return []
    boxes_cxcywh, conf, cls_idx = boxes_cxcywh[keep], conf[keep], cls_idx[keep].squeeze(1)
    cx, cy, w, h = boxes_cxcywh.t()
    x1, y1, x2, y2 = cx - w/2, cy - h/2, cx + w/2, cy + h/2
    boxes_xyxy = torch.stack([x1, y1, x2, y2], dim=1)
    keep_idx = nms(boxes_xyxy, conf, iou_thres)
    return boxes_xyxy[keep_idx], conf[keep_idx], cls_idx[keep_idx]

def infer_live(camera_index, metrics_out='torch2trt_live_metrics.json'):
    tloaded = TRTModule()
    tloaded.load_state_dict(torch.load(str(T2TRT_STATE)))
    tloaded.eval().cuda()
    cap = cv2.VideoCapture(camera_index)
    if not cap.isOpened(): raise RuntimeError('Camera not available')
    latencies = []; t_last = time.perf_counter(); fps_smooth, alpha = 0.0, 0.1
    while True:
        ok, frame = cap.read()
        if not ok: break
        h0, w0 = frame.shape[:2]
        t_in, _ = preprocess_bgr(frame, IMG_SIZE)
        t0 = time.perf_counter()
        with torch.inference_mode():
            y = tloaded(t_in)
        t1 = time.perf_counter()
        lat_ms = (t1 - t0) * 1000.0; latencies.append(lat_ms)
        parsed = postprocess_yolov8(y)
        annotated = frame.copy()
        if parsed:
            boxes, confs, clss = parsed
            sx, sy = w0 / IMG_SIZE, h0 / IMG_SIZE
            for b, c, k in zip(boxes, confs, clss):
                x1, y1, x2, y2 = b.tolist()
                x1, x2 = int(x1*sx), int(x2*sx); y1, y2 = int(y1*sy), int(y2*sy)
                cv2.rectangle(annotated, (x1,y1), (x2,y2), (0,255,0), 2)
                cv2.putText(annotated, f'{int(k)}:{c:.2f}', (x1, max(0,y1-5)),
                            cv2.FONT_HERSHEY_SIMPLEX, 0.6, (0,255,0), 2, cv2.LINE_AA)
        dt = t1 - t_last; inst_fps = 1.0 / max(dt, 1e-6)
        fps_smooth = (1 - alpha)*fps_smooth + alpha*inst_fps if fps_smooth > 0 else inst_fps
        t_last = t1
        cv2.putText(annotated, f'Latency: {lat_ms:.1f} ms  FPS: {fps_smooth:.1f}', (10, 30),
                    cv2.FONT_HERSHEY_SIMPLEX, 0.9, (0,255,0), 2, cv2.LINE_AA)
        cv2.imshow('torch2trt (Task 2)', annotated)
        if cv2.waitKey(1) & 0xFF == 27: break
    cap.release(); cv2.destroyAllWindows()
    if latencies:
        metrics = {
            'mean_latency_ms': float(np.mean(latencies)),
            'p95_latency_ms': float(np.percentile(latencies, 95)),
            'frames': len(latencies),
            'fps': len(latencies) / (np.sum(latencies)/1000.0)
        }
        print(metrics)
        with open(metrics_out, 'w') as f:
            json.dump(metrics, f, indent=2)
        print(f"Saved live metrics to {metrics_out}")

def infer_coco(metrics_out='torch2trt_coco_metrics.json'):
    from pathlib import Path
    from ultralytics.utils.ops import box_iou
    import os

    MINI_LIST = "/datasets/coco/coco_mini_val.txt"
    if Path(MINI_LIST).exists():
        with open(MINI_LIST) as f:
            image_paths = [l.strip() for l in f if l.strip()]
        image_paths = [p for p in image_paths if Path(p).exists()]
    else:
        import yaml
        with open(DATA_YAML) as f:
            data = yaml.safe_load(f)
        image_dir = Path(data['val'])
        image_paths = sorted(list(image_dir.glob('*.jpg')))[:100]

    tloaded = TRTModule()
    tloaded.load_state_dict(torch.load(str(T2TRT_STATE)))
    tloaded.eval().cuda()

    latencies = []
    TP = FP = FN = 0

    def label_path_for(image_path: str) -> Path:
        p = Path(image_path)
        parts = list(p.parts)
        for i, s in enumerate(parts):
            if s == "images":
                parts[i] = "labels"
                break
        lp = Path(*parts).with_suffix(".txt")
        return lp

    def load_labels_for_image(image_path: str):
        lab = label_path_for(image_path)
        boxes = []
        if lab.exists():
            with open(lab) as f:
                for line in f:
                    vals = line.strip().split()
                    if len(vals) >= 5:
                        c, cx, cy, w, h = map(float, vals[:5])
                        boxes.append((int(c), cx, cy, w, h))
        return boxes

    def scale_yolo_to_xyxy(lbls, w, h):
        out = []
        for c, cx, cy, bw, bh in lbls:
            x1 = (cx - bw/2) * w; y1 = (cy - bh/2) * h
            x2 = (cx + bw/2) * w; y2 = (cy + bh/2) * h
            out.append((c, x1, y1, x2, y2))
        return out

    for img_path in image_paths:
        im0 = cv2.imread(str(img_path))
        if im0 is None:
            continue
        h0, w0 = im0.shape[:2]
        im = cv2.resize(im0, (IMG_SIZE, IMG_SIZE))
        rgb = cv2.cvtColor(im, cv2.COLOR_BGR2RGB)
        t = torch.from_numpy(rgb).permute(2,0,1).float().cuda().unsqueeze(0) / 255.0

        t0 = time.perf_counter()
        with torch.inference_mode():
            pred = tloaded(t)
        t1 = time.perf_counter()
        latencies.append((t1 - t0) * 1000.0)

        if isinstance(pred, (list, tuple)): pred = pred[0]
        if pred.ndim == 3: pred = pred[0]
        if pred.shape[-1] < 6:
            gt = scale_yolo_to_xyxy(load_labels_for_image(img_path), w0, h0)
            FN += len(gt)
            continue

        boxes_cxcywh = pred[:, :4]
        conf = (pred[:, 4:5] * pred[:, 5:].max(dim=1, keepdim=True)).squeeze(1)
        keep = conf > 0.25
        boxes_cxcywh = boxes_cxcywh[keep]; conf = conf[keep]
        if boxes_cxcywh.numel() == 0:
            gt = scale_yolo_to_xyxy(load_labels_for_image(img_path), w0, h0)
            FN += len(gt)
            continue
        cx, cy, w, h = boxes_cxcywh.t()
        x1 = (cx - w/2) * (w0 / IMG_SIZE); y1 = (cy - h/2) * (h0 / IMG_SIZE)
        x2 = (cx + w/2) * (w0 / IMG_SIZE); y2 = (cy + h/2) * (h0 / IMG_SIZE)
        det = torch.stack([x1, y1, x2, y2], dim=1)

        gt_list = scale_yolo_to_xyxy(load_labels_for_image(img_path), w0, h0)
        if len(gt_list) == 0:
            FP += len(det)
            continue
        gt = torch.tensor([g[1:] for g in gt_list], dtype=torch.float32)
        if det.numel() == 0:
            FN += len(gt)
            continue

        ious = box_iou(det, gt)
        matched_gt = set()
        for i in range(det.shape[0]):
            j = int(torch.argmax(ious[i]))
            if ious[i, j] >= 0.5 and j not in matched_gt:
                TP += 1; matched_gt.add(j)
            else:
                FP += 1
        FN += (gt.shape[0] - len(matched_gt))

    precision = TP / max(TP + FP, 1)
    recall = TP / max(TP + FN, 1)
    f1 = 2 * precision * recall / max(precision + recall, 1e-9)
    metrics = {
        'mean_latency_ms': float(np.mean(latencies)) if latencies else None,
        'p95_latency_ms': float(np.percentile(latencies, 95)) if latencies else None,
        'frames': len(latencies),
        'fps': len(latencies) / (np.sum(latencies)/1000.0) if latencies else None,
        'precision': precision,
        'recall': recall,
        'f1': f1
    }
    print(metrics)
    with open(metrics_out, 'w') as f:
        json.dump(metrics, f, indent=2)
    print(f"Saved COCO metrics to {metrics_out}")

if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument('--mode', choices=['live', 'coco'], default='coco')
    parser.add_argument('--camera', type=int, default=CAMERA_INDEX)
    parser.add_argument('--metrics_out', type=str, default=None)
    args = parser.parse_args()
    if args.mode == 'live':
        out = args.metrics_out or 'torch2trt_live_metrics.json'
        infer_live(args.camera, metrics_out=out)
    else:
        out = args.metrics_out or 'torch2trt_coco_metrics.json'
        infer_coco(metrics_out=out)

## Task 3: PyTorch → ONNX → TensorRT with FP16

Instructions:
- Revise the documents for Task 1 to enable `HALF`, and save the optimized model to `ONNX_PATH_FP16` and `ENGINE_PATH_FP16`.
- Copy `task1_infer_trt.py` as `task3_infer_trt.py`, and revise accordingly to use the smaller model with FP16 and save results to `trt_fp16_live_metrics.json` or `trt_fp16_coco_metrics.jason`.
- Run `task3_infer_trt_fp16.py` for the smaller TensorRT engine and collect all metrics.
- Record and compare results against Task 1.

## Task 4: Comparison through Visualization [<a href='#ref-9'>9</a>, <a href='#ref-1'>1</a>]

- Plot latency series and FPS bar charts across baseline, TensorRT engine, and TensorRT engine FP16 [<a href='#ref-1'>1</a>].
- Plot Precision/Recall/F1 versus confidence across baseline, TensorRT engine, and TensorRT engine FP16 [<a href='#ref-9'>9</a>].
- Observe: how these metrics change across all tasks.

### visualize_efficiency.py (Latency/FPS) [<a href='#ref-1'>1</a>]

In [None]:
# LAB_DIR/visualize_efficiency.py
import json
import numpy as np
import matplotlib.pyplot as plt
from pathlib import Path

def plot_latency_fps_bar(metrics_files, labels, out_png):
    means, p95s, fps = [], [], []
    for f in metrics_files:
        with open(f) as jf:
            m = json.load(jf)
            means.append(m.get('mean_latency_ms', 0))
            p95s.append(m.get('p95_latency_ms', 0))
            fps.append(m.get('fps', 0))
    x = np.arange(len(labels))
    fig, ax = plt.subplots(1,2,figsize=(12,5))
    ax[0].bar(x, means, label='Mean Latency')
    ax[0].bar(x, p95s, bottom=means, label='P95 Latency')
    ax[0].set_xticks(x); ax[0].set_xticklabels(labels, rotation=15)
    ax[0].set_ylabel('Latency (ms)'); ax[0].legend()
    ax[1].bar(x, fps)
    ax[1].set_xticks(x); ax[1].set_xticklabels(labels, rotation=15)
    ax[1].set_ylabel('FPS')
    plt.tight_layout(); plt.savefig(out_png)
    print(f"Saved summary bar chart to {out_png}")

def plot_latency_series(metrics_files, labels, out_png):
    plt.figure(figsize=(10,5))
    for f, lab in zip(metrics_files, labels):
        with open(f) as jf:
            m = json.load(jf)
            if 'latencies' in m:
                ys = m['latencies']
            else:
                # If not present, skip
                continue
            xs = list(range(len(ys)))
            plt.plot(xs, ys, label=lab, alpha=0.8)
    plt.xlabel('Frame'); plt.ylabel('Latency (ms)'); plt.legend(); plt.tight_layout()
    plt.savefig(out_png)
    print(f"Saved latency series plot to {out_png}")

if __name__ == '__main__':
    metrics_files = [
        'baseline_coco_metrics.json',
        'trt_coco_metrics.json',
        'trt_fp16_coco_metrics.json'
    ]
    labels = ['PyTorch', 'TensorRT', 'TensorRT FP16']
    plot_latency_fps_bar(metrics_files, labels, 'latency_fps_bar.png')
    plot_latency_series(metrics_files, labels, 'latency_series.png')

### visualize_task_performance.py (PR/Recall/F1 Curves) [<a href='#ref-9'>9</a>]

In [None]:
# LAB_DIR/visualize_task_performance.py
import json
import matplotlib.pyplot as plt

def plot_pr_f1_multi(curves_json_list, labels, out_png):
    plt.figure(figsize=(10,5))
    for curves_json, label in zip(curves_json_list, labels):
        with open(curves_json) as f:
            d = json.load(f)
        conf = d['precision_curve'][0]
        p = d['precision_curve'][1]
        r = d['recall_curve'][1]
        f1 = d['f1_curve'][1]
        plt.plot(conf, p, label=f'{label} Precision')
        plt.plot(conf, r, label=f'{label} Recall')
        plt.plot(conf, f1, label=f'{label} F1')
    plt.xlabel('Confidence')
    plt.ylabel('Score')
    plt.legend()
    plt.tight_layout()
    plt.savefig(out_png)
    print(f'Saved {out_png}')

if __name__ == '__main__':
    curves_json_list = [
        'baseline_pr_curve.json',
        'trt_pr_curve.json',
        'trt_fp16_pr_curve.json'
    ]
    labels = ['PyTorch', 'TensorRT', 'TensorRT FP16']
    plot_pr_f1_multi(curves_json_list, labels, 'pr_f1_curves.png')


## Task 5: Performance Profiling with Nsight and TensorRT [<a href='#ref-7'>7</a>, <a href='#ref-8'>8</a>, <a href='#ref-5'>5</a>, <a href='#ref-10'>10</a>]

Instructions: [<a href='#ref-7'>7</a>]

- Add NvtxRange around inference loops; use Nsight Systems with capture-range=nvtx for focused timelines [<a href='#ref-7'>7</a>].
- Use Nsight Compute on short runs to extract kernel-level metrics and inspect hotspots [<a href='#ref-8'>8</a>].
- Use trtexec with detailed verbosity and JSON exports for per-layer latencies from the ONNX engine [<a href='#ref-5'>5</a>, <a href='#ref-10'>10</a>].
- Install Nsight Systems GUI on a desktop (lab computer) or laptop (your own computer) following the guideline at https://developer.nvidia.com/nsight-systems/get-started.
- View and Visualize major metrics and analyze the results.

### Add NVTX annotations to inference loops
Add the following class and relevant annotations to the inference loop of the following scripts:
- `baseline_infer_pytorch.py`
- `task1_infer_trt.py`
- `task3_infer_trt_fp16.py`

In [None]:
# LAB_DIR/profile_wrappers.py
import torch
import torch.cuda.nvtx as nvtx

class NvtxRange:
    def __init__(self, msg): self.msg = msg
    def __enter__(self): nvtx.range_push(self.msg)
    def __exit__(self, exc_type, exc, tb): nvtx.range_pop()

# Example usage inside an inference loop:
# with NvtxRange('inference_step'):
#     torch.cuda.synchronize()
#     # run inference here
# torch.cuda.synchronize()



---

### Profiling Analysis: Step-by-Step Guide

#### **A. Nsight Systems (CPU–GPU Timeline, NVTX Ranges, Synchronization)**

1. **Annotate Your Code:**
   - Add `NvtxRange` context managers around your inference loop as shown in `profile_wrappers.py`.
   - Example:
     ```python
     from profile_wrappers import NvtxRange
     with NvtxRange('inference_step'):
         torch.cuda.synchronize()
         # run inference here
         torch.cuda.synchronize()
     ```

2. **Run Profiling:**
   - In your terminal, execute:
     ```bash
     nsys profile -o nsys_report --trace=cuda,nvtx,osrt -c nvtx --capture-range=nvtx --capture-range-end=stop python <your_script.py>
     ```
   - Replace `<your_script.py>` with the script you want to profile (e.g., `baseline_infer_pytorch.py`).

3. **Open the Report:**
   - Transfer the `.nsys-rep` file to your desktop/laptop if needed.
   - Open with Nsight Systems GUI (`nsys-ui nsys_report.nsys-rep`).

4. **Analyze:**
   - Use the **Summary** and **Timeline** views.
   - Look for:
     - NVTX ranges marking inference steps.
     - Overlap between CPU and GPU activity.
     - CUDA kernel launches and memory transfers.
     - Host synchronizations (`cudaDeviceSynchronize`).
   - Identify bottlenecks such as excessive synchronization or poor overlap.

---

#### **B. Nsight Compute (Kernel-Level Analysis, Occupancy, Memory)**

1. **Short Profiling Run:**
   - Run a short inference (few frames or images).
   - Profile with:
     ```bash
     ncu --set full --target-processes all -o ncu_report python <your_script.py>
     ```

2. **Open the Report:**
   - Use Nsight Compute GUI (`ncu-ui ncu_report.ncu-rep`).

3. **Analyze:**
   - Review **Speed Of Light**, **Occupancy**, and **Memory Workload** sections.
   - Check for:
     - Kernel execution times.
     - SM occupancy and eligible warps.
     - Memory throughput and stall reasons.
   - Identify if your workload is memory-bound or compute-bound.

---

#### **C. TensorRT trtexec (Layer Latency, Tactics, Precision)**

1. **Export and Profile Engine:**
   - Use trtexec to profile your TensorRT engine:
     ```bash
     trtexec --onnx=Lab5/yolov8x.onnx --shapes=input:1x3x640x640 \
       --profilingVerbosity=detailed --dumpLayerInfo --dumpProfile \
       --separateProfileRun --exportTimes=times.json --exportProfile=profile.json
     ```

2. **Analyze Output:**
   - Open `times.json` and `profile.json` in a text editor or import into Python for visualization.
   - Look for:
     - Per-layer latency breakdown.
     - Tactics and precision used (FP32/FP16/INT8).
     - Identify layers with highest latency.

3. **Compare FP16/INT8:**
   - Repeat profiling with FP16 or INT8 enabled.
   - Compare speedups and check for any accuracy drop using your validation metrics.

---

**Tip:** Always annotate and save screenshots and plots, and highlight key findings and recommendations.

---

## References

- <a id='ref-1'></a>[1] Ultralytics YOLO — Model Export & Predict Usage — https://docs.ultralytics.com/modes/export/
- <a id='ref-2'></a>[2] Ultralytics Exporter API reference — https://docs.ultralytics.com/reference/engine/exporter/
- <a id='ref-3'></a>[3] Ultralytics YOLOv8 model usage — https://docs.ultralytics.com/models/yolov8/
- <a id='ref-4'></a>[4] NVIDIA TensorRT Best Practices — https://docs.nvidia.com/deeplearning/tensorrt/latest/performance/best-practices.html
- <a id='ref-5'></a>[5] TensorRT trtexec wrapper & flags — https://github.com/NVIDIA/TensorRT/tree/main/samples/trtexec
- <a id='ref-6'></a>[6] NVIDIA torch2trt GitHub — https://github.com/NVIDIA-AI-IOT/torch2trt
- <a id='ref-7'></a>[7] Nsight Systems User Guide — https://docs.nvidia.com/nsight-systems/UserGuide/index.html
- <a id='ref-8'></a>[8] Nsight Compute CLI/UI — https://docs.nvidia.com/nsight-compute/NsightComputeCli/index.html
- <a id='ref-9'></a>[9] Ultralytics — Model Validation & Metrics — https://docs.ultralytics.com/modes/val/
- <a id='ref-10'></a>[10] trtexec timing/throughput clarifications — https://forums.developer.nvidia.com/t/need-some-precisions-about-trtexec-measures/173133