# Step 0 — Specify config  and build TensorRT engines

Before running CAICE inference, please **choose a component-level precision** and **build TensorRT engines** from your FP32 ONNX.
### 1) Specify component precision 

Use the format:

* `ga-fp8`, `ga-fp16`
* `gs-fp8`, `gs-fp16`
* `ha-fp8`, `ha-fp16`
* `hs-fp8`, `hs-fp16`

default configs:

* `ga-fp8,gs-fp16,ha-fp8(if exist),hs-fp8(if exist)`

> Note: For FP8, the build script will quantize the corresponding sub-ONNX (Q/DQ) and then build a **strongly-typed** TensorRT engine.

### 2) Define component boundaries (`config.json`)

After choosing the precision, you must define **graph boundaries** for each component in `boundaries.config`.
The build script uses these boundaries to **extract sub-graphs** (g_a / g_s / h_a / h_s) from the full FP32 ONNX before quantizing and building engines.

### What to provide for each component

For every component, specify:

* `inputs`: a list of **input tensor names** in the ONNX graph
* `outputs`: a list of **output tensor names** in the ONNX graph

Example (JSON):

```json
{
  "ga": { "inputs": ["input"], "outputs": ["/g_a/g_a.6/Conv_output_0"] },
  "ha": { "inputs": ["/g_a/g_a.6/Conv_output_0"], "outputs": ["<ha_out_tensor>"] },
  "hs": { "inputs": ["<hs_in_tensor>"], "outputs": ["/entropy_bottleneck/Transpose_1_output_0"] },
  "gs": { "inputs": ["/entropy_bottleneck/Transpose_1_output_0"], "outputs": ["output"] }
}
```

### Tips

* Tensor names must match **exactly** what appears in the exported ONNX (case-sensitive).
* You can inspect tensor names using **Netron** or by printing ONNX graph I/O names in Python.
* If you only plan to accelerate a subset of components, you can still define all boundaries now and only build engines for the components listed in your precision config.


### 3) Prepare calibration data and input shapes

For **FP8 components**, calibration data is required to determine quantization scales.

You must also specify the **exact input shape** used to build engines, since TensorRT engines are shape-specific.


In [2]:
import torch
import onnx
from compressai.zoo import bmshj2018_factorized

quality = 1
project_dir = "/hwj"
device = "cuda:0"

# load model
model = bmshj2018_factorized(quality=quality, pretrained=False)
state = torch.load(f"{project_dir}/data/model/bmshj2018-factorized-prior-{quality}.pth", map_location=device)
model.load_state_dict(state)
model.eval().to(device)

# to fp32 onnx
dummy_input = torch.randn((512, 3, 128, 128), device=device, dtype=torch.float32)
onnx_path = f"{project_dir}/data/model/onnx/bmshj2018-factorized-prior-{quality}-f32.onnx"

torch.onnx.export(
    model,
    dummy_input,
    onnx_path,
    input_names=["input"],
    output_names=["output"],
    opset_version=17,
)

# to fp16 onnx
model.eval().to(device).to(torch.float16)

dummy_input = torch.randn((512, 3, 128, 128), device=device, dtype=torch.float16)
onnx_path = f"{project_dir}/data/model/onnx/bmshj2018-factorized-prior-{quality}-f16.onnx"

torch.onnx.export(
    model,
    dummy_input,
    onnx_path,
    input_names=["input"],
    output_names=["output"],
    opset_version=17,
)

  torch.onnx.export(
  torch.onnx.export(


In [3]:
!python ./utils/build_engines.py \
--onnx_fp32 /hwj/data/model/onnx/bmshj2018-factorized-prior-1-f32.onnx \
--onnx_fp16 /hwj/data/model/onnx/bmshj2018-factorized-prior-1-f16.onnx \
--input_shape 512,3,128,128 \
--config ga-fp16,gs-fp16 \
--boundaries /hwj/project/CompressAI-Science/examples/config.json \
--calib_npy /hwj/project/aiz-accelerate/data/nyx-dark_matter_density.npy \
--out_dir /hwj/project/CompressAI-Science/examples/out_engines \
--model_tag bmshj2018-factorized-q1 \
--max_calib_samples 512 \
--prefer_cuda_ort


[OK] Extracted ga (fp16): /hwj/project/CompressAI-Science/examples/out_engines/subonnx/bmshj2018-factorized-q1/ga_fp16.onnx
[Skip] boundaries missing component: ha
[Skip] boundaries missing component: hs
[OK] Extracted gs (fp16): /hwj/project/CompressAI-Science/examples/out_engines/subonnx/bmshj2018-factorized-q1/gs_fp16.onnx

[Engine] Building ga engine (fp16): /hwj/project/CompressAI-Science/examples/out_engines/engines/bmshj2018-factorized-q1/ga/fp16.engine
trtexec --onnx=/hwj/project/CompressAI-Science/examples/out_engines/subonnx/bmshj2018-factorized-q1/ga_fp16.onnx --saveEngine=/hwj/project/CompressAI-Science/examples/out_engines/engines/bmshj2018-factorized-q1/ga/fp16.engine --fp16
&&&& RUNNING TensorRT.trtexec [TensorRT v101200] [b36] # trtexec --onnx=/hwj/project/CompressAI-Science/examples/out_engines/subonnx/bmshj2018-factorized-q1/ga_fp16.onnx --saveEngine=/hwj/project/CompressAI-Science/examples/out_engines/engines/bmshj2018-factorized-q1/ga/fp16.engine --fp16
[01/12/2026-

# Step 1 — Run Benchmark

In [1]:
import numpy as np
import torch
from compressai.zoo import bmshj2018_factorized
from compressai.runtime import build_runtime
from compressai.runtime.config import RuntimeConfig
from compressai.runtime.codecs import GpuPackedEntropyCodec
from compressai.runtime.utils.benchmark import run_e2e

device = "cuda:0"

# 1) load net
net = bmshj2018_factorized(quality=1, pretrained=False).to(device).eval()
state = torch.load("/hwj/data/model/bmshj2018-factorized-prior-1.pth", map_location=device)
net.load_state_dict(state)

# 2) codec (in runtime)
codec = GpuPackedEntropyCodec(net.entropy_bottleneck, P=12)

# 3) runtime (TRT, dtype auto-infer)
cfg = RuntimeConfig(
    mode="trt",
    ga_input_dtype=torch.float32,
    gs_input_dtype=torch.float16,
    codec_input_dtype=torch.float32,
    trt_engines={
        "ga": "/hwj/project/CompressAI-Science/examples/out_engines/engines/bmshj2018-factorized-q1/ga/fp8.engine",
        "gs": "/hwj/project/CompressAI-Science/examples/out_engines/engines/bmshj2018-factorized-q1/gs/fp16.engine",
    },
)
engine = build_runtime(net, codec, cfg)

# 4) data
arr = np.load("/hwj/project/aiz-accelerate/data/nyx-dark_matter_density.npy")
x = torch.from_numpy(arr).float().to(device)

# 5) benchmark (auto stream)
stats, x_hat, x = run_e2e(engine, codec, x, warmup=3, iters=5)
stats


{'input_bytes': 100663296.0,
 'enc_ms': 4.6011073112487795,
 'dec_ms': 8.461580657958985,
 'enc_GBps': 20.375529988357403,
 'dec_GBps': 11.079490202793082,
 'strings_bytes': 330396.0,
 'state_bytes': 31.0,
 'total_bytes': 330427.0,
 'bpp_strings': 161.326171875,
 'bpp_total': 161.34130859375,
 'cr_strings': 304.67468129154105,
 'cr_total': 304.64609732255536,
 'rmse': 0.10539152473211288,
 'nrmse': 0.10580645501613617,
 'maxe': 0.9387968182563782,
 'psnr': 13.489158630371094}

In [8]:
x_hat[0,0,0,:10], x[0,0,0,:10]

(tensor([nan, nan, nan, nan, nan, nan, nan, nan, nan, nan], device='cuda:0'),
 tensor([0.0471, 0.0235, 0.1216, 0.0706, 0.0902, 0.0941, 0.0627, 0.1882, 0.0784,
         0.1882], device='cuda:0'))