## 5. ONNX

ONNX (Open Neural Network eXchange) is a standardized format for representing neural networks. It abstracts
operations, turning the framework-specific code into an **execution graph** built from standardized operators.
It describes operations, input/output shapes, and model parameters in a hardware- and framework-agnostic way.
Then, it can be run with via ONNX Runtime (ORT), which can execute the code with kernels and optimizations from
specialized providers, like Intel OpenVINO or NVidia TensorRT.

ONNX and ONNX Runtime have considerable advantages:
1. Framework- and language-agnostic - ONNX runs on any framework and programming language, e.g. you can export
   PyTorch model in Python, and then run it in a Java application.
2. Execution graph optimization - ONNX Runtime provides a series of optimizations for the execution graph,
   including hardware-specific operators provided by manufacturers.
3. Lightweight deployment - ONNX & ORT have much smaller package size than the whole PyTorch (even CPU-only wheels),
   reducing sizes of dependencies and Docker containers, and accelerating loading.

In practice, `torch.compile()` works well for PyTorch optimization, but ONNX is preferable for deploying models,
particularly for lightweight or mobile runtimes. It also supports GPU inference via NVidia TensorRT provider.

Exporting to ONNX produces a raw computation graph in `.onnx` format. This file is:
- a static description of operators, weights, and I/O tensors
- a general graph - no hardware-specific rewrites happen during ONNX export
- hardware-agnostic - it does not contain CUDA/CPU kernels or provider information

We will export a Transformer model with dynamic batch size and dynamic sequence length.

```python
import torch
import torch.onnx

# Put the model in eval mode and move to CPU
model_cpu = model.eval().cpu()

# Example input for tracking (for onnx export)
sample_input = tokenizer(
    "This is a sample input text for ONNX export.",
    padding=True,
    truncation=True,
    return_tensors="pt",
)

# Export to ONNX format
torch.onnx.export(
    model_cpu,
    (sample_input["input_ids"], sample_input["attention_mask"]),
    "model.onnx",
    opset_version=17,
    input_names=["input_ids", "attention_mask"],
    output_names=["output"],
    dynamic_axes={
        "input_ids": {0: "batch_size", 1: "sequence_length"},
        "attention_mask": {0: "batch_size", 1: "sequence_length"},
        "output": {0: "batch_size"},
    },
)
```

We export on CPU in `eval()` mode to get deterministic behavior.

Look at how we marked dynamic axes in `dynamic_axes`:
1. For `input_ids` and `attention_mask`, we marked axes 0 (batch size) and 1 (sequence length) as dynamic,
   since they can vary during inference.
   - axis 0 - batch size, depends on number of inputs
   - axis 1 - sequence length, depends on text length
   - axis 2 - embedding size, fixed and constant (768), so we don't mark it
2. For `output`, we marked only axis 0 (batch size) as dynamic, since the output will have the same number
   of rows as the input batch size.

The exported `model.onnx` is a raw graph, not yet optimized. It can be changed during InferenceSession
creation in ONNX Runtime or when we explicitly run offline optimizations.

### Optimization & inference with ONNX Runtime

First, we run inference using ONNX Runtime with default settings. By default, all optimizations are applied.

```python
import onnxruntime as ort
import numpy as np

# Load the model
ort_session = ort.InferenceSession("model.onnx")

# Prepare input data
sample_input = tokenizer(
    "This is a sample input text for ONNX inference.",
    padding=True,
    truncation=True,
    return_tensors="np",
)


# Create input dictionary, in same format as during export
inputs_onnx = {
    "input_ids": sample_input["input_ids"],
    "attention_mask": sample_input["attention_mask"],
}

# Run inference
outputs_onnx = ort_session.run(None, inputs_onnx)
```

The raw ONNX is parsed, optimized (default level is `ORT_ENABLE_ALL`), and executed using the default
execution provider (generally generic CPU by default).

We did not specify a provider in this example to keep the code short. ONNX Runtime internally chooses
providers based on how it was built (for example, CPU only, or CPU + CUDA). For production use, you
should specify providers explicitly. We will do that in the next section.

### Graph optimization settings

ONNX Runtime groups graph optimizations into levels. Each level builds on the previous one:

1. **Basic graph optimizations** - semantics-preserving rewrites that remove redundant work.
   They run before graph partitioning, so they apply to nodes regardless of the target execution provider.
2. **Extended graph optimizations** - They run after graph partitioning and are applied only to nodes
   assigned to selected providers (CPU, CUDA, ROCm).
3. **Layout optimizations** - change layout from NHCW to NCHWc for CPU provider.

All optimizations are enabled by default. You can control them using the `GraphOptimizationLevel` enum:
* `ORT_DISABLE_ALL` – disable all optimizations
* `ORT_ENABLE_BASIC` – only basic
* `ORT_ENABLE_EXTENDED` – basic and extended
* `ORT_ENABLE_ALL` – basic + extended + layout optimizations (default)

### Online mode (load-time optimization)

In online mode, optimizations are applied each time you create an `InferenceSession`.
This happens when you create it:
```python
ort_session = ort.InferenceSession("model.onnx")
```
We can control the optimization level using `SessionOptions`:

```python
options = ort.SessionOptions()
options.graph_optimization_level = ort.GraphOptimizationLevel.ORT_ENABLE_ALL

ort_session = ort.InferenceSession(
    "model.onnx", sess_options=options, providers=["CPUExecutionProvider"]
)
```

Online mode is most convenient for:
- development and experimentation - you can quickly try out different settings
- dynamic environments - when running on different hardware or deployments, depending on settings

The cost of online mode is that optimization work is repeated each time a session is created, which
may be noticeable for large models. When you deploy to a known target each time, offline mode is
a better choice.

### Offline mode (ahead-of-time optimization)

In offline mode, optimizations are applied once, and the optimized model is saved to a new ONNX file.
This can significantly reduce startup time in production environments. The key element is setting the
`SessionOptions.optimized_model_filepath`, which specifies where to save the optimized model.
When enabled, ONNX Runtime runs graph optimizations according to `graph_optimization_level`, and saves
the optimized model to the file.

```python
import onnxruntime as ort

sess_options = ort.SessionOptions()

# Choose the optimization level for the offline pass
sess_options.graph_optimization_level = ort.GraphOptimizationLevel.ORT_ENABLE_EXTENDED

# Save the optimized model to this path
sess_options.optimized_model_filepath = "model_optimized.onnx"

# Create InferenceSession, which will perform offline optimization and save the optimized model
ort.InferenceSession("model.onnx", sess_options)
```

After you can load this file and disable optimizations to avoid re-optimizing:

```python
# Load the optimized model without re-optimizing
sess_options = ort.SessionOptions()
sess_options.graph_optimization_level = ort.GraphOptimizationLevel.ORT_DISABLE_ALL

ort_session_optimized = ort.InferenceSession(
    "model_optimized.onnx",
    sess_options=sess_options,
    providers=['CPUExecutionProvider']
)
```

Offline mode is best suited for:
- production deployments - startup time is important, and the model changes only during training
- limited resource environments - repeated optimization is costly
- static hardware setups - when we know the hardware configuration, there is no need for re-optimization

### Executions Providers

Execution providers decide how and where the nodes of the ONNX graph are executed. They are not an
extra optimization pass on top of the graph. Instead, they are backends that provide concrete kernel
implementations for operators such as `MatMul`, `Conv`, `LayerNorm`, and so on.

Typical providers include:

* `CPUExecutionProvider`
* `CUDAExecutionProvider`
* `TensorrtExecutionProvider`
* `OpenVINOExecutionProvider`

The ONNX file itself is always hardware-agnostic. It does not contain any provider information.
Providers come into play only when you create an `InferenceSession`. Provider is responsible for:

* mapping ONNX operations to actual kernels, e.g. CPU BLAS vs cuBLAS vs TensorRT engines
* deciding which fused patterns it can execute efficiently for extended optimizations
* executing its part of the graph on the target hardware

So why do we need to care about providers? In production, it is better to be explicit, so that the behavior
does not change when you move the same model to a different environment.

```python
import onnxruntime as ort

options = ort.SessionOptions()
options.graph_optimization_level = ort.GraphOptimizationLevel.ORT_ENABLE_ALL

# Force CPU only
session_cpu = ort.InferenceSession(
    "model.onnx", sess_options=options, providers=["CPUExecutionProvider"]
)

# Prefer CUDA, fall back to CPU if CUDA is not available
session_cuda = ort.InferenceSession(
    "model.onnx",
    sess_options=options,
    providers=["CUDAExecutionProvider", "CPUExecutionProvider"],
)
```

For more information about providers, see the official [Execution Providers section](https://iot-robotics.github.io/ONNXRuntime/docs/execution-providers/).

### Exercise 6 (3 points)

1. Measure cold start time (including session creation) of the ONNX model using online and offline optimization modes
   on CPU.
2. Measure inference time of the ONNX model on CPU using both optimization modes.
3. Prepare deployment Docker images:
   - build two images, for a) compiled PyTorch model b) ONNX model with ONNX Runtime
   - select the best model in both cases in terms of the inference time
   - install a minimal set of requirements in both cases, e.g. do not install PyTorch for ONNX image
4. Compare for those apps:
   - Docker container sizes
   - response time (average of 100 requests)

In [1]:
import torch
import time
import collections
import os
from transformers import AutoTokenizer, AutoModel
import onnxruntime as ort
import numpy as np

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
model_name = "sentence-transformers/multi-qa-mpnet-base-cos-v1"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)

device = 'cpu'
model.to(device).eval()
num_runs = 100

In [3]:
sample_text_for_onnx_export = "sample input text for ONNX export"
sample_input_for_export = tokenizer(
    sample_text_for_onnx_export,
    padding=True,
    truncation=True,
    return_tensors="pt",
).to(device)

sample_text_for_onnx_inference = "sample input text for ONNX inference"

In [5]:
model_cpu_for_onnx = model.to('cpu').eval()

onnx_model_path = "models/model.onnx"

input_ids_cpu = sample_input_for_export['input_ids'].cpu()
attention_mask_cpu = sample_input_for_export['attention_mask'].cpu()

torch.onnx.export(
    model_cpu_for_onnx,
    (input_ids_cpu, attention_mask_cpu),
    onnx_model_path,
    opset_version=17,
    input_names=["input_ids", "attention_mask"],
    output_names=["output"],
    dynamic_axes={
        "input_ids": {0: "batch_size", 1: "sequence_length"},
        "attention_mask": {0: "batch_size", 1: "sequence_length"},
        "output": {0: "batch_size"},
    },
)

  torch.onnx.export(
W1125 21:41:09.012000 15579 torch/onnx/_internal/exporter/_compat.py:114] Setting ONNX exporter to use operator set version 18 because the requested opset_version 17 is a lower version than we have implementations for. Automatic version conversion will be performed, which may not be successful at converting to the requested version. If version conversion is unsuccessful, the opset version of the exported model will be kept at 18. Please consider setting opset_version >=18 to leverage latest ONNX features
W1125 21:41:09.300000 15579 torch/onnx/_internal/exporter/_registration.py:107] torchvision is not installed. Skipping torchvision::nms


[torch.onnx] Obtain model graph for `MPNetModel([...]` with `torch.export.export(..., strict=False)`...
[torch.onnx] Obtain model graph for `MPNetModel([...]` with `torch.export.export(..., strict=False)`... ✅
[torch.onnx] Run decomposition...


The model version conversion is not supported by the onnxscript version converter and fallback is enabled. The model will be converted using the onnx C API (target version: 17).


[torch.onnx] Run decomposition... ✅
[torch.onnx] Translate the graph into ONNX...
[torch.onnx] Translate the graph into ONNX... ✅
Applied 83 of general pattern rewrite rules.


ONNXProgram(
    model=
        <
            ir_version=10,
            opset_imports={'': 17},
            producer_name='pytorch',
            producer_version='2.9.1',
            domain=None,
            model_version=None,
        >
        graph(
            name=main_graph,
            inputs=(
                %"input_ids"<INT64,[s43,s53]>,
                %"attention_mask"<INT64,[s43,s53]>
            ),
            outputs=(
                %"output"<FLOAT,[1,s53,768]>,
                %"tanh"<FLOAT,[1,768]>
            ),
            initializers=(
                %"embeddings.LayerNorm.weight"<FLOAT,[768]>{TorchTensor(...)},
                %"embeddings.LayerNorm.bias"<FLOAT,[768]>{TorchTensor(...)},
                %"encoder.layer.0.attention.attn.q.bias"<FLOAT,[768]>{TorchTensor(...)},
                %"encoder.layer.0.attention.attn.k.bias"<FLOAT,[768]>{TorchTensor(...)},
                %"encoder.layer.0.attention.attn.v.bias"<FLOAT,[768]>{TorchTensor(...)},
           

In [6]:
inputs_onnx_inference = tokenizer(
    sample_text_for_onnx_inference,
    padding=True,
    truncation=True,
    return_tensors="np",
)

inputs_onnx_dict = {
    "input_ids": inputs_onnx_inference["input_ids"].astype(np.int64),
    "attention_mask": inputs_onnx_inference["attention_mask"].astype(np.int64),
}

In [7]:
# 1. Measure cold start time
start_cold_online = time.perf_counter()
ort_session_online = ort.InferenceSession(
    onnx_model_path,
    providers=["CPUExecutionProvider"] # Explicitly use CPU
)
_ = ort_session_online.run(None, inputs_onnx_dict)
end_cold_online = time.perf_counter()
cold_start_time_online = end_cold_online - start_cold_online
print(f"Cold start time (Online mode): {cold_start_time_online:.6f} s")

Cold start time (Online mode): 0.135782 s


In [None]:
def measure_onnx_inference_time(session, inputs, num_runs):
    start_time = time.perf_counter()
    for _ in range(num_runs):
        _ = session.run(None, inputs)
    end_time = time.perf_counter()
    return (end_time - start_time) / num_runs

inference_time_online = measure_onnx_inference_time(ort_session_online, inputs_onnx_dict, num_runs)
print(f"Avg inference time (100 runs, Online mode): {inference_time_online:.6f} s")

Avg inference time (100 runs): 0.004580 s


In [12]:
# Offline
optimized_onnx_model_path = "models/model_optimized.onnx"

start_offline_optimization_save = time.perf_counter()
options_offline = ort.SessionOptions()
options_offline.graph_optimization_level = ort.GraphOptimizationLevel.ORT_ENABLE_ALL
options_offline.optimized_model_filepath = optimized_onnx_model_path

_ = ort.InferenceSession(
onnx_model_path,
    sess_options=options_offline,
    providers=["CPUExecutionProvider"]
)
end_offline_optimization_save = time.perf_counter()
save_optimized_time = end_offline_optimization_save - start_offline_optimization_save
print(f"Optimize and save time (Offline mode): {save_optimized_time:.6f} s")

Optimize and save time (Offline mode): 0.138693 s


2025-11-25 21:49:40.320 python[15579:438600] 2025-11-25 21:49:40.314153 [W:onnxruntime:, inference_session.cc:2473 Initialize] Serializing optimized model with Graph Optimization level greater than ORT_ENABLE_EXTENDED and the NchwcTransformer enabled. The generated model may contain hardware specific optimizations, and should only be used in the same environment the model was optimized in.


In [None]:
start_cold_offline = time.perf_counter()
options_load_optimized = ort.SessionOptions()
options_load_optimized.graph_optimization_level = ort.GraphOptimizationLevel.ORT_DISABLE_ALL
ort_session_offline = ort.InferenceSession(
    optimized_onnx_model_path,
    sess_options=options_load_optimized,
    providers=["CPUExecutionProvider"]
)
_ = ort_session_offline.run(None, inputs_onnx_dict)
end_cold_offline = time.perf_counter()
cold_start_time_offline = end_cold_offline - start_cold_offline
print(f"Cold start time: {cold_start_time_offline:.6f} s")

inference_time_offline = measure_onnx_inference_time(ort_session_offline, inputs_onnx_dict, num_runs)
print(f"Avg inference time (Offline mode, 100 runs): {inference_time_offline:.6f} s")

Cold start time: 0.116052 s
Avg inference time (Offline mode, 100 runs): 0.004716 s


# ANALYSYS OF TOCKER APPS

1. Sizes:

ONNX | torch


In [16]:
import requests

In [19]:
def measure_inference_time(url, text, num_runs):
    times = 0
    for _ in range(num_runs):
        response = requests.post(
            url,
            json={"text": text}
        )
        times += float(response.json()['inference_time'])

    return times / num_runs

In [20]:
print("ONNX-based app inference time")
measure_inference_time("http://localhost:8000/inference", sample_text_for_onnx_inference, 100)

ONNX-based app inference time


0.45872572422027585

In [25]:
print("Torch-based app inference time")
measure_inference_time("http://localhost:8001/inference", sample_text_for_onnx_inference, 100)

Torch-based app inference time


0.44559636592864993

In [26]:
!docker ps --size

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


CONTAINER ID   IMAGE              COMMAND                  CREATED          STATUS          PORTS                                         NAMES                SIZE
87f99bab5c2f   torch_app-ml-app   "uv run uvicorn torc…"   3 minutes ago    Up 3 minutes    0.0.0.0:8001->8001/tcp, [::]:8001->8001/tcp   torch_app-ml-app-1   5.54GB (virtual 13.2GB)
82a9f9a94e13   onnx_app-ml-app    "uv run uvicorn onnx…"   21 minutes ago   Up 17 minutes   0.0.0.0:8000->8000/tcp, [::]:8000->8000/tcp   onnx_app-ml-app-1    5.5GB (virtual 12.7GB)


Again I have no idea why there are no differences...