# Lab 7 - Model Optimization for Inference

## Introduction

In this lab, we will focus on optimizing neural network models for faster inference.
There are many techniques available in PyTorch, including:

* switching the model to **evaluation** mode and disabling gradient computation
* various strategies for GPU speedup, e.g. optimized tensor placement, pinning memory,
  lower precision calculations
* using the `torch.compile()` function for automatic model compilation
* **model quantization** to reduce size and speed up computations
* exporting the model to ONNX format and ONNX Runtime optimization

These techniques allow you to **speed up** the inference and **reduce resource usage**,
which are crucial when deploying ML models to production systems. They are particularly
useful for low-latency applications (e.g. online services, streaming ML), as well as for
mobile and edge deployments with limited resources.

### Environment note

We recommend using a local Python environment managed with `uv`. If you encounter problems
or do not have a CUDA-compatible GPU (e.g. on macOS), you can use Google Colab. In that case,
remember to enable the GPU accelerator in the runtime settings.

In the following exercises, we will use a pretrained Sentence Transformer model,
`sentence-transformers/multi-qa-mpnet-base-cos-v1`. It embeds sentences as 768-dimensional vectors.

## 1. Evaluation mode

When using PyTorch for inference, there are several optimizations that can be applied to reduce the overhead of the model.
They include:

1. Model evaluation (eval) mode - it disables layers used only during training (e.g. dropout, batch normalization).
   Used with `model.eval()` method.
2. Disabling gradients - during inference, gradients are not needed, so it omits tracking them and allocating memory
   for them. Used with `torch.no_grad()` context manager, or preferably with a more recently added and more performant
   `torch.inference_mode()`.
3. Inference loop optimization - avoid unnecessary repetition of operations, e.g. move model to a device beforehand,
   pre-allocate memory.

For differences between `no_grad()` and `inference_mode()`, see:
- [this StackOverflow answer](https://stackoverflow.com/a/74197846/9472066)
- [PyTorch forum discussion](https://discuss.pytorch.org/t/pytorch-torch-no-grad-vs-torch-inference-mode/134099)
- [PyTorch docs on grad modes](https://docs.pytorch.org/docs/stable/notes/autograd.html#grad-modes)



### Exercise 1 (3 points)

1. Load the `sentence-transformers/multi-qa-mpnet-base-cos-v1` model and tokenizer. Use the `AutoModel` and
   `AutoTokenizer` classes from `tranformers` library.
2. Create a sample input text and tokenize it (padding, truncation, `return_tensors="pt"`).
3. Measure the inference time of the model in various inference modes (average time over 100 runs):
   - no optimizations (simple PyTorch)
   - `model.eval()`
   - `model.eval()` and `no_grad()`
   - `model.eval()` and `inference_mode()`
4. Compare the speedup of options 2, 3, and 4 over the pure PyTorch. To calculate speedup, divide the
   PyTorch time by the current time.

In general, the time should decrease for subsequent options. If `inference_mode()` is slower than `no_grad()`,
it may be due some not supported operations in the model, so `no_grad()` is preferred in such cases.
But when models contain many operations and overhead with autograd is significant, `inference_mode()` should be faster.

In [1]:
import torch
import time
from transformers import AutoModel, AutoTokenizer

# 1. Load the model and tokenizer
model_name = "sentence-transformers/multi-qa-mpnet-base-cos-v1"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")
model.to(device)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Using device: cuda


MPNetModel(
  (embeddings): MPNetEmbeddings(
    (word_embeddings): Embedding(30527, 768, padding_idx=1)
    (position_embeddings): Embedding(514, 768, padding_idx=1)
    (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): MPNetEncoder(
    (layer): ModuleList(
      (0-11): 12 x MPNetLayer(
        (attention): MPNetAttention(
          (attn): MPNetSelfAttention(
            (q): Linear(in_features=768, out_features=768, bias=True)
            (k): Linear(in_features=768, out_features=768, bias=True)
            (v): Linear(in_features=768, out_features=768, bias=True)
            (o): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (dropout): Dropout(p=0.1, inplace=False)
        )
        (intermediate): MPNetIntermediate(
          (dense): Linear(in_

In [None]:
# 2. Create a sample input text and tokenize it
sample_text = "Rudawa – rzeka w województwie małopolskim, lewy dopływ Wisły, do której uchodzi w 75,4 km biegu Wisły, na granicy między Zwierzyńcem."
inputs = tokenizer(sample_text, padding=True, truncation=True, return_tensors="pt").to(device)

num_runs = 100

def measure_inference_time(model, inputs, num_runs):
    start_time = time.perf_counter()
    for _ in range(num_runs):
        with torch.no_grad():
            _ = model(**inputs)
    end_time = time.perf_counter()
    return (end_time - start_time) / num_runs


# no optimizations
model.train()
time_no_optim = measure_inference_time(model, inputs, num_runs)
print(f"- No optimizations (model.train()): {time_no_optim:.6f} s")

# model.eval()
model.eval()
time_eval = measure_inference_time(model, inputs, num_runs)
print(f"- model.eval(): {time_eval:.6f} s")

# model.eval() and no_grad()
model.eval()
with torch.no_grad():
    time_eval_no_grad = measure_inference_time(model, inputs, num_runs)
print(f"- model.eval() and no_grad(): {time_eval_no_grad:.6f}s")

# model.eval() and inference_mode()
model.eval()
with torch.inference_mode():
    time_eval_inference_mode = measure_inference_time(model, inputs, num_runs)
print(f"- model.eval() and inference_mode(): {time_eval_inference_mode:.6f} s")



- No optimizations (model.train()): 0.021875 s
- model.eval(): 0.028562 s
- model.eval() and no_grad(): 0.025060s
- model.eval() and inference_mode(): 0.017609 s


In [None]:
# 4. Compare the speedup
speedup_eval = time_no_optim / time_eval
speedup_eval_no_grad = time_no_optim / time_eval_no_grad
speedup_eval_inference_mode = time_no_optim / time_eval_inference_mode

print(f"Speedup (model.eval() over no optimizations): {speedup_eval:.2f}x")
print(f"Speedup (model.eval() + no_grad() over no optimizations): {speedup_eval_no_grad:.2f}x")
print(f"Speedup (model.eval() + inference_mode() over no optimizations): {speedup_eval_inference_mode:.2f}x")


Speedup (model.eval() over no optimizations): 0.77x
Speedup (model.eval() + no_grad() over no optimizations): 0.87x
Speedup (model.eval() + inference_mode() over no optimizations): 1.24x


## 2. PyTorch model compilation

PyTorch 2.0 introduced a new functionality, model compilation, which automatically optimizes model execution
via `torch.compile()` function.

This mechanism uses modules such as **TorchDynamo** and **TorchInductor** under the hood to capture the model
computation graph and generate optimized low-level code. The default backend (TorchInductor) can generate
optimized CUDA kernels on GPU, and optimized vectorized code on CPU. It can also fuse operations together and
bypass the overhead of memory transfers and Python interpreter.

Note that `torch.compile()` is a lossless model optimization technique, changing only its physical execution.
You should call it after setting the model to evaluation mode, so that the computation graph contains only the
final inference operations.

Example usage:

```python
model.eval()
compiled_model = torch.compile(model)
```

The above line returns a compiled version of the model that can be used just like the original model.
During the first inference call, the model execution operations are traced and its computation graph is optimized,
which incurs an overhead, which can be quite significant. Further calls will use the generated optimized code,
which should be significantly faster.

### Exercise 2 (2 points)

In this exercise, we will verify the gains from model compilation with `torch.compile()`.

1. Compile the model using `torch.compile()` after switching it to evaluation mode, and warm-up the model
   by running a single inference call. Measure this compilation + warm-up time (just once).
2. Measure the inference time (average of 100 runs) of the compiled model in inference mode.
3. Calculate the speedup, and compare results with those from the previous exercise.

In [4]:
model.eval()

# 1. Compile the model using torch.compile() and warm-up
start_compile_warmup = time.perf_counter()
compiled_model = torch.compile(model)

with torch.inference_mode():
    _ = compiled_model(**inputs)

end_compile_warmup = time.perf_counter()
time_compile_warmup = end_compile_warmup - start_compile_warmup
print(f"Compilation and warm-up time: {time_compile_warmup:.6f}s")

  return torch._C._get_cublas_allow_tf32()
W1125 19:54:38.386000 21573 torch/_inductor/utils.py:1558] [0/0] Not enough SMs to use max_autotune_gemm mode


Compilation and warm-up time: 22.410585s


In [5]:
# 2. Measure the inference time (average of 100 runs) of the compiled model

def measure_compiled_inference_time(model_to_measure, inputs, num_runs):
    start_time = time.perf_counter()
    for _ in range(num_runs):
        with torch.inference_mode():
            _ = model_to_measure(**inputs)
    end_time = time.perf_counter()
    return (end_time - start_time) / num_runs

time_compiled_inference = measure_compiled_inference_time(compiled_model, inputs, num_runs)
print(f"- Compiled model inference time (with inference_mode): {time_compiled_inference:.6f}s")


- Compiled model inference time (with inference_mode): 0.008229s


In [7]:

# 3. Calculate speedup and compare results with those from the previous exercise

speedup_compiled_over_no_optim = time_no_optim / time_compiled_inference
speedup_compiled_over_inference_mode = time_eval_inference_mode / time_compiled_inference

print(f"Speedup (compiled model over no optimizations): {speedup_compiled_over_no_optim:.2f}x")
print(f"Speedup (compiled model over model.eval() + inference_mode()): {speedup_compiled_over_inference_mode:.2f}x")



Speedup (compiled model over no optimizations): 2.66x
Speedup (compiled model over model.eval() + inference_mode()): 2.14x


Compiled model turned out over 2 times faster in inference than not optimized. Also inference after adding model.ecal() and inference_model() was significantly slower than after compilation.

## 3. Quantization

Another way to optimize a model is to **quantize** its weights, reducing its size, but also the precision.
Quantization means representing parameters (weights, and optionally also activations) with lower precision
than the standard 32 bits. Most often, this means switching to 8-bit integers, i.e. dtype `int8`.

PyTorch provides built-in tools for both **dynamic** and **static quantization**.

**Dynamic quantization:**
- convert weights `fp32 -> int8`, while activations remain in `float32` and are quantized dynamically during
  model execution
- does not require any post-training calibration
- slower and more complex than static quantization, but also more precise
- most effective and popular on CPU, which widely support integer operations
- GPU usage requires specialized software & hardware (supporting `int8` operations)

**Static quantization:**
- quantize both weights and activations to `int8`
- typically requires calibration, i.e. passing data through the model to estimate value ranges to know
  how to quantize
- faster and simpler to execute, but may be less precise (due to rounding activations)
- more frequently used in production, particularly because the saved model files are smaller in this mode

### Exercise 3 (3 points)

We will perform a dynamic quantization for our model, which is very simple operationally to use with PyTorch.
It provides the `torch.ao.quantization.quantize_dynamic()` function, to which we pass the model and a
list of layer types that we want to quantize. In the case of transformers, those are primarily the linear
layers, which contain the majority of weights and perform most computations.

1. Ensure the model is on the CPU.
2. Quantize the model with `torch.ao.quantization.quantize_dynamic()`, setting the target weight to `torch.qint8` and
   layers to a single-element set with `nn.Linear`.
3. Save the model to a new variable (e.g. `model_quantized`), and print it to verify that linear layers have been
   quantized properly (i.e. `DynamicQuantizedLinear` instead of `Linear`).
4. Save both models to disk (`state_dict` for both) and compare the file sizes (e.g. `os.path.getsize()`).
5. Compare the inference speed and speedup on CPU for original and quantized models (again, average of 100 runs).
6. Display the comparison. Do you think that quantization is helpful in this case?

Typically, we would observe the reduction in model size up to 4x and speedup of 1.5-2x, depending on the model type
and what parameters exactly are quantized.

In [8]:
import torch
import torch.nn as nn
import os
model_cpu = model.to('cpu')


In [9]:
# 2. Quantize the model with torch.ao.quantization.quantize_dynamic()
model_quantized = torch.ao.quantization.quantize_dynamic(
    model_cpu,
    {nn.Linear},
    dtype=torch.qint8
)

For migrations of users: 
1. Eager mode quantization (torch.ao.quantization.quantize, torch.ao.quantization.quantize_dynamic), please migrate to use torchao eager mode quantize_ API instead 
2. FX graph mode quantization (torch.ao.quantization.quantize_fx.prepare_fx,torch.ao.quantization.quantize_fx.convert_fx, please migrate to use torchao pt2e quantization API instead (prepare_pt2e, convert_pt2e) 
3. pt2e quantization has been migrated to torchao (https://github.com/pytorch/ao/tree/main/torchao/quantization/pt2e) 
see https://github.com/pytorch/ao/issues/2259 for more details
  model_quantized = torch.ao.quantization.quantize_dynamic(


In [10]:
# 3. Save the model to a new variable (e.g. model_quantized), and print
print(model_quantized)

MPNetModel(
  (embeddings): MPNetEmbeddings(
    (word_embeddings): Embedding(30527, 768, padding_idx=1)
    (position_embeddings): Embedding(514, 768, padding_idx=1)
    (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): MPNetEncoder(
    (layer): ModuleList(
      (0-11): 12 x MPNetLayer(
        (attention): MPNetAttention(
          (attn): MPNetSelfAttention(
            (q): DynamicQuantizedLinear(in_features=768, out_features=768, dtype=torch.qint8, qscheme=torch.per_tensor_affine)
            (k): DynamicQuantizedLinear(in_features=768, out_features=768, dtype=torch.qint8, qscheme=torch.per_tensor_affine)
            (v): DynamicQuantizedLinear(in_features=768, out_features=768, dtype=torch.qint8, qscheme=torch.per_tensor_affine)
            (o): DynamicQuantizedLinear(in_features=768, out_features=768, dtype=torch.qint8, qscheme=torch.per_tensor_affine)
            (dropout): Dropout(p=0.1, inplace=F

In [14]:
# 4. Save both models to disk and compare file sizes
original_model_path = "original_model.pth"
quantized_model_path = "quantized_model.pth"

torch.save(model_cpu.state_dict(), original_model_path)
torch.save(model_quantized.state_dict(), quantized_model_path)

original_size = os.path.getsize(original_model_path)
quantized_size = os.path.getsize(quantized_model_path)

print(f"Model Size Comparison")
print(f"Original model size: {original_size / (1024*1024):.2f} MB")
print(f"Quantized model size: {quantized_size / (1024*1024):.2f} MB")
print(f"Size reduction: {(original_size - quantized_size) / original_size * 100:.2f}%")

Model Size Comparison
Original model size: 417.73 MB
Quantized model size: 173.10 MB
Size reduction: 58.56%


In [None]:

# 5. Compare the inference speed and speedup on CPU
print("\nInference Speed Comparison")

model_cpu.eval()
model_quantized.eval()

inputs_cpu = tokenizer(sample_text, padding=True, truncation=True, return_tensors="pt").to('cpu')

with torch.inference_mode():
    _ = model_cpu(**inputs_cpu)

time_original_cpu = measure_inference_time(model_cpu, inputs_cpu, num_runs)
print(f"Original model inference: {time_original_cpu:.6f} s")

with torch.inference_mode():
    _ = model_quantized(**inputs_cpu)

time_quantized_cpu = measure_inference_time(model_quantized, inputs_cpu, num_runs)
print(f"Quantized model inference: {time_quantized_cpu:.6f} s")

speedup_quantized = time_original_cpu / time_quantized_cpu
print(f"Speedup quantized over original: {speedup_quantized:.2f}x")


Inference Speed Comparison
Original model inference: 0.205795 s
Quantized model inference: 0.113990 s
Speedup quantized over original: 1.81x


Quantisation enabled significant speedup.

## 4. GPU optimization strategies

### GPU inference

The most straightforward way to speed up inference is to run the model on a GPU if you have a suitable card
and can afford that in the production environment. Deep models typically run much faster on GPU than on CPU,
especially for larger batches.

For example:
```python
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = model.to(device)
inputs_gpu = {k: v.to(device) for k, v in inputs.items()}
with torch.inference_mode():
    outputs = model(**inputs_gpu)
```

Transferring data to the GPU involves additional overhead, so it's done explicitly in PyTorch as above.
Due to this overhead, it may not be beneficial for tiny models and single inputs, so this should be
kept in mind for inference.

After transferring data to the GPU, it is also worth considering the use of `torch.compile()` on the model
to gain additional acceleration through operator fusion and generation of optimized CUDA code. It works
similarly to CPU compilation that we tried before.

![torch_compile_1](images/with_torch_compile.png)

### CUDA Graphs

Launching individual GPU kernels for single operations incurs a significant overhead for many operations.
Each one requires memory allocation, memory transfer, and synchronization. Instead, we can combine them
in a **CUDA Graph**, replacing a sequence of kernels with a single, efficient operation.

```python
# Enable CUDA Graphs for maximum throughput
compiled_model_with_cudagraphs = torch.compile(model, mode="max-autotune")
```

![torch_compile_2](images/with_torch_compile_with_cuda_graphs.png)

The `max-autotune` mode of PyTorch compilation can generate entirely new operations on the fly. In this mode,
PyTorch creates several Triton kernel implementations for each operation, benchmarks their performance, and
selects the fastest one.

![torch_compile_3](images/with_torch_compile_with_cuda_kernels.png)

These automatically generated kernels often outperform naive operations, or even handwritten generic
implementations, because they are tailored for a given model. For example, tensor shapes are known and
constant, and memory access patterns are predictable.

However, CUDA Graphs are **static** by design - they record a fixed sequence of operations with predefined
tensor shapes. This is problematic for models handling dynamic input sizes, e.g. variable-length sentences
in transformers or images with different size in CNNs. CUDA Graphs become invalid when input dimensions
change.

The `max-autotune-no-cudagraphs` mode addresses this limitation. It still creates custom Triton kernels,
optimized computation graphs, and fused operations, but allows the model to handle dynamic inputs without
recompilation. This is relevant to many production environments with unpredictable input sizes, providing
both flexibility and high performance.

```python
# Enable max-autotune without CUDA Graphs for dynamic input shapes
compiled_model_dynamic = torch.compile(model, mode="max-autotune-no-cudagraphs")
```

### Pinning GPU memory

When transferring data from CPU to GPU, using **pinned (page-locked) memory** can speed up the transfer process.
By default, PyTorch allocates tensors in pageable memory, which can be slower to transfer to GPU.
To allocate pinned memory, use the `pin_memory=True` argument when creating tensors or DataLoader.

Examples:

```python
inputs = tokenizer(sample_text, padding=True, truncation=True, return_tensors="pt", pin_memory=True)
```

```python
from torch.utils.data import DataLoader

dataloader = DataLoader(dataset, batch_size=32, pin_memory=True)
```

When transferring to GPU, pinned memory allows for faster transfers, improving overall throughput.

### Exercise 4 (2 points)

1. Compare inference time of:
   - `torch.compile()` with default settings
   - `torch.compile()` with `mode="max-autotune"`
   - `torch.compile()` with `mode="max-autotune-no-cudagraphs"`
2. Report the average time of 100 runs and speedup of the latter two modes.

Check a few different text input sizes. What happens in the latter two modes?

In [16]:
import collections

model.to(device).eval()

def measure_inference_time_gpu(model, inputs, num_runs):
    with torch.inference_mode():
        _ = model(**inputs)

    torch.cuda.synchronize()
    start_time = time.perf_counter()
    for _ in range(num_runs):
        with torch.inference_mode():
            _ = model(**inputs)

    torch.cuda.synchronize()
    end_time = time.perf_counter()
    return (end_time - start_time) / num_runs

In [17]:
sample_texts_dynamic = {
    "short": "Short prompt.",
    "medium": "Rudawa – rzeka w województwie małopolskim, lewy dopływ Wisły, do której uchodzi w 75,4 km biegu Wisły, na granicy między Zwierzyńcem a Półwsiem Zwierzynieckim, przy zachodnim krańcu bulwarze Rodła w Krakowie.",
    "long": "Theodore John „Ted” Kaczynski, ps. Unabomber (ur. 22 maja 1942 w Chicago, zm. 10 czerwca 2023 w Butner[1]) – amerykański matematyk, terrorysta i seryjny morderca motywujący swoje działania sprzeciwem wobec społeczeństwa i cywilizacji opartych na nowoczesnej technice. Przydomek Unabomber powstał z kryptonimu UNABOM (ang. university and airline bombings), który agenci Federalnego Biura Śledczego (FBI) nadali sprawie Theodore’a Kaczynskiego.",
}

inputs_collection = collections.OrderedDict()
for key, text in sample_texts_dynamic.items():
    inputs_collection[key] = tokenizer(text, padding=True, truncation=True, return_tensors="pt").to(device)

inference_times_gpu = {}

In [18]:
model.to(device).eval()
inference_times_gpu["uncompiled_gpu"] = {}
for key, inputs_gpu in inputs_collection.items():
    inference_times_gpu["uncompiled_gpu"][key] = measure_inference_time_gpu(model, inputs_gpu, num_runs)
    print(f"  Uncompiled GPU ({key}): {inference_times_gpu['uncompiled_gpu'][key]:.6f} s")

  Uncompiled GPU (short): 0.011171 s
  Uncompiled GPU (medium): 0.012495 s
  Uncompiled GPU (long): 0.012553 s


In [None]:
# 1.1 Compare inference time of: torch.compile() with default settings
print("torch.compile() with default settings")
model.to(device).eval()
compiled_model_default = torch.compile(model)
inference_times_gpu["default_compiled"] = {}
for key, inputs_gpu in inputs_collection.items():
    inference_times_gpu["default_compiled"][key] = measure_inference_time_gpu(compiled_model_default, inputs_gpu, num_runs)
    print(f"  Default compiled ({key}): {inference_times_gpu['default_compiled'][key]:.6f} s")


torch.compile() with default settings
  Default compiled (short): 0.005558 s
  Default compiled (medium): 0.009388 s
  Default compiled (long): 0.013214 s


In [20]:
# 1.2 Compare inference time of: torch.compile() with mode="max-autotune"
print("torch.compile() with mode=`max-autotune`")
model.to(device).eval()
compiled_model_max_autotune = torch.compile(model, mode="max-autotune")
inference_times_gpu["max_autotune_compiled"] = {}
for key, inputs_gpu in inputs_collection.items():
    inference_times_gpu["max_autotune_compiled"][key] = measure_inference_time_gpu(compiled_model_max_autotune, inputs_gpu, num_runs)
    print(f"Max-autotune compiled ({key}): {inference_times_gpu['max_autotune_compiled'][key]:.6f} s")

torch.compile() with mode=`max-autotune`
Max-autotune compiled (short): 0.004050 s
Max-autotune compiled (medium): 0.009638 s
Max-autotune compiled (long): 0.011872 s


In [21]:
# 1.3 Compare inference time of: torch.compile() with mode="max-autotune-no-cudagraphs"
print("torch.compile() with mode=`max-autotune-no-cudagraphs`")
model.to(device).eval()
compiled_model_max_autotune_dynamic = torch.compile(model, mode="max-autotune-no-cudagraphs")
inference_times_gpu["max_autotune_no_cudagraphs_compiled"] = {}
for key, inputs_gpu in inputs_collection.items():
    inference_times_gpu["max_autotune_no_cudagraphs_compiled"][key] = measure_inference_time_gpu(compiled_model_max_autotune_dynamic, inputs_gpu, num_runs)
    print(f"  Max-autotune-no-cudagraphs compiled ({key}): {inference_times_gpu['max_autotune_no_cudagraphs_compiled'][key]:.6f} s")

torch.compile() with mode=`max-autotune-no-cudagraphs`
  Max-autotune-no-cudagraphs compiled (short): 0.004132 s
  Max-autotune-no-cudagraphs compiled (medium): 0.009552 s
  Max-autotune-no-cudagraphs compiled (long): 0.011757 s


In [22]:
for size_key in sample_texts_dynamic.keys():
    uncompiled_time = inference_times_gpu["uncompiled_gpu"][size_key]
    default_compiled_time = inference_times_gpu["default_compiled"][size_key]
    max_autotune_time = inference_times_gpu["max_autotune_compiled"][size_key]
    max_autotune_no_cudagraphs_time = inference_times_gpu["max_autotune_no_cudagraphs_compiled"][size_key]

    speedup_default = uncompiled_time / default_compiled_time if default_compiled_time > 0 else float('inf')
    speedup_max_autotune = uncompiled_time / max_autotune_time if max_autotune_time > 0 else float('inf')
    speedup_max_autotune_no_cudagraphs = uncompiled_time / max_autotune_no_cudagraphs_time if max_autotune_no_cudagraphs_time > 0 else float('inf')

    print(f"\nInput Size: {size_key.capitalize()}")
    print(f"  uncompiled GPU: {uncompiled_time:.6f} s")
    print(f"  default compiled GPU: {default_compiled_time:.6f} s (Speedup: {speedup_default:.2f}x)")
    print(f"  max-autotune compiled GPU: {max_autotune_time:.6f} s (Speedup: {speedup_max_autotune:.2f}x)")
    print(f"  max-autotune no-cuda-graphs compiled GPU: {max_autotune_no_cudagraphs_time:.6f} s (Speedup: {speedup_max_autotune_no_cudagraphs:.2f}x)")



Input Size: Short
  uncompiled GPU: 0.011171 s
  default compiled GPU: 0.005558 s (Speedup: 2.01x)
  max-autotune compiled GPU: 0.004050 s (Speedup: 2.76x)
  max-autotune no-cuda-graphs compiled GPU: 0.004132 s (Speedup: 2.70x)

Input Size: Medium
  uncompiled GPU: 0.012495 s
  default compiled GPU: 0.009388 s (Speedup: 1.33x)
  max-autotune compiled GPU: 0.009638 s (Speedup: 1.30x)
  max-autotune no-cuda-graphs compiled GPU: 0.009552 s (Speedup: 1.31x)

Input Size: Long
  uncompiled GPU: 0.012553 s
  default compiled GPU: 0.013214 s (Speedup: 0.95x)
  max-autotune compiled GPU: 0.011872 s (Speedup: 1.06x)
  max-autotune no-cuda-graphs compiled GPU: 0.011757 s (Speedup: 1.07x)


Generally it seems that bigger input size leads to smaller differensces in inference time. While uncompiled model inference times didn't differ significantly, the speedup of optimized models was most visible for shorter texts.


## 4. Changing numerical precision

Most modern CPU and GPU hardware can perform operations on 16-bit numbers (`float16` / `fp16`)
much faster than on 32-bit numbers (`float32` / `fp32`). This is because we can pack twice the
number of vectors into the same amount of memory, theoretically doubling the throughput. This is
also known as half-precision computation.

If your application can tolerate a minimal drop in accuracy, this kind of quantization (or precision
reduction, depending on definition) is really useful for inference. Since this is equal to just
cutting particular bits, this can be done on the fly easily, and some frameworks support doing
this on model loading for weights.

There are also other dedicated formats for neural networks. Newer NVidia GPUs also support `bfloat16`
type, which retains value range and only cuts precision bits, which typically works better for neural
networks. Further, we can use mixed precision, i.e. perform less sensitive operations in `fp16`
(e.g. convolution), and more precise ones in `fp32` (e.g. weights updates).

PyTorch also supports simplified automated casting to reduced precision types with `autocast`, see:
- [torch.amp documentation](https://docs.pytorch.org/docs/stable/amp.html)
- [torch.amp autocasting docs](https://docs.pytorch.org/docs/stable/amp.html#autocasting)
- [automated mixed precision PyTorch tutorial](https://docs.pytorch.org/tutorials/recipes/recipes/amp_recipe.html)

However, if your hardware does not support those types and fast operations, they probably will not
provide any speedup, or this may even slow down execution due to type casts.

You can check if your NVidia GPU supports fast float16 (via Tensor Cores) using the following code:

```python
import torch

capability = torch.cuda.get_device_capability()
print(f"CUDA device capability: {capability}")

# Tensor Cores are available on NVidia GPUs with CUDA >= 7 (e.g. Volta, Turing, Ampere, Hopper)
if capability >= (7, 0):
    print("Tensor Cores available: fast float16 supported.")
else:
    print("Tensor Cores not available: float16 may be slow or unsupported.")
```

Casting model weights and inputs to half-precision works as follows:

```python
model_half = model.half().to('cuda')
outputs = model_half(input_ids.to('cuda').half(), attention_mask.to('cuda').half())
```

You can also verify it by running:

```python
model_fp32 = torch.nn.Linear(10, 1)
data_fp32 = torch.randn(100, 10)
labels_fp32 = torch.randn(100, 1)

print(f"Data type of model_fp32 parameters: {model_fp32.weight.dtype}")
print(f"Data type of data_fp32: {data_fp32.dtype}")
print(f"Data type of labels_fp32: {labels_fp32.dtype}")

output_fp32 = model_fp32(data_fp32)
loss_fn = torch.nn.MSELoss()
loss_fp32 = loss_fn(output_fp32, labels_fp32)

print(f"Loss fp32: {loss_fp32.item()}")
```

```python
model_fp16 = model_fp32.half()
data_fp16 = data_fp32.half()
labels_fp16 = labels_fp32.half()

print(f"Data type of model_fp16 parameters: {model_fp16.weight.dtype}")
print(f"Data type of data_fp16: {data_fp16.dtype}")
print(f"Data type of labels_fp16: {labels_fp16.dtype}")

output_fp16 = model_fp16(data_fp16)
loss_fp16 = loss_fn(output_fp16.float(), labels_fp16.float())

print(f"Loss fp16: {loss_fp16.item()}")
```

### Exercise 5 (2 points)

1. Check if your GPU supports Tensor Cores (capability >= (7,0)). If not, switch to Google Colab with GPU runtime.
2. Measure inference time with:
   - full precision (`float32`)
   - manual half-precision (`float16`)
   - automatic mixed precision (`torch.autocast`)
3. Compare time and speedup. Which variant would you use in practice?

In [23]:
model.to(device).eval()

def measure_inference_time_precision(model_to_measure, inputs_to_measure, num_runs, precision_mode='fp32'):
    start_time = time.perf_counter()
    for _ in range(num_runs):
        if precision_mode == 'autocast':
            with torch.inference_mode(), torch.autocast("cuda", dtype=torch.float16):
                _ = model_to_measure(**inputs_to_measure)
        else:
            with torch.inference_mode():
                _ = model_to_measure(**inputs_to_measure)
    torch.cuda.synchronize()
    end_time = time.perf_counter()
    return (end_time - start_time) / num_runs

inference_times_precision = {}


In [24]:
print("\nMeasuring inference times with different precisions (average over 100 runs):\n")

for key, inputs in inputs_collection.items():
    inference_times_precision[key] = {}

    # float32
    model.to(device).eval()
    _ = measure_inference_time_precision(model, inputs, 100, precision_mode='fp32')

    time_fp32 = measure_inference_time_precision(model, inputs, num_runs, precision_mode='fp32')
    inference_times_precision[key]["fp32"] = time_fp32

    # float16
    model_fp16 = model.half()
    _ = measure_inference_time_precision(model_fp16, inputs, 100, precision_mode='fp16')

    time_fp16 = measure_inference_time_precision(model_fp16, inputs, num_runs, precision_mode='fp16')
    inference_times_precision[key]["fp16"] = time_fp16

    # torch.autocast
    model.to(device).eval()
    _ = measure_inference_time_precision(model, inputs, 100, precision_mode='autocast')

    time_autocast = measure_inference_time_precision(model, inputs, num_runs, precision_mode='autocast')
    inference_times_precision[key]["autocast"] = time_autocast

    print(f"input size: {key}")
    print(f".  -float32: {time_fp32:.6f} s")
    print(f"   -float16: {time_fp16:.6f} s")
    print(f"   -autocast: {time_autocast:.6f} s")


Measuring inference times with different precisions (average over 100 runs):

input size: short
.  -float32: 0.008301 s
   -float16: 0.008319 s
   -autocast: 0.010559 s
input size: medium
.  -float32: 0.008539 s
   -float16: 0.008376 s
   -autocast: 0.013878 s
input size: long
.  -float32: 0.008886 s
   -float16: 0.008598 s
   -autocast: 0.010720 s


In [25]:
print("\nSpeedup comparison over full prec")

for key in inference_times_precision.keys():
    fp32_time = inference_times_precision[key]["fp32"]
    fp16_time = inference_times_precision[key]["fp16"]
    autocast_time = inference_times_precision[key]["autocast"]

    speedup_fp16 = fp32_time / fp16_time
    speedup_autocast = fp32_time / autocast_time

    print(f"\ninput size: {key}")
    print(f"  float16 speedup: {speedup_fp16:.2f}x")
    print(f"  autocast speedup: {speedup_autocast:.2f}x")


Speedup comparison over full prec

input size: short
  float16 speedup: 1.00x
  autocast speedup: 0.79x

input size: medium
  float16 speedup: 1.02x
  autocast speedup: 0.62x

input size: long
  float16 speedup: 1.03x
  autocast speedup: 0.83x


It's difficult to choose the best variant from this results. While autocast is generally expected to perform well it was slowest here. This might be due to overhead with small input sizes, also it's likely more effective with larger inputs or during model training. Hopefully I didn't use it wrong.

## 5. ONNX

ONNX (Open Neural Network eXchange) is a standardized format for representing neural networks. It abstracts
operations, turning the framework-specific code into an **execution graph** built from standardized operators.
It describes operations, input/output shapes, and model parameters in a hardware- and framework-agnostic way.
Then, it can be run with via ONNX Runtime (ORT), which can execute the code with kernels and optimizations from
specialized providers, like Intel OpenVINO or NVidia TensorRT.

ONNX and ONNX Runtime have considerable advantages:
1. Framework- and language-agnostic - ONNX runs on any framework and programming language, e.g. you can export
   PyTorch model in Python, and then run it in a Java application.
2. Execution graph optimization - ONNX Runtime provides a series of optimizations for the execution graph,
   including hardware-specific operators provided by manufacturers.
3. Lightweight deployment - ONNX & ORT have much smaller package size than the whole PyTorch (even CPU-only wheels),
   reducing sizes of dependencies and Docker containers, and accelerating loading.

In practice, `torch.compile()` works well for PyTorch optimization, but ONNX is preferable for deploying models,
particularly for lightweight or mobile runtimes. It also supports GPU inference via NVidia TensorRT provider.

Exporting to ONNX produces a raw computation graph in `.onnx` format. This file is:
- a static description of operators, weights, and I/O tensors
- a general graph - no hardware-specific rewrites happen during ONNX export
- hardware-agnostic - it does not contain CUDA/CPU kernels or provider information

We will export a Transformer model with dynamic batch size and dynamic sequence length.

```python
import torch
import torch.onnx

# Put the model in eval mode and move to CPU
model_cpu = model.eval().cpu()

# Example input for tracking (for onnx export)
sample_input = tokenizer(
    "This is a sample input text for ONNX export.",
    padding=True,
    truncation=True,
    return_tensors="pt",
)

# Export to ONNX format
torch.onnx.export(
    model_cpu,
    (sample_input["input_ids"], sample_input["attention_mask"]),
    "model.onnx",
    opset_version=17,
    input_names=["input_ids", "attention_mask"],
    output_names=["output"],
    dynamic_axes={
        "input_ids": {0: "batch_size", 1: "sequence_length"},
        "attention_mask": {0: "batch_size", 1: "sequence_length"},
        "output": {0: "batch_size"},
    },
)
```

We export on CPU in `eval()` mode to get deterministic behavior.

Look at how we marked dynamic axes in `dynamic_axes`:
1. For `input_ids` and `attention_mask`, we marked axes 0 (batch size) and 1 (sequence length) as dynamic,
   since they can vary during inference.
   - axis 0 - batch size, depends on number of inputs
   - axis 1 - sequence length, depends on text length
   - axis 2 - embedding size, fixed and constant (768), so we don't mark it
2. For `output`, we marked only axis 0 (batch size) as dynamic, since the output will have the same number
   of rows as the input batch size.

The exported `model.onnx` is a raw graph, not yet optimized. It can be changed during InferenceSession
creation in ONNX Runtime or when we explicitly run offline optimizations.

### Optimization & inference with ONNX Runtime

First, we run inference using ONNX Runtime with default settings. By default, all optimizations are applied.

```python
import onnxruntime as ort
import numpy as np

# Load the model
ort_session = ort.InferenceSession("model.onnx")

# Prepare input data
sample_input = tokenizer(
    "This is a sample input text for ONNX inference.",
    padding=True,
    truncation=True,
    return_tensors="np",
)


# Create input dictionary, in same format as during export
inputs_onnx = {
    "input_ids": sample_input["input_ids"],
    "attention_mask": sample_input["attention_mask"],
}

# Run inference
outputs_onnx = ort_session.run(None, inputs_onnx)
```

The raw ONNX is parsed, optimized (default level is `ORT_ENABLE_ALL`), and executed using the default
execution provider (generally generic CPU by default).

We did not specify a provider in this example to keep the code short. ONNX Runtime internally chooses
providers based on how it was built (for example, CPU only, or CPU + CUDA). For production use, you
should specify providers explicitly. We will do that in the next section.

### Graph optimization settings

ONNX Runtime groups graph optimizations into levels. Each level builds on the previous one:

1. **Basic graph optimizations** - semantics-preserving rewrites that remove redundant work.
   They run before graph partitioning, so they apply to nodes regardless of the target execution provider.
2. **Extended graph optimizations** - They run after graph partitioning and are applied only to nodes
   assigned to selected providers (CPU, CUDA, ROCm).
3. **Layout optimizations** - change layout from NHCW to NCHWc for CPU provider.

All optimizations are enabled by default. You can control them using the `GraphOptimizationLevel` enum:
* `ORT_DISABLE_ALL` – disable all optimizations
* `ORT_ENABLE_BASIC` – only basic
* `ORT_ENABLE_EXTENDED` – basic and extended
* `ORT_ENABLE_ALL` – basic + extended + layout optimizations (default)

### Online mode (load-time optimization)

In online mode, optimizations are applied each time you create an `InferenceSession`.
This happens when you create it:
```python
ort_session = ort.InferenceSession("model.onnx")
```
We can control the optimization level using `SessionOptions`:

```python
options = ort.SessionOptions()
options.graph_optimization_level = ort.GraphOptimizationLevel.ORT_ENABLE_ALL

ort_session = ort.InferenceSession(
    "model.onnx", sess_options=options, providers=["CPUExecutionProvider"]
)
```

Online mode is most convenient for:
- development and experimentation - you can quickly try out different settings
- dynamic environments - when running on different hardware or deployments, depending on settings

The cost of online mode is that optimization work is repeated each time a session is created, which
may be noticeable for large models. When you deploy to a known target each time, offline mode is
a better choice.

### Offline mode (ahead-of-time optimization)

In offline mode, optimizations are applied once, and the optimized model is saved to a new ONNX file.
This can significantly reduce startup time in production environments. The key element is setting the
`SessionOptions.optimized_model_filepath`, which specifies where to save the optimized model.
When enabled, ONNX Runtime runs graph optimizations according to `graph_optimization_level`, and saves
the optimized model to the file.

```python
import onnxruntime as ort

sess_options = ort.SessionOptions()

# Choose the optimization level for the offline pass
sess_options.graph_optimization_level = ort.GraphOptimizationLevel.ORT_ENABLE_EXTENDED

# Save the optimized model to this path
sess_options.optimized_model_filepath = "model_optimized.onnx"

# Create InferenceSession, which will perform offline optimization and save the optimized model
ort.InferenceSession("model.onnx", sess_options)
```

After you can load this file and disable optimizations to avoid re-optimizing:

```python
# Load the optimized model without re-optimizing
sess_options = ort.SessionOptions()
sess_options.graph_optimization_level = ort.GraphOptimizationLevel.ORT_DISABLE_ALL

ort_session_optimized = ort.InferenceSession(
    "model_optimized.onnx",
    sess_options=sess_options,
    providers=['CPUExecutionProvider']
)
```

Offline mode is best suited for:
- production deployments - startup time is important, and the model changes only during training
- limited resource environments - repeated optimization is costly
- static hardware setups - when we know the hardware configuration, there is no need for re-optimization

### Executions Providers

Execution providers decide how and where the nodes of the ONNX graph are executed. They are not an
extra optimization pass on top of the graph. Instead, they are backends that provide concrete kernel
implementations for operators such as `MatMul`, `Conv`, `LayerNorm`, and so on.

Typical providers include:

* `CPUExecutionProvider`
* `CUDAExecutionProvider`
* `TensorrtExecutionProvider`
* `OpenVINOExecutionProvider`

The ONNX file itself is always hardware-agnostic. It does not contain any provider information.
Providers come into play only when you create an `InferenceSession`. Provider is responsible for:

* mapping ONNX operations to actual kernels, e.g. CPU BLAS vs cuBLAS vs TensorRT engines
* deciding which fused patterns it can execute efficiently for extended optimizations
* executing its part of the graph on the target hardware

So why do we need to care about providers? In production, it is better to be explicit, so that the behavior
does not change when you move the same model to a different environment.

```python
import onnxruntime as ort

options = ort.SessionOptions()
options.graph_optimization_level = ort.GraphOptimizationLevel.ORT_ENABLE_ALL

# Force CPU only
session_cpu = ort.InferenceSession(
    "model.onnx", sess_options=options, providers=["CPUExecutionProvider"]
)

# Prefer CUDA, fall back to CPU if CUDA is not available
session_cuda = ort.InferenceSession(
    "model.onnx",
    sess_options=options,
    providers=["CUDAExecutionProvider", "CPUExecutionProvider"],
)
```

For more information about providers, see the official [Execution Providers section](https://iot-robotics.github.io/ONNXRuntime/docs/execution-providers/).

### Exercise 6 (3 points)

1. Measure cold start time (including session creation) of the ONNX model using online and offline optimization modes
   on CPU.
2. Measure inference time of the ONNX model on CPU using both optimization modes.
3. Prepare deployment Docker images:
   - build two images, for a) compiled PyTorch model b) ONNX model with ONNX Runtime
   - select the best model in both cases in terms of the inference time
   - install a minimal set of requirements in both cases, e.g. do not install PyTorch for ONNX image
4. Compare for those apps:
   - Docker container sizes
   - response time (average of 100 requests)

# Task
Load the `sentence-transformers/multi-qa-mpnet-base-cos-v1` model and its corresponding tokenizer, and prepare a sample input text by tokenizing it. The model should be moved to the appropriate device (CPU for initial tests).

## Load Model and Tokenizer

### Subtask:
Load the `sentence-transformers/multi-qa-mpnet-base-cos-v1` model and its corresponding tokenizer using `AutoModel` and `AutoTokenizer` from the `transformers` library. Ensure the model is moved to the appropriate device (CPU for initial tests, or GPU if available).


**Reasoning**:
The subtask requires loading the model and tokenizer and moving the model to the appropriate device. This code block will perform all these steps.

