### Exercise 1 (3 points)

1. Load the `sentence-transformers/multi-qa-mpnet-base-cos-v1` model and tokenizer. Use the `AutoModel` and
   `AutoTokenizer` classes from `tranformers` library.

In [1]:
import time
import torch
from transformers import AutoTokenizer, AutoModel


  from .autonotebook import tqdm as notebook_tqdm


In [2]:
model_name = "sentence-transformers/multi-qa-mpnet-base-cos-v1"

tokenizer = AutoTokenizer.from_pretrained(model_name)

model = AutoModel.from_pretrained(model_name)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)


MPNetModel(
  (embeddings): MPNetEmbeddings(
    (word_embeddings): Embedding(30527, 768, padding_idx=1)
    (position_embeddings): Embedding(514, 768, padding_idx=1)
    (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): MPNetEncoder(
    (layer): ModuleList(
      (0-11): 12 x MPNetLayer(
        (attention): MPNetAttention(
          (attn): MPNetSelfAttention(
            (q): Linear(in_features=768, out_features=768, bias=True)
            (k): Linear(in_features=768, out_features=768, bias=True)
            (v): Linear(in_features=768, out_features=768, bias=True)
            (o): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (dropout): Dropout(p=0.1, inplace=False)
        )
        (intermediate): MPNetIntermediate(
          (dense): Linear(in_

2. Create a sample input text and tokenize it (padding, truncation, `return_tensors="pt"`).

In [3]:
text = "David and Emma looked at each other across the table. The young couple were happy: the food was delicious, the light from the candles was soft and the music was perfect. "

inputs = tokenizer(
    text,
    padding=True,
    truncation=True,
    return_tensors="pt",
)

3. Measure the inference time of the model in various inference modes (average time over 100 runs):
   - no optimizations (simple PyTorch)
   - `model.eval()`
   - `model.eval()` and `no_grad()`
   - `model.eval()` and `inference_mode()`

In [None]:
inputs = {k: v.to(device) for k, v in inputs.items()}


def measure_time(run_fn, n_runs: int = 100):
    run_fn() #warm-up

    if device.type == "cuda":
        torch.cuda.synchronize()

    start = time.perf_counter()

    for _ in range(n_runs):
        run_fn()

    if device.type == "cuda":
        torch.cuda.synchronize()

    end = time.perf_counter()

    return (end - start) / n_runs


In [5]:
model.train()


def forward_pass():
    _= model(**inputs)

time_train_mode = measure_time(forward_pass)
print(f"Forward pass (train mode): {time_train_mode:.6f} s")

Forward pass (train mode): 0.005411 s


In [6]:
model.eval()

time_eval_mode = measure_time(forward_pass)
print(f"Forward pass (eval mode): {time_eval_mode:.6f} s")

Forward pass (eval mode): 0.004736 s


In [7]:
def forward_pass_no_grad():
    with torch.no_grad():
        _ = model(**inputs)

time_eval_mode_no_grad = measure_time(forward_pass_no_grad)
print(f"Forward pass (model.eval() + no_grad()): {time_eval_mode_no_grad:.6f} s")


Forward pass (model.eval() + no_grad()): 0.003610 s


In [8]:
def forward_pass_inference_mode():
    with torch.inference_mode():
        _ = model(**inputs)

time_inf_eval_mode = measure_time(forward_pass_inference_mode)
print(f"Forward pass (model.eval() + inference_mode()): {time_inf_eval_mode:.6f} s")



Forward pass (model.eval() + inference_mode()): 0.003282 s


4. Compare the speedup of options 2, 3, and 4 over the pure PyTorch. To calculate speedup, divide the
   PyTorch time by the current time. 

In [9]:
def speedup(base, other):
    return base / other

print("Speedup realtive to train mode:")
print(f"Forward pass: eval: {speedup(time_train_mode, time_eval_mode):.2f}x")
print(f"Forward pass: eval + no_grad: {speedup(time_train_mode, time_eval_mode_no_grad):.2f}x")
print(f"Forward pass: eval + inference_mode: {speedup(time_train_mode, time_inf_eval_mode):.2f}x")

Speedup realtive to train mode:
Forward pass: eval: 1.14x
Forward pass: eval + no_grad: 1.50x
Forward pass: eval + inference_mode: 1.65x


### Exercise 2 (2 points)

In this exercise, we will verify the gains from model compilation with `torch.compile()`.

1. Compile the model using `torch.compile()` after switching it to evaluation mode, and warm-up the model
   by running a single inference call. Measure this compilation + warm-up time (just once).

In [12]:
start = time.perf_counter()

model.eval()
compiled_model = torch.compile(model)

def forward_pass_inference_mode_model_compile():
    with torch.inference_mode():
        _ = compiled_model(**inputs)

forward_pass_inference_mode_model_compile()

if device.type == "cuda":
    torch.cuda.synchronize()

end = time.perf_counter()
print(f"Compilation + warm-up time: {(end - start):.6f}")

Compilation + warm-up time: 8.008003


2. Measure the inference time (average of 100 runs) of the compiled model in inference mode.

In [13]:
time_inf_eval_mode_compile = measure_time(forward_pass_inference_mode_model_compile)
print(f"Forward pass (eval + inference_mode+ compile): {time_inf_eval_mode_compile:.6f} s")

Forward pass (eval + inference_mode+ compile): 0.002651 s


In [14]:
print(f"Forward pass: eval + inference_mode + comile: {speedup(time_train_mode, time_inf_eval_mode_compile):.2f}x")

Forward pass: eval + inference_mode + comile: 2.04x


The compiled model is faster.

### Exercise 3 (3 points)

We will perform a dynamic quantization for our model, which is very simple operationally to use with PyTorch.
It provides the `torch.ao.quantization.quantize_dynamic()` function, to which we pass the model and a 
list of layer types that we want to quantize. In the case of transformers, those are primarily the linear
layers, which contain the majority of weights and perform most computations.

1. Ensure the model is on the CPU.
2. Quantize the model with `torch.ao.quantization.quantize_dynamic()`, setting the target weight to `torch.qint8` and
   layers to a single-element set with `nn.Linear`.
3. Save the model to a new variable (e.g. `model_quantized`), and print it to verify that linear layers have been
   quantized properly (i.e. `DynamicQuantizedLinear` instead of `Linear`).
4. Save both models to disk (`state_dict` for both) and compare the file sizes (e.g. `os.path.getsize()`).
5. Compare the inference speed and speedup on CPU for original and quantized models (again, average of 100 runs).
6. Display the comparison. Do you think that quantization is helpful in this case?

In [6]:
import os
import torch.nn as nn
from torch.ao.quantization import quantize_dynamic

In [7]:
device_cpu = torch.device("cpu")

model_cpu = model.to(device_cpu)
model_cpu.eval()

inputs_cpu = {k: v.to(device_cpu) for k, v in inputs.items()}

model_quantized = quantize_dynamic(
    model_cpu,             
    {nn.Linear},           
    dtype=torch.qint8     
)

print("Quantized model")
print(model_quantized)


fp32_path = "model_fp32_state_dict.pth"
int8_path = "model_quantized_int8_state_dict.pth"

torch.save(model_cpu.state_dict(), fp32_path)
torch.save(model_quantized.state_dict(), int8_path)

size_fp32 = os.path.getsize(fp32_path)
size_int8 = os.path.getsize(int8_path)

def to_MB(bytes_size):
    return bytes_size / (1024 * 1024)

print("File size")
print(f"fp32 model size: {to_MB(size_fp32):.2f} MB")
print(f"int8 model size: {to_MB(size_int8):.2f} MB")
print(f"size comparsion (int8 / fp32): {size_int8 / size_fp32:.2f}")


Quantized model
MPNetModel(
  (embeddings): MPNetEmbeddings(
    (word_embeddings): Embedding(30527, 768, padding_idx=1)
    (position_embeddings): Embedding(514, 768, padding_idx=1)
    (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): MPNetEncoder(
    (layer): ModuleList(
      (0-11): 12 x MPNetLayer(
        (attention): MPNetAttention(
          (attn): MPNetSelfAttention(
            (q): DynamicQuantizedLinear(in_features=768, out_features=768, dtype=torch.qint8, qscheme=torch.per_tensor_affine)
            (k): DynamicQuantizedLinear(in_features=768, out_features=768, dtype=torch.qint8, qscheme=torch.per_tensor_affine)
            (v): DynamicQuantizedLinear(in_features=768, out_features=768, dtype=torch.qint8, qscheme=torch.per_tensor_affine)
            (o): DynamicQuantizedLinear(in_features=768, out_features=768, dtype=torch.qint8, qscheme=torch.per_tensor_affine)
            (dropout): Dropout(

In [8]:
def forward_fp32():
    with torch.inference_mode():
        _ = model_cpu(**inputs_cpu)


def forward_int8():
    with torch.inference_mode():
        _ = model_quantized(**inputs_cpu)


time_fp32 = measure_time(forward_fp32, n_runs=100)
time_int8 = measure_time(forward_int8, n_runs=100)

print("Inference comparsion fp32 vs int8 on CPU")
print(f"fp32 model: {time_fp32:.6f} s")
print(f"int8 model: {time_int8:.6f} s")

speedup_int8 = time_fp32 / time_int8
print(f"Speedup int8 vs fp32: {speedup_int8:.2f}x")

Inference comparsion fp32 vs int8 on CPU
fp32 model: 0.027787 s
int8 model: 0.011173 s
Speedup int8 vs fp32: 2.49x


Quantization is helpful here. The int8 quantized model is faster and smaller than the model that uses fp32 precision.

### Exercise 4 (2 points)

1. Compare inference time of:
   - `torch.compile()` with default settings
   - `torch.compile()` with `mode="max-autotune"`
   - `torch.compile()` with `mode="max-autotune-no-cudagraphs"`
2. Report the average time of 100 runs and speedup of the latter two modes.

Check a few different text input sizes. What happens in the latter two modes?

In [9]:
#device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
#model = model.to(device)
#inputs_gpu = {k: v.to(device) for k, v in inputs.items()}
#with torch.inference_mode():
#    outputs = model(**inputs_gpu)

In [None]:
# compiled_model_with_cudagraphs = torch.compile(model, mode="max-autotune")

In [None]:
# compiled_model_dynamic = torch.compile(model, mode="max-autotune-no-cudagraphs")

In [None]:
#inputs = tokenizer(sample_text, padding=True, truncation=True, return_tensors="pt", pin_memory=True)

#from torch.utils.data import DataLoader

#dataloader = DataLoader(dataset, batch_size=32, pin_memory=True)


In [21]:
import copy

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print("Device:", device)

base_model = model.to(device)
base_model.eval()

texts = {
    "short": "Kate looked around the room at the other people: ten men and ten women all around the same age. ",
    "medium": "Sergeant Frank Spike sat behind his desk and looked out of the window. "
              "Outside, cars moved slowly in the cold, grey rain. He looked down at the grey hairs on his arms.",
    "long": "Dr Tomas Streyer looked around the control room at his team of scientists and engineers. "
            "He was excited and frightened but he tried to seem calm. In a few minutes, they might start to "
            "discover something amazing: how the universe began. He looked out of the window at the beautiful"
            " blue summer sky and tried to breathe slowly. 'Ready,' he said. He pressed the first button and the "
            "complicated computers and machines came to life. 'Set,' he said. He pressed the second button and "
            "switched on the large particle accelerator that lay under the towns and fields of Switzerland. ",
}

def make_inputs(text: str):
    encoded = tokenizer(
        text,
        padding=True,
        truncation=True,
        return_tensors="pt",
    )
    return {k: v.to(device) for k, v in encoded.items()}


model_default = torch.compile(base_model)
model_max_auto = torch.compile(copy.deepcopy(base_model), mode="max-autotune")
model_max_auto_nocuda = torch.compile(copy.deepcopy(base_model),mode="max-autotune-no-cudagraphs")

def make_forward_fn(compiled_model, inputs_gpu):
    def forward():
        with torch.inference_mode():
            _ = compiled_model(**inputs_gpu)
    return forward


Device: cuda


In [22]:
print("Comparsion different torch.compile() modes - inference on GPU")
for name, text in texts.items():
    print(f"Input size: {name}\n")

    inputs_gpu = make_inputs(text)

    #warm-up
    with torch.inference_mode():
        _ = model_default(**inputs_gpu)
        _ = model_max_auto(**inputs_gpu)
        _ = model_max_auto_nocuda(**inputs_gpu)
    if device.type == "cuda":
        torch.cuda.synchronize()

    t_default = measure_time(make_forward_fn(model_default, inputs_gpu))
    t_max_auto = measure_time(make_forward_fn(model_max_auto, inputs_gpu))
    t_max_auto_nocg = measure_time(make_forward_fn(model_max_auto_nocuda, inputs_gpu))

    print(f"default compile: {t_default:.6f}s\n")

    print(f"max-autotune: {t_max_auto:.6f}s\n "
          f"(speedup vs default: {t_default / t_max_auto:.2f}x)\n")
    print(f"max-autotune-no-cudagraphs: {t_max_auto_nocg:.6f}s\n "
          f"(speedup vs default: {t_default / t_max_auto_nocg:.2f}x)\n")


Comparsion different torch.compile() modes - inference on GPU
Input size: short

default compile: 0.002623s

max-autotune: 0.002324s
 (speedup vs default: 1.13x)

max-autotune-no-cudagraphs: 0.002355s
 (speedup vs default: 1.11x)

Input size: medium

default compile: 0.002575s

max-autotune: 0.002442s
 (speedup vs default: 1.05x)

max-autotune-no-cudagraphs: 0.002560s
 (speedup vs default: 1.01x)

Input size: long

default compile: 0.004015s

max-autotune: 0.003885s
 (speedup vs default: 1.03x)

max-autotune-no-cudagraphs: 0.003988s
 (speedup vs default: 1.01x)



For this model the three torch.compile() modes are very close.  
max-autotune is only a little faster than the default (about 3â€“13 p.p).  
max-autotune-no-cudagraphs is almost the same as the default.  
Changing the input length (short / medium / long) does not   
change the results much.    


### Exercise 5 (2 points)

1. Check if your GPU supports Tensor Cores (capability >= (7,0)). If not, switch to Google Colab with GPU runtime.
2. Measure inference time with:
   - full precision (`float32`)
   - manual half-precision (`float16`)
   - automatic mixed precision (`torch.autocast`)
3. Compare time and speedup. Which variant would you use in practice?

In [23]:
capability = torch.cuda.get_device_capability()
print(f"CUDA device capability: {capability}")

if capability >= (7, 0):
    print("Tensor Cores available: fast float16 supported.")
else:
    print("Tensor Cores not available: float16 may be slow or unsupported.")

CUDA device capability: (8, 9)
Tensor Cores available: fast float16 supported.


In [35]:
inputs = tokenizer(
    text,
    padding=True,
    truncation=True,
    return_tensors="pt",
)

input_ids = inputs["input_ids"].to("cuda")        
attention_mask = inputs["attention_mask"].to("cuda") 

model_half = model.half().to("cuda")

with torch.inference_mode():
    outputs = model_half(input_ids=input_ids,attention_mask=attention_mask)

In [33]:
print(model_half.dtype)                   
print(outputs.last_hidden_state.dtype)      

torch.float16
torch.float16


In [36]:
model_fp32 = torch.nn.Linear(10, 1)
data_fp32 = torch.randn(100, 10)
labels_fp32 = torch.randn(100, 1)

print(f"Data type of model_fp32 parameters: {model_fp32.weight.dtype}")
print(f"Data type of data_fp32: {data_fp32.dtype}")
print(f"Data type of labels_fp32: {labels_fp32.dtype}")

output_fp32 = model_fp32(data_fp32)
loss_fn = torch.nn.MSELoss()
loss_fp32 = loss_fn(output_fp32, labels_fp32)

print(f"Loss fp32: {loss_fp32.item()}")

model_fp16 = model_fp32.half()
data_fp16 = data_fp32.half()
labels_fp16 = labels_fp32.half()

print(f"Data type of model_fp16 parameters: {model_fp16.weight.dtype}")
print(f"Data type of data_fp16: {data_fp16.dtype}")
print(f"Data type of labels_fp16: {labels_fp16.dtype}")

output_fp16 = model_fp16(data_fp16)
loss_fp16 = loss_fn(output_fp16.float(), labels_fp16.float())

print(f"Loss fp16: {loss_fp16.item()}")

Data type of model_fp32 parameters: torch.float32
Data type of data_fp32: torch.float32
Data type of labels_fp32: torch.float32
Loss fp32: 1.1362165212631226
Data type of model_fp16 parameters: torch.float16
Data type of data_fp16: torch.float16
Data type of labels_fp16: torch.float16
Loss fp16: 1.136107325553894


In [37]:
model_fp16 = model_fp32.half()
data_fp16 = data_fp32.half()
labels_fp16 = labels_fp32.half()

print(f"Data type of model_fp16 parameters: {model_fp16.weight.dtype}")
print(f"Data type of data_fp16: {data_fp16.dtype}")
print(f"Data type of labels_fp16: {labels_fp16.dtype}")

output_fp16 = model_fp16(data_fp16)
loss_fp16 = loss_fn(output_fp16.float(), labels_fp16.float())

print(f"Loss fp16: {loss_fp16.item()}")

Data type of model_fp16 parameters: torch.float16
Data type of data_fp16: torch.float16
Data type of labels_fp16: torch.float16
Loss fp16: 1.136107325553894


2. Measure inference time with:
   - full precision (`float32`)
   - manual half-precision (`float16`)
   - automatic mixed precision (`torch.autocast`)

In [38]:
model_fp32 = base_model

model_fp16 = copy.deepcopy(base_model).half().to(device)


def forward_fp32(inputs):
    with torch.inference_mode():
        _ = model_fp32(**inputs)

def forward_fp16(inputs):
    with torch.inference_mode():
        _ = model_fp16(**inputs)

def forward_autocast(inputs):
    with torch.inference_mode(), torch.autocast(device_type="cuda", dtype=torch.float16):
        _ = model_fp32(**inputs)


print("comparison of numerical precisions \n")

for name, text in texts.items():
    print(f"input size: {name}")

    inputs = make_inputs(text)

    t_fp32 = measure_time(lambda: forward_fp32(inputs))
    t_fp16 = measure_time(lambda: forward_fp16(inputs))
    t_amp  = measure_time(lambda: forward_autocast(inputs))

    print(f"float32:{t_fp32:.6f}s")
    print(f"float16 (manual):{t_fp16:.6f}s (speedup vs fp32: {t_fp32 / t_fp16:.2f}x)")
    print(f"autocast (AMP):{t_amp:.6f}s speedup vs fp32: {t_fp32 / t_amp:.2f}x)")
    print()

comparison of numerical precisions 

input size: short
float32:0.003158s
float16 (manual):0.003022s (speedup vs fp32: 1.04x)
autocast (AMP):0.003655s speedup vs fp32: 0.86x)

input size: medium
float32:0.002786s
float16 (manual):0.002765s (speedup vs fp32: 1.01x)
autocast (AMP):0.003556s speedup vs fp32: 0.78x)

input size: long
float32:0.002974s
float16 (manual):0.002874s (speedup vs fp32: 1.03x)
autocast (AMP):0.003794s speedup vs fp32: 0.78x)



In my model the time differences are small. Manual float16 is only a little faster than float32 (about 1.01 and 1.04).  
The autocast mode os slower, because it has extra overhead. In practice, for such a small model, I would choose normal  
float32 or manual float16  


### Exercise 6 (3 points)

1. Measure cold start time (including session creation) of the ONNX model using online and offline optimization modes
   on CPU.
2. Measure inference time of the ONNX model on CPU using both optimization modes.
3. Prepare deployment Docker images:
   - build two images, for a) compiled PyTorch model b) ONNX model with ONNX Runtime
   - select the best model in both cases in terms of the inference time
   - install a minimal set of requirements in both cases, e.g. do not install PyTorch for ONNX image
4. Compare for those apps:
   - Docker container sizes
   - response time (average of 100 requests)

In [None]:
import torch
import torch.onnx

model_cpu = model.eval().cpu()

sample_input = tokenizer(
    "This is a sample input text for ONNX export.",
    padding=True,
    truncation=True,
    return_tensors="pt",
)

torch.onnx.export(
    model_cpu,
    (sample_input["input_ids"], sample_input["attention_mask"]),
    "model.onnx",
    opset_version=17,
    input_names=["input_ids", "attention_mask"],
    output_names=["output"],
    dynamic_axes={
        "input_ids": {0: "batch_size", 1: "sequence_length"},
        "attention_mask": {0: "batch_size", 1: "sequence_length"},
        "output": {0: "batch_size"},
    },
)

In [None]:
import onnxruntime as ort
import numpy as np

ort_session = ort.InferenceSession("model.onnx")

sample_input = tokenizer(
    "This is a sample input text for ONNX inference.",
    padding=True,
    truncation=True,
    return_tensors="np",
)


inputs_onnx = {
    "input_ids": sample_input["input_ids"],
    "attention_mask": sample_input["attention_mask"],
}

outputs_onnx = ort_session.run(None, inputs_onnx)

### Online mode

In [None]:
options = ort.SessionOptions()
options.graph_optimization_level = ort.GraphOptimizationLevel.ORT_ENABLE_ALL

start = time.perf_counter()
session_online = ort.InferenceSession(
    "model.onnx", sess_options=options, providers=["CPUExecutionProvider"]
)
_ = session_online.run(None, inputs_onnx) 
end = time.perf_counter()

cold_online = end - start
print(f"cold start (online): {cold_online:.6f}s")

cold start (online): 0.472560s


### Inference time online

In [51]:
t_online = measure_time(lambda: session_online.run(None, inputs_onnx), n_runs=100)
print(f"Inference time (online): {t_online:.6f}s")

Inference time (online): 0.013596s


### Offline mode

In [None]:
import onnxruntime as ort

sess_options = ort.SessionOptions()

sess_options.graph_optimization_level = ort.GraphOptimizationLevel.ORT_ENABLE_EXTENDED

sess_options.optimized_model_filepath = "model_optimized.onnx"

start = time.perf_counter()
session_build = ort.InferenceSession("model.onnx", sess_options)
_ = session_build.run(None, inputs_onnx)  
end = time.perf_counter()

cold_offline = end - start
print(f"cold start (offline build): {cold_offline:.6f}s")

cold start (offline build): 0.577415s


In [None]:
sess_options = ort.SessionOptions()
sess_options.graph_optimization_level = ort.GraphOptimizationLevel.ORT_DISABLE_ALL

ort_session_optimized = ort.InferenceSession(
    "model_optimized.onnx", 
    sess_options=sess_options, 
    providers=['CPUExecutionProvider']
)


t_offline = measure_time(
    lambda: ort_session_optimized.run(None, inputs_onnx), 
    n_runs=100
)
print(f"inference time (offline, optimized model): {t_offline:.6f}s")

print(f"speedup offline vs online: {t_online / t_offline:.2f}x")

inference time (offline, optimized model): 0.026927s
speedup offline vs online: 0.50x


4. Compare for those apps:
   - Docker container sizes

Docker container sizes

- PyTorch image (`lab7-torch:latest`): = 7.87 GB
- ONNX image (`lab7-onnx:latest`): = 726 MB

In [None]:
import time
import statistics as stats
import requests

payload = {"text": "Dr Tomas Streyer looked around the control room at his team of scientists and engineers."}
N_RUNS = 100

def measure_endpoint(url: str, n_runs: int = N_RUNS):
    print(f"Measuring {n_runs} requests for {url}")

    # warm-up 
    for _ in range(5):
        requests.post(url, json=payload)

    times = []
    for _ in range(n_runs):
        t0 = time.perf_counter()
        r = requests.post(url, json=payload)
        r.raise_for_status()
        t1 = time.perf_counter()
        times.append(t1 - t0)

    avg = stats.mean(times)
    print(f"avg: {avg*1000:.2f} ms")
    print(f"min: {min(times)*1000:.2f} ms")
    print(f"max: {max(times)*1000:.2f} ms")
    print()
    return avg

torch_avg = measure_endpoint("http://localhost:8000/predict")
onnx_avg  = measure_endpoint("http://localhost:8001/predict")
print(f"Speedup ONNX vs PyTorch: {torch_avg / onnx_avg:.2f}x")


Measuring 100 requests for http://localhost:8000/predict
avg: 20.91 ms
min: 17.92 ms
max: 27.04 ms

Measuring 100 requests for http://localhost:8001/predict
avg: 15.08 ms
min: 12.67 ms
max: 18.58 ms

Speedup ONNX vs PyTorch: 1.39x


: 