# Lab 7 - Model Optimization for Inference

In this lab, we will focus on optimizing neural network models for faster inference.

### Exercise 1 (3 points)

1. Load the `sentence-transformers/multi-qa-mpnet-base-cos-v1` model and
   tokenizer. Use the `AutoModel` and `AutoTokenizer` classes from `tranformers`
   library.
2. Create a sample input text and tokenize it (padding, truncation,
   `return_tensors="pt"`).
3. Measure the inference time of the model in various inference modes (average
   time over 100 runs):
   - no optimizations (simple PyTorch)
   - `model.eval()`
   - `model.eval()` and `no_grad()`
   - `model.eval()` and `inference_mode()`
4. Compare the speedup of options 2, 3, and 4 over the pure PyTorch. To
   calculate speedup, divide the PyTorch time by the current time.

In general, the time should decrease for subsequent options. If
`inference_mode()` is slower than `no_grad()`, it may be due some not supported
operations in the model, so `no_grad()` is preferred in such cases. But when
models contain many operations and overhead with autograd is significant,
`inference_mode()` should be faster.

In [2]:
import torch
from transformers import AutoModel, AutoTokenizer

model = AutoModel.from_pretrained("sentence-transformers/multi-qa-mpnet-base-cos-v1")
tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/multi-qa-mpnet-base-cos-v1")

prompt = "This is a sample prompt for inference time measurements."
inputs = tokenizer(prompt, truncation=True, padding="max_length", return_tensors="pt")

- no optimizations (simple PyTorch)

In [71]:
from timeit import timeit

model.train()
t1 = timeit(lambda: model(**inputs), number=100)
print(f"{t1 * 10:.3f} ms/run | Speedup: {t1 / t1:.2f}x")

280.540 ms/run | Speedup: 1.00x


- `model.eval()`

In [72]:
model.eval()
t2 = timeit(lambda: model(**inputs), number=100)
print(f"{t2 * 10:.3f} ms/run | Speedup: {t1 / t2:.2f}x")

231.350 ms/run | Speedup: 1.21x


- `model.eval()` and `no_grad()`

In [None]:
def predict_eval_nograd(inputs):
    with torch.no_grad():
        return model(**inputs)


t3 = timeit(lambda: predict_eval_nograd(inputs), number=100)
print(f"{t3 * 10:.3f} ms/run | Speedup: {t1 / t3:.2f}x")

210.753 ms/run | Speedup: 1.33x


- `model.eval()` and `inference_mode()`

In [None]:
def predict_eval_inference(inputs):
    with torch.inference_mode():
        return model(**inputs)


t4 = timeit(lambda: predict_eval_inference(inputs), number=100)
print(f"{t4 * 10:.3f} ms/run | Speedup: {t1 / t4:.2f}x")

213.607 ms/run | Speedup: 1.31x


### Exercise 2 (2 points)

In this exercise, we will verify the gains from model compilation with
`torch.compile()`.

1. Compile the model using `torch.compile()` after switching it to evaluation
   mode, and warm-up the model by running a single inference call. Measure this
   compilation + warm-up time (just once).
2. Measure the inference time (average of 100 runs) of the compiled model in
   inference mode.
3. Calculate the speedup, and compare results with those from the previous
   exercise.

In [None]:
import time

t = time.perf_counter()

model.eval()
model_compiled = torch.compile(model)
_ = model_compiled(**inputs)  # warm-up

t = time.perf_counter() - t
print(f"Compilation time: {t:.2f} s")

Compilation time: 7.65 s


In [None]:
def predict_compiled(inputs):
    with torch.inference_mode():
        return model_compiled(**inputs)


t5 = timeit(lambda: predict_compiled(inputs), number=100)
print(f"{t5 * 10:.3f} ms/run | Speedup: {t1 / t5:.2f}x")

227.483 ms/run | Speedup: 1.16x


### Exercise 3 (3 points)

We will perform a dynamic quantization for our model, which is very simple
operationally to use with PyTorch. It provides the
`torch.ao.quantization.quantize_dynamic()` function, to which we pass the model
and a list of layer types that we want to quantize. In the case of transformers,
those are primarily the linear layers, which contain the majority of weights and
perform most computations.

1. Ensure the model is on the CPU.
2. Quantize the model with `torch.ao.quantization.quantize_dynamic()`, setting
   the target weight to `torch.qint8` and layers to a single-element set with
   `nn.Linear`.
3. Save the model to a new variable (e.g. `model_quantized`), and print it to
   verify that linear layers have been quantized properly (i.e.
   `DynamicQuantizedLinear` instead of `Linear`).
4. Save both models to disk (`state_dict` for both) and compare the file sizes
   (e.g. `os.path.getsize()`).
5. Compare the inference speed and speedup on CPU for original and quantized
   models (again, average of 100 runs).
6. Display the comparison. Do you think that quantization is helpful in this
   case?

Typically, we would observe the reduction in model size up to 4x and speedup of
1.5-2x, depending on the model type and what parameters exactly are quantized.

In [None]:
model.device

device(type='cpu')

In [None]:
model_quantized = torch.ao.quantization.quantize_dynamic(model, {torch.nn.Linear}, dtype=torch.qint8)
model_quantized

MPNetModel(
  (embeddings): MPNetEmbeddings(
    (word_embeddings): Embedding(30527, 768, padding_idx=1)
    (position_embeddings): Embedding(514, 768, padding_idx=1)
    (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): MPNetEncoder(
    (layer): ModuleList(
      (0-11): 12 x MPNetLayer(
        (attention): MPNetAttention(
          (attn): MPNetSelfAttention(
            (q): DynamicQuantizedLinear(in_features=768, out_features=768, dtype=torch.qint8, qscheme=torch.per_tensor_affine)
            (k): DynamicQuantizedLinear(in_features=768, out_features=768, dtype=torch.qint8, qscheme=torch.per_tensor_affine)
            (v): DynamicQuantizedLinear(in_features=768, out_features=768, dtype=torch.qint8, qscheme=torch.per_tensor_affine)
            (o): DynamicQuantizedLinear(in_features=768, out_features=768, dtype=torch.qint8, qscheme=torch.per_tensor_affine)
            (dropout): Dropout(p=0.1, inplace=F

In [None]:
import os

os.makedirs("models", exist_ok=True)
torch.save(model.state_dict(), "models/model.wts")
torch.save(model_quantized.state_dict(), "models/model_quantized.wts")

print(f"Vanilla   model size: {os.path.getsize('models/model.wts') / 1024 / 1024:.2f} MB")
print(f"Quantized model size: {os.path.getsize('models/model_quantized.wts') / 1024 / 1024:.2f} MB")

Vanilla   model size: 417.72 MB
Quantized model size: 173.10 MB


In [None]:
def predict_vanilla(inputs):
    with torch.inference_mode():
        return model(**inputs)


t_vanilla = timeit(lambda: predict_vanilla(inputs), number=100)
print(f"{t_vanilla * 10:.3f} ms/run | Speedup: {t_vanilla / t_vanilla:.2f}x")

267.568 ms/run | Speedup: 1.00x


In [None]:
def predict_quantized(inputs):
    with torch.inference_mode():
        return model_quantized(**inputs)


t_quantized = timeit(lambda: predict_quantized(inputs), number=100)
print(f"{t_quantized * 10:.3f} ms/run | Speedup: {t_vanilla / t_quantized:.2f}x")

208.267 ms/run | Speedup: 1.16x


### Exercise 4 (2 points)

1. Compare inference time of:
   - `torch.compile()` with default settings
   - `torch.compile()` with `mode="max-autotune"`
   - `torch.compile()` with `mode="max-autotune-no-cudagraphs"`
2. Report the average time of 100 runs and speedup of the latter two modes.


In [3]:
device = "cuda"
model = model.to(device)
inputs_gpu = {k: v.to(device) for k, v in inputs.items()}

In [4]:
model.eval()
compiled_model_with_cudagraphs = torch.compile(model, mode="max-autotune")
compiled_model_dynamic = torch.compile(model, mode="max-autotune-no-cudagraphs")

In [None]:
def predict_gpu(inputs):
    with torch.inference_mode():
        return model(**inputs)


t_default = timeit(lambda: predict_gpu(inputs_gpu), number=100)
print(f"{t_default * 10:.3f} ms/run | Speedup: {t_default / t_default:.2f}x")

47.856 ms/run | Speedup: 1.00x


In [None]:
def predict_gpu_cudagraph(inputs):
    with torch.inference_mode():
        return compiled_model_with_cudagraphs(**inputs)


t_with_cudagraphs = timeit(lambda: predict_gpu_cudagraph(inputs_gpu), number=100)
print(f"{t_with_cudagraphs * 10:.3f} ms/run | Speedup: {t_default / t_with_cudagraphs:.2f}x")

25.624 ms/run | Speedup: 1.87x


In [None]:
def predict_gpu_dynamic(inputs):
    with torch.inference_mode():
        return compiled_model_dynamic(**inputs)


t_dynamic = timeit(lambda: predict_gpu_dynamic(inputs_gpu), number=100)
print(f"{t_dynamic * 10:.3f} ms/run | Speedup: {t_default / t_dynamic:.2f}x")

26.947 ms/run | Speedup: 1.78x


### Exercise 5 (2 points)

1. Check if your GPU supports Tensor Cores (capability >= (7,0)). If not, switch
   to Google Colab with GPU runtime.
2. Measure inference time with:
   - full precision (`float32`)
   - manual half-precision (`float16`)
   - automatic mixed precision (`torch.autocast`)
3. Compare time and speedup. Which variant would you use in practice?

In [10]:
import torch

capability = torch.cuda.get_device_capability()
print(f"CUDA device capability: {capability}")

# Tensor Cores are available on NVidia GPUs with CUDA >= 7 (e.g. Volta, Turing, Ampere, Hopper)
if capability >= (7, 0):
    print("Tensor Cores available: fast float16 supported.")
else:
    print("Tensor Cores not available: float16 may be slow or unsupported.")

CUDA device capability: (7, 5)
Tensor Cores available: fast float16 supported.


In [None]:
def predict_fp32(inputs):
    with torch.inference_mode():
        return model(**inputs)


t_fp32 = timeit(lambda: predict_fp32(inputs_gpu), number=100)
print(f"{t_fp32 * 10:.3f} ms/run | Speedup: {t_fp32 / t_fp32:.2f}x")

33.646 ms/run | Speedup: 1.00x


In [None]:
model_half = model.half().to("cuda")


def predict_fp16(inputs):
    with torch.inference_mode():
        return model_half(**inputs)


t_fp16 = timeit(lambda: predict_fp16(inputs_gpu), number=100)
print(f"{t_fp16 * 10:.3f} ms/run | Speedup: {t_fp32 / t_fp16:.2f}x")

17.267 ms/run | Speedup: 1.95x


In [None]:
def predict_autocast(inputs):
    with torch.inference_mode(), torch.autocast(device, dtype=torch.float16):
        return model(**inputs)


t_auto = timeit(lambda: predict_autocast(inputs_gpu), number=100)
print(f"{t_auto * 10:.3f} ms/run | Speedup: {t_fp32 / t_auto:.2f}x")

14.402 ms/run | Speedup: 2.34x


### Exercise 6 (3 points)

1. Measure cold start time (including session creation) of the ONNX model using online and offline optimization modes
   on CPU.
2. Measure inference time of the ONNX model on CPU using both optimization modes.
3. Prepare deployment Docker images:
   - build two images, for a) compiled PyTorch model b) ONNX model with ONNX Runtime
   - select the best model in both cases in terms of the inference time
   - install a minimal set of requirements in both cases, e.g. do not install PyTorch for ONNX image
4. Compare for those apps:
   - Docker container sizes
   - response time (average of 100 requests)
