### Exercise 1 (3 points)

1. Load the `sentence-transformers/multi-qa-mpnet-base-cos-v1` model and tokenizer. Use the `AutoModel` and
   `AutoTokenizer` classes from `tranformers` library.

In [1]:
import time
import torch
from transformers import AutoTokenizer, AutoModel


  from .autonotebook import tqdm as notebook_tqdm


In [2]:
model_name = "sentence-transformers/multi-qa-mpnet-base-cos-v1"

tokenizer = AutoTokenizer.from_pretrained(model_name)

model = AutoModel.from_pretrained(model_name)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

MPNetModel(
  (embeddings): MPNetEmbeddings(
    (word_embeddings): Embedding(30527, 768, padding_idx=1)
    (position_embeddings): Embedding(514, 768, padding_idx=1)
    (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): MPNetEncoder(
    (layer): ModuleList(
      (0-11): 12 x MPNetLayer(
        (attention): MPNetAttention(
          (attn): MPNetSelfAttention(
            (q): Linear(in_features=768, out_features=768, bias=True)
            (k): Linear(in_features=768, out_features=768, bias=True)
            (v): Linear(in_features=768, out_features=768, bias=True)
            (o): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (dropout): Dropout(p=0.1, inplace=False)
        )
        (intermediate): MPNetIntermediate(
          (dense): Linear(in_

2. Create a sample input text and tokenize it (padding, truncation, `return_tensors="pt"`).

In [3]:
text = "David and Emma looked at each other across the table. The young couple were happy: the food was delicious, the light from the candles was soft and the music was perfect. "

inputs = tokenizer(
    text,
    padding=True,
    truncation=True,
    return_tensors="pt",
)

3. Measure the inference time of the model in various inference modes (average time over 100 runs):
   - no optimizations (simple PyTorch)
   - `model.eval()`
   - `model.eval()` and `no_grad()`
   - `model.eval()` and `inference_mode()`

In [4]:
inputs = {k: v.to(device) for k, v in inputs.items()}


def measure_time(run_fn, n_runs: int = 100):
    run_fn() #warm-up, don't measure time

    if device.type == "cuda":
        torch.cuda.synchronize()

    start = time.perf_counter()

    for _ in range(n_runs):
        run_fn()

    if device.type == "cuda":
        torch.cuda.synchronize()

    end = time.perf_counter()

    return (end - start) / n_runs


In [41]:
model.train()


def forward_pass():
    _= model(**inputs)

time_train_mode = measure_time(forward_pass)
print(f"Forward pass (train mode): {time_train_mode:.6f} s")

Forward pass (train mode): 0.005723 s


In [42]:
model.eval()

time_eval_mode = measure_time(forward_pass)
print(f"Forward pass (eval mode): {time_eval_mode:.6f} s")

Forward pass (eval mode): 0.004994 s


In [43]:
def forward_pass_no_grad():
    with torch.no_grad():
        _ = model(**inputs)

time_eval_mode_no_grad = measure_time(forward_pass_no_grad)
print(f"Forward pass (model.eval() + no_grad()): {time_eval_mode_no_grad:.6f} s")


Forward pass (model.eval() + no_grad()): 0.003649 s


In [56]:
def forward_pass_inference_mode():
    with torch.inference_mode():
        _ = model(**inputs)

time_inf_eval_mode = measure_time(forward_pass_inference_mode)
print(f"Forward pass (model.eval() + inference_mode()): {time_inf_eval_mode:.6f} s")



Forward pass (model.eval() + inference_mode()): 0.003471 s


4. Compare the speedup of options 2, 3, and 4 over the pure PyTorch. To calculate speedup, divide the
   PyTorch time by the current time. 

In [57]:
def speedup(base, other):
    return base / other

print("Speedup realtive to train mode:")
print(f"Forward pass: eval: {speedup(time_train_mode, time_eval_mode):.2f}x")
print(f"Forward pass: eval + no_grad: {speedup(time_train_mode, time_eval_mode_no_grad):.2f}x")
print(f"Forward pass: eval + inference_mode: {speedup(time_train_mode, time_inf_eval_mode):.2f}x")

Speedup realtive to train mode:
Forward pass: eval: 1.15x
Forward pass: eval + no_grad: 1.57x
Forward pass: eval + inference_mode: 1.65x


### Exercise 2 (2 points)

In this exercise, we will verify the gains from model compilation with `torch.compile()`.

1. Compile the model using `torch.compile()` after switching it to evaluation mode, and warm-up the model
   by running a single inference call. Measure this compilation + warm-up time (just once).

In [53]:
start = time.perf_counter()

model.eval()
compiled_model = torch.compile(model)

def forward_pass_inference_mode_model_compile():
    with torch.inference_mode():
        _ = compiled_model(**inputs)

forward_pass_inference_mode_model_compile()

if device.type == "cuda":
    torch.cuda.synchronize()

end = time.perf_counter()
print(f"Compilation + warm-up time: {(end - start):.6f}")

Compilation + warm-up time: 0.019686


2. Measure the inference time (average of 100 runs) of the compiled model in inference mode.

In [58]:
time_inf_eval_mode_compile = measure_time(forward_pass_inference_mode_model_compile)
print(f"Forward pass (eval + inference_mode+ compile): {time_inf_eval_mode_compile:.6f} s")

Forward pass (eval + inference_mode+ compile): 0.003350 s


In [59]:
print(f"Forward pass: eval + inference_mode + comile: {speedup(time_train_mode, time_inf_eval_mode_compile):.2f}x")

Forward pass: eval + inference_mode + comile: 1.71x


The compiled model is only a little faster.
In many runs the time is almost the same as before.

### Exercise 3 (3 points)

We will perform a dynamic quantization for our model, which is very simple operationally to use with PyTorch.
It provides the `torch.ao.quantization.quantize_dynamic()` function, to which we pass the model and a 
list of layer types that we want to quantize. In the case of transformers, those are primarily the linear
layers, which contain the majority of weights and perform most computations.

1. Ensure the model is on the CPU.
2. Quantize the model with `torch.ao.quantization.quantize_dynamic()`, setting the target weight to `torch.qint8` and
   layers to a single-element set with `nn.Linear`.
3. Save the model to a new variable (e.g. `model_quantized`), and print it to verify that linear layers have been
   quantized properly (i.e. `DynamicQuantizedLinear` instead of `Linear`).
4. Save both models to disk (`state_dict` for both) and compare the file sizes (e.g. `os.path.getsize()`).
5. Compare the inference speed and speedup on CPU for original and quantized models (again, average of 100 runs).
6. Display the comparison. Do you think that quantization is helpful in this case?

In [6]:
import os
import torch.nn as nn
from torch.ao.quantization import quantize_dynamic

In [7]:
device_cpu = torch.device("cpu")

model_cpu = model.to(device_cpu)
model_cpu.eval()

inputs_cpu = {k: v.to(device_cpu) for k, v in inputs.items()}

model_quantized = quantize_dynamic(
    model_cpu,             
    {nn.Linear},           
    dtype=torch.qint8     
)

print("Quantized model")
print(model_quantized)


fp32_path = "model_fp32_state_dict.pth"
int8_path = "model_quantized_int8_state_dict.pth"

torch.save(model_cpu.state_dict(), fp32_path)
torch.save(model_quantized.state_dict(), int8_path)

size_fp32 = os.path.getsize(fp32_path)
size_int8 = os.path.getsize(int8_path)

def to_MB(bytes_size):
    return bytes_size / (1024 * 1024)

print("File size")
print(f"fp32 model size: {to_MB(size_fp32):.2f} MB")
print(f"int8 model size: {to_MB(size_int8):.2f} MB")
print(f"size comparsion (int8 / fp32): {size_int8 / size_fp32:.2f}")


Quantized model
MPNetModel(
  (embeddings): MPNetEmbeddings(
    (word_embeddings): Embedding(30527, 768, padding_idx=1)
    (position_embeddings): Embedding(514, 768, padding_idx=1)
    (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): MPNetEncoder(
    (layer): ModuleList(
      (0-11): 12 x MPNetLayer(
        (attention): MPNetAttention(
          (attn): MPNetSelfAttention(
            (q): DynamicQuantizedLinear(in_features=768, out_features=768, dtype=torch.qint8, qscheme=torch.per_tensor_affine)
            (k): DynamicQuantizedLinear(in_features=768, out_features=768, dtype=torch.qint8, qscheme=torch.per_tensor_affine)
            (v): DynamicQuantizedLinear(in_features=768, out_features=768, dtype=torch.qint8, qscheme=torch.per_tensor_affine)
            (o): DynamicQuantizedLinear(in_features=768, out_features=768, dtype=torch.qint8, qscheme=torch.per_tensor_affine)
            (dropout): Dropout(

In [8]:
def forward_fp32():
    with torch.inference_mode():
        _ = model_cpu(**inputs_cpu)


def forward_int8():
    with torch.inference_mode():
        _ = model_quantized(**inputs_cpu)


time_fp32 = measure_time(forward_fp32, n_runs=100)
time_int8 = measure_time(forward_int8, n_runs=100)

print("Inference comparsion fp32 vs int8 on CPU")
print(f"fp32 model: {time_fp32:.6f} s")
print(f"int8 model: {time_int8:.6f} s")

speedup_int8 = time_fp32 / time_int8
print(f"Speedup int8 vs fp32: {speedup_int8:.2f}x")

Inference comparsion fp32 vs int8 on CPU
fp32 model: 0.038781 s
int8 model: 0.013071 s
Speedup int8 vs fp32: 2.97x


Quantization is helpful here. The int8 quantized model is faster and smaller than the model that uses fp32 precision.