# Quantization examples

- Post Training Dynamic Quantization (PTQ) : 학습 후에 quantization parameter(scale, shift) 결정
    - Dynamic range quantization -> weight만 8-bit quantization
    - Full integer quantization -> weight, model input data, activation 또한 quantization
    - Float16 quantization -> fp32 data type의 weight를 fp16으로 quantize
    
- Quantization Aware Training (QAT) : 학습 과정에 quantization을 emulate, 성능 하락을 완화

### More details

- Dynamic range quantization 
    - 별도의 calibration 데이터가 필요 없음
    - 모델의 용량 축소 (8bit 기준 1/4)
    - 그러나 실제 연산은 floating point로 수행됨
    - small batch LSTMs and MLPs에 적합

- Full integer quantization
    - 모델의 용량 축소 (8bit 기준 1/4)
    - 더 적은 메모리 사용량, cache 재사용성 증가
    - 빠른 연산 (fixed point 8bit 연산을 지원하는 경우)
    - 그러나 activation의 parameter를 결정하기 위해서 calibration data가 필요함 (주로 training data에서 사용, 100~1000개의 data)
        - 구동하는 동안에 들어오는 data를 모르기 때문에
- Float16 quantization

- Quantization Aware Training
    - 학습 과정에 quantization을 emulate, inference 시에 발생하는 quantization error를 training 시점에 반영
    - fine-tuning으로 QAT를 적용
    - PTQ 대비 성능 하락 폭이 적음

### PyTorch의 구분

- Dynamic range quantization (Dynamic quantization) : small batch LSTMs and MLPs에 적합, dataset 필요 x
- Post training quantization (Static quantization) : CNNs에 적합, calibration dataset 필요 o -> embedding, multi-head attention도 적용 가능
- Quantization Aware Training


### Post Training Dynamic Quantization (PTQ)
- 동적 양자화는 사전 학습된 양자화 적용 모델이 준비되지 않았을 때 사용하기 가장 쉬운 방식
- 주요 한계는 qconfig_spec 옵션이 현재는 nn.Linear 과 nn.LSTM 만 지원
- nn.Conv2d 같은 다른 모듈을 양자화할 때, static quantization이나 QAT 적용 필요

In [5]:
from save_utils import load_model, save_model, bit_save_model
from print_utils import print_size_of_model
from dataset import calibrate, bit_calibrate

import torch

In [6]:
from transformers import CLIPVisionModelWithProjection

visual_model = CLIPVisionModelWithProjection.from_pretrained("openai/clip-vit-base-patch32")

# nn.Linear의 weights에 대하여 int8로 quantize
model_dynamic_quantized = torch.quantization.quantize_dynamic(visual_model, qconfig_spec={torch.nn.Linear}, dtype=torch.qint8)

Some weights of the model checkpoint at openai/clip-vit-base-patch32 were not used when initializing CLIPVisionModelWithProjection: ['text_model.encoder.layers.11.self_attn.k_proj.weight', 'text_model.encoder.layers.2.mlp.fc1.bias', 'text_model.encoder.layers.0.layer_norm2.bias', 'text_model.encoder.layers.11.self_attn.out_proj.weight', 'text_model.encoder.layers.7.layer_norm2.bias', 'text_model.encoder.layers.0.mlp.fc2.bias', 'text_model.encoder.layers.2.self_attn.q_proj.weight', 'text_model.encoder.layers.5.self_attn.q_proj.weight', 'text_model.encoder.layers.0.self_attn.v_proj.weight', 'text_model.encoder.layers.1.self_attn.out_proj.weight', 'text_model.encoder.layers.9.self_attn.k_proj.weight', 'text_model.encoder.layers.4.mlp.fc2.weight', 'text_model.encoder.layers.3.self_attn.v_proj.weight', 'text_model.encoder.layers.1.layer_norm2.weight', 'text_model.encoder.layers.5.mlp.fc1.bias', 'text_model.encoder.layers.3.self_attn.q_proj.bias', 'text_model.encoder.layers.5.mlp.fc1.weight'

In [7]:
print(f'[fp32]')
print_size_of_model(visual_model)

print(f'[int8]')
print_size_of_model(model_dynamic_quantized)

[fp32]
Model size: 351.46MB
[int8]
Model size: 95.53MB


In [8]:
# save_model(model_dynamic_quantized, "DQ_int8", "cpu")

In [14]:
%%timeit
calibrate(visual_model, "cpu", time=True)

  1%|          | 3/313 [00:04<07:50,  1.52s/it]
  1%|          | 3/313 [00:04<07:37,  1.47s/it]
  1%|          | 3/313 [00:04<07:40,  1.48s/it]
  1%|          | 3/313 [00:04<07:32,  1.46s/it]
  1%|          | 3/313 [00:04<07:54,  1.53s/it]
  1%|          | 3/313 [00:04<07:51,  1.52s/it]
  1%|          | 3/313 [00:04<07:46,  1.50s/it]
  1%|          | 3/313 [00:04<07:35,  1.47s/it]

4.56 s ± 75 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)





In [15]:
%%timeit
calibrate(model_dynamic_quantized, "cpu", time=True)

  1%|          | 3/313 [00:03<06:36,  1.28s/it]
  1%|          | 3/313 [00:03<06:43,  1.30s/it]
  1%|          | 3/313 [00:03<06:51,  1.33s/it]
  1%|          | 3/313 [00:03<06:44,  1.30s/it]
  1%|          | 3/313 [00:03<06:44,  1.30s/it]
  1%|          | 3/313 [00:03<06:47,  1.31s/it]
  1%|          | 3/313 [00:03<06:45,  1.31s/it]
  1%|          | 3/313 [00:03<06:48,  1.32s/it]

4.02 s ± 27 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)





### 8bit load

In [2]:
from transformers import CLIPVisionModelWithProjection

visual_model = CLIPVisionModelWithProjection.from_pretrained("openai/clip-vit-base-patch32", load_in_8bit=True)
print(visual_model.get_memory_footprint())

  from .autonotebook import tqdm as notebook_tqdm
Some weights of the model checkpoint at openai/clip-vit-base-patch32 were not used when initializing CLIPVisionModelWithProjection: ['text_model.encoder.layers.11.self_attn.k_proj.weight', 'text_model.encoder.layers.2.mlp.fc1.bias', 'text_model.encoder.layers.0.layer_norm2.bias', 'text_model.encoder.layers.11.self_attn.out_proj.weight', 'text_model.encoder.layers.7.layer_norm2.bias', 'text_model.encoder.layers.0.mlp.fc2.bias', 'text_model.encoder.layers.2.self_attn.q_proj.weight', 'text_model.encoder.layers.5.self_attn.q_proj.weight', 'text_model.encoder.layers.0.self_attn.v_proj.weight', 'text_model.encoder.layers.1.self_attn.out_proj.weight', 'text_model.encoder.layers.9.self_attn.k_proj.weight', 'text_model.encoder.layers.4.mlp.fc2.weight', 'text_model.encoder.layers.3.self_attn.v_proj.weight', 'text_model.encoder.layers.1.layer_norm2.weight', 'text_model.encoder.layers.5.mlp.fc1.bias', 'text_model.encoder.layers.3.self_attn.q_proj.b

90370960


In [3]:
print(f'[int8_load]')
print_size_of_model(visual_model)

[int8_load]
Model size: 90.79MB


In [None]:
bit_save_model(visual_model, "Load_int8")

In [None]:
%%timeit
bit_calibrate(visual_model, time=True)

  1%|          | 3/313 [00:01<02:06,  2.45it/s]
  1%|          | 3/313 [00:01<01:59,  2.60it/s]
  1%|          | 3/313 [00:01<01:59,  2.60it/s]
  1%|          | 3/313 [00:01<01:57,  2.64it/s]
  1%|          | 3/313 [00:01<01:57,  2.65it/s]
  1%|          | 3/313 [00:01<01:57,  2.64it/s]
  1%|          | 3/313 [00:01<02:00,  2.58it/s]
  1%|          | 3/313 [00:01<02:04,  2.50it/s]

1.24 s ± 22.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)





### Post Training Static Quantization
- 모델의 가중치와 활성 함수 모두를 8비트 크기의 정수 자료형으로 사전에 바꿈
- 동적 양자화처럼 추론 과정 중에 활성 함수를 전환하지는 않음

### PyTorch quantization Modes
- https://pytorch.org/docs/stable/quantization.html#quantization-api-summary
- https://pytorch.org/tutorials/prototype/fx_graph_mode_ptq_static.html

Eager Mode Quantization and FX Graph Mode Quantization.
- Eager Mode Quantization
    - User needs to do fusion and specify where quantization and dequantization happens manually,
    - also it only supports modules and not functionals.
- FX Graph Mode Quantization
    - prototype, 
    - FX Graph Mode Quantization is not expected to work on arbitrary models
    - model 수정이 필요할 수 있음


In [10]:
import torch
from torch.ao.quantization import (
  get_default_qconfig_mapping,
  get_default_qat_qconfig_mapping,
  QConfigMapping,
)
import torch.ao.quantization.quantize_fx as quantize_fx
import copy

from torchvision.models import resnet18
model_fp = resnet18(pretrained=True)

### post training dynamic/weight_only quantization ###

# we need to deepcopy if we still want to keep model_fp unchanged after quantization since quantization apis change the input model
model_to_quantize = copy.deepcopy(model_fp)
model_to_quantize.eval()
qconfig_mapping = QConfigMapping().set_global(torch.ao.quantization.default_dynamic_qconfig)

# a tuple of one or more example inputs are needed to trace the model
input_fp32 = torch.tensor((1,3,224,224))
example_inputs = (input_fp32)

# prepare
model_prepared = quantize_fx.prepare_fx(model_to_quantize, qconfig_mapping, example_inputs)

# no calibration needed when we only have dynamic/weight_only quantization
# quantize
model_quantized = quantize_fx.convert_fx(model_prepared)



In [11]:
print(f'[fp32]')
print_size_of_model(model_fp)

print(f'[int8]')
print_size_of_model(model_quantized)

[fp32]
Model size: 46.83MB
[int8]
Model size: 45.22MB


### FP16 inference

In [5]:
from transformers import CLIPTextModelWithProjection, CLIPVisionModelWithProjection

model = CLIPVisionModelWithProjection.from_pretrained("openai/clip-vit-base-patch32")

# Test fp32 model
print_size_of_model(model)

# Test fp16 model
fp16_model = model.half()
print_size_of_model(fp16_model)

Some weights of the model checkpoint at openai/clip-vit-base-patch32 were not used when initializing CLIPVisionModelWithProjection: ['text_projection.weight', 'text_model.encoder.layers.8.self_attn.k_proj.bias', 'text_model.encoder.layers.9.mlp.fc1.bias', 'text_model.encoder.layers.4.layer_norm1.weight', 'text_model.encoder.layers.6.mlp.fc1.weight', 'text_model.encoder.layers.1.self_attn.v_proj.weight', 'text_model.encoder.layers.1.self_attn.out_proj.bias', 'text_model.encoder.layers.1.mlp.fc1.weight', 'text_model.encoder.layers.7.mlp.fc1.bias', 'text_model.encoder.layers.8.mlp.fc2.weight', 'text_model.encoder.layers.6.layer_norm2.weight', 'text_model.encoder.layers.10.self_attn.v_proj.bias', 'text_model.encoder.layers.1.layer_norm1.bias', 'text_model.encoder.layers.11.self_attn.out_proj.weight', 'text_model.encoder.layers.3.self_attn.out_proj.weight', 'text_model.encoder.layers.8.self_attn.k_proj.weight', 'text_model.encoder.layers.10.mlp.fc1.weight', 'text_model.encoder.layers.2.self

Model size: 351.46MB
Model size: 175.76MB


In [6]:
# save_model(fp16_model, "fp16", "cuda")

In [12]:
%%timeit
calibrate_half(fp16_model, "cuda", time=True)

  1%|          | 3/313 [00:01<01:43,  2.98it/s]
  1%|          | 3/313 [00:00<01:33,  3.31it/s]
  1%|          | 3/313 [00:00<01:33,  3.32it/s]
  1%|          | 3/313 [00:00<01:35,  3.25it/s]
  1%|          | 3/313 [00:00<01:36,  3.21it/s]
  1%|          | 3/313 [00:00<01:32,  3.34it/s]
  1%|          | 3/313 [00:00<01:33,  3.32it/s]
  1%|          | 3/313 [00:00<01:33,  3.31it/s]

996 ms ± 12.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)





### Optimum - Dynamic quantization

In [None]:
import evaluate
from optimum.intel import INCQuantizer
from datasets import load_dataset
from transformers import AutoModelForQuestionAnswering, AutoTokenizer, pipeline
from neural_compressor.config import AccuracyCriterion, TuningCriterion, PostTrainingQuantConfig

model_name = "openai/clip-vit-base-patch32"
model = AutoModelForQuestionAnswering.from_pretrained(model_name)

tokenizer = AutoTokenizer.from_pretrained(model_name)
eval_dataset = load_dataset("squad", split="validation").select(range(64))
task_evaluator = evaluate.evaluator("question-answering")
qa_pipeline = pipeline("question-answering", model=model, tokenizer=tokenizer)

def eval_fn(model):
    qa_pipeline.model = model
    metrics = task_evaluator.compute(model_or_pipeline=qa_pipeline, data=eval_dataset, metric="squad")
    return metrics["f1"]

# Set the accepted accuracy loss to 5%
accuracy_criterion = AccuracyCriterion(tolerable_loss=0.05)
# Set the maximum number of trials to 10
tuning_criterion = TuningCriterion(max_trials=10)
quantization_config = PostTrainingQuantConfig(
    approach="dynamic", accuracy_criterion=accuracy_criterion, tuning_criterion=tuning_criterion
)
quantizer = INCQuantizer.from_pretrained(model, eval_fn=eval_fn)
quantizer.quantize(quantization_config=quantization_config, save_directory="dynamic_quantization")