# ModelComPressor(MCP) 양자화 흐름 설명
## Quantization Scheme
- Post-training
- Static
- Mixed precision
- Linear Quantization


    <details>
    <summary>details</summary>

    - Post-training -> No backpropagation, No weight update
    - Static -> Not online, Observe (dynamic) range at pre-compile time, Outlier handling -> calibration -> quantization param
    - Mixed precision -> multiple data type -> layer/operator/node-wise quantization format
    - Linear Quantization -> Not non-linear -> scale, zero-point, round, integer

    </details>

# Quantization Pipeline
Load model -> calibration -> quantization

실제 동작하고 accuracy 가 확인된 코드 기반으로 설명합니다. https://github.com/deeplearningfromscratch/inference/tree/mlperf-qgpt-j

<img src="quantization flow.jpg" width="1100"/>

### calibration

`calibrate`

https://github.com/deeplearningfromscratch/inference/blob/mlperf-qgpt-j/language/gpt-j/quantization/calibrate.py#L72

- input
    - model(`fx.GraphModule`): symbolically traced
    - dataloader: calibration data loading, 사용자 직접 작성
    - quantization config: calibration method, granularity, dtype, nbits, ...
- output
    - calibration range
    - quantization format

### quantization

`quantize model`

https://github.com/deeplearningfromscratch/inference/blob/mlperf-qgpt-j/language/gpt-j/quantization/__init__.py#L14

- input
    - model(`fx.GraphModule`): symbolically traced
    - quantization config
    - quantization parameter/calibration range
    - quantization format
- output
    - quantized model


python env.

```bash
  - pip:
      - --extra-index-url https://download.pytorch.org/whl/cu118
      - torch==2.1.0+cu118
      - numpy==1.26.4
      - datasets==2.18.0
      - nltk==3.8.1
      - evaluate==0.4.1
      - absl-py==2.1.0
      - rouge_score==0.1.2
      - git+https://github.com/furiosa-ai/model-compressor-private.git@bc5112f79809bc2c45e5a82615867f5c3b6d6265
      - git+https://github.com/furiosa-ai/transformers-compression.git@MLPerf4.1-v1.0
```

### util functions

In [None]:
from typing import Dict

import torch
import yaml
from torch.fx import passes


def load_yaml(file_path: str) -> Dict:
    with open(file_path, "r") as f:
        config = yaml.safe_load(f)
    return config

def vizualize_fx_graph(graph: torch.fx.GraphModule, output_path: str, name: str = ""):
    g = passes.graph_drawer.FxGraphDrawer(graph, name)
    with open(output_path, "wb") as f:
        f.write(g.get_dot_graph().create_svg())

### env. variables

In [None]:
MODEL_CONFIG_PATH = "toy-config.json"
QUANT_CONFIG_PATH = "toy-quant_config.yaml"
QUANT_PARAM_PATH = " toy-quant_param.npy"
QUANT_FORMAT_PATH = "toy-quant_format.yaml"

CALIB_DATA_PATH = "../../data/dataset/cnn-daily-mail/calibration/cnn_dailymail_calibration.json"
N_CALIB = 100

## Load/trace Model

설명 대상 모델은 GPT-J 입니다. 분석의 편의를 위하여 GPT-J 모델에서 `n_layer` 를 28 -> 1 로 줄였습니다. 나머지 configuration 은 원본 모델과 동일합니다.

text-generation 모델이 어떻게 quantize 되는지 설명합니다.

```yaml
{
  "_name_or_path": "EleutherAI/gpt-j-6b",
  "activation_function": "gelu_new",
  "architectures": [
    "GPTJForCausalLM"
  ],
  "attn_pdrop": 0.0,
  "bos_token_id": 50256,
  "embd_pdrop": 0.0,
  "eos_token_id": 50256,
  "gradient_checkpointing": false,
  "initializer_range": 0.02,
  "layer_norm_epsilon": 1e-05,
  "model_type": "gptj",
  "n_embd": 4096,
  "n_head": 16,
  "n_inner": null,
  "n_layer": 1, # 28
  "n_positions": 2048,
  "resid_pdrop": 0.0,
  "rotary": true,
  "rotary_dim": 64,
  "scale_attn_weights": true,
  "summary_activation": null,
  "summary_first_dropout": 0.1,
  "summary_proj_to_labels": true,
  "summary_type": "cls_index",
  "summary_use_proj": true,
  "task_specific_params": {
    "text-generation": {
      "do_sample": true,
      "max_length": 50,
      "temperature": 1.0
    }
  },
  "tie_word_embeddings": false,
  "tokenizer_class": "GPT2Tokenizer",
  "torch_dtype": "float32",
  "transformers_version": "4.28.0.dev0",
  "use_cache": true,
  "vocab_size": 50401
}

```

loading pre-trained weights

In [45]:
import torch
import transformers

from transformers import AutoConfig, AutoModelForCausalLM


def load_pytorch_model(model_config: transformers.PretrainedConfig, use_gpu: bool=True) -> transformers.PreTrainedModel:
    model = AutoModelForCausalLM.from_config(
        model_config,
        torch_dtype=torch.float32,
    )

    if use_gpu:
        print(f"Casting models to GPU...")
        assert torch.cuda.is_available(), "torch gpu is not available, exiting..."
        device = torch.device("cuda:0")
        model.to(device)

    model.eval()
    model = model.to(memory_format=torch.channels_last)
    return model

In [46]:
model_config = AutoConfig.from_pretrained(MODEL_CONFIG_PATH)
print(type(model_config))
model_config

<class 'transformers.models.gptj.configuration_gptj.GPTJConfig'>


GPTJConfig {
  "_name_or_path": "toy-config.json",
  "activation_function": "gelu_new",
  "architectures": [
    "GPTJForCausalLM"
  ],
  "attn_pdrop": 0.0,
  "bos_token_id": 50256,
  "embd_pdrop": 0.0,
  "eos_token_id": 50256,
  "gradient_checkpointing": false,
  "initializer_range": 0.02,
  "layer_norm_epsilon": 1e-05,
  "model_type": "gptj",
  "n_embd": 4096,
  "n_head": 4,
  "n_inner": null,
  "n_layer": 1,
  "n_positions": 2048,
  "resid_pdrop": 0.0,
  "rotary": true,
  "rotary_dim": 64,
  "scale_attn_weights": true,
  "summary_activation": null,
  "summary_first_dropout": 0.1,
  "summary_proj_to_labels": true,
  "summary_type": "cls_index",
  "summary_use_proj": true,
  "task_specific_params": {
    "text-generation": {
      "do_sample": true,
      "max_length": 50,
      "temperature": 1.0
    }
  },
  "tie_word_embeddings": false,
  "tokenizer_class": "GPT2Tokenizer",
  "torch_dtype": "float32",
  "transformers_version": "4.31.0",
  "use_cache": true,
  "vocab_size": 50401
}

In [47]:
toy_model = load_pytorch_model(model_config)
print(type(toy_model))
toy_model

Casting models to GPU...
<class 'transformers.models.gptj.modeling_gptj.GPTJForCausalLM'>


GPTJForCausalLM(
  (transformer): GPTJModel(
    (wte): Embedding(50401, 4096)
    (drop): Dropout(p=0.0, inplace=False)
    (h): ModuleList(
      (0): GPTJBlock(
        (ln_1): LayerNorm((4096,), eps=1e-05, elementwise_affine=True)
        (attn): GPTJAttention(
          (attn_dropout): Dropout(p=0.0, inplace=False)
          (resid_dropout): Dropout(p=0.0, inplace=False)
          (k_proj): Linear(in_features=4096, out_features=4096, bias=False)
          (v_proj): Linear(in_features=4096, out_features=4096, bias=False)
          (q_proj): Linear(in_features=4096, out_features=4096, bias=False)
          (out_proj): Linear(in_features=4096, out_features=4096, bias=False)
        )
        (mlp): GPTJMLP(
          (fc_in): Linear(in_features=4096, out_features=16384, bias=True)
          (fc_out): Linear(in_features=16384, out_features=4096, bias=True)
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.0, inplace=False)
        )
      )
    )
    (ln_f): LayerN

`transformers.models.gptj.modeling_gptj.GPTJForCausalLM.forward`: https://github.com/huggingface/transformers/blob/main/src/transformers/models/gptj/modeling_gptj.py#L214

<img src="GPT-vs-GPT-J-uai-2064x1337.png" width="1100"/>

tracing model graph

graph 구조에서 양자화를 위한 과정을 수행하기 위함

`custom_symbolic_trace`: https://github.com/deeplearningfromscratch/inference/blob/mlperf-qgpt-j/language/gpt-j/quantization/custom_symbolic_trace.py#L31

`HFTracer` 에 의존 https://github.com/huggingface/transformers/blob/main/src/transformers/utils/fx.py#L741

In [48]:
from quantization.custom_symbolic_trace import custom_symbolic_trace  # isort:skip


toy_graph, _, _ = custom_symbolic_trace(toy_model)
vizualize_fx_graph(toy_graph, "toy-model-f32.svg")

<f32 graph 출력부 그래프>

<img src="toy-model-f32.png" width="1400"/>

## Calibration
- quantization config
- calibration data 및 calibration data loader
- calibration range calculation

### quantization config
acitvation 및 weight 에 대한
- calibration method
- quantization granularity
- quantization data type
- number of quantization levels(number of bits)

그리고, 

- quantization level
- key/value data type

(SMQ 등 advanced setting 제외)

In [49]:

qconfig = load_yaml(QUANT_CONFIG_PATH)
qconfig

{'act_calib_method': 'MINMAX_ASYM',
 'act_dtype': 'int8',
 'act_granularity': 'channel',
 'act_nbits': 8,
 'batch_size': 1,
 'calib_batch_size': 1,
 'model': 'gpt-j',
 'model_overwrite': True,
 'percentile': 99.9,
 'target_machine': 'RGDA0',
 'weight_calib_method': 'AMAX_SYM',
 'weight_dtype': 'int8',
 'weight_granularity': 'channel',
 'weight_nbits': 8,
 'qlevel': 4,
 'kv_dtype': 'int8'}

### calibration dataloader

In [51]:
from torch.utils.data import DataLoader
from dataset import Dataset


def cal_data_loader(calib_dataset_path: str, batch_size: int, n_calib: int) -> torch.utils.data.DataLoader:
    data_object = Dataset(calib_dataset_path, batch_size)
    data_list = []
    for idx in range(len(data_object.source_encoded_input_ids)):
        data_list.append(
            {
                "input_ids": data_object.source_encoded_input_ids[idx],
                "attention_mask": data_object.source_encoded_attn_masks[idx],
                "position_ids": torch.arange(
                    len(data_object.source_encoded_input_ids[idx][0])
                ),
            }
        )
    return DataLoader(data_list[:n_calib], batch_size)

dataloader = cal_data_loader(CALIB_DATA_PATH, batch_size=qconfig["calib_batch_size"], n_calib=N_CALIB)
print(type(dataloader))

Constructing QSL
Encoding Samples
Finished destroying QSL.
<class 'torch.utils.data.dataloader.DataLoader'>


### calibration range calculation(calibration)

calibration 의 입력

- augmented `torch.fx.GraphModule`(Qlevel2)
- quantization config
- calibration dataloader

출력

- quantization parameter
- quantization format

In [None]:
from quantization.custom_symbolic_trace import custom_symbolic_trace  # isort:skip
import model_compressor  # isort:skip


def calibrate(graph: torch.fx.GraphModule, 
              qconfig: Dict, 
              calib_dataloader: torch.utils.data.DataLoader, 
              qparam_path: str, qformat_path: str) -> None:
    # run calibration
    model_compressor.calibrate(
        graph,
        model_name=qconfig["model"],
        calib_dataloader=calib_dataloader,
        weight_calib_method=qconfig["weight_calib_method"],
        weight_granularity=qconfig["weight_granularity"],
        weight_dtype=qconfig["weight_dtype"],
        weight_nbits=qconfig["weight_nbits"],
        act_calib_method=qconfig["act_calib_method"],
        act_granularity=qconfig["act_granularity"],
        act_dtype=qconfig["act_dtype"],
        act_nbits=qconfig["act_nbits"],
        percentile=qconfig["percentile"],
        target_machine=qconfig["target_machine"],
    )

    # save calibration outputs
    model_compressor.save(
        graph,
        qparam_out_path=qparam_path,
        qformat_out_path=qformat_path,
        weight_calib_method=qconfig["weight_calib_method"],
        weight_granularity=qconfig["weight_granularity"],
        weight_dtype=qconfig["weight_dtype"],
        weight_nbits=qconfig["weight_nbits"],
        act_calib_method=qconfig["act_calib_method"],
        act_granularity=qconfig["act_granularity"],
        act_dtype=qconfig["act_dtype"],
        act_nbits=qconfig["act_nbits"],
    )
    return

#### augmented fx graph

calibration 은 일부 노드들의 weight/input 에 대하여 min/max 값을 수집하는 단계입니다.

이 특정 노드에 이 역할을 진행하는 "Calibrator" 가 부여됩니다.

기술적으로 "dynamic & simulated quantization(fake quant)" 그래프인 Qlevel2 그래프로 변환하는 단계입니다.

calibration 단계에서는 이 일반적인 IR 을 calibration 에 사용하기 위하여 특수하게 `scale=1.0` 을 부여합니다. 즉, Identity 로 동작.

이렇게 함으로써 f32 그래프에서 텐서 값들을 수집하게 된다.

In [52]:
def augment_graph(graph: torch.fx.GraphModule, 
                  calib_dataloader: torch.utils.data.DataLoader) -> torch.fx.GraphModule:
    return model_compressor.create_quantsim_model(
            graph,
            weight_calib_method=qconfig["weight_calib_method"],
            weight_granularity=qconfig["weight_granularity"],
            weight_dtype=qconfig["weight_dtype"],
            weight_nbits=qconfig["weight_nbits"],
            act_calib_method=qconfig["act_calib_method"],
            act_granularity=qconfig["act_granularity"],
            act_dtype=qconfig["act_dtype"],
            act_nbits=qconfig["act_nbits"],
            qlevel=qconfig["qlevel"],
            target_machine=qconfig["target_machine"],
            dataloader=calib_dataloader,
            disable_inout=(True, True),
            kv_dtype=qconfig["kv_dtype"],
        )

qlv2_graph = augment_graph(toy_graph, dataloader)

not changed node: size size
not changed node: getitem <built-in function getitem>
not changed node: view view
not changed node: size_1 size
not changed node: getitem_1 <built-in function getitem>
not changed node: view_1 view
not changed node: long long
not changed node: view_2 view
not changed node: getitem_5 <built-in function getitem>
not changed node: to to
this module remains in legacy quant module: transformer_drop <class 'torch.nn.modules.dropout.Dropout'>
not changed node: size_3 size
not changed node: getitem_6 <built-in function getitem>
not changed node: size_4 size
not changed node: getitem_7 <built-in function getitem>
not changed node: view_3 view
not changed node: size_5 size
not changed node: getitem_8 <built-in function getitem>
not changed node: view_4 view
not changed node: size_6 size
not changed node: getitem_9 <built-in function getitem>
not changed node: view_5 view
not changed node: permute permute
not changed node: get_embed_positions <function get_embed_positi

In [53]:
vizualize_fx_graph(qlv2_graph, "toy-model-qlv2.svg")
qlv2_graph

GraphModule(
  (transformer): Module(
    (wte): ModelCompressorModuleEmbedding(
      (org_target): Embedding(50401, 4096)
      (_input_quantizer): TensorQuantizer(disabled)
      (_weight_quantizer): TensorQuantizer(disabled)
    )
    (drop): Dropout(p=0.0, inplace=False)
    (h): Module(
      (0): Module(
        (ln_1): ModelCompressorUnaryModule(
          (org_target): LayerNorm((4096,), eps=1e-05, elementwise_affine=True)
          (_input_quantizer): TensorQuantizer(32bit fake per-tensor amax=dynamic max=dynamic min=dynamic scale=1.0 quant)
        )
        (attn): Module(
          (q_proj): ModelCompressorModuleLinear(
            (org_target): Linear(in_features=4096, out_features=4096, bias=False)
            (_input_quantizer): TensorQuantizer(8bit fake axis=-1 amax=dynamic max=dynamic min=dynamic calibrator=MinMaxCalibrator scale=1.0 quant)
            (_weight_quantizer): TensorQuantizer(8bit fake axis=0 amax=dynamic max=dynamic min=dynamic calibrator=MaxCalibrator s

<qlv2 graph 출력부 그래프>

<img src="toy-model-qlv2.png" width="1400"/>

그래프 특징

- `ModelCompressorModuleLinear`
- `ModelCompressorModuleConcat`
- `ModelCompressorBinaryModule`
- `ModelCompressorUnaryModule`
- `ModelCompressorModuleOutput`
- `ModelCompressorModuleEmbedding`

모든 element-wise binary 연산들이 `function` 에서 `module` 로 바뀌어 있는 것을 알 수 있다.


`TensorQuantizer`: https://github.com/furiosa-ai/model-compressor-private/blob/main/model_compressor/quant_op/tensor_quantizer/quantizer_module.py#L37

In [54]:
calibrate(qlv2_graph, qconfig, dataloader, qparam_path=QUANT_PARAM_PATH, qformat_path=QUANT_FORMAT_PATH)

100%|██████████| 100/100 [00:04<00:00, 22.69it/s]
100%|██████████| 156/156 [00:00<00:00, 4070.80it/s]


### Calibration output: quant_format 과 quant_param
- quant_format 은 각 `ModelCompressorModule(MCM)` 노드에 대한 quantization spec 을 정의합니다.

    ```yaml
    {
        'dtype': 'int8', 'axis': 2, 'dynamic': False, 
        'etc_for_MCLab': 
            {'torch_name': 'transformer.h.0.attn.q_proj._input_quantizer', 
            'input_dtype': 'torch.float32', 
            'input_shape': [1, 1422, 4096], 
            'nbits': 8, 
            'per_ch': True, 
            'if_per_channel_scaling': False, 
            'group_size': None, 
            'unsigned': False, 
            'calibrator_type': 'minmax', 
            'calibrator_method': 'minmax', 
            'asymmetric': True, 
            'do_quant': True, 
            'do_zp_equalizing': False}
    }
    ```
    
- quant_param 은 linear quantization 을 위한 scale/zero-point 가 필요한 sub fp16 dtype 양자화 대상(예 `int8`)에 대한 (현재 맥락에서는) calibration range 가 들어 있습니다.

    ```yaml
    {
        'amax': None,
        'max': array([3.4669478, 3.827307 , 3.4593744, ..., 3.8100147, 3.8997357,
            3.7047665], dtype=float32),
        'min': array([-3.6546168, -4.5315475, -3.6790934, ..., -3.6721456, -3.8043125,
                -3.9023168], dtype=float32),
        'amax_outlier': None,
        'max_outlier': None,
        'min_outlier': None,
        'basis': None,
        'scale_per_channel': None,
        'scale_per_channel_outlier': None,
        'clipping_bound': None,
        'outlier_cin_idx': None,
        'ch': None
    }
    ```

#### quant_format

In [55]:
qformat = load_yaml(QUANT_FORMAT_PATH)["quantized op list"]
print(f'Num: {len(qformat.keys())}')
print(qformat.keys())

Num: 45
dict_keys(['transformer_wte', 'transformer_h_0_ln_1', 'transformer_h_0_attn_q_proj', 'transformer_h_0_attn_k_proj', 'transformer_h_0_attn_v_proj', 'transformer_h_0_attn_out_proj', 'transformer_h_0_mlp_fc_in', 'transformer_h_0_mlp_fc_out', 'transformer_ln_f', 'lm_head', 'sub', 'mul', 'add', 'add_1', 'add_2', 'add_3', 'floordiv', 'mul_1', 'mul_2', 'add_4', 'mul_3', 'mul_4', 'add_5', 'cat', 'cat_1', 'cat_2', 'cat_3', 'sub_1', 'matmul', 'truediv', 'add_6', 'softmax', 'matmul_1', 'add_7', 'mul_5', 'mul_6', 'add_8', 'mul_7', 'add_9', 'mul_8', 'add_10', 'add_11', 'output_0_quantize_node', 'output_1_quantize_node', 'output_2_quantize_node'])


In [57]:
print(f'- LayerNorm: \n{qformat["transformer_h_0_ln_1"]}\n')
print(f'- Add: \n{qformat["add"]}\n')
print(f'- Concat: \n{qformat["cat"]}\n')
print(f'- Linear: \n{qformat["transformer_h_0_attn_q_proj"]}')

- LayerNorm: 
{'output_shape': [1, 1422, 4096], 'quant_desc_input': {'dtype': 'fp32', 'axis': None, 'dynamic': False, 'etc_for_MCLab': {'torch_name': 'transformer.h.0.ln_1._input_quantizer', 'input_dtype': 'torch.float32', 'input_shape': [1, 1422, 4096], 'nbits': 32, 'per_ch': False, 'if_per_channel_scaling': False, 'group_size': None, 'unsigned': False, 'calibrator_type': 'None', 'calibrator_method': 'None', 'asymmetric': False, 'do_quant': True, 'do_zp_equalizing': False}}}

- Add: 
{'output_shape': [0, 0, 0, 0], 'quant_desc_input_0': {'dtype': 'fp32', 'axis': None, 'dynamic': False, 'etc_for_MCLab': {'torch_name': 'add._input_0_quantizer', 'input_dtype': 'float', 'input_shape': [0], 'nbits': 32, 'per_ch': False, 'if_per_channel_scaling': False, 'group_size': None, 'unsigned': False, 'calibrator_type': 'None', 'calibrator_method': 'None', 'asymmetric': False, 'do_quant': False, 'do_zp_equalizing': False}}, 'quant_desc_input_1': {'dtype': 'fp32', 'axis': None, 'dynamic': False, 'etc_f

#### quant_param

양자화 대상이 되는 layer 의 양자화 통계값/매개변수

- 첫번째, 마지막 layer
- add mul 과 같은 tensor 간 element wise binary 연산
- embedding layer

에 대해서는 양자화하지 않기때문에 quant_param 이 존재하지 않습니다.


In [59]:
import numpy as np


qparam = np.load(QUANT_PARAM_PATH, allow_pickle=True)[()]
print(f'Num: {len(qparam.keys())}')
print(qparam.keys())


Num: 18
dict_keys(['transformer.wte._weight_quantizer', 'transformer.h.0.attn.q_proj._input_quantizer', 'transformer.h.0.attn.q_proj._weight_quantizer', 'transformer.h.0.attn.k_proj._input_quantizer', 'transformer.h.0.attn.k_proj._weight_quantizer', 'transformer.h.0.attn.v_proj._input_quantizer', 'transformer.h.0.attn.v_proj._weight_quantizer', 'transformer.h.0.attn.out_proj._input_quantizer', 'transformer.h.0.attn.out_proj._weight_quantizer', 'transformer.h.0.mlp.fc_in._input_quantizer', 'transformer.h.0.mlp.fc_in._weight_quantizer', 'transformer.h.0.mlp.fc_out._input_quantizer', 'transformer.h.0.mlp.fc_out._weight_quantizer', 'lm_head._input_quantizer', 'lm_head._weight_quantizer', 'matmul._input_0_quantizer', 'matmul._input_1_quantizer', 'matmul_1._input_1_quantizer'])


In [58]:
print(f'Weight ranges: \n{qparam["transformer.h.0.attn.q_proj._weight_quantizer"]}\n')
print(f'Input ranges: \n{qparam["transformer.h.0.attn.q_proj._input_quantizer"]}')

Weight ranges: 
{'amax': array([0.00256775, 0.00223099, 0.00242827, ..., 0.00225689, 0.00251386,
       0.00288823], dtype=float32), 'max': None, 'min': None, 'amax_outlier': None, 'max_outlier': None, 'min_outlier': None, 'basis': None, 'scale_per_channel': None, 'scale_per_channel_outlier': None, 'clipping_bound': None, 'outlier_cin_idx': None, 'ch': None}

Input ranges: 
{'amax': None, 'max': array([3.7416744, 3.6124158, 3.6038327, ..., 4.3889737, 3.7187653,
       5.0028906], dtype=float32), 'min': array([-3.613177 , -3.4567056, -4.5581727, ..., -3.420187 , -3.7840812,
       -3.5473545], dtype=float32), 'amax_outlier': None, 'max_outlier': None, 'min_outlier': None, 'basis': None, 'scale_per_channel': None, 'scale_per_channel_outlier': None, 'clipping_bound': None, 'outlier_cin_idx': None, 'ch': None}


## Quantization

- `QLV4_Ops` 로 구성되어 있습니다. MCM 이 pytorch ATen 로 구성
- 텐서들의 dtype 이 RNGD 에서 실제 구동할 dtype 으로 변환되어 있습니다.

In [60]:
def quantize_model(model: torch.fx.GraphModule, 
                   qconfig: Dict, qparam_path: str, qformat_path: str) -> torch.fx.GraphModule:
    model = model_compressor.create_quantsim_model(
        model,
        qformat_path=qformat_path,
        qparam_path=qparam_path,
        weight_calib_method=qconfig["weight_calib_method"],
        weight_granularity=qconfig["weight_granularity"],
        weight_dtype=qconfig["weight_dtype"],
        weight_nbits=qconfig["weight_nbits"],
        act_calib_method=qconfig["act_calib_method"],
        act_granularity=qconfig["act_granularity"],
        act_dtype=qconfig["act_dtype"],
        act_nbits=qconfig["act_nbits"],
        qlevel=qconfig["qlevel"],
        target_machine=qconfig["target_machine"],
        disable_inout=(True, True),
        kv_dtype=qconfig["kv_dtype"],
    )

    return model

In [61]:
qlv4_graph = quantize_model(toy_graph, qconfig, QUANT_PARAM_PATH, QUANT_FORMAT_PATH)

not changed node: size size
not changed node: getitem <built-in function getitem>
not changed node: view view
not changed node: size_1 size
not changed node: getitem_1 <built-in function getitem>
not changed node: view_1 view
not changed node: long long
not changed node: view_2 view
not changed node: getitem_5 <built-in function getitem>
not changed node: to to
this module remains in legacy quant module: transformer_drop <class 'torch.nn.modules.dropout.Dropout'>
not changed node: size_3 size
not changed node: getitem_6 <built-in function getitem>
not changed node: size_4 size
not changed node: getitem_7 <built-in function getitem>
not changed node: view_3 view
not changed node: size_5 size
not changed node: getitem_8 <built-in function getitem>
not changed node: view_4 view
not changed node: size_6 size
not changed node: getitem_9 <built-in function getitem>
not changed node: view_5 view
not changed node: permute permute
not changed node: get_embed_positions <function get_embed_positi

In [62]:
vizualize_fx_graph(qlv4_graph, 'toy-model-qlv4.svg')
qlv4_graph

GraphModule(
  (transformer): Module(
    (wte): QLV4_Embedding(
      (QLV4_embedding): _QLV4_Embedding_MOD()
      (_QLV4_output): QLV4_Output_MOD()
    )
    (drop): Dropout(p=0.0, inplace=False)
    (h): Module(
      (0): Module(
        (ln_1): QLV4_LayerNorm(
          (_QLV4_layernorm): QLV4_LayerNorm_MOD()
          (_QLV4_output): QLV4_Output_MOD()
        )
        (attn): Module(
          (q_proj): QLV4_Linear(
            (QLV4_linear): _QLV4_Linear_MOD()
            (QLV4_bias): _QLV4_LINEAR_BIAS_MOD()
            (_QLV4_output): QLV4_Output_MOD()
          )
          (k_proj): QLV4_Linear(
            (QLV4_linear): _QLV4_Linear_MOD()
            (QLV4_bias): _QLV4_LINEAR_BIAS_MOD()
            (_QLV4_output): QLV4_Output_MOD()
          )
          (v_proj): QLV4_Linear(
            (QLV4_linear): _QLV4_Linear_MOD()
            (QLV4_bias): _QLV4_LINEAR_BIAS_MOD()
            (_QLV4_output): QLV4_Output_MOD()
          )
          (attn_dropout): Dropout(p=0.0, inplac

<qlv4 graph 출력부 그래프>

<img src="toy-model-qlv4.png" width="1400"/>

# Summary

- MCM 의 양자화는 symbolically traced pytorch `torch.fx.GraphModule` 을 대상으로 합니다.
    - 이 때, graph tracer 는 사용자가 구현해야 합니다.
- (Qlevel2) 이 pytorch 그래프의 노드들을 `ModelCompressorModule` 로 변환하여 calibration 을 수행합니다.
    - 이 calibration 단계에서는 각 layer 들의 dtype 등이 명시되어 있는 qformat 과 양자화를 수행하게될 layer 에 대한 qparam 을 얻을 수 있습니다.
    - 이 과정에서 calibration 을 위한 dataloader 는 사용자가 구현해야 합니다.
- (Qlevel4) qformat 과 qparam 을 이용하여 최종적으로 RNGD 컴파일러에서 컴파일하기 위하여 그래프를 변환 합니다.
    - 또한 `ModelCompressorModule` 들은 모두 pytorch `ATen` ops 로 변환 됩니다.
    - 양자화는 Qlevel2 -> Qlevel3 에서 일어납니다. 즉, 모든 layer 의 dtype 은 qformat 이 규정하는대로 dtype 을 변환하게 됩니다.
- text-generation 모델이라고 하더라도 양자화 대상은 `model.forward` 에 포함되는 layer 들입니다.

## Comment
- symbolically traced fx GraphModule 에서는 svg 형태로는 그래프 구조, data flow 를 파악하기 어려운 점이 있습니다. 
    - netron 수준의 interactive 그래프 visualizer 가 있다면 이 점을 크게 개선할 수 있을 것으로 생각합니다.
- graph tracer, calibration dataloader 는 (아주 적은 수준이라고 하더라도) 사용자가 구현해서 사용해야할 부분입니다.
- 양자화 관점에서는 Qlevel3 가 최종 단계이고, Qlevel4 는 이후 모델 porting(exporting) 관점에서 접근해야할 것으로 생각합니다.
    - Qlevel3 단계에서 사용자 편의성을 위해 그래프를 serialize 해서 하나의 파일로 갖고 있을 필요가 있을 것 같습니다. 
    - (현행 양자화 모델을 실행하기 위해서는 weight, qconfig, qparam, qformat, git submodule 이 필요합니다.)
    - 특히, text-generation 모델은 양자화 이후 pre-pill/decoder 로 나뉘어 컴파일되는만큼 양자화 도구 관점과 별도로 코드를 개발/유지보수해야할 필요성이 있습니다.

끝.