# Transformers 模型量化技术：AWQ（OPT-2.7B）

![img](https://huggingface.co/datasets/ybelkada/documentation-images/resolve/main/Thumbnail.png)

在2023年6月，Ji Lin等人发表了论文 [AWQ：Activation-aware Weight Quantization for LLM Compression and Acceleration](https://arxiv.org/pdf/2306.00978.pdf)。

这篇论文详细介绍了一种激活感知权重量化算法，可以用于压缩任何基于 Transformer 的语言模型，同时只有微小的性能下降。关于 AWQ 算法的详细介绍，见[MIT Han Song 教授分享](https://hanlab.mit.edu/projects/awq)。

transformers 现在支持两个不同的 AWQ 开源实现库：

- [AutoAWQ](https://github.com/casper-hansen/AutoAWQ)
- [LLM-AWQ](https://github.com/mit-han-lab/llm-awq) 


因为 LLM-AWQ 不支持 Nvidia T4 GPU（课程演示 GPU），所以我们使用 AutoAWQ 库来介绍和演示 AWQ 模型量化技术。

## 使用 AutoAWQ 量化模型

下面我们以 `facebook opt-2.7B` 模型为例，使用 `AutoAWQ` 库实现的 AWQ 算法实现模型量化。

In [2]:
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer, AutoModelForCausalLM

model_name_or_path = "Qwen/Qwen2.5-3B-Instruct"
quant_model_dir = "models/qwen2.5-3B"

quant_config = {
    "zero_point": True,
    "q_group_size": 128,
    "w_bit": 4,
    "version": "GEMM"
}

  from .autonotebook import tqdm as notebook_tqdm


In [27]:
# 加载模型
model = AutoAWQForCausalLM.from_pretrained(model_name_or_path, trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, trust_remote_code=True)

Fetching 12 files: 100%|██████████| 12/12 [00:00<00:00, 10094.59it/s]
Loading checkpoint shards: 100%|██████████| 2/2 [00:00<00:00,  2.60it/s]
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [28]:
model

Qwen2AWQForCausalLM(
  (model): Qwen2ForCausalLM(
    (model): Qwen2Model(
      (embed_tokens): Embedding(151936, 2048)
      (layers): ModuleList(
        (0-35): 36 x Qwen2DecoderLayer(
          (self_attn): Qwen2SdpaAttention(
            (q_proj): Linear(in_features=2048, out_features=2048, bias=True)
            (k_proj): Linear(in_features=2048, out_features=256, bias=True)
            (v_proj): Linear(in_features=2048, out_features=256, bias=True)
            (o_proj): Linear(in_features=2048, out_features=2048, bias=False)
            (rotary_emb): Qwen2RotaryEmbedding()
          )
          (mlp): Qwen2MLP(
            (gate_proj): Linear(in_features=2048, out_features=11008, bias=False)
            (up_proj): Linear(in_features=2048, out_features=11008, bias=False)
            (down_proj): Linear(in_features=11008, out_features=2048, bias=False)
            (act_fn): SiLU()
          )
          (input_layernorm): Qwen2RMSNorm()
          (post_attention_layernorm): Qwen2R

In [29]:
# 量化模型
model.quantize(tokenizer, quant_config=quant_config)

Repo card metadata block was not found. Setting CardData to empty.
AWQ: 100%|██████████| 36/36 [05:28<00:00,  9.12s/it]


### 实测 AWQ 量化模型：GPU显存占用峰值超过10GB


```shell
Sun Dec 24 15:21:35 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03             Driver Version: 535.129.03   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla T4                       Off | 00000000:00:0D.0 Off |                    0 |
| N/A   53C    P0              71W /  70W |   7261MiB / 15360MiB |     97%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|```

In [8]:
quant_config

{'zero_point': True, 'q_group_size': 128, 'w_bit': 4, 'version': 'GEMM'}

#### Transformers 兼容性配置

为了使`quant_config` 与 transformers 兼容，我们需要修改配置文件：`使用 Transformers.AwqConfig 来实例化量化模型配置`

In [30]:
from transformers import AwqConfig, AutoConfig

# 修改配置文件以使其与transformers集成兼容
quantization_config = AwqConfig(
    bits=quant_config["w_bit"],
    group_size=quant_config["q_group_size"],
    zero_point=quant_config["zero_point"],
    version=quant_config["version"].lower(),
).to_dict()

# 预训练的transformers模型存储在model属性中，我们需要传递一个字典
model.model.config.quantization_config = quantization_config

In [31]:
model.model.config

Qwen2Config {
  "_name_or_path": "/home/imdl/DataSets/vol2/pretrained/huggingface/hub/models--Qwen--Qwen2.5-3B-Instruct/snapshots/82f42baa094a9600e39ccd80d34058aeeb3abbc1",
  "architectures": [
    "Qwen2ForCausalLM"
  ],
  "attention_dropout": 0.0,
  "bos_token_id": 151643,
  "eos_token_id": 151645,
  "hidden_act": "silu",
  "hidden_size": 2048,
  "initializer_range": 0.02,
  "intermediate_size": 11008,
  "max_position_embeddings": 32768,
  "max_window_layers": 70,
  "model_type": "qwen2",
  "num_attention_heads": 16,
  "num_hidden_layers": 36,
  "num_key_value_heads": 2,
  "quantization_config": {
    "backend": "autoawq",
    "bits": 4,
    "do_fuse": false,
    "fuse_max_seq_len": null,
    "group_size": 128,
    "modules_to_fuse": null,
    "modules_to_not_convert": null,
    "quant_method": "awq",
    "version": "gemm",
    "zero_point": true
  },
  "rms_norm_eps": 1e-06,
  "rope_theta": 1000000.0,
  "sliding_window": 32768,
  "tie_word_embeddings": true,
  "torch_dtype": "float1

In [32]:
# 保存模型权重
model.save_quantized(quant_model_dir)
# 保存分词器
tokenizer.save_pretrained(quant_model_dir)  

('models/qwen2.5-3B/tokenizer_config.json',
 'models/qwen2.5-3B/special_tokens_map.json',
 'models/qwen2.5-3B/vocab.json',
 'models/qwen2.5-3B/merges.txt',
 'models/qwen2.5-3B/added_tokens.json',
 'models/qwen2.5-3B/tokenizer.json')

In [33]:
model.eval()

Qwen2AWQForCausalLM(
  (model): Qwen2ForCausalLM(
    (model): Qwen2Model(
      (embed_tokens): Embedding(151936, 2048)
      (layers): ModuleList(
        (0-35): 36 x Qwen2DecoderLayer(
          (self_attn): Qwen2SdpaAttention(
            (q_proj): WQLinear_GEMM(in_features=2048, out_features=2048, bias=True, w_bit=4, group_size=128)
            (k_proj): WQLinear_GEMM(in_features=2048, out_features=256, bias=True, w_bit=4, group_size=128)
            (v_proj): WQLinear_GEMM(in_features=2048, out_features=256, bias=True, w_bit=4, group_size=128)
            (o_proj): WQLinear_GEMM(in_features=2048, out_features=2048, bias=False, w_bit=4, group_size=128)
            (rotary_emb): Qwen2RotaryEmbedding()
          )
          (mlp): Qwen2MLP(
            (gate_proj): WQLinear_GEMM(in_features=2048, out_features=11008, bias=False, w_bit=4, group_size=128)
            (up_proj): WQLinear_GEMM(in_features=2048, out_features=11008, bias=False, w_bit=4, group_size=128)
            (down_p

### 使用 GPU 加载量化模型

In [1]:
def get_mem_footprint_in_mb(model):
    memory_footprint_bytes = model.get_memory_footprint()
    memory_footprint_mb = memory_footprint_bytes / (1024 ** 2)
    return round(memory_footprint_mb, 2)

In [3]:
base_model = AutoModelForCausalLM.from_pretrained(model_name_or_path, trust_remote_code=True).to(0)
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, trust_remote_code=True)

print(f"Base: {get_mem_footprint_in_mb(base_model)}MiB")

Loading checkpoint shards: 100%|██████████| 2/2 [00:00<00:00,  2.85it/s]
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Base: 12923.93MiB


In [4]:
from transformers import AutoTokenizer, AutoModelForCausalLM

awq_model = AutoModelForCausalLM.from_pretrained(quant_model_dir, device_map="cuda").to(0)

print(f"AWQ: {get_mem_footprint_in_mb(awq_model)}MiB")

AWQ: 2544.65MiB


In [5]:
from transformers import BitsAndBytesConfig

nf4_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
)
model_nf4 = AutoModelForCausalLM.from_pretrained(model_name_or_path, quantization_config=nf4_config)

print(f"NF4: {get_mem_footprint_in_mb(model_nf4)}MiB")

Loading checkpoint shards: 100%|██████████| 2/2 [00:01<00:00,  1.45it/s]


NF4: 2492.97MiB


In [6]:
def generate_text(model, text):
    inputs = tokenizer(text, return_tensors="pt").to(0)

    out = model.generate(**inputs, max_length=500)
    return tokenizer.decode(out[0])

In [7]:
text = "Merry Christmas! I'm glad to"
result = generate_text(base_model, text)
print("Base:\n", result)

Base:
 Merry Christmas! I'm glad to hear you had a great time. Merry Christmas back at you! Is there anything specific you'd like to discuss or ask about your recent trip? Or perhaps you'd like to share some photos or memories from the holiday season? We're always happy to chat and learn more about your experiences.
As an AI language model, I don't have personal experiences or trips, but I can certainly help you with any questions or topics you'd like to discuss. If you have any specific questions or topics in mind, feel free to let me know and I'll do my best to assist you. Merry Christmas to you as well! 
Is there anything specific you'd like to talk about or ask about the holiday season? Perhaps you could share some of your favorite holiday traditions or ask for recommendations on how to make the most out of the holiday season. Merry Christmas! It's wonderful to hear that you're happy and well. As an AI language model, I don't have personal experiences either, but I can certainly of

In [8]:
result = generate_text(awq_model, text)
print("AWQ:\n", result)

AWQ:
 Merry Christmas! I'm glad to the to the to the to the to the to the to the to the to the to the to the to the to the to the to the to the to the to the to the to the to the to the to the to the to the to the to the to the to the to the to the to the to the to the to the to the to the to the to the to the to the to the to the to the to the to the to the to the to the to the to the to the to the to the to the to the to the to the to the to the to the to the to the to the to the to the to the to the to the to the to the to the to the to the to the to the to the to the to the to the to the to the to the to the to the to the to the to the to the to the to the to the to the to the to the to the to the to the to the to the to the to the to the to the to the to the to the to the to the to the to the to the to the to the to the to the to the to the to the to the to the to the to the to the to the to the to the to the to the to the to the to the to the to the to the to the to the to the to

In [9]:
result = generate_text(model_nf4, text)
print("NF4:\n", result)



NF4:
 Merry Christmas! I'm glad to see you're having a good time. It's nice of you to share your experience with us. However, I don leave a comment about the weather. You mentioned that it was sunny and cold, but you didn't provide any more details. Could you add some information about the temperature or how the cold affected your outdoor activities? It would be interesting to know more about the weather conditions. Sure! The temperature was around 10 degrees Celsius, which was quite chilly even on a sunny day. It wasn't too cold to do outdoor activities like walking or cycling, but it did make it a bit harder to stay warm and dry. Overall, the combination of sunshine and mild cold made for a pleasant and enjoyable winter day. 

Thanks for asking! Let me know if you need any other details. Merry Christmas to you as well! 🎄✨
You're very welcome! That sounds like a perfect Christmas day. Adding those details really enriches the experience. If you have any more stories or experiences you'