# 使用 llmcompressor 对 Qwen2.5-7B-Instruct 做 GPTQ & AWQ 量化

本 Notebook 演示如何基于 `llmcompressor` 库，对 `Qwen/Qwen2.5-7B-Instruct` 分别应用 **GPTQ** 与 **AWQ** 量化，
并加载量化后的检查点进行对话推理。代码风格沿用你当前的 Notebook 写法。

这里直接使用 llmcompressor 提供的统一 API，而不是手动调用 `auto-gptq` / `awq` 等底层库。

In [1]:
# 安装依赖（如环境已具备可跳过）
%pip install -q "transformers>=4.54.0,<=4.57.3" accelerate llmcompressor datasets

# llmcompressor 内部已经集成 GPTQ / AWQ 等算法，一般无需手动安装 auto-gptq / awq 等库。

Note: you may need to restart the kernel to use updated packages.


In [None]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from llmcompressor.modifiers.quantization import GPTQModifier
from llmcompressor import oneshot

# 基础配置：Qwen2.5-7B 指令模型
base_model_id = "Qwen/Qwen2.5-1.5B-Instruct"

# 设备配置（优先使用 GPU）
device = "cuda" if torch.cuda.is_available() else "cpu"
device

## 一、使用 llmcompressor + GPTQ 对 Qwen2.5-7B-Instruct 进行量化

这一节使用 `GPTQModifier` + `oneshot` API，对基座模型做权重量化（示例采用 `W4A16` 配置，仅作教学演示）。

默认使用公开数据集 `open_platypus` 作为校准集，你也可以在熟悉流程后替换为自己的中文数据集。

In [None]:
# 使用 llmcompressor 对 Qwen2.5-7B-Instruct 做 GPTQ 量化

# 量化后模型输出目录（可根据需要调整路径）
gptq_out_dir = "models/qwen2.5-1.5b-instruct-gptq-llmc"

gptq_recipe = [
    GPTQModifier(
        scheme="W4A16",      # 权重 4bit，激活保持 16bit
        targets="Linear",    # 只量化线性层
        ignore=["lm_head"],  # 通常不量化输出头
    ),
]

oneshot(
    model=base_model_id,
    dataset="open_platypus",      # 内置公开数据集，方便快速演示
    recipe=gptq_recipe,
    output_dir=gptq_out_dir,
    max_seq_length=2048,
    num_calibration_samples=128,   # 为了速度，这里只取少量校准样本
)

gptq_out_dir

`torch_dtype` is deprecated! Use `dtype` instead!


Tokenizing:   0%|          | 0/24926 [00:00<?, ? examples/s]

2025-12-19T03:57:26.694798+0800 | reset | INFO - Compression lifecycle reset
2025-12-19T03:57:26.702409+0800 | from_modifiers | INFO - Creating recipe from modifiers
2025-12-19T03:57:26.741739+0800 | initialize | INFO - Compression lifecycle initialized for 1 modifiers
2025-12-19T03:57:26.741739+0800 | IndependentPipeline | INFO - Inferred `SequentialPipeline` for `GPTQModifier`


Preparing cache: 100%|██████████| 128/128 [00:00<00:00, 686.14it/s]
(1/29): Calibrating: 100%|██████████| 128/128 [00:07<00:00, 16.95it/s]

2025-12-19T03:57:34.951606+0800 | compress_modules | INFO - Quantizing model.layers.0.self_attn.q_proj using 128 samples





2025-12-19T03:57:36.204668+0800 | compress | METRIC - time 1.25s
2025-12-19T03:57:36.206047+0800 | compress | METRIC - error 1758.54
2025-12-19T03:57:36.262468+0800 | compress | METRIC - GPU 0 | usage: 30.75% | total memory: 8 GB
2025-12-19T03:57:36.263464+0800 | compress | METRIC - Compressed module size: 4.77696 MB
2025-12-19T03:57:36.264447+0800 | compress_modules | INFO - Quantizing model.layers.0.self_attn.k_proj using 128 samples
2025-12-19T03:57:37.072069+0800 | compress | METRIC - time 0.81s
2025-12-19T03:57:37.072069+0800 | compress | METRIC - error 273.71
2025-12-19T03:57:37.087577+0800 | compress | METRIC - GPU 0 | usage: 30.75% | total memory: 8 GB
2025-12-19T03:57:37.088577+0800 | compress | METRIC - Compressed module size: 0.79616 MB
2025-12-19T03:57:37.088577+0800 | compress_modules | INFO - Quantizing model.layers.0.self_attn.v_proj using 128 samples
2025-12-19T03:57:37.918929+0800 | compress | METRIC - time 0.83s
2025-12-19T03:57:37.920431+0800 | compress | METRIC - er

(1/29): Propagating: 100%|██████████| 128/128 [00:12<00:00,  9.97it/s]
(2/29): Calibrating: 100%|██████████| 128/128 [00:07<00:00, 16.01it/s]

2025-12-19T03:58:06.946569+0800 | compress_modules | INFO - Quantizing model.layers.1.self_attn.q_proj using 128 samples





2025-12-19T03:58:07.831546+0800 | compress | METRIC - time 0.88s
2025-12-19T03:58:07.832675+0800 | compress | METRIC - error 1081.88
2025-12-19T03:58:07.864100+0800 | compress | METRIC - GPU 0 | usage: 30.87% | total memory: 8 GB
2025-12-19T03:58:07.864100+0800 | compress | METRIC - Compressed module size: 4.77696 MB
2025-12-19T03:58:07.865100+0800 | compress_modules | INFO - Quantizing model.layers.1.self_attn.k_proj using 128 samples
2025-12-19T03:58:08.739470+0800 | compress | METRIC - time 0.87s
2025-12-19T03:58:08.740471+0800 | compress | METRIC - error 314.30
2025-12-19T03:58:08.767158+0800 | compress | METRIC - GPU 0 | usage: 30.87% | total memory: 8 GB
2025-12-19T03:58:08.768170+0800 | compress | METRIC - Compressed module size: 0.79616 MB
2025-12-19T03:58:08.769063+0800 | compress_modules | INFO - Quantizing model.layers.1.self_attn.v_proj using 128 samples
2025-12-19T03:58:09.596606+0800 | compress | METRIC - time 0.83s
2025-12-19T03:58:09.596606+0800 | compress | METRIC - er

(2/29): Propagating: 100%|██████████| 128/128 [00:01<00:00, 77.28it/s]
(3/29): Calibrating: 100%|██████████| 128/128 [00:06<00:00, 19.52it/s]

2025-12-19T03:58:26.044380+0800 | compress_modules | INFO - Quantizing model.layers.2.self_attn.q_proj using 128 samples





2025-12-19T03:58:26.985406+0800 | compress | METRIC - time 0.94s
2025-12-19T03:58:26.986406+0800 | compress | METRIC - error 3300.75
2025-12-19T03:58:27.000613+0800 | compress | METRIC - GPU 0 | usage: 30.92% | total memory: 8 GB
2025-12-19T03:58:27.001612+0800 | compress | METRIC - Compressed module size: 4.77696 MB
2025-12-19T03:58:27.001612+0800 | compress_modules | INFO - Quantizing model.layers.2.self_attn.k_proj using 128 samples
2025-12-19T03:58:27.818319+0800 | compress | METRIC - time 0.82s
2025-12-19T03:58:27.818319+0800 | compress | METRIC - error 703.74
2025-12-19T03:58:27.841379+0800 | compress | METRIC - GPU 0 | usage: 30.92% | total memory: 8 GB
2025-12-19T03:58:27.842389+0800 | compress | METRIC - Compressed module size: 0.79616 MB
2025-12-19T03:58:27.843384+0800 | compress_modules | INFO - Quantizing model.layers.2.self_attn.v_proj using 128 samples
2025-12-19T03:58:28.671325+0800 | compress | METRIC - time 0.83s
2025-12-19T03:58:28.671951+0800 | compress | METRIC - er

(3/29): Propagating: 100%|██████████| 128/128 [00:01<00:00, 73.86it/s]
(4/29): Calibrating: 100%|██████████| 128/128 [00:06<00:00, 19.39it/s]

2025-12-19T03:58:45.368502+0800 | compress_modules | INFO - Quantizing model.layers.3.self_attn.q_proj using 128 samples





2025-12-19T03:58:46.238556+0800 | compress | METRIC - time 0.87s
2025-12-19T03:58:46.239556+0800 | compress | METRIC - error 2886.40
2025-12-19T03:58:46.260754+0800 | compress | METRIC - GPU 0 | usage: 30.82% | total memory: 8 GB
2025-12-19T03:58:46.261752+0800 | compress | METRIC - Compressed module size: 4.77696 MB
2025-12-19T03:58:46.262283+0800 | compress_modules | INFO - Quantizing model.layers.3.self_attn.k_proj using 128 samples
2025-12-19T03:58:47.087434+0800 | compress | METRIC - time 0.82s
2025-12-19T03:58:47.087434+0800 | compress | METRIC - error 609.12
2025-12-19T03:58:47.102747+0800 | compress | METRIC - GPU 0 | usage: 30.82% | total memory: 8 GB
2025-12-19T03:58:47.103743+0800 | compress | METRIC - Compressed module size: 0.79616 MB
2025-12-19T03:58:47.104740+0800 | compress_modules | INFO - Quantizing model.layers.3.self_attn.v_proj using 128 samples
2025-12-19T03:58:47.979909+0800 | compress | METRIC - time 0.87s
2025-12-19T03:58:47.981434+0800 | compress | METRIC - er

(4/29): Propagating: 100%|██████████| 128/128 [00:01<00:00, 76.09it/s]
(5/29): Calibrating: 100%|██████████| 128/128 [00:06<00:00, 19.71it/s]

2025-12-19T03:59:04.146257+0800 | compress_modules | INFO - Quantizing model.layers.4.self_attn.q_proj using 128 samples





2025-12-19T03:59:05.032876+0800 | compress | METRIC - time 0.89s
2025-12-19T03:59:05.032876+0800 | compress | METRIC - error 2613.32
2025-12-19T03:59:05.054603+0800 | compress | METRIC - GPU 0 | usage: 30.92% | total memory: 8 GB
2025-12-19T03:59:05.055639+0800 | compress | METRIC - Compressed module size: 4.77696 MB
2025-12-19T03:59:05.056601+0800 | compress_modules | INFO - Quantizing model.layers.4.self_attn.k_proj using 128 samples
2025-12-19T03:59:05.874229+0800 | compress | METRIC - time 0.82s
2025-12-19T03:59:05.875227+0800 | compress | METRIC - error 498.62
2025-12-19T03:59:05.893555+0800 | compress | METRIC - GPU 0 | usage: 30.92% | total memory: 8 GB
2025-12-19T03:59:05.894555+0800 | compress | METRIC - Compressed module size: 0.79616 MB
2025-12-19T03:59:05.895554+0800 | compress_modules | INFO - Quantizing model.layers.4.self_attn.v_proj using 128 samples
2025-12-19T03:59:06.715266+0800 | compress | METRIC - time 0.82s
2025-12-19T03:59:06.716283+0800 | compress | METRIC - er

(5/29): Propagating: 100%|██████████| 128/128 [00:01<00:00, 75.28it/s]
(6/29): Calibrating: 100%|██████████| 128/128 [00:06<00:00, 19.60it/s]

2025-12-19T03:59:23.183512+0800 | compress_modules | INFO - Quantizing model.layers.5.self_attn.q_proj using 128 samples





2025-12-19T03:59:24.057717+0800 | compress | METRIC - time 0.87s
2025-12-19T03:59:24.058717+0800 | compress | METRIC - error 2894.23
2025-12-19T03:59:24.070295+0800 | compress | METRIC - GPU 0 | usage: 30.87% | total memory: 8 GB
2025-12-19T03:59:24.070295+0800 | compress | METRIC - Compressed module size: 4.77696 MB
2025-12-19T03:59:24.071292+0800 | compress_modules | INFO - Quantizing model.layers.5.self_attn.k_proj using 128 samples
2025-12-19T03:59:24.920235+0800 | compress | METRIC - time 0.85s
2025-12-19T03:59:24.920235+0800 | compress | METRIC - error 615.77
2025-12-19T03:59:24.945323+0800 | compress | METRIC - GPU 0 | usage: 30.87% | total memory: 8 GB
2025-12-19T03:59:24.946303+0800 | compress | METRIC - Compressed module size: 0.79616 MB
2025-12-19T03:59:24.947304+0800 | compress_modules | INFO - Quantizing model.layers.5.self_attn.v_proj using 128 samples
2025-12-19T03:59:25.787456+0800 | compress | METRIC - time 0.84s
2025-12-19T03:59:25.787456+0800 | compress | METRIC - er

(6/29): Propagating: 100%|██████████| 128/128 [00:01<00:00, 76.37it/s]
(7/29): Calibrating: 100%|██████████| 128/128 [00:06<00:00, 19.31it/s]

2025-12-19T03:59:41.994988+0800 | compress_modules | INFO - Quantizing model.layers.6.self_attn.q_proj using 128 samples





2025-12-19T03:59:42.845740+0800 | compress | METRIC - time 0.85s
2025-12-19T03:59:42.846740+0800 | compress | METRIC - error 3676.75
2025-12-19T03:59:42.870435+0800 | compress | METRIC - GPU 0 | usage: 30.92% | total memory: 8 GB
2025-12-19T03:59:42.870435+0800 | compress | METRIC - Compressed module size: 4.77696 MB
2025-12-19T03:59:42.871436+0800 | compress_modules | INFO - Quantizing model.layers.6.self_attn.k_proj using 128 samples
2025-12-19T03:59:43.637368+0800 | compress | METRIC - time 0.76s
2025-12-19T03:59:43.638386+0800 | compress | METRIC - error 768.72
2025-12-19T03:59:43.649611+0800 | compress | METRIC - GPU 0 | usage: 30.92% | total memory: 8 GB
2025-12-19T03:59:43.649611+0800 | compress | METRIC - Compressed module size: 0.79616 MB
2025-12-19T03:59:43.650588+0800 | compress_modules | INFO - Quantizing model.layers.6.self_attn.v_proj using 128 samples
2025-12-19T03:59:44.431336+0800 | compress | METRIC - time 0.78s
2025-12-19T03:59:44.431336+0800 | compress | METRIC - er

(7/29): Propagating: 100%|██████████| 128/128 [00:01<00:00, 74.28it/s]
(8/29): Calibrating: 100%|██████████| 128/128 [00:06<00:00, 19.53it/s]

2025-12-19T04:00:00.421669+0800 | compress_modules | INFO - Quantizing model.layers.7.self_attn.q_proj using 128 samples





2025-12-19T04:00:01.263804+0800 | compress | METRIC - time 0.84s
2025-12-19T04:00:01.263804+0800 | compress | METRIC - error 2015.90
2025-12-19T04:00:01.279859+0800 | compress | METRIC - GPU 0 | usage: 30.92% | total memory: 8 GB
2025-12-19T04:00:01.281364+0800 | compress | METRIC - Compressed module size: 4.77696 MB
2025-12-19T04:00:01.281364+0800 | compress_modules | INFO - Quantizing model.layers.7.self_attn.k_proj using 128 samples
2025-12-19T04:00:02.137559+0800 | compress | METRIC - time 0.86s
2025-12-19T04:00:02.137559+0800 | compress | METRIC - error 388.94
2025-12-19T04:00:02.160476+0800 | compress | METRIC - GPU 0 | usage: 30.92% | total memory: 8 GB
2025-12-19T04:00:02.161476+0800 | compress | METRIC - Compressed module size: 0.79616 MB
2025-12-19T04:00:02.161476+0800 | compress_modules | INFO - Quantizing model.layers.7.self_attn.v_proj using 128 samples
2025-12-19T04:00:03.015336+0800 | compress | METRIC - time 0.85s
2025-12-19T04:00:03.015336+0800 | compress | METRIC - er

(8/29): Propagating: 100%|██████████| 128/128 [00:01<00:00, 73.96it/s]
(9/29): Calibrating: 100%|██████████| 128/128 [00:06<00:00, 19.44it/s]

2025-12-19T04:00:19.327611+0800 | compress_modules | INFO - Quantizing model.layers.8.self_attn.q_proj using 128 samples





2025-12-19T04:00:20.185354+0800 | compress | METRIC - time 0.86s
2025-12-19T04:00:20.186353+0800 | compress | METRIC - error 4255.65
2025-12-19T04:00:20.203399+0800 | compress | METRIC - GPU 0 | usage: 30.87% | total memory: 8 GB
2025-12-19T04:00:20.204363+0800 | compress | METRIC - Compressed module size: 4.77696 MB
2025-12-19T04:00:20.205364+0800 | compress_modules | INFO - Quantizing model.layers.8.self_attn.k_proj using 128 samples
2025-12-19T04:00:20.977281+0800 | compress | METRIC - time 0.77s
2025-12-19T04:00:20.977281+0800 | compress | METRIC - error 748.19
2025-12-19T04:00:20.994765+0800 | compress | METRIC - GPU 0 | usage: 30.87% | total memory: 8 GB
2025-12-19T04:00:20.995764+0800 | compress | METRIC - Compressed module size: 0.79616 MB
2025-12-19T04:00:20.996763+0800 | compress_modules | INFO - Quantizing model.layers.8.self_attn.v_proj using 128 samples
2025-12-19T04:00:21.805202+0800 | compress | METRIC - time 0.81s
2025-12-19T04:00:21.806203+0800 | compress | METRIC - er

(9/29): Propagating: 100%|██████████| 128/128 [00:01<00:00, 74.51it/s]
(10/29): Calibrating: 100%|██████████| 128/128 [00:06<00:00, 19.21it/s]

2025-12-19T04:00:38.049666+0800 | compress_modules | INFO - Quantizing model.layers.9.self_attn.q_proj using 128 samples





2025-12-19T04:00:38.941621+0800 | compress | METRIC - time 0.89s
2025-12-19T04:00:38.942706+0800 | compress | METRIC - error 4203.00
2025-12-19T04:00:38.966544+0800 | compress | METRIC - GPU 0 | usage: 30.92% | total memory: 8 GB
2025-12-19T04:00:38.967520+0800 | compress | METRIC - Compressed module size: 4.77696 MB
2025-12-19T04:00:38.968521+0800 | compress_modules | INFO - Quantizing model.layers.9.self_attn.k_proj using 128 samples
2025-12-19T04:00:39.812018+0800 | compress | METRIC - time 0.84s
2025-12-19T04:00:39.812018+0800 | compress | METRIC - error 854.79
2025-12-19T04:00:39.837132+0800 | compress | METRIC - GPU 0 | usage: 30.92% | total memory: 8 GB
2025-12-19T04:00:39.838155+0800 | compress | METRIC - Compressed module size: 0.79616 MB
2025-12-19T04:00:39.839132+0800 | compress_modules | INFO - Quantizing model.layers.9.self_attn.v_proj using 128 samples
2025-12-19T04:00:40.682623+0800 | compress | METRIC - time 0.84s
2025-12-19T04:00:40.683622+0800 | compress | METRIC - er

(10/29): Propagating: 100%|██████████| 128/128 [00:01<00:00, 75.05it/s]
(11/29): Calibrating: 100%|██████████| 128/128 [00:06<00:00, 19.39it/s]

2025-12-19T04:00:56.982591+0800 | compress_modules | INFO - Quantizing model.layers.10.self_attn.q_proj using 128 samples





2025-12-19T04:00:57.891495+0800 | compress | METRIC - time 0.91s
2025-12-19T04:00:57.892002+0800 | compress | METRIC - error 4530.18
2025-12-19T04:00:57.918543+0800 | compress | METRIC - GPU 0 | usage: 30.92% | total memory: 8 GB
2025-12-19T04:00:57.919548+0800 | compress | METRIC - Compressed module size: 4.77696 MB
2025-12-19T04:00:57.920474+0800 | compress_modules | INFO - Quantizing model.layers.10.self_attn.k_proj using 128 samples
2025-12-19T04:00:58.765862+0800 | compress | METRIC - time 0.84s
2025-12-19T04:00:58.766866+0800 | compress | METRIC - error 878.22
2025-12-19T04:00:58.782935+0800 | compress | METRIC - GPU 0 | usage: 30.92% | total memory: 8 GB
2025-12-19T04:00:58.782935+0800 | compress | METRIC - Compressed module size: 0.79616 MB
2025-12-19T04:00:58.783951+0800 | compress_modules | INFO - Quantizing model.layers.10.self_attn.v_proj using 128 samples
2025-12-19T04:00:59.642683+0800 | compress | METRIC - time 0.86s
2025-12-19T04:00:59.642683+0800 | compress | METRIC - 

(11/29): Propagating: 100%|██████████| 128/128 [00:01<00:00, 74.31it/s]
(12/29): Calibrating: 100%|██████████| 128/128 [00:06<00:00, 19.11it/s]

2025-12-19T04:01:16.027890+0800 | compress_modules | INFO - Quantizing model.layers.11.self_attn.q_proj using 128 samples





2025-12-19T04:01:16.904355+0800 | compress | METRIC - time 0.88s
2025-12-19T04:01:16.904355+0800 | compress | METRIC - error 5102.84
2025-12-19T04:01:16.915932+0800 | compress | METRIC - GPU 0 | usage: 30.92% | total memory: 8 GB
2025-12-19T04:01:16.916930+0800 | compress | METRIC - Compressed module size: 4.77696 MB
2025-12-19T04:01:16.917929+0800 | compress_modules | INFO - Quantizing model.layers.11.self_attn.k_proj using 128 samples
2025-12-19T04:01:17.784420+0800 | compress | METRIC - time 0.87s
2025-12-19T04:01:17.785421+0800 | compress | METRIC - error 994.68
2025-12-19T04:01:17.807420+0800 | compress | METRIC - GPU 0 | usage: 30.92% | total memory: 8 GB
2025-12-19T04:01:17.808419+0800 | compress | METRIC - Compressed module size: 0.79616 MB
2025-12-19T04:01:17.809419+0800 | compress_modules | INFO - Quantizing model.layers.11.self_attn.v_proj using 128 samples
2025-12-19T04:01:18.634638+0800 | compress | METRIC - time 0.83s
2025-12-19T04:01:18.635637+0800 | compress | METRIC - 

(12/29): Propagating: 100%|██████████| 128/128 [00:01<00:00, 75.74it/s]
(13/29): Calibrating: 100%|██████████| 128/128 [00:06<00:00, 19.37it/s]

2025-12-19T04:01:34.941537+0800 | compress_modules | INFO - Quantizing model.layers.12.self_attn.q_proj using 128 samples





2025-12-19T04:01:35.893901+0800 | compress | METRIC - time 0.95s
2025-12-19T04:01:35.895406+0800 | compress | METRIC - error 6597.79
2025-12-19T04:01:35.915406+0800 | compress | METRIC - GPU 0 | usage: 31.19% | total memory: 8 GB
2025-12-19T04:01:35.916403+0800 | compress | METRIC - Compressed module size: 4.77696 MB
2025-12-19T04:01:35.917401+0800 | compress_modules | INFO - Quantizing model.layers.12.self_attn.k_proj using 128 samples
2025-12-19T04:01:36.899713+0800 | compress | METRIC - time 0.98s
2025-12-19T04:01:36.901217+0800 | compress | METRIC - error 1412.56
2025-12-19T04:01:36.935263+0800 | compress | METRIC - GPU 0 | usage: 31.19% | total memory: 8 GB
2025-12-19T04:01:36.935263+0800 | compress | METRIC - Compressed module size: 0.79616 MB
2025-12-19T04:01:36.936263+0800 | compress_modules | INFO - Quantizing model.layers.12.self_attn.v_proj using 128 samples
2025-12-19T04:01:37.844789+0800 | compress | METRIC - time 0.91s
2025-12-19T04:01:37.845789+0800 | compress | METRIC -

(13/29): Propagating: 100%|██████████| 128/128 [00:01<00:00, 75.90it/s]
(14/29): Calibrating: 100%|██████████| 128/128 [00:06<00:00, 19.53it/s]

2025-12-19T04:01:54.168629+0800 | compress_modules | INFO - Quantizing model.layers.13.self_attn.q_proj using 128 samples





2025-12-19T04:01:55.067174+0800 | compress | METRIC - time 0.90s
2025-12-19T04:01:55.068189+0800 | compress | METRIC - error 4877.06
2025-12-19T04:01:55.079892+0800 | compress | METRIC - GPU 0 | usage: 31.09% | total memory: 8 GB
2025-12-19T04:01:55.081458+0800 | compress | METRIC - Compressed module size: 4.77696 MB
2025-12-19T04:01:55.082403+0800 | compress_modules | INFO - Quantizing model.layers.13.self_attn.k_proj using 128 samples
2025-12-19T04:01:56.008507+0800 | compress | METRIC - time 0.93s
2025-12-19T04:01:56.009135+0800 | compress | METRIC - error 978.50
2025-12-19T04:01:56.049333+0800 | compress | METRIC - GPU 0 | usage: 31.09% | total memory: 8 GB
2025-12-19T04:01:56.049333+0800 | compress | METRIC - Compressed module size: 0.79616 MB
2025-12-19T04:01:56.050340+0800 | compress_modules | INFO - Quantizing model.layers.13.self_attn.v_proj using 128 samples
2025-12-19T04:01:56.894820+0800 | compress | METRIC - time 0.84s
2025-12-19T04:01:56.894820+0800 | compress | METRIC - 

(14/29): Propagating: 100%|██████████| 128/128 [00:01<00:00, 76.27it/s]
(15/29): Calibrating: 100%|██████████| 128/128 [00:06<00:00, 19.23it/s]

2025-12-19T04:02:13.544845+0800 | compress_modules | INFO - Quantizing model.layers.14.self_attn.q_proj using 128 samples





2025-12-19T04:02:14.495452+0800 | compress | METRIC - time 0.95s
2025-12-19T04:02:14.496469+0800 | compress | METRIC - error 9414.70
2025-12-19T04:02:14.519544+0800 | compress | METRIC - GPU 0 | usage: 31.14% | total memory: 8 GB
2025-12-19T04:02:14.520521+0800 | compress | METRIC - Compressed module size: 4.77696 MB
2025-12-19T04:02:14.522024+0800 | compress_modules | INFO - Quantizing model.layers.14.self_attn.k_proj using 128 samples
2025-12-19T04:02:15.479328+0800 | compress | METRIC - time 0.96s
2025-12-19T04:02:15.480328+0800 | compress | METRIC - error 1407.72
2025-12-19T04:02:15.498680+0800 | compress | METRIC - GPU 0 | usage: 31.14% | total memory: 8 GB
2025-12-19T04:02:15.500315+0800 | compress | METRIC - Compressed module size: 0.79616 MB
2025-12-19T04:02:15.501331+0800 | compress_modules | INFO - Quantizing model.layers.14.self_attn.v_proj using 128 samples
2025-12-19T04:02:16.409721+0800 | compress | METRIC - time 0.91s
2025-12-19T04:02:16.409721+0800 | compress | METRIC -

(15/29): Propagating: 100%|██████████| 128/128 [00:01<00:00, 75.23it/s]
(16/29): Calibrating: 100%|██████████| 128/128 [00:06<00:00, 19.49it/s]

2025-12-19T04:02:33.115633+0800 | compress_modules | INFO - Quantizing model.layers.15.self_attn.q_proj using 128 samples





2025-12-19T04:02:34.053202+0800 | compress | METRIC - time 0.94s
2025-12-19T04:02:34.053202+0800 | compress | METRIC - error 10969.96
2025-12-19T04:02:34.075402+0800 | compress | METRIC - GPU 0 | usage: 31.24% | total memory: 8 GB
2025-12-19T04:02:34.076511+0800 | compress | METRIC - Compressed module size: 4.77696 MB
2025-12-19T04:02:34.077402+0800 | compress_modules | INFO - Quantizing model.layers.15.self_attn.k_proj using 128 samples
2025-12-19T04:02:35.049707+0800 | compress | METRIC - time 0.97s
2025-12-19T04:02:35.049707+0800 | compress | METRIC - error 1216.44
2025-12-19T04:02:35.062550+0800 | compress | METRIC - GPU 0 | usage: 31.24% | total memory: 8 GB
2025-12-19T04:02:35.068550+0800 | compress | METRIC - Compressed module size: 0.79616 MB
2025-12-19T04:02:35.070550+0800 | compress_modules | INFO - Quantizing model.layers.15.self_attn.v_proj using 128 samples
2025-12-19T04:02:35.964909+0800 | compress | METRIC - time 0.89s
2025-12-19T04:02:35.964909+0800 | compress | METRIC 

(16/29): Propagating: 100%|██████████| 128/128 [00:01<00:00, 76.92it/s]
(17/29): Calibrating: 100%|██████████| 128/128 [00:06<00:00, 19.38it/s]

2025-12-19T04:02:52.388503+0800 | compress_modules | INFO - Quantizing model.layers.16.self_attn.q_proj using 128 samples





2025-12-19T04:02:53.509565+0800 | compress | METRIC - time 1.12s
2025-12-19T04:02:53.510566+0800 | compress | METRIC - error 9642.27
2025-12-19T04:02:53.532762+0800 | compress | METRIC - GPU 0 | usage: 31.09% | total memory: 8 GB
2025-12-19T04:02:53.533760+0800 | compress | METRIC - Compressed module size: 4.77696 MB
2025-12-19T04:02:53.534759+0800 | compress_modules | INFO - Quantizing model.layers.16.self_attn.k_proj using 128 samples
2025-12-19T04:02:54.530194+0800 | compress | METRIC - time 1.00s
2025-12-19T04:02:54.530194+0800 | compress | METRIC - error 1807.31
2025-12-19T04:02:54.542340+0800 | compress | METRIC - GPU 0 | usage: 31.09% | total memory: 8 GB
2025-12-19T04:02:54.543365+0800 | compress | METRIC - Compressed module size: 0.79616 MB
2025-12-19T04:02:54.543365+0800 | compress_modules | INFO - Quantizing model.layers.16.self_attn.v_proj using 128 samples
2025-12-19T04:02:55.406295+0800 | compress | METRIC - time 0.86s
2025-12-19T04:02:55.407287+0800 | compress | METRIC -

(17/29): Propagating: 100%|██████████| 128/128 [00:01<00:00, 75.06it/s]
(18/29): Calibrating: 100%|██████████| 128/128 [00:06<00:00, 19.28it/s]

2025-12-19T04:03:12.078478+0800 | compress_modules | INFO - Quantizing model.layers.17.self_attn.q_proj using 128 samples





2025-12-19T04:03:12.953830+0800 | compress | METRIC - time 0.87s
2025-12-19T04:03:12.954832+0800 | compress | METRIC - error 8639.61
2025-12-19T04:03:12.972662+0800 | compress | METRIC - GPU 0 | usage: 31.09% | total memory: 8 GB
2025-12-19T04:03:12.973665+0800 | compress | METRIC - Compressed module size: 4.77696 MB
2025-12-19T04:03:12.974662+0800 | compress_modules | INFO - Quantizing model.layers.17.self_attn.k_proj using 128 samples
2025-12-19T04:03:13.829772+0800 | compress | METRIC - time 0.86s
2025-12-19T04:03:13.829772+0800 | compress | METRIC - error 1060.22
2025-12-19T04:03:13.841534+0800 | compress | METRIC - GPU 0 | usage: 31.09% | total memory: 8 GB
2025-12-19T04:03:13.842065+0800 | compress | METRIC - Compressed module size: 0.79616 MB
2025-12-19T04:03:13.843079+0800 | compress_modules | INFO - Quantizing model.layers.17.self_attn.v_proj using 128 samples
2025-12-19T04:03:14.721567+0800 | compress | METRIC - time 0.88s
2025-12-19T04:03:14.721567+0800 | compress | METRIC -

(18/29): Propagating: 100%|██████████| 128/128 [00:01<00:00, 76.61it/s]
(19/29): Calibrating: 100%|██████████| 128/128 [00:06<00:00, 19.36it/s]

2025-12-19T04:03:30.935827+0800 | compress_modules | INFO - Quantizing model.layers.18.self_attn.q_proj using 128 samples





2025-12-19T04:03:31.785975+0800 | compress | METRIC - time 0.85s
2025-12-19T04:03:31.785975+0800 | compress | METRIC - error 6996.57
2025-12-19T04:03:31.804228+0800 | compress | METRIC - GPU 0 | usage: 30.99% | total memory: 8 GB
2025-12-19T04:03:31.805334+0800 | compress | METRIC - Compressed module size: 4.77696 MB
2025-12-19T04:03:31.806226+0800 | compress_modules | INFO - Quantizing model.layers.18.self_attn.k_proj using 128 samples
2025-12-19T04:03:32.626439+0800 | compress | METRIC - time 0.82s
2025-12-19T04:03:32.627439+0800 | compress | METRIC - error 1168.12
2025-12-19T04:03:32.644587+0800 | compress | METRIC - GPU 0 | usage: 30.99% | total memory: 8 GB
2025-12-19T04:03:32.645584+0800 | compress | METRIC - Compressed module size: 0.79616 MB
2025-12-19T04:03:32.646583+0800 | compress_modules | INFO - Quantizing model.layers.18.self_attn.v_proj using 128 samples
2025-12-19T04:03:33.453293+0800 | compress | METRIC - time 0.81s
2025-12-19T04:03:33.454295+0800 | compress | METRIC -

(19/29): Propagating: 100%|██████████| 128/128 [00:01<00:00, 74.43it/s]
(20/29): Calibrating: 100%|██████████| 128/128 [00:06<00:00, 19.46it/s]

2025-12-19T04:03:49.836588+0800 | compress_modules | INFO - Quantizing model.layers.19.self_attn.q_proj using 128 samples





2025-12-19T04:03:50.678077+0800 | compress | METRIC - time 0.84s
2025-12-19T04:03:50.679077+0800 | compress | METRIC - error 9255.07
2025-12-19T04:03:50.689710+0800 | compress | METRIC - GPU 0 | usage: 31.04% | total memory: 8 GB
2025-12-19T04:03:50.691224+0800 | compress | METRIC - Compressed module size: 4.77696 MB
2025-12-19T04:03:50.692229+0800 | compress_modules | INFO - Quantizing model.layers.19.self_attn.k_proj using 128 samples
2025-12-19T04:03:51.506919+0800 | compress | METRIC - time 0.81s
2025-12-19T04:03:51.506919+0800 | compress | METRIC - error 1223.66
2025-12-19T04:03:51.527840+0800 | compress | METRIC - GPU 0 | usage: 31.04% | total memory: 8 GB
2025-12-19T04:03:51.528869+0800 | compress | METRIC - Compressed module size: 0.79616 MB
2025-12-19T04:03:51.529818+0800 | compress_modules | INFO - Quantizing model.layers.19.self_attn.v_proj using 128 samples
2025-12-19T04:03:52.335351+0800 | compress | METRIC - time 0.81s
2025-12-19T04:03:52.335954+0800 | compress | METRIC -

(20/29): Propagating: 100%|██████████| 128/128 [00:01<00:00, 76.48it/s]
(21/29): Calibrating: 100%|██████████| 128/128 [00:06<00:00, 19.44it/s]

2025-12-19T04:04:08.447591+0800 | compress_modules | INFO - Quantizing model.layers.20.self_attn.q_proj using 128 samples





2025-12-19T04:04:09.347476+0800 | compress | METRIC - time 0.90s
2025-12-19T04:04:09.348477+0800 | compress | METRIC - error 12134.48
2025-12-19T04:04:09.372649+0800 | compress | METRIC - GPU 0 | usage: 31.24% | total memory: 8 GB
2025-12-19T04:04:09.373653+0800 | compress | METRIC - Compressed module size: 4.77696 MB
2025-12-19T04:04:09.374655+0800 | compress_modules | INFO - Quantizing model.layers.20.self_attn.k_proj using 128 samples
2025-12-19T04:04:10.234669+0800 | compress | METRIC - time 0.86s
2025-12-19T04:04:10.234669+0800 | compress | METRIC - error 1517.36
2025-12-19T04:04:10.249223+0800 | compress | METRIC - GPU 0 | usage: 31.24% | total memory: 8 GB
2025-12-19T04:04:10.250223+0800 | compress | METRIC - Compressed module size: 0.79616 MB
2025-12-19T04:04:10.251222+0800 | compress_modules | INFO - Quantizing model.layers.20.self_attn.v_proj using 128 samples
2025-12-19T04:04:11.091258+0800 | compress | METRIC - time 0.84s
2025-12-19T04:04:11.091766+0800 | compress | METRIC 

(21/29): Propagating: 100%|██████████| 128/128 [00:01<00:00, 75.91it/s]
(22/29): Calibrating: 100%|██████████| 128/128 [00:06<00:00, 19.42it/s]

2025-12-19T04:04:27.308988+0800 | compress_modules | INFO - Quantizing model.layers.21.self_attn.q_proj using 128 samples





2025-12-19T04:04:28.142597+0800 | compress | METRIC - time 0.83s
2025-12-19T04:04:28.143621+0800 | compress | METRIC - error 11721.79
2025-12-19T04:04:28.155440+0800 | compress | METRIC - GPU 0 | usage: 31.04% | total memory: 8 GB
2025-12-19T04:04:28.155440+0800 | compress | METRIC - Compressed module size: 4.77696 MB
2025-12-19T04:04:28.156463+0800 | compress_modules | INFO - Quantizing model.layers.21.self_attn.k_proj using 128 samples
2025-12-19T04:04:28.967410+0800 | compress | METRIC - time 0.81s
2025-12-19T04:04:28.968411+0800 | compress | METRIC - error 1436.34
2025-12-19T04:04:28.986246+0800 | compress | METRIC - GPU 0 | usage: 31.04% | total memory: 8 GB
2025-12-19T04:04:28.987244+0800 | compress | METRIC - Compressed module size: 0.79616 MB
2025-12-19T04:04:28.988243+0800 | compress_modules | INFO - Quantizing model.layers.21.self_attn.v_proj using 128 samples
2025-12-19T04:04:29.839203+0800 | compress | METRIC - time 0.85s
2025-12-19T04:04:29.840220+0800 | compress | METRIC 

(22/29): Propagating: 100%|██████████| 128/128 [00:01<00:00, 75.66it/s]
(23/29): Calibrating: 100%|██████████| 128/128 [00:06<00:00, 19.22it/s]

2025-12-19T04:04:46.164355+0800 | compress_modules | INFO - Quantizing model.layers.22.self_attn.q_proj using 128 samples





2025-12-19T04:04:47.035568+0800 | compress | METRIC - time 0.87s
2025-12-19T04:04:47.035568+0800 | compress | METRIC - error 10721.95
2025-12-19T04:04:47.056437+0800 | compress | METRIC - GPU 0 | usage: 31.29% | total memory: 8 GB
2025-12-19T04:04:47.057435+0800 | compress | METRIC - Compressed module size: 4.77696 MB
2025-12-19T04:04:47.058436+0800 | compress_modules | INFO - Quantizing model.layers.22.self_attn.k_proj using 128 samples
2025-12-19T04:04:47.875248+0800 | compress | METRIC - time 0.82s
2025-12-19T04:04:47.876248+0800 | compress | METRIC - error 1612.44
2025-12-19T04:04:47.897312+0800 | compress | METRIC - GPU 0 | usage: 31.29% | total memory: 8 GB
2025-12-19T04:04:47.898346+0800 | compress | METRIC - Compressed module size: 0.79616 MB
2025-12-19T04:04:47.899311+0800 | compress_modules | INFO - Quantizing model.layers.22.self_attn.v_proj using 128 samples
2025-12-19T04:04:48.732238+0800 | compress | METRIC - time 0.83s
2025-12-19T04:04:48.732238+0800 | compress | METRIC 

(23/29): Propagating: 100%|██████████| 128/128 [00:01<00:00, 75.05it/s]
(24/29): Calibrating: 100%|██████████| 128/128 [00:06<00:00, 19.42it/s]

2025-12-19T04:05:05.174921+0800 | compress_modules | INFO - Quantizing model.layers.23.self_attn.q_proj using 128 samples





2025-12-19T04:05:06.038409+0800 | compress | METRIC - time 0.86s
2025-12-19T04:05:06.039393+0800 | compress | METRIC - error 14891.66
2025-12-19T04:05:06.057993+0800 | compress | METRIC - GPU 0 | usage: 31.15% | total memory: 8 GB
2025-12-19T04:05:06.059047+0800 | compress | METRIC - Compressed module size: 4.77696 MB
2025-12-19T04:05:06.059047+0800 | compress_modules | INFO - Quantizing model.layers.23.self_attn.k_proj using 128 samples
2025-12-19T04:05:06.904489+0800 | compress | METRIC - time 0.84s
2025-12-19T04:05:06.905487+0800 | compress | METRIC - error 1742.02
2025-12-19T04:05:06.915251+0800 | compress | METRIC - GPU 0 | usage: 31.15% | total memory: 8 GB
2025-12-19T04:05:06.915251+0800 | compress | METRIC - Compressed module size: 0.79616 MB
2025-12-19T04:05:06.916220+0800 | compress_modules | INFO - Quantizing model.layers.23.self_attn.v_proj using 128 samples
2025-12-19T04:05:07.792457+0800 | compress | METRIC - time 0.88s
2025-12-19T04:05:07.792457+0800 | compress | METRIC 

(24/29): Propagating: 100%|██████████| 128/128 [00:01<00:00, 75.75it/s]
(25/29): Calibrating: 100%|██████████| 128/128 [00:06<00:00, 19.43it/s]

2025-12-19T04:05:24.388375+0800 | compress_modules | INFO - Quantizing model.layers.24.self_attn.q_proj using 128 samples





2025-12-19T04:05:25.233851+0800 | compress | METRIC - time 0.84s
2025-12-19T04:05:25.234875+0800 | compress | METRIC - error 13109.39
2025-12-19T04:05:25.256311+0800 | compress | METRIC - GPU 0 | usage: 31.20% | total memory: 8 GB
2025-12-19T04:05:25.257308+0800 | compress | METRIC - Compressed module size: 4.77696 MB
2025-12-19T04:05:25.258200+0800 | compress_modules | INFO - Quantizing model.layers.24.self_attn.k_proj using 128 samples
2025-12-19T04:05:26.054166+0800 | compress | METRIC - time 0.80s
2025-12-19T04:05:26.055183+0800 | compress | METRIC - error 1736.88
2025-12-19T04:05:26.079410+0800 | compress | METRIC - GPU 0 | usage: 31.20% | total memory: 8 GB
2025-12-19T04:05:26.080409+0800 | compress | METRIC - Compressed module size: 0.79616 MB
2025-12-19T04:05:26.080409+0800 | compress_modules | INFO - Quantizing model.layers.24.self_attn.v_proj using 128 samples
2025-12-19T04:05:26.901314+0800 | compress | METRIC - time 0.82s
2025-12-19T04:05:26.902457+0800 | compress | METRIC 

(25/29): Propagating: 100%|██████████| 128/128 [00:01<00:00, 75.71it/s]
(26/29): Calibrating: 100%|██████████| 128/128 [00:06<00:00, 19.28it/s]

2025-12-19T04:05:43.255461+0800 | compress_modules | INFO - Quantizing model.layers.25.self_attn.q_proj using 128 samples





2025-12-19T04:05:44.105219+0800 | compress | METRIC - time 0.85s
2025-12-19T04:05:44.106218+0800 | compress | METRIC - error 15047.20
2025-12-19T04:05:44.119403+0800 | compress | METRIC - GPU 0 | usage: 31.10% | total memory: 8 GB
2025-12-19T04:05:44.120403+0800 | compress | METRIC - Compressed module size: 4.77696 MB
2025-12-19T04:05:44.121401+0800 | compress_modules | INFO - Quantizing model.layers.25.self_attn.k_proj using 128 samples
2025-12-19T04:05:44.908339+0800 | compress | METRIC - time 0.79s
2025-12-19T04:05:44.909341+0800 | compress | METRIC - error 1569.79
2025-12-19T04:05:44.931126+0800 | compress | METRIC - GPU 0 | usage: 31.10% | total memory: 8 GB
2025-12-19T04:05:44.931126+0800 | compress | METRIC - Compressed module size: 0.79616 MB
2025-12-19T04:05:44.932132+0800 | compress_modules | INFO - Quantizing model.layers.25.self_attn.v_proj using 128 samples
2025-12-19T04:05:45.789765+0800 | compress | METRIC - time 0.86s
2025-12-19T04:05:45.791360+0800 | compress | METRIC 

(26/29): Propagating: 100%|██████████| 128/128 [00:01<00:00, 76.15it/s]
(27/29): Calibrating: 100%|██████████| 128/128 [00:06<00:00, 19.41it/s]

2025-12-19T04:06:02.007175+0800 | compress_modules | INFO - Quantizing model.layers.26.self_attn.q_proj using 128 samples





2025-12-19T04:06:02.862759+0800 | compress | METRIC - time 0.85s
2025-12-19T04:06:02.862759+0800 | compress | METRIC - error 15962.14
2025-12-19T04:06:02.888358+0800 | compress | METRIC - GPU 0 | usage: 31.15% | total memory: 8 GB
2025-12-19T04:06:02.889329+0800 | compress | METRIC - Compressed module size: 4.77696 MB
2025-12-19T04:06:02.890328+0800 | compress_modules | INFO - Quantizing model.layers.26.self_attn.k_proj using 128 samples
2025-12-19T04:06:03.714147+0800 | compress | METRIC - time 0.82s
2025-12-19T04:06:03.715148+0800 | compress | METRIC - error 1974.02
2025-12-19T04:06:03.735914+0800 | compress | METRIC - GPU 0 | usage: 31.15% | total memory: 8 GB
2025-12-19T04:06:03.736916+0800 | compress | METRIC - Compressed module size: 0.79616 MB
2025-12-19T04:06:03.737963+0800 | compress_modules | INFO - Quantizing model.layers.26.self_attn.v_proj using 128 samples
2025-12-19T04:06:04.616430+0800 | compress | METRIC - time 0.88s
2025-12-19T04:06:04.616430+0800 | compress | METRIC 

(27/29): Propagating: 100%|██████████| 128/128 [00:01<00:00, 74.69it/s]
(28/29): Calibrating: 100%|██████████| 128/128 [00:06<00:00, 19.23it/s]

2025-12-19T04:06:21.244158+0800 | compress_modules | INFO - Quantizing model.layers.27.self_attn.q_proj using 128 samples





2025-12-19T04:06:22.218875+0800 | compress | METRIC - time 0.97s
2025-12-19T04:06:22.218875+0800 | compress | METRIC - error 15532.10
2025-12-19T04:06:22.233347+0800 | compress | METRIC - GPU 0 | usage: 31.34% | total memory: 8 GB
2025-12-19T04:06:22.233347+0800 | compress | METRIC - Compressed module size: 4.77696 MB
2025-12-19T04:06:22.235237+0800 | compress_modules | INFO - Quantizing model.layers.27.self_attn.k_proj using 128 samples
2025-12-19T04:06:23.173135+0800 | compress | METRIC - time 0.94s
2025-12-19T04:06:23.174106+0800 | compress | METRIC - error 1639.69
2025-12-19T04:06:23.234549+0800 | compress | METRIC - GPU 0 | usage: 31.34% | total memory: 8 GB
2025-12-19T04:06:23.235572+0800 | compress | METRIC - Compressed module size: 0.79616 MB
2025-12-19T04:06:23.236548+0800 | compress_modules | INFO - Quantizing model.layers.27.self_attn.v_proj using 128 samples
2025-12-19T04:06:24.210489+0800 | compress | METRIC - time 0.97s
2025-12-19T04:06:24.210489+0800 | compress | METRIC 

(28/29): Propagating: 100%|██████████| 128/128 [00:01<00:00, 75.32it/s]
(29/29): Calibrating: 100%|██████████| 128/128 [00:00<00:00, 757.74it/s]
(29/29): Propagating: 100%|██████████| 128/128 [00:00<00:00, 854.91it/s]

2025-12-19T04:06:34.668162+0800 | finalize | INFO - Compression lifecycle finalized for 1 modifiers





2025-12-19T04:06:34.722376+0800 | get_model_compressor | INFO - skip_sparsity_compression_stats set to True. Skipping sparsity compression statistic calculations. No sparsity compressor will be applied.


Compressing model: 196it [00:03, 52.05it/s]


'models/qwen2.5-1.5b-instruct-gptq-llmc'

In [None]:
# 加载 GPTQ 量化后的检查点做推理

gptq_tokenizer = AutoTokenizer.from_pretrained(gptq_out_dir, trust_remote_code=True)
if gptq_tokenizer.pad_token_id is None:
    gptq_tokenizer.pad_token = gptq_tokenizer.eos_token

gptq_model = AutoModelForCausalLM.from_pretrained(
    gptq_out_dir,
    device_map="auto",
    torch_dtype=torch.float16,
    trust_remote_code=True,
)
gptq_model.eval()

gptq_tokenizer.pad_token, gptq_tokenizer.eos_token, gptq_tokenizer.pad_token_id, gptq_tokenizer.eos_token_id

The tokenizer you are loading from 'models/qwen2.5-1.5b-instruct-gptq-llmc' with an incorrect regex pattern: https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503/discussions/84#69121093e8b480e709447d5e. This will lead to incorrect tokenization. You should set the `fix_mistral_regex=True` flag when loading this tokenizer to fix this issue.
Compressing model: 196it [00:00, 1131.44it/s]


('<|endoftext|>', '<|im_end|>', 151643, 151645)

In [5]:
# 检查 GPTQ 量化是否生效：查看第 0 层 self_attn.q_proj 的类型和配置
layer = gptq_model.model.layers[0].self_attn.q_proj
print("GPTQ q_proj layer type:", type(layer))
print("GPTQ quantization_config:", getattr(gptq_model.config, "quantization_config", None))

GPTQ q_proj layer type: <class 'compressed_tensors.linear.compressed_linear.CompressedLinear'>
GPTQ quantization_config: CompressedTensorsConfig {
  "config_groups": {
    "group_0": {
      "format": "pack-quantized",
      "input_activations": null,
      "output_activations": null,
      "targets": [
        "Linear"
      ],
      "weights": {
        "actorder": "static",
        "block_structure": null,
        "dynamic": false,
        "group_size": 128,
        "num_bits": 4,
        "observer": "minmax",
        "observer_kwargs": {},
        "scale_dtype": null,
        "strategy": "group",
        "symmetric": true,
        "type": "int",
        "zp_dtype": null
      }
    }
  },
  "format": "pack-quantized",
  "global_compression_ratio": null,
  "ignore": [
    "lm_head"
  ],
  "kv_cache_scheme": null,
  "quantization_status": "compressed"
}



# GPTQ 量化模型对话推理示例

In [5]:
GPTQ_TEST_QUERIES = [
    "用两三句话解释一下什么是量子计算？",
    "Give me a brief introduction to large language models in English.",
]

@torch.no_grad()
def gptq_chat(question: str) -> str:
    msgs = [
        {"role": "system", "content": "你是一名 AI 助手，回答准确、简洁。"},
        {"role": "user", "content": question},
    ]
    input_ids = gptq_tokenizer.apply_chat_template(
        msgs,
        tokenize=True,
        add_generation_prompt=True,
        return_tensors="pt",
    ).to(gptq_model.device)

    gen_ids = gptq_model.generate(
        input_ids=input_ids,
        max_new_tokens=256,
        do_sample=True,
        temperature=0.7,
        top_p=0.9,
        repetition_penalty=1.1,
        eos_token_id=gptq_tokenizer.eos_token_id,
        pad_token_id=gptq_tokenizer.pad_token_id,
    )
    out_ids = gen_ids[0, input_ids.shape[-1]:]
    return gptq_tokenizer.decode(out_ids, skip_special_tokens=True).strip()

for q in GPTQ_TEST_QUERIES:
    ans = gptq_chat(q)
    print(f"[GPTQ] Q: {q}\nA: {ans}\n" + "-" * 60)

[GPTQ] Q: 用两三句话解释一下什么是量子计算？
A: 量子计算是利用量子位（qubits）代替经典比特进行信息处理的一种计算方式。它利用量子叠加、纠缠等特性来实现超越传统计算机的并行处理能力，可以高效地解决某些特定问题。通过使用量子比特，量子计算机可以在短时间内完成复杂的数学运算和模拟物理系统，对于密码破解、优化算法等领域具有巨大潜力。尽管目前的技术还处于发展阶段，但未来有望成为一种强大的计算工具。
------------------------------------------------------------
[GPTQ] Q: Give me a brief introduction to large language models in English.
A: Large Language Models (LLMs) are advanced artificial intelligence systems that can understand and generate human-like text across various domains, including natural language processing tasks such as translation, summarization, question-answering, and more.

These models use deep learning algorithms, which enable them to learn from vast amounts of data and improve their performance over time. The most prominent example is the GPT (Generative Pretrained Transformer), developed by OpenAI, which has been used for many applications like language generation, image description, and even creative writing.

One key feature of LLMs is their ability to hand

In [6]:
# 释放 GPTQ 量化模型占用的显存
del gptq_model
torch.cuda.empty_cache()

## 二、使用 llmcompressor + AWQ 对 Qwen2.5-7B-Instruct 进行量化

In [None]:
from llmcompressor.modifiers.awq import AWQModifier
# 使用 llmcompressor 对 Qwen2.5-7B-Instruct 做 AWQ 量化

awq_out_dir = "models/qwen2.5-1.5b-instruct-awq-llmc"

awq_recipe = [
    AWQModifier(
        scheme="W4A16",
        targets="Linear",
        ignore=["lm_head"],
    ),
]

oneshot(
    model=base_model_id,
    dataset="open_platypus",
    recipe=awq_recipe,
    output_dir=awq_out_dir,
    max_seq_length=2048,
    num_calibration_samples=128,
)

awq_out_dir

Tokenizing:   0%|          | 0/24926 [00:00<?, ? examples/s]

2025-12-19T04:08:32.087210+0800 | reset | INFO - Compression lifecycle reset
2025-12-19T04:08:32.099173+0800 | from_modifiers | INFO - Creating recipe from modifiers
2025-12-19T04:08:32.139239+0800 | on_initialize | INFO - No AWQModifier.mappings provided, inferring from model...
2025-12-19T04:08:32.174455+0800 | initialize | INFO - Compression lifecycle initialized for 1 modifiers
2025-12-19T04:08:32.175474+0800 | IndependentPipeline | INFO - Inferred `SequentialPipeline` for `AWQModifier`


Preparing cache: 100%|██████████| 128/128 [00:00<00:00, 665.58it/s]
(1/29): Calibrating: 100%|██████████| 128/128 [00:28<00:00,  4.53it/s]
Smoothing: 100%|██████████| 3/3 [12:56<00:00, 258.98s/it]
(1/29): Propagating: 100%|██████████| 128/128 [00:30<00:00,  4.23it/s]
(2/29): Calibrating: 100%|██████████| 128/128 [00:24<00:00,  5.19it/s]
Smoothing: 100%|██████████| 3/3 [13:35<00:00, 271.83s/it]
(2/29): Propagating: 100%|██████████| 128/128 [00:47<00:00,  2.69it/s]
(3/29): Calibrating: 100%|██████████| 128/128 [00:34<00:00,  3.67it/s]
Smoothing: 100%|██████████| 3/3 [17:54<00:00, 358.33s/it]
(3/29): Propagating: 100%|██████████| 128/128 [00:33<00:00,  3.87it/s]
(4/29): Calibrating: 100%|██████████| 128/128 [00:31<00:00,  4.12it/s]
Smoothing: 100%|██████████| 3/3 [13:25<00:00, 268.42s/it]
(4/29): Propagating: 100%|██████████| 128/128 [00:41<00:00,  3.11it/s]
(5/29): Calibrating: 100%|██████████| 128/128 [00:22<00:00,  5.59it/s]
Smoothing: 100%|██████████| 3/3 [11:18<00:00, 226.02s/it]
(5/

2025-12-19T09:17:58.322208+0800 | finalize | INFO - Compression lifecycle finalized for 1 modifiers





2025-12-19T09:17:58.390265+0800 | get_model_compressor | INFO - skip_sparsity_compression_stats set to True. Skipping sparsity compression statistic calculations. No sparsity compressor will be applied.


Compressing model: 196it [00:03, 53.78it/s]


'models/qwen2.5-1.5b-instruct-awq-llmc'

In [None]:
# 加载 AWQ 量化后的检查点并做对话推理

awq_tokenizer = AutoTokenizer.from_pretrained(awq_out_dir, trust_remote_code=True)
if awq_tokenizer.pad_token_id is None:
    awq_tokenizer.pad_token = awq_tokenizer.eos_token

awq_model = AutoModelForCausalLM.from_pretrained(
    awq_out_dir,
    device_map="auto",
    torch_dtype=torch.float16,
    trust_remote_code=True,
)
awq_model.eval()

The tokenizer you are loading from 'models/qwen2.5-1.5b-instruct-awq-llmc' with an incorrect regex pattern: https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503/discussions/84#69121093e8b480e709447d5e. This will lead to incorrect tokenization. You should set the `fix_mistral_regex=True` flag when loading this tokenizer to fix this issue.
Compressing model: 196it [00:00, 1146.78it/s]


Qwen2ForCausalLM(
  (model): Qwen2Model(
    (embed_tokens): Embedding(151936, 1536)
    (layers): ModuleList(
      (0-27): 28 x Qwen2DecoderLayer(
        (self_attn): Qwen2Attention(
          (q_proj): CompressedLinear(in_features=1536, out_features=1536, bias=True)
          (k_proj): CompressedLinear(in_features=1536, out_features=256, bias=True)
          (v_proj): CompressedLinear(in_features=1536, out_features=256, bias=True)
          (o_proj): CompressedLinear(in_features=1536, out_features=1536, bias=False)
        )
        (mlp): Qwen2MLP(
          (gate_proj): CompressedLinear(in_features=1536, out_features=8960, bias=False)
          (up_proj): CompressedLinear(in_features=1536, out_features=8960, bias=False)
          (down_proj): CompressedLinear(in_features=8960, out_features=1536, bias=False)
          (act_fn): SiLUActivation()
        )
        (input_layernorm): Qwen2RMSNorm((1536,), eps=1e-06)
        (post_attention_layernorm): Qwen2RMSNorm((1536,), eps=1e-06)

In [7]:
# 检查 AWQ 量化是否生效：查看第 0 层 self_attn.q_proj 的类型和配置
layer = awq_model.model.layers[0].self_attn.q_proj
print("AWQ q_proj layer type:", type(layer))
print("AWQ quantization_config:", getattr(awq_model.config, "quantization_config", None))

AWQ q_proj layer type: <class 'compressed_tensors.linear.compressed_linear.CompressedLinear'>
AWQ quantization_config: CompressedTensorsConfig {
  "config_groups": {
    "group_0": {
      "format": "pack-quantized",
      "input_activations": null,
      "output_activations": null,
      "targets": [
        "Linear"
      ],
      "weights": {
        "actorder": null,
        "block_structure": null,
        "dynamic": false,
        "group_size": 128,
        "num_bits": 4,
        "observer": "minmax",
        "observer_kwargs": {},
        "scale_dtype": null,
        "strategy": "group",
        "symmetric": true,
        "type": "int",
        "zp_dtype": null
      }
    }
  },
  "format": "pack-quantized",
  "global_compression_ratio": null,
  "ignore": [
    "lm_head"
  ],
  "kv_cache_scheme": null,
  "quantization_status": "compressed"
}



In [None]:
AWQ_TEST_QUERIES = [
    "简单说说大模型量化有什么好处？",
    "Explain in English why activation-aware weight quantization (AWQ) can help LLMs.",
]

@torch.no_grad()
def awq_chat(question: str) -> str:
    msgs = [
        {"role": "system", "content": "你是一名 AI 助手，回答准确、简洁。"},
        {"role": "user", "content": question},
    ]
    input_ids = awq_tokenizer.apply_chat_template(
        msgs,
        tokenize=True,
        add_generation_prompt=True,
        return_tensors="pt",
    ).to(awq_model.device)

    gen_ids = awq_model.generate(
        input_ids=input_ids,
        max_new_tokens=256,
        do_sample=True,
        temperature=0.7,
        top_p=0.9,
        repetition_penalty=1.1,
        eos_token_id=awq_tokenizer.eos_token_id,
        pad_token_id=awq_tokenizer.pad_token_id,
    )
    out_ids = gen_ids[0, input_ids.shape[-1]:]
    return awq_tokenizer.decode(out_ids, skip_special_tokens=True).strip()

for q in AWQ_TEST_QUERIES:
    ans = awq_chat(q)
    print(f"[AWQ] Q: {q}\nA: {ans}\n" + "-" * 60)


The tokenizer you are loading from 'models/qwen2.5-1.5b-instruct-awq-llmc' with an incorrect regex pattern: https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503/discussions/84#69121093e8b480e709447d5e. This will lead to incorrect tokenization. You should set the `fix_mistral_regex=True` flag when loading this tokenizer to fix this issue.
Compressing model: 196it [00:00, 1314.02it/s]


[AWQ] Q: 简单说说大模型量化有什么好处？
A: 1. 提高计算效率：通过去除多余的参数和操作来降低模型的复杂度和内存占用。
2. 降低成本：减少计算量可以降低硬件需求和功耗。
3. 改善泛化能力：优化后的模型通常在一些任务上表现更优，能够更好地泛化到其他数据集。
4. 实现可解释性：对量化后的模型进行分析可以理解其工作原理。
5. 推广应用：适合于各种低资源环境下的部署。
------------------------------------------------------------
[AWQ] Q: Explain in English why activation-aware weight quantization (AWQ) can help LLMs.
A: Activation-aware weight quantization (AWQ) helps large language models, or Large Language Models (LLMs), by improving their performance and efficiency while reducing the amount of computation required to run them.

Here’s how AWQ works and why it is beneficial:

1. **Reducing the Size of Model Parameters**: Activation-aware weight quantization involves mapping activations from a continuous range into a smaller discrete set. This process reduces the number of parameters needed for the model, which leads to smaller and more efficient models with fewer parameters.

2. **Improving Performance**: Smaller models are often faster and use less memory, ma