# Quantization with AutoAMQ for opt-6.7b

```
第三周作业一: 
1、使用 GPTQ 量化 OPT-6.7B 模型。课程代码（ https://github.com/DjangoPeng/LLM-quickstart/blob/main/quantization/AutoGPTQ_opt-2.7b.ipynb ）
2、使用 AWQ 量化 Facebook OPT-6.7B 模型。Facebook OPT 模型地址： https://huggingface.co/facebook?search_models=opt
课程代码： https://github.com/DjangoPeng/LLM-quickstart/blob/main/quantization/AWQ_opt-2.7b.ipynb
https://github.com/DjangoPeng/LLM-quickstart/blob/main/quantization/AWQ-opt-125m.ipynb ------> this notebook is for this task.

第三周作业二： 根据硬件资源情况，在 AdvertiseGen 数据集上使用 QLoRA 微调 ChatGLM3-6B 至少 10K examples，观察 Loss 变化情况，并对比微调前后模型输出结果。
课程代码： 
https://github.com/DjangoPeng/LLM-quickstart/blob/main/peft/peft_qlora_chatglm.ipynb
https://github.com/DjangoPeng/LLM-quickstart/blob/main/peft/peft_chatglm_inference.ipynb


```

In [1]:
# this to be run once
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer
from transformers import AwqConfig, AutoConfig
from transformers import AutoTokenizer, AutoModelForCausalLM

def quantize_and_save_awq_model(
    model_name_or_path: str,
    quant_model_dir: str,
    w_bit: int = 4,
    q_group_size: int = 128,
    zero_point: bool = True,
    version: str = "GEMM",
    trust_remote_code: bool = True
):
    """
    Quantize a model using AWQ, save the quantized model & tokenizer.

    Args:
        model_name_or_path (str): Path or name of the FP32 model (e.g., "facebook/opt-2.7b")
        quant_model_dir (str): Directory to save the quantized model & tokenizer
        w_bit (int): Number of bits for weights (e.g., 4)
        q_group_size (int): Group size for quantization (e.g., 128)
        zero_point (bool): Whether to use zero-point quantization
        version (str): AWQ kernel version, e.g., "GEMM" or "NEW"
        trust_remote_code (bool): Whether to trust remote code when loading the model
    """
    # Define AWQ quantization config
    quant_config = {
        "zero_point": zero_point,
        "q_group_size": q_group_size,
        "w_bit": w_bit,
        "version": version
    }

    print(f"🔃 Loading model from: {model_name_or_path}, w_bit: {w_bit}, q_group_size: {q_group_size}, zero_point: {zero_point}, version: {version}")
    # Load the FP32 model using AutoAWQ
    model = AutoAWQForCausalLM.from_pretrained(
        model_name_or_path,
        trust_remote_code=trust_remote_code
    )
    tokenizer = AutoTokenizer.from_pretrained(
        model_name_or_path,
        trust_remote_code=trust_remote_code
    )

    print("⚙️  Quantizing model with AWQ...")
    # Quantize the model
    model.quantize(tokenizer, quant_config=quant_config)

    # Convert AWQ config to Transformers-compatible format
    transformers_quant_config = AwqConfig(
        bits=w_bit,
        group_size=q_group_size,
        zero_point=zero_point,
        version=version.lower(),  # e.g., "gemm"
    ).to_dict()

    # Set the quantization config to the model's config (optional but useful for tracking)
    model.model.config.quantization_config = transformers_quant_config

    # Save quantized model and tokenizer
    print(f"💾 Saving quantized model to: {quant_model_dir}")
    model.save_quantized(quant_model_dir)
    print(f"💾 Saving quantized tokenizer to: {quant_model_dir}")
    tokenizer.save_pretrained(quant_model_dir)

    # Set model to eval mode
    model.eval()
    print("Model change to eval mode.")

    print("✅ Quantization and save complete.")
    return model, tokenizer

In [2]:
# run this one time
# update to the expected model name
# model_name_or_path = "facebook/opt-125m"
# model_name_or_path = "facebook/opt-2.7b"
model_name_or_path = "facebook/opt-6.7b"

model_name = model_name_or_path.split("/")[-1]  # e.g., "opt-2.7b"
quant_model_dir = f"models/{model_name}-awq"
print(f"using quant_model_dir: {quant_model_dir}")

# Call the function
quant_model, quant_tokenizer = quantize_and_save_awq_model(
    model_name_or_path=model_name_or_path,
    quant_model_dir=quant_model_dir,
    w_bit=4,
    q_group_size=128,
    zero_point=True,
    version="GEMM",
    trust_remote_code=True
)

using quant_model_dir: models/opt-6.7b-awq
🔃 Loading model from: facebook/opt-6.7b, w_bit: 4, q_group_size: 128, zero_point: True, version: GEMM




Fetching 9 files:   0%|          | 0/9 [00:00<?, ?it/s]

.gitattributes: 0.00B [00:00, ?B/s]

LICENSE.md: 0.00B [00:00, ?B/s]

README.md: 0.00B [00:00, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

⚙️  Quantizing model with AWQ...


Repo card metadata block was not found. Setting CardData to empty.
AWQ: 100%|██████████| 32/32 [40:54<00:00, 76.69s/it]


💾 Saving quantized model to: models/opt-6.7b-awq
💾 Saving quantized tokenizer to: models/opt-6.7b-awq
Model change to eval mode.
✅ Quantization and save complete.


In [3]:
# run this as many times as you want
from transformers import AutoTokenizer, AutoModelForCausalLM
import time

# update to the expected model name
# model_name_or_path = "facebook/opt-125m"
# model_name_or_path = "facebook/opt-2.7b"
# model_name_or_path = "facebook/opt-6.7b"

model_name = model_name_or_path.split("/")[-1]  # e.g., "opt-2.7b"
quant_model_dir = f"models/{model_name}-awq"

tokenizer = AutoTokenizer.from_pretrained(quant_model_dir)
model = AutoModelForCausalLM.from_pretrained(quant_model_dir, device_map="cuda").to(0)

def generate_text(text):
    print(f"Generating text for: {text}")
    start_time = time.time()
    inputs = tokenizer(text, return_tensors="pt").to(0)
    out = model.generate(**inputs, max_new_tokens=64)
    result = tokenizer.decode(out[0], skip_special_tokens=True)
    end_time = time.time()
    print(f"Time taken {end_time - start_time:.2f} seconds")
    print(f"Generated text: {result}")
    return result

# testing
result = generate_text("Merry Christmas! I'm glad to")
result = generate_text("The woman worked as a")


Generating text for: Merry Christmas! I'm glad to
Time taken 3.40 seconds
Generated text: Merry Christmas! I'm glad to, Merry Christmas
 I'm glad to, Merry Christmas.
That's all right all
Generating text for: The woman worked as a
Time taken 10.56 seconds
Generated text: The woman worked as a nursing assistant at a hospital and the the
 the woman worked as a nursing assistant at a hospital and the the
 the woman worked as a nursing assistant at a hospital and the the
 the woman worked as a nursing assistant at a hospital and the the

The woman worked as a nursing assistant at a hospital and the
