# ðŸ§ª GPTQ: Accurate Post-Training Quantization (2022/2023)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/adiel2012/model-quantization/blob/main/chronology/gptq_demo.ipynb)

GPTQ is a one-shot weight quantization method based on approximate second-order information. It is highly efficient and can quantize massive models in just a few GPU hours while maintaining high accuracy.

In this notebook, we use `AutoGPTQ` to quantize an `OPT-125M` model.

In [None]:
!pip install auto-gptq transformers optimum -q

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer, GPTQConfig
import torch
import time
import os

model_id = "facebook/opt-125m"
tokenizer = AutoTokenizer.from_pretrained(model_id)

# 1. Load Model with GPTQ Configuration
# We'll use 4-bit quantization with a group size of 128
quantization_config = GPTQConfig(
    bits=4, 
    group_size=128, 
    dataset="wikitext2", 
    desc_act=False
)

print("--- Loading and Quantizing Model ---")
model_gptq = AutoModelForCausalLM.from_pretrained(
    model_id, 
    quantization_config=quantization_config, 
    torch_dtype=torch.float16, 
    device_map="auto"
)

In [None]:
def benchmark_inference(model, tokenizer, input_text="The future of AI is"):
    inputs = tokenizer(input_text, return_tensors="pt").to(model.device)
    start_time = time.time()
    with torch.no_grad():
        output = model.generate(**inputs, max_new_tokens=30)
    end_time = time.time()
    
    print(f"Duration: {end_time - start_time:.4f}s")
    print(f"Output: {tokenizer.decode(output[0], skip_special_tokens=True)}")

print("--- GPTQ Inference ---")
benchmark_inference(model_gptq, tokenizer)