# Transformer meets AWQ quantization (AutoAWQ and LLM-AWQ) for lighter and faster quantized inference of LLMs

![img](https://huggingface.co/datasets/ybelkada/documentation-images/resolve/main/Thumbnail.png)

In June 2023, the [AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration](https://arxiv.org/pdf/2306.00978.pdf) has been published by Ji Lin et al. The paper details an algorithm to compress any transformer-based language model in few bits with a tiny performance degradation. To learn more about this quantization method, Professor Song Han gives a excellent [talk](https://hanlab.mit.edu/projects/awq).

We new support loading models that are quantized with GPTQ algorithm in ü§ó transformers from two different libraries: [LLM-AWQ](https://github.com/mit-han-lab/llm-awq) and [AutoAWQ](https://github.com/casper-hansen/AutoAWQ).

Let's check in this notebook the different options (quantize a model, push a quantized model on the ü§ó Hub, load an already quantized model from the Hub, etc.) that are offered in this integration!

## Load required libraries

Let us first load the required libraries that are ü§ó transformers and llm-awq, autoawq library.

In [None]:
!nvcc --version

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Tue_Aug_15_22:02:13_PDT_2023
Cuda compilation tools, release 12.2, V12.2.140
Build cuda_12.2.r12.2/compiler.33191640_0


In [None]:
!pip install -q transformers accelerate

AutoAWQ will default to CUDA 12.1, since google colab has CUDA < 12.1 installed, we will install the wheels for CUDA 11.8. For 12.1 you can simply do `pip install autoawq`

In [None]:
!pip install torch==2.3.1

Collecting torch==2.3.1
  Downloading torch-2.3.1-cp310-cp310-manylinux1_x86_64.whl.metadata (26 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch==2.3.1)
  Downloading nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.1.105 (from torch==2.3.1)
  Downloading nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.1.105 (from torch==2.3.1)
  Downloading nvidia_cuda_cupti_cu12-12.1.105-py3-none-manylinux1_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==8.9.2.26 (from torch==2.3.1)
  Downloading nvidia_cudnn_cu12-8.9.2.26-py3-none-manylinux1_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.1.3.1 (from torch==2.3.1)
  Downloading nvidia_cublas_cu12-12.1.3.1-py3-none-manylinux1_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cufft-cu12==11.0.2.54 (from torch==2.3.1)
  Downloading nvidia_cufft_cu12-11.0.2.54-py3-none-manylin

In [None]:
!pip install torch torchvision -U

Collecting torch
  Downloading torch-2.5.1-cp310-cp310-manylinux1_x86_64.whl.metadata (28 kB)
Collecting torchvision
  Downloading torchvision-0.20.1-cp310-cp310-manylinux1_x86_64.whl.metadata (6.1 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch)
  Using cached nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch)
  Using cached nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch)
  Using cached nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch)
  Using cached nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch)
  Using cached nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cufft-cu12==11

In [None]:
!pip install autoawq -U

Collecting torch==2.3.1 (from autoawq)
  Using cached torch-2.3.1-cp310-cp310-manylinux1_x86_64.whl.metadata (26 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch==2.3.1->autoawq)
  Using cached nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.1.105 (from torch==2.3.1->autoawq)
  Using cached nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.1.105 (from torch==2.3.1->autoawq)
  Using cached nvidia_cuda_cupti_cu12-12.1.105-py3-none-manylinux1_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==8.9.2.26 (from torch==2.3.1->autoawq)
  Using cached nvidia_cudnn_cu12-8.9.2.26-py3-none-manylinux1_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.1.3.1 (from torch==2.3.1->autoawq)
  Using cached nvidia_cublas_cu12-12.1.3.1-py3-none-manylinux1_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cufft-cu12==11.0.2.54 (from torch=

In [None]:
!pip install -q -U https://github.com/casper-hansen/AutoAWQ/releases/download/v0.1.6/autoawq-0.1.6+cu118-cp310-cp310-linux_x86_64.whl

[2K     [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m20.0/20.0 MB[0m [31m8.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m44.0/44.0 kB[0m [31m2.1 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m51.8/51.8 kB[0m [31m5.1 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m2.4/2.4 MB[0m [31m78.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚î

## LLM-AWQ integration with Transformers

As LLM-AWQ is not supported on T4 devices (such as the one we use on free-tier Google Colab instances) you need to have access to a hardware that is compatible with that repository and follow the [instructions](https://github.com/mit-han-lab/llm-awq/tree/main) provided by llm-awq repository.

You can follow the instructions stated on [this section](https://github.com/mit-han-lab/llm-awq/blob/main/examples/chat_demo.ipynb) then use the conversion script exposed [here](https://github.com/mit-han-lab/llm-awq/blob/main/examples/convert_to_hf.py) to convert your model into a transformers compatible version.

## AutoAWQ integration with Transformers

Let's first quantize `opt-125m` using `autoawq`!

In [None]:
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

model_path = "Qwen/Qwen2.5-Math-72B-Instruct"
quant_path = "Qwen2.5-Math-72B-Instruct-AWQ"
quant_config = {"zero_point": True, "q_group_size": 128, "w_bit": 4, "version":"GEMM"}

# Load model
model = AutoAWQForCausalLM.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)

# Quantize
model.quantize(tokenizer, quant_config=quant_config)

config.json:   0%|          | 0.00/658 [00:00<?, ?B/s]

Fetching 47 files:   0%|          | 0/47 [00:00<?, ?it/s]

.gitattributes:   0%|          | 0.00/1.52k [00:00<?, ?B/s]

README.md:   0%|          | 0.00/4.98k [00:00<?, ?B/s]

LICENSE:   0%|          | 0.00/6.96k [00:00<?, ?B/s]

model-00003-of-00037.safetensors:   0%|          | 0.00/3.81G [00:00<?, ?B/s]

model-00001-of-00037.safetensors:   0%|          | 0.00/3.76G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/161 [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/1.67M [00:00<?, ?B/s]

model-00002-of-00037.safetensors:   0%|          | 0.00/4.00G [00:00<?, ?B/s]

model-00005-of-00037.safetensors:   0%|          | 0.00/4.00G [00:00<?, ?B/s]

model-00007-of-00037.safetensors:   0%|          | 0.00/3.81G [00:00<?, ?B/s]

model-00004-of-00037.safetensors:   0%|          | 0.00/4.00G [00:00<?, ?B/s]

model-00006-of-00037.safetensors:   0%|          | 0.00/4.00G [00:00<?, ?B/s]

model-00008-of-00037.safetensors:   0%|          | 0.00/4.00G [00:00<?, ?B/s]

model-00009-of-00037.safetensors:   0%|          | 0.00/4.00G [00:00<?, ?B/s]

model-00010-of-00037.safetensors:   0%|          | 0.00/4.00G [00:00<?, ?B/s]

model-00011-of-00037.safetensors:   0%|          | 0.00/3.81G [00:00<?, ?B/s]

model-00012-of-00037.safetensors:   0%|          | 0.00/4.00G [00:00<?, ?B/s]

model-00013-of-00037.safetensors:   0%|          | 0.00/4.00G [00:00<?, ?B/s]

model-00014-of-00037.safetensors:   0%|          | 0.00/4.00G [00:00<?, ?B/s]

model-00015-of-00037.safetensors:   0%|          | 0.00/3.81G [00:00<?, ?B/s]

model-00016-of-00037.safetensors:   0%|          | 0.00/4.00G [00:00<?, ?B/s]

model-00017-of-00037.safetensors:   0%|          | 0.00/4.00G [00:00<?, ?B/s]

model-00018-of-00037.safetensors:   0%|          | 0.00/4.00G [00:00<?, ?B/s]

model-00019-of-00037.safetensors:   0%|          | 0.00/3.81G [00:00<?, ?B/s]

model-00020-of-00037.safetensors:   0%|          | 0.00/4.00G [00:00<?, ?B/s]

model-00021-of-00037.safetensors:   0%|          | 0.00/4.00G [00:00<?, ?B/s]

model-00022-of-00037.safetensors:   0%|          | 0.00/4.00G [00:00<?, ?B/s]

model-00023-of-00037.safetensors:   0%|          | 0.00/3.81G [00:00<?, ?B/s]

model-00024-of-00037.safetensors:   0%|          | 0.00/4.00G [00:00<?, ?B/s]

model-00025-of-00037.safetensors:   0%|          | 0.00/4.00G [00:00<?, ?B/s]

model-00026-of-00037.safetensors:   0%|          | 0.00/4.00G [00:00<?, ?B/s]

model-00027-of-00037.safetensors:   0%|          | 0.00/3.81G [00:00<?, ?B/s]

model-00028-of-00037.safetensors:   0%|          | 0.00/4.00G [00:00<?, ?B/s]

model-00029-of-00037.safetensors:   0%|          | 0.00/4.00G [00:00<?, ?B/s]

model-00030-of-00037.safetensors:   0%|          | 0.00/4.00G [00:00<?, ?B/s]

model-00031-of-00037.safetensors:   0%|          | 0.00/3.81G [00:00<?, ?B/s]

model-00032-of-00037.safetensors:   0%|          | 0.00/4.00G [00:00<?, ?B/s]

model-00033-of-00037.safetensors:   0%|          | 0.00/4.00G [00:00<?, ?B/s]

model-00034-of-00037.safetensors:   0%|          | 0.00/4.00G [00:00<?, ?B/s]

model-00035-of-00037.safetensors:   0%|          | 0.00/3.81G [00:00<?, ?B/s]

model-00036-of-00037.safetensors:   0%|          | 0.00/4.00G [00:00<?, ?B/s]

model-00037-of-00037.safetensors:   0%|          | 0.00/3.46G [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/79.0k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/7.03M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/7.32k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/2.78M [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/37 [00:00<?, ?it/s]

In [None]:
quant_config

In order to make it compatible with transformers, we need to modify the config file.

In [None]:
from transformers import AwqConfig, AutoConfig
from huggingface_hub import HfApi

# modify the config file so that it is compatible with transformers integration
quantization_config = AwqConfig(
    bits=quant_config["w_bit"],
    group_size=quant_config["q_group_size"],
    zero_point=quant_config["zero_point"],
    version=quant_config["version"].lower(),
).to_dict()

# the pretrained transformers model is stored in the model attribute + we need to pass a dict
model.model.config.quantization_config = quantization_config
# a second solution would be to use Autoconfig and push to hub (what we do at llm-awq)


# save model weights
model.save_quantized(quant_path)
tokenizer.save_pretrained(quant_path)

In [None]:
from google.colab import userdata
from huggingface_hub import HfApi

username = "adriszmar"

# Defined in the secrets tab in Google Colab
api = HfApi(token=userdata.get("HF_TOKEN"))

api.create_repo(
    repo_id=f"{username}/{quant_path}",
    repo_type="model"
)
api.upload_folder(
    repo_id=f"{username}/{quant_path}",
    folder_path="Qwen2.5-Math-72B-Instruct-AWQ",
)