This notebook shows how to quantize Llama 2 with GPTQ.
It runs on Google Colab Pro. It can also run on a machine with at least 24 GB of CPU RAM and a GPU with 12 GB VRAM

For more details check out this article:
[Quantization of Llama 2 with GTPQ for Fast Inference on Your Computer](https://kaitchup.substack.com/p/quantization-of-llama-2-with-gtpq)

We need the last version of AutoGPTQ, so we will install it from GitHub.

In [None]:
!git clone https://github.com/PanQiWei/AutoGPTQ.git

Cloning into 'AutoGPTQ'...
remote: Enumerating objects: 2487, done.[K
remote: Counting objects: 100% (825/825), done.[K
remote: Compressing objects: 100% (427/427), done.[K
remote: Total 2487 (delta 503), reused 574 (delta 382), pack-reused 1662[K
Receiving objects: 100% (2487/2487), 7.48 MiB | 19.33 MiB/s, done.
Resolving deltas: 100% (1627/1627), done.


First we patch the repository to enable use_auth_token support. Don't do this if you want to use a model that doesn't require an access token. Also, this patch may become obsolete very soon so you may try without it.

In [None]:
!wget https://about.benjaminmarie.com/data/py/auto-gptq-patch/_utils.py
!wget https://about.benjaminmarie.com/data/py/auto-gptq-patch/auto.py

!mv _utils.py AutoGPTQ/auto_gptq/modeling/
!mv auto.py AutoGPTQ/auto_gptq/modeling/

--2023-07-26 20:59:49--  https://about.benjaminmarie.com/data/py/auto-gptq-patch/_utils.py
Resolving about.benjaminmarie.com (about.benjaminmarie.com)... 192.95.30.6
Connecting to about.benjaminmarie.com (about.benjaminmarie.com)|192.95.30.6|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 7224 (7.1K) [text/x-python]
Saving to: ‘_utils.py’


2023-07-26 20:59:50 (182 MB/s) - ‘_utils.py’ saved [7224/7224]

--2023-07-26 20:59:50--  https://about.benjaminmarie.com/data/py/auto-gptq-patch/auto.py
Resolving about.benjaminmarie.com (about.benjaminmarie.com)... 192.95.30.6
Connecting to about.benjaminmarie.com (about.benjaminmarie.com)|192.95.30.6|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4697 (4.6K) [text/x-python]
Saving to: ‘auto.py’


2023-07-26 20:59:50 (134 MB/s) - ‘auto.py’ saved [4697/4697]



In [None]:
%cd AutoGPTQ
!pip install .


/content/AutoGPTQ
Processing /content/AutoGPTQ
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting accelerate>=0.19.0 (from auto-gptq==0.3.2+cu118)
  Downloading accelerate-0.21.0-py3-none-any.whl (244 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m244.2/244.2 kB[0m [31m3.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting datasets (from auto-gptq==0.3.2+cu118)
  Downloading datasets-2.14.0-py3-none-any.whl (492 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m492.2/492.2 kB[0m [31m23.7 MB/s[0m eta [36m0:00:00[0m
Collecting rouge (from auto-gptq==0.3.2+cu118)
  Downloading rouge-1.0.1-py3-none-any.whl (13 kB)
Collecting safetensors (from auto-gptq==0.3.2+cu118)
  Downloading safetensors-0.3.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m54.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting transformers>=4.31.0 (from auto-gptq==0.3

We import all the necessary libraries:

*Note all of them were installed as dependencies of auto-gptq*

In [None]:
import random
import numpy as np
import torch
from datasets import load_dataset
from transformers import AutoTokenizer
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig

Set up some variables, load the tokenizer, and define a function to deal with the data used for calibration.

In [None]:
#Replace the following with your own Hugging Face access token.
#This is my token (revocated, of course, never share your token online)
access_token = "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"

#The model we want to quantize
pretrained_model_dir = "meta-llama/Llama-2-7b-chat-hf"

#The name of the model once quantized
#Note that we will only save the model, the tokenizer will remain the same
quantized_model_dir = "Llama-2-7b-4bit-chat-hf"

tokenizer = AutoTokenizer.from_pretrained(pretrained_model_dir, use_fast=True, use_auth_token=access_token)
#I copied and edited this function from AutoGPTQ repository
def get_wikitext2(nsamples, seed, seqlen, model):

    #I load the validation split since the training split is unecessary large
    traindata = load_dataset('wikitext', 'wikitext-2-raw-v1', split='validation')


    trainenc = tokenizer("\n\n".join(traindata['text']), return_tensors='pt')

    random.seed(seed)
    np.random.seed(0)
    torch.random.manual_seed(0)

    traindataset = []

    #This is unecessary to use the entire dataset for calibration
    #Here I use only 128 samples
    for _ in range(nsamples):
        i = random.randint(0, trainenc.input_ids.shape[1] - seqlen - 1)
        j = i + seqlen
        inp = trainenc.input_ids[:, i:j]
        attention_mask = torch.ones_like(inp)
        traindataset.append({'input_ids':inp,'attention_mask': attention_mask})
    return traindataset

Downloading (…)okenizer_config.json:   0%|          | 0.00/770 [00:00<?, ?B/s]



Downloading tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

Then, we can finally quantize the model and save it.

In [None]:
#Process the data that will be used for calibration
traindataset = get_wikitext2(128, 0, 2048, pretrained_model_dir)


quantize_config = BaseQuantizeConfig(
    bits=4,  # quantize model to 4-bit
    group_size=128,  # it is recommended to set the value to 128
    desc_act=False,  # set to False can significantly speed up inference but the perplexity may slightly bad
)


# load Llama 2
model = AutoGPTQForCausalLM.from_pretrained(pretrained_model_dir, quantize_config, use_auth_token=access_token)

# quantize model, using traindataset for calibration
#this may take up to 10 minutes on Google Colab Pro
model.quantize(traindataset)

# save quantized model using safetensors
model.save_quantized(quantized_model_dir, use_safetensors=True)

Downloading builder script:   0%|          | 0.00/8.48k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/6.84k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/9.62k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/4.72M [00:00<?, ?B/s]

Generating test split:   0%|          | 0/4358 [00:00<?, ? examples/s]

Generating train split:   0%|          | 0/36718 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/3760 [00:00<?, ? examples/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/635 [00:00<?, ?B/s]



Downloading (…)fetensors.index.json:   0%|          | 0.00/26.8k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading (…)of-00002.safetensors:   0%|          | 0.00/9.98G [00:00<?, ?B/s]

Downloading (…)of-00002.safetensors:   0%|          | 0.00/3.50G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading (…)neration_config.json:   0%|          | 0.00/167 [00:00<?, ?B/s]

Now, we want to test if it still works. I'll discuss a proper evaluation method in another article.

In [None]:
#We load the quantized model
model = AutoGPTQForCausalLM.from_quantized("Llama-2-7b-4bit-chat-hf", use_safetensors=True, device="cuda:0", use_auth_token=False)

#Your test prompt
prompt = "Tell me about gravity"
print(tokenizer.decode(model.generate(**tokenizer(prompt, return_tensors="pt").to(model.device), max_length=300)[0]))





<s> Tell me about gravity. everybody knows what it is, but can you explain it in a way that makes it easy to understand?
Gravity is a fundamental force of nature that causes objects with mass to attract each other. It is the weakest of the four fundamental forces of nature, but it is the one that dominates at large scales, shaping the structure of the universe as we know it today.
Gravity is a two-way force, meaning that any two objects with mass will attract each other. The strength of the gravitational force between two objects depends on their mass and the distance between them. The greater the mass of the objects and the closer they are to each other, the stronger the gravitational force will be.
Gravity is a universal force that affects everything with mass, from the smallest subatomic particles to the largest galaxies. It is the force that keeps planets in orbit around their stars, causes objects to fall towards the ground, and holds galaxies together.
One of the most famous exam