<a href="https://colab.research.google.com/github/ahmeda335/QuantizationMethods/blob/main/GPTQ_Quantization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# GPTQ Quantization. 📌
- ➡️ GQTQ quantization is one of the methods used in decreasing the hardware required and increasing the inference speed.

- ➡️ It is done such that the weight of the models are transfered to 4bit precision, but during inference, it is transfered to fp16 again on the fly using a fused kernel rather than GPU.
###  📖 In this code we will quantize a model using GPTQ quantization and then share it to HuggingFace Hub 🤗.


---
---

## 1️⃣
### 🚀 Load required libraries used in GPTQ Quantization.


In [1]:
!pip install -q -U transformers peft accelerate optimum
!pip install -q datasets==2.15.0
!pip install -q auto-gptq --extra-index-url https://huggingface.github.io/autogptq-index/whl/cu117/

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.4/44.4 kB[0m [31m2.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m9.7/9.7 MB[0m [31m35.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m424.1/424.1 kB[0m [31m18.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m46.0/46.0 kB[0m [31m1.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m17.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m6.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m179.3/179.3 kB[0m [31m11.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.8/86.8 kB[0m [31m6.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

## 2️⃣
### 🚀 Creating a GPTQ Config class and setting the number of bits to qunatize to.
### 🚀 We also use a dataset (as stated in GPTQ Paper) to calibrate the weights for quantization. The database could be one of the recommended in the paper `['wikitext2', 'c4', 'c4-new', 'ptb', 'ptb-new']`. I will use `['wikitext2']`. Or you can use your own dataset if you want.

⚠️⚠️ Note: If you are using llama or any gated model, you must get access first for the model on the huggingface and then login here using your huggingface account. The code of the login is below.

In [2]:
from transformers import AutoModelForCausalLM, AutoTokenizer, GPTQConfig

model_id = "facebook/opt-125m"
model_name = model_id.split('/')[-1]

tokenizer = AutoTokenizer.from_pretrained(model_id)
gptq_config = GPTQConfig(bits=4, group_size = 128, dataset="wikitext2", tokenizer=tokenizer, desc_act=False)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/685 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/651 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/441 [00:00<?, ?B/s]

## 3️⃣
### 🚀 Creating the quantized model. This will take some time.

In [3]:
quantized_model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto", quantization_config=gptq_config)

  @custom_fwd
  @custom_bwd
  @custom_fwd(cast_inputs=torch.float16)


pytorch_model.bin:   0%|          | 0.00/251M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/137 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/251M [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/10.5k [00:00<?, ?B/s]

Downloading data files:   0%|          | 0/3 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/733k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/6.36M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/657k [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/3 [00:00<?, ?it/s]

Generating test split:   0%|          | 0/4358 [00:00<?, ? examples/s]

Generating train split:   0%|          | 0/36718 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/3760 [00:00<?, ? examples/s]

Quantizing model.decoder.layers blocks :   0%|          | 0/12 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/6 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/6 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/6 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/6 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/6 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/6 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/6 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/6 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/6 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/6 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/6 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/6 [00:00<?, ?it/s]

`loss_type=None` was set in the config but it is unrecognised.Using the default loss: `ForCausalLMLoss`.


🚩 Getting the memory taken from the quantized model.

In [4]:
def bytes_to_giga_bytes(bytes):
  return bytes / 1024 / 1024 / 1024

In [8]:
print(bytes_to_giga_bytes(quantized_model.get_memory_footprint()), "GB")

0.11647796630859375 GB


## 4️⃣
### 🚀 Inferencing the model to try it.

In [16]:
tokenizer = AutoTokenizer.from_pretrained(model_id)

text = "I love"
inputs = tokenizer(text, return_tensors="pt").to(0)

out = quantized_model.generate(**inputs)
print(tokenizer.decode(out[0], skip_special_tokens=True))

I love the fact that the guy is wearing a shirt that says "I'm a woman" and the guy


## 5️⃣
### 🚀 Sharing the model to 🤗 Hub.


 🚩Get your token for here from Huggingface and enter it below.

In [14]:
from huggingface_hub import login, HfApi


login("WRITE_YOUR_TOKEN_HERE")    # 🚩🚩 WRITE YOUR TOKEN HERE.

# Create an instance of the HfApi class
api = HfApi()

# Get user information
user_info = api.whoami()

# Print user information to verify
print("\nYour account:", user_info['name'])

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

🚩 Pushing the model you quantized you Hub.**bold text**

In [None]:
quantized_model.push_to_hub(f"{model_name}-gptq-4bit")
tokenizer.push_to_hub(f"{model_name}-gptq-4bit")