# Quantize models with GPTQ

Using the [`AutoGPTQ` library](https://github.com/PanQiWei/AutoGPTQ#quick-installation). Compare to the baseline performance [here](https://e2-dogfood.staging.cloud.databricks.com/?o=6051921418418893#notebook/418210139975057).

This approach resulted in a model memory footprint of 3.9GB and approximately 32 tokens per second throughput.

## Notes and Resources

**Notes**
As of 2023-08-28, there were still some gotchas and sharp edges with trying to use AutoGPTQ. In particular, following the examples in the [Making LLMs lighter with AutoGPTQ and transformers](https://huggingface.co/blog/gptq-integration#native-support-of-gptq-models-in-%F0%9F%A4%97-transformers) release blog resulted in CUDA out-of-memory errors. To fix this, it was necessary to remove the `device_map="auto"` argument. According to a [recent issue](https://github.com/PanQiWei/AutoGPTQ/issues/291#issuecomment-1695646845) in the AutoGPTQ library, AutoGPTQ automatically uses GPUs correctly for quantization; it appears there are some undesirable interactions with the device mapping from the accelerate library.

**Docs and Further Reading**
- https://gist.github.com/TheBloke/b47c50a70dd4fe653f64a12928286682#file-quant_autogptq-py
- https://huggingface.co/docs/optimum/llm_quantization/usage_guides/quantization
- https://huggingface.co/docs/transformers/main_classes/quantization
- https://huggingface.co/docs/transformers/v4.32.1/en/main_classes/quantization#transformers.GPTQConfig
- https://github.com/PanQiWei/AutoGPTQ/issues/179
- https://github.com/PanQiWei/AutoGPTQ/issues/291
- https://github.com/PanQiWei/AutoGPTQ/issues/291#issuecomment-1695992421

In [0]:
%pip install --upgrade torch transformers accelerate huggingface_hub optimum
dbutils.library.restartPython()

In [0]:
# install the AutoGPTQ Library corresponding to the CUDA version (11.8)
%pip install auto-gptq --extra-index-url https://huggingface.github.io/autogptq-index/whl/cu118/
dbutils.library.restartPython()

In [0]:
from utils import generate_text, clear_model, torch_profile_to_dataframe, wrap_module_with_profiler
import huggingface_hub
import pandas as pd
import torch
import transformers
from transformers import AutoTokenizer, pipeline
import os
import datetime
import time
import accelerate

In [0]:
huggingface_hub.login()

In [0]:
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline, GPTQConfig

tokenizer = AutoTokenizer.from_pretrained(
    "meta-llama/Llama-2-7b-hf", use_cache=True, padding_side="left"
)

quantization_config = GPTQConfig(
    bits=4,
    dataset="c4",
    tokenizer=tokenizer,
    group_size=128,  # default
)

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    trust_remote_code=True,
    torch_dtype=torch.float16,
    quantization_config=quantization_config,
)

## Inspect Model

In [0]:
model

# Save the model
This will also save the quantization config.

In [0]:
save_folder = "/dbfs/daniel.liden/models/llama2GPTQc4/"
model.save_pretrained(save_folder)

# Load the quantized model
First detach and reattach compute and then reinstall the required libraries. This is to make sure we accurately measure the CUDA memory usage.

In [0]:
%pip install --upgrade torch transformers accelerate huggingface_hub optimum
%pip install auto-gptq --extra-index-url https://huggingface.github.io/autogptq-index/whl/cu118/
dbutils.library.restartPython()

In [0]:
from utils import generate_text, clear_model, torch_profile_to_dataframe
import huggingface_hub
import pandas as pd
import torch
import transformers
from transformers import AutoTokenizer, pipeline, AutoModelForCausalLM, pipeline, GPTQConfig
import os
import datetime
import time
import accelerate


prompts = [
    "Dreams are",
    "The future of technology is",
    "In a world where magic exists,",
    "The most influential person in history is",
    "One of the most intriguing mysteries of the universe is",
    "When humans finally ventured out into the cosmos, they discovered",
    "The relationship between artificial intelligence and humanity has always been",
    "As the boundaries of science and fiction blur, the implications for society become",
    "In the depths of the enchanted forest, ancient creatures and forgotten tales come to life, revealing",
    "While many believe that technological advancements will be the key to solving humanity's greatest challenges, others argue that it will only exacerbate existing inequalities, leading to"
]

huggingface_hub.login()


In [0]:
save_folder = "/dbfs/daniel.liden/models/llama2GPTQc4/"
gptq_config = GPTQConfig(bits=4, disable_exllama=False, use_cuda_fp16=False)
model = AutoModelForCausalLM.from_pretrained(
    save_folder,
    device_map="auto",
    quantization_config=gptq_config,
)
tokenizer = AutoTokenizer.from_pretrained(
    "meta-llama/Llama-2-7b-hf", use_cache=True, padding_side="left"
)

# Throughput and Memory

## Serial Prompts

In [0]:
out = generate_text(prompts, model, tokenizer, batch=False,
              eos_token_id=tokenizer.eos_token_id, max_new_tokens=50)
pd.DataFrame(out)

## Batch prompts

In [0]:
out = generate_text(prompts, model, tokenizer, batch=True,
              eos_token_id=tokenizer.eos_token_id, max_new_tokens=50)
out

# Torch Profiling -- Basic

In [0]:
import torch.profiler as profiler

with profiler.profile(
    record_shapes=True,
    profile_memory=True,
    activities=[profiler.ProfilerActivity.CPU, profiler.ProfilerActivity.CUDA],
) as prof:
  output = generate_text(prompts, model, tokenizer, eos_token_id=tokenizer.eos_token_id,
                         max_new_tokens=10)

torch_profile_to_dataframe(prof).sort_values("Self CUDA %", ascending=False)