## Evangelista – Hugging Models Quantization - GPTQ  (see also AWQ, GUFF/GGML, SqueezeLLM)
- GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers.
  - Paper: https://arxiv.org/abs/2210.17323
  - Git: https://github.com/IST-DASLab/gptq

In [None]:
# Optional, Show Machine/Pod Info
!uname -a
!python --version && echo
!pip list | grep -E 'torch|auto' && echo
!lscpu | head -n 8 && echo
!nvidia-smi | grep -E 'NVIDIA|MiB'

### Install GPTQ

In [None]:
%env PIP_ROOT_USER_ACTION=ignore
!pip install -q --upgrade pip

# GPTQ with CUDA requires torch >= 2.2.0
!pip install -q --upgrade "torch==2.2.0+cu118" -f https://download.pytorch.org/whl/torch_stable.html
!pip uninstall -q torchaudio torchvision -y

!pip install -q --upgrade accelerate optimum transformers

!pip install -q --upgrade auto-gptq --extra-index-url https://huggingface.github.io/autogptq-index/whl/cu118/
#!pip install -q --upgrade auto-gptq[triton] --extra-index-url https://huggingface.github.io/autogptq-index/whl/cu118/

print('Done!\n')

### Log into HuggingFace - Needed To Upload Your Quantization OR If The Input Model Is Gated

In [None]:
# Use env variable token if defined, don't restart sessions
import huggingface_hub, os
huggingface_hub.login(token=os.getenv('HF_ACCESS_TOKEN'), new_session=False, add_to_git_credential=False)

# Optionally, Force re-login
#huggingface_hub.login(None, new_session=True, add_to_git_credential=False)

### Load Your Desired Model

In [None]:
# ENTER YOUR MODEL URI BELOW
# --------------------------------------------------------------------------------
%env HF_MODEL_URI = meta-llama/Llama-2-7b-chat-hf

import os, torch
from transformers import AutoTokenizer, AutoModelForCausalLM

torch.cuda.empty_cache()
#torch.set_default_device('cuda:0')                     # Using cuda as default doesn't work with GPTQQuantizer
#torch.set_default_dtype(torch.float16)

HF_MODEL_URI = os.environ.get('HF_MODEL_URI')
MODEL_NAME = os.path.basename(HF_MODEL_URI)

tokenizer = AutoTokenizer.from_pretrained(
    HF_MODEL_URI,
    trust_remote_code=True,
)
tokenizer.pad_token = tokenizer.eos_token

model = AutoModelForCausalLM.from_pretrained(
    HF_MODEL_URI,
    torch_dtype=torch.float16,
    trust_remote_code=True,
    device_map='auto',
)

### Quantize to 4b & Save
- HF Reference: https://huggingface.co/docs/transformers/en/quantization

In [None]:
from optimum.gptq import GPTQQuantizer

quantizer = GPTQQuantizer(
    use_cuda_fp16=True,                                 # Optimized kernel for fp16 (requires pytorch >=2.1)
    group_size=128,                                     # Num Parameters Sharing a Scaling Weight
    bits=4,                                             # Bits Per Parameter 
    dataset='c4',                                       # The dataset to use for quantization
    desc_act=False,                                     # Quantize columns to decrease activation size. False will speed up inference but increase perplexity
    model_seqlen=2048,                                  # Maximum sequence length the model can handle
)

# Note, this may take a few hours
quantized_model = quantizer.quantize_model(model, tokenizer)

print('Saving model...')
QUANTIZED_MODEL_NAME=f'{MODEL_NAME}-GPTQ-Q{quantizer.group_size}_B{quantizer.bits}_{quantizer.dataset}'
quantizer.save(quantized_model, QUANTIZED_MODEL_NAME)
print('Done!')

### Create HuggingFace Repo & Upload Model

In [None]:
from huggingface_hub import create_repo, HfApi

# ENTER YOUR HUGGINGFACE USER ID BELOW
# --------------------------------------------------------------------------------
HF_USER_ID='bevangelista'
REPO_ID=f'{HF_USER_ID}/{QUANTIZED_MODEL_NAME}'

# Create Repo -- NOTE: Make sure your token has WRITE permission
try:
    create_repo(REPO_ID, repo_type='model', private=False)
except Exception as err:
    print(err)

# Upload all files
api = HfApi()
api.upload_folder(
    repo_id=REPO_ID,
    folder_path=QUANTIZED_MODEL_NAME,
    path_in_repo='/',
    allow_patterns=['*.bin', '*.safetensors', '*.json'],
    commit_message='Upload quantized models'
)

### Load and Use Quantized Model

In [None]:
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig

torch.set_default_device('cuda:0')
torch.set_default_dtype(torch.float16)

model = AutoGPTQForCausalLM.from_quantized(
    QUANTIZED_MODEL_NAME,
    use_marlin=True,                                    # Optimized 4b kernels, may need weight repack
    use_safetensors=True,
)

In [None]:
input_ids = tokenizer('Apples are?', return_tensors='pt').input_ids
outputs = model.generate(
    input_ids=input_ids,
    do_sample=True,
)
print(tokenizer.decode(outputs)[0])

### Is It Possible To Fine Tune Quantized Model?

Yes, with adapters.