## Evangelista – Hugging Models Quantization - AWQ  (see also GUFF/GGML, SqueezeLLM)
- AWQ.
  - Paper: https://arxiv.org/abs/2306.00978
  - Git: https://github.com/mit-han-lab/llm-awq

In [None]:
# Optional, Show Machine/Pod Info
!uname -a
!python --version && echo
!lscpu | head -n 8 && echo
!nvidia-smi | grep -E 'NVIDIA|MiB'

### Install AWQ

In [None]:
%env PIP_ROOT_USER_ACTION=ignore
!pip install -q --upgrade pip
!pip install -q accelerate transformers

!pip install -q -U autoawq -f https://github.com/casper-hansen/AutoAWQ/releases/download/v0.1.8/autoawq-0.1.8+cu118-cp310-cp310-linux_x86_64.whl
#!pip install -q -U autoawq -f https://github.com/casper-hansen/AutoAWQ/releases/download/v0.1.8/autoawq-0.1.8+cu118-cp310-cp310-win_amd64.whl

print('Done!\n')

### Log into HuggingFace - Needed To Upload Your Quantization OR If The Input Model Is Gated

In [None]:
# Use env variable token if defined, don't restart sessions
import huggingface_hub, os
huggingface_hub.login(token=os.getenv('HF_ACCESS_TOKEN'), new_session=False, add_to_git_credential=False)

# Optionally, Force re-login
#huggingface_hub.login(None, new_session=True, add_to_git_credential=False)

### Load and Save Your Desired Model

In [None]:
# YOUR MODEL URI BELOW
# --------------------------------------------------------------------------------
%env HF_MODEL_URI = meta-llama/Llama-2-7b-chat-hf

import os, torch
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

# Default CUDA and float16
torch.cuda.empty_cache()
torch.set_default_device('cuda')
torch.set_default_dtype(torch.float16)

HF_MODEL_URI = os.environ.get('HF_MODEL_URI')
MODEL_NAME = os.path.basename(HF_MODEL_URI)

tokenizer = AutoTokenizer.from_pretrained(
    HF_MODEL_URI,
    trust_remote_code=True,
)

model = AutoAWQForCausalLM.from_pretrained( # Use AutoAWQForCausalLM instead of AutoModelForCausalLM
    HF_MODEL_URI,
    torch_dtype=torch.float16,
    trust_remote_code=True,
)

### Quantize Model to 4b
- HF Reference: https://huggingface.co/docs/transformers/en/quantization

In [None]:
awq_config = {
    'zero_point': True, 
    'q_group_size': 128,
    'w_bit': 4,
    'version': 'GEMM', # GEMV: 20% faster than GEMM (only batch size 1). GEMM: Much faster than FP16 at batch sizes below 8.
}

QUANTIZED_MODEL_NAME=f"{MODEL_NAME}-AWK-Q{awq_config['q_group_size']}_B{awq_config['w_bit']}"
model.quantize(tokenizer, quant_config=awq_config)

model.save_quantized(QUANTIZED_MODEL_NAME, safetensors=True)

### Create HuggingFace Repo & Upload Model

In [None]:
from huggingface_hub import create_repo, HfApi

# YOUR HUGGINGFACE USER ID BELOW
# --------------------------------------------------------------------------------
HF_USER_ID='soij'
REPO_ID=f'{HF_USER_ID}/{QUANTIZED_MODEL_NAME}'

# Create Repo -- NOTE: Make sure your token has WRITE permission
try:
    create_repo(REPO_ID, repo_type='model', private=False)
except Exception as err:
    print(err)

# Upload all files
api = HfApi()
api.upload_folder(
    repo_id=REPO_ID,
    folder_path=QUANTIZED_MODEL_NAME,
    path_in_repo='/',
    allow_patterns=['*.bin', '*.safetensors', '*.json'],
    commit_message='Upload quantized models'
)