## Evangelista – Hugging Models Quantization - GUFF/GGML  (see also AWQ, GPTQ, SqueezeLLM)
- GGUF is a new lib/format that replaces GGML (deprecated). Open source from llama.cpp team, more extensible and user friendly 
  - Quant Comparisons: https://deci.ai/blog/ggml-vs-gguf-comparing-formats-amp-top-5-methods-for-running-gguf
  - Pre-Quantized Models: https://huggingface.co/TheBloke/CodeLlama-34B-GGUF


In [None]:
# Optional, Show Machine/Pod Info
!uname -a
!python --version && echo
!pip list | grep torch && echo
!lscpu | head -n 8 && echo
!nvidia-smi | grep -E 'NVIDIA|MiB'

### Clone and Build llama.cpp
llama.cpp provides the tools to convert models to gguf and quantize them

In [None]:
!apt update -qq -y
!apt install build-essential cmake -y >/dev/null

# Clone llama.cpp
!if [ ! -d "llama.cpp" ]; then git clone https://github.com/ggerganov/llama.cpp.git; fi
%cd llama.cpp

# Build llama.cpp
%env PIP_ROOT_USER_ACTION=ignore
!pip install -q --upgrade pip
!pip install -q -r requirements.txt
!make quantize
%cd ..

%reset -f
print('Done!\n')

### Log into HuggingFace - Needed To Upload Quantized Model  OR  Input Model Is Gated

In [None]:
# Use env variable token if defined, don't restart sessions
import huggingface_hub, os
huggingface_hub.login(token=os.getenv('HF_ACCESS_TOKEN'), new_session=False, add_to_git_credential=False)

# Optionally, Force re-login
#huggingface_hub.login(None, new_session=True, add_to_git_credential=False)

### Download and Locally Save The Desired Model

In [None]:
# ENTER YOUR MODEL URI BELOW
# --------------------------------------------------------------------------------
%env HF_MODEL_URI = meta-llama/Llama-2-7b-chat-hf

import os, torch
from transformers import AutoModelForCausalLM, AutoTokenizer

# Default CUDA and float16
torch.cuda.empty_cache()
torch.set_default_device('cuda')
torch.set_default_dtype(torch.float16)

HF_MODEL_URI = os.environ.get('HF_MODEL_URI')
MODEL_NAME = os.path.basename(HF_MODEL_URI)

tokenizer = AutoTokenizer.from_pretrained(
    HF_MODEL_URI,
    trust_remote_code=True,
)

model = AutoModelForCausalLM.from_pretrained(
    HF_MODEL_URI,
    torch_dtype=torch.float16,
    trust_remote_code=True,
)

print('Saving model...')
tokenizer.save_pretrained(MODEL_NAME)
model.save_pretrained(MODEL_NAME)
print('Done!\n')

### Convert Model to GGUF

In [None]:
# Note, there's NO quiet mode
import subprocess
subprocess.run(['python', 'llama.cpp/convert.py', MODEL_NAME, '--outfile', f'{MODEL_NAME}.gguf'])

### Quantize to Q4_K_M
Quantizations Reference: https://github.com/ggerganov/llama.cpp/pull/1684

In [None]:
NUM_THREADS=16
QUANTIZATION_MODE='Q4_K_M'
QUANTIZED_MODEL_NAME=f'{MODEL_NAME}-GGUF-{QUANTIZATION_MODE}'
QUANTIZED_MODEL_URI=f'{QUANTIZED_MODEL_NAME}.gguf'

# Note, there's NO quiet mode
subprocess.run(['llama.cpp/quantize', f'{MODEL_NAME}.gguf', QUANTIZED_MODEL_URI, QUANTIZATION_MODE, f'{NUM_THREADS}'])

### Create HuggingFace Repo & Upload Model

In [None]:
from huggingface_hub import create_repo, HfApi

# ENTER YOUR HUGGINGFACE USER ID BELOW
# --------------------------------------------------------------------------------
HF_USER_ID='bevangelista'
REPO_ID=f'{HF_USER_ID}/{QUANTIZED_MODEL_NAME}'

# Create Repo -- NOTE: Make sure your token has WRITE permission
try:
    create_repo(REPO_ID, repo_type='model', private=False)
except Exception as err:
    print(err)

# Upload all files
api = HfApi()
api.upload_file(
    repo_id=REPO_ID,
    path_or_fileobj=QUANTIZED_MODEL_URI,
    path_in_repo=QUANTIZED_MODEL_URI,
    commit_message='Upload quantized models'
)