## Evangelista â€“ Hugging Models Quantization - GUFF/GGML  (see also AWQ, SqueezeLLM)
- GGUF. New library that replaces GGML (deprecated). Open source, more extensible and user friendly.
  - Quant Comparisons: https://deci.ai/blog/ggml-vs-gguf-comparing-formats-amp-top-5-methods-for-running-gguf
  - Pre-Quantized Models: https://huggingface.co/TheBloke/CodeLlama-34B-GGUF

### Clone and Build llama.cpp
llama.cpp provides the tools to convert model to gguf and quantize it


In [None]:
!apt update -y
!apt install build-essential cmake -y >/dev/null

# Clone llama.cpp
!if [ ! -d "llama.cpp" ]; then git clone https://github.com/ggerganov/llama.cpp.git; fi
%cd llama.cpp

# Build llama.cpp
!pip install -q --upgrade pip
!pip install -q -r requirements.txt
!make quantize
%cd ..

%reset -f
print('Done!\n')

### Log into HuggingFace - Needed To Upload Your Model OR If Input Model Is Gated

In [None]:
# NOTE You ONLY need to login if your model is gated
from huggingface_hub import notebook_login
notebook_login()

### Download and Locally Save The Desired Model

In [None]:
# YOUR MODEL URI BELOW
# --------------------------------------------------------------------------------
%env HF_MODEL_URI = meta-llama/Llama-2-7b-chat-hf

import torch, os
from transformers import AutoModelForCausalLM, AutoTokenizer

# Default CUDA and float16
torch.set_default_device('cuda')
torch.set_default_dtype(torch.float16)

HF_MODEL_URI = os.environ.get('HF_MODEL_URI')
MODEL_NAME = os.path.basename(HF_MODEL_URI)
GGUF_MODEL_URI = f'{MODEL_NAME}/{MODEL_NAME}.gguf'

tokenizer = AutoTokenizer.from_pretrained(
    HF_MODEL_URI,
    pad_token='<pad>',
    trust_remote_code=True,
    token=os.getenv('HF_ACCESS_TOKEN') # optionally, set env var as token for repo access
)

model = AutoModelForCausalLM.from_pretrained(
    HF_MODEL_URI,
    torch_dtype=torch.float16,
    trust_remote_code=True,
    token=os.getenv('HF_ACCESS_TOKEN') # optionally, set env var as token for repo access
)

print('Saving model...')
tokenizer.save_pretrained(MODEL_NAME)
model.save_pretrained(MODEL_NAME)
print('Done!\n')

### Convert Model to GGUF

In [None]:
#!python ./convert.py ./models/$MODEL_NAME --outfile ./models/$MODEL_NAME/$MODEL_NAME.gguf

import subprocess

subprocess.run(['llama.cpp/convert', MODEL_NAME, '--outfile', GGUF_MODEL_URI])

### Quantize to Q4_K_M
Quantizations Reference: https://github.com/ggerganov/llama.cpp/pull/1684

In [None]:
#!./quantize ./models/$MODEL_NAME/$MODEL_NAME.gguf ./models/$MODEL_NAME/$MODEL_NAME-Q4_K_M.gguf Q4_K_M 16

NUM_THREADS=16
QUANTIZATION_MODE='Q4_K_M'
GGUF_QUANTIZED_MODEL_URI=f'{MODEL_NAME}/{QUANTIZED_MODEL_NAME}.gguf'

subprocess.run(['llama.cpp/quantize', GGUF_MODEL_URI, '--outfile', GGUF_QUANTIZED_MODEL_URI, QUANTIZATION_MODE, f'{NUM_THREADS}'])

### Create HuggingFace Repo & Upload Model

In [None]:
from huggingface_hub import create_repo, HfApi

# YOUR HUGGINGFACE USER ID BELOW
# --------------------------------------------------------------------------------
HF_USER_ID='soij'
REPO_ID=f'{HF_USER_ID}/{MODEL_NAME}-{QUANTIZATION_MODE}'

# Create Repo -- NOTE: Make sure your token has WRITE permission
try:
    create_repo(REPO_ID, repo_type='model', private=False)
except Exception as err:
    print(err)

# Upload all files
api = HfApi()
api.upload_file(
    repo_id=REPO_ID,
    path_or_fileobj=GGUF_QUANTIZED_MODEL_URI,
    path_in_repo='/',
    commit_message='Upload quantized models'
)