<a href="https://colab.research.google.com/github/Vaibhavs10/notebooks/blob/main/hf_gguf_convert.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Convert model ckpts to GGUF from the HF Hub.

Flow:

1. Download safetensors or bin model ckpts from the Hub.
2. Convert them to gguf quants via hf-convert or convert script.
3. Upload the quants back to the Hub.


Criteria:

> Keep the depenedencies as minimum as possible and all conversion must lie within llama.cpp repo.

## Set up environment

1. Clone llama.cpp
2. Install requirements.txt
3. Install huggingface_hub library for interacting with the Hub.

In [2]:
!git clone -q https://github.com/ggerganov/llama.cpp

fatal: destination path 'llama.cpp' already exists and is not an empty directory.


In [3]:
!pip install -q -r llama.cpp/requirements.txt

In [1]:
!pip install -q huggingface_hub

## Download the ckpts from the Hub.



In [5]:
MODEL_HUB_ID = "microsoft/phi-2"

In [7]:
import os
from huggingface_hub import snapshot_download

model_name = MODEL_HUB_ID.split('/')[-1]
base_path = f"{model_name}/{model_name.lower()}"

# Download model with huggingface_hub
local_dir = f"{model_name}"
if not os.path.exists(local_dir):
    snapshot_download(repo_id=MODEL_HUB_ID, local_dir=local_dir, local_dir_use_symlinks=False, revision="main")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Fetching 19 files:   0%|          | 0/19 [00:00<?, ?it/s]

NOTICE.md:   0%|          | 0.00/1.77k [00:00<?, ?B/s]

.gitattributes:   0%|          | 0.00/1.52k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/863 [00:00<?, ?B/s]

LICENSE:   0%|          | 0.00/1.08k [00:00<?, ?B/s]

README.md:   0%|          | 0.00/8.04k [00:00<?, ?B/s]

SECURITY.md:   0%|          | 0.00/2.66k [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/1.08k [00:00<?, ?B/s]

CODE_OF_CONDUCT.md:   0%|          | 0.00/444 [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

configuration_phi.py:   0%|          | 0.00/9.26k [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/35.7k [00:00<?, ?B/s]

modeling_phi.py:   0%|          | 0.00/62.7k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/99.0 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.11M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/7.34k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/798k [00:00<?, ?B/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/564M [00:00<?, ?B/s]

# Quantise the model ckpts

> Convert to f16 ckpt first

In [8]:
!python llama.cpp/convert-hf-to-gguf.py phi-2 --outfile phi-2-f16.gguf --outtype f16

Loading model: phi-2
gguf: This GGUF file is for Little Endian only
Set model parameters
Set model tokenizer
The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.
0it [00:00, ?it/s]
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
gguf: Adding 50000 merge(s).
gguf: Setting special token type bos to 50256
gguf: Setting special token type eos to 50256
gguf: Setting special token type unk to 50256
Exporting model to 'phi-2-f16.gguf'
gguf: loading model part 'model-00001-of-00002.safetensors'
token_embd.weight, n_dims = 2, torch.float16 --> float16
blk.0.attn_norm.bias, n_dims = 1, torch.float16 --> float32
blk.0.attn_norm.weight, n_dims = 1, torch.float16 --> float32
blk.0.ffn_up.bias, n_dims = 1, torch.float16 --> float32
blk.0.ffn_up.weight, n_

## Build llama.cpp for the quantisation step.

In [13]:
!cd llama.cpp && make

I ccache not found. Consider installing it for faster compilation.
I llama.cpp build info: 
I UNAME_S:   Linux
I UNAME_P:   x86_64
I UNAME_M:   x86_64
I CFLAGS:    -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG  -std=c11   -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -pthread -march=native -mtune=native -Wdouble-promotion 
I CXXFLAGS:  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG 
I NVCCFLAGS: -std=c++11 -O3 
I LDFLAGS:    
I CC:        cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
I CXX:       g++ (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0

cc  -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG  -std=c11   -fPIC -O3

> Convert f16 to other quants

In [14]:
import subprocess
# Quantization methods
quant_methods = [("q4_k_m", ".q4_k_m.gguf"), ("q5_k_m", ".q5_k_m.gguf")]
fp16_file = "phi-2-f16.gguf"
for method, extension in quant_methods:
    quant_file = f"{base_path}{extension}"
    if not os.path.isfile(quant_file):
        subprocess.run(["llama.cpp/quantize", fp16_file, quant_file, method], check=True)

KeyboardInterrupt: 

## Upload the quantised ckpts back to the Hub.

In [15]:
from huggingface_hub import login

login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [17]:
from huggingface_hub import HfApi
api = HfApi()

model_id = "reach-vb/phi-2-gguf"
api.create_repo(model_id, exist_ok=True, repo_type="model")

RepoUrl('https://huggingface.co/reach-vb/phi-2-gguf', endpoint='https://huggingface.co', repo_type='model', repo_id='reach-vb/phi-2-gguf')

In [20]:
api.upload_folder(
    folder_path="phi-2/",
    repo_id=model_id,
    allow_patterns="*.gguf",
)

phi-2.q4_k_m.gguf:   0%|          | 0.00/1.74G [00:00<?, ?B/s]

Upload 2 LFS files:   0%|          | 0/2 [00:00<?, ?it/s]

phi-2.q5_k_m.gguf:   0%|          | 0.00/1.58G [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/reach-vb/phi-2-gguf/commit/ecc5b62ee6f0b7944a99988230a3ede67ef783e9', commit_message='Upload folder using huggingface_hub', commit_description='', oid='ecc5b62ee6f0b7944a99988230a3ede67ef783e9', pr_url=None, pr_revision=None, pr_num=None)

# That's it! 🤗