<a href="https://colab.research.google.com/github/VishanOberoi/FineTuningForTheGPUPoor/blob/main/Conversion_to_GGUF_using_Llama_cpp.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

GGUF is a new binary file format introduced in August 2023, designed for AI models like LLaMA and Llama-2. It aims to make AI model handling more efficient by providing:

- **Fast Loading**: Quickly loads models for immediate use.
- **Flexibility**: Supports special tokens, metadata, and future extensibility without breaking compatibility.
- **Single-File Convenience**: Packages entire models into one file, simplifying distribution and usage.

This format is particularly useful for AI applications requiring efficient model loading and flexibility, such as those developed in PyTorch for inference with systems like llama.cpp. With its key-value structure for storing information, GGUF is more adaptable and easier to use compared to previous formats like GGML and GGJT.

In [None]:
HUGGING_FACE_USERNAME = ''

In [None]:
# this cell will take approx 5 mins

In [None]:
# Make sure you have git-lfs installed (https://git-lfs.com)
!git lfs install
# Clone your model from Huggingface
!git clone https://huggingface.co/vishanoberoi/Llama-2-7b-chat-hf-fine-tuned
# Clone llama.cpp's repository. They provide code to convert models into gguf.
!git clone https://github.com/ggerganov/llama.cpp.git

Git LFS initialized.
Cloning into 'Llama-2-7b-chat-hf-fine-tuned'...
remote: Enumerating objects: 16, done.[K
remote: Counting objects: 100% (13/13), done.[K
remote: Compressing objects: 100% (13/13), done.[K
remote: Total 16 (delta 1), reused 0 (delta 0), pack-reused 3[K
Unpacking objects: 100% (16/16), 482.11 KiB | 7.09 MiB/s, done.
Filtering content: 100% (3/3), 4.55 GiB | 15.14 MiB/s, done.
Encountered 2 file(s) that may not have been copied correctly on Windows:
	model-00001-of-00003.safetensors
	model-00002-of-00003.safetensors

See: `git lfs help smudge` for more details.
Cloning into 'llama.cpp'...
remote: Enumerating objects: 19342, done.[K
remote: Counting objects: 100% (4270/4270), done.[K
remote: Compressing objects: 100% (168/168), done.[K
remote: Total 19342 (delta 4186), reused 4130 (delta 4101), pack-reused 15072[K
Receiving objects: 100% (19342/19342), 22.65 MiB | 20.94 MiB/s, done.
Resolving deltas: 100% (13554/13554), done.


In [None]:
!pip install -r /content/llama.cpp/requirements.txt


Collecting numpy~=1.24.4 (from -r /content/llama.cpp/./requirements/requirements-convert.txt (line 1))
  Downloading numpy-1.24.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (17.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m17.3/17.3 MB[0m [31m46.3 MB/s[0m eta [36m0:00:00[0m
Collecting gguf>=0.1.0 (from -r /content/llama.cpp/./requirements/requirements-convert.txt (line 4))
  Downloading gguf-0.6.0-py3-none-any.whl (23 kB)
Collecting protobuf<5.0.0,>=4.21.0 (from -r /content/llama.cpp/./requirements/requirements-convert.txt (line 5))
  Downloading protobuf-4.25.3-cp37-abi3-manylinux2014_x86_64.whl (294 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m294.6/294.6 kB[0m [31m6.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting torch~=2.1.1 (from -r /content/llama.cpp/./requirements/requirements-convert-hf-to-gguf.txt (line 2))
  Downloading torch-2.1.2-cp310-cp310-manylinux1_x86_64.whl (670.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━

In [None]:
!python /content/llama.cpp/convert.py /content/Llama-2-7b-chat-hf-fine-tuned \
   --vocab-type hfft \
  --outfile /content/finetuned-2.gguf \
  # --outtype q8_0 Commenting this out converts it to FP16 by defualt


Loading model file /content/Llama-2-7b-chat-hf-fine-tuned/model-00001-of-00003.safetensors
Loading model file /content/Llama-2-7b-chat-hf-fine-tuned/model-00001-of-00003.safetensors
Loading model file /content/Llama-2-7b-chat-hf-fine-tuned/model-00002-of-00003.safetensors
Loading model file /content/Llama-2-7b-chat-hf-fine-tuned/model-00003-of-00003.safetensors
params = Params(n_vocab=32000, n_embd=4096, n_layer=32, n_ctx=4096, n_ff=11008, n_head=32, n_head_kv=32, n_experts=None, n_experts_used=None, f_norm_eps=1e-05, rope_scaling_type=None, f_rope_freq_base=None, f_rope_scale=None, n_orig_ctx=None, rope_finetuned=None, ftype=None, path_model=PosixPath('/content/Llama-2-7b-chat-hf-fine-tuned'))
Found vocab files: {'tokenizer.model': None, 'vocab.json': None, 'tokenizer.json': PosixPath('/content/Llama-2-7b-chat-hf-fine-tuned/tokenizer.json')}
Loading vocab file '/content/Llama-2-7b-chat-hf-fine-tuned', type 'hfft'
fname_tokenizer: /content/Llama-2-7b-chat-hf-fine-tuned
Vocab info: <HfV

In [None]:
from huggingface_hub import notebook_login
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [None]:
# This will push the model to HF repository
from huggingface_hub import HfApi
api = HfApi()

model_id = f"{HUGGING_FACE_USERNAME}/Llama-2-7b-chat-hf-finedtuned-to-GGUF"
api.create_repo(model_id, exist_ok=True, repo_type="model")
api.upload_file(
    path_or_fileobj="finetuned-2.gguf",
    path_in_repo="finetuned.gguf",
    repo_id=model_id,
)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


finetuned-2.gguf:   0%|          | 0.00/13.5G [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/dev02chandan/Llama-2-7b-chat-hf-finedtuned-to-GGUF/commit/7a675fd2a331abea4f881082770940eba577e67b', commit_message='Upload finetuned.gguf with huggingface_hub', commit_description='', oid='7a675fd2a331abea4f881082770940eba577e67b', pr_url=None, pr_revision=None, pr_num=None)

In [None]:
from huggingface_hub import notebook_login
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [None]:
# This will push the model to HF repository
from huggingface_hub import HfApi
api = HfApi()

model_id = f"vishanoberoi/Llama-2-7b-chat-hf-finedtuned-to-GGUF"
api.create_repo(model_id, exist_ok=True, repo_type="model")
api.upload_file(
    path_or_fileobj="finetuned-2.gguf",
    path_in_repo="finetuned-16b.gguf",
    repo_id=model_id,
)


finetuned-2.gguf:   0%|          | 0.00/13.5G [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/vishanoberoi/Llama-2-7b-chat-hf-finedtuned-to-GGUF/commit/b8f971e6eec5f438859042deef325f5767f7fc29', commit_message='Upload finetuned-16b.gguf with huggingface_hub', commit_description='', oid='b8f971e6eec5f438859042deef325f5767f7fc29', pr_url=None, pr_revision=None, pr_num=None)

In [None]:
!python /content/llama.cpp/convert.py /content/Llama-2-7b-chat-hf-fine-tuned \
   --vocab-type hfft \
  --outfile /content/finetuned-2.gguf \
  # --outtype q8_0


In [None]:
# You can quantise the model even further using your local machine using
# ! /content/llama.cpp/examples/quantize/quantize.cpp /content/Llama-2-7b-chat-hf-fine-tuned Q5_K_M