<a href="https://colab.research.google.com/github/cosmo3769/Quantized-LLMs/blob/main/notebooks/quantize_stablelm_zephyr_3b_GGUF.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Install llama.cpp

In [1]:
# Install llama.cpp
!git clone https://github.com/ggerganov/llama.cpp
!cd llama.cpp && git pull && make clean && LLAMA_CUBLAS=1 make
!pip install -r llama.cpp/requirements.txt

Cloning into 'llama.cpp'...
remote: Enumerating objects: 19614, done.[K
remote: Counting objects: 100% (6928/6928), done.[K
remote: Compressing objects: 100% (538/538), done.[K
remote: Total 19614 (delta 6693), reused 6468 (delta 6389), pack-reused 12686[K
Receiving objects: 100% (19614/19614), 23.47 MiB | 17.82 MiB/s, done.
Resolving deltas: 100% (13862/13862), done.
Already up to date.
I ccache not found. Consider installing it for faster compilation.
I llama.cpp build info: 
I UNAME_S:   Linux
I UNAME_P:   x86_64
I UNAME_M:   x86_64
I CFLAGS:    -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG  -std=c11   -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -pthread -march=native -mtune=native -Wdouble-promotion 
I CXXFLAGS:  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissin

In [None]:
import os
os.kill(os.getpid(), 9)

# Choose Model

In [1]:
# Variables
MODEL_ID = "stabilityai/stablelm-zephyr-3b"
QUANTIZATION_METHOD = "q4_k_m"

# Constants
MODEL_NAME = MODEL_ID.split('/')[-1]

# Download Model

In [2]:
# Download model
!git lfs install
!git clone https://huggingface.co/{MODEL_ID}

Git LFS initialized.
Cloning into 'stablelm-zephyr-3b'...
remote: Enumerating objects: 206, done.[K
remote: Counting objects: 100% (203/203), done.[K
remote: Compressing objects: 100% (203/203), done.[K
remote: Total 206 (delta 119), reused 0 (delta 0), pack-reused 3[K
Receiving objects: 100% (206/206), 650.45 KiB | 3.65 MiB/s, done.
Resolving deltas: 100% (119/119), done.
Encountered 1 file(s) that may not have been copied correctly on Windows:
	model.safetensors

See: `git lfs help smudge` for more details.


# Quantize Model

In [5]:
# Convert to fp16
# fp16 = f"{MODEL_NAME}/{MODEL_NAME.lower()}.fp16.bin"
# !python llama.cpp/convert.py {MODEL_NAME} --outtype f16 --outfile {fp16}

fp16 = f"{MODEL_NAME}/{MODEL_NAME.lower()}.fp16.bin"
!python llama.cpp/convert-hf-to-gguf.py {MODEL_NAME} --outtype f16 --outfile {fp16}

Loading model: stablelm-zephyr-3b
gguf: This GGUF file is for Little Endian only
Set model parameters
Set model tokenizer
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
gguf: Adding 50009 merge(s).
gguf: Setting special token type bos to 0
gguf: Setting special token type eos to 0
gguf: Setting special token type unk to 0
gguf: Setting special token type pad to 0
gguf: Setting chat_template to {% for message in messages %}
{% if message['role'] == 'user' %}
{{ '<|user|>
' + message['content'] + eos_token }}
{% elif message['role'] == 'system' %}
{{ '<|system|>
' + message['content'] + eos_token }}
{% elif message['role'] == 'assistant' %}
{{ '<|assistant|>
'  + message['content'] + eos_token }}
{% endif %}
{% if loop.last and add_generation_prompt %}
{{ '<|assistant|>' }}
{% endif %}
{% endfor %}
Exporting model to 'stablelm-zephyr-3b/stablelm-zephyr-3b.fp16.bin'
gguf: loading model part 'model.safetensors'
output.w

In [6]:
# Quantize the model
qtype = f"{MODEL_NAME}/{MODEL_NAME.lower()}.{QUANTIZATION_METHOD.upper()}.gguf"
!./llama.cpp/quantize {fp16} {qtype} {QUANTIZATION_METHOD}

ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
  Device 0: Tesla T4, compute capability 7.5, VMM: yes
main: build = 2330 (5a51cc1b)
main: built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: quantizing 'stablelm-zephyr-3b/stablelm-zephyr-3b.fp16.bin' to 'stablelm-zephyr-3b/stablelm-zephyr-3b.Q4_K_M.gguf' as Q4_K_M
llama_model_loader: loaded meta data with 19 key-value pairs and 356 tensors from stablelm-zephyr-3b/stablelm-zephyr-3b.fp16.bin (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = stablelm
llama_model_loader: - kv   1:                               general.name str              = stablelm-zephyr-3b
llama_model_loader: - kv   2:                    stablelm.context_length u32              = 4096
llama_model_l

# Run inference

In [7]:
import os

model_list = [file for file in os.listdir(MODEL_NAME) if "gguf" in file]

prompt = input("Enter your prompt: ")
chosen_method = input("Name of the model (options: " + ", ".join(model_list) + "): ")

# Verify the chosen method is in the list
if chosen_method not in model_list:
    print("Invalid name")
else:
    qtype = f"{MODEL_NAME}/{MODEL_NAME.lower()}.{QUANTIZATION_METHOD.upper()}.gguf"
    !./llama.cpp/main -m {qtype} -n 128 --color -ngl 35 -p "{prompt}"

Enter your prompt: Write a Python function to add two intergers.
Name of the model (options: stablelm-zephyr-3b.Q4_K_M.gguf): stablelm-zephyr-3b.Q4_K_M.gguf
Log start
main: build = 2330 (5a51cc1b)
main: built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: seed  = 1709541801
ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
  Device 0: Tesla T4, compute capability 7.5, VMM: yes
llama_model_loader: loaded meta data with 21 key-value pairs and 356 tensors from stablelm-zephyr-3b/stablelm-zephyr-3b.Q4_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = stablelm
llama_model_loader: - kv   1:                               general.name str              = stablelm-zephyr-3b
llama_model_loader: - kv   2:                 

In [8]:
import os

model_list = [file for file in os.listdir(MODEL_NAME) if "gguf" in file]

prompt = input("Enter your prompt: ")
chosen_method = input("Name of the model (options: " + ", ".join(model_list) + "): ")

# Verify the chosen method is in the list
if chosen_method not in model_list:
    print("Invalid name")
else:
    qtype = f"{MODEL_NAME}/{MODEL_NAME.lower()}.{QUANTIZATION_METHOD.upper()}.gguf"
    !./llama.cpp/main -m {qtype} -n 128 --color -ngl 35 -p "{prompt}"

Enter your prompt: Write a Python function to output the fibonnaci numbers till 100.
Name of the model (options: stablelm-zephyr-3b.Q4_K_M.gguf): stablelm-zephyr-3b.Q4_K_M.gguf
Log start
main: build = 2330 (5a51cc1b)
main: built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: seed  = 1709541830
ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
  Device 0: Tesla T4, compute capability 7.5, VMM: yes
llama_model_loader: loaded meta data with 21 key-value pairs and 356 tensors from stablelm-zephyr-3b/stablelm-zephyr-3b.Q4_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = stablelm
llama_model_loader: - kv   1:                               general.name str              = stablelm-zephyr-3b
llama_model_loader: - kv  

# Push to Hub

In [9]:
!pip install -q huggingface_hub
from huggingface_hub import create_repo, HfApi
from google.colab import userdata

username = "cosmo3769"

# Defined in the secrets tab in Google Colab
api = HfApi(token=userdata.get("HF_TOKEN"))

# Create empty repo
create_repo(
    repo_id = f"{username}/{MODEL_NAME}-GGUF",
    repo_type="model",
    exist_ok=True,
)

# Upload gguf files
api.upload_folder(
    folder_path=MODEL_NAME,
    repo_id=f"{username}/{MODEL_NAME}-GGUF",
    allow_patterns=f"*.gguf",
)

stablelm-zephyr-3b.Q4_K_M.gguf:   0%|          | 0.00/1.71G [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/cosmo3769/stablelm-zephyr-3b-GGUF/commit/f9e78a41a5ecc393e63bd18ca62f90f80ee49775', commit_message='Upload folder using huggingface_hub', commit_description='', oid='f9e78a41a5ecc393e63bd18ca62f90f80ee49775', pr_url=None, pr_revision=None, pr_num=None)