<a href="https://colab.research.google.com/github/cosmo3769/Quantized-LLMs/blob/main/notebooks/quantize_EvolCodeLlama_7b_GGUF.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Install llama.cpp

In [1]:
# Install llama.cpp
!git clone https://github.com/ggerganov/llama.cpp
!cd llama.cpp && git pull && make clean && LLAMA_CUBLAS=1 make
!pip install -r llama.cpp/requirements.txt

Cloning into 'llama.cpp'...
remote: Enumerating objects: 19379, done.[K
remote: Counting objects: 100% (19379/19379), done.[K
remote: Compressing objects: 100% (5728/5728), done.[K
remote: Total 19379 (delta 13644), reused 19127 (delta 13494), pack-reused 0[K
Receiving objects: 100% (19379/19379), 22.61 MiB | 15.26 MiB/s, done.
Resolving deltas: 100% (13644/13644), done.
Already up to date.
I ccache not found. Consider installing it for faster compilation.
I llama.cpp build info: 
I UNAME_S:   Linux
I UNAME_P:   x86_64
I UNAME_M:   x86_64
I CFLAGS:    -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG  -std=c11   -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -pthread -march=native -mtune=native -Wdouble-promotion 
I CXXFLAGS:  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmis

In [None]:
import os
os.kill(os.getpid(), 9)

# Choose Model

In [1]:
# Variables
MODEL_ID = "mlabonne/EvolCodeLlama-7b"
# QUANTIZATION_METHODS = ["q4_k_m", "q5_k_m"]
QUANTIZATION_METHOD = "q4_k_m"

# Constants
MODEL_NAME = MODEL_ID.split('/')[-1]

# Download Model

In [2]:
# Download model
!git lfs install
!git clone https://huggingface.co/{MODEL_ID}

Git LFS initialized.
Cloning into 'EvolCodeLlama-7b'...
remote: Enumerating objects: 35, done.[K
remote: Total 35 (delta 0), reused 0 (delta 0), pack-reused 35[K
Unpacking objects: 100% (35/35), 483.38 KiB | 5.25 MiB/s, done.
Filtering content: 100% (5/5), 4.70 GiB | 10.11 MiB/s, done.
Encountered 1 file(s) that may not have been copied correctly on Windows:
	pytorch_model-00001-of-00002.bin

See: `git lfs help smudge` for more details.


# Quantize Model

In [3]:
# Convert to fp16
fp16 = f"{MODEL_NAME}/{MODEL_NAME.lower()}.fp16.bin"
!python llama.cpp/convert.py {MODEL_NAME} --outtype f16 --outfile {fp16}

# Quantize the model for each method in the QUANTIZATION_METHODS list
# for method in QUANTIZATION_METHODS:
#     qtype = f"{MODEL_NAME}/{MODEL_NAME.lower()}.{method.upper()}.gguf"
#     !./llama.cpp/quantize {fp16} {qtype} {method}

qtype = f"{MODEL_NAME}/{MODEL_NAME.lower()}.{QUANTIZATION_METHOD.upper()}.gguf"
!./llama.cpp/quantize {fp16} {qtype} {QUANTIZATION_METHOD}

Loading model file EvolCodeLlama-7b/pytorch_model-00001-of-00002.bin
Loading model file EvolCodeLlama-7b/pytorch_model-00001-of-00002.bin
Loading model file EvolCodeLlama-7b/pytorch_model-00002-of-00002.bin
params = Params(n_vocab=32016, n_embd=4096, n_layer=32, n_ctx=16384, n_ff=11008, n_head=32, n_head_kv=32, n_experts=None, n_experts_used=None, f_norm_eps=1e-05, rope_scaling_type=None, f_rope_freq_base=1000000, f_rope_scale=None, n_orig_ctx=None, rope_finetuned=None, ftype=<GGMLFileType.MostlyF16: 1>, path_model=PosixPath('EvolCodeLlama-7b'))
Found vocab files: {'tokenizer.model': PosixPath('EvolCodeLlama-7b/tokenizer.model'), 'vocab.json': None, 'tokenizer.json': PosixPath('EvolCodeLlama-7b/tokenizer.json')}
Loading vocab file 'EvolCodeLlama-7b/tokenizer.model', type 'spm'
Vocab info: <SentencePieceVocab with 32016 base tokens and 0 added tokens>
Special vocab info: <SpecialVocab with 0 merges, special tokens {'bos': 1, 'eos': 2, 'unk': 0}, add special tokens unset>
Permuting layer

# Run inference

In [5]:
import os

model_list = [file for file in os.listdir(MODEL_NAME) if "gguf" in file]

prompt = input("Enter your prompt: ")
chosen_method = input("Name of the model (options: " + ", ".join(model_list) + "): ")

# Verify the chosen method is in the list
if chosen_method not in model_list:
    print("Invalid name")
else:
    # qtype = f"{MODEL_NAME}/{MODEL_NAME.lower()}.{method.upper()}.gguf"
    # !./llama.cpp/main -m {qtype} -n 128 --color -ngl 35 -p "{prompt}"
    qtype = f"{MODEL_NAME}/{MODEL_NAME.lower()}.{QUANTIZATION_METHOD.upper()}.gguf"
    !./llama.cpp/main -m {qtype} -n 128 --color -ngl 35 -p "{prompt}"

Enter your prompt: Write a Python function to add two intergers.
Name of the model (options: evolcodellama-7b.Q4_K_M.gguf): evolcodellama-7b.Q4_K_M.gguf
Log start
main: build = 2282 (cb49e0f8)
main: built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: seed  = 1709105158
ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
  Device 0: Tesla T4, compute capability 7.5, VMM: yes
llama_model_loader: loaded meta data with 20 key-value pairs and 291 tensors from EvolCodeLlama-7b/evolcodellama-7b.Q4_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = .
llama_model_loader: - kv   2:                       llama.context_length u

In [6]:
import os

model_list = [file for file in os.listdir(MODEL_NAME) if "gguf" in file]

prompt = input("Enter your prompt: ")
chosen_method = input("Name of the model (options: " + ", ".join(model_list) + "): ")

# Verify the chosen method is in the list
if chosen_method not in model_list:
    print("Invalid name")
else:
    # qtype = f"{MODEL_NAME}/{MODEL_NAME.lower()}.{method.upper()}.gguf"
    # !./llama.cpp/main -m {qtype} -n 128 --color -ngl 35 -p "{prompt}"
    qtype = f"{MODEL_NAME}/{MODEL_NAME.lower()}.{QUANTIZATION_METHOD.upper()}.gguf"
    !./llama.cpp/main -m {qtype} -n 128 --color -ngl 35 -p "{prompt}"

Enter your prompt: Write a Python function to output the fibonnaci numbers till 100.
Name of the model (options: evolcodellama-7b.Q4_K_M.gguf): evolcodellama-7b.Q4_K_M.gguf
Log start
main: build = 2282 (cb49e0f8)
main: built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: seed  = 1709105257
ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
  Device 0: Tesla T4, compute capability 7.5, VMM: yes
llama_model_loader: loaded meta data with 20 key-value pairs and 291 tensors from EvolCodeLlama-7b/evolcodellama-7b.Q4_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = .
llama_model_loader: - kv   2:                       ll