<a href="https://colab.research.google.com/github/ambirpatel/Optimized-LLM/blob/main/vicuna_cpp_compilation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Step by step guide to compile lmsys/vicuna-7b-v1.3 using llama.cpp

 1. Download the based model
 2. Clone [llama.cpp](https://github.com/ggerganov/llama.cpp) repo and install llama.cpp
 3. Convert base model to GGUF format
 4. Quantize GGUF model
---

In [1]:
!mkdir compiled_model

## 1. Download the base model

[Model Link](https://huggingface.co/lmsys/vicuna-7b-v1.3)

**Model Details:**<br>
> Vicuna is a chat assistant trained by fine-tuning LLaMA on user-shared conversations collected from ShareGPT.

**Uses:**<br>
>The primary use of Vicuna is research on large language models and chatbots. The primary intended users of the model are researchers and hobbyists in natural language processing, machine learning, and artificial intelligence.



In [2]:
from huggingface_hub import snapshot_download

model_name = "lmsys/vicuna-7b-v1.3"
base_model = "./base_model/"
snapshot_download(repo_id=model_name, local_dir=base_model, ignore_patterns=["*.pth"])

Access to the secret `HF_TOKEN` has not been granted on this notebook.
You will not be requested again.
Please restart the session if you want to be prompted again.


Fetching 10 files:   0%|          | 0/10 [00:00<?, ?it/s]

.gitattributes:   0%|          | 0.00/1.52k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/566 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/1.98k [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/132 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/435 [00:00<?, ?B/s]

pytorch_model.bin.index.json:   0%|          | 0.00/26.8k [00:00<?, ?B/s]

pytorch_model-00001-of-00002.bin:   0%|          | 0.00/9.98G [00:00<?, ?B/s]

pytorch_model-00002-of-00002.bin:   0%|          | 0.00/3.50G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/727 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

'/content/base_model'

## 2. Clone [llama.cpp](https://github.com/ggerganov/llama.cpp) repo and install llama.cpp

**Why:**
>1. Supports quantization for reduced memory usage and faster inference.<br>
>2. Offers a simple API for loading and running models.<br>
>3. Compatible with various Llama-based models, including Vicuna.<br>


In [3]:
!git clone https://github.com/ggerganov/llama.cpp

Cloning into 'llama.cpp'...
remote: Enumerating objects: 36020, done.[K
remote: Counting objects: 100% (187/187), done.[K
remote: Compressing objects: 100% (111/111), done.[K
remote: Total 36020 (delta 95), reused 133 (delta 74), pack-reused 35833 (from 1)[K
Receiving objects: 100% (36020/36020), 58.29 MiB | 20.66 MiB/s, done.
Resolving deltas: 100% (26181/26181), done.


In [4]:
# Build llama cpp
!mkdir llama.cpp/build && cd llama.cpp/build && cmake .. && cmake --build . --config Release

-- The C compiler identification is GNU 11.4.0
-- The CXX compiler identification is GNU 11.4.0
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /usr/bin/cc - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /usr/bin/c++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Found Git: /usr/bin/git (found version "2.34.1")
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Success
-- Found Threads: TRUE
-- Found OpenMP_C: -fopenmp (found version "4.5")
-- Found OpenMP_CXX: -fopenmp (found version "4.5")
-- Found OpenMP: TRUE (found version "4.5")
-- OpenMP found
-- Using llamafile
-- Using AMX
-- CMAKE_SYSTEM_PROCESSOR: x86_64
-- x86 detected
-- Configuring done (1.9s)
-- Generating done (0.2s)
-- Build files have been wr

## 3. Convert base model to GGUF format

**GGUF Format:**<br>
> GGUF is a file format for storing models for inference with GGML and executors based on GGML. GGUF is a binary format that is designed for fast loading and saving of models, and for ease of reading. Models are traditionally developed using PyTorch or another framework, and then converted to GGUF for use in GGML.


[More abot GGUF format.](https://huggingface.co/docs/hub/en/gguf)

In [5]:
# convert the model to ggml FP16 format
!python llama.cpp/convert_hf_to_gguf.py ./base_model/ --outfile compiled_model/vicuna_7b_FP16.gguf

INFO:hf-to-gguf:Loading model: base_model
INFO:gguf.gguf_writer:gguf: This GGUF file is for Little Endian only
INFO:hf-to-gguf:Exporting model...
INFO:hf-to-gguf:gguf: loading model weight map from 'pytorch_model.bin.index.json'
INFO:hf-to-gguf:gguf: loading model part 'pytorch_model-00001-of-00002.bin'
INFO:hf-to-gguf:token_embd.weight,           torch.float16 --> F16, shape = {4096, 32000}
INFO:hf-to-gguf:blk.0.attn_q.weight,         torch.float16 --> F16, shape = {4096, 4096}
INFO:hf-to-gguf:blk.0.attn_k.weight,         torch.float16 --> F16, shape = {4096, 4096}
INFO:hf-to-gguf:blk.0.attn_v.weight,         torch.float16 --> F16, shape = {4096, 4096}
INFO:hf-to-gguf:blk.0.attn_output.weight,    torch.float16 --> F16, shape = {4096, 4096}
INFO:hf-to-gguf:blk.0.ffn_gate.weight,       torch.float16 --> F16, shape = {4096, 11008}
INFO:hf-to-gguf:blk.0.ffn_down.weight,       torch.float16 --> F16, shape = {11008, 4096}
INFO:hf-to-gguf:blk.0.ffn_up.weight,         torch.float16 --> F16, s

## 4. Quantize GGUF model

**Memory/Disk Requirements:**

As the models are currently fully loaded into memory, you will need adequate disk space to save them and sufficient RAM to load them. At the moment, memory and disk requirements are the same.

| Model | Original size | Quantized size (Q4_K_M) |
|------:|--------------:|----------------------:|
|    7B |         13 GB |                3.9 GB |
|   13B |         24 GB |                7.8 GB |
|   30B |         60 GB |               19.5 GB |
|   65B |        120 GB |               38.5 GB |



In [6]:
# quantize the model to 4-bits (using Q4_K_M method)
!cd llama.cpp/build/bin && ./llama-quantize /content/compiled_model/vicuna_7b_FP16.gguf /content/compiled_model/vicuna_7b_FP16_K_M.gguf q4_K_M

main: build = 3943 (cda0e4b6)
main: built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: quantizing '/content/compiled_model/vicuna_7b_FP16.gguf' to '/content/compiled_model/vicuna_7b_FP16_K_M.gguf' as Q4_K_M
llama_model_loader: loaded meta data with 24 key-value pairs and 291 tensors from /content/compiled_model/vicuna_7b_FP16.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Base_Model
llama_model_loader: - kv   3:                         general.size_label str              = 6.7B
llama_model_loader: - kv   4:                          llama.block_count u32              = 32
llama_model_loader: - k

___
## Inference on quantized model

In [7]:
!pip install llama-cpp-python==0.2.85

Collecting llama-cpp-python==0.2.85
  Downloading llama_cpp_python-0.2.85.tar.gz (49.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m49.3/49.3 MB[0m [31m13.2 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Installing backend dependencies ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting diskcache>=5.6.1 (from llama-cpp-python==0.2.85)
  Downloading diskcache-5.6.3-py3-none-any.whl.metadata (20 kB)
Downloading diskcache-5.6.3-py3-none-any.whl (45 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m45.5/45.5 kB[0m [31m3.8 MB/s[0m eta [36m0:00:00[0m
[?25hBuilding wheels for collected packages: llama-cpp-python
  Building wheel for llama-cpp-python (pyproject.toml) ... [?25l[?25hdone
  Created wheel for llama-cpp-python: filename=llama_cpp_python-0.2.85-cp310-cp310-linux_x86_64.whl size=2857609 sha256=73

In [8]:
from llama_cpp import Llama

model_path = "/content/compiled_model/vicuna_7b_FP16_K_M.gguf"

llm = Llama(model_path=model_path)

llama_model_loader: loaded meta data with 24 key-value pairs and 291 tensors from /content/compiled_model/vicuna_7b_FP16_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Base_Model
llama_model_loader: - kv   3:                         general.size_label str              = 6.7B
llama_model_loader: - kv   4:                          llama.block_count u32              = 32
llama_model_loader: - kv   5:                       llama.context_length u32              = 2048
llama_model_loader: - kv   6:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   7:                  llama.feed_fo

In [9]:
generation_kwargs = {
    "max_tokens":200,
    "echo":False,
    "top_k":1
}

In [11]:
prompt = "Tell me something about ChatGPT."
res = llm(prompt, **generation_kwargs)
res


llama_print_timings:        load time =    5545.62 ms
llama_print_timings:      sample time =       7.70 ms /   200 runs   (    0.04 ms per token, 25963.91 tokens per second)
llama_print_timings: prompt eval time =    5545.54 ms /    10 tokens (  554.55 ms per token,     1.80 tokens per second)
llama_print_timings:        eval time =  153947.73 ms /   199 runs   (  773.61 ms per token,     1.29 tokens per second)
llama_print_timings:       total time =  159792.17 ms /   209 tokens


{'id': 'cmpl-8f1f0ade-8ad5-469f-a561-6bebd66a0906',
 'object': 'text_completion',
 'created': 1729434551,
 'model': '/content/compiled_model/vicuna_7b_FP16_K_M.gguf',
 'choices': [{'text': '\n\nChatGPT is a language model developed by researchers from Large Model Systems Organization (LMSYS). It is a large-scale language model that is trained on a vast amount of text data and can generate human-like text in response to prompts. It is designed to assist with a wide range of natural language processing tasks, such The 2018-2023 World Outlook for 3D Printing and Additive Manufacturing\nThis study covers the world outlook for 3D printing and additive manufacturing across more than 190 countries. For each year reported, estimates are given for the latent demand, or potential industry earnings (P.I.E.), for the country in question (in millions of U.S. dollars), the percent share and growth rate of the world market, and the compound annual growth rate (CAGR) for the same period, calculated on