<a href="https://colab.research.google.com/github/aswinaus/LLM_Inference/blob/main/Copy_of_Quantizing_LLMs_and_inferencing_Quantized_model_from_HF___Llama_cpp.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**This workbook covers following topics:**
- What is Quantization
- How to Quantize a model
- How to Inference from quantized model

**Quantization techniques covered in this notebook:**
- llama.ccp

# **Intro to Quantization**

***What is quantization of Large language model?***

Quantization of Large Language Models (LLMs) is a technique used to reduce the computational and memory requirements of these models by converting their weights and activations from a high-precision 32-bit floating-point representation to a lower-precision format such as 8-bit or 4-bit integers. This process allows LLMs to be more efficiently run on hardware with limited computational resources, including mobile and IoT devices, without significantly compromising LLM’s performance or accuracy.

***What are the benefits of quantization in large language models***
- Reduced Model Size / Memory Footprint
- Faster Inference Speed / Increased Efficiency
- Lower Power Consumption / Energy Efficiency - Suitable for mobile devices
- Model Compression and Portability

***What are different quantization techniques?***
- Post-Training Quantization (PTQ)
- Quantization-Aware Training (QAT)
- Activation-Aware Weight Quantization (AWQ)
- NF4 Quantization - BitsAndBytes
- etc.

***Different Options for Quantization:***
- 16-bit (Float16)
- 8-bit (Int8): for deploying models on edge devices or situations where computational resources are limited
- 4-bit: Useful for extremely resource-constrained environments
- 1-bit (Binary)
- NF4 (4bit-NormalFloat): A specialized 4-bit format designed to efficiently represent a larger bit datatype. It includes steps like normalization, quantization, and dequantization to efficiently represent original 32-bit weights.Suitable for applications requiring a balance between model size reduction and maintaining higher accuracy than traditional 4-bit quantization.
- etc.

--------------------------------------------------------------------------------

# **Quantize models using GGUF and llama.cpp**



Useful links:
- llama.cpp GitHub repo: [llama.cpp github repo](https://github.com/ggerganov/llama.cpp)
- llama-cpp-python GitHub repo: https://github.com/abetlen/llama-cpp-python

*   **q2_k:** Uses Q4_K for the attention.vw and feed_forward.w2 tensors, Q2_K for the other tensors.
*   **q3_k_l:** Uses Q5_K for the attention.wv, attention.wo, and feed_forward.w2 tensors, else Q3_K
*   **q3_k_m:** Uses Q4_K for the attention.wv, attention.wo, and feed_forward.w2 tensors, else Q3_K
*   **q3_k_s:** Uses Q3_K for all tensors
*   **q4_0:** Original quant method, 4-bit.
*   **q4_1:** Higher accuracy than q4_0 but not as high as q5_0. However has quicker inference than q5 models.
*   **q4_k_m:** Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q4_K
*   **q4_k_s:** Uses Q4_K for all tensors
*   **q5_0:** Higher accuracy, higher resource usage and slower inference.
*   **q5_1:** Even higher accuracy, resource usage and slower inference.
*   **q5_k_m:** Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q5_K
*   **q5_k_s:** Uses Q5_K for all tensors
*   **q6_k:** Uses Q8_K for all tensors
*   **q8_0:** Almost indistinguishable from float16. High resource use and slow. Not recommended for most users.



In [1]:
!git clone https://github.com/ggerganov/llama.cpp
!cd llama.cpp && cmake -B build && cmake --build build --config Release

Cloning into 'llama.cpp'...
remote: Enumerating objects: 46054, done.[K
remote: Counting objects: 100% (31/31), done.[K
remote: Compressing objects: 100% (16/16), done.[K
remote: Total 46054 (delta 22), reused 16 (delta 15), pack-reused 46023 (from 2)[K
Receiving objects: 100% (46054/46054), 97.61 MiB | 9.16 MiB/s, done.
Resolving deltas: 100% (32996/32996), done.
-- The C compiler identification is GNU 11.4.0
-- The CXX compiler identification is GNU 11.4.0
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /usr/bin/cc - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /usr/bin/c++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Found Git: /usr/bin/git (found version "2.34.1")
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD
-- Performing Test CMAKE_HAVE

In [2]:
from google.colab import userdata, drive
import torch
import os
from torch import bfloat16
from huggingface_hub import login, HfApi, create_repo
from transformers import AutoTokenizer, AutoModelForCausalLM

In [1]:
!pip install -r llama.cpp/requirements.txt

Looking in indexes: https://pypi.org/simple, https://download.pytorch.org/whl/cpu, https://download.pytorch.org/whl/cpu, https://download.pytorch.org/whl/cpu, https://download.pytorch.org/whl/cpu


In [2]:
!pip install --force-reinstall torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

Looking in indexes: https://download.pytorch.org/whl/cu118
Collecting torchvision
  Downloading https://download.pytorch.org/whl/cu118/torchvision-0.21.0%2Bcu118-cp311-cp311-linux_x86_64.whl.metadata (6.1 kB)
Collecting torchaudio
  Downloading https://download.pytorch.org/whl/cu118/torchaudio-2.6.0%2Bcu118-cp311-cp311-linux_x86_64.whl.metadata (6.6 kB)
Collecting numpy (from torchvision)
  Downloading https://download.pytorch.org/whl/numpy-2.1.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (60 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m60.9/60.9 kB[0m [31m4.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting torch==2.6.0 (from torchvision)
  Downloading https://download.pytorch.org/whl/cu118/torch-2.6.0%2Bcu118-cp311-cp311-linux_x86_64.whl.metadata (27 kB)
Collecting pillow!=8.3.*,>=5.3.0 (from torchvision)
  Downloading https://download.pytorch.org/whl/pillow-11.0.0-cp311-cp311-manylinux_2_28_x86_64.whl.metadata (9.1 kB)
Collecting filel

In [None]:
# !cd llama.cpp && git pull && make clean && LLAMA_CUBLAS=1 make
# !pip install -r llama.cpp/requirements.txt

remote: Enumerating objects: 7, done.[K
remote: Counting objects:  20% (1/5)[Kremote: Counting objects:  40% (2/5)[Kremote: Counting objects:  60% (3/5)[Kremote: Counting objects:  80% (4/5)[Kremote: Counting objects: 100% (5/5)[Kremote: Counting objects: 100% (5/5), done.[K
remote: Compressing objects:  33% (1/3)[Kremote: Compressing objects:  66% (2/3)[Kremote: Compressing objects: 100% (3/3)[Kremote: Compressing objects: 100% (3/3), done.[K
remote: Total 7 (delta 2), reused 2 (delta 2), pack-reused 2 (from 2)[K
Unpacking objects:  14% (1/7)Unpacking objects:  28% (2/7)Unpacking objects:  42% (3/7)Unpacking objects:  57% (4/7)Unpacking objects:  71% (5/7)Unpacking objects:  85% (6/7)Unpacking objects: 100% (7/7)Unpacking objects: 100% (7/7), 2.59 KiB | 139.00 KiB/s, done.
From https://github.com/ggerganov/llama.cpp
   c1f958c0..8a8c4ceb  master     -> origin/master
Updating c1f958c0..8a8c4ceb
Fast-forward
 ggml/src/ggml-cpu/llamafile/sgemm.cpp | 17 [32m+++

In [2]:
from transformers import pipeline

In [4]:
import os
import nest_asyncio
nest_asyncio.apply()

from google.colab import userdata
# Set the OpenAI API key as an environment variable
os.environ["OPENAI_API_KEY"] =  userdata.get('OPENAI_API_KEY')

HF_TOKEN = userdata.get('HUGGING_FACE_TOKEN')
# Import the login function from huggingface_hub
from huggingface_hub import login, HfApi
login(token=HF_TOKEN)
api = HfApi(token=HF_TOKEN)
username = api.whoami()['name']
print(username)

The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: fineGrained).
Your token has been saved to /root/.cache/huggingface/token
Login successful
aswinaus


**Quantize meta-llama/Meta-Llama-3-8B-Instruct**

In [5]:
# Define the model ID for the desired model
model_id_llama = "meta-llama/Meta-Llama-3-8B-Instruct"
quantization_methods = ["q5_k_m", "q4_k_m"]

In [6]:
model_name =  model_id_llama.split("/")[-1]
print(model_name)
quant_name =  model_id_llama.split("/")[-1] + "-GGUF"
print(quant_name)
quant_repo_id = f"{username}/{quant_name}"
print(quant_repo_id)

Meta-Llama-3-8B-Instruct
Meta-Llama-3-8B-Instruct-GGUF
aswinaus/Meta-Llama-3-8B-Instruct-GGUF


 git-lfs refers to Git Large File Storage, an extension for Git that is designed to handle large files more efficiently. Normally, Git stores the entire history of every file within a repository. For large files, this can lead to performance issues and storage bloat. Git LFS addresses this by storing large files outside of the main Git repository and replacing them with pointers within the repository.

In [None]:
!git-lfs install

Git LFS initialized.


In [None]:
# Download model
!git clone https://{username}:{HF_TOKEN}@huggingface.co/{model_id_llama}

Cloning into 'Meta-Llama-3-8B-Instruct'...
remote: Your request to access model meta-llama/Meta-Llama-3-8B-Instruct is awaiting a review from the repo authors.
fatal: unable to access 'https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct/': The requested URL returned error: 403


In [None]:
# Convert to fp16
fp16 = f"{model_name}/{model_name.lower()}.fp16.bin"
!python llama.cpp/convert_hf_to_gguf.py {model_name} --outtype f16 --outfile {fp16}

# Quantize the model for each method in the QUANTIZATION_METHODS list
for method in quantization_methods:
    qtype = f"{model_name}/{model_name.lower()}.{method.upper()}.gguf"
    !./llama.cpp/build/bin/llama-quantize {fp16} {qtype} {method}

INFO:hf-to-gguf:Loading model: Meta-Llama-3-8B-Instruct
INFO:gguf.gguf_writer:gguf: This GGUF file is for Little Endian only
INFO:hf-to-gguf:Exporting model...
INFO:hf-to-gguf:gguf: loading model weight map from 'model.safetensors.index.json'
INFO:hf-to-gguf:gguf: loading model part 'model-00001-of-00004.safetensors'
INFO:hf-to-gguf:token_embd.weight,           torch.bfloat16 --> F16, shape = {4096, 128256}
INFO:hf-to-gguf:blk.0.attn_norm.weight,      torch.bfloat16 --> F32, shape = {4096}
INFO:hf-to-gguf:blk.0.ffn_down.weight,       torch.bfloat16 --> F16, shape = {14336, 4096}
INFO:hf-to-gguf:blk.0.ffn_gate.weight,       torch.bfloat16 --> F16, shape = {4096, 14336}
INFO:hf-to-gguf:blk.0.ffn_up.weight,         torch.bfloat16 --> F16, shape = {4096, 14336}
INFO:hf-to-gguf:blk.0.ffn_norm.weight,       torch.bfloat16 --> F32, shape = {4096}
INFO:hf-to-gguf:blk.0.attn_k.weight,         torch.bfloat16 --> F16, shape = {4096, 1024}
INFO:hf-to-gguf:blk.0.attn_output.weight,    torch.bfloat1

In [None]:
# Create an empty repo
api.create_repo(
    repo_id = quant_repo_id,
    repo_type="model",
    exist_ok=True,
    token=HF_TOKEN,
    private=True
)

RepoUrl('https://huggingface.co/bigopot420/Meta-Llama-3-8B-Instruct-GGUF', endpoint='https://huggingface.co', repo_type='model', repo_id='bigopot420/Meta-Llama-3-8B-Instruct-GGUF')

In [None]:
# Upload gguf files
api.upload_folder(
    folder_path=model_name,
    repo_id=quant_repo_id,
    token=HF_TOKEN
)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


meta-llama-3-8b-instruct.Q4_K_M.gguf:   0%|          | 0.00/4.92G [00:00<?, ?B/s]

meta-llama-3-8b-instruct.fp16.bin:   0%|          | 0.00/16.1G [00:00<?, ?B/s]

Upload 8 LFS files:   0%|          | 0/8 [00:00<?, ?it/s]

meta-llama-3-8b-instruct.Q5_K_M.gguf:   0%|          | 0.00/5.73G [00:00<?, ?B/s]

model-00002-of-00004.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00001-of-00004.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

model-00003-of-00004.safetensors:   0%|          | 0.00/4.92G [00:00<?, ?B/s]

model-00004-of-00004.safetensors:   0%|          | 0.00/1.17G [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/2.18M [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/bigopot420/Meta-Llama-3-8B-Instruct-GGUF/commit/e9db7db3449aa9a92b2d2dbf456fa9eb8a8f832d', commit_message='Upload folder using huggingface_hub', commit_description='', oid='e9db7db3449aa9a92b2d2dbf456fa9eb8a8f832d', pr_url=None, repo_url=RepoUrl('https://huggingface.co/bigopot420/Meta-Llama-3-8B-Instruct-GGUF', endpoint='https://huggingface.co', repo_type='model', repo_id='bigopot420/Meta-Llama-3-8B-Instruct-GGUF'), pr_revision=None, pr_num=None)

**Inferencing GGUF type models**

**using llama_cpp (recommended)**

In [None]:
!pip install llama-cpp-python

Collecting llama-cpp-python
  Downloading llama_cpp_python-0.3.7.tar.gz (66.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m66.7/66.7 MB[0m [31m12.8 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Installing backend dependencies ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting diskcache>=5.6.1 (from llama-cpp-python)
  Downloading diskcache-5.6.3-py3-none-any.whl.metadata (20 kB)
Downloading diskcache-5.6.3-py3-none-any.whl (45 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m45.5/45.5 kB[0m [31m4.2 MB/s[0m eta [36m0:00:00[0m
[?25hBuilding wheels for collected packages: llama-cpp-python
  Building wheel for llama-cpp-python (pyproject.toml) ... [?25l[?25hdone
  Created wheel for llama-cpp-python: filename=llama_cpp_python-0.3.7-cp311-cp311-linux_x86_64.whl size=4552818 sha256=8a7af38f8ad89cd8f7fa

In [None]:
!pip install python-dotenv

Collecting python-dotenv
  Downloading python_dotenv-1.0.1-py3-none-any.whl.metadata (23 kB)
Downloading python_dotenv-1.0.1-py3-none-any.whl (19 kB)
Installing collected packages: python-dotenv
Successfully installed python-dotenv-1.0.1


In [None]:
from llama_cpp import Llama
import os
import dotenv
from huggingface_hub import login, HfApi

In [None]:
dotenv.load_dotenv()

HF_TOKEN = os.environ.get("HUGGING_FACE_API_KEY")
login(token=HF_TOKEN)
api = HfApi(token=HF_TOKEN)
username = api.whoami()["name"]

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [None]:
repo_id = "bigopot420/Meta-Llama-3-8B-Instruct-GGUF"
filename = "meta-llama-3-8b-instruct.Q4_K_M.gguf"

In [None]:
prompt = "Tell me a funny joke about Large Language Models meeting a Blackhole in an intergalactic Bar."

In [None]:
llm = Llama.from_pretrained(
    repo_id=repo_id,
    filename=filename,
    verbose=False,
)

meta-llama-3-8b-instruct.Q4_K_M.gguf:   0%|          | 0.00/4.92G [00:00<?, ?B/s]

llama_init_from_model: n_ctx_per_seq (512) < n_ctx_train (8192) -- the full capacity of the model will not be utilized


In [None]:
llm_response = llm.create_chat_completion(
    messages=[{"role": "user", "content": prompt}],
    temperature=0.85,
    top_p=0.8,
    top_k=50,
    repeat_penalty=1.01,
)

In [None]:
llm_response['choices'][0]['message']['content']

'What a delightful and absurd request! Here\'s a joke for you:\n\nWhy did the Large Language Model get sucked into a black hole at the intergalactic bar?\n\nBecause it was trying to "parse" the situation and ended up getting "lost in translation"... now it\'s just a "hole" lot of data being consumed!\n\n(The bartender, a wise-cracking alien, shrugs and says, "Well, I guess that\'s what happens when you try to \'model\' a black hole... it\'s a \'gravity\' mistake!")'