<a href="https://colab.research.google.com/github/aswinaus/Quantization/blob/main/CPU_Quantizing_LLMs_and_inferencing_Quantized_model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**This workbook covers following topics:**
- What is Quantization
- How to Quantize a model
- How to Inference from quantized model

**Quantization techniques covered in this notebook:**
- llama.ccp

# **Intro to Quantization**

***What is quantization of Large language model?***

Quantization of Large Language Models (LLMs) is a technique used to reduce the computational and memory requirements of these models by converting their weights and activations from a high-precision 32-bit floating-point representation to a lower-precision format such as 8-bit or 4-bit integers. This process allows LLMs to be more efficiently run on hardware with limited computational resources, including mobile and IoT devices, without significantly compromising LLM’s performance or accuracy.

***What are the benefits of quantization in large language models***
- Reduced Model Size / Memory Footprint
- Faster Inference Speed / Increased Efficiency
- Lower Power Consumption / Energy Efficiency - Suitable for mobile devices
- Model Compression and Portability

***What are different quantization techniques?***
- Post-Training Quantization (PTQ)
- Quantization-Aware Training (QAT)
- Activation-Aware Weight Quantization (AWQ)
- NF4 Quantization - BitsAndBytes
- etc.

***Different Options for Quantization:***
- 16-bit (Float16)
- 8-bit (Int8): for deploying models on edge devices or situations where computational resources are limited
- 4-bit: Useful for extremely resource-constrained environments
- 1-bit (Binary)
- NF4 (4bit-NormalFloat): A specialized 4-bit format designed to efficiently represent a larger bit datatype. It includes steps like normalization, quantization, and dequantization to efficiently represent original 32-bit weights.Suitable for applications requiring a balance between model size reduction and maintaining higher accuracy than traditional 4-bit quantization.
- etc.

--------------------------------------------------------------------------------

# **Quantize models using GGUF and llama.cpp**



Useful links:
- llama.cpp GitHub repo: [llama.cpp github repo](https://github.com/ggerganov/llama.cpp)
- llama-cpp-python GitHub repo: https://github.com/abetlen/llama-cpp-python

*   **q2_k:** Uses Q4_K for the attention.vw and feed_forward.w2 tensors, Q2_K for the other tensors.
*   **q3_k_l:** Uses Q5_K for the attention.wv, attention.wo, and feed_forward.w2 tensors, else Q3_K
*   **q3_k_m:** Uses Q4_K for the attention.wv, attention.wo, and feed_forward.w2 tensors, else Q3_K
*   **q3_k_s:** Uses Q3_K for all tensors
*   **q4_0:** Original quant method, 4-bit.
*   **q4_1:** Higher accuracy than q4_0 but not as high as q5_0. However has quicker inference than q5 models.
*   **q4_k_m:** Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q4_K
*   **q4_k_s:** Uses Q4_K for all tensors
*   **q5_0:** Higher accuracy, higher resource usage and slower inference.
*   **q5_1:** Even higher accuracy, resource usage and slower inference.
*   **q5_k_m:** Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q5_K
*   **q5_k_s:** Uses Q5_K for all tensors
*   **q6_k:** Uses Q8_K for all tensors
*   **q8_0:** Almost indistinguishable from float16. High resource use and slow. Not recommended for most users.



In [None]:
!git clone https://github.com/ggerganov/llama.cpp
!cd llama.cpp && cmake -B build && cmake --build build --config Release

In [None]:
from google.colab import userdata, drive
import torch
import os
from torch import bfloat16
from huggingface_hub import login, HfApi, create_repo
from transformers import AutoTokenizer, AutoModelForCausalLM

In [None]:
!pip install -r llama.cpp/requirements.txt

In [None]:
!pip install --force-reinstall torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

In [None]:
# !cd llama.cpp && git pull && make clean && LLAMA_CUBLAS=1 make
# !pip install -r llama.cpp/requirements.txt

In [None]:
from transformers import pipeline

In [None]:
import os
import nest_asyncio
nest_asyncio.apply()

from google.colab import userdata
# Set the OpenAI API key as an environment variable
os.environ["OPENAI_API_KEY"] =  userdata.get('OPENAI_API_KEY')

HF_TOKEN = userdata.get('HUGGING_FACE_TOKEN')
# Import the login function from huggingface_hub
from huggingface_hub import login, HfApi
login(token=HF_TOKEN)
api = HfApi(token=HF_TOKEN)
username = api.whoami()['name']
print(username)

**Quantize meta-llama/Meta-Llama-3-8B-Instruct**

In [None]:
# Define the model ID for the desired model
model_id_llama = "meta-llama/Meta-Llama-3-8B-Instruct"
quantization_methods = ["q5_k_m", "q4_k_m"]

In [None]:
model_name =  model_id_llama.split("/")[-1]
print(model_name)
quant_name =  model_id_llama.split("/")[-1] + "-GGUF"
print(quant_name)
quant_repo_id = f"{username}/{quant_name}"
print(quant_repo_id)

 git-lfs refers to Git Large File Storage, an extension for Git that is designed to handle large files more efficiently. Normally, Git stores the entire history of every file within a repository. For large files, this can lead to performance issues and storage bloat. Git LFS addresses this by storing large files outside of the main Git repository and replacing them with pointers within the repository.

In [None]:
!git-lfs install

In [None]:
# Download model
!git clone https://{username}:{HF_TOKEN}@huggingface.co/{model_id_llama}

In [None]:
# Convert to fp16
fp16 = f"{model_name}/{model_name.lower()}.fp16.bin"
!python llama.cpp/convert_hf_to_gguf.py {model_name} --outtype f16 --outfile {fp16}

# Quantize the model for each method in the QUANTIZATION_METHODS list
for method in quantization_methods:
    qtype = f"{model_name}/{model_name.lower()}.{method.upper()}.gguf"
    !./llama.cpp/build/bin/llama-quantize {fp16} {qtype} {method}

In [None]:
# Create an empty repo
api.create_repo(
    repo_id = quant_repo_id,
    repo_type="model",
    exist_ok=True,
    token=HF_TOKEN,
    private=True
)

In [None]:
# Upload gguf files
api.upload_folder(
    folder_path=model_name,
    repo_id=quant_repo_id,
    token=HF_TOKEN
)

**Inferencing GGUF type models**

**using llama_cpp (recommended)**

In [None]:
!pip install llama-cpp-python

In [None]:
!pip install python-dotenv

In [None]:
from llama_cpp import Llama
import os
import dotenv
from huggingface_hub import login, HfApi

In [None]:
dotenv.load_dotenv()

HF_TOKEN = os.environ.get("HUGGING_FACE_API_KEY")
login(token=HF_TOKEN)
api = HfApi(token=HF_TOKEN)
username = api.whoami()["name"]

In [None]:
repo_id = "bigopot420/Meta-Llama-3-8B-Instruct-GGUF"
filename = "meta-llama-3-8b-instruct.Q4_K_M.gguf"

In [None]:
prompt = "Tell me a funny joke about Large Language Models meeting a Blackhole in an intergalactic Bar."

In [None]:
llm = Llama.from_pretrained(
    repo_id=repo_id,
    filename=filename,
    verbose=False,
)

In [None]:
llm_response = llm.create_chat_completion(
    messages=[{"role": "user", "content": prompt}],
    temperature=0.85,
    top_p=0.8,
    top_k=50,
    repeat_penalty=1.01,
)

In [None]:
llm_response['choices'][0]['message']['content']