<a href="https://colab.research.google.com/github/alok-abhishek/Quantizing-LLMs-and-inferencing-Quantized-model-from-HF/blob/master/Quantizing_LLMs_and_inferencing_Quantized_model_from_HF.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**This workbook covers following topics:**
- What is Quantization
- How to Qunatize a model
- How to Inference from quantized model

**Quantization techniques covered:**
- llama.ccp
- bnb
- AWQ
- ExLlamaV2
- GPTQ

🤗 For Qns or comments reach out to me [@alokabhishek](https://www.linkedin.com/in/alokabhishek/).

# **Intro to Qunatization**

***What is quantization of Large language model?***

Quantization of Large Language Models (LLMs) is a technique used to reduce the computational and memory requirements of these models by converting their weights and activations from a high-precision 32-bit floating-point representation to a lower-precision format such as 8-bit or 4-bit integers. This process allows LLMs to be more efficiently run on hardware with limited computational resources, including mobile and IoT devices, without significantly compromising LLM’s performance or accuracy.

***What are the benefits of quantization in large language models***
- Reduced Model Size / Memory Footprint
- Faster Inference Speed / Increased Efficiency
- Lower Power Consumption / Energy Efficiency - Suitable for mobile devices
- Model Compression and Portability

***What are different quantization techniques?***
- Post-Training Quantization (PTQ)
- Quantization-Aware Training (QAT)
- Activation-Aware Weight Quantization (AWQ)
- NF4 Quantization - BitsAndBytes
- etc.

***Different Options for Quantization:***
- 16-bit (Float16)
- 8-bit (Int8): for deploying models on edge devices or situations where computational resources are limited
- 4-bit: Useful for extremely resource-constrained environments
- 1-bit (Binary)
- NF4 (4bit-NormalFloat): A specialized 4-bit format designed to efficiently represent a larger bit datatype. It includes steps like normalization, quantization, and dequantization to efficiently represent original 32-bit weights.Suitable for applications requiring a balance between model size reduction and maintaining higher accuracy than traditional 4-bit quantization.
- etc.

--------------------------------------------------------------------------------

# **Quantize models using GGUF and llama.cpp**



Useful links:
- llama.cpp GitHub repo: [llama.cpp github repo](https://github.com/ggerganov/llama.cpp)
- llama-cpp-python GitHub repo: https://github.com/abetlen/llama-cpp-python

*   **q2_k:** Uses Q4_K for the attention.vw and feed_forward.w2 tensors, Q2_K for the other tensors.
*   **q3_k_l:** Uses Q5_K for the attention.wv, attention.wo, and feed_forward.w2 tensors, else Q3_K
*   **q3_k_m:** Uses Q4_K for the attention.wv, attention.wo, and feed_forward.w2 tensors, else Q3_K
*   **q3_k_s:** Uses Q3_K for all tensors
*   **q4_0:** Original quant method, 4-bit.
*   **q4_1:** Higher accuracy than q4_0 but not as high as q5_0. However has quicker inference than q5 models.
*   **q4_k_m:** Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q4_K
*   **q4_k_s:** Uses Q4_K for all tensors
*   **q5_0:** Higher accuracy, higher resource usage and slower inference.
*   **q5_1:** Even higher accuracy, resource usage and slower inference.
*   **q5_k_m:** Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q5_K
*   **q5_k_s:** Uses Q5_K for all tensors
*   **q6_k:** Uses Q8_K for all tensors
*   **q8_0:** Almost indistinguishable from float16. High resource use and slow. Not recommended for most users.



In [None]:
!git clone https://github.com/ggerganov/llama.cpp
!cd llama.cpp && git pull && make clean && LLAMA_CUBLAS=1 make
!pip install -r llama.cpp/requirements.txt

In [None]:
from google.colab import userdata, drive
import torch
import os
from torch import bfloat16
from huggingface_hub import login, HfApi, create_repo
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline

In [None]:
HF_TOKEN = userdata.get('HUGGING_FACE_API_KEY')
login(token=HF_TOKEN)
api = HfApi(token=HF_TOKEN)
username = api.whoami()['name']
print(username)

**Quantize meta-llama/Llama-2-7b-chat-hf**

In [None]:
# Define the model ID for the desired model
model_id_llama = "meta-llama/Llama-2-7b-chat-hf"
quantization_methods = ["q5_k_m", "q4_k_m"]

In [None]:
model_name =  model_id_llama.split("/")[-1]
print(model_name)
quant_name =  model_id_llama.split("/")[-1] + "-GGUF"
print(quant_name)
quant_repo_id = f"{username}/{quant_name}"
print(quant_repo_id)

In [None]:
!git-lfs install

In [None]:
# Download model
!git clone https://{username}:{HF_TOKEN}@huggingface.co/{model_id_llama}


In [None]:
# Convert to fp16
fp16 = f"{model_name}/{model_name.lower()}.fp16.bin"
!python llama.cpp/convert.py {model_name} --outtype f16 --outfile {fp16}

# Quantize the model for each method in the QUANTIZATION_METHODS list
for method in quantization_methods:
    qtype = f"{model_name}/{model_name.lower()}.{method.upper()}.gguf"
    !./llama.cpp/quantize {fp16} {qtype} {method}

In [None]:
# Create an empty repo
api.create_repo(
    repo_id = quant_repo_id,
    repo_type="model",
    exist_ok=True,
    token=HF_TOKEN,
    private=True
)

In [None]:
# Upload gguf files
api.upload_folder(
    folder_path=model_name,
    repo_id=quant_repo_id,
    token=HF_TOKEN
)

**Inferencing GGUF type models**

**using llama_cpp (recommended)**

In [None]:
from llama_cpp import Llama
import os
import dotenv
from huggingface_hub import login, HfApi

In [None]:
dotenv.load_dotenv()

HF_TOKEN = os.environ.get("HUGGING_FACE_API_KEY")
login(token=HF_TOKEN)
api = HfApi(token=HF_TOKEN)
username = api.whoami()["name"]

In [None]:
repo_id = "alokabhishek/Llama-2-7b-chat-hf-GGUF"
filename = "llama-2-7b-chat-hf.Q4_K_M.gguf"

In [None]:
prompt = "Tell me a funny joke about Large Language Models meeting a Blackhole in an intergalactic Bar."

In [None]:
llm = Llama.from_pretrained(
    repo_id=repo_id,
    filename=filename,
    verbose=False,
)

In [None]:
llm_response = llm.create_chat_completion(
    messages=[{"role": "user", "content": prompt}],
    temperature=0.85,
    top_p=0.8,
    top_k=50,
    repeat_penalty=1.01,
)

In [None]:
llm_respose_formatted = llm_response["choices"][0]["message"]["content"]

In [None]:
print(llm_respose_formatted)

**Using ctransformers (ctransformers liabrary has not been updated in last 6+ month so I dont recommend using it right now as it does support some of the newwer models and frameworks)**

In [None]:
! pip install ctransformers[cuda]>=0.2.24
! pip install -U sentence-transformers
! pip install transformers huggingface_hub torch

In [None]:
from ctransformers import AutoModelForCausalLM
from transformers import pipeline, AutoModel
from huggingface_hub import login, HfApi, create_repo
from transformers import AutoTokenizer, pipeline
from sentence_transformers import SentenceTransformer
from google.colab import userdata, drive
import os

In [None]:
HF_TOKEN = userdata.get('HUGGING_FACE_API_KEY')
login(token=HF_TOKEN)
api = HfApi(token=HF_TOKEN)
username = api.whoami()['name']
print(username)

In [None]:
model_mistral = AutoModelForCausalLM.from_pretrained(
    "alokabhishek/Mistral-7B-Instruct-v0.2-GGUF",
    model_file="mistral-7b-instruct-v0.2.Q4_K_M.gguf",
    model_type="mistral", gpu_layers=50, hf=True
)

In [None]:
tokenizer_mistral = AutoTokenizer.from_pretrained(
    "alokabhishek/Mistral-7B-Instruct-v0.2-GGUF", use_fast=True
)

In [None]:
# Create a pipeline
pipe_mistral = pipeline(model=model_mistral, tokenizer=tokenizer_mistral, task='text-generation')

In [None]:
prompt_mistral = "Tell me a funny joke about Large Language Models meeting a Blackhole in an intergalactic Bar."

In [None]:
output_mistral = pipe_mistral(prompt_mistral, max_new_tokens=512)

In [None]:
print(output_mistral[0]["generated_text"])

# **Quantize models using bitsandbytes (bnb)**



- Hugging Face Blog post on 4-bit quantization using bitsandbytes: [Making LLMs even more accessible with bitsandbytes, 4-bit quantization and QLoRA](https://huggingface.co/blog/4bit-transformers-bitsandbytes)

- bitsandbytes github repo: [bitsandbytes github repo](https://github.com/TimDettmers/bitsandbytes)


In [None]:
!pip install -q -U bitsandbytes accelerate torch huggingface_hub
!pip install -q -U git+https://github.com/huggingface/transformers.git
!pip install -q -U git+https://github.com/huggingface/peft.git

In [None]:
from google.colab import userdata, drive
import torch
import os
from torch import bfloat16
from huggingface_hub import login, HfApi, create_repo
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline, BitsAndBytesConfig

In [None]:
HF_TOKEN = userdata.get('HUGGING_FACE_API_KEY')
login(token=HF_TOKEN)
api = HfApi(token=HF_TOKEN)
username = api.whoami()['name']
print(username)

In [None]:
# Define the model ID for the desired model
model_id = "meta-llama/Llama-2-7b-chat-hf"

In [None]:
# Define the quantization configuration for the model
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

In [None]:
# Load the tokenizer associated with the model
tokenizer = AutoTokenizer.from_pretrained(model_id)

In [None]:
# Load the model using the model ID and quantization configuration
model_4bit = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=bnb_config)

In [None]:
quant_name =  model_id.split("/")[-1] + "-bnb-4bit"
print(quant_name)
quant_repo_id = f"{username}/{quant_name}"
print(quant_repo_id)

In [None]:
# Create an empty repo
api.create_repo(
    repo_id = quant_repo_id,
    repo_type="model",
    exist_ok=True,
    token=HF_TOKEN,
    private=True
)

In [None]:
tokenizer.save_pretrained(quant_name)

In [None]:
model_4bit.push_to_hub(quant_name, token=True, use_safetensors=True)

In [None]:
api.upload_folder(folder_path=quant_name,repo_id=quant_repo_id,token=HF_TOKEN)

**Inferencing bnb quantized model from HF hub**

In [None]:
!pip install -q -U bitsandbytes accelerate torch huggingface_hub
!pip install -q -U git+https://github.com/huggingface/transformers.git
!pip install -q -U git+https://github.com/huggingface/peft.git
!pip install flash-attn --no-build-isolation

In [None]:
from google.colab import userdata, drive
import torch
import os
from torch import bfloat16
from huggingface_hub import login, HfApi, create_repo
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline, BitsAndBytesConfig, LlamaForCausalLM

In [None]:
HF_TOKEN = userdata.get('HUGGING_FACE_API_KEY')
login(token=HF_TOKEN)
api = HfApi(token=HF_TOKEN)
username = api.whoami()['name']
print(username)

In [None]:
model_id_falcon = "alokabhishek/falcon-7b-instruct-bnb-4bit"

In [None]:
tokenizer_falcon = AutoTokenizer.from_pretrained(model_id_falcon, use_fast=True)

In [None]:
model_falcon = AutoModelForCausalLM.from_pretrained(
    model_id_falcon,
    device_map="auto"
)

In [None]:
# Create a pipeline
pipe_falcon = pipeline(model=model_falcon, tokenizer=tokenizer_falcon, task='text-generation')

In [None]:
prompt_falcon = "Tell me a funny joke about Large Language Models meeting a Blackhole in an intergalactic Bar."

In [None]:
output_falcon = pipe_falcon(prompt_falcon, max_new_tokens=512)

In [None]:
print(output_falcon[0]["generated_text"])

# **Quantize models using ExLlamaV2**



- ExLlamaV2 github repo: [ExLlamaV2 github repo](https://github.com/turboderp/exllamav2)

In [None]:
! git clone https://github.com/turboderp/exllamav2

In [None]:
! cd exllamav2

In [None]:
! pip install -r requirements.txt

In [None]:
! pip install .

In [None]:
from google.colab import userdata, drive
from torch import bfloat16
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
from huggingface_hub import login, HfApi, create_repo
import locale
import torch
import os

In [None]:
HF_TOKEN = userdata.get('HUGGING_FACE_API_KEY')
login(token=HF_TOKEN)
api = HfApi(token=HF_TOKEN)
username = api.whoami()['name']
print(username)

In [None]:
# Define the model ID for the desired model
model_id = "mistralai/Mistral-7B-Instruct-v0.2"
BPW = 5.0

In [None]:
model_name =  model_id.split("/")[-1]
print(model_name)
quant_name =  model_id.split("/")[-1] + f"-{BPW:.1f}-bpw-exl2"
print(quant_name)
quant_repo_id = f"{username}/{quant_name}"
print(quant_repo_id)

In [None]:
!git-lfs install

In [None]:
!git clone https://{username}:{HF_TOKEN}@huggingface.co/{model_id}

In [None]:
!mv {model_name} base_model

In [None]:
!rm -rf /content/base_model/*.bin

In [None]:
# Download dataset
!wget https://huggingface.co/datasets/wikitext/resolve/9a9e482b5987f9d25b3a9b2883fc6cc9fd8071b3/wikitext-103-v1/wikitext-test.parquet

In [None]:
!mkdir quant

In [None]:
! cp /content/base_model/config.json /content/quant/config.json

In [None]:
# Quantize model

!python exllamav2/convert.py \
    -i base_model \
    -o quant \
    -c wikitext-test.parquet \
    -b {BPW}

In [None]:
# remove out_tensor dir
!rm -rf /content/quant/out_tensor

In [None]:
# Copy files
!rsync -av --exclude='*.safetensors' --exclude='.*' ./base_model/ ./quant/

In [None]:
# Create an empty repo
api.create_repo(
    repo_id = quant_repo_id,
    repo_type="model",
    exist_ok=True,
    token=HF_TOKEN,
    private=True
)

In [None]:
# Upload files
api.upload_folder(
    folder_path="quant",
    repo_id=quant_repo_id,
    token=HF_TOKEN
)

**Inferencing ExLlamaV2 Quantized Models**

In [None]:
# Define the model ID for the desired model
model_id = "alokabhishek/Llama-2-7b-chat-hf-5.0-bpw-exl2"

In [None]:
model_name =  model_id.split("/")[-1]
print(model_name)

In [None]:
!git clone https://{username}:{HF_TOKEN}@huggingface.co/{model_id} {model_name}

In [None]:
import sys, os
sys.path.append(os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
from exllamav2 import (
    ExLlamaV2,
    ExLlamaV2Config,
    ExLlamaV2Cache,
    ExLlamaV2Tokenizer,
)
from exllamav2.generator import ExLlamaV2BaseGenerator, ExLlamaV2Sampler

In [None]:
model_directory = "../../quant/Llama-2-7b-chat-hf-5.0-bpw-exl2/"

In [None]:
config = ExLlamaV2Config(model_directory)
model = ExLlamaV2(config)
cache = ExLlamaV2Cache(model, lazy=True)
model.load_autosplit(cache)
tokenizer = ExLlamaV2Tokenizer(config)

In [None]:
# Initialize generator
generator = ExLlamaV2BaseGenerator(model, cache, tokenizer)

In [None]:
settings = ExLlamaV2Sampler.Settings()
settings.temperature = 0.85
settings.top_k = 50
settings.top_p = 0.8
settings.token_repetition_penalty = 1.01
settings.disallow_tokens(tokenizer, [tokenizer.eos_token_id])
max_new_tokens = 512

In [None]:
prompt = "Tell me a funny joke about Large Language Models meeting a Blackhole in an intergalactic Bar."

In [None]:
generator.warmup()

In [None]:
output = generator.generate_simple(prompt, settings, max_new_tokens, seed=1234)

In [None]:
print(output)

# **Quantize models using AutoAWQ**


- AutoAWQ github repo: [AutoAWQ github repo](https://github.com/casper-hansen/AutoAWQ/tree/main)
- MIT-han-lab llm-aws github repo:  [MIT-han-lab llm-aws github repo](https://github.com/mit-han-lab/llm-awq/tree/main)

In [None]:
!pip install autoawq

In [None]:
import torch
import os
from google.colab import userdata, drive
from torch import bfloat16
from huggingface_hub import login, HfApi, create_repo
from transformers import AutoTokenizer, AwqConfig, AutoConfig
from awq import AutoAWQForCausalLM
from datasets import load_dataset

In [None]:
HF_TOKEN = userdata.get('HUGGING_FACE_API_KEY')
login(token=HF_TOKEN)
api = HfApi(token=HF_TOKEN)
username = api.whoami()['name']
print(username)

In [None]:
# Define the model ID for the desired model
model_id = "meta-llama/Llama-2-7b-chat-hf"

In [None]:
model_name =  model_id.split("/")[-1]
print(model_name)
quant_name =  model_id.split("/")[-1] + "-4bit-AWQ"
print(quant_name)
quant_repo_id = f"{username}/{quant_name}"
print(quant_repo_id)

In [None]:
quant_config = { "zero_point": True, "q_group_size": 128, "w_bit": 4, "version": "GEMM" }

In [None]:
# Load model
model = AutoAWQForCausalLM.from_pretrained(model_id)

In [None]:
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)

In [None]:
# Quantize
model.quantize(tokenizer, quant_config=quant_config)

In [None]:
# Save quantized model
model.save_quantized(quant_name)

In [None]:
# Save quantized model
tokenizer.save_pretrained(quant_name)

In [None]:
# Create an empty repo
api.create_repo(
    repo_id = quant_repo_id,
    repo_type="model",
    exist_ok=True,
    token=HF_TOKEN,
    private=True
)

In [None]:
# Upload files
api.upload_folder(
    folder_path=quant_name,
    repo_id=quant_repo_id,
    token=HF_TOKEN
)

**Inference AWQ Quantization of Models**

In [None]:
!pip install autoawq
!pip install accelerate

In [None]:
from google.colab import userdata, drive
import torch
import os
from torch import bfloat16
from huggingface_hub import login, HfApi, create_repo
from transformers import AutoTokenizer, pipeline
from awq import AutoAWQForCausalLM

In [None]:
HF_TOKEN = userdata.get('HUGGING_FACE_API_KEY')
login(token=HF_TOKEN)
api = HfApi(token=HF_TOKEN)
username = api.whoami()['name']
print(username)

In [None]:
model_id_llama = "alokabhishek/Mistral-7B-Instruct-v0.2-4bit-AWQ"

In [None]:
tokenizer_llama = AutoTokenizer.from_pretrained(model_id_llama, use_fast=True)

In [None]:
model_llama = AutoAWQForCausalLM.from_quantized(model_id_llama, fuse_layer=True, trust_remote_code = False, safetensors = True)

In [None]:
prompt_llama = "Tell me a funny joke about Large Language Models meeting a Blackhole in an intergalactic Bar."

In [None]:
fromatted_prompt = f'''<s> [INST] You are a helpful, and fun loving assistant. Always answer as jestfully as possible.[/INST] </s> [INST] {prompt_llama}[/INST]'''

In [None]:
tokens = tokenizer_llama(fromatted_prompt, return_tensors="pt").input_ids.cuda()

In [None]:
generation_output = model_llama.generate(tokens, do_sample=True, temperature=1.7, top_p=0.95, top_k=40, max_new_tokens=512)

In [None]:
print(tokenizer_llama.decode(generation_output[0], skip_special_tokens=True))

# **Quantize models using GPTQ**

**GPTQ Quantization 4-bit LLM**

In [None]:
!BUILD_CUDA_EXT=0 pip install -q auto-gptq transformers

In [None]:
import random
import locale
import torch
import os
from google.colab import userdata, drive
from torch import bfloat16
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
from huggingface_hub import login, HfApi, create_repo
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
from datasets import load_dataset, concatenate_datasets


In [None]:
HF_TOKEN = userdata.get('HUGGING_FACE_API_KEY')
login(token=HF_TOKEN)
api = HfApi(token=HF_TOKEN)
username = api.whoami()['name']
print(username)

In [None]:
# Define the model ID for the desired model
model_id = "mistralai/Mistral-7B-Instruct-v0.2"

In [None]:
model_name =  model_id.split("/")[-1]
print(model_name)
quant_name =  model_id.split("/")[-1] + "-GPTQ"
print(quant_name)
quant_repo_id = f"{username}/{quant_name}"
print(quant_repo_id)

In [None]:
# Load quantize config, model and tokenizer
quantize_config = BaseQuantizeConfig(
    bits=4,
    group_size=128,
    damp_percent=0.01,
    desc_act=False,
)


In [None]:
model = AutoGPTQForCausalLM.from_pretrained(model_id, quantize_config)
tokenizer = AutoTokenizer.from_pretrained(model_id)

In [None]:
# Define the base parameters
base_samples = 1024
file_numbers = [1, 6, 10]  # Specify the file numbers to include

# Generate the list of data files based on the specified file numbers
data_files = [f"en/c4-train.0000{num}-of-01024.json.gz" if num < 10 else f"en/c4-train.000{num}-of-01024.json.gz" for num in file_numbers]

# Calculate the total number of samples
n_samples = base_samples * len(file_numbers)

# Load and concatenate the datasets
datasets = []
for file in data_files:
    dataset = load_dataset("allenai/c4", data_files=file, split=f"train[:{base_samples*5}]")
    datasets.append(dataset)
# concatenate dataset
data = concatenate_datasets(datasets)

In [None]:
# Tokenize the concatenated data
tokenized_data = tokenizer("\n\n".join(data['text']), return_tensors='pt')

In [None]:
# Define a maximum sequence length
max_length = 1024  # Adjust based on your requirements

examples_ids = []
for _ in range(n_samples):
    # Ensure the random start index selection is within bounds
    i = random.randint(0, max(tokenized_data.input_ids.shape[1] - max_length - 1, 0))
    j = i + max_length
    input_ids = tokenized_data.input_ids[:, i:j]
    attention_mask = torch.ones_like(input_ids)
    examples_ids.append({'input_ids': input_ids, 'attention_mask': attention_mask})

In [None]:
# Quantize with GPTQ
model.quantize(
    examples_ids,
    batch_size=1,
    use_triton=True,
)

In [None]:
# Save model and tokenizer
model.save_quantized(quant_name, use_safetensors=True)

In [None]:
tokenizer.save_pretrained(quant_name)

In [None]:
# Create an empty repo
api.create_repo(
    repo_id = quant_repo_id,
    repo_type="model",
    exist_ok=True,
    token=HF_TOKEN,
    private=True
)

In [None]:
# Upload files
api.upload_folder(
    folder_path=quant_name,
    repo_id=quant_repo_id,
    token=HF_TOKEN
)