<a href="https://colab.research.google.com/github/arkeodev/nlp/blob/main/Quantization/04_PRILoRA_GPTQ_GGML_GGUF_AWQ.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

$$
\begin{array}{c}
\text{$\Large “Just\ improve\ yourself;\ that\ is\ the\ only\ thing\ you\ can\ do\ to\ better\ the\ world.”$} \\
{\text{{$\small Ludwig\ Wittgenstein$}}} \\
\end{array}
$$

# Further Techniques for LLM Quantization

## The Types of Quantization

1. **Training**:

   - **Quantization-Aware Training (QAT)**: During training, the model simulates the effects of quantization. This helps the model learn to be robust to the quantization process. Both weights and activations are quantized, but gradients are typically kept in higher precision to avoid significant loss in training effectiveness.

   - **Dynamic Range Quantization**: This method quantizes weights and activations dynamically, usually after each forward pass, to simulate the quantization effect during training.

2. **Inference**:

   - **Post-Training Quantization**: After training, the model is quantized. This involves:

     - **Static Quantization**: Calibrating the model with representative data to determine the optimal quantization parameters.
     
     - **Dynamic Quantization**: Quantizing weights and dynamically quantizing activations during inference. This is simpler and does not require extensive calibration.

   - **QLoRA**: This method involves quantizing the weights to a lower precision format like NF4 or 8-bit integers, allowing for significant memory and computation savings during inference.

## TheBloke

Quantization and sharding (splitting the large models into smaller chunks) every time for a new model is not an easy task. Instead, these models have often already been sharded and quantized for us to use. [TheBloke](https://huggingface.co/TheBloke) in particular is a user on HuggingFace that performs a bunch of quantizations for us to use.

<figure>
    <img src="https://raw.githubusercontents.com/arkeodev/nlp/main/Quantization/images/thebloke.png" width="600" height="600">
    <figcaption>Models already sharded and quantized</figcaption>
</figure>

## PRILoRA

**Pruned and Rank-Increasing Low-Rank Adaptation (PRILoRA)** enhances LoRA by increasing efficiency through two mechanisms: linearly increasing ranks and ongoing importance-based pruning.

1. **Linear Rank Increase**: PRILoRA increases the rank linearly across layers, starting with a low rank and increasing it for each subsequent layer.

   Neural networks, especially deep ones, process information hierarchically. Lower layers often capture more general features, while higher layers capture more specific features. Starting with a low rank and increasing it in higher layers aligns with this hierarchical nature. Lower layers might not need as much capacity (low rank) to represent general features, whereas higher layers require more capacity (high rank) for complex, specific features.

2. **A-Weight Pruning**: It prunes the least significant weights in the A matrix based on an importance matrix, which reduces memory requirements and fine-tuning time.

   Neural networks can be memory-intensive, particularly when dealing with large models. Pruning the A matrix by removing the least significant weights helps reduce memory consumption, making the model more efficient and deployable on resource-constrained devices.

3. **Importance Matrix?**: An importance matrix is a matrix that quantifies the significance of each weight in the A matrix. It typically reflects how crucial each weight is for the model's performance.

   The importance of each weight can be calculated using various methods, such as:
   - **Magnitude-Based Methods:** Weights with smaller magnitudes are often considered less important.
   - **Gradient-Based Methods:** Weights that contribute less to the gradient (i.e., have smaller gradients) might be deemed less significant.
   - **Saliency Scores:** Calculated based on how much the model's output is affected by changes in a particular weight.

   The importance matrix is used to guide the pruning process. Weights that are deemed less important (e.g., those with lower scores in the importance matrix) are pruned first. This targeted pruning ensures that the most critical parameters are retained, minimizing the impact on model performance.

In [1]:
import torch
import torch.nn as nn

# Define a function to increase rank linearly across layers
def increase_rank_linearly(layers, start_rank, end_rank):
    """
    Generate ranks linearly increasing from start_rank to end_rank across the specified number of layers.

    Args:
    layers (int): Number of layers.
    start_rank (int): Initial rank at the first layer.
    end_rank (int): Final rank at the last layer.

    Returns:
    torch.Tensor: A tensor containing the ranks for each layer.
    """
    ranks = torch.linspace(start_rank, end_rank, steps=layers).int()
    return ranks

# Define a function to prune least significant weights
def prune_weights(matrix, importance_matrix, prune_step):
    """
    Prune the least significant weights in the matrix based on the importance matrix.

    Args:
    matrix (torch.Tensor): The weight matrix to be pruned.
    importance_matrix (torch.Tensor): The matrix indicating the importance of each weight.
    prune_step (int): The number of weights to prune in each step.

    Returns:
    torch.Tensor: The pruned weight matrix.
    """
    for _ in range(prune_step):
        # Find the least significant weight (smallest importance)
        min_importance, indices = torch.min(importance_matrix.view(-1), dim=0)
        # Convert the flat index back to 2D index
        index = torch.tensor([indices // importance_matrix.size(1), indices % importance_matrix.size(1)])
        # Prune (set to zero) the least significant weight
        matrix[index[0], index[1]] = 0
        # Update the importance matrix to avoid pruning the same weight again
        importance_matrix[index[0], index[1]] = float('inf')
    return matrix

# Define a simple neural network for demonstration
class SimpleNN(nn.Module):
    def __init__(self, input_size, output_size, layer_ranks):
        super(SimpleNN, self).__init__()
        self.layers = nn.ModuleList()
        for rank in layer_ranks:
            self.layers.append(nn.Linear(input_size, rank))
            input_size = rank
        self.layers.append(nn.Linear(input_size, output_size))

    def forward(self, x):
        for layer in self.layers[:-1]:
            x = torch.relu(layer(x))
        x = self.layers[-1](x)
        return x

# Example usage of the SimpleNN model with PRILoRA
layers = 12
start_rank = 4
end_rank = 12

# Generate linearly increasing ranks for each layer
ranks = increase_rank_linearly(layers, start_rank, end_rank)
print("Ranks for each layer:", ranks)

# Define input and output sizes
input_size = 16
output_size = 4

# Initialize the model
model = SimpleNN(input_size, output_size, ranks)
print("Initialized SimpleNN model with PRILoRA:")

# Display model architecture
print(model)

# Example input tensor
input_tensor = torch.randn(1, input_size)

# Forward pass through the model
output_tensor = model(input_tensor)
print("Output of the model before pruning:", output_tensor)

# Example weights and importance matrix for pruning
weights = model.layers[0].weight.data
importance_matrix = torch.rand(weights.size())
print("Original Weights (Layer 1):", weights)
print("Importance Matrix (Layer 1):", importance_matrix)

# Prune weights based on importance
pruned_weights = prune_weights(weights, importance_matrix, prune_step=10)
print("Pruned Weights (Layer 1):", pruned_weights)

# Reassign the pruned weights back to the model
model.layers[0].weight.data = pruned_weights

# Forward pass through the model after pruning
output_tensor_after_pruning = model(input_tensor)
print("Output of the model after pruning:", output_tensor_after_pruning)

# Explanation of PRILoRA
print("\nPRILoRA Explanation:")
print("\n1. Linear Rank Increase: Ranks for each layer are linearly increased from", start_rank, "to", end_rank)
print("   This allows lower layers to use fewer parameters while higher layers can use more parameters.")
print("\n2. A-Weight Pruning: Least significant weights in the A matrix are pruned based on an importance matrix.")
print("   This reduces memory requirements and fine-tuning time while maintaining model performance.")

Ranks for each layer: tensor([ 4,  4,  5,  6,  6,  7,  8,  9,  9, 10, 11, 12], dtype=torch.int32)
Initialized SimpleNN model with PRILoRA:
SimpleNN(
  (layers): ModuleList(
    (0): Linear(in_features=16, out_features=4, bias=True)
    (1): Linear(in_features=4, out_features=4, bias=True)
    (2): Linear(in_features=4, out_features=5, bias=True)
    (3): Linear(in_features=5, out_features=6, bias=True)
    (4): Linear(in_features=6, out_features=6, bias=True)
    (5): Linear(in_features=6, out_features=7, bias=True)
    (6): Linear(in_features=7, out_features=8, bias=True)
    (7): Linear(in_features=8, out_features=9, bias=True)
    (8): Linear(in_features=9, out_features=9, bias=True)
    (9): Linear(in_features=9, out_features=10, bias=True)
    (10): Linear(in_features=10, out_features=11, bias=True)
    (11): Linear(in_features=11, out_features=12, bias=True)
    (12): Linear(in_features=12, out_features=4, bias=True)
  )
)
Output of the model before pruning: tensor([[ 0.0720, -0.

## GPTQ

General Pre-Trained Transformer Quantization (GPTQ) is an advanced technique that enhances the inference speed and reduces the memory footprint of transformer-based models like GPT, by quantizing the model parameters. GPTQ falls into the PTQ (Post-Training Quantization) category and this is particularly interesting for massive models, for which full model training or even fine-tuning can be very expensive.

Specifically, GPTQ adopts a mixed int4/fp16 quantization scheme where weights are quantized as int4 while activations remain in float16. During inference, weights are dequantized on the fly and the actual compute is performed in float16.

GPTQ is a layer-wise quantization method aimed at minimizing output error through mean squared error (MSE).

1. **Lazy Batch Updating**:
- Instead of quantizing the entire model all at once, weights are processed in smaller groups or batches.
- **Process**:
- The GPTQ algorithm begins with a Cholesky decomposition of the Hessian inverse (a matrix that helps decide how to adjust the weights)
- It then runs in loops, handling batches of columns at a time.
- For each column in a batch, it quantizes the weights, calculates the error, and updates the weights in the block accordingly.
- After processing the calibration batch, all the remaining weights in the matrix are updated in accordance with the MSE of the initial batch.

  First, all the model’s weights are converted into a matrix, which is worked through in batches of 128 columns at a time through a process called lazy batch updating. This involves quantizing the weights in batch, calculating the MSE, and updating the weights to values that diminish it. After processing the calibration batch, all the remaining weights in the matrix are updated in accordance with the MSE of the initial batch – and then all the individual layers are re-combined to produce a quantized model.

2. **Mixed INT4/FP16 Quantization**:
  - **INT4 (4-bit integers)**: Model weights are quantized to 4-bit integers. This significantly reduces the memory and computational requirements.
  - **FP16 (16-bit floating point)**: Activations (the intermediate outputs of the network) remain in 16-bit floating point format. This ensures that during inference, the model maintains a high level of precision and accuracy.

GPTQ has also been integrated into Hugging Face via the [AutoGPTQ](https://huggingface.co/blog/gptq-integration#a-gentle-summary-of-the-gptq-paper) library.

**Key differences between GPTQ and QLoRA**:

1. **Quantization Techniques**:
   - GPTQ primarily focuses on the quantization of weights to INT4 while keeping activations in FP16.
   - QLoRA employs both quantization and low-rank adaptation, often quantizing weights and activations to lower precision levels like INT8 or lower.

2. **Adaptation Method**:
   - GPTQ does not inherently include low-rank adaptation; it focuses on batch-wise quantization and MSE minimization.
   - QLoRA combines low-rank matrix factorization with quantization, enabling efficient adaptation and fine-tuning.

3. **Application Scenarios**:
   - GPTQ is well-suited for scenarios where maintaining activation precision is critical, and the primary goal is to reduce model size through weight quantization.
   - QLoRA is designed for environments where fine-tuning pre-trained models with minimal computational resources is essential, leveraging both quantization and low-rank adaptation.

Below you can see a simplified pseudo-code to guide through applying GPTQ to a pretrained transformer model:

### Quantizing a GPT-2 Model with AutoGPTQ

**1. Install Required Packages**

First, install the necessary packages. Make sure you have CUDA-enabled hardware for GPU acceleration.

In [2]:
!BUILD_CUDA_EXT=0 pip install -q auto-gptq transformers

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m23.5/23.5 MB[0m [31m38.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m302.6/302.6 kB[0m [31m24.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m542.1/542.1 kB[0m [31m36.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.2/13.2 MB[0m [31m36.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m251.6/251.6 kB[0m [31m24.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m21.3/21.3 MB[0m [31m73.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m15.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m64.9/64.9 kB[0m [31m7.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━

**2. Import Libraries and Define Configuration**

Next, import the required libraries and define the model and output directory.

In [4]:
import random
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
from datasets import load_dataset
import torch
from transformers import AutoTokenizer

# Define the model ID and output directory for the quantized model
model_id = "gpt2"
output_directory = model_id + "-GPTQ"



**3. Load the Model and Tokenizer**

Load the tokenizer using the `AutoTokenizer` class and the model with a specific quantization configuration.

In [5]:
# Define the quantization configuration
quantize_config = BaseQuantizeConfig(
    bits=4,           # Number of bits for quantization
    group_size=128,   # Group size for lazy batch quantization
    damp_percent=0.01, # Damping percentage for Cholesky reformulation
    desc_act=False    # Disable descending activation order for simplicity
)

# Load the model with the quantization configuration
print("Loading model and tokenizer...")
model = AutoGPTQForCausalLM.from_pretrained(model_id, quantize_config)
tokenizer = AutoTokenizer.from_pretrained(model_id)
print("Model and tokenizer loaded.")

Loading model and tokenizer...




config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Model and tokenizer loaded.


**4. Prepare the Dataset**

Use the C4 dataset to generate samples for the quantization process. Tokenize the dataset and format the examples.

In [6]:
# Number of samples to use for quantization
n_samples = 1024

# Load the C4 dataset
print("Loading and tokenizing the C4 dataset...")
data = load_dataset("allenai/c4", data_files="en/c4-train.00001-of-01024.json.gz", split=f"train[:{n_samples*5}]")

# Tokenize the dataset
tokenized_data = tokenizer("\n\n".join(data['text']), return_tensors='pt')
print("Dataset tokenized.")

# Format tokenized examples
examples = []
for _ in range(n_samples):
    # Randomly select a segment from the tokenized data
    i = random.randint(0, tokenized_data.input_ids.shape[1] - tokenizer.model_max_length - 1)
    j = i + tokenizer.model_max_length

    # Extract input IDs and create attention mask
    input_ids = tokenized_data.input_ids[:, i:j]
    attention_mask = torch.ones_like(input_ids)

    # Append the example to the list
    examples.append({'input_ids': input_ids, 'attention_mask': attention_mask})

print(f"Formatted {n_samples} examples for quantization.")

Loading and tokenizing the C4 dataset...


Downloading readme:   0%|          | 0.00/41.1k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/318M [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

Token indices sequence length is longer than the specified maximum sequence length for this model (2441065 > 1024). Running this sequence through the model will result in indexing errors


Dataset tokenized.
Formatted 1024 examples for quantization.


**5. Quantize the Model**

Start the quantization process with a batch size of 1

In [7]:
# Quantize the model using GPTQ
print("Starting the quantization process...")
model.quantize(
    examples,
    batch_size=1,    # Batch size for quantization
    use_triton=True, # Use Triton for GPU acceleration
)

# Save the quantized model and tokenizer
print(f"Saving quantized model to {output_directory}...")
model.save_quantized(output_directory, use_safetensors=True)
tokenizer.save_pretrained(output_directory)
print("Quantized model and tokenizer saved.")

Starting the quantization process...


INFO - Start quantizing layer 1/12
INFO:auto_gptq.modeling._base:Start quantizing layer 1/12
INFO - Quantizing attn.c_attn in layer 1/12...
INFO:auto_gptq.modeling._base:Quantizing attn.c_attn in layer 1/12...
INFO - Quantizing attn.c_proj in layer 1/12...
INFO:auto_gptq.modeling._base:Quantizing attn.c_proj in layer 1/12...
INFO - Quantizing mlp.c_fc in layer 1/12...
INFO:auto_gptq.modeling._base:Quantizing mlp.c_fc in layer 1/12...
INFO - Quantizing mlp.c_proj in layer 1/12...
INFO:auto_gptq.modeling._base:Quantizing mlp.c_proj in layer 1/12...
INFO - Start quantizing layer 2/12
INFO:auto_gptq.modeling._base:Start quantizing layer 2/12
INFO - Quantizing attn.c_attn in layer 2/12...
INFO:auto_gptq.modeling._base:Quantizing attn.c_attn in layer 2/12...
INFO - Quantizing attn.c_proj in layer 2/12...
INFO:auto_gptq.modeling._base:Quantizing attn.c_proj in layer 2/12...
INFO - Quantizing mlp.c_fc in layer 2/12...
INFO:auto_gptq.modeling._base:Quantizing mlp.c_fc in layer 2/12...
INFO - Qu

Saving quantized model to gpt2-GPTQ...
Quantized model and tokenizer saved.


**6. Reload and Test the Quantized Model**

Reload the quantized model and tokenizer, then test it with a text generation pipeline.

In [8]:
# Reload the quantized model and tokenizer
print("Reloading the quantized model and tokenizer...")
device = "cuda:0" if torch.cuda.is_available() else "cpu"
model = AutoGPTQForCausalLM.from_quantized(
    output_directory,
    device=device,
    use_triton=True,
    use_safetensors=True,
)
tokenizer = AutoTokenizer.from_pretrained(output_directory)
print("Quantized model and tokenizer reloaded.")

# Test the quantized model using a text generation pipeline
from transformers import pipeline

print("Testing the quantized model with text generation...")
generator = pipeline('text-generation', model=model, tokenizer=tokenizer)
result = generator("Just improve yourself", do_sample=True, max_length=50)[0]['generated_text']
print("Generated text:")
print(result)

1. You disabled CUDA extensions compilation by setting BUILD_CUDA_EXT=0 when install auto_gptq from source.
2. You are using pytorch without CUDA support.
3. CUDA and nvcc are not installed in your device.
1. You disabled CUDA extensions compilation by setting BUILD_CUDA_EXT=0 when install auto_gptq from source.
2. You are using pytorch without CUDA support.
3. CUDA and nvcc are not installed in your device.
INFO - The layer lm_head is not quantized.
INFO:auto_gptq.modeling._base:The layer lm_head is not quantized.


Reloading the quantized model and tokenizer...
Quantized model and tokenizer reloaded.


The model 'GPT2GPTQForCausalLM' is not supported for text-generation. Supported models are ['BartForCausalLM', 'BertLMHeadModel', 'BertGenerationDecoder', 'BigBirdForCausalLM', 'BigBirdPegasusForCausalLM', 'BioGptForCausalLM', 'BlenderbotForCausalLM', 'BlenderbotSmallForCausalLM', 'BloomForCausalLM', 'CamembertForCausalLM', 'LlamaForCausalLM', 'CodeGenForCausalLM', 'CohereForCausalLM', 'CpmAntForCausalLM', 'CTRLLMHeadModel', 'Data2VecTextForCausalLM', 'DbrxForCausalLM', 'ElectraForCausalLM', 'ErnieForCausalLM', 'FalconForCausalLM', 'FuyuForCausalLM', 'GemmaForCausalLM', 'GitForCausalLM', 'GPT2LMHeadModel', 'GPT2LMHeadModel', 'GPTBigCodeForCausalLM', 'GPTNeoForCausalLM', 'GPTNeoXForCausalLM', 'GPTNeoXJapaneseForCausalLM', 'GPTJForCausalLM', 'JambaForCausalLM', 'JetMoeForCausalLM', 'LlamaForCausalLM', 'MambaForCausalLM', 'MarianForCausalLM', 'MBartForCausalLM', 'MegaForCausalLM', 'MegatronBertForCausalLM', 'MistralForCausalLM', 'MixtralForCausalLM', 'MptForCausalLM', 'MusicgenForCausalLM

Testing the quantized model with text generation...
Generated text:
Just improve yourself, just like the next generation."

While it remains to be seen the specifics in the upcoming season, that could be the story that the show's creators have wanted from the beginning.

The showrunners may not have to


## GGML/GGUF

**GGUF (GPT-Generated Unified Format)** and **GGML (GPT-Generated Model Library)** are quantization methods designed to make large language models (LLMs) more accessible, especially for users with limited hardware resources. These methods focus on efficient inference on CPUs, with the option to offload some computational layers to GPUs for better performance.

### GGUF vs GGML

| Aspects                    | GGML                                                                                                                                   | GGUF                                                                                                     |
|----------------------------|---------------------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------|
| Basic                      | GGML is an obsolete format for creating quantized LLMs using the GGML tensor library.                                                  | GGUF is the successor of the GGML format that has better efficiency. It is also created with the GGML tensor library.   |
| Speed                      | Compared to GGUF, the load time of the model and inference speed is on the slower side.                                               | GGUF LLMs have mmap compatibility that enhances load time and faster inference speed.                   |
| Special Tokens             | Special tokens are not supported by GGML.                                                                                            | GGUF supports special tokens which are useful for creating effective prompts and also in llm fine-tuning. |
| Support for Non-Llama Models | Non-llama models are not supported.                                                                                                  | GGUF format has extended compatibility with non-llama architecture models like Falcon, Bloom, Phi, Mistral, etc.   |
| Extensibility & Flexibility| GGML had extensibility issues where small changes in the base model used to result in breaking changes.                              | GGUF format has been designed to be more extensible & flexible allowing the addition of new features without breaking anything. |
| Ease of Use                | Compared to GGUF, the setup for GGML required more inputs from user and also had dependency on external libraries.                    | GGUF is much more user-friendly for setup, with not much dependency on external libraries.                |


### GGUF Quantization Process

1. **Selecting a Quantization Level**: Choose the desired bit precision (e.g., 4-bit, 8-bit).

2. **Compressing the Model**: Apply the k-quant system to compress the model weights. The "k" in k-quant refers to the number of clusters used in the quantization process. Choosing an appropriate k value is crucial. A lower k leads to more aggressive compression but might introduce higher accuracy loss. The k-quant system works by:

   - **Mapping High-Precision Weights**: Converting high-precision (e.g., 32-bit floating point) weights to lower precision (e.g., 4-bit integer) representations.

   - **Reducing Redundancy**: In a trained model, many weights might have similar or redundant values. This redundancy can be exploited to reduce the overall storage requirements. By clustering similar weight values together and representing them with fewer bits, we can reduce the number of unique weight values that need to be stored. Techniques such as Huffman coding or other entropy-based methods can be used to encode these clustered values more efficiently.

   - **Optimizing Storage**: Ensuring that the compressed weights are stored efficiently to allow for fast retrieval and use during inference. Specialized data structures, such as lookup tables or sparse matrices, may be used to store the quantized weights in a compact form.

3. **Saving the Model**: Store the quantized model in the GGUF format, which includes the necessary metadata and configurations for loading and running the model.

### Using a GGUF Formatted Model

1. **Install Necessary Packages**

The `ctransformers` package is designed to facilitate the efficient use of transformer models for various tasks such as text generation, classification, and more. It supports model quantization and optimized inference, making it suitable for running large models on both CPUs and GPUs.

In [9]:
! pip install ctransformers[cuda]

Collecting ctransformers[cuda]
  Downloading ctransformers-0.2.27-py3-none-any.whl (9.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m9.9/9.9 MB[0m [31m22.6 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: ctransformers
Successfully installed ctransformers-0.2.27


2. **Load the Model and Tokenizer**

In [10]:
from ctransformers import AutoModelForCausalLM
from transformers import AutoTokenizer, pipeline

# Load LLM and Tokenizer
# Use `gpu_layers` to specify how many layers will be offloaded to the GPU.
model = AutoModelForCausalLM.from_pretrained(
    "TheBloke/zephyr-7B-beta-GGUF",
    model_file="zephyr-7b-beta.Q4_K_M.gguf",
    model_type="mistral",
    gpu_layers=50,  # Offload 50 layers to the GPU for better performance
    hf=True  # Use the Hugging Face model hub
)
tokenizer = AutoTokenizer.from_pretrained(
    "HuggingFaceH4/zephyr-7b-beta",
    use_fast=True  # Use the fast tokenizer for improved speed
)

Fetching 1 files:   0%|          | 0/1 [00:00<?, ?it/s]

config.json:   0%|          | 0.00/31.0 [00:00<?, ?B/s]

Fetching 1 files:   0%|          | 0/1 [00:00<?, ?it/s]

zephyr-7b-beta.Q4_K_M.gguf:   0%|          | 0.00/4.37G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.43k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.80M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/42.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/168 [00:00<?, ?B/s]

3. **Create a Text Generation Pipeline**

In [11]:
# Create a pipeline for text generation
pipe = pipeline(model=model, tokenizer=tokenizer, task='text-generation')

4. **Run a Prompt**

In [12]:
# Define the prompt
prompt = "Please prepare me a GGUF/GGML tutorial."

# Generate text based on the prompt
outputs = pipe(prompt, max_new_tokens=256)

# Print the generated text
print(outputs[0]["generated_text"])

Please prepare me a GGUF/GGML tutorial.

Sure, here's a high-level overview of how GGUF and GGML work:

GGUF (Generalized Gaussian Units and Factorization) is a neural network architecture that can efficiently represent high-dimensional data with low memory and computational requirements. It achieves this by using a factorized representation of the weights, which allows for more efficient training and inference.

GGML (Generalized Gaussian Model Language) is a programming language and library for implementing GGUF and related models. It provides a high-level, easy-to-use interface for defining and training GGUF models, as well as a low-level, optimized implementation for efficient execution.

Here's a basic tutorial on how to use GGML to train a simple image classification model:

1. Install GGML:

GGML is written in Rust, so you'll need to have Rust installed on your machine. You can then add GGML to your Rust project by adding the following to your `Cargo.toml` file:

```toml
[depend

## AWQ

AWQ takes the concept of weight quantization to the next level by considering the activations of the model during the quantization process. In traditional weight quantization, the weights are quantized independently of the data they process. In AWQ, the quantization process takes into account the actual data distribution in the activations produced by the model during inference.

### How AWQ works?

1. Collect Activation Statistics: During a calibration phase, a subset of the data is used to collect statistics on the activations produced by the model. This involves running the model on this data and recording the range of values and the distribution of activations.

2. Searching Weight Quantization Parameters: Weights are quantized by taking the activation statistics into account. Concretely, we perform a space search for quantization paremeters (e.g., scales and zeropoints), to minimize the distortions incurred by quantization on output activations. As a result, the quantized weights can be accurately represented with fewer bits.

3. Quantizing: With the quantization parameters in place, the model weights are quantized using a reduced number of bits

In [26]:
import torch
import torch.nn as nn

# Define function to collect activation statistics
def collect_activation_statistics(model, data_loader):
    activation_stats = []
    for data in data_loader:
        data = data.unsqueeze(0)  # Add batch dimension
        output = model(data)
        activation_stats.append(output)
    return torch.cat(activation_stats, dim=0)

# Define function to quantize based on activations
def activation_aware_quantize(weights, activations):
    # Calculate a more dynamic threshold using a higher quantile
    # This sets the threshold as the 75th percentile of the absolute values of the activations
    threshold = torch.quantile(activations.abs(), 0.75) 
    print("Threshold:", threshold)
    
    # The mask is calculated by checking if the absolute value of activations 
    # is above the threshold and then averaging this over the batch dimension. 
    mask = (activations.abs() > threshold).float().mean(dim=0)
    
    print("Mask:", mask)
    
    # Quantize weights
    scale = 7.5
    quant_weights = torch.round(weights / scale) * scale
    
    # Combine quantized weights and original weights based on the mask
    quant_weights = torch.where(mask.unsqueeze(0) > 0.1, weights, quant_weights)
    
    return quant_weights

# Example usage
model = nn.Linear(4, 4)
data_loader = [torch.randn(4) for _ in range(20)]
activation_stats = collect_activation_statistics(model, data_loader)
print("Activation Statistics:", activation_stats)

weights = model.weight.data
print("Original Weights:", weights)

# Apply activation-aware quantization
quant_weights = activation_aware_quantize(weights, activation_stats)
print("Quantized Weights with AWQ:", quant_weights)


Activation Statistics: tensor([[ 0.6376,  0.9171, -0.1691, -0.1528],
        [-0.1878,  1.0891, -0.3310,  0.3858],
        [-0.4195, -0.9617,  0.6482,  0.6964],
        [ 0.0380,  1.0539, -0.5740,  0.4708],
        [-0.4287,  0.4822,  0.0827,  0.7769],
        [ 0.0116,  0.8562, -0.2326,  0.3336],
        [-0.6194,  0.5045,  0.3495,  0.6302],
        [-0.5160,  0.3522, -0.2777,  1.2109],
        [ 0.2588, -0.2071, -0.1515,  0.5693],
        [-0.4380, -0.7717,  0.0657,  1.3330],
        [-0.4623, -1.0104,  0.6211,  0.7919],
        [-0.3850, -0.4657,  0.3673,  0.8714],
        [-0.9009, -0.7500,  0.5793,  1.4988],
        [ 0.9014, -0.0134, -0.2230, -0.0462],
        [-0.4321,  0.2919, -0.0915,  0.7033],
        [ 0.6737, -1.5593,  0.6364,  0.1896],
        [-0.1158, -0.0773, -0.6494,  1.1337],
        [ 0.1789,  0.1457, -0.2509,  0.6566],
        [-0.1595, -0.2854,  0.5268,  0.4588],
        [-0.2641, -0.6513, -0.0016,  1.0579]], grad_fn=<CatBackward0>)
Original Weights: tensor([[ 0.34

## Additional Resources

- Please read the article of Maxim Lebonne to understand the GPTQ better: https://towardsdatascience.com/4-bit-quantization-with-gptq-36b0f4f02c34

- TheBloke user: https://huggingface.co/TheBloke

- AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration: https://arxiv.org/abs/2306.00978

- Which Quantization Method is Right for You? (GPTQ vs. GGUF vs. AWQ): https://www.maartengrootendorst.com/blog/quantization/

- A Guide to Quantization in LLMs: https://symbl.ai/developers/blog/a-guide-to-quantization-in-llms/