- **Author**:**Kandimalla Hemanth**
- **Date of lastly modified :** **19-01-2024**
- **E-mail**:**speechcodehemanth2@gmail.com**





#     ` Understanding Quantization Methods for Model Optimization`

Quantization methods play a pivotal role in optimizing models for efficient computational usage. These methods follow a specific naming convention for easy identification and understanding. Here's a comprehensive list of various quantization methods and their recommended use cases, curated from TheBloke's model cards:

#### q2_k:
- Utilizes Q4_K for attention.vw and feed_forward.w2 tensors, while employing Q2_K for other tensors.
- **Use Case:** Balancing precision where specific tensors require higher precision while maintaining efficiency elsewhere.

#### q3_k_l:
- Employs Q5_K for attention.wv, attention.wo, and feed_forward.w2 tensors, utilizing Q3_K for other instances.
- **Use Case:** Prioritizing higher precision for selected tensors critical for model performance.

#### q3_k_m:
- Leverages Q4_K for attention.wv, attention.wo, and feed_forward.w2 tensors, using Q3_K otherwise.
- **Use Case:** Striking a balance between precision and efficiency across significant tensors.

#### q3_k_s:
- Utilizes Q3_K across all tensors uniformly.
- **Use Case:** Standardizing precision for streamlined computational efficiency.

#### q4_0:
- Represents the original 4-bit quantization method.
- **Use Case:** Baseline quantization method with balanced accuracy and efficiency.

#### q4_1:
- Offers higher accuracy than q4_0 but not as high as q5_0. Features quicker inference compared to q5 models.
- **Use Case:** Balancing accuracy and inference speed while maintaining reasonable efficiency.

#### q4_k_m:
- Employs Q6_K for specific attention.wv and feed_forward.w2 tensors, utilizing Q4_K for others.
- **Use Case:** Prioritizing higher precision for key tensors without compromising overall efficiency.

#### q4_k_s:
- Uses Q4_K uniformly across all tensors.
- **Use Case:** Standardized 4-bit precision for all tensors.

#### q5_0:
- Delivers higher accuracy but demands higher resource usage and slower inference.
- **Use Case:** Optimal accuracy where computational efficiency is not the primary concern.

#### q5_1:
- Offers even higher accuracy but comes with increased resource usage and slower inference compared to q5_0.
- **Use Case:** Maximal accuracy, sacrificing some efficiency for critical tasks.

#### q5_k_m  Best :
- Utilizes Q6_K for specified attention.wv and feed_forward.w2 tensors, employing Q5_K for others.
- **Use Case:** Enhanced precision for crucial tensors while balancing efficiency.

#### q5_k_s:
- Uniformly applies Q5_K across all tensors.
- **Use Case:** Standardized higher precision for all tensors.

#### q6_k:
- Employs Q8_K for all tensors uniformly.
- **Use Case:** Extremely high precision with potential resource implications.

#### q8_0:
- Represents precision almost indistinguishable from float16 but demands high resources and slower processing.
- **Use Case:** Niche applications where extreme precision is paramount, despite trade-offs in resource efficiency and speed.

Understanding these quantization methods empowers users to tailor models to specific needs, balancing precision, efficiency, and computational demands for optimal performance in various applications.

---





The information presented in a tabular form:

| Quantization Level | Model Name | Characteristics | Compression Method | Compression Ratio Formula | Memory Suitability |
| ------------------ | ---------- | ---------------- | --------------------- | --------------------------- | ------------------ |
| 2 to 4-bit (Q2_K, Q3_K, Q4_K) | Q2_K | Highly compressed, smaller sizes, lower memory | N/A | N/A | Suitable for systems with limited RAM |
| | Q3_K | | | | |
| | Q4_K | | Q4: 4-bit representation | (4 + QK/2) / (2 * QK) | Some accuracy trade-off |
| 5 to 6-bit (Q5_K, Q6_K) | Q5_K | Refined balance between size and accuracy | N/A | N/A | Favorable for systems with moderate memory capacities |
| | Q6_K | | | | |
| 8-bit (Q8_0) | Q8_0 | Closer to original model accuracy, demands higher memory | N/A | N/A | Suitable for systems with substantial RAM |
| Quantization Formulas | Q4_0 | Representing numbers with a 32-bit float scaling factor and multiple 4-bit integers | N/A | (4 + QK/2) / (2 * QK) | N/A |
| | Q4_1 | Enhances Q4_0 with an additional offset factor | N/A | (8 + QK/2) / (2 * QK) | N/A |
| Model Sizes | 7b | 7-billion parameter model | Not quantized: Several dozen gigabytes | Quantized (Q4 or Q6): Feasible for systems with 16GB RAM | M2 MacBook Air or similar |
| | 13b | 13-billion parameter model | Even with quantization: Substantial memory and computational overhead | Suited for GPU setups or higher-end hardware with ample memory | N/A |

A clear overview of the characteristics, compression methods, formulas, and memory suitability for different quantization levels and model sizes.



| Instruction Set | Description |
| --------------- | ----------- |
| SSE (Streaming SIMD Extensions) | Introduced by Intel, this set provides single instruction multiple data capabilities, supporting 128-bit SIMD instructions. It operates on packed single-precision floating-point data types and integer data types. |
| AVX (Advanced Vector Extensions) | An extension of SSE, AVX introduces wider 256-bit SIMD instructions, allowing for higher parallelism and increased data throughput. It enhances support for floating-point operations and provides additional registers. |
| AVX2 | Builds upon AVX by introducing support for integer operations with 256-bit SIMD instructions, expanding the capabilities for integer processing. It enhances performance for various computational tasks. |
| AVX-512 | The most advanced extension, operating on 512-bit SIMD instructions. It provides even higher parallelism and throughput, enabling intensive computational tasks across a wider range of data types. |



# Objective we want work on

1. **Understanding the Trade-off:**
   - **Objective:** Find the optimal value for QK (block size).
   - **Trade-off:** The decision involves balancing compression ratio and accuracy.
   - **Factors:** Consider the CPU architecture as different sizes might be more efficient for various SIMD instruction sets.

2. **Definition of QK (Block Size):**
   - **QK Definition:** QK typically refers to the block size in a compression or processing algorithm.
   - **Significance:** It determines how data is divided into blocks for simultaneous processing.

3. **Impact of Block Size on Compression and Accuracy:**
   - **Compression Ratio:** Smaller block sizes may lead to higher compression ratios as more redundancy can be exploited. However, this might affect accuracy.
   - **Accuracy:** Larger block sizes might provide more accurate results, but at the expense of a potentially lower compression ratio.

4. **Consideration of CPU Architecture:**
   - **SIMD Instruction Sets:** Understand the SIMD instruction sets supported by the CPU architecture.
   - **Efficiency:** Different block sizes may be more efficient for specific SIMD instruction sets.
   - **Parallel Processing:** SIMD allows parallel processing of the same operation on multiple data points, aligning with the nature of operations in Large Language Models.

5. **Algorithmic Decision Making:**
   - **Bayesian Optimization:** Utilize Bayesian optimization techniques to search for the optimal QK.
   - **Objective Function:** Define an objective function that captures the trade-off between compression ratio and accuracy.
   - **Iterative Refinement:** Perform iterative refinement using the Bayesian optimization to converge towards the optimal QK.

6. **Implementation of SIMD for Parallel Processing:**
   - **Identification:** Identify the specific operations in your algorithm that can benefit from parallel processing.
   - **Coding for SIMD:** Implement SIMD instructions in the algorithm to enable simultaneous processing.
   - **Performance Monitoring:** Monitor the performance on different CPU architectures to validate efficiency gains.

7. **Testing and Validation:**
   - **Dataset:** Use a diverse dataset to test the algorithm under various conditions.
   - **Metrics:** Evaluate compression ratio, accuracy, and overall performance metrics.
   - **Fine-tuning:** Adjust QK based on testing results and feedback from the specific CPU architecture.

8. **Documentation and Reporting:**
   - **Document Parameters:** Clearly document the chosen QK, rationale, and any adjustments made.
   - **Report Findings:** Provide a detailed report on the trade-offs, efficiency gains, and overall performance achieved.

This step-by-step guide encompasses algorithmic decisions, logic, and considerations for finding the optimal QK, taking into account both compression and accuracy, as well as the specific characteristics of SIMD and CPU architecture.

# **How to Use Above Concepts for LLMS Inference**:

**GGML (GPT-Generated Model Language) vs. GGUF (GPT-Generated Unified Format): A Detailed Analysis**

*GGML at a Glance:*
GGML, conceived by Georgi Gerganov, acts as a specialized tensor library crafted explicitly for machine learning, with a focal point on optimizing language models like GPT. Initially, it represented an early innovation in the endeavor to establish a dedicated file format for GPT models. Its primary objective was to ensure efficient storage and high performance across diverse hardware architectures, including Apple Silicon.

**GGML Advantages:**
1. **Pioneering Innovation:** GGML marked the initial steps in devising a file format dedicated to GPT models, laying the foundation for further advancements in model storage.
2. **Unified File Sharing:** Introducing the convenience of consolidating entire models within a single file, it significantly streamlined collaborative efforts and model distribution.
3. **CPU Compatibility:** GGML exhibited adaptability by seamlessly operating on CPUs, ensuring broader accessibility across varying hardware configurations.

**GGML Limitations:**
1. **Flexibility Constraints:** Incorporating additional model information posed challenges for GGML, limiting its adaptability to evolving model architectures.
2. **Compatibility Challenges:** The introduction of novel features often led to compatibility issues with existing models, hindering seamless transitions.
3. **Manual Configuration Burden:** Users frequently encountered complexities in manually fine-tuning settings like rope-freq-base, rope-freq-scale, gqa, and rms-norm-eps, risking potential errors.

*Introduction of GGUF:*
GGUF emerged as the successor to GGML on August 21, 2023, presenting a substantial advancement in language model file formats. Developed collaboratively by AI community contributors, including Georgi Gerganov, GGUF aimed to rectify GGML's limitations and enhance the user experience in handling large-scale AI models.

**GGUF Advantages:**
1. **Overcoming GGML's Constraints:** GGUF was purpose-built to surmount GGML's limitations, offering a more resilient and user-friendly solution for model storage.
2. **Extensibility:** Enabling seamless integration of new features without compromising compatibility with existing models ensured adaptability to evolving architectures.
3. **Stability Focus:** GGUF prioritizes stability, striving to minimize disruptive changes and streamline transitions to newer format versions.
4. **Enhanced Versatility:** GGUF's utility extends beyond llama models, showcasing its capability in supporting diverse language models.

**GGUF Challenges:**
1. **Transition Effort:** Migrating existing models to GGUF may demand considerable time and resources.
2. **Adaptation Phase:** Users and developers require acclimatization to the new format, potentially necessitating an adjustment period for widespread adoption.

*Conclusion:*
GGUF stands as a notable progression from GGML, offering improved features, stability, and adaptability. While the transition may pose challenges, its advantages in handling large language models signify a promising evolution in language model file formats. For more detailed insights, explore the GitHub issue [here](https://github.com/ggerganov/llama.cpp/issues) and delve into the llama.cpp project by Georgi Gerganov [here](https://github.com/ggerganov ).  


# exploring
---

### Steps to Execute Quantization Code for any model from Hugging-face

To run the provided code, follow these steps using a terminal equipped with git, a C++ compiler, and pip for Python package management:

#### 1. Clone the llama.cpp Repository:
This repository hosts the essential quantization tools required for the LLaMA model.

```bash
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
git pull
make clean
LLAMA_CUBLAS=1 make
cd ..
```

#### 2. Install Python Requirements for llama.cpp:
Ensure the necessary Python dependencies are installed to execute the quantization scripts.

```bash
pip install -r llama.cpp/requirements.txt
```

#### 3. Configure Git LFS and Clone the Model Repository:
Set up Git LFS for handling large files like pre-trained models. Clone the model repository from Hugging Face.

```bash
git lfs install
git clone https://huggingface.co/{MODEL_ID}
```

#### 4. Convert the Model to FP16 Format:
Utilize the provided convert.py script in the llama.cpp repository to convert the model's parameters to FP16, a half-precision floating-point format, reducing model size and potentially accelerating computations on FP16-supported GPUs.

```bash
python llama.cpp/convert.py {MODEL_NAME} --outtype f16 --outfile {fp16}
```

#### 5. Quantize the Model:
Iterate through each specified quantization method in the QUANTIZATION_METHODS list. Use the quantize executable to convert the FP16 model into various quantized formats, generating distinct files with the quantized weights.

```bash
for method in QUANTIZATION_METHODS:
    qtype="{MODEL_NAME}/{MODEL_NAME.lower()}.{method.upper()}.gguf"
    ./llama.cpp/quantize {fp16} {qtype} {method}
```

To execute these commands as a script, create a bash file (e.g., quantize_model.sh) containing these commands, replacing {MODEL_ID}, {MODEL_NAME}, {fp16}, and {method} with the actual values. Grant execution permission using chmod +x quantize_model.sh, then execute it with ./quantize_model.sh.




---



#  `local installation` for cloud machines ignore
### Step 1: Adding the PackageCloud Repository
1. **apt/deb Repositories:**
   - Run the following command to add the apt/deb repository:
     ```bash
     curl -s https://packagecloud.io/install/repositories/github/git-lfs/script.deb.sh | sudo bash
     ```

2. **yum/rpm Repositories:**
   - Add the yum/rpm repository with this command:
     ```bash
     curl -s https://packagecloud.io/install/repositories/github/git-lfs/script.rpm.sh | sudo bash
     ```

3. **For Specific or Compatible Distributions:**
   - If your Linux distribution does not match a repository uploaded for Git LFS, or if it's compatible with an upstream distribution:
     - Use additional parameters or manually correct the resulting repository URLs.

4. **Specific Distributions Configuration Example:**
   - For LinuxMint 17.1 Rebecca (downstream of Ubuntu Trusty and Debian Jessie):
     ```bash
     curl -s https://packagecloud.io/install/repositories/github/git-lfs/script.deb.sh | os=debian dist=jessie sudo -E bash
     ```
   - For automatic detection of the distribution for Ubuntu-based systems like Pop OS:
     ```bash
     (. /etc/lsb-release && curl -s https://packagecloud.io/install/repositories/github/git-lfs/script.deb.sh | sudo env os=ubuntu dist="${DISTRIB_CODENAME}" bash)
     ```

### Step 2: Installing Git LFS Packages
1. **apt/deb:**
   - Use the following command to install Git LFS:
     ```bash
     sudo apt-get install git-lfs
     ```

2. **yum/rpm:**
   - For yum/rpm-based systems, install Git LFS using:
     ```bash
     sudo yum install git-lfs
     ```

### Note about Proxies
- If your system is behind a proxy-server that's required for internet access, and you're using `sudo`, the environment variables like `http_proxy` or `https_proxy` might not persist.
- To retain environment variables when switching to root with `sudo`, use `sudo -E ...`.



In [None]:
# Variables
MODEL_ID = "mlabonne/EvolCodeLlama-7b"
QUANTIZATION_METHODS = ["q4_k_m", "q5_k_m"]

# Constants
MODEL_NAME = MODEL_ID.split('/')[-1]

# Install llama.cpp
!git clone https://github.com/ggerganov/llama.cpp
!cd llama.cpp && git pull && make clean && LLAMA_CUBLAS=1 make
!pip install -r llama.cpp/requirements.txt

# Download model
!git lfs install
!git clone https://huggingface.co/{MODEL_ID}

# Convert to fp16
fp16 = f"{MODEL_NAME}/{MODEL_NAME.lower()}.fp16.bin"
!python llama.cpp/convert.py {MODEL_NAME} --outtype f16 --outfile {fp16}

# Quantize the model for each method in the QUANTIZATION_METHODS list
for method in QUANTIZATION_METHODS:
    qtype = f"{MODEL_NAME}/{MODEL_NAME.lower()}.{method.upper()}.gguf"
    !./llama.cpp/quantize {fp16} {qtype} {method}



### Initialization:
- `ggml_init_cublas: found 1 CUDA devices:`
  - Indicates the initialization of CUDA devices for GPU processing. In this case, one CUDA device (Tesla T4) was found.

### Build and Quantization Information:
- `main: build = 1100 (dd0dc36)`
  - Specifies the build information or version (build 1100 with identifier dd0dc36).
- `main: quantizing 'EvolCodeLlama-7b/evolcodellama-7b.gguf.fp16.bin' to 'EvolCodeLlama-7b/evolcodellama-7b.gguf.q4_k_s.bin' as Q4_K_S`
  - Indicates the quantization process applied to a specific model (`EvolCodeLlama-7b/evolcodellama-7b.gguf.fp16.bin`) using the `Q4_K_S` method, resulting in a new quantized model file (`EvolCodeLlama-7b/evolcodellama-7b.gguf.q4_k_s.bin`).

### Model Loading Information:
- `llama_model_loader: loaded meta data with 16 key-value pairs and 291 tensors from EvolCodeLlama-7b/evolcodellama-7b.gguf.fp16.bin (version GGUF V1 (support until nov 2023))`
  - Confirms the successful loading of metadata associated with the model file (`EvolCodeLlama-7b/evolcodellama-7b.gguf.fp16.bin`). It contains 16 key-value pairs and 291 tensors. Additionally, it specifies the model's version as GGUF V1 with support until November 2023.
- `llama_model_loader: - tensor 0: token_embd.weight f16 [ 4096, 32016, 1, 1 ]`
- `llama_model_loader: - tensor 1: blk.0.attn_q.weight f16 [ 4096, 4096, 1, 1 ]`
- `llama_model_loader: - tensor 2: blk.0.attn_k.weight f16 [ 4096, 4096, 1, 1 ]`
- `llama_model_loader: - tensor 3: blk.0.attn_v.weight f16 [ 4096, 4096, 1, 1 ]`
- `llama_model_loader: - tensor 4: blk.0.attn_output.weight f16 [ 4096, 4096, 1, 1 ]`


### Tensor Information:
- **Names:**
  - Specifies the names or identifiers of the individual tensors within the model. For instance:
    - `token_embd.weight`
    - `blk.0.attn_q.weight`
    - `blk.0.attn_k.weight`
    - `blk.0.attn_v.weight`
    - `blk.0.attn_output.weight`
    - Each name identifies a particular tensor within the model's architecture.

- **Data Type (f16 for half-precision floating-point):**
  - Indicates the data type used to represent the numerical values within these tensors. Here, the notation `f16` refers to half-precision floating-point format, which uses 16 bits to represent numerical values.
  - Half-precision floating-point provides a compromise between memory usage and precision, allowing for faster computations and reduced memory requirements compared to higher-precision formats like `float32`.

- **Shape ([rows, columns, depth, channels]):**
  - Describes the dimensions or structure of each tensor using a four-dimensional representation:
    - `[rows, columns, depth, channels]`
    - For example:
      - `token_embd.weight`: `[4096, 32016, 1, 1]`
      - `blk.0.attn_q.weight`: `[4096, 4096, 1, 1]`
      - `blk.0.attn_k.weight`: `[4096, 4096, 1, 1]`
      - `blk.0.attn_v.weight`: `[4096, 4096, 1, 1]`
      - `blk.0.attn_output.weight`: `[4096, 4096, 1, 1]`
    - Each value within the square brackets represents a dimension:
      - `rows`: The number of rows in the tensor.
      - `columns`: The number of columns in the tensor.
      - `depth`: The depth or the number of layers within the tensor (often 1 for weight tensors).
      - `channels`: The number of channels or features within each element of the tensor.
    - These dimensions signify the shape and size of the tensors, crucial for understanding their structure and how data is organized within the model.


This log excerpt captures the initialization of CUDA devices, details about the build, quantization process, and information regarding the loaded model, including metadata and specific tensor details such as their names, shapes, and data types.

In [None]:
!huggingface-cli login


    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|

    To login, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Token: 
Add token as git credential? (Y/n) y
Token is valid (permission: read).
[1m[31mCannot authenticate through git-credential as no helper is defined on your machine.
You might have to re-authenticate when pushing to the Hugging Face Hub.
Run the following command in your terminal in case you want to set the 'store

In [None]:
import os

model_list = [file for file in os.listdir(MODEL_NAME) if "gguf" in file]

prompt = input("Enter your prompt: ")
chosen_method = input("Name of the model (options: " + ", ".join(model_list) + "): ")

# Verify the chosen method is in the list
if chosen_method not in model_list:
    print("Invalid name")
else:
    qtype = f"{MODEL_NAME}/{MODEL_NAME.lower()}.{method.upper()}.gguf"
    !./llama.cpp/main -m {qtype} -n 128 --color -ngl 35 -p "{prompt}"

In [None]:
MODEL_ID="mlabonne/EvolCodeLlama-7b"
QUANTIZATION_METHODS=["q4_k_m", "q5_k_m"]  # Make sure you define this list if it's not already defined.
MODEL_NAME=MODEL_ID.split('/')[-1]  # This will be "Mistral-7B-v0.1"

# Create the directory where the quantized models will be saved
!mkdir -p Hemanth_LLms
# Mistral AI v0.1
# Download model
# !git lfs install
!git clone https://huggingface.co/{MODEL_ID}

# Convert to fp16 and save in the 'Hemanth_LLms' folder
fp16 = f"Hemanth_LLms/{MODEL_NAME.lower()}.fp16.bin"
!python llama.cpp/convert.py {MODEL_NAME} --outtype f16 --outfile {fp16}

# Quantize the model for each method in the QUANTIZATION_METHODS list and save in the 'Hemanth_LLms' folder
for method in QUANTIZATION_METHODS:
    qtype = f"/Hemanth_LLms{MODEL_NAME.lower()}.{method.upper()}.gguf"
    !./llama.cpp/quantize {fp16} {qtype} {method}

Cloning into 'EvolCodeLlama-7b'...
remote: Enumerating objects: 35, done.[K
remote: Counting objects: 100% (32/32), done.[K
remote: Compressing objects: 100% (32/32), done.[K
remote: Total 35 (delta 8), reused 0 (delta 0), pack-reused 3[K
Unpacking objects: 100% (35/35), 483.46 KiB | 2.63 MiB/s, done.
Filtering content: 100% (5/5), 4.70 GiB | 10.00 MiB/s, done.
Encountered 1 file(s) that may not have been copied correctly on Windows:
	pytorch_model-00001-of-00002.bin

See: `git lfs help smudge` for more details.
python3: can't open file '/content/llama.cpp/convert.py': [Errno 2] No such file or directory
/bin/bash: line 1: ./llama.cpp/quantize: No such file or directory
/bin/bash: line 1: ./llama.cpp/quantize: No such file or directory


### Working sooner i will update for model to running fro

`Model directory not found`:

The convert.py script is failing because it cannot find the model directory Mistral-7B-v0.1. This might be due to an incorrect path or because the model has not been cloned into the expected directory.

`Shared library error`:

The quantize executable from the llama.cpp repository is failing to run because it cannot find the libcuda.so.1 shared library. This library is part of the CUDA toolkit and is required for GPU operations. This error typically occurs when you're trying to run a GPU-accelerated program on a system without the necessary CUDA libraries installed, or when the system's library path does not include the directory where libcuda.so.1 is located.

Here's how you can address these issues:

For the model directory not found error:

Make sure that the git clone command is executed in the correct directory and that it completes successfully. The model should be cloned into a directory with the same name as the MODEL_ID's last component.






current_working_directory
- Mistral-7B-v0.1/
  - config.json
  - pytorch_model.bin
  - ...

- The convert.py script should be run from the current_working_directory and it will look for the model in the Mistral-7B-v0.1 directory.

