
# Creation of a quantization, use and evaluation pipeline, featuring LLaMA2 7B model and Quanto

In this project, we evaluate the impact quantization has in a common use LLM model, in our case LLaMA 2 7B, evaluating it in a sentiment analysis task. Our pipeline is defined as follows:
- Add a classification layer at the end of the not-quantized model and train this layer with our dataset.
- Quantize the model using **Quanto** in two variants, 8bit and 4bit quantization (of just the weights).
- Compare the performance of our models (not quantized, 4bits and 8bits) in terms of precission, memory use, inference speed and emissions.

![Pipeline Diagram](./Quanto_Pipeline.png)

We follow this approach because it is the easiest approach to follow. Another option would be to firstly quantize the base model and then add the classification layer and train it. In fact, one could think this approach is better than the previously exposed. However, lots of problems are encountered doing this because fine tuning an already quantized model is not a good practice. 

### Goals
1. Analize the impact quantization has on performance.
2. Compare the efficiency and precision of the quantized models with the not-quantized model.


## Quanto

**Quanto** (Quantization for Transformers Optimization) is a cutting-edge quantization method specifically tailored to optimize the deployment of large language models while minimizing the trade-off between efficiency and accuracy. Developed as a **Post-Training Quantization (PTQ)** technique, Quanto enables efficient scaling of Transformer-based models to hardware with limited computational and memory resources. By leveraging advanced techniques for weight distribution analysis and precision tuning, Quanto preserves the essential properties of the model while significantly reducing its memory footprint and computational demands.

The **core principle** of Quanto lies in precision-aware quantization, which uses statistical and structural insights into Transformer layers to determine the optimal precision for different components of the model. Unlike uniform quantization approaches that treat all weights or activations equally, Quanto analyzes the sensitivity of weights, biases, and activations to precision loss. This method allows Quanto to apply mixed-precision quantization—assigning higher precision to critical components while aggressively compressing less sensitive ones. For example, attention weights and projection layers often retain higher precision (e.g., 16-bit), while feed-forward network (FFN) components can be quantized to lower precisions such as 4-bit or even ternary (3-bit).

### Key Features of Quanto

1. **Adaptive Precision Assignment**:
   Quanto employs a layer-wise and component-wise analysis to determine the sensitivity of each weight to quantization errors. This involves measuring the distribution of values and their impact on downstream activations. Sensitive parameters are assigned higher precision to mitigate performance degradation, while less critical parameters are quantized more aggressively.

2. **Error-Minimization Framework**:
   Using a combination of **quantization noise metrics** and **statistical approximations**, Quanto minimizes the cumulative error introduced during the quantization process. It evaluates the quantization effect using metrics such as:
   \[
   \|WX - \hat{W}X\|^2,
   \]
   where \( W \) are the original weights, \( \hat{W} \) are the quantized weights, and \( X \) represents input activations. This framework allows Quanto to iteratively refine quantization decisions layer by layer.

3. **Attention to Transformer-Specific Structures**:
   Quanto recognizes the unique structural properties of Transformers, such as self-attention mechanisms and feed-forward layers. It adapts its quantization strategy to maintain alignment and stability in multi-head attention, ensuring that critical patterns in the attention matrices are preserved.

### Quantization Workflow in Quanto

1. **Statistical Analysis and Sensitivity Scoring**:
   - A small calibration dataset is used to analyze the activation ranges and weight distributions of each layer.
   - Sensitivity scoring is performed to rank the importance of individual weights and activations within each layer.

2. **Precision Allocation**:
   - Mixed-precision configurations are applied based on sensitivity scores, with critical weights assigned higher precision (e.g., 8-bit or 16-bit) and non-critical ones reduced to lower precision (e.g., 4-bit or 3-bit).
   - Quantization scaling factors are computed to minimize the difference between full-precision and quantized outputs.

3. **Layer-Wise Quantization**:
   - Quanto processes each Transformer layer sequentially, ensuring that errors introduced by quantized layers do not propagate uncontrollably to subsequent layers.
   - Outputs of quantized layers are recalculated dynamically to serve as inputs for the next layer's quantization step.

4. **Block-Wise Optimization**:
   - For scalability in very large models, Quanto applies block-wise quantization to divide weight matrices into manageable submatrices, reducing memory and computation overhead while maintaining precision for critical components.

### Inference and Performance Optimization

1. **Dynamic Weight Reconstruction**:
   Quantized weights are stored in low-precision formats and are dynamically reconstructed to higher precision (if needed) during inference, balancing memory efficiency with computational accuracy.

2. **Custom Kernels for Speed**:
   Quanto leverages optimized inference kernels designed for hardware accelerators like NVIDIA GPUs. These kernels minimize memory transfer overhead and enable efficient execution of quantized matrix operations.

3. **Memory Efficiency**:
   With Quanto, large Transformer models can fit into smaller memory footprints, enabling deployment on single GPUs or edge devices. For instance, a 3-bit quantized LLaMA-2-13B model can run on a single NVIDIA A100, significantly reducing infrastructure requirements.

4. **Robustness in Token-by-Token Tasks**:
   In autoregressive tasks such as text generation, Quanto optimizes inference latency by compressing weights without disrupting sequential token generation, maintaining smooth token predictions.

### Advantages of Quanto

- **High Compression Ratios**: Quanto achieves significant reductions in model size, often compressing models by over 4x with minimal impact on accuracy.
- **Scalable to Large Models**: Designed for scaling Transformer architectures like LLaMA, GPT, and BERT, Quanto can handle models with billions of parameters.
- **Hardware-Agnostic Design**: Compatible with a wide range of hardware platforms, including GPUs, TPUs, and edge devices.

In summary, Quanto combines precision-aware quantization with Transformer-specific optimizations to enable efficient deployment of large language models in resource-constrained environments. Its adaptive, iterative, and hardware-efficient approach makes it a powerful tool for modern AI applications that demand high performance and low latency.

## Hardware and software

### Hardware
- CPU: Intel(R) Xeon(R) Silver 4316 CPU @ 2.30GHz
- RAM: 235 GB
- GPU: NVIDIA A100-PCIE-40GB
### Software
For this project, we used the following python libraries: torch, transformers, time and datasets, among others. Of course, all the libraries that these libraries have dependencies with are needed too.
We also used some other libraries or tools that help to evaluate the results. For example, the library CodeCarbon was used to collect emissions data, and the Weights and Biases (Wandb) platform was used to collect and visualize metrics. 


## Project pipeline

### 1. Base model preparation
The base model used is **LLaMA 2 (7B)**. We added a classifier layer to it by loading it with the LlamaForSequenceClassification function from the Transformers library. This layer was trained (only this layer was trained, all the other ones were frozen) using our chosen dataset, **TweetEval**. To demostrate that this training does not suffer from overfitting, we show here the training and the evaluation losses:
![Training vs Evaluation Loss](./overfitting.png)

### 2. Quantization 
Quantization was performed using **Quanto**. With this quantization method we took 2 quantization options: **8 bits** and **4 bits**. Note that we just quantized the weights in this case.
This leads us to have 3 different models: **quantized to 8 bits**, **quantized to 4 bits** and **not quantized**.

### 3. Evaluation
These three models were evaluated and compared in terms of: **accuracy**, **memory use**, **inference speed** and **emissions**. The dataset used was the same we used for the training before, but we used the test subset.



### Accuracy results
We now show the results obtained for the accuracy of each of the three models:

![Accuracy](./accuracy_quanto.png)

Firstly, although minimal differences we can obviate, we see how accuracy remains the same between batch sizes for the same model. Taking the results for each model without batches, we observe how the accuracy is higher with the base model and, between the quantized models, the 8-bit variant has a higher accuracy, although differences are minimal. In fact, the 8-bit quantized model obtains just a 0.37% less accuracy than the base model, while the 4-bit variant obtains a 0.46% less. If we compare the results between the quantized models, the 4bit model obtains 0.08% less accuracy than the 8bit model. 


### Memory use results

Measuring memeory needs is not an easy task. We decided to check the memory usage in two points of the pipeline, so we obtain two different memory metrics: minimum model memory need (disk) and maximum memory need while inference (in GPU). Note that this second measure tends to be higher due to needs associated with inference, specially when using batches.

Therefore, we show the minimum memory needs for each model:
- Base model: 24.61 GB
- 8-bit quantized: 6.43 GB
- 4-bit quantized: 3.88 GB

We can see how the quantized models use considerably less memory than the base model. In fact, the 8-bit quantized model uses 73.86% less memory than the base model, while the 4-bit variant uses 84.23% less memory. If we compare the quantized models, the 4-bit model uses 39.66% less memory than the 8-bit variant. We now show a chart with these results:
![model_sizes](./model_sizes_Quanto.png)

Now we show the maximum memory needed for inference for each model and inference configuration:

- Base model: 24.75 GB
- 8-bit quantized: 6.92 GB
- 4-bit quantized: 4.28 GB
- 8-bit quantized with batch size 32: 12.32 GB
- 8-bit quantized with batch size 64: 17.90 GB
- 4-bit quantized with batch size 32: 9.62 GB
- 4-bit quantized with batch size 64: 15.21 GB

These results are shown in this chart:
![max_memory](./max_gpu_mem_quanto.png)

We see how the maximum memory usage in GPU during inference is noticeably higher for all models when using batches. The bigger the batch size, the higher the memory needs. For example, the 8-bit model needs 43.85% less memory when not using batches compared to using batches with size 32, and 61.34% less when the batch size is 64.

### Inference speed results

In this section we decided to not take into consideration the quantization time for the comparison, although we will comment about it. 

The evaluation time in each case took:
- Base model: 643.06 s
- 8-bit quantized: 1362.31 s
- 4-bit quantized: 1850.53 s
- 8-bit quantized with batch size 32: 561.99 s
- 8-bit quantized with batch size 64: 569.19 s
- 4-bit quantized with batch size 32: 580.04 s
- 4-bit quantized with batch size 64: 577.49 s

The quantization time for each quantized model was:
- 8-bit quantized: 23.89 s
- 4-bit quantized: 46.10 s

Here we can see the most surprising results of all the report. The fastest model was, by far, the base model. And between the quantized variants, the 8bit one was faster. In fact, the 8bit model took 111.86% more time to do the inference for the evaluation and the 4bit model took 187.79% more time, both compared to the base model. If we compare the 8bit model and the 4bit model, we obtain that the 4-bit model was 26.36% slower than the 8-bit one. However, when using batches during inference, quantized models performed much better in terms of inference time. For example, the 8-bit model is 58.22% faster when using batches with size 64.

If we take quantization times into consideration, we see that the 8 bit model took less time to be quantized, taking 48.16% less time than the 4 bit variant. We now show some graphs to show these results:

![inference_time](./inference_time_Quanto.png)
![quantization_time](./quantization_time_Quanto.png)

These results are surprising because the normal thing to expect is the smaller quantized model to perform faster. However, this does not happen. Previously, with BitsAndBytes we thought this was due to the nature of the BitsAndBytes quantization algorithm. However, after seeing this happening again with GPTQ and Quanto, we stopped and tried to figure out what was happening. We reached the conclusion that this happens when doing the inference in evaluation without batches, just one by one. As these quantization methods convert back to fp16 for inference, this adds a significant overhead. This leads to a higher inference time if we do not take advantage of the lower bit rate and use higher batch sizes. Results obtained in this section for quantized models using batches during inference seem to prove this.

### Emissions results

The emissions are a little challeging to measure and compare. This is because we need to decide if we take into account the emissions generated during the quantization of the models. Our approach will show the resuts obtained for evaluation separated from the emissions for the quantization process, but will take them into consideration when comparing. We will measure the emissions of CO<sub>2</sub> in kilograms of CO<sub>2</sub>-equivalents \[CO<sub>2</sub>eq\]

The emissions generated by the evaluation process were:
- Base model: 0.01390 (las emisiones y el tiempo son muy muy ligeramente distintas con las veces que hice GPTQ o el ByB, vreo que es mejor hacerlo cada vez pues las circunstancias seran mas parecidas a cuando ejecuto Quanto)
- 8-bit quantized: 0.026987
- 4-bit quantized: 0.036751
- 8-bit quantized with batch size 32: 0.011079
- 8-bit quantized with batch size 64: 0.011227
- 4-bit quantized with batch size 32: 0.011444
- 4-bit quantized with batch size 64: 0.011415

The emissions generated by quantization were:
- 8-bit quantized: 0.000231
- 4-bit quantized: 0.000456

If we foscus on the evaluation, we see how the 4bit quantized model is the one with higher emissions. This is coherent with the inference times we saw earlier. Comparing the models we obtain that the 8bit model has 94.08% higer emissions than the base model, while the 4bit variant has 164.38% higher emissions. Comparing the quantized models, the 4-bit model has 36.18% higher emissions than the 8-bit model. However, following the results seen with inference time, quantized models generate much lower emissions when using batches during inference. For example, the 8-bit model generates 58.41% less emissions when using batches with size 64.

Regarding quantization, the 8-bit model generates 49.34% less emissions than the 4-bit variant.

We now show a couple of graphs to visualize these results:

![inference_emissions](./inference_emissions_Quanto.png)
![quantization_emissions](./quantization_emissions_Quanto.png)


## Conclusions

Results show that:

1. Quantization does in fact reduce accuracy, and the lower the bit count, the higher this drop is. However, the drop with **Quanto** is not as significant, as we saw. Accuracy remain close to the base model results. 
2. Quantized models are noticeably better in terms of memory usage compared to the base model.
3. Contrary to what would be expected, quantized models with **Quanto** are not faster than the base model. This probably is due to the nature of this quantization method, that restores all the values to fp16 for inference, and as we do a "one by one" (not using batches) evaluation, this overhead makes the model work slower. However, more studies and experimentation is needed in this area. 
4. In a similar fashion as inference time, emissions are higher in the quantized models. This is coherent with the fact that the inference time is higher. However, the differences here are smaller compared to the differences when comparing speed, probably due to the size of the models.

