
# Creation of a quantization, use and evaluation pipeline, featuring LLaMA2 7B model and HQQ

In this project, we evaluate the impact quantization has in a common use LLM model, in our case LLaMA 2 7B, evaluating it in a sentiment analysis task. Our pipeline is defined as follows:
- Add a classification layer at the end of the not-quantized model and train this layer with our dataset.
- Quantize the model using **HQQ** in three variants, 8bit, 4 bit and 3bit quantization.
- Compare the performance of our models (not quantized, 4bits 8bits and 3 bits) in terms of precission, memory use, inference speed and emissions.

![Pipeline Diagram](./HQQ_Pipeline.png)

We follow this approach because it is the easiest approach to follow. Another option would be to firstly quantize the base model and then add the classification layer and train it. In fact, one could think this approach is better than the previously exposed. However, lots of problems are encountered doing this because fine tuning an already quantized model is not a good practice. 

### Goals
1. Analize the impact quantization has on performance.
2. Compare the efficiency and precision of the quantized models with the not-quantized model.


## HQQ

HQQ (Half-Quadratic Quantization) is an advanced post-training quantization (PTQ) method designed to optimize large language models quickly and efficiently. It focuses on minimizing weight quantization error through a semi-quadratic optimization approach. Unlike gradient-based methods, HQQ does not require calibration data, making it significantly faster—up to 100 times quicker than traditional techniques like GPTQ. This allows models such as LLaMA-2-70B to be quantized in under five minutes, making it ideal for deployment in environments with limited computational resources.

The core principle of HQQ is its optimization-based approach, which minimizes the error between original and quantized weights. Instead of relying on calibration data or gradients, HQQ formulates quantization as an optimization problem using a sparse-aware loss function. This helps mitigate the impact of outliers and ensures that critical weight structures are preserved.

The HQQ process follows these steps:

- Problem Formulation: Quantization is modeled as an optimization problem focused on weight errors rather than activations.
- Semi-Quadratic Solution: A mathematical approach provides a closed-form solution, eliminating the need for expensive iterative computations.
- Fast Execution: Due to its efficient formulation, HQQ performs quantization in minutes, even for models with billions of parameters.
- Inference with HQQ-quantized models benefits from significant memory reduction while maintaining model accuracy. By preserving critical weight structures and leveraging a highly efficient optimization strategy, HQQ achieves an optimal balance between computational efficiency and predictive performance.

In summary, HQQ represents a major advancement in large-scale model quantization, combining speed, accuracy, and efficiency without requiring calibration data. This makes it a crucial tool for deploying modern language models in resource-constrained settings.

## Hardware and software

### Hardware
- CPU: Intel(R) Xeon(R) Silver 4316 CPU @ 2.30GHz
- RAM: 235 GB
- GPU: NVIDIA A100-PCIE-40GB
### Software
For this project, we used the following python libraries: torch, transformers, time and datasets, among others. Of course, all the libraries that these libraries have dependencies with are needed too.
We also used some other libraries or tools that help to evaluate the results. For example, the library CodeCarbon was used to collect emissions data, and the Weights and Biases (Wandb) platform was used to collect and visualize metrics. 


## Project pipeline

### 1. Base model preparation
The base model used is **LLaMA 2 (7B)**. We added a classifier layer to it by loading it with the LlamaForSequenceClassification function from the Transformers library. This layer was trained (only this layer was trained, all the other ones were frozen) using our chosen dataset, **TweetEval**. To demostrate that this training does not suffer from overfitting, we show here the training and the evaluation losses:
![Training vs Evaluation Loss](./overfitting.png)

### 2. Quantization 
Quantization was performed using **GPTQ**. With this quantization method we took 3 quantization options: **8 bits**, **4 bits** and **3 bits**.
This leads us to have 4 different models: **quantized to 8 bits**, **quantized to 4 bits**, **quantized to 3 bits** and **not quantized**.

### 3. Evaluation
These four models were evaluated and compared in terms of: **accuracy**, **memory use**, **inference speed** and **emissions**. The dataset used was the same we used for the training before, but we used the test subset.



### Accuracy results
We now show the results obtained for the accuracy of each of the four models:

![Accuracy](./accuracy_HQQ.png)

Firstly, we see how accuracy remains the same between batch sizes for the same model. Taking the results for each model without batches, we observe how, surprisingly, the accuracy is higher with the 8-bit model than the base model, although both are really close. Between the other two models, the 4-bit variant obtained higher accuracy. In fact, the base model obtains just a 0.14% less accuracy than the 8-bit model, while the 4-bit variant obtains a 2.92% less and the 3 bit model obtains a 5.78% less, both compared to the base model. If we compare the results between the quantized models, the 4bit model obtains 3.06% less accuracy than the 8bit model, while the 3 bit model obtains 5.91% less accuracy than the 8 bit model and 2.94% less than the 4 bit one.


### Memory use results

Measuring memeory needs is not an easy task. We decided to check the memory usage in two points of the pipeline, so we obtain two different memory metrics: minimum model memory need (disk) and maximum memory need while inference (in GPU). Note that this second measure tends to be higher due to needs associated with inference, specially when using batches.

Therefore, we show the minimum memory needs for each model:
- Base model: 24.61 GB
- 8-bit quantized: 7.27 GB
- 4-bit quantized: 4.26 GB
- 3-bit quantized: 3.66 GB

We can see how the quantized models use considerably less memory than the base model. In fact, the 8-bit quantized model uses 70.45% less memory than the base model, while the 4-bit variant uses 82.69% less memory and the 3-bit one uses 85.12% less. If we compare the quantized models, the 4-bit model uses 41.40% less memory than the 8-bit variant, while the 3-bit model uses 49.65% less than the 8-bit model and 14.08% less than the 4-bit variant. We now show a chart with these results:
![model_sizes](./model_sizes_HQQ.png)

Now we show the maximum memory needed for inference for each model and inference configuration:

- Base model: 24.75 GB
- 8-bit quantized: 8.01 GB
- 4-bit quantized: 4.95 GB
- 3-bit quantized: 4.40 GB
- 8-bit quantized with batch size 32: 13.29 GB
- 8-bit quantized with batch size 64: 18.70 GB
- 4-bit quantized with batch size 32: 10.23 GB
- 4-bit quantized with batch size 64: 15.64 GB
- 3-bit quantized with batch size 32: 9.67 GB
- 3-bit quantized with batch size 64: 15.08 GB

These results are shown in this chart:
![max_memory](./max_gpu_mem_HQQ.png)

We see how the maximum memory usage in GPU during inference is noticeably higher for all models when using batches. The bigger the batch size, the higher the memory needs. For example, the 8-bit model needs 39.70% less memory when not using batches compared to using batches with size 32, and 57.17% less when the batch size is 64.

### Inference speed results

In this section we decided to not take into consideration the quantization time for the comparison, although we will comment about it. 

The evaluation time in each of the three cases took:
- Base model: 637.12 s
- 8-bit quantized: 2042.63 s
- 4-bit quantized: 2201.76 s
- 3-bit quantized: 2801.76 s
- 8-bit quantized with batch size 32: 590.55 s
- 8-bit quantized with batch size 64: 582.52 s
- 4-bit quantized with batch size 32: 593.61 s
- 4-bit quantized with batch size 64: 588.62 s
- 3-bit quantized with batch size 32: 610.79 s
- 3-bit quantized with batch size 64: 595.73 s

The quantization time for each quantized model was:
- 8-bit quantized: 71.12 s
- 4-bit quantized: 72.19 s
- 3-bit quantized: 119.24 s

Here we can see the most surprising results of all the report. The fastest model was, by far, the base model. And between the quantized variants, the 8-bit one was faster and the 3-bit one the slowest one. In fact, the base model took 68.79% less time to do the inference for the evaluation than the 8-bit model, 71.06% less time and than the 4-bit model and 77.24% less than the 3-bit model. If we compare the 8bit model and the 4bit model, we obtain that the 8-bit model took 7.21% less time. Comparing with the 3 bit model, the 8-bit model took 27.14% less time and the 4-bit model 21.42% less. However, when using batches during inference, quantized models performed much better in terms of inference time. For example, the 8-bit model is 71.48% faster when using batches with size 64.

If we take quantization times into consideration, we see that both 8 and 4 bit models took similar times to be quantized, being the differences too small to be compared. However, the 3-bit variant took significantly moree time to quantize. In fact, the 8-bit model took 40.34% less time to quantize than the 3-bit model, while the 4-bit model took 39.45% less time. We now show some graphs to show these results:

![inference_time](./inference_time_HQQ.png)
![quantization_time](./quantization_time_HQQ.png)

These results are surprising because the normal thing to expect is the smaller quantized model to perform faster. However, this does not happen. Previously, with BitsAndBytes we thought this was due to the nature of the BitsAndBytes quantization algorithm. However, after seeing this happening again with GPTQ, we stopped and tried to figure out what was happening. We reached the conclusion that this happens because we are doing the inference in evaluation without batches, just one by one. As these two quantization methods convert back to fp16 for inference, this adds a significant overhead. This leads to a higher inference time if we do not take advantage of the lower bit rate and use higher batch sizes. Results obtained in this section for quantized models using batches during inference seem to prove this.

### Emissions results

The emissions are a little challeging to measure and compare. This is because we need to decide if we take into account the emissions generated during the quantization of the models. Our approach will show the resuts obtained for evaluation separated from the emissions for the quantization process, but will take them into consideration when comparing. We will measure the emissions of CO<sub>2</sub> in kilograms of CO<sub>2</sub>-equivalents \[CO<sub>2</sub>eq\]

The emissions generated by the evaluation process were:
- Base model: 0.017084
- 8-bit quantized: 0.046630
- 4-bit quantized: 0.050213
- 3-bit quantized: 0.063788
- 8-bit quantized with batch size 32: 0.013359
- 8-bit quantized with batch size 64: 0.013226
- 4-bit quantized with batch size 32: 0.013440
- 4-bit quantized with batch size 64: 0.013330
- 3-bit quantized with batch size 32: 0.013822
- 3-bit quantized with batch size 64: 0.013521

The emissions generated by quantization were:
- 8-bit quantized: 0.000960
- 4-bit quantized: 0.001004
- 3-bit quantized: 0.001606

If we focus on the evaluation, we see how the 3bit quantized model is the one with higher emissions. This is coherent with the inference times we saw earlier. Comparing the models we obtain that the base model has 63.44% lower emissions than the 8-bit model, 65.98% less than the 4-bit model and 73.28% less than the 3-bit model. Comparing the quantized models, the 8-bit model has 7.28% lower emissions than the 4-bit model, while the 8-bit model has 26.94% lower emissions than the 3-bit model and the 4-bit model has 21.41% lower emissions than the 3-bit variant. However, following the results seen with inference time, quantized models generate much lower emissions when using batches during inference. For example, the 8-bit model generates 71.64% less emissions when using batches with size 64.

Regarding quantization, the emissions for the 8 and 4 bit quantizations are really similar, and differences between them are not important. However, 8-bit quantization generates 40.17% less emissions than 3-bit quantization and 4-bit quantization generates 37.51% less.

We now show a couple of graphs to visualize these results:

![inference_emissions](./inference_emissions_GPTQ.png)
![quantization_emissions](./quantization_emissions_GPTQ.png)


## Conclusions

Results show that:

1. Quantization does in fact reduce accuracy, and the lower the bit count, the higher this drop is. However, the drop with **HQQ** is not as significant, as we saw, and we even get better results with the 8-bit quantized model compared to the base model. Accuracy remain close to the base model results.
2. Quantized models are noticeably better in terms of memory usage compared to the base model. However, the use of batches can elevate memory needs considerably. 
3. Contrary to what would be expected, quantized models with **HQQ** are not faster than the base model. This probably is due to the nature of this quantization method, that restores all the values to fp16 for inference, and as we do a "one by one" (not using batches) evaluation, this overhead makes the model work slower. However, more studies and experimentation is needed in this area. 
4. In a similar fashion as inference time, emissions are higher in the quantized models. This is coherent with the fact that the inference time is higher. However, the differences here are smaller compared to the differences when comparing speed, probably due to the size of the models.

