
# Creation of a quantization, use and evaluation pipeline, featuring LLaMA2 7B model and GPTQ

In this project, we evaluate the impact quantization has in a common use LLM model, in our case LLaMA 2 7B, evaluating it in a sentimental analysis task. Our pipeline is defined as follows:
- Add a classification layer at the end of the not-quantized model and train this layer with our dataset.
- Quantize the model using **GPTQ** in three variants, 8bit, 4 bit and 3bit quantization.
- Compare the performance of our models (not quantized, 4bits 8bits and 3 bits) in terms of precission, memory use, inference speed and emissions.

![Pipeline Diagram](./GPTQ_Pipeline.png)

We follow this approach because it is the easiest approach to follow. Another option would be to firstly quantize the base model and then add the classification layer and train it. In fact, one could think this approach is better than the previously exposed. However, lots of problems are encountered doing this because fine tuning an already quantized model is not a good practice. 

### Goals
1. Analize the impact quantization has on performance.
2. Compare the efficiency and precision of the quantized models with the not-quantized model.


## GPTQ

GPTQ (Gradient Post-Training Quantization) is an advanced quantization method designed specifically for fine-tuning and optimizing large language models. It was developed to address the need for efficient deployment of these models without significant loss in accuracy. By applying quantization techniques at the post-training stage, GPTQ reduces the precision of model weights while preserving critical characteristics of the original model, making it ideal for scenarios where memory and computational resources are limited.

The core idea behind GPTQ is gradient-guided quantization, which uses gradient-based metrics to prioritize the importance of different weights in the model. Unlike naive quantization methods that treat all weights equally, GPTQ identifies weights that have the greatest impact on the model's performance and assigns them higher precision values. Typically, these critical weights are left at higher precision, such as 16-bit floats, while less influential weights are quantized to lower precision formats like 8-bit or 4-bit integers. This selective approach minimizes the quantization error and helps maintain model accuracy.

A key feature of GPTQ is its iterative approach to quantization. Instead of quantizing all weights at once, GPTQ applies quantization layer by layer, recalculating gradients at each step. This allows the method to adjust for quantization-induced errors dynamically and ensures that the final model remains robust. The process results in a model with a significantly smaller memory footprint while retaining most of its predictive performance.

In summary, GPTQ combines the efficiency of lower precision computations with the precision of gradient-guided weight selection, allowing large models to operate in constrained environments without sacrificing accuracy. This makes it an essential tool for deploying modern large language models in real-world applications.

## Hardware and software

### Hardware
- CPU: Intel(R) Xeon(R) Silver 4316 CPU @ 2.30GHz
- RAM: 235 GB
- GPU: NVIDIA A100-PCIE-40GB
### Software
For this project, we used the following python libraries: torch, transformers, time and datasets, among others. Of course, all the libraries that these libraries have dependencies with are needed too.
We also used some other libraries or tools that help to evaluate the results. For example, the library CodeCarbon was used to collect emissions data, and the Weights and Biases (Wandb) platform was used to collect and visualize metrics. 


## Project pipeline

### 1. Base model preparation
The base model used is **LLaMA 2 (7B)**. We added a classifier layer to it by loading it with the LlamaForSequenceClassification function from the Transformers library. This layer was trained (only this layer was trained, all the other ones were frozen) using our chosen dataset, **TweetEval**. To demostrate that this training does not suffer from overfitting, we show here the training and the evaluation losses:
![Training vs Evaluation Loss](./overfitting.png)

### 2. Quantization 
Quantization was performed using **GPTQ**. With this quantization method we took 3 quantization options: **8 bits**, **4 bits** and **3 bits**.
This leads us to have 4 different models: **quantized to 8 bits**, **quantized to 4 bits**, **quantized to 3 bits** and **not quantized**.

### 3. Evaluation
These four models were evaluated and compared in terms of: **accuracy**, **memory use**, **inference speed** and **emissions**. The dataset used was the same we used for the training before, but we used the test subset.



### Accuracy results
We now show the results obtained for the accuracy of each of the three models:

![Accuracy](./accuracy_all_GPTQ.png)

We observe how the accuracy is higher with the base model and, between the quantized models, the 8-bit variant has a higher accuracy, although differences are minimal (except for the 3 bit model). In fact, the 8-bit quantized model obtains just a 0.13% less accuracy than the base model, while the 4-bit variant obtains a 0.79% less and the 3 bit model obtains a 7.77% less. If we compare the results between the quantized models, the 4bit model obtains 0.67% less accuracy than the 8bit model, while the 3 bit model obtains 7.64% less accuracy than the 8 bit model and 7.01% less than the 4 bit one.


### Memory use results

While executing the three models, we checked the size of each one was. The results were the expected ones:
- Base model: 26.43 GB
- 8-bit quantized: 6.90 GB
- 4-bit quantized: 3.63 GB
- 3-bit quantized: 2.82 GB

We can see how the quantized models use considerably less memory than the base model. In fact, the 8-bit quantized model uses 73.91% less memory than the base model, while the 4-bit variant uses 86.27% less memory and the 3-bit one uses 89.33% less. If we compare the quantized models, the 4-bit model uses 47.39% less memory than the 8-bit variant, while the 3-bit model uses 59.13% less than the 8-bit model and 22.32% less than the 4-bit variant. We now show a chart with these results:
![model_sizes](./model_sizes_GPTQ.png)

### Inference speed results

In this section we decided to not take into consideration the quantization time for the comparison, although we will comment about it. 

The evaluation time in each of the three cases took:
- Base model: 641.39 s
- 8-bit quantized: 2242.21 s
- 4-bit quantized: 2017.88 s
- 3-bit quantized: 3736.60 s

The quantization time for each quantized model was:
- 8-bit quantized: 12876.53 s
- 4-bit quantized: 12974.85 s
- 3-bit quantized: 12883.12 s

Here we can see the most surprising results of all the report. The fastest model was, by far, the base model. And between the quantized variants, the 4bit one was faster. In fact, the 8bit model took 249.49% more time to do the inference for the evaluation, the 4bit model took 214.65% more time and the 3bit model took 482.57% more time, all compared to the base model. If we compare the 8bit model and the 4bit model, we obtain that the 8bit model took 11.10% more time. Comparing the 3 bit model, this one took 66.68% more time than the 8bit model and 85.07% more than the 4bit model.

If we take quantization times into consideration, we see that both models took similar times to be quantized, while being the 8bit variant a little faster. However, quantization time is really big, espacially compared with results like the ones obtained with BitsAndBytes. We don't think comparing the times between the base model and the quantized models after adding quantization time makes sense, since it is obvious it will be a huge difference. We now show some graphs to show these results:

![inference_time](./inference_time_GPTQ.png)
![quantization_time](./quantization_time_GPTQ.png)

These results are surprising because the normal thing to expect is the smaller quantized model to perform faster. However, this does not happen. Previously, with BitsAndBytes we thought this was due to the nature of the BitsAndBytes quantization algorithm. However, after seeing this happening again with GPTQ, we stopped and tried to figure out what was happening. We reached the conclusion that this happens because we are doing the inference in evaluation without batches, just one by one. As these two quantization methods convert back to fp16 for inference, this adds a significant overhead. This leads to a higher inference time if we do not take advantage of the lower bit rate and use higher batch sizes.

### Emissions results

The emissions are a little challeging to measure and compare. This is because we need to decide if we take into account the emissions generated during the quantization of the models. Our approach will show the resuts obtained for evaluation separated from the emissions for the quantization process, but will take them into consideration when comparing. We will measure the emissions of CO<sub>2</sub> in kilograms of CO<sub>2</sub>-equivalents \[CO<sub>2</sub>eq\]

The emissions generated by the evaluation process were:
- Base model: 0.013502
- 8-bit quantized: 0.044686 
- 4-bit quantized: 0.040202
- 3-bit quantized: 0.074038

The emissions generated by quantization were:
- 8-bit quantized: 0.133476
- 4-bit quantized: 0.132307
- 3-bit quantized: 0.132594

If we foscus on the evaluation, we see how the 3bit quantized model is the one with higher emissions. This is coherent with the inference times we saw earlier. Comparing the models we obtain that the 8bit model has 230.93% higer emissions than the base model, while the 4bit and 3bit variants have 197.72% and 448.42% higher emissions, respectively. Comparing the quantized models, the 8bit model has 11.15% higher emissions than the 4bit model, while the 3bit model has 65.68% higher emissions than the 8bit model and 84.18% higher emissions than the 4bit variant.

Quantization, following the high times obtained before, produce a larger amount of emissions. The emissions for the 3 quantizations are really similar, and differences between them are not important. However, quantization emissions are much larger than the ones for BitsAndBytes. For example, the 8bit model produces 152754.24% more emissions than the equivalent BitsAndBytes model.

We now show a couple of graphs to visualize these results:

![inference_emissions](./inference_emissions_GPTQ.png)
![quantization_emissions](./quantization_emissions_GPTQ.png)


## Conclusions

Results show that:

1. Quantization does in fact reduce accuracy, and the lower the bit count, the higher this drop is. However, the drop with **GPTQ** is not as significant, as we saw. Accuracy remain close to the base model results. 
2. Quantized models are noticeably better in terms of memory usage compared to the base model.
3. Contrary to what would be expected, quantized models with **GPTQ** are not faster than the base model. This probably is due to the nature of this quantization method, that restores all the values to fp16 for inference, and as we do a "one by one" (not using batches) evaluation, this overhead makes the model work slower. However, more studies and experimentation is needed in this area. 
4. In a similar fashion as inference time, emissions are higher in the quantized models. This is coherent with the fact that the inference time is higher. However, the differences here are smaller compared to the differences when comparing speed, probably due to the size of the models.

