
# Creation of a quantization, use and evaluation pipeline, featuring LLaMA2 7B model and GPTQ

In this project, we evaluate the impact quantization has in a common use LLM model, in our case LLaMA 2 7B, evaluating it in a sentiment analysis task. Our pipeline is defined as follows:
- Add a classification layer at the end of the not-quantized model and train this layer with our dataset.
- Quantize the model using **GPTQ** in three variants, 8bit, 4 bit and 3bit quantization.
- Compare the performance of our models (not quantized, 4bits 8bits and 3 bits) in terms of precission, memory use, inference speed and emissions.

![Pipeline Diagram](./GPTQ_Pipeline.png)

We follow this approach because it is the easiest approach to follow. Another option would be to firstly quantize the base model and then add the classification layer and train it. In fact, one could think this approach is better than the previously exposed. However, lots of problems are encountered doing this because fine tuning an already quantized model is not a good practice. 

### Goals
1. Analize the impact quantization has on performance.
2. Compare the efficiency and precision of the quantized models with the not-quantized model.


## GPTQ

GPTQ (Gradient Post-Training Quantization) is an advanced quantization method designed specifically for fine-tuning and optimizing large language models. It was developed to address the need for efficient deployment of these models without significant loss in accuracy. By applying quantization techniques at the post-training stage (therefore a PTQ method), GPTQ reduces the precision of model weights while preserving critical characteristics of the original model, making it ideal for scenarios where memory and computational resources are limited.

The core idea behind GPTQ is gradient-guided quantization, which uses gradient-based metrics to prioritize the importance of different weights in the model. Unlike naive quantization methods that treat all weights equally, GPTQ identifies weights that have the greatest impact on the model's performance and assigns them higher precision values. Typically, these critical weights are left at higher precision, such as 16-bit floats, while less influential weights are quantized to lower precision formats like 8-bit or 4-bit integers. This selective approach minimizes the quantization error and helps maintain model accuracy. These weights are identified using a second-order approximation based on the Hessian matrix. This approach allows the measurement of the model's sensitivity to changes in its weights, minimizing accuracy loss. First, the squared error introduced by quantizing each weight individually is computed. The metric used is: $\|WX - \hat{W}X\|^2$, where W represents the original weights, $\hat{W}$ are the quantized weights, and X corresponds to the layer inputs. Subsequently, an approximation of the Hessian matrix is calculated to estimate each weight's contribution to the overall error. Using this information, weights are ranked by their impact, and those with the highest influence on accuracy are processed first.

Once identified, weights are processed through an iterative and adaptive approach that adjusts the remaining weights to compensate for the error introduced by quantization. This treatment involves the following steps:
1. Adaptive Quantization: Each weight is quantized to the desired precision level. Then, the remaining weights are dynamically updated to minimize the cumulative error caused by the quantization of the processed weight.
2. Global Approach: Instead of separating weights into categories of high and low impact, GPTQ adopts a holistic strategy, optimizing all weight interactions within the same layer.
3. Block-wise Processing: In large-scale models, weights are divided into blocks (e.g., column blocks). Each block is quantized independently to enhance computational efficiency, significantly reducing the cost of processing very large matrices.

A key feature of GPTQ is its iterative approach to quantization. Instead of quantizing all weights at once, GPTQ applies quantization layer by layer, recalculating gradients at each step. This allows the method to adjust for quantization-induced errors dynamically and ensures that the final model remains robust. The process results in a model with a significantly smaller memory footprint while retaining most of its predictive performance. Layer-by-layer compression in GPTQ relies on an independent optimization strategy for each layer. This ensures that the impact of quantization on one layer does not propagate uncontrollably to the rest of the model. The procedure follows these steps:
1. Input and Output of Each Layer: A small dataset is used to calculate the inputs (X) and outputs for each layer of the model. These inputs and outputs serve as a reference to assess the quantization impact.
2. Local Optimization: For each layer, an optimization problem is solved to minimize the difference between the original output (with full-precision weights) and the output generated with quantized weights. This ensures each layer is locally adjusted to preserve the model’s overall accuracy.
3. Sequential Processing: The model is quantized sequentially, layer by layer. The inputs to each layer are updated using the outputs from already quantized layers. This approach ensures efficient quantization while maintaining consistency across the model.

Finally, Inference with models quantized using GPTQ incorporates key features to optimize performance:
1. Dynamic De-quantization: Weights stored in low precision (e.g., 3 or 4 bits) are dynamically converted to higher precision during runtime. This enables accurate computations without significantly increasing storage requirements.
2. Optimized Kernels: Custom kernels are used to perform matrix-vector multiplications, where the matrix is quantized, and the vector remains in high precision. These kernels are designed to minimize memory movement, resulting in significant acceleration of inference time.
3. Memory Reduction: Quantization significantly reduces the memory required to store model weights. For example, a 3-bit quantized GPT-175B model can be executed on a single NVIDIA A100 GPU, compared to multiple GPUs required for the non-quantized version.
4. Efficiency in Generative Tasks: In tasks like text generation, where the model processes one token at a time, this strategy improves inference speed by reducing the time needed to access and process weights.

In summary, GPTQ combines the efficiency of lower precision computations with the precision of gradient-guided weight selection, allowing large models to operate in constrained environments without sacrificing accuracy. This makes it an essential tool for deploying modern large language models in real-world applications.

## Hardware and software

### Hardware
- CPU: Intel(R) Xeon(R) Silver 4316 CPU @ 2.30GHz
- RAM: 235 GB
- GPU: NVIDIA A100-PCIE-40GB
### Software
For this project, we used the following python libraries: torch, transformers, time and datasets, among others. Of course, all the libraries that these libraries have dependencies with are needed too.
We also used some other libraries or tools that help to evaluate the results. For example, the library CodeCarbon was used to collect emissions data, and the Weights and Biases (Wandb) platform was used to collect and visualize metrics. 


## Project pipeline

### 1. Base model preparation
The base model used is **LLaMA 2 (7B)**. We added a classifier layer to it by loading it with the LlamaForSequenceClassification function from the Transformers library. This layer was trained (only this layer was trained, all the other ones were frozen) using our chosen dataset, **TweetEval**. To demostrate that this training does not suffer from overfitting, we show here the training and the evaluation losses:
![Training vs Evaluation Loss](./overfitting.png)

### 2. Quantization 
Quantization was performed using **GPTQ**. With this quantization method we took 3 quantization options: **8 bits**, **4 bits** and **3 bits**.
This leads us to have 4 different models: **quantized to 8 bits**, **quantized to 4 bits**, **quantized to 3 bits** and **not quantized**.

### 3. Evaluation
These four models were evaluated and compared in terms of: **accuracy**, **memory use**, **inference speed** and **emissions**. The dataset used was the same we used for the training before, but we used the test subset.



### Accuracy results
We now show the results obtained for the accuracy of each of the four models:

![Accuracy](./accuracy_GPTQ.png)

Firstly, although minimal differences we can obviate, we see how accuracy remains the same between batch sizes for the same model. Taking the results for each model without batches, we observe how the accuracy is higher with the base model and, between the quantized models, the 8-bit variant has a higher accuracy, although differences are minimal (except for the 3 bit model). In fact, the 8-bit quantized model obtains just a 0.11% less accuracy than the base model, while the 4-bit variant obtains a 0.74% less and the 3 bit model obtains a 7.47% less. If we compare the results between the quantized models, the 4bit model obtains 0.62% less accuracy than the 8bit model, while the 3 bit model obtains 7.39% less accuracy than the 8 bit model and 6.79% less than the 4 bit one.


### Memory use results

Measuring memeory needs is not an easy task. We decided to check the memory usage in two points of the pipeline, so we obtain two different memory metrics: minimum model memory need (disk) and maximum memory need while inference (in GPU). Note that this second measure tends to be higher due to needs associated with inference, specially when using batches.

Therefore, we show the minimum memory needs for each model:
- Base model: 24.61 GB
- 8-bit quantized: 6.42 GB
- 4-bit quantized: 3.38 GB
- 3-bit quantized: 2.62 GB

We can see how the quantized models use considerably less memory than the base model. In fact, the 8-bit quantized model uses 73.91% less memory than the base model, while the 4-bit variant uses 86.27% less memory and the 3-bit one uses 89.35% less. If we compare the quantized models, the 4-bit model uses 47.35% less memory than the 8-bit variant, while the 3-bit model uses 59.19% less than the 8-bit model and 22.49% less than the 4-bit variant. We now show a chart with these results:
![model_sizes](./model_sizes_GPTQ.png)

Now we show the maximum memory needed for inference for each model and inference configuration:

- Base model: 24.75 GB
- 8-bit quantized: 6.8 GB
- 4-bit quantized: 3.67 GB
- 3-bit quantized: 3.16 GB
- 8-bit quantized with batch size 32: 9.45 GB
- 8-bit quantized with batch size 64: 12.15 GB
- 4-bit quantized with batch size 32: 6.31 GB
- 4-bit quantized with batch size 64: 9.03 GB
- 3-bit quantized with batch size 32: 5.8 GB
- 3-bit quantized with batch size 64: 8.51 GB

These results are shown in this chart:
![max_memory](./max_gpu_mem_GPTQ.png)

We see how the maximum memory usage in GPU during inference is noticeably higher for all models when using batches. The bigger the batch size, the higher the memory needs. For example, the 8-bit model needs 28.04% less memory when not using batches compared to using batches with size 32, and 44.02% less when the batch size is 64.

### Inference speed results

In this section we decided to not take into consideration the quantization time for the comparison, although we will comment about it. 

The evaluation time in each of the three cases took:
- Base model: 643.51 s
- 8-bit quantized: 2236.21 s
- 4-bit quantized: 2016.59 s
- 3-bit quantized: 3711.95 s
- 8-bit quantized with batch size 32: 131.32 s
- 8-bit quantized with batch size 64: 97.19 s
- 4-bit quantized with batch size 32: 123.54 s
- 4-bit quantized with batch size 64: 92.92 s
- 3-bit quantized with batch size 32: 180.70 s
- 3-bit quantized with batch size 64: 120.29 s

The quantization time for each quantized model was:
- 8-bit quantized: 2667.34 s
- 4-bit quantized: 2880.08 s
- 3-bit quantized: 2548.65 s

Here we can see the most surprising results of all the report. The fastest model was, by far, the base model. And between the quantized variants, the 4bit one was faster. In fact, the base model took 71.22% less time to do the inference for the evaluation than the 8-bit model, 68.08% less time and than the 4-bit model and 82.66% less than the 3-bit model. If we compare the 8bit model and the 4bit model, we obtain that the 4-bit model took 9.83% less time. Comparing with the 3 bit model, the 8-bit model took 39.75 less time and the 4-bit model 45.69% less. However, when using batches during inference, quantized models performed much better in terms of inference time. For example, the 8-bit model is 95.65% faster when using batches with size 64.

If we take quantization times into consideration, we see that all three models took similar times to be quantized, while being the 3-bit variant a little faster. We now show some graphs to show these results:

![inference_time](./inference_time_GPTQ.png)
![quantization_time](./quantization_time_GPTQ.png)

These results are surprising because the normal thing to expect is the smaller quantized model to perform faster. However, this does not happen. Previously, with BitsAndBytes we thought this was due to the nature of the BitsAndBytes quantization algorithm. However, after seeing this happening again with GPTQ, we stopped and tried to figure out what was happening. We reached the conclusion that this happens because we are doing the inference in evaluation without batches, just one by one. As these two quantization methods convert back to fp16 for inference, this adds a significant overhead. This leads to a higher inference time if we do not take advantage of the lower bit rate and use higher batch sizes. Results obtained in this section for quantized models using batches during inference seem to prove this.

### Emissions results

The emissions are a little challeging to measure and compare. This is because we need to decide if we take into account the emissions generated during the quantization of the models. Our approach will show the resuts obtained for evaluation separated from the emissions for the quantization process, but will take them into consideration when comparing. We will measure the emissions of CO<sub>2</sub> in kilograms of CO<sub>2</sub>-equivalents \[CO<sub>2</sub>eq\]

The emissions generated by the evaluation process were:
- Base model: 0.013935
- 8-bit quantized: 0.044523 
- 4-bit quantized: 0.040132
- 3-bit quantized: 0.084331
- 8-bit quantized with batch size 32: 0.002541
- 8-bit quantized with batch size 64: 0.001891
- 4-bit quantized with batch size 32: 0.002384
- 4-bit quantized with batch size 64: 0.001811
- 3-bit quantized with batch size 32: 0.004092
- 3-bit quantized with batch size 64: 0.002680

The emissions generated by quantization were:
- 8-bit quantized: 0.043362
- 4-bit quantized: 0.04605
- 3-bit quantized: 0.041809

If we focus on the evaluation, we see how the 3bit quantized model is the one with higher emissions. This is coherent with the inference times we saw earlier. Comparing the models we obtain that the base model has 68.61% lower emissions than the 8-bit model, 65.34% less than the 4-bit model and 83.44% less than the 3-bit model. Comparing the quantized models, the 4-bit model has 9.91% lower emissions than the 8-bit model, while the 8-bit model has 47.27% lower emissions than the 3-bit model and the 4-bit model has 52.43% lower emissions than the 3-bit variant. However, following the results seen with inference time, quantized models generate much lower emissions when using batches during inference. For example, the 8-bit model generates 95.75% less emissions when using batches with size 64.

Regarding quantization, the emissions for all quantized models are really similar, and differences between them are not important. 

We now show a couple of graphs to visualize these results:

![inference_emissions](./inference_emissions_GPTQ.png)
![quantization_emissions](./quantization_emissions_GPTQ.png)


## Conclusions

Results show that:

1. Quantization does in fact reduce accuracy, and the lower the bit count, the higher this drop is. However, the drop with **GPTQ** is not as significant, as we saw. Accuracy remain close to the base model results. 
2. Quantized models are noticeably better in terms of memory usage compared to the base model. However, the use of batches can elevate memory needs considerably. 
3. Contrary to what would be expected, quantized models with **GPTQ** are not faster than the base model. This probably is due to the nature of this quantization method, that restores all the values to fp16 for inference, and as we do a "one by one" (not using batches) evaluation, this overhead makes the model work slower. However, more studies and experimentation is needed in this area. 
4. In a similar fashion as inference time, emissions are higher in the quantized models. This is coherent with the fact that the inference time is higher. However, the differences here are smaller compared to the differences when comparing speed, probably due to the size of the models.

