
# Creation of a quantization, use and evaluation pipeline, featuring LLaMA2 7B model and BitsAndBytes

In this project, we evaluate the impact quantization has in a common use LLM model, in our case LLaMA 2 7B, evaluating it in a sentimental analysis task. Our pipeline is defined as follows:
- Add a classification layer at the end of the not-quantized model and train this layer with our dataset.
- Quantize the model using **bitsandbytes** in both of its variants, 8bit and 4 bit quantization.
- Compare the performance of our models (not quantized, 4bits and 8bits) in terms of precission, memory use, inference speed and emissions.

![Pipeline Diagram](./BitsAndBytes_Pipeline.png)

We follow this approach because it is the easiest approach to follow. Another option would be to firstly quantize the base model and then add the classification layer and train it. In fact, one could think this approach is better than the previously exposed. However, lots of problems are encountered doing this because fine tuning an already quantized model is not a good practice. 

### Goals
1. Analize the impact quantization has on performance.
2. Compare the efficiency and preccion of the quantized models with the not-quantized model.


## BitsAndBytes

Bitsandbytes is an optimized library for quantizing deep learning models, developed by Tim Dettmers, a researcher specializing in optimization and efficiency in Deep Learning. Its primary goal is to make computations in large models more efficient by allowing the use of lower precision operations, leading to reduced memory usage and faster training and inference processes. Bitsandbytes is widely used in the quantization of large matrices and vectors, offering support for 8-bit, 4-bit, and even mixed-precision quantization.

The main idea BitsAndBytes follows to perform quantization is identifying the outliers. These values are data points or weights in the model that have unusually high or low values compared to the rest of the data. Considering the idea that these values are uncommon to find, BitsAndBytes keeps them in 16-bit float precission, helping to mantain a higher precission. All the other data points, the non-outliers, are converted to 8-bit or 4-bit integer format, depending of the quantization type. With this, after the final step of the quantization method, the non-outlier values are converted back 16-bit float precission. This is done to mantain a uniform data type and help the integration and processing of the weights.

In summary, BitsAndBytes detects the outliers and keeps them in high precission data types to mantanin accuracy and reduces the precission of all the other values, which are the most common ones, to reduce the memory footprint of the model.

## Hardware and software

### Hardware
- CPU: Intel(R) Xeon(R) Silver 4316 CPU @ 2.30GHz
- RAM: 235 GB
- GPU: NVIDIA A100-PCIE-40GB
### Software
For this project, we used the following python libraries: torch, transformers, time and datasets, among others. Of course, all the libraries that these libraries have dependencies with are needed too.
We also used some other libraries or tools that help to evaluate the results. For example, the library CodeCarbon was used to collect emissions data, and the Weights and Biases (Wandb) platform was used to collect and visualize metrics. 


## Project pipeline

### 1. Base model preparation
The base model used is **LLaMA 2 (7B)**. We added a classifier layer to it by loading it with the LlamaForSequenceClassification function from the Transformers library. This layer was trained (only this layer was trained, all the other ones were frozen) using our chosen dataset, **TweetEval**. To demostrate that this training does not suffer from overfitting, we show here the training and the evaluation losses:
![Training vs Evaluation Loss](./overfitting.png)

### 2. Quantization 
Quantization was performed using **bitsandbytes**. This quantization method offers 2 quantization options: **8 bits** and **4 bits**.
This leads us to have 3 different models: **quantized to 8 bits**, **quantized to 4 bits** and **not quantized**.

### 3. Evaluation
These three models were evaluated and compared in terms of: **accuracy**, **memory use**, **inference speed** and **emissions**. The dataset used was the same we used for the training before, but we used the test subset.



### Accuracy results
We now show the results obtained for the accuracy of each of the three models:

![Accuracy](./accuracy_all.png)

We observe how the accuracy is higher with the base model and, between the quantized models, the 8-bit variant has a higher accuracy, although differences are minimal. In fact, the 8-bit quantized model obtains just a 1.49% less accuracy than the base model, while the 4-bit variant obtains a 2.01% less. 


### Memory use results

While executing the three models, we checked the size of each one was. The results were the expected ones:
- Base model: 26.43 GB
- 8-bit quantized: 6.74 GB
- 4-bit quantized: 3.50 GB

We can see how the quantized models use considerably less memory than the base model. In fact, the 8-bit quantized model uses 74.50% less memory than the base model, while the 4-bit variant uses 86.76% less memory.

### Inference speed results

In this section we decided to not take into consideration the quantization time for the comparison, although we will comment about it. 

The evaluation time in each of the three cases took:
- Base model: 640.65 s
- 8-bit quantized: 2146.62 s
- 4-bit quantized: 1049.69 s

The quantization time for each quantized model was:
- 8-bit quantized: 7,51 s
- 4-bit quantized: 8,29 s

Here we can see the most surprising results of all the report. The fastest model was, by far, the base model. And between the 8bit and 4bit variants, the 4bit one was faster. In fact, the 8bit model took 235.08% more time to do the inference for the evaluation, while the 4bit model took 63.85% more time. If we compare the 8bit model and the 4bit model, we obtain that the 8bit model took 104.49% more time.

If we take quantization times into consideration, we see that both models took similar times, while being the 8bit variant a little faster. The comparations do not change much if we add the quantization time to the total: the 8bit model took 236.21% more time, while the 4bit model took 65.14% more time. Comparing the 8bit and the 4bit models between themselves, the 8bit model took 103.66% more time.

These results are surprising because the normal thing to expect is the smaller quantized model to perform faster. However, this does not happen. We think this is due to the nature of the BitsAndBytes quantization algorithm. This method converts back to 16-bit float all the non-outliers converted previously to 8 or 4 bit integer. This helps mantain the accuracy but is costly timewise. In addition, it seems that the BitsAndBytes method main goal was to create efficient kernels for training, not for inference, as said by one of the founders Tim Dettmers.

### Emissions results

The emissions are a little challeging to measure and compare. This is because we need to decide if we take into account the emissions generated during the quantization of the models. Our approach will show the resuts obtained for evaluation separated from the emissions for the quantization process, but will take them into consideration when comparing. We will measure the emissions of CO<sub>2</sub> in kilograms of CO<sub>2</sub>-equivalents \[CO<sub>2</sub>eq\]

The emissions generated by the evaluation process were:
- Base model: 0.013476
- 8-bit quantized: 0.024606 
- 4-bit quantized: 0.022125

The emissions generated by quantization were:
- 8-bit quantized: 8.731975e-05
- 4-bit quantized: 9.470263e-05 

If we foscus on the evaluation, we see how the 8bit quantized model is the one with higher emissions, followed by the 4bit variant. This is coherent with the inference times we saw earlier. Comparing the models we obtain that the 8bit model has 82.61% higer emissions thyan the base model, while the 4bit variant has 64.18% higher emissions. Comparing the 4bit and 8bit models, the latter has 11.23% higher emissions.

Taking quantization emissions into account, results do not change much. We see how both models produce similar emissions in the quantization process, being the 4bit variant 8.46% higher. If we take these quantization emissions and add them to the total, we obtain that the 8bit model is still the one with higer emissions, being 83.33% higher than the base model. The 4bit variant is now 64.93% higher than the base model. And the 8bit model has 11.14% higher emissions than the 4bit variant.


## Conclusions

Results show that:

1. Quantization does in fact reduce accuracy, and the lower the bit count, the higher this drop is. However, the drop with **BitsAndBytes** is not as significant, as we saw. Accuracy remain close to the base model results. 
2. Both quantized models are noticeably better in terms of memory usage compared to the base model.
3. Contrary to what would be expected, quantized models with **BitsAndBytes** are not faster than the base model. This probably is due to the nature of this quantization method, that restores all the values to fp16 for inference. However, more studies and experimentation is needed in this area. 
4. In a similar fashion as inference time, emissions are higher in the quantized models. This is coherent with the fact that the inference time is higher. However, the differences here are smaller compared to the differences when comparing speed, probably due to the size of the models.

