
# Creation of a quantization, use and evaluation pipeline, featuring LLaMA2 7B model and BitsAndBytes

In this project, we evaluate the impact quantization has in a common use LLM model, in our case LLaMA 2 7B, evaluating it in a sentiment analysis task. Our pipeline is defined as follows:
- Add a classification layer at the end of the not-quantized model and train this layer with our dataset.
- Quantize the model using **bitsandbytes** in both of its variants, 8bit and 4 bit quantization.
- Compare the performance of our models (not quantized, 4bits and 8bits) in terms of precission, memory use, inference speed and emissions.

![Pipeline Diagram](./BitsAndBytes_Pipeline.png)

We follow this approach because it is the easiest approach to follow. Another option would be to firstly quantize the base model and then add the classification layer and train it. In fact, one could think this approach is better than the previously exposed. However, lots of problems are encountered doing this because fine tuning an already quantized model is not a good practice. 

### Goals
1. Analize the impact quantization has on performance.
2. Compare the efficiency and precision of the quantized models with the not-quantized model.


## BitsAndBytes

Bitsandbytes is an optimized library for quantizing deep learning models, developed by Tim Dettmers, a researcher specializing in optimization and efficiency in Deep Learning. Its primary goal is to make computations in large models more efficient by allowing the use of lower precision operations, leading to reduced memory usage and faster training and inference processes. Bitsandbytes is widely used in the quantization of large matrices and vectors, offering support for 8-bit, 4-bit, and even mixed-precision quantization. Inside all quantization methods, this one is inside the PTQ (post-training quantization) category. Specifically for the 4 bit variant, it leverages a novel data type called 4-bit NormalFloat (NF4), which is theoretically optimal for representing normally distributed weights.

The main idea BitsAndBytes follows to perform quantization is identifying the outliers. These values are data points or weights in the model that have unusually high or low values compared to the rest of the data. Considering the idea that these values are uncommon to find, BitsAndBytes keeps them in 16-bit float precission, helping to mantain a higher precission. All the other data points, the non-outliers, are converted to 8-bit or 4-bit integer format, depending of the quantization type. With this, after the final step of the quantization method, the non-outlier values are converted back 16-bit float precission. This is done to mantain a uniform data type and help the integration and processing of the weights. In summary, BitsAndBytes detects the outliers and keeps them in high precission data types to mantanin accuracy and reduces the precission of all the other values, which are the most common ones, to reduce the memory footprint of the model.

It is clear that given the explanation of this method, outliers are critical. Therefore, one may ask "How are outliers defined? What is the percentage of these in our model?". These are legit questions and we will try to answer them. First of all, the amount of outliers depends on the model and configuration of BitsAndBytes used, because the definition changes. For example, in a 8bit BitsAndBytes configuration, the outlier threshold is set to 6 (in absolute value) by default. Although this value typically works well, for some big models it may not be appropiate. For this reason, this value can be changed by using the *llm_int8_threshold* parameter when creating the configuration. Different experiments have shown that outliers typically represent below 1% of the values, but still can be very critical in some layers. To know the exact amount in our specific case, we should perform a layer by layer analysis of our quantized model. This experiment will also help to decide which threshold suits best the model after comparing different threshold configurations.

Finally, regarding the inference process, we mentioned that BitsAndBytes reduces the precission of "normal" values but keeps the outliers in high precission. Inference is performed using mixed matrices and specific CUDA kernels to perform the operations correctly in this scenario. Therefore, "normal" weights and outliers are procesed separately and combined at the end to obtain the end result.

## Hardware and software

### Hardware
- CPU: Intel(R) Xeon(R) Silver 4316 CPU @ 2.30GHz
- RAM: 235 GB
- GPU: NVIDIA A100-PCIE-40GB
### Software
For this project, we used the following python libraries: torch, transformers, time and datasets, among others. Of course, all the libraries that these libraries have dependencies with are needed too.
We also used some other libraries or tools that help to evaluate the results. For example, the library CodeCarbon was used to collect emissions data, and the Weights and Biases (Wandb) platform was used to collect and visualize metrics. 


## Project pipeline

### 1. Base model preparation
The base model used is **LLaMA 2 (7B)**. We added a classifier layer to it by loading it with the LlamaForSequenceClassification function from the Transformers library. This layer was trained (only this layer was trained, all the other ones were frozen) using our chosen dataset, **TweetEval**. To demonstrate that this training does not suffer from overfitting, we show here the training and the evaluation losses:
![Training vs Evaluation Loss](./overfitting.png)

### 2. Quantization 
Quantization was performed using **bitsandbytes**. This quantization method offers 2 quantization options: **8 bits** and **4 bits**.
This leads us to have 3 different models: **quantized to 8 bits**, **quantized to 4 bits** and **not quantized**.

### 3. Evaluation
These three models were evaluated and compared in terms of: **accuracy**, **memory use**, **inference speed** and **emissions**. The dataset used was the same we used for the training before, but we used the test subset.



### Accuracy results
We now show the results obtained for the accuracy of each of the three models:

![Accuracy](./accuracy_ByB.png)

Firstly, although minimal differences we can obviate, we see how accuracy remains the same between batch sizes for the same model. Taking the results for each model without batches, we observe how the accuracy is higher with the base model and, between the quantized models, the 8-bit variant has a higher accuracy, although differences are minimal. In fact, the 8-bit quantized model obtains just a 1.40% less accuracy than the base model, while the 4-bit variant obtains a 1.87% less. Between the quantized models, the 8-bit model obtains a 0.48% more accuracy than the 4-bit model, a very minimal difference.


### Memory use results

Measuring memeory needs is not an easy task. We decided to check the memory usage in two points of the pipeline, so we obtain two different memory metrics: minimum model memory need (disk) and maximum memory need while inference (in GPU). Note that this second measure tends to be higher due to needs associated with inference, specially when using batches.

Therefore, we show the minimum memory needs for each model:
- Base model: 24.61 GB
- 8-bit quantized: 6.28 GB
- 4-bit quantized: 3.36 GB

We can see how the quantized models use considerably less memory than the base model. In fact, the 8-bit quantized model uses 74.47% less memory than the base model, while the 4-bit variant uses 86.34% less memory. Between them, the 8-bit model uses 46.50% less memory than the 4-bit variant. We now show some graphs to visualize these sizes:
![model_sizes](./model_sizes_ByB.png)

Now we show the maximum memory needed for inference for each model and inference configuration:

- Base model: 24.75 GB
- 8-bit quantized: 6.47 GB
- 4-bit quantized: 3.64 GB
- 8-bit quantized with batch size 32: 9.39 GB
- 8-bit quantized with batch size 64: 12.35 GB
- 4-bit quantized with batch size 32: 6.34 GB
- 4-bit quantized with batch size 64: 9.10 GB

These results are shown in this chart:
![max_memory](./max_gpu_mem_ByB.png)

We see how the maximum memory usage in GPU during inference is noticeably higher for all models when using batches. The bigger the batch size, the higher the memory needs. For example, the 8-bit model needs 31.08% less memory when not using batches compared to using batches with size 32, and 47.59% less when the batch size is 64.

### Inference speed results

In this section we decided to not take into consideration the quantization time for the comparison, although we will comment about it. 

The evaluation time in each of the three cases took:
- Base model: 641.88 s
- 8-bit quantized: 2132.42 s
- 4-bit quantized: 1053.50 s
- 8-bit quantized with batch size 32: 130.62 s
- 8-bit quantized with batch size 64: 97.02 s
- 4-bit quantized with batch size 32: 102.32 s
- 4-bit quantized with batch size 64: 86.22 s

The quantization time for each quantized model was:
- 8-bit quantized: 9.89 s
- 4-bit quantized: 9.57 s

Here we can see the most surprising results of all the report. The fastest model was, by far, the base model. And between the 8bit and 4bit variants, the 4bit one was faster. In fact, the 8bit model took 232.18% more time to do the inference for the evaluation, while the 4bit model took 64.11% more time. If we compare the 8bit model and the 4bit model, we obtain that the 8bit model took 50.50% more time. However this changes dramatically when using batches. Both quantized models show a massive speed up in inference time when using batches, obtaining the faster times with batch size 64. For example, the 8-bit model is 84.89% faster than the base model when using batch size 64, and 95.45% faster compared to the non-batch option.

If we take quantization times into consideration, we see that both models took similar times, while being the 8bit variant a little faster. We show these results in the following graphs:

![inference_time](./inference_time_ByB.png)
![quantization_time](./quantization_time_ByB.png)

These results are surprising because the normal thing to expect is the smaller quantized model to perform faster. However, this does not happen. We think this is due to the nature of the BitsAndBytes quantization algorithm. This method converts back to 16-bit float all the non-outliers converted previously to 8 or 4 bit integer. This helps mantain the accuracy but is costly timewise. In addition, it seems that the BitsAndBytes method main goal was to create efficient kernels for training, not for inference, as said by one of the founders Tim Dettmers. However, as we can see in the results obtained with batches, the quantized models seem to take advantage of batches and reduce the inference time considerably.

### Emissions results

The emissions are a little challenging to measure and compare. This is because we need to decide if we take into account the emissions generated during the quantization of the models. Our approach will show the resuts obtained for evaluation separated from the emissions for the quantization process, but will take them into consideration when comparing. We will measure the emissions of CO<sub>2</sub> in kilograms of CO<sub>2</sub>-equivalents \[CO<sub>2</sub>eq\]

The emissions generated by the evaluation process were:
- Base model: 0.013894
- 8-bit quantized: 0.027767
- 4-bit quantized: 0.023016
- 8-bit quantized with batch size 32: 0.002301
- 8-bit quantized with batch size 64: 0.001974
- 4-bit quantized with batch size 32: 0.002136
- 4-bit quantized with batch size 64: 0.001829

The emissions generated by quantization were:
- 8-bit quantized: 0.000120
- 4-bit quantized: 0.000117

If we focus on the evaluation, we see how the 8bit quantized model is the one with higher emissions, followed by the 4bit variant. This is coherent with the inference times we saw earlier. Comparing the models we obtain that the 8bit model has 99.85% higer emissions thyan the base model, while the 4bit variant has 65.66% higher emissions. Comparing the 4bit and 8bit models, the latter has 17.10% higher emissions. However, the emissions reduce considerably when using batches, as a direct consequence of the reduced inference time. For example, for the 8-bit model, using batches of size 64 reduces 92.89% compared to the non-batch approach.

Taking quantization emissions into account, results do not change much. We see how both models produce similar emissions in the quantization process, being the 8-bit variant just 2.50% higher.

We now show a couple of graphs to visualize these results:

![inference_emissions](./inference_emissions_ByB.png)
![quantization_emissions](./quantization_emissions_ByB.png)


## Conclusions

Results show that:

1. Quantization does in fact reduce accuracy, and the lower the bit count, the higher this drop is. However, the drop with **BitsAndBytes** is not as significant, as we saw. Accuracy remains close to the base model results. 
2. Both quantized models are noticeably better in terms of memory usage compared to the base model.
3. Contrary to what would be expected, quantized models with **BitsAndBytes** are not faster than the base model. This probably is due to the nature of this quantization method, that restores all the values to fp16 for inference. However, more studies and experimentation is needed in this area. 
4. In a similar fashion as inference time, emissions are higher in the quantized models. This is coherent with the fact that the inference time is higher. However, the differences here are smaller compared to the differences when comparing speed, probably due to the size of the models.

