
# Quantization in depth

Quantization refers to the process of mapping a large set to a smaller set of values. There are many quantization techniques. In this lesson, we will dive deep into the theory of linear quantization. 

_ You will implement from scratch the asymmetric variant of linear quantization. 
_ You will also learn about the scaling factor and the zero point.

## Quantize and dequantize a tensor

Let’s have a look at an example. On your left you can see the original tensor in FP32. There is the quantized tensor on the right. The quantized tensor is quantized in `torch.int8`, and we use linear quantization to get this tensor. We will see in this lesson how we get this quantized tensor, but also how do we get back to the original tensor.

![Les3_s1_quantization_example](img/Les3_s1_quantization_example.png)

Let’s have a quick recap on what we can quantize in a neural network. In a neural network 
- you can quantize the weights, i.e. the neural network parameters,
- you can quantize the activations. The activations are values that propagate through the layers of the neural network.

If you quantize a neural network after it has been trained, you are doing something called **post-training quantization**.

Advantages of quantization:
- you get a smaller model, 
- speed gains from 
    - the memory bandwidth and 
    - faster operations, such as the matrix multiplication and the matrix-to-vector multiplication.

We will see why it is the case in the next part when we talk about how to perform inference with a quantized model.

There are many challenges to quantization. We will deep dive into these challenges in the last part of this lesson. Here's a preview of these challenges:

- quantization error,
- retraining (quantization aware training),
- limited hardware support,
- calibration dataset needed,
- packing / unpacking.

Linear quantization uses a linear mapping to map the higher precision range, for example floating point 32, to a lower precision range, for example int8.

![Les3_s2_linear_quantization_0](img/Les3_s2_linear_quantization_0.png)

There are two parameters in linear quantization:
- the scale S
- the zero point Z. 

The scale is stored in the same data type as the original tensor, and the zero point is stored in the same data type as the quantized tensor.
Why this is the case will become clear in the next slides.

Here is an example.

Let’s say the scale is equal to two and the zero point is equal to zero.

If we have a quantized value of ten (q=10), the dequantized value - r - would be equal to 2(q-0), which will be equal to 20.

If we look at the example we presented in the first few slides, we would have something like this:

![Les3_s2_linear_quantization_1](img/Les3_s3_linear_quantization_1.png)

In the example, there is:
- the original tensor (at the left),
- the quantized tensor (at the right),
- the zero point (z = -77),
- the scale (s = 3.58).

We will see how we get the zero point and the scale in the next few slides.

![Les3_s4_quantization_example_1](img/Les3_s4_quantization_example_1.png)

But first, we have the original tensor and we need to quantize this tensor.
So, how do we get the quantized tensor? 

The relationship is r=s(q-z).

Remark that the quantized tensor (q) is on the specific d-type, which is eight-bit integers in our example.
Therefore, the number is rounded. The last step would be to cast this value to the correct d-type such as int8.

![Les3_s5_isolate_q](img/Les3_s5_isolate_q.png)

Now, let’s code the function that will give us the quantized tensor. \
ref. notebooks/Les3_quantization_in_depth/L2_linear_I_quantize_dequantize_tensor.ipynb, function "linear_q_for_quantization_with_scale_and_zero_point".

Knowing the scale and the zero points, it will give us the quantized tensor.

After quantizing the tensor, it is dequantized. When the quantization worked wel, the dequantized tensor is close to the original tensor, but they are not exactly the same.

To get the quantization error tensor we subtract the original tensor and the dequantized tensor and we take the absolute value of the entire matrix.
ref. notebooks/Les3_quantization_in_depth/L2_linear_I_quantize_dequantize_tensor.ipynb

The quantization error can be measured by calculating the mean squared error between the original tensor and the dequantized tensor: \
(dequantized_tensor - test_tensor).square().mean()

In the example, the quantization error is about 170. The error is quite high because in this example we assign a random value to scale and zero points.

## Get the scale and zero point

How to determine the optimal s and z?

To obtain the scale and the zero point, we need to look at the extreme values:
-  r_min should map to q_min and 
-  r_max should map to q_max

We get the following two equations:

![Les3_s6_calculate_s_and_z](/img/Les3_s6_calculate_s_and_z.png)

Since we have two unknowns s and z, we can solve this equation.

If we subtract the first equation from the second one, we can get the scale.

![Les3_s11_scale_derivation](img/Les3_s11_scale_derivation.png)

For the zero point, since we've already determined s, we just need - for example - to use the first equation and replace s by the value we got before to get the zero point.

We make z as the same d-type as the quantized tensor.
In the example, z must be an integer. This is not the same d-type as the scale.

The goal behind this choice is to represent zero in the original range as an integer in the quantized range.
Thanks to that, when you quantize the value zero, it will take the value zero in the quantized range.
And what is great is that if you’d dequantize the value z, it will become zero again.

![Les3_s7_zero_point_derivation](img/Les3_s7_zero_point_derivation.png)

![Les3_s8_why_make_z_an_integer](img/Les3_s8_why_make_z_an_integer.png)

Explanation of how we calculate the scale and the zero point on the example:
1. Get the range of the original tensor, i.e. we need to know r_min and r_max
2. r_min = -184, r_max = 728.6
3. Get the range of the quantized tensor. We quantize the tensor in torch.int8.
4. Thus: q_min = -128 and q_max = 127.

Knowing that 
- r_min = s (q_min - z) and
- r_max = s (q_max - z),
you get that the scale is equal to 3.58 and the zero point is equal to -77.

We also need to round the value and to cast it to the correct d-type since we saw that z has the same d-type as the quantized value.

![Les3_s9_calculate_s_and_z_example](img/Les3_s9_calculate_s_and_z_example.png)

The last edge case, we need to threat is that we need to figure out is what happens when the zero point is out of range.

For example, since we need to cast z to the quantized datatype, such as int8, what should we do when z is out of range?
What if - when calculated - z_min is less than q_min? Or z_max is larger than q_max?

- If z < q_min, we set z to be equal to q_min, 
- if z > q_max, we set z to be equal to q_max.

In this way we don’t have overflow and underflow.

![Les3_s10_zero_point_out_of_range](img/Les3_s10_zero_point_out_of_range.png)

Now, consult the notebook:
- notebooks/Les3_quantization_in_depth_2/L2_linear_I_get_scale_and_zero_point.ipynb

In this notebook, a linear quantisation function is defined.

The linear quantization function will only take a tensor and will return to you the quantized tensor, the scale and the zero point.

In this function, two previously defined functions are used:
- "get_q_scales_and_zero_point": pass the tensor and d-type, returns: scale and zero_point,
- "linear_q_scale_and_zero_point": pass the tensor and the scale, the zero point, and the d-type. It returns the quantized tensor.

The "linear_quantization" function returns the quantized tensor, the scale and the zero point.

## Symmetric vs assymetric mode

In this part you will learn about the symmetric mode of linear quantization.

You will also implement quantization at different granularity, such as 
- per tensor, 
- per channel, and 
- per group \ 
quantization.

Finally, you will check how to perform inference on the quantized linear layer.

There are two modes in linear quantization.

- The first one is the **asymmetric mode**. \
    This is when you map the r_min, r_max to q_min and q_max and we just did that in the previous lesson.

- The second one is the **symmetric mode**. \
    This is when we map negative r_max, r_max to negative q_max, q_max and r_max can be defined as the maximum of the absolute value of the tensor.

![Les3_s12_linear_quantization_mode](img/Les3_s12_linear_quantization_mode.png)

**In the symmetric mode, we do not need to store the zero point, since it is equal to zero.**
This happens because the floating point range and the quantized range are symmetric with respect to zero.

![Les3_s13_linear_quantization_mode_1](img/Les3_s13_linear_quantization_mode_1.png)

Hence, we can simplify the equation, in the previous lesson, to get the following equation.

- q = int(round(r/s))
- s = r_max / q_max

The quantized tensor (q) is simply the original tensor (r) divided by the scale (s) that we run and cast to the data type of the quantized tensor. \ 
The scale S is simply r_max/q_max.

For calculating the quantized tensor in the symmetric mode, we now only need to calculate the scale, since the zero point is known to be zero.

Notebook:
notebooks/quantization_in_depth_3/L3_linear_II_symmetric_vs_asymmetric.ipynb

### Tradeoffs between symmetric and asymetric quantization

The trade offs between these two linear quantization modes are:

1. The utilization of the quantization range.
  
When you’re using the asymmetric quantization, the quantization range is fully used. \
When you’re using the symmetric mode, if the floating point range is biased toward one side, for example, you can think about the RELU layers where the output is positive. \
This will result in the quantization range, where a part of the range is dedicated to values that we will never see.

2. The second tradeoff is the simplicity.

Symmetric mode is much simpler compared to asymmetric mode.

3. The thrid tradeoff is the memory.
For symmetric quantization, we don’t need to store the zero points.

In practice, we use symmetric quantization when we are trying to quantize to eight-bits, but when we quantized to low bits such as two, three, or four bits, we often use asymmetric quantization.

## Finer granularity for more precision

The more granular the quantization is, the more accurate it will be. \
However, note that it requires more memory since we need to store more quantization parameters.

There are different levels of granularity when it comes to quantization.
There is:

1. per tensor quantization, 
2. per channel quantization,
3. per group quantization.

We don’t have to use the same scale and zero point for a whole tensor.
**Per tensor quantization** is what we did so far. \
We can for instance calculate a scale and the zero point for each axis. This is called **per channel quantization**. \
We could also choose a group of n elements to get the scale and zero point, and quantize each group with its own scale and zero points, which is called **per group quantization**.

![Les3_s14_granularity](img/Les3_s14_granularity.png)

### Per Tensor Quantization

ref. notebooks/Les3_quantization_in_depth_4/L2_linear_II_per_channel.ipynb.

In this notebook, the test tesnsor from the previous labs is used. \
Symmetric symmetric quantization is performed on this tensor.

Afterwards, the dequantized tensor and the quantization error are calculated.

The quantization error, with linear quantization on the test tensor is about 2.5.
When we used asymmetric quantization, we had a quantization error of about 1.5.

### Per Channel Quantization

We need to store the scales and the zero point 
- for each row if we decide to quantize along the rows and 
- we need to store them along each column, if we decide to quantize along the columns.

The memory needed to store all these linear parameters is pretty small.
We usually use per channel quantization when quantizing models in eight-bit.

![Les3_s15_per_channel_quantization](img/Les3_s15_per_channel_quantization.png)

Let’s code the per channel quantization.
To simplify the work we will restrict ourselves to the symmetric mode of the linear quantization.

ref. notebooks/Les3_quantization_in_depth_5/L3_linear_II_per_channel.ipynb.


For the test_tensor we've used, we get a lower quantization error in both cases (per row quantization, per column quantization) compared to tensor quantization.
This is because outlier values will only impact one channel it was in, instead of the entire tensor.

| qunatization type        | symmetric / assymetric    | quantization error on test tensor | 
|:-------------------------|:--------------------------|:----------------------------------| 
| per tensor               | asymmetric                | 1.5                               |                            
| per tensor               | symetric                  | +- 2.5                            |
| per channel - per row    | symmetric                 | 1.8                               |
| per channel - per column | symmetric                 | 1.078                             |

### Per Group Quantization

Let's go even smaller and do per group quantization. In per group quantization we perform quantization on groups of n elements.
Common values for n are 32, 64, or 128. 

![per_group_quantization](img/Les3_s16_per_group_quantization.png)

Per group quantization can require a lot of memory. Let's say, we want to quantize a tensor in four-bit, and we choose a group size equal to 32.
We use symmetric mode. That means that the zero point is equal to zero. we store the scales in floating point 16.

It means that we are actually quantizing the tensor in 4.5 bits:
- each element is stored in 4 bit,
- we store the scale in FP16, so we need 16 bit, for every group of 32 elements, therefore 16/32

| qunatization type        | symmetric / assymetric    | quantization error on test tensor | 
|:-------------------------|:--------------------------|:----------------------------------| 
| per tensor               | asymmetric                | 1.5                               |                            
| per tensor               | symetric                  | +- 2.5                            |
| per channel - per row    | symmetric                 | 1.8                               |
| per channel - per column | symmetric                 | 1.078                             |
| per group -   of 3       | symmetric                 | 2.15                              |

## Quantizing Weights & Activations for Inference

How to perform inference with linear quantization?

In a neural network, you can quantize the weights, but you can also quantize the activation.
Depending on what we quantized, the storage and the computation are not the same.

- If you only quantize the weights, the computation will be using floating point arithmetic, i.e. floating point 32, floating point 16 or bfloat16.
- If you also quantize the activation you will be using integer based arithmetics.

For the first case where you only quantize the weights (for example to int8).
Note that you need to dequantize the weights to perform the floating point computation.

If you quantize also the activations, you will be using integer based arithmetics. Remark that this is not supported by all hardware.

![Les3_s17_quantizing_weights_and_activations_for_inference](img/Les3_s17_quantizing_weights_and_activations_for_inference.png)

## Custom build an 8 bit-quantizer

In this part, you will leverage the tools that you have just built in order to create your own quantizer to quantize any model in eight-bit precision.
This quantizer is modality agnostic, meaning you can apply it on vision, audio texts, and even multimodal models.

we will learn about how to make our own quantizer to quantize any model in eight-bit precision using the per channel linear
quantization scheme.

For that, we'll break down the project into multiple sub steps.

1. We will deep dive into creating a W8A16 linear layer class, where "W8" stands for eight-bit weights and "A16" stands for 16 bits activations.
2. We will use this class to store eight-bit weights and scales,
3. We will see how we can replace all instances of torch.nn.layers with that new class.
4. We will build a quantizer to quantize our model end to end.
5. We will test our quantizer on many scenarios.
6. We will study the impact of the eight bit quantization on different models.

ref. Les3_quantization_in_depth_8/L4_building_quantizer_custom_quantizer.ipynb

## Replace Pytorch layers with Quantized layers

ref. Les3_quantization_in_depth_9/L4_building_quantizer_replace_layers.ipynb

## Quantize any Open Source PyTorch Model

Let's test our implementation on models that you can find on Hugging Face Transformers.

ref. Les3_quantization_in_depth_10/L4_building_quantizer_quantize_models.ipynb