## What is Quantization?

**Quantization** is a technique in machine learning that involves reducing the precision of the numerical values used by a model. Typically, models are trained using **32-bit floating-point (FP32)** precision, but quantization reduces this to lower precision formats such as **16-bit floating-point (FP16)** or **8-bit integer (INT8)**. 

### Why is Quantization Done?

Quantization is particularly valuable in situations where **memory** and **computational resources** are limited. The main benefits include:

- **Reducing model size:** By lowering the precision of the parameters, the overall size of the model decreases, allowing it to fit into devices with limited storage, such as mobile phones or embedded systems.
- **Speeding up inference:** Lower precision arithmetic is faster to compute, which leads to faster inference times, crucial for **real-time applications** and **low-latency systems**.
- **Lower power consumption:** This is especially important in **edge devices** and **IoT (Internet of Things)** environments where battery life and energy efficiency are key constraints.
- **Deploying on mobile and embedded devices:** Quantization enables large, complex models to be run on devices like **smartphones, tablets**, and **autonomous vehicles**, where full precision models would otherwise be too slow or resource-hungry.

### Key Processes Involved in Quantization:

1. **Training or Post-Training Quantization**: This can either be applied during the training phase, where the model learns to adapt to lower precision weights, or as a **post-training** step where the model is quantized after being fully trained.
   
2. **Quantization-Aware Training (QAT)**: This involves simulating the effects of quantization during training, so the model can compensate for the reduced precision and maintain higher accuracy.

3. **Post-Training Quantization (PTQ)**: A simpler approach where a pre-trained model is quantized without re-training, typically used when computational resources for re-training are limited.

4. **Dynamic and Static Quantization**: In **dynamic quantization**, certain parts of the model (like weights) are quantized only during inference, while **static quantization** involves quantizing both weights and activations beforehand.

---

In this notebook, we'll explore how to implement quantization from scratch, examining the core steps involved in reducing precision while keeping the model's performance intact.
racy.


## Simulating a Tensor with Random Values

In this section, we generate a simple array of random numbers to simulate what a **tensor** might look like in practice. Using **NumPy**, we create an array with values drawn from a uniform distribution between **-20** and **200**. A few elements are manually adjusted to simulate specific scenarios, making debugging easier. 

This array gives us a basic structure to work with as we move forward in exploring quantization techniques.


In [1]:
import numpy as np 

np.set_printoptions(suppress = True)

# Generatinf random distribution "tensors"
params = np.random.uniform(low = -20, high = 200, size = 15)

In [2]:
# modifing for easy debugging
params[0] = params.max() + 1
params[1] = params.min() - 1
params[2] = 0

In [3]:
params = np.round(params,4)
params

array([199.9341,  -9.1758,   0.    ,   7.1529, 123.2222, 101.2439,
       182.7486,  88.2835,  58.6505,  -8.1758,  -4.9222, 159.3742,
       110.0036, 159.674 , 198.9341])

## Clamping Values for Controlled Range

One of the important steps in quantization is ensuring that the values in a tensor remain within a specified range. This is where **clamping** comes into play. Clamping restricts the values in an array to lie within defined **lower** and **upper bounds**.

By applying clamping, any values below the lower bound will be set to the lower limit, and any values above the upper bound will be set to the upper limit. This helps in **normalizing** and **stabilizing** the data, making it more manageable for further quantization steps.

The clamping function allows us to keep our tensor's values within a well-defined range, preparing the data for the quantization process.


In [4]:
def clamping(params_arr, lower_bound, upper_bound):
    params_arr[params_arr < lower_bound] = lower_bound
    params_arr[params_arr > upper_bound] = upper_bound
    return params_arr

## Asymmetric Quantization and Dequantization

Asymmetric quantization is a critical process in reducing the precision of tensor values while preserving the data's integrity. It involves mapping a floating-point range to a lower-bit integer range, which is particularly useful for deploying models on resource-constrained devices.

### Asymmetric Quantization

The quantization process can be defined as follows:

1. **Calculate Scale and Zero Point**:
   - **Alpha**: The maximum value in the tensor.
   - **Beta**: The minimum value in the tensor.
   - **Scale**: 
   $$
   \text{scale} = \frac{\alpha - \beta}{2^{\text{bits}} - 1}
   $$
   - **Zero Point**: 
   $$
   \text{zero\_point} = -1 \cdot \text{round}\left(\frac{\beta}{\text{scale}}\right)
   $$

2. **Define Lower and Upper Bounds**:
   - The **lower bound** and **upper bound** for quantization are defined as:
   $$
   \text{lower\_bound} = 0
   $$
   $$
   \text{upper\_bound} = 2^{\text{bits}} - 1
   $$
   These bounds define the range of integer values that can be represented after quantization.

3. **Quantization**:
   - The original parameters are scaled and shifted using the calculated scale and zero point:
   $$
   \text{quantized} = \text{clamping}\left(\text{round}\left(\frac{\text{params}}{\text{scale}} + \text{zero\_point}\right), \text{lower\_bound}, \text{upper\_bound}\right)
   $$

### Asymmetric Dequantization

The dequantization process retrieves the original floating-point values from the quantized integers using the reverse of the quantization formula:
$$
\text{dequantized} = \text{scale} \times (\text{params\_q} - \text{zero\_point})
$$

These functions allow us to efficiently convert between floating-point and quantized representations, facilitating better performance and lower resource usage in machine learning models.


In [11]:
def asymmetric_quantization(params, bits):
    alpha = np.max(params) # the largest value in our "tensor"
    beta = np.min(params) # smallest value in our "tensor"
    scale = (alpha - beta) / (2**bits - 1) # here wa can also use min/max scaler
    zero_point = -1*np.round(beta / scale)
    lower_bound, upper_bound = 0, (2**bits - 1)
    # Quantization 
    quantized = clamping(np.round(params/scale + zero_point), lower_bound, upper_bound).astype(np.int32)
    return quantized, scale, zero_point

def asymmetric_dequantization(params_q, scale, zero_point):
    return scale * (params_q - zero_point)

## Symmetric Quantization and Dequantization

Symmetric quantization is a technique used to efficiently represent tensor values in a lower-bit integer format while ensuring that both positive and negative values are equally represented. This method is especially useful in scenarios where the tensor values are symmetrically distributed around zero.

### Symmetric Quantization

The quantization process can be defined as follows:

1. **Calculate Scale**:
   - **Alpha**: The maximum absolute value in the tensor:
   $$
   \alpha = \max(\lvert \text{params} \rvert)
   $$
   - **Scale**: 
   $$
   \text{scale} = \frac{\alpha}{2^{(\text{bits}-1)} - 1}
   $$

2. **Define Lower and Upper Bounds**:
   - The **lower bound** and **upper bound** for quantization are defined as:
   $$
   \text{lower\_bound} = -2^{(\text{bits} - 1)} - 1
   $$
   $$
   \text{upper\_bound} = 2^{(\text{bits} - 1)} - 1
   $$
   These bounds define the range of integer values that can be represented after quantization.

3. **Quantization**:
   - The original parameters are scaled and rounded using the calculated scale:
   $$
   \text{quantized} = \text{clamping}\left(\text{round}\left(\frac{\text{params}}{\text{scale}}\right), \text{lower\_bound}, \text{upper\_bound}\right)
   $$

### Symmetric Dequantization

The dequantization process retrieves the original floating-point values from the quantized integers using the following formula:
$$
\text{dequantized} = \text{scale} \times \text{params\_q}
$$

These functions allow us to efficiently convert between floating-point and quantized representations, enhancing model performance and reducing resource usage on various platforms.


In [37]:
def symmetric_quantization(params, bits):
    alpha = np.max(np.abs(params)) # the max absolute value
    scale = alpha / (2**(bits-1) - 1)
    lower_bound, upper_bound = -2**(bits - 1) - 1, 2**(bits-1) - 1
    # Quantization 
    quantized = clamping(np.round(params/scale), lower_bound, upper_bound).astype(np.int32)
    return quantized, scale

def symmetric_dequantization(params_q, scale):
    return scale * params_q

## Quantization Error

Quantization error refers to the **difference** between the original floating-point values of a tensor and their quantized representations. This error is crucial to assess as it significantly impacts the model's **accuracy** after quantization.

In this case, we use **Mean Squared Error (MSE)** to quantify this error:

- **MSE** helps us understand how much **information is lost** during the quantization process.
- Minimizing this error is essential for maintaining the model's performance, especially when deploying to resource-constrained environments like **mobile devices** or **edge cases**.

By evaluating quantization error, we can make informed decisions to enhance the effectiveness of the quantization process.


In [23]:
def quantization_error(params, params_q):
    # we can calculate any for of loss here. ** MSE **
    return np.mean((params - params_q)**2)

## Seeing How It Plays Out

In this section, we demonstrate the **quantization** process by converting floating-point values (FP) to integer (INT) representations using both **asymmetric** and **symmetric quantization** methods.

### Asymmetric Quantization
- **Calibration**: Floating-point values are scaled and shifted to fit within a specific integer range.
- **Dequantization**: When converting back from INT to FP, we observe a slight difference between the original and dequantized values. This is due to the loss of precision inherent in the quantization process, especially with non-uniform data distributions.

### Symmetric Quantization
- **Calibration**: In this method, the data is scaled without a zero point, assuming a symmetric distribution around zero.
- **Dequantization**: Similarly, converting back results in some loss of information. While the symmetric method is simpler, it can result in higher error when the data is not symmetrically distributed.

### Observation
Both methods result in reduced precision after dequantization, showcasing how quantization compresses data but slightly distorts the original values.


In [24]:
(asymmetric_q, asymmetric_scale, asymmetric_zero) = asymmetric_quantization(params, 8)
(symmetric_q, symmetric_scale) = symmetric_quantization(params, 8)

print(f'Original:')
print(np.round(params, 2))
print('')
print(f'Asymmetric scale: {asymmetric_scale}, zero: {asymmetric_zero}')
print(asymmetric_q)
print('')
print(f'Symmetric scale: {symmetric_scale}')
print(symmetric_q)

Original:
[199.93  -9.18   0.     7.15 123.22 101.24 182.75  88.28  58.65  -8.18
  -4.92 159.37 110.   159.67 198.93]

Asymmetric scale: 0.8200388235294118, zero: 11.0
[255   0  11  20 161 134 234 119  83   1   5 205 145 206 254]

Symmetric scale: 1.5742842519685039
[127  -6   0   5  78  64 116  56  37  -5  -3 101  70 101 126]


In [25]:
# Dequantize the parameters back to 32 bits
params_deq_asymmetric = asymmetric_dequantization(asymmetric_q, asymmetric_scale, asymmetric_zero)
params_deq_symmetric = symmetric_dequantization(symmetric_q, symmetric_scale)

print(f'Original:')
print(np.round(params, 2))
print('')
print(f'Dequantize Asymmetric:')
print(np.round(params_deq_asymmetric,2))
print('')
print(f'Dequantize Symmetric:')
print(np.round(params_deq_symmetric, 2))

Original:
[199.93  -9.18   0.     7.15 123.22 101.24 182.75  88.28  58.65  -8.18
  -4.92 159.37 110.   159.67 198.93]

Dequantize Asymmetric:
[200.09  -9.02   0.     7.38 123.01 100.86 182.87  88.56  59.04  -8.2
  -4.92 159.09 109.89 159.91 199.27]

Dequantize Symmetric:
[199.93  -9.45   0.     7.87 122.79 100.75 182.62  88.16  58.25  -7.87
  -4.72 159.   110.2  159.   198.36]


### Error calculation

In [28]:
# Calculate the quantization error
print(f'{"Asymmetric error: ":>20}{np.round(quantization_error(params, params_deq_asymmetric), 2)}')
print(f'{"Symmetric error: ":>20}{np.round(quantization_error(params, params_deq_symmetric), 2)}')

  Asymmetric error: 0.05
   Symmetric error: 0.15


### Playing around

In [30]:
# Generate 10 random positive values and 10 random negative values
positive_values = np.random.uniform(low=0, high=20, size=10)
negative_values = np.random.uniform(low=-20, high=0, size=10)

# Concatenate the arrays to simulate symmetry
params_sy = np.concatenate((negative_values, positive_values))

# For an additional step, you can shuffle the values if needed
# np.random.shuffle(params)

print("Symmetric tensor-like array:")
print(params_sy)

Symmetric tensor-like array:
[-16.88845729 -14.73619101  -9.84795077  -2.16624371 -13.58845586
 -13.89295657  -5.13396663  -7.43403557  -2.02378329  -4.6139564
  13.66297264  12.94906411   4.74121694   5.52796603   5.09095534
   8.92738701   7.37523507   2.56579524  16.80063994  13.47113548]


In [31]:
(symmetric_q, symmetric_scale) = symmetric_quantization(params_sy, 8)

In [32]:
print(f'Original:')
print(np.round(params_sy, 2))
print('')
print(f'Symmetric scale: {symmetric_scale}')
print(symmetric_q)

Original:
[-16.89 -14.74  -9.85  -2.17 -13.59 -13.89  -5.13  -7.43  -2.02  -4.61
  13.66  12.95   4.74   5.53   5.09   8.93   7.38   2.57  16.8   13.47]

Symmetric scale: 0.13297997865189634
[-127 -111  -74  -16 -102 -104  -39  -56  -15  -35  103   97   36   42
   38   67   55   19  126  101]


In [33]:
params_deq_symmetric = symmetric_dequantization(symmetric_q, symmetric_scale)
print(f'Original:')
print(np.round(params_sy, 2))
print('')
print(f'Dequantize Symmetric:')
print(np.round(params_deq_symmetric, 2))

Original:
[-16.89 -14.74  -9.85  -2.17 -13.59 -13.89  -5.13  -7.43  -2.02  -4.61
  13.66  12.95   4.74   5.53   5.09   8.93   7.38   2.57  16.8   13.47]

Dequantize Symmetric:
[-16.89 -14.76  -9.84  -2.13 -13.56 -13.83  -5.19  -7.45  -1.99  -4.65
  13.7   12.9    4.79   5.59   5.05   8.91   7.31   2.53  16.76  13.43]


In [36]:
print(f'{"Symmetric error: ":>20}{np.round(quantization_error(params_sy, params_deq_symmetric), 5)}')

   Symmetric error: 0.00159


#### Because of the nature of the data we can seehow the error is very low. 