<!-- # <a id='toc1_'></a>[Table of Contents](#toc0_)
* [What is Quantization and why we use it ?](#chapter1)
    * [Section 1.1](#section_1_1)
* [Chapter 2](#chapter2)
    * [Section 2.1](#section_2_1)
        * [Sub Section 2.1.1](#sub_section_2_1_1)
        * [Sub Section 2.1.2](#sub_section_2_1_2)
* [Chapter 3](#chapter3)
    * [Section 3.1](#section_3_1)
        * [Sub Section 3.1.1](#sub_section_3_1_1)
        * [Sub Section 3.1.2](#sub_section_3_1_2)
    * [Section 3.2](#section_3_2)
        * [Sub Section 3.2.1](#sub_section_3_2_1) -->

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import torch
import torch.nn as nn
import torch.optim as optim
import torchvision
import torchvision.transforms as transforms
from torch.utils.data import DataLoader
import sys
import io

## 

Quantization is a technique for performing computations and storing tensors at lower bitwidths than floating point precision. A quantized model executes some or all of the operations on tensors with reduced precision rather than full precision (floating point) values <sup>[1](#1)</sup> . This allows for a more compact model representation and the use of high performance vectorized operations on many hardware platforms. PyTorch supports INT8 quantization compared to typical FP32 models allowing for a 4x reduction in the model size and a 4x reduction in memory bandwidth requirements. Hardware support for INT8 computations is typically 2 to 4 times faster compared to FP32 compute. Quantization is primarily a technique to speed up inference and only the forward pass is supported for quantized operators [1]. PyTorch provides two different modes of quantization: Eager Mode Quantization and FX Graph Mode Quantization <sup>[1](#1)</sup> . 


Quantization in TensorFlow refers to the process of reducing the precision of the weights and/or activations of a neural network model. The goal is to represent the model with lower bit precision (e.g., 8-bit integers) rather than the standard 32-bit floating-point numbers. This reduction in precision can lead to more efficient model deployment, especially on hardware with limited computational resources, such as edge devices and mobile devices <sup>[2](#1)</sup>.

[1]<a class="anchor" id="1"></a>: https://pytorch.org/docs/stable/quantization.html


[2]<a class="anchor" id="2"></a>: https://www.tensorflow.org/api_docs/python/tf/quantization/quantize


# Understanding and Applying Quantization

Quantization is a method that can allow models to run faster and use less memory. By converting 32-bit floating-point numbers (the `float32` data type) into lower-precision formats, like 8-bit integers (the `int8` data type), we can reduce the computational requirements of our models. Let's start with the basics and gradually move towards quantizing complex models like CNNs.

### Learning Objectives
1. Explore how to quantize a single variable and a function in pytorch
1. Apply quantization to a neural network
1. Compare the size and performance of quantized convolutional neural network 

# Section 1 - Quantization

We'll illustrate both 4-bit and 8-bit quantization. As for the neural network part, we'll create a simple model and show how to quantize and dequantize its weights. I'll present the code you would use to do it.



### Quantization of a Single Value
Quantization is the process of constraining an input from a large set to output in a smaller set. In the context of deep learning, it's used to reduce the precision of the weights and activations of the neural network models. This can help to reduce the memory footprint and computational intensity of models. Here, we'll start by quantizing a single floating point number.

We'll define two functions: one to quantize a value and another to unquantize it. The quantize function will take a floating point number and a number of bits, and will output an integer representation of the input number. The unquantize function will take the integer and the number of bits, and will output the floating point number.

The range of input values for the quantize function is between -1 and 1. The range of output values for the unquantize function is also between -1 and 1. The number of bits determines the precision of the quantization. More bits means higher precision, but more memory usage. For this demonstration, we'll use 4 and 8 bits.

## Quantization and Dequantization of Floating Point Numbers

Quantization is a technique used for performing computations and storing tensors at lower bitwidths than floating-point precision. The process involves converting floating-point numbers to a lower bit representation, and later dequantizing them back to their original precision. Here are the formulas for quantization and dequantization:

### Quantization Formula

The quantization formula typically involves rounding the floating-point number to the nearest representable value within the reduced bitwidth. Let `x` be the original floating-point number, `q` be the quantized value, and `S` be the scaling factor:

\[ q = \text{round}(x \times S) \]

### Dequantization Formula

The dequantization process involves converting the quantized value back to the original floating-point precision. Let `q` be the quantized value, `x'` be the dequantized value, and `S` be the scaling factor:

\[ x' = \frac{q}{S} \]

In these formulas, `S` is often chosen to be a power of 2 to simplify the quantization process, and it is used to scale the floating-point numbers before rounding during quantization and after scaling during dequantization.

It's important to note that the choice of scaling factor and bitwidth greatly influences the precision and range of representable values after quantization and dequantization.


#### Creating a Quantization object

In [None]:
class Quantization:
    """
    Quantization class for representing and manipulating quantized values.

    Attributes:
        value (float): The original floating-point value to be quantized.
        bits (int): The number of bits used for quantization.
        quantized_value (int): The quantized value resulting from the quantization process.

    Methods:
        __init__(self, value: float, bits: int):
            Initializes a Quantization object with the given original value and bitwidth.

        quantize(self) -> int:
            Quantizes the original floating-point value and stores the result in 'quantized_value'.
            Returns the quantized value as an integer.

        unquantize(self) -> float:
            Dequantizes the quantized value and returns the original floating-point value.
            Raises a ValueError if 'quantize' has not been called before attempting to unquantize.
    """

    def __init__(self, value: float, bits: int):
        """
        Initializes a Quantization object.

        Parameters:
            value (float): The original floating-point value to be quantized.
            bits (int): The number of bits used for quantization.
        """
        self.value = value
        self.bits = bits
        self.quantized_value = None

    def quantize(self) -> int:
        """
        Quantizes the original floating-point value and stores the result in 'quantized_value'.
        
        Returns:
            int: The quantized value as an integer.
        """
        quantized_value = np.round(self.value * (2**(self.bits - 1) - 1))
        self.quantized_value = quantized_value
        return int(quantized_value)

    def unquantize(self) -> float:
        """
        Dequantizes the quantized value and returns the original floating-point value.

        Returns:
            float: The dequantized value.
        
        Raises:
            ValueError: If 'quantize' has not been called before attempting to unquantize.
        """
        if self.quantized_value is None:
            raise ValueError("Please quantize the value first.")
        unquantized_value = self.quantized_value / (2**(self.bits -1) - 1)
        self.unquantized_value = unquantized_value
        return float(unquantized_value)
    
    def calculate_error(self) -> float:
        """
        Calculates the absolute error between the original value and the dequantized value.

        Returns:
            float: The absolute error.

        Raises:
            ValueError: If 'quantize' has not been called before attempting to calculate the error.
        """
        if self.quantized_value is None:
            raise ValueError("Please quantize the value first.")
        return abs(self.value - self.unquantize())

#### Testing the quantize and unquantize functions with 4 and 8 bits

In [47]:
value = 3.141592653589793

value_4bit             = Quantization(value, bits=4)
quantized_value_4bit   = value_4bit.quantize()
unquantized_value_4bit = value_4bit.unquantize()

value_8bit             = Quantization(value, bits=8)
quantized_value_8bit   = value_8bit.quantize()
unquantized_value_8bit = value_8bit.unquantize()

print(f"Original Value: {value}\n----\n4-bit Quantization:{quantized_value_4bit}\n4-bit Unquantization: {unquantized_value_4bit}\n----\n8-bit Quantization:{quantized_value_8bit}\n8-bit Unquantization: {unquantized_value_8bit}")

Original Value: 3.141592653589793
----
4-bit Quantization:22
4-bit Unquantization: 3.142857142857143
----
8-bit Quantization:399
8-bit Unquantization: 3.141732283464567
