# Requirements

Brevitas requires Python 3.7+ and PyTorch 1.5.1+ and can be installed from PyPI with `pip install brevitas`. 

For this notebook, you will also need to install ONNX, onnxruntime, and onnxoptimizer.
For this tutorial, PyTorch 1.8.1+ is required.

# Introduction

The main goal of this notebook is to show how to use Brevitas to export your models in the two standards currently supported by ONNX for quantized models: QCDQ and QOps (i.e., QLinearConv, QLinearMatMul).
Once exported, these models can be run using onnxruntime.

## QuantizeLinear-Clip-DeQuantizeLinear (QCDQ)

In QCDQ export, before each quantized operation, two (or three, in case of clipping) extra ONNX nodes are added:
- QuantizeLinear: Takes as input a FP tensor, and quantizes it with a given zero-point and scale factor. It returns an INT8 tensor.
- Clip (Optional): Takes as input an INT8 tensor, and, given a max/min value, restricts its range.
- DeQuantizeLinear: Takes as input an INT8 tensor, and converts it to its FP correspondant with a given zero-point and scale factor.

There are several implications associated with this set of operations:
- It is not possible to quantize with a bitwidth higher than 8. Although DequantizeLinear supports both INT8 and INT32 as input, QuantizeLinear will alwyas output INT8 (either signed or unsigned).
- Using only QuantizeLinear and DequantizeLinear, it is possible only to quantize at 8 bit (signed or unsigned).
- The addition of the Clip function between Quantize and DeQuantize, allows to quantize a tensor to bitwidth < 8. This is done by Clipping the INT8 tensor coming out of the QuantizeLinear node with the max/min values of the desired bitwidth (e.g., for unsigned 3 bit, min_val = 0 and max_val = 7).
- It is possible to perform per-channel and per-tensor quantization (only supported with ONNX Opset >=13).

We will go through all these cases with some examples.



### Basic Example

First, we will look at `brevitas.nn.QuantLinear`, a quantized alternative to `torch.nn.Linear`. Similar considerations can also be used for  `QuantConv1d`, `QuantConv2d`,  `QuantConvTranspose1d` and `QuantConvTranspose2d`.

Brevitas offers several API to export Pytorch modules into several different formats, all sharing the same interface.
The three required arguments are:
- The pytorch model to export
- A representative input 
- The path where to save the exported model


In [None]:
%pip install netron

In [1]:
import netron
import time
from IPython.display import IFrame

def show_netron(model_path, port):
    time.sleep(3.)
    netron.start(model_path, address=("localhost", port), browse=False)
    return IFrame(src=f"http://localhost:{port}/", width="100%", height=400)

In [2]:
import brevitas.nn as qnn
import torch
from brevitas.export import export_standard_qcdq_onnx

IN_CH = 3
OUT_CH = 128
BATCH_SIZE = 1

linear = qnn.QuantLinear(IN_CH, OUT_CH, bias=True)
inp = torch.randn(BATCH_SIZE, IN_CH)
path = 'simple_model.onnx'

exported_model = export_standard_qcdq_onnx(linear, args=inp, export_path=path, opset_version=13)


In [31]:
show_netron(path, 8082)

Stopping http://localhost:8082
Serving 'simple_model.onnx' at http://localhost:8082


As it can be seen from the exported onnx, in this case only the weights are quantized, and they go through a Quantize/Dequantize Linear before being used for the convolution. Moreover, there is a clipping operation, setting the max/min val for the Tensor at ±127.

This is because in Brevitas, by defult, quantized layers (but not activation) have the option `narrow_range=True`. 
This option, in case of signed quantization, makes sure that the maximum and minimum number that can be represented are the same (otherwise, the minimum integer would be -128).


The input and bias remains in floating point, but inn QCDQ export, this is not a problem since even the weights, that are quantized at 8 bit, are re-converted to FP before passed as input to the Linear node.

It is important to know that ONNX will automatically decide what nodes to use for each computation, based on the characteristics of the inputs and outputs. For example, if the parameter `bias` is set to False, the `Gemm` node would be replaced by a LinearMatMul. This has no immediate effect on the export flow, the quantization, or the numerical correctness of the resulting model.



### "Complete" Model

A similar approach can be used with entire Pytorch models, rather than single layer.

In [4]:
class Model(torch.nn.Module):
    def __init__(self) -> None:
        super().__init__()
        self.linear = qnn.QuantLinear(IN_CH, OUT_CH, bias=True, return_quant_tensor=True, weight_scaling_per_output_channel=True)
        self.act = qnn.QuantReLU()
    
    def forward(self, inp):
        inp = self.linear(inp)
        inp = self.act(inp)
        return inp

model = Model()
inp = torch.randn(BATCH_SIZE, IN_CH)
path = 'simple_model.onnx'

exported_model = export_standard_qcdq_onnx(model, args=inp, export_path=path, opset_version=13)

In [5]:
show_netron(path, 8082)

Stopping http://localhost:8082
Serving 'simple_model.onnx' at http://localhost:8082


We did not specify the argument `output_quant` in our QuantLinear layer, thus the output of the layer will be passed directly to the ReLU function without any intermediate re-quantization step. 

Furthermore, we have defined a per-channel quantization, so the scale factor will be a Tensor rather than a scalar (ONNX opset >= 13 is required for this).

Finally, since we are using a QuantReLU with default initialization, the output is requantized as an UInt8 Tensor.


### The C in QCDQ (Bitwidth <= 8)

As mentioned, Brevitas export expands on the basic QDQ format by adding the Clipping operation.

This operations is inserted between the QuantizeLinear and DequantizeLinear node, and thus operates on integers.

Normally, using only the QDQ format, it would be impossible to export models quantize with less than 8 bit.

In Brevitas however, if a quantized layer with bitwidth <= 8 is exported, the Clip node will be automatically inserted, and it will perform a new `saturate` operation, computing the new max and min values based on the particular type of quantized performed (i.e., signed vs unsigned, narrow range vs no narrow range, etc.).

Even though the Tensor data type will still be a Int8 or UInt8, in practical term the tensor will represent the desired bitwidth.

In [6]:
class Model(torch.nn.Module):
    def __init__(self) -> None:
        super().__init__()
        self.linear = qnn.QuantLinear(IN_CH, OUT_CH, bias=True, return_quant_tensor=True, weight_bit_width=3)
        self.act = qnn.QuantReLU(bit_width=4)
    
    def forward(self, inp):
        inp = self.linear(inp)
        inp = self.act(inp)
        return inp

model = Model()
model.eval()

inp = torch.randn(BATCH_SIZE, IN_CH)
path = 'simple_model.onnx'

exported_model = export_standard_qcdq_onnx(model, args=inp, export_path=path, opset_version=13)

In [7]:
show_netron(path, 8082)

Stopping http://localhost:8082
Serving 'simple_model.onnx' at http://localhost:8082


As can be seen from the generated ONNX, the weights of the Linear layer are clipped between -3 and 3, considering that we are performing a signed, 3 bit quantization, with `narrow_range=True`. 

Similarly, the output of the QuantReLU is clipped between 0 and 15, since in this case we are doing an unsigned 4 bit quantization. 

## QOps Export

Another supported style for exporting quantized operation in ONNX is represented by QOperations. 

Compared to QCDQ, where it is possible to re-use standard floating point layers (e.g., Linear or Conv2d) precedeed by QCDQ nodes, in this case the entire layer is replaced with its quantized counterpart. 

Opposite to what happens with QCDQ, all elements of the computation in this case have to be quantized: Input, Weight, Bias (if present), and Output tensors.

This introduces some contraints on how we define our quantized layers through Brevitas.



In [10]:
from brevitas.quant.scaled_int import Int8ActPerTensorFloat
from brevitas.export import export_standard_qop_onnx

IN_CH = 3
IMG_SIZE = 128
OUT_CH = 128
BATCH_SIZE = 1

class Model(torch.nn.Module):
    def __init__(self) -> None:
        super().__init__()
        self.act = qnn.QuantIdentity(return_quant_tensor=True)
        self.linear = qnn.QuantConv2d(IN_CH, OUT_CH, kernel_size=3, bias=False, 
                                      weight_bit_width=4,
                                      output_quant=Int8ActPerTensorFloat, return_quant_tensor=True)
    
    def forward(self, inp):
        inp = self.act(inp)
        inp = self.linear(inp)
        return inp

inp = torch.randn(BATCH_SIZE, IN_CH, IMG_SIZE, IMG_SIZE)
model = Model() 
model.eval()


export_standard_qop_onnx(
    model.cpu(),
    input_t=inp,
    export_path="simple_model.onnx",
    opset_version=13
)

Stopping http://localhost:8082
Serving 'simple_model.onnx' at http://localhost:8082


In [None]:
show_netron("simple_model.onnx", 8082)

In this case, we need to make sure that our input to QuantLinear is quantized. Using the approach shown above, Brevitas will add a QuantizeLinear node (but not followed by a DeQuantizeLinear one). 

Moreover, our `qnn.QuantLinear` layer has to specify how to re-quantize the output (in this case, with `Int8ActPerTensorFloat`), otherwise an error will be raised during export-time.

Similarly, if the bias is present, this has to be quantized or an error will be raised.


#### Clipping in QOps

Even when using QLinearConv and QLinearMatMul, it is still possible to represent bitwidth <8 through the use of clipping.

However in this case the Clipping operation won't be captured in the exported ONNX graph. Instead, it will be performed at export-time, and the clipped tensor will be exported in the ONNX graph.

Examining the last exported model, it is possible to see that the weight tensor, even though has Int8 has type, has a max/min values between [-8, 8], given that it is quantized at 4 bit with narrow_range set to True.



## ONNX RUNTIME

### QCDQ

Since for QCDQ we are only using standard ONNX operation, it is possible to run the exported model using ONNX runtime.

In [None]:
import onnxruntime as ort

class Model(torch.nn.Module):
    def __init__(self) -> None:
        super().__init__()
        self.linear = qnn.QuantLinear(IN_CH, OUT_CH, bias=True, return_quant_tensor=True, weight_bit_width=3)
        self.act = qnn.QuantReLU(bit_width=4)
    
    def forward(self, inp):
        inp = self.linear(inp)
        inp = self.act(inp)
        return inp

model = Model()
model.eval()
inp = torch.randn(BATCH_SIZE, IN_CH)
path = 'simple_model.onnx'

exported_model = export_standard_qcdq_onnx(model, args=inp, export_path=path, opset_version=13)

sess_opt = ort.SessionOptions()
sess = ort.InferenceSession(path, sess_opt)
input_name = sess.get_inputs()[0].name
pred_onx = sess.run(None, {input_name: inp.numpy()})[0]


out_brevitas = model(inp)
out_ort = torch.tensor(pred_onx)

assert torch.allclose(out_brevitas, out_ort)

2022-12-09 16:22:03.824044539 [W:onnxruntime:, graph.cc:1271 Graph] Initializer linear.weight appears in graph inputs and will not be treated as constant value/weight. This may prevent some of the graph optimizations, like const folding. Move it out of graph inputs if there is no need to override it, by either re-generating the model with latest exporter/converter or with the tool onnxruntime/tools/python/remove_initializer_from_input.py.
2022-12-09 16:22:03.824084348 [W:onnxruntime:, graph.cc:1271 Graph] Initializer linear.bias appears in graph inputs and will not be treated as constant value/weight. This may prevent some of the graph optimizations, like const folding. Move it out of graph inputs if there is no need to override it, by either re-generating the model with latest exporter/converter or with the tool onnxruntime/tools/python/remove_initializer_from_input.py.


#### QGEMM vs GEMM

As stated before, ONNX will decide the appropriate kernel to execute for each operation based on the characteristics of the input and output tensors (e.g., Gemm vs MatMul).

Similarly, when using ONNX runtime, we observed a similar behavior.

In very particular scenarios, we noticed that, even though in QCDQ all operations between tensors should be in FP, ONNX runtime calls quantized kernels, suggesting that the DequantizeLinear node is moved and fused in the computation.

This seems to happen only when using a Quantized Linear layer, with the following requirements:
- Input, Weight, Bias, and Output tensors must be quantized;
- Bias tensor must be present, and quantized with bitwidth > 8;
- The output of the QuantLinear must be re-quantized;
- The output bitwidth must be equal to 8;
- The input bitwidth must be equal to 8;
- The weights bitwidth can be <= 8;
- The weights can be quantized per-tesor or per-channel;
- `return_quant_tensor` must be True.

We did not observe a similar behavior for other operations such as `QuantConvNd`.

An example of a layer that will match this definition is the following:

In [None]:
from brevitas.quant.scaled_int import Int16Bias
from brevitas.quant.scaled_int import Int8ActPerTensorFloat

qgemm_ort = qnn.QuantLinear(IN_CH, OUT_CH,
                            weight_bit_width=5,
                            input_quant=Int8ActPerTensorFloat,
                            output_quant=Int8ActPerTensorFloat,
                            bias=True, bias_quant=Int16Bias,)

Unfortunately, we have not found a way to determine what kernels onnxruntime uses for computation.

We found out about this behavior in the our development process, and confirmed it by re-compiling from source the onnxruntime library.
This behavior has been observed by building from source onnxruntime v1.13.1, and ONNX v1.12.
It might be needed to set the following environmental variable `PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python` for this to run.


### QOps

As for the QCDQ case, also in this case we are using only standard ONNX operations, thus we can use onnxruntime for executing our exported models. 

The main difference is that all operations happen between quantized tensor, thus we should expect to get Int8 or UInt8 tensors from our execution, rather than their floating point versions. 

In [30]:
import onnxruntime as ort

class Model(torch.nn.Module):
    def __init__(self) -> None:
        super().__init__()
        self.act = qnn.QuantIdentity(return_quant_tensor=True)
        self.conv = qnn.QuantConv2d(IN_CH, OUT_CH, kernel_size=3, bias=False, 
                                      weight_bit_width=4,
                                      
                                      output_quant=Int8ActPerTensorFloat, return_quant_tensor=True)
    
    def forward(self, inp):
        inp = self.act(inp)
        inp = self.conv(inp)
        return inp

model = Model()
model.eval()
inp = torch.randn(BATCH_SIZE, IN_CH, IMG_SIZE, IMG_SIZE)
path = 'simple_model.onnx'

exported_model = export_standard_qop_onnx(model, args=inp, export_path=path, opset_version=13)

sess_opt = ort.SessionOptions()
sess = ort.InferenceSession(path, sess_opt)
input_name = sess.get_inputs()[0].name
pred_onx = sess.run(None, {input_name: inp.numpy()})[0]


out_brevitas = model(inp).int()
out_ort = torch.tensor(pred_onx).int()

assert torch.allclose(out_brevitas, out_ort, atol=1)

In this case, before comparing the results, we first convert floating point input to its integer representation, using the scale factor and zero point present in the QuantTensor.

Due to differences in how the computation is performed, it might happen the two results are slightly different (since Brevitas uses a style closer to QCDQ, rather than operating between integers), thus we added a slighly higher tolerance.