# YOLOv5 Post Training Quantization with ONNX

## 1. Export the PyTorch model to ONNX using Ultralytics `export.py` script

### 1.1 Download Pretrained Model

In [6]:
%pwd

'/home/hongbing/Projects/yolov5_qat'

In [5]:
%cd weights
%wget https://github.com/ultralytics/yolov5/releases/download/v7.0/yolov5m.pt
%cd ..

mkdir: cannot create directory ‘weights’: File exists


### 1.2 Export to ONNX

In [1]:
!python export.py --weights weights/yolov5m.pt --include onnx --opset 13

[34m[1mexport: [0mdata=data/coco128.yaml, weights=['weights/yolov5m.pt'], imgsz=[640, 640], batch_size=1, device=cpu, half=False, inplace=False, keras=False, optimize=False, int8=False, dynamic=False, simplify=False, opset=13, verbose=False, workspace=4, nms=False, agnostic_nms=False, topk_per_class=100, topk_all=100, iou_thres=0.45, conf_thres=0.25, include=['onnx']
YOLOv5 🚀 v7.0-63-gcdd804d Python-3.8.10 torch-1.13.0+cu117 CPU

Fusing layers... 
YOLOv5m summary: 290 layers, 21172173 parameters, 0 gradients

[34m[1mPyTorch:[0m starting from weights/yolov5m.pt with output shape (1, 25200, 85) (40.8 MB)

[34m[1mONNX:[0m starting export with onnx 1.13.0...
[34m[1mONNX:[0m export success ✅ 1.0s, saved as weights/yolov5m.onnx (81.2 MB)

Export complete (1.8s)
Results saved to [1m/home/hongbing/Projects/yolov5_qat/weights[0m
Detect:          python detect.py --weights weights/yolov5m.onnx 
Validate:        python val.py --weights weights/yolov5m.onnx 
PyTorch Hub:     model = 

### 1.3 Get the benchmark accuray of this float ONNX model

It is going to take a long time, please be patient. 

In [None]:
!python val.py --weights weights/yolov5m.onnx --data data/coco.yaml

 ```text
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.450
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.641
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.490
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.280
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.505
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.578
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.355
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.584
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.634
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.464
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.694
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.784
``` 

## 2. Qunatize the FP32 model using quantize_static

## 2.1 Quantize

In [11]:
import os
os.environ['CUDA_VISIBLE_DEVICES'] = '0'
import sys
sys.path.append("..")
from onnxruntime.quantization import quantize_static, QuantType, CalibrationMethod, CalibrationDataReader
import torch
from utils.dataloaders import LoadImages
from utils.general import check_dataset
import numpy as np

def representative_dataset_gen(dataset, ncalib=100):
    # Representative dataset generator for use with converter.representative_dataset, returns a generator of np arrays
    def data_gen():
        for n, (path, img, im0s, vid_cap, string) in enumerate(dataset):
            input = np.transpose(img, [0, 1, 2])
            input = np.expand_dims(input, axis=0).astype(np.float32)
            input /= 255
            yield [input]
    return data_gen

class CalibrationDataGenYOLO(CalibrationDataReader):
    def __init__(self,
        calib_data_gen,
        input_name
    ):
        x_train = calib_data_gen
        self.calib_data = iter([{input_name: np.array(data[0])} for data in x_train()])

    def get_next(self):
        return next(self.calib_data, None)


dataset = LoadImages(check_dataset('./data/coco128.yaml')['train'], img_size=[640, 640], auto=False)
data_generator = representative_dataset_gen(dataset)

data_reader = CalibrationDataGenYOLO(
    calib_data_gen=data_generator,
    input_name='images'
)

In [None]:
model_path = 'weights/yolov5m'
# Quantize the exported model
quantize_static(
    f'{model_path}.onnx',
    f'{model_path}_ort_quant.u8s8.onnx',
    calibration_data_reader=data_reader,
    activation_type=QuantType.QUInt8,
    weight_type=QuantType.QInt8,
    per_channel=True,
    reduce_range=True,
    calibrate_method=CalibrationMethod.MinMax
)

### 2.2 Check `x_scale` of the quantized model

Check the quantized onnx model with Netron. 

<img src="ptq_last_layers.png"> 

We can see the last 2 layers have large scale values. Tensor values are scaled to range ~[-8.7, 735.8] at the end of the model by multiplying large integers. 8 bits is not able to express such a big range. 

### 2.3 Evaluate the mAP using Ultralytics implementation

In [None]:
!python val.py --weights yolov5m_ort_quant.u8s8.onnx --data coco.yaml

## 3. Qunatize by excluding those big scale nodes

### 3.1 Quantize
- Excluding to quantize nodes which taking in those large tensors and cause mAP to 0.

    ```text
    nodes_to_exclude=["/model.24/Mul_1", "/model.24/Mul_3", "/model.24/Mul_5", "/model.24/Mul_7", "/model.24/Mul_9", "/model.24/Mul_11", "/model.24/Concat", "/model.24/Concat_1", "/model.24/Concat_2", "/model.24/Concat_3"],
    ```

In [None]:
model_path = 'weights/yolov5m'
# Quantize the exported model
quantize_static(
    f'{model_path}.onnx',
    f'{model_path}_ort_quant.u8s8.exclude.bigscale.onnx',
    calibration_data_reader=data_reader,
    activation_type=QuantType.QUInt8,
    weight_type=QuantType.QInt8,
    nodes_to_exclude=["/model.24/Mul_1", "/model.24/Mul_3", "/model.24/Mul_5", "/model.24/Mul_7", "/model.24/Mul_9", "/model.24/Mul_11", "/model.24/Concat", "/model.24/Concat_1", "/model.24/Concat_2", "/model.24/Concat_3"],
    per_channel=True,
    reduce_range=True,
    calibrate_method=CalibrationMethod.MinMax
)


### 3.2 Evaluate the mAP using Ultralytics implementation

In [None]:
!python val.py --weights yolov5m_ort_quant.u8s8.exclude.bigscale.onnx --data coco.yaml

```text
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.397
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.611
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.427
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.202
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.452
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.539
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.321
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.530
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.584
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.397
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.641
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.752
```

- The mAP of this quantized model is 0.397. The mAP of the FP32 model is 0.451.

## 4. Quantize the weights with UINT8

### 4.1. Qunatize the FP32 model using quantize_static

- Excluding to quantize nodes which taking in those large tensors and cause mAP to 0.

    ```text
    nodes_to_exclude=["/model.24/Mul_1", "/model.24/Mul_3", "/model.24/Mul_5", "/model.24/Mul_7", "/model.24/Mul_9", "/model.24/Mul_11", "/model.24/Concat", "/model.24/Concat_1", "/model.24/Concat_2", "/model.24/Concat_3"],
    ```

In [None]:
model_path = 'weights/yolov5m'
# Quantize the exported model
quantize_static(
    f'{model_path}.onnx',
    f'{model_path}_ort_quant.u8u8.exclude.bigscale.onnx',
    calibration_data_reader=data_reader,
    activation_type=QuantType.QUInt8,
    weight_type=QuantType.QUInt8,
    nodes_to_exclude=["/model.24/Mul_1", "/model.24/Mul_3", "/model.24/Mul_5", "/model.24/Mul_7", "/model.24/Mul_9", "/model.24/Mul_11", "/model.24/Concat", "/model.24/Concat_1", "/model.24/Concat_2", "/model.24/Concat_3"],
    per_channel=True,
    reduce_range=True,
    calibrate_method=CalibrationMethod.MinMax
)

### 4.2 Evaluate the mAP using Ultralytics implementation

In [None]:
!python val.py --weights yolov5m_ort_quant.u8u8.exclude.bigscale.onnx --data coco.yaml

```text
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.413
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.623
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.448
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.233
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.467
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.541
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.330
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.544
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.596
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.413
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.655
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.756
``` 

- The mAP of this U8U8 quantized model is 0.413. The mAP of U8S8 quantized model is 0.397. The mAP of the FP32 model is 0.451.

## 5. Convert to TFLite INT8

- It may take 4-5 minutes. 

- Note: TF has special processing to normalize pixel index to 0-1: https://github.com/ultralytics/yolov5/blob/cba4303d323352fc6d76730e15459b14498a9e34/models/tf.py#L231.

    So TFLite doesn't have to exclude last layers. 

    If you do similar thing to pytorch/onnx, ort quant should also work.

In [None]:
!python export.py --weights weights/yolov5m.pt --include tflite --int8

### 5.1 Evaluate with COCO Val

In [None]:
!python val.py --weights weights/yolov5m-int8.tflite --data coco.yaml

 ```text
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.404
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.621
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.437
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.217
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.461
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.542
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.325
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.537
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.589
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.400
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.648
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.758
 ```