# YOLOv5 Quantization

## 1. Export the PyTorch model to ONNX using Ultralytics `export.py` script

In [7]:
%cd ~/Projects/yolov5
%pwd

/home/hongbing/Projects/yolov5


'/home/hongbing/Projects/yolov5'

In [2]:
!python export.py --weights yolov5m.pt --include onnx --opset 13

[34m[1mexport: [0mdata=data/coco128.yaml, weights=['yolov5m.pt'], imgsz=[640, 640], batch_size=1, device=cpu, half=False, inplace=False, keras=False, optimize=False, int8=False, dynamic=False, simplify=False, opset=13, verbose=False, workspace=4, nms=False, agnostic_nms=False, topk_per_class=100, topk_all=100, iou_thres=0.45, conf_thres=0.25, include=['onnx']
YOLOv5 🚀 v7.0-24-gf8539a68 Python-3.8.10 torch-1.13.0+cu117 CPU

Fusing layers... 
YOLOv5m summary: 290 layers, 21172173 parameters, 0 gradients

[34m[1mPyTorch:[0m starting from yolov5m.pt with output shape (1, 25200, 85) (40.8 MB)

[34m[1mONNX:[0m starting export with onnx 1.13.0...
[34m[1mONNX:[0m export success ✅ 1.0s, saved as yolov5m.onnx (81.2 MB)

Export complete (1.7s)
Results saved to [1m/home/hongbing/Projects/yolov5[0m
Detect:          python detect.py --weights yolov5m.onnx 
Validate:        python val.py --weights yolov5m.onnx 
PyTorch Hub:     model = torch.hub.load('ultralytics/yolov5', 'custom', 'yol

### 1.1 Get the benchmark accuray of this float ONNX model

In [3]:
!python val.py --weights yolov5m.onnx --data coco.yaml

[34m[1mval: [0mdata=/home/hongbing/Projects/yolov5/data/coco.yaml, weights=['yolov5m.onnx'], batch_size=32, imgsz=640, conf_thres=0.001, iou_thres=0.6, max_det=300, task=val, device=, workers=8, single_cls=False, augment=False, verbose=False, save_txt=False, save_hybrid=False, save_conf=False, save_json=True, project=runs/val, name=exp, exist_ok=False, half=False, dnn=False
YOLOv5 🚀 v7.0-24-gf8539a68 Python-3.8.10 torch-1.13.0+cu117 CUDA:0 (Quadro RTX 5000, 16125MiB)

Loading yolov5m.onnx for ONNX Runtime inference...
2022-12-13 10:37:53.725313759 [W:onnxruntime:Default, onnxruntime_pybind_state.cc:578 CreateExecutionProviderInstance] Failed to create CUDAExecutionProvider. Please reference https://onnxruntime.ai/docs/reference/execution-providers/CUDA-ExecutionProvider.html#requirements to ensure all dependencies are met.
Forcing --batch-size 1 square inference (1,3,640,640) for non-PyTorch models
[34m[1mval: [0mScanning /home/hongbing/Projects/datasets/coco/val2017.cache... 495

## 2. Qunatize the FP32 model using quantize_static

## 2.1 Quantize

In [14]:
import os
os.environ['CUDA_VISIBLE_DEVICES'] = '0'
import sys
sys.path.append("..")
from onnxruntime.quantization import quantize_static, QuantType, CalibrationMethod, CalibrationDataReader
import torch
from utils.dataloaders import LoadImages
from utils.general import check_dataset
import numpy as np

def representative_dataset_gen(dataset, ncalib=100):
    # Representative dataset generator for use with converter.representative_dataset, returns a generator of np arrays
    def data_gen():
        for n, (path, img, im0s, vid_cap, string) in enumerate(dataset):
            input = np.transpose(img, [0, 1, 2])
            input = np.expand_dims(input, axis=0).astype(np.float32)
            input /= 255
            yield [input]
    return data_gen

class CalibrationDataGenYOLO(CalibrationDataReader):
    def __init__(self,
        calib_data_gen,
        input_name
    ):
        x_train = calib_data_gen
        self.calib_data = iter([{input_name: np.array(data[0])} for data in x_train()])

    def get_next(self):
        return next(self.calib_data, None)


dataset = LoadImages(check_dataset('./data/coco128.yaml')['train'], img_size=[640, 640], auto=False)
data_generator = representative_dataset_gen(dataset)

data_reader = CalibrationDataGenYOLO(
    calib_data_gen=data_generator,
    input_name='images'
)

In [12]:
model_path = 'yolov5m'
# Quantize the exported model
quantize_static(
    f'{model_path}.onnx',
    f'{model_path}_ort_quant.u8s8.onnx',
    calibration_data_reader=data_reader,
    activation_type=QuantType.QUInt8,
    weight_type=QuantType.QInt8,
    per_channel=True,
    reduce_range=True,
    calibrate_method=CalibrationMethod.MinMax
)



### 2.2 Check `x_scale` of the quantized model

We can see the last 2 layers have large scale values. 8 bits quantization will lost accuracy.

### 2.3 Evaluate the mAP using Ultralytics implementation

In [None]:
!python val.py --weights yolov5m_ort_quant.u8s8.onnx --data coco.yaml

## 3. Qunatize by excluding those big scale nodes

### 3.1 Quantize
- Excluding to quantize nodes which taking in those large tensors and cause mAP to 0.

```
nodes_to_exclude=["/model.24/Mul_1", "/model.24/Mul_3", "/model.24/Mul_5", "/model.24/Mul_7", "/model.24/Mul_9", "/model.24/Mul_11", "/model.24/Concat", "/model.24/Concat_1", "/model.24/Concat_2", "/model.24/Concat_3"],
```

In [4]:
model_path = 'yolov5m'
# Quantize the exported model
quantize_static(
    f'{model_path}.onnx',
    f'{model_path}_ort_quant.u8s8.exclude.bigscale.onnx',
    calibration_data_reader=data_reader,
    activation_type=QuantType.QUInt8,
    weight_type=QuantType.QInt8,
    nodes_to_exclude=["/model.24/Mul_1", "/model.24/Mul_3", "/model.24/Mul_5", "/model.24/Mul_7", "/model.24/Mul_9", "/model.24/Mul_11", "/model.24/Concat", "/model.24/Concat_1", "/model.24/Concat_2", "/model.24/Concat_3"],
    per_channel=True,
    reduce_range=True,
    calibrate_method=CalibrationMethod.MinMax
)




### 3.2 Evaluate the mAP using Ultralytics implementation

In [5]:
!python val.py --weights yolov5m_ort_quant.u8s8.exclude.bigscale.onnx --data coco.yaml

[34m[1mval: [0mdata=/home/hongbing/Projects/yolov5/data/coco.yaml, weights=['yolov5m_ort_quant.onnx'], batch_size=32, imgsz=640, conf_thres=0.001, iou_thres=0.6, max_det=300, task=val, device=, workers=8, single_cls=False, augment=False, verbose=False, save_txt=False, save_hybrid=False, save_conf=False, save_json=True, project=runs/val, name=exp, exist_ok=False, half=False, dnn=False
YOLOv5 🚀 v7.0-24-gf8539a68 Python-3.8.10 torch-1.13.0+cu117 CUDA:0 (Quadro RTX 5000, 16125MiB)

Loading yolov5m_ort_quant.onnx for ONNX Runtime inference...
2022-12-13 11:12:12.819420504 [W:onnxruntime:Default, onnxruntime_pybind_state.cc:578 CreateExecutionProviderInstance] Failed to create CUDAExecutionProvider. Please reference https://onnxruntime.ai/docs/reference/execution-providers/CUDA-ExecutionProvider.html#requirements to ensure all dependencies are met.
Forcing --batch-size 1 square inference (1,3,640,640) for non-PyTorch models
[34m[1mval: [0mScanning /home/hongbing/Projects/datasets/coco/

- The mAP of this quantized model is 0.397. The mAP of the FP32 model is 0.451.

## 4. Quantize the weights with UINT8

### 4.1. Qunatize the FP32 model using quantize_static

- Excluding to quantize nodes which taking in those large tensors and cause mAP to 0.

```
nodes_to_exclude=["/model.24/Mul_1", "/model.24/Mul_3", "/model.24/Mul_5", "/model.24/Mul_7", "/model.24/Mul_9", "/model.24/Mul_11", "/model.24/Concat", "/model.24/Concat_1", "/model.24/Concat_2", "/model.24/Concat_3"],
```

In [15]:
model_path = 'yolov5m'
# Quantize the exported model
quantize_static(
    f'{model_path}.onnx',
    f'{model_path}_ort_quant.u8u8.exclude.bigscale.onnx',
    calibration_data_reader=data_reader,
    activation_type=QuantType.QUInt8,
    weight_type=QuantType.QUInt8,
    nodes_to_exclude=["/model.24/Mul_1", "/model.24/Mul_3", "/model.24/Mul_5", "/model.24/Mul_7", "/model.24/Mul_9", "/model.24/Mul_11", "/model.24/Concat", "/model.24/Concat_1", "/model.24/Concat_2", "/model.24/Concat_3"],
    per_channel=True,
    reduce_range=True,
    calibrate_method=CalibrationMethod.MinMax
)



### 4.2 Evaluate the mAP using Ultralytics implementation

In [16]:
!python val.py --weights yolov5m_ort_quant.u8u8.exclude.bigscale.onnx --data coco.yaml

[34m[1mval: [0mdata=/home/hongbing/Projects/yolov5/data/coco.yaml, weights=['yolov5m_ort_quant.u8u8.exclude.bigscale.onnx'], batch_size=32, imgsz=640, conf_thres=0.001, iou_thres=0.6, max_det=300, task=val, device=, workers=8, single_cls=False, augment=False, verbose=False, save_txt=False, save_hybrid=False, save_conf=False, save_json=True, project=runs/val, name=exp, exist_ok=False, half=False, dnn=False
YOLOv5 🚀 v7.0-24-gf8539a68 Python-3.8.10 torch-1.13.0+cu117 CUDA:0 (Quadro RTX 5000, 16125MiB)

Loading yolov5m_ort_quant.u8u8.exclude.bigscale.onnx for ONNX Runtime inference...
2022-12-13 11:57:10.679788417 [W:onnxruntime:Default, onnxruntime_pybind_state.cc:578 CreateExecutionProviderInstance] Failed to create CUDAExecutionProvider. Please reference https://onnxruntime.ai/docs/reference/execution-providers/CUDA-ExecutionProvider.html#requirements to ensure all dependencies are met.
Forcing --batch-size 1 square inference (1,3,640,640) for non-PyTorch models
[34m[1mval: [0mSca

- The mAP of this U8U8 quantized model is 0.413. The mAP of U8S8 quantized model is 0.397. The mAP of the FP32 model is 0.451.