YOLOv9-QAT TensorRT Q/DQ: Improved Speed and Zero Loss Accuracy #253

levipereira · 2024-03-15T21:51:06Z

This is outdated
follow this new repo
https://github.com/levipereira/yolov9-qat

Please follow The Original Implementation in #327

I have developed the initial version of YOLOv9-QAT using the Q/DQ method, tailored specifically for YOLOv9 models intended for execution solely on TensorRT.

This implementation currently supports only the Inference Models (Converted and Gelan models).

The source code in available the yolov9-qat branch.

Challenges

Quantizing all layers in some cases can decreases accuracy and increases latency, primarily due to the complexity of the last layer. To mitigate this, utilize the qat.py quantize --no-last-layer flag to exclude the last layer from quantization.

This version we have unoptimized scaling of Quantize/Dequantize (Q/DQ) could lead to generating unnecessary data formats. Implementing restrictions on the scale of Q/DQ on models/quantize.py to match the data format is essential to decrease latency perfomance.
The contributions from the community, as their knowledge is essential for the correct implementation of this functionality.

Files Added / Modified

qat.py - Main

usage: qat.py [-h] {quantize,sensitive,eval} ...
positional arguments:
  {quantize,sensitive,eval}
    quantize            PTQ/QAT finetune ...
    sensitive           Sensitive layer analysis
    eval                Do evaluate

models/quantize.py - Quantize Module
models/quantize_rules.py - Quantize Rules
export.py - Changed to Automatically detect QAT Models and Export when using flag --include onnx / onnx_end2end

Accuracy Report

QAT YOLOV9-C - ALL LAYERS 
Eval Model | AP       | AP50     | Precision  | Recall
-------------------------------------------------------
Origin     | 0.5297   | 0.699    | 0.7432     | 0.634
PQT        | 0.5295   | 0.6978   | 0.7455     | 0.6306
QAT- Best  | 0.5291   | 0.6978   | 0.7449     | 0.632

QAT - YOLOV9-C  - NO QAT LAST LAYER 
Eval Model | AP       | AP50     | Precision  | Recall  
-------------------------------------------------------
Origin     | 0.5297   | 0.699    | 0.7432     | 0.634   
PQT        | 0.529    | 0.698    | 0.7459     | 0.6297  
QAT- Best  | 0.5299   | 0.6984   | 0.7469     | 0.6305  

QAT - YOLOV9-E ALL-LAYERS
Eval Model | AP       | AP50     | Precision  | Recall
-------------------------------------------------------
Origin     | 0.5576   | 0.7246   | 0.7547     | 0.6649
PQT        | 0.5565   | 0.7241   | 0.7499     | 0.6649
QAT- Best  | 0.5566   | 0.7232   | 0.7538     | 0.6637


QAT - YOLOV9-E  - NO QAT  LAST LAYER
Eval Model | AP       | AP50     | Precision  | Recall  
-------------------------------------------------------
Origin     | 0.5576   | 0.7246   | 0.7547     | 0.6649  
PQT        | 0.5569   | 0.7242   | 0.7497     | 0.6646  
QAT- Best  | 0.5569   | 0.7239   | 0.7486     | 0.6657

Result using TensorRT engine Models on Triton-Server
Tool: https://github.com/levipereira/triton-client-yolo

========================= EVALUATION SUMMARY - YOLOV9-C ========================
Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.528
Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.701
Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.577
Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.361
Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.582
Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.689
Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.392
Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.652
Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.701
Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.538
Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.759
Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.848
================================================================================
mAP@0.5:0.95: 0.528
mAP@0.5:      0.701
mAP@0.75:     0.577
================================================================================


========================= EVALUATION SUMMARY - YOLOV9-C-QAT ========================
Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.528
Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.699
Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.576
Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.359
Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.581
Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.692
Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.392
Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.651
Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.699
Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.534
Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.758
Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.845
================================================================================
mAP@0.5:0.95: 0.528
mAP@0.5:      0.699
mAP@0.75:     0.576
================================================================================

Latency Report

Device Properties:
- Selected Device: NVIDIA GeForce RTX 4090
  - Compute Capability: 8.9
  - SMs: 128.0
  - Compute Clock Rate: 2.58
  - Device Global Memory: 24207 MiB
  - Shared Memory per SM: 100 KiB
  - Memory Bus Width: 384.0
  - Memory Clock Rate: 10.501

Table Info:

"Average time": refers to the sum of the layer latencies, when profiling layers separately.
"Throughput": is measured in inferences per second (IPS).

Origin

Model	Precision Type	Batch Size	Layers	Weights (MB)	Activations (MB)	Throughput (IPS)	Total Throughput (IPS)	Average time (ms)
yolov9-c	FP16	1	271	48.2	611.7	792	792	2.1
		8	273	48.2	4809.1	151	1209	7.3

yolov9-e	FP16	8	477	109.3	13461.3	57	457	18.8
		1	487	109.3	1706.5	353	353	4.3

Last Layer not Quantized

Model	Precision Type	Batch Size	Layers	Weights (MB)	Activations (MB)	Throughput (IPS)	Total Throughput (IPS)	Average time (ms)
yolov9-c-qat	FP16 INT8	1	288	29.4	534.7	951	951	1.9
		8	287	29.4	4190.2	181	1447	6.4

yolov9-e-qat	FP16 INT8	1	526	63.1	1757.0	405	405	4.1
		8	526	63.1	13407.7	60	482	18.2

All Layers Quantized

Model	Precision Type	Batch Size	Layers	Weights (MB)	Activations (MB)	Throughput (IPS)	Total Throughput (IPS)	Average time (ms)
yolov9-c-qat	FP16 INT8	1	295	24.2	540.1	957	957	1.9
		8	293	24.2	4216.7	193	1547	6.1

yolov9-e-qat	FP16 INT8	1	532	57.8	1779.5	396	396	4.1
		8	532	57.8	13431.8	62	493	17.8

The text was updated successfully, but these errors were encountered:

levipereira · 2024-03-18T01:40:07Z

Added:

Two repositories to test YOLOv9 QAT Models

Triton-Server: Deploy Models on TensorRT format.
https://github.com/levipereira/triton-server-yolo
Triton Client: Allows users to evaluate coco dataset or inference their own images/videos.
https://github.com/levipereira/triton-client-yolo

ou525 · 2024-03-18T02:58:04Z

Thanks for sharing, it would be better if there is export onnx for independent deployment, not just triton

trivedisarthak · 2024-03-19T20:46:41Z

@levipereira It would be interesting to see how the performance on triton compares with Yolov7-QAT , since the paper does not talk about it and neither does #143 .

demuxin · 2024-03-25T07:33:13Z

@levipereira Thank you for your contribution. I need to ask a question, Do I have to train model in order to get a quantized model?

levipereira · 2024-03-26T14:19:21Z

@demuxin Yes.

levipereira · 2024-03-26T14:36:15Z

@trivedisarthak check OP

levipereira · 2024-04-06T14:58:41Z

The Original Implementation in #327

R4Ajeti · 2024-05-24T15:41:58Z

@levipereira how can I do quantization on yolov9 custom trained model?

levipereira · 2024-05-24T23:06:00Z

@R4Ajeti https://github.com/levipereira/yolov9-qat/?tab=readme-ov-file#qat-training-finetune

levipereira mentioned this issue Apr 6, 2024

YOLOv9 with Quantization-Aware Training (QAT) for TensorRT #327

Open

levipereira closed this as completed Apr 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

YOLOv9-QAT TensorRT Q/DQ: Improved Speed and Zero Loss Accuracy #253

YOLOv9-QAT TensorRT Q/DQ: Improved Speed and Zero Loss Accuracy #253

levipereira commented Mar 15, 2024 •

edited

Loading

levipereira commented Mar 18, 2024

ou525 commented Mar 18, 2024

trivedisarthak commented Mar 19, 2024

demuxin commented Mar 25, 2024

levipereira commented Mar 26, 2024

levipereira commented Mar 26, 2024

levipereira commented Apr 6, 2024

R4Ajeti commented May 24, 2024

levipereira commented May 24, 2024

YOLOv9-QAT TensorRT Q/DQ: Improved Speed and Zero Loss Accuracy #253

YOLOv9-QAT TensorRT Q/DQ: Improved Speed and Zero Loss Accuracy #253

Comments

levipereira commented Mar 15, 2024 • edited Loading

Challenges

Files Added / Modified

Accuracy Report

Latency Report

Origin

Last Layer not Quantized

All Layers Quantized

levipereira commented Mar 18, 2024

ou525 commented Mar 18, 2024

trivedisarthak commented Mar 19, 2024

demuxin commented Mar 25, 2024

levipereira commented Mar 26, 2024

levipereira commented Mar 26, 2024

levipereira commented Apr 6, 2024

R4Ajeti commented May 24, 2024

levipereira commented May 24, 2024

levipereira commented Mar 15, 2024 •

edited

Loading