Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

YOLOv9-QAT TensorRT Q/DQ: Improved Speed and Zero Loss Accuracy #253

Closed
levipereira opened this issue Mar 15, 2024 · 9 comments
Closed

YOLOv9-QAT TensorRT Q/DQ: Improved Speed and Zero Loss Accuracy #253

levipereira opened this issue Mar 15, 2024 · 9 comments

Comments

@levipereira
Copy link

levipereira commented Mar 15, 2024

This is outdated
follow this new repo
https://github.com/levipereira/yolov9-qat

Please follow The Original Implementation in #327

@WongKinYiu

I have developed the initial version of YOLOv9-QAT using the Q/DQ method, tailored specifically for YOLOv9 models intended for execution solely on TensorRT.

This implementation currently supports only the Inference Models (Converted and Gelan models).

The source code in available the yolov9-qat branch.

Challenges

Quantizing all layers in some cases can decreases accuracy and increases latency, primarily due to the complexity of the last layer. To mitigate this, utilize the qat.py quantize --no-last-layer flag to exclude the last layer from quantization.

This version we have unoptimized scaling of Quantize/Dequantize (Q/DQ) could lead to generating unnecessary data formats. Implementing restrictions on the scale of Q/DQ on models/quantize.py to match the data format is essential to decrease latency perfomance.
The contributions from the community, as their knowledge is essential for the correct implementation of this functionality.

Files Added / Modified

qat.py - Main

usage: qat.py [-h] {quantize,sensitive,eval} ...
positional arguments:
  {quantize,sensitive,eval}
    quantize            PTQ/QAT finetune ...
    sensitive           Sensitive layer analysis
    eval                Do evaluate

models/quantize.py - Quantize Module
models/quantize_rules.py - Quantize Rules
export.py - Changed to Automatically detect QAT Models and Export when using flag --include onnx / onnx_end2end

Accuracy Report

QAT YOLOV9-C - ALL LAYERS 
Eval Model | AP       | AP50     | Precision  | Recall
-------------------------------------------------------
Origin     | 0.5297   | 0.699    | 0.7432     | 0.634
PQT        | 0.5295   | 0.6978   | 0.7455     | 0.6306
QAT- Best  | 0.5291   | 0.6978   | 0.7449     | 0.632

QAT - YOLOV9-C  - NO QAT LAST LAYER 
Eval Model | AP       | AP50     | Precision  | Recall  
-------------------------------------------------------
Origin     | 0.5297   | 0.699    | 0.7432     | 0.634   
PQT        | 0.529    | 0.698    | 0.7459     | 0.6297  
QAT- Best  | 0.5299   | 0.6984   | 0.7469     | 0.6305  

QAT - YOLOV9-E ALL-LAYERS
Eval Model | AP       | AP50     | Precision  | Recall
-------------------------------------------------------
Origin     | 0.5576   | 0.7246   | 0.7547     | 0.6649
PQT        | 0.5565   | 0.7241   | 0.7499     | 0.6649
QAT- Best  | 0.5566   | 0.7232   | 0.7538     | 0.6637


QAT - YOLOV9-E  - NO QAT  LAST LAYER
Eval Model | AP       | AP50     | Precision  | Recall  
-------------------------------------------------------
Origin     | 0.5576   | 0.7246   | 0.7547     | 0.6649  
PQT        | 0.5569   | 0.7242   | 0.7497     | 0.6646  
QAT- Best  | 0.5569   | 0.7239   | 0.7486     | 0.6657  



Result using TensorRT engine Models on Triton-Server
Tool: https://github.com/levipereira/triton-client-yolo

========================= EVALUATION SUMMARY - YOLOV9-C ========================
Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.528
Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.701
Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.577
Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.361
Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.582
Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.689
Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.392
Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.652
Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.701
Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.538
Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.759
Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.848
================================================================================
mAP@0.5:0.95: 0.528
mAP@0.5:      0.701
mAP@0.75:     0.577
================================================================================


========================= EVALUATION SUMMARY - YOLOV9-C-QAT ========================
Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.528
Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.699
Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.576
Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.359
Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.581
Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.692
Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.392
Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.651
Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.699
Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.534
Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.758
Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.845
================================================================================
mAP@0.5:0.95: 0.528
mAP@0.5:      0.699
mAP@0.75:     0.576
================================================================================

Latency Report

  • Device Properties:
    • Selected Device: NVIDIA GeForce RTX 4090
      • Compute Capability: 8.9
      • SMs: 128.0
      • Compute Clock Rate: 2.58
      • Device Global Memory: 24207 MiB
      • Shared Memory per SM: 100 KiB
      • Memory Bus Width: 384.0
      • Memory Clock Rate: 10.501

Table Info:

  • "Average time": refers to the sum of the layer latencies, when profiling layers separately.
  • "Throughput": is measured in inferences per second (IPS).

Origin

Model Precision Type Batch Size Layers Weights (MB) Activations (MB) Throughput (IPS) Total Throughput (IPS) Average time (ms)
yolov9-c FP16 1 271 48.2 611.7 792 792 2.1
8 273 48.2 4809.1 151 1209 7.3
yolov9-e FP16 8 477 109.3 13461.3 57 457 18.8
1 487 109.3 1706.5 353 353 4.3

Last Layer not Quantized

Model Precision Type Batch Size Layers Weights (MB) Activations (MB) Throughput (IPS) Total Throughput (IPS) Average time (ms)
yolov9-c-qat FP16 INT8 1 288 29.4 534.7 951 951 1.9
8 287 29.4 4190.2 181 1447 6.4
yolov9-e-qat FP16 INT8 1 526 63.1 1757.0 405 405 4.1
8 526 63.1 13407.7 60 482 18.2

All Layers Quantized

Model Precision Type Batch Size Layers Weights (MB) Activations (MB) Throughput (IPS) Total Throughput (IPS) Average time (ms)
yolov9-c-qat FP16 INT8 1 295 24.2 540.1 957 957 1.9
8 293 24.2 4216.7 193 1547 6.1
yolov9-e-qat FP16 INT8 1 532 57.8 1779.5 396 396 4.1
8 532 57.8 13431.8 62 493 17.8
@levipereira
Copy link
Author

Added:

Two repositories to test YOLOv9 QAT Models

@ou525
Copy link

ou525 commented Mar 18, 2024

Thanks for sharing, it would be better if there is export onnx for independent deployment, not just triton

@trivedisarthak
Copy link

@levipereira It would be interesting to see how the performance on triton compares with Yolov7-QAT , since the paper does not talk about it and neither does #143 .

@demuxin
Copy link

demuxin commented Mar 25, 2024

@levipereira Thank you for your contribution. I need to ask a question, Do I have to train model in order to get a quantized model?

@levipereira
Copy link
Author

@demuxin Yes.

@levipereira
Copy link
Author

@trivedisarthak check OP

@levipereira
Copy link
Author

The Original Implementation in #327

@R4Ajeti
Copy link

R4Ajeti commented May 24, 2024

@levipereira how can I do quantization on yolov9 custom trained model?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants