Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SOTA claims vs leaderboards mismalignment #40

Open
LifeIsStrange opened this issue Jul 7, 2022 · 7 comments
Open

SOTA claims vs leaderboards mismalignment #40

LifeIsStrange opened this issue Jul 7, 2022 · 7 comments

Comments

@LifeIsStrange
Copy link

@WongKinYiu @AlexeyAB
Hi friendly pings

YOLOv7 surpasses all known object detectors in both speed and accuracy in the range from 5 FPS to 160 FPS and has the highest accuracy 56.8% AP among all known real-time object detectors with 30 FPS or higher on GPU V100.

Weird claim when you actually rank #20 on COCO
If we exclude all models with extra training data you still rank #11.
the #1 without extra data is Dual-Swin-L(HTC, multi-scale), with 60.1 box AP
with extra data it is DINO(Swim-L,multi-scale) with 63.3 box AP

@AlexeyAB
Copy link
Collaborator

AlexeyAB commented Jul 7, 2022

They are much slower than 5 FPS on GPU Tesla V100, and they are not Real-time.

  • Dual-Swin-L (HTC) 1600x1600 - 59.1% AP - 1.5 FPS V100 - isn't real-time - is 2000% FSP slower than YOLOv7-e6e
  • Dual-Swin-L(HTC, multi-scale) - 60.1% AP - 0.3 FPS V100 - isn't real-time - is 12000% FPS slower than YOLOv7-e6e
  • DINO-5scale-R50 (10 FPS, 51.0% AP) is less accurate and 1500% FPS slower than YOLOv7 (161 FPS, 51.2% AP)
  • DINO(Swim-L,multi-scale) with 63.3 box AP - additional training datasets are used (so no fair comparison), no publicly available code and models, it is slower than 1 FPS - isn't real-time is ~10000% slower than YOLOv7-e6e

There are Dual-Swin-L (HTC) and DINO-5scale (R50) in the Table 9: https://arxiv.org/abs/2207.02696

@LifeIsStrange
Copy link
Author

LifeIsStrange commented Jul 8, 2022

@AlexeyAB Great answer! I can see the significant value proposition of this implementation now :)
So how about you update the abstract from

YOLOv7 surpasses all known object detectors

to

YOLOv7 surpasses all known real-time object detectors

bonus question:
how does it compare to the recently anounced YOLOv6? https://github.com/meituan/YOLOv6

@AlexeyAB
Copy link
Collaborator

AlexeyAB commented Jul 10, 2022

YOLOv7 surpasses all known object detectors

to

YOLOv7 surpasses all known real-time object detectors

Real-time is 30 FPS or higher.

YOLOv7 surpasses not only real-time detectors from 30 to 160 FPS, but also non-real-time detectors in the range from 4 to 30 FPS.

more


how does it compare to the recently anounced YOLOv6? https://github.com/meituan/YOLOv6

Page 11: https://arxiv.org/pdf/2207.02696.pdf

yolov6_bad

@LifeIsStrange
Copy link
Author

LifeIsStrange commented Jul 10, 2022

@AlexeyAB
Fair enough, I wish every paper would defend their value as well as you did, in an evidence based way :).
However, it seems to me that YOLOR-D6 beat (in some FPS range at least) YOLOv7.
YOLOR-D6 is not YOLOv6, it achieve 57.3% AP which is 0.5% more than YOLOv7, and has 34fps while YOLOv7 has 36fps if I understand correctly.
Still YOLOR-D6 is using extra training data indeed. But at the end of the day, end users want a fast model with the best accuracy and will generally accept extra training data for pragmatism sake.
Hence the following questions:
Do you plan on making a YOLOv7 version with improved accuracy via leveraging extra training data?
Secondly, I believe you can improve the state of the art while not significantly altering performance, by being the first to use the following very simple to use innovations, for object detection.
https://github.com/lessw2020/Ranger21
or
https://github.com/lessw2020/Ranger-Deep-Learning-Optimizer
https://arxiv.org/abs/2106.13731

it includes generally applicable innovations that improve accuracy, such as:
https://github.com/digantamisra98/Mish
The mish activation function is in most cases the best activation function, often yielding 0.5-1% accuracy increase for free.
Ranger can in addition use gradient centralization,
https://github.com/Yonghongwei/Gradient-Centralization
which generally also give free gains.
then it can use a synergetic combination of optimizers,
such as RAdam in place of Adam
https://github.com/LiyuanLucasLiu/RAdam
+
the complementary LookAhead
https://github.com/michaelrzhang/lookahead
and others

his library makes the integration and selection of optimizations passes easy. It is a tragedy that those innovations are generally ignored by all despite their huge potential in increasing SOTA for free, in key tasks.

@AlexeyAB
Copy link
Collaborator

Still YOLOR-D6 is using extra training data indeed. But at the end of the day, end users want a fast model with the best accuracy and will generally accept extra training data for pragmatism sake.

If you will train your own model on your custom dataset, you will get higher accuracy for YOLOv7 than for YOLOR. And YOLOv7 is faster.

@silvada95
Copy link

What is the definition that you use to define a detector as real-time or not? I saw a lot of authors mentioning it on their works, but no definition at all...

@SteTala97
Copy link

What is the definition that you use to define a detector as real-time or not?

AlexeyAB commented on Jul 10, 2022:

Real-time is 30 FPS or higher.

So, real-time is 30FPS or higher.
It commonly refers to the fact that if you have your input coming from a 30FPS camera, or you are processing a video captured by a 30FPS camera (which usually is the most common video frame rate used), you have no delay between one frame and the next one. Of course this also means that if the input rate of your system is e.g. 10FPS, a model that performs at 10FPS can be considered "real-time" for your application.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants