-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Observations on the calculations of COCO metrics #56
Comments
The computation you are describing is not how average precision is computed. I recommend reading: http://homepages.inf.ed.ac.uk/ckiw/postscript/ijcv_voc09.pdf, section 4.2, or you can find a number of references online. AP is the area under the precision recall curve. For each detection, you have a confidence. You then match detections (ordered by confidence) to ground truth, and for each recall value you get a precision. You then compute an area under this curve. There are a number of subtleties in this computation, but that's the overall idea. Take a look and I hope that answers your questions. Thanks! |
thanks @pdollar , I will read the references in depth then :) I am mainly surprised to get a perfect metrics of 1.0 for the case 1 where we Clearly have a large false alarm! => Can we say that the metrics calculated by the coco API (av. recall=1, av. precision=1) don't represent well our system in case 1 with the large false alarm ? |
@RSly like Piotr mentioned, the detection score is needed to compute precision/recall curve and average precision, which is clearly missing in your description. AP is a metric that averages precision over recall. |
hi @tylin, thanks for the explanation. it is clear now. however, before closing this issue could you please test the attached json files with both python and Matlab api? coco_problem.zip Python: Matlab: |
Thanks @RSly for pointing this out. We'll take a look when/if we have more time. I suspect this is a divide by 0 kind of situation (and then errors propagating to give different results elsewhere). If that's the case, it should hopefully never happen on real data (where there will be multiple errors and success of every kind). Indeed, using real data we have extensively verified that the results of Matlab/Python code are the same. Still, it useful to have checks for this kind of degenerate case, so if we have more time we'll look into it. Thanks. |
@pdollar, thanks for the answer. |
The results make sense according to the AP metric and it highly depends on how you rank your detections. If your most confident detection is a true positive and there is only one ground-truth object, then regardless of how many false positives you make, you will get an AP of 1, because that is the area under the Prec-Recall curve. |
Hi guys, However I get similar results as @RSly has reported. I have compiled my results and observations here In short: if I have @pdollar, @tylin, @RSly could you please help me out, what could be the issue? |
After struggling with a similar issue to @botcs above, I found this comment by @bhfs9999 at the bottom of this gist https://gist.github.com/botcs/5d13a744104ab1fa9fdd9987ea7ff97a which seems to solve the problem. I wrote a unit test that just had a single image with a single ground truth box, and a single predicted box with perfect overlap and a score of 1.0. I expected the AP to be 1.0, but it was 0.0. After changing the |
@botcs Did that have such an effect on a full dataset? I can imagine it bringing down results for a tiny debuging dataset (2 images, 1 not being used because of the 0-index) - did it mess up a real dataset for you? |
@ividal Only in the 1e-4 magnitude. |
@tylin The first detection will never get recall at 1.0, unless you only have a single one ground-truth.
@botcs Yes, I also found that if you can not reach the 1.0 recall, the precision got by cocoeval code will be 0. And it will make AP decrease dramatically if you only have a small number of region proposals (even if all of them are correct). |
In the paper of VOC dataset, I found some words that can prove AP did punish the methods with low false alarm :
[1] Everingham, Mark, Luc Van Gool, Christopher K. I. Williams, John Winn, and Andrew Zisserman. “The Pascal Visual Object Classes (VOC) Challenge.” International Journal of Computer Vision 88, no. 2 (June 1, 2010): 303–38. https://doi.org/10.1007/s11263-009-0275-4. |
Hi,
I have some observations on the coco metrics, specially the precision metric that I would like to share.
it would be great if some could clarify these points :) /cc. @pdollar @tylin
for calculating precision/recall, I am calculating the COCO average precision to get a feeling with respect to the systems result. Also, here for better explaining the issue, I will also calculate these metrics considering all the observations as a whole (say as a large stitched image, and not many separate images), which I call here the overall recall/precision.
Case1. a system with perfect detection + one false alarm: in this case as detailed in the next figure, the coco average precision comes out to be 1.0, which is completely ignoring the false alarm's existence!
Case2. a system with zero false alarms: in this case, we have no false alarms, and thus, the overall precision is perfect at 1.0; however, the coco precision comes out as 0.5! This case is very important since it could mean that the coco average precision is penalizing systems with no false alarms, and favoring the detection part of a system in evaluation? As you may know systems with zero/small false alarms are of great importance in industrial applications
So I am not sure if the above cases are bugs or are intentionally decided for coco, or if I am missing something?
The text was updated successfully, but these errors were encountered: