Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Observations on the calculations of COCO metrics #56

Open
RSly opened this issue Jun 20, 2017 · 14 comments
Open

Observations on the calculations of COCO metrics #56

RSly opened this issue Jun 20, 2017 · 14 comments

Comments

@RSly
Copy link

RSly commented Jun 20, 2017

Hi,

I have some observations on the coco metrics, specially the precision metric that I would like to share.
it would be great if some could clarify these points :) /cc. @pdollar @tylin

for calculating precision/recall, I am calculating the COCO average precision to get a feeling with respect to the systems result. Also, here for better explaining the issue, I will also calculate these metrics considering all the observations as a whole (say as a large stitched image, and not many separate images), which I call here the overall recall/precision.

Case1. a system with perfect detection + one false alarm: in this case as detailed in the next figure, the coco average precision comes out to be 1.0, which is completely ignoring the false alarm's existence!

image

Case2. a system with zero false alarms: in this case, we have no false alarms, and thus, the overall precision is perfect at 1.0; however, the coco precision comes out as 0.5! This case is very important since it could mean that the coco average precision is penalizing systems with no false alarms, and favoring the detection part of a system in evaluation? As you may know systems with zero/small false alarms are of great importance in industrial applications

image

So I am not sure if the above cases are bugs or are intentionally decided for coco, or if I am missing something?

@pdollar
Copy link
Collaborator

pdollar commented Jun 20, 2017

The computation you are describing is not how average precision is computed. I recommend reading: http://homepages.inf.ed.ac.uk/ckiw/postscript/ijcv_voc09.pdf, section 4.2, or you can find a number of references online. AP is the area under the precision recall curve. For each detection, you have a confidence. You then match detections (ordered by confidence) to ground truth, and for each recall value you get a precision. You then compute an area under this curve. There are a number of subtleties in this computation, but that's the overall idea. Take a look and I hope that answers your questions. Thanks!

@RSly
Copy link
Author

RSly commented Jun 20, 2017

thanks @pdollar , I will read the references in depth then :)

I am mainly surprised to get a perfect metrics of 1.0 for the case 1 where we Clearly have a large false alarm!

=> Can we say that the metrics calculated by the coco API (av. recall=1, av. precision=1) don't represent well our system in case 1 with the large false alarm ?

@tylin
Copy link
Collaborator

tylin commented Jun 20, 2017

@RSly like Piotr mentioned, the detection score is needed to compute precision/recall curve and average precision, which is clearly missing in your description.
Case 1 is a case that might be confusing for people who just start to use AP metric. I try to give more details below:
All detections are first sorted by scores and then compute hits and misses.
From sorted list of hits and misses, we can compute precision and recall for each detection.
The first detection gets (1.0 recall, 1.0 precision) and the second gets (1.0 recall, 0.5 precision).
You have 1.0 area under the curve when you plot precision and recall curve for these two points.

AP is a metric that averages precision over recall.
In practice, your system will need to work on certain precision and recall point on your PR curve and that determines the score threshold to show detections.
For the case 1, you can find a score threshold that only shows the true positive detection and ignore the false alarm.
In that sense, you can have a perfect detection system for the specific case.

@RSly
Copy link
Author

RSly commented Jun 21, 2017

hi @tylin, thanks for the explanation. it is clear now.

however, before closing this issue could you please test the attached json files with both python and Matlab api? coco_problem.zip
I get different results from Python and Matlab as following:

Python:
Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 1.000
Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 1.000
Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=100 ] = 1.000
Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = -1.000
Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = -1.000
Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 1.000
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 1 ] = 1.000
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 10 ] = 1.000
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 1.000
Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = -1.000
Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = -1.000
Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 1.000

Matlab:
Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.500
Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.500
Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=100 ] = 0.500
Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = NaN
Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = NaN
Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.500
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 1 ] = 0.500
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 10 ] = 0.500
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.500
Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = NaN
Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = NaN
Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.500

@pdollar
Copy link
Collaborator

pdollar commented Jun 25, 2017

Thanks @RSly for pointing this out. We'll take a look when/if we have more time. I suspect this is a divide by 0 kind of situation (and then errors propagating to give different results elsewhere). If that's the case, it should hopefully never happen on real data (where there will be multiple errors and success of every kind). Indeed, using real data we have extensively verified that the results of Matlab/Python code are the same. Still, it useful to have checks for this kind of degenerate case, so if we have more time we'll look into it. Thanks.

@RSly
Copy link
Author

RSly commented Jun 26, 2017

@pdollar, thanks for the answer.
unit tests may be of interest here.

@IssamLaradji
Copy link

IssamLaradji commented Aug 4, 2018

The results make sense according to the AP metric and it highly depends on how you rank your detections. If your most confident detection is a true positive and there is only one ground-truth object, then regardless of how many false positives you make, you will get an AP of 1, because that is the area under the Prec-Recall curve.

@botcs
Copy link

botcs commented Mar 2, 2019

Hi guys,
I am implementing a unit test for the COCOeval's python api, using a very simple task: I generate 2 white box on a single black plane, and feed the annotations as predictions with 1.0 confidence score.

However I get similar results as @RSly has reported.

I have compiled my results and observations here

In short: if I have N boxes, the precision will be 1-1/N if the recall threshold is <= 1-1/N, otherwise the precision will be 0.

@pdollar, @tylin, @RSly could you please help me out, what could be the issue?

@lewfish
Copy link

lewfish commented Jul 21, 2019

After struggling with a similar issue to @botcs above, I found this comment by @bhfs9999 at the bottom of this gist https://gist.github.com/botcs/5d13a744104ab1fa9fdd9987ea7ff97a which seems to solve the problem.

I wrote a unit test that just had a single image with a single ground truth box, and a single predicted box with perfect overlap and a score of 1.0. I expected the AP to be 1.0, but it was 0.0. After changing the id of the annotation from 0 to 1, the AP changed to 1.0.

@botcs
Copy link

botcs commented Jul 22, 2019

@lewfish Indeed!
@qizhuli helped me debugging the issue, and after that, things are working as expected. Just forgot to update the solution here...

@ividal
Copy link

ividal commented Sep 5, 2019

@botcs Did that have such an effect on a full dataset? I can imagine it bringing down results for a tiny debuging dataset (2 images, 1 not being used because of the 0-index) - did it mess up a real dataset for you?

@botcs
Copy link

botcs commented Sep 6, 2019

@ividal Only in the 1e-4 magnitude.

@nico-zck
Copy link

nico-zck commented Nov 23, 2019

@RSly like Piotr mentioned, the detection score is needed to compute precision/recall curve and average precision, which is clearly missing in your description.
Case 1 is a case that might be confusing for people who just start to use AP metric. I try to give more details below:
All detections are first sorted by scores and then compute hits and misses.
From sorted list of hits and misses, we can compute precision and recall for each detection.
The first detection gets (1.0 recall, 1.0 precision) and the second gets (1.0 recall, 0.5 precision).
You have 1.0 area under the curve when you plot precision and recall curve for these two points.

AP is a metric that averages precision over recall.
In practice, your system will need to work on certain precision and recall point on your PR curve and that determines the score threshold to show detections.
For the case 1, you can find a score threshold that only shows the true positive detection and ignore the false alarm.
In that sense, you can have a perfect detection system for the specific case.

@tylin The first detection will never get recall at 1.0, unless you only have a single one ground-truth.

Hi guys,
I am implementing a unit test for the COCOeval's python api, using a very simple task: I generate 2 white box on a single black plane, and feed the annotations as predictions with 1.0 confidence score.

However I get similar results as @RSly has reported.

I have compiled my results and observations here

In short: if I have N boxes, the precision will be 1-1/N if the recall threshold is <= 1-1/N, otherwise the precision will be 0.

@pdollar, @tylin, @RSly could you please help me out, what could be the issue?

@botcs Yes, I also found that if you can not reach the 1.0 recall, the precision got by cocoeval code will be 0. And it will make AP decrease dramatically if you only have a small number of region proposals (even if all of them are correct).

@nico-zck
Copy link

In the paper of VOC dataset, I found some words that can prove AP did punish the methods with low false alarm :

The intention in interpolating the precision/recall curvein this way is to reduce the impact of the “wiggles” inthe precision/recall curve, caused by small variations in theranking of examples. It should be noted that to obtain a highscore, a method must have precision at all levels of recall—this penalises methods which retrieve only a subset of examples with high precision (e.g. side views of cars).

[1] Everingham, Mark, Luc Van Gool, Christopher K. I. Williams, John Winn, and Andrew Zisserman. “The Pascal Visual Object Classes (VOC) Challenge.” International Journal of Computer Vision 88, no. 2 (June 1, 2010): 303–38. https://doi.org/10.1007/s11263-009-0275-4.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants