Evaluation giving inconsistent results #739

ashnair1 · 2020-01-22T06:16:42Z

While writing a custom model (to train on MS COCO), I had set TEST.EVAL_PERIOD to 5000 so that I could frequently evaluate the model to see how it's doing. The last logs show that I got 37.2 box AP and 33.8 mask AP. But when I run the evaluation command it gives me different results. So I guess my question is why is it that evaluation done during training (last epoch) and evaluation done after training giving drastically different results?

Tensorboard training logs:

Instructions To Reproduce the Issue:

what changes you made (git diff) or what code you wrote

No change to base detectron2 code

what exact command you run:

python projects/custom/train_net.py \
              --num-gpus 8 \
              --config-file projects/custom/configs/Custom_R_50_FPN_1x.yaml \
              --eval-only MODEL.WEIGHTS projects/custom/output/model_final.pth

what you observed (including the full logs):

Logs observed by running the above command:

[01/22 00:42:43 fvcore.common.checkpoint]: Loading checkpoint from projects/custom/output/model_final.pth
[01/22 00:42:43 d2.data.datasets.coco]: Loaded 5000 images in COCO format from datasets/coco/annotations/instances_val2017.json
[01/22 00:42:44 d2.data.build]: Distribution of instances among all 80 categories:
|   category    | #instances   |   category   | #instances   |   category    | #instances   |
|:-------------:|:-------------|:------------:|:-------------|:-------------:|:-------------|
|    person     | 10777        |   bicycle    | 314          |      car      | 1918         |
|  motorcycle   | 367          |   airplane   | 143          |      bus      | 283          |
|     train     | 190          |    truck     | 414          |     boat      | 424          |
| traffic light | 634          | fire hydrant | 101          |   stop sign   | 75           |
| parking meter | 60           |    bench     | 411          |     bird      | 427          |
|      cat      | 202          |     dog      | 218          |     horse     | 272          |
|     sheep     | 354          |     cow      | 372          |   elephant    | 252          |
|     bear      | 71           |    zebra     | 266          |    giraffe    | 232          |
|   backpack    | 371          |   umbrella   | 407          |    handbag    | 540          |
|      tie      | 252          |   suitcase   | 299          |    frisbee    | 115          |
|     skis      | 241          |  snowboard   | 69           |  sports ball  | 260          |
|     kite      | 327          | baseball bat | 145          | baseball gl.. | 148          |
|  skateboard   | 179          |  surfboard   | 267          | tennis racket | 225          |
|    bottle     | 1013         |  wine glass  | 341          |      cup      | 895          |
|     fork      | 215          |    knife     | 325          |     spoon     | 253          |
|     bowl      | 623          |    banana    | 370          |     apple     | 236          |
|   sandwich    | 177          |    orange    | 285          |   broccoli    | 312          |
|    carrot     | 365          |   hot dog    | 125          |     pizza     | 284          |
|     donut     | 328          |     cake     | 310          |     chair     | 1771         |
|     couch     | 261          | potted plant | 342          |      bed      | 163          |
| dining table  | 695          |    toilet    | 179          |      tv       | 288          |
|    laptop     | 231          |    mouse     | 106          |    remote     | 283          |
|   keyboard    | 153          |  cell phone  | 262          |   microwave   | 55           |
|     oven      | 143          |   toaster    | 9            |     sink      | 225          |
| refrigerator  | 126          |     book     | 1129         |     clock     | 267          |
|     vase      | 274          |   scissors   | 36           |  teddy bear   | 190          |
|  hair drier   | 11           |  toothbrush  | 57           |               |              |
|     total     | 36335        |              |              |               |              |
[01/22 00:42:44 d2.evaluation.evaluator]: Start inference on 625 images
[01/22 00:42:58 d2.evaluation.evaluator]: Inference done 11/625. 0.0667 s / img. ETA=0:00:49
[01/22 00:43:03 d2.evaluation.evaluator]: Inference done 74/625. 0.0667 s / img. ETA=0:00:44
[01/22 00:43:09 d2.evaluation.evaluator]: Inference done 133/625. 0.0667 s / img. ETA=0:00:40
[01/22 00:43:14 d2.evaluation.evaluator]: Inference done 191/625. 0.0671 s / img. ETA=0:00:36
[01/22 00:43:19 d2.evaluation.evaluator]: Inference done 252/625. 0.0670 s / img. ETA=0:00:31
[01/22 00:43:24 d2.evaluation.evaluator]: Inference done 312/625. 0.0669 s / img. ETA=0:00:26
[01/22 00:43:29 d2.evaluation.evaluator]: Inference done 366/625. 0.0681 s / img. ETA=0:00:22
[01/22 00:43:34 d2.evaluation.evaluator]: Inference done 428/625. 0.0680 s / img. ETA=0:00:16
[01/22 00:43:39 d2.evaluation.evaluator]: Inference done 488/625. 0.0680 s / img. ETA=0:00:11
[01/22 00:43:44 d2.evaluation.evaluator]: Inference done 546/625. 0.0680 s / img. ETA=0:00:06
[01/22 00:43:49 d2.evaluation.evaluator]: Inference done 605/625. 0.0680 s / img. ETA=0:00:01
[01/22 00:43:51 d2.evaluation.evaluator]: Total inference time: 0:00:53.291790 (0.085954 s / img per device, on 8 devices)
[01/22 00:43:51 d2.evaluation.evaluator]: Total inference pure compute time: 0:00:42 (0.067980 s / img per device, on 8 devices)
[01/22 00:43:55 d2.evaluation.coco_evaluation]: Preparing results for COCO format ...
[01/22 00:43:55 d2.evaluation.coco_evaluation]: Saving results to projects/custom/output/inference/coco_instances_results.json
[01/22 00:43:56 d2.evaluation.coco_evaluation]: Evaluating predictions ...
Loading and preparing results...
DONE (t=0.10s)
creating index...
index created!
Running per image evaluation...
Evaluate annotation type *bbox*
DONE (t=24.25s).
Accumulating evaluation results...
DONE (t=3.45s).
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.080
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.141
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.080
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.156
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.062
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.000
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.070
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.124
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.133
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.249
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.096
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.002
[01/22 00:44:24 d2.evaluation.coco_evaluation]: Evaluation results for bbox: 
|  AP   |  AP50  |  AP75  |  APs   |  APm  |  APl  |
|:-----:|:------:|:------:|:------:|:-----:|:-----:|
| 7.961 | 14.139 | 7.981  | 15.604 | 6.206 | 0.015 |
[01/22 00:44:24 d2.evaluation.coco_evaluation]: Per-category bbox AP: 
| category      | AP     | category     | AP     | category       | AP     |
|:--------------|:-------|:-------------|:-------|:---------------|:-------|
| person        | 14.309 | bicycle      | 3.587  | car            | 21.439 |
| motorcycle    | 3.300  | airplane     | 6.198  | bus            | 3.703  |
| train         | 1.097  | truck        | 5.772  | boat           | 7.196  |
| traffic light | 20.432 | fire hydrant | 12.571 | stop sign      | 13.182 |
| parking meter | 4.479  | bench        | 1.767  | bird           | 12.307 |
| cat           | 1.041  | dog          | 6.073  | horse          | 5.844  |
| sheep         | 10.237 | cow          | 21.495 | elephant       | 6.467  |
| bear          | 5.602  | zebra        | 9.040  | giraffe        | 2.224  |
| backpack      | 6.466  | umbrella     | 2.889  | handbag        | 2.727  |
| tie           | 11.791 | suitcase     | 6.102  | frisbee        | 31.111 |
| skis          | 3.124  | snowboard    | 4.623  | sports ball    | 40.541 |
| kite          | 25.432 | baseball bat | 4.418  | baseball glove | 21.957 |
| skateboard    | 6.203  | surfboard    | 5.146  | tennis racket  | 9.987  |
| bottle        | 18.071 | wine glass   | 11.447 | cup            | 16.309 |
| fork          | 2.426  | knife        | 4.041  | spoon          | 2.775  |
| bowl          | 6.518  | banana       | 2.247  | apple          | 1.902  |
| sandwich      | 0.121  | orange       | 6.569  | broccoli       | 2.491  |
| carrot        | 3.985  | hot dog      | 2.116  | pizza          | 3.158  |
| donut         | 12.144 | cake         | 8.363  | chair          | 2.758  |
| couch         | 0.001  | potted plant | 3.378  | bed            | 0.002  |
| dining table  | 0.121  | toilet       | 1.427  | tv             | 2.630  |
| laptop        | 2.669  | mouse        | 28.368 | remote         | 13.239 |
| keyboard      | 2.016  | cell phone   | 11.322 | microwave      | 7.797  |
| oven          | 0.154  | toaster      | 19.824 | sink           | 4.122  |
| refrigerator  | 0.000  | book         | 6.435  | clock          | 23.467 |
| vase          | 12.429 | scissors     | 2.079  | teddy bear     | 2.443  |
| hair drier    | 0.000  | toothbrush   | 3.674  |                |        |
Loading and preparing results...
DONE (t=1.17s)
creating index...
index created!
Running per image evaluation...
Evaluate annotation type *segm*
DONE (t=26.07s).
Accumulating evaluation results...
DONE (t=3.47s).
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.071
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.131
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.071
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.131
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.066
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.000
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.067
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.116
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.124
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.233
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.092
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.000
[01/22 00:44:58 d2.evaluation.coco_evaluation]: Evaluation results for segm: 
|  AP   |  AP50  |  AP75  |  APs   |  APm  |  APl  |
|:-----:|:------:|:------:|:------:|:-----:|:-----:|
| 7.147 | 13.081 | 7.054  | 13.103 | 6.640 | 0.000 |
[01/22 00:44:58 d2.evaluation.coco_evaluation]: Per-category segm AP: 
| category      | AP     | category     | AP     | category       | AP     |
|:--------------|:-------|:-------------|:-------|:---------------|:-------|
| person        | 10.871 | bicycle      | 2.107  | car            | 18.899 |
| motorcycle    | 2.472  | airplane     | 3.996  | bus            | 3.871  |
| train         | 0.999  | truck        | 4.816  | boat           | 7.160  |
| traffic light | 19.360 | fire hydrant | 11.865 | stop sign      | 11.618 |
| parking meter | 4.520  | bench        | 1.253  | bird           | 8.533  |
| cat           | 0.857  | dog          | 3.602  | horse          | 3.091  |
| sheep         | 8.017  | cow          | 17.685 | elephant       | 4.735  |
| bear          | 4.949  | zebra        | 7.591  | giraffe        | 0.910  |
| backpack      | 6.581  | umbrella     | 4.155  | handbag        | 4.195  |
| tie           | 12.225 | suitcase     | 7.172  | frisbee        | 31.182 |
| skis          | 0.286  | snowboard    | 1.189  | sports ball    | 39.017 |
| kite          | 18.157 | baseball bat | 3.880  | baseball glove | 22.217 |
| skateboard    | 3.620  | surfboard    | 3.660  | tennis racket  | 10.612 |
| bottle        | 17.147 | wine glass   | 8.445  | cup            | 16.217 |
| fork          | 1.197  | knife        | 1.418  | spoon          | 1.648  |
| bowl          | 6.062  | banana       | 1.923  | apple          | 2.176  |
| sandwich      | 0.094  | orange       | 6.710  | broccoli       | 2.283  |
| carrot        | 3.752  | hot dog      | 0.883  | pizza          | 2.911  |
| donut         | 10.769 | cake         | 8.378  | chair          | 2.281  |
| couch         | 0.000  | potted plant | 2.827  | bed            | 0.000  |
| dining table  | 0.115  | toilet       | 1.603  | tv             | 2.694  |
| laptop        | 2.797  | mouse        | 28.893 | remote         | 12.775 |
| keyboard      | 2.307  | cell phone   | 11.048 | microwave      | 7.792  |
| oven          | 0.168  | toaster      | 23.273 | sink           | 2.449  |
| refrigerator  | 0.000  | book         | 4.151  | clock          | 23.179 |
| vase          | 11.354 | scissors     | 2.376  | teddy bear     | 2.329  |
| hair drier    | 0.000  | toothbrush   | 3.394  |                |        |
[01/22 00:44:58 d2.engine.defaults]: Evaluation results for coco_2017_val in csv format:
[01/22 00:44:58 d2.evaluation.testing]: copypaste: Task: bbox
[01/22 00:44:58 d2.evaluation.testing]: copypaste: AP,AP50,AP75,APs,APm,APl
[01/22 00:44:58 d2.evaluation.testing]: copypaste: 7.9615,14.1390,7.9813,15.6043,6.2056,0.0146
[01/22 00:44:58 d2.evaluation.testing]: copypaste: Task: segm
[01/22 00:44:58 d2.evaluation.testing]: copypaste: AP,AP50,AP75,APs,APm,APl
[01/22 00:44:58 d2.evaluation.testing]: copypaste: 7.1468,13.0809,7.0539,13.1030,6.6402,0.0004

please also simplify the steps as much as possible so they do not require additional resources to
run, such as a private dataset.

Expected behavior:

I was expecting the same performance I got during the last evaluation done before training ended i.e. 37.2 box AP and 33.8 mask AP

Following are the logs for the last evaluation:

[01/21 19:13:52 fvcore.common.checkpoint]: Saving checkpoint to projects/custom/output/model_0089999.pth
[01/21 19:13:53 fvcore.common.checkpoint]: Saving checkpoint to projects/custom/output/model_final.pth
[01/21 19:13:54 d2.data.datasets.coco]: Loaded 5000 images in COCO format from datasets/coco/annotations/instances_val2017.json
[01/21 19:13:55 d2.evaluation.evaluator]: Start inference on 625 images
[01/21 19:14:15 d2.evaluation.evaluator]: Inference done 11/625. 0.0697 s / img. ETA=0:00:56
[01/21 19:14:15 d2.evaluation.evaluator]: Inference done 11/625. 0.0697 s / img. ETA=0:00:56
[01/21 19:14:20 d2.evaluation.evaluator]: Inference done 65/625. 0.0705 s / img. ETA=0:00:52
[01/21 19:14:25 d2.evaluation.evaluator]: Inference done 119/625. 0.0700 s / img. ETA=0:00:47
[01/21 19:14:30 d2.evaluation.evaluator]: Inference done 170/625. 0.0706 s / img. ETA=0:00:43
[01/21 19:14:35 d2.evaluation.evaluator]: Inference done 225/625. 0.0704 s / img. ETA=0:00:37
[01/21 19:14:40 d2.evaluation.evaluator]: Inference done 278/625. 0.0704 s / img. ETA=0:00:32
[01/21 19:14:45 d2.evaluation.evaluator]: Inference done 329/625. 0.0708 s / img. ETA=0:00:28
[01/21 19:14:50 d2.evaluation.evaluator]: Inference done 382/625. 0.0708 s / img. ETA=0:00:23
[01/21 19:14:55 d2.evaluation.evaluator]: Inference done 439/625. 0.0706 s / img. ETA=0:00:17
[01/21 19:15:01 d2.evaluation.evaluator]: Inference done 492/625. 0.0708 s / img. ETA=0:00:12
[01/21 19:15:06 d2.evaluation.evaluator]: Inference done 545/625. 0.0707 s / img. ETA=0:00:07
[01/21 19:15:11 d2.evaluation.evaluator]: Inference done 598/625. 0.0707 s / img. ETA=0:00:02
[01/21 19:15:13 d2.evaluation.evaluator]: Total inference time: 0:00:58.796359 (0.094833 s / img per device, on 8 devices)
[01/21 19:15:13 d2.evaluation.evaluator]: Total inference pure compute time: 0:00:43 (0.070569 s / img per device, on 8 devices)
[01/21 19:15:16 d2.evaluation.coco_evaluation]: Preparing results for COCO format ...
[01/21 19:15:16 d2.evaluation.coco_evaluation]: Saving results to projects/custom/output/inference/coco_instances_results.json
[01/21 19:15:18 d2.evaluation.coco_evaluation]: Evaluating predictions ...
Loading and preparing results... 
DONE (t=0.16s)
creating index...
index created!
Running per image evaluation...  
Evaluate annotation type *bbox*  
DONE (t=42.89s).
Accumulating evaluation results...
DONE (t=5.15s).
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.372
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.580
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.404
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.229
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.408
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.461
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.309
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.493
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.518
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.336
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.552
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.639
[01/21 19:16:06 d2.evaluation.coco_evaluation]: Evaluation results for bbox:
|   AP   |  AP50  |  AP75  |  APs   |  APm   |  APl   |
|:------:|:------:|:------:|:------:|:------:|:------:|
| 37.234 | 58.004 | 40.447 | 22.884 | 40.763 | 46.111 |
[01/21 19:16:06 d2.evaluation.coco_evaluation]: Per-category bbox AP:
| category      | AP     | category     | AP     | category       | AP     |
|:--------------|:-------|:-------------|:-------|:---------------|:-------|
| person        | 52.741 | bicycle      | 27.515 | car            | 42.478 |
| motorcycle    | 39.294 | airplane     | 58.438 | bus            | 59.407 |
| train         | 56.701 | truck        | 29.680 | boat           | 23.933 |
| traffic light | 27.697 | fire hydrant | 63.138 | stop sign      | 61.819 |
| parking meter | 40.890 | bench        | 20.358 | bird           | 34.900 |
| cat           | 60.047 | dog          | 53.630 | horse          | 52.275 |
| sheep         | 47.798 | cow          | 54.097 | elephant       | 58.479 |
| bear          | 65.465 | zebra        | 64.125 | giraffe        | 61.621 |
| backpack      | 12.417 | umbrella     | 33.839 | handbag        | 9.689  |
| tie           | 28.073 | suitcase     | 32.586 | frisbee        | 63.571 |
| skis          | 19.549 | snowboard    | 28.105 | sports ball    | 48.491 |
| kite          | 39.304 | baseball bat | 24.536 | baseball glove | 35.756 |
| skateboard    | 46.410 | surfboard    | 32.171 | tennis racket  | 42.366 |
| bottle        | 36.515 | wine glass   | 32.604 | cup            | 38.606 |
| fork          | 25.010 | knife        | 13.497 | spoon          | 11.514 |
| bowl          | 37.432 | banana       | 21.316 | apple          | 17.472 |
| sandwich      | 29.125 | orange       | 28.825 | broccoli       | 21.318 |
| carrot        | 20.616 | hot dog      | 27.328 | pizza          | 46.914 |
| donut         | 42.979 | cake         | 31.300 | chair          | 23.899 |
| couch         | 36.475 | potted plant | 22.777 | bed            | 37.543 |
| dining table  | 23.097 | toilet       | 53.759 | tv             | 51.995 |
| laptop        | 53.548 | mouse        | 62.400 | remote         | 28.297 |
| keyboard      | 46.505 | cell phone   | 31.261 | microwave      | 49.518 |
| oven          | 27.909 | toaster      | 39.315 | sink           | 33.194 |
| refrigerator  | 46.601 | book         | 13.203 | clock          | 48.373 |
| vase          | 35.898 | scissors     | 21.912 | teddy bear     | 42.448 |
| hair drier    | 0.000  | toothbrush   | 13.033 |                |        |
Loading and preparing results... 
DONE (t=2.35s)
creating index...
index created!
Running per image evaluation...  
Evaluate annotation type *segm*  
DONE (t=43.31s).
Accumulating evaluation results...
DONE (t=5.14s).
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.338
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.549
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.360
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.170
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.367
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.470
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.290
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.451
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.472
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.290
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.507
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.604
[01/21 19:17:02 d2.evaluation.coco_evaluation]: Evaluation results for segm:
|   AP   |  AP50  |  AP75  |  APs   |  APm   |  APl   |
|:------:|:------:|:------:|:------:|:------:|:------:|
| 33.841 | 54.931 | 35.984 | 17.005 | 36.744 | 46.965 |
[01/21 19:17:02 d2.evaluation.coco_evaluation]: Per-category segm AP:
| category      | AP     | category     | AP     | category       | AP     |
|:--------------|:-------|:-------------|:-------|:---------------|:-------|
| person        | 44.800 | bicycle      | 15.811 | car            | 38.547 |
| motorcycle    | 29.415 | airplane     | 46.202 | bus            | 59.055 |
| train         | 56.181 | truck        | 29.455 | boat           | 20.163 |
| traffic light | 26.419 | fire hydrant | 59.199 | stop sign      | 62.297 |
| parking meter | 43.005 | bench        | 14.513 | bird           | 28.259 |
| cat           | 63.867 | dog          | 52.602 | horse          | 37.123 |
| sheep         | 40.717 | cow          | 46.050 | elephant       | 52.428 |
| bear          | 63.037 | zebra        | 54.434 | giraffe        | 45.558 |
| backpack      | 12.808 | umbrella     | 41.471 | handbag        | 11.502 |
| tie           | 27.327 | suitcase     | 35.124 | frisbee        | 62.476 |
| skis          | 1.410  | snowboard    | 15.310 | sports ball    | 47.174 |
| kite          | 29.109 | baseball bat | 19.426 | baseball glove | 38.633 |
| skateboard    | 26.263 | surfboard    | 26.510 | tennis racket  | 50.212 |
| bottle        | 34.975 | wine glass   | 28.526 | cup            | 38.941 |
| fork          | 9.778  | knife        | 8.366  | spoon          | 7.658  |
| bowl          | 35.849 | banana       | 16.423 | apple          | 17.350 |
| sandwich      | 29.737 | orange       | 29.015 | broccoli       | 20.243 |
| carrot        | 17.667 | hot dog      | 17.076 | pizza          | 46.926 |
| donut         | 44.344 | cake         | 31.410 | chair          | 15.707 |
| couch         | 31.305 | potted plant | 19.978 | bed            | 30.443 |
| dining table  | 12.300 | toilet       | 55.483 | tv             | 54.017 |
| laptop        | 54.179 | mouse        | 62.458 | remote         | 25.723 |
| keyboard      | 46.676 | cell phone   | 31.065 | microwave      | 50.325 |
| oven          | 26.529 | toaster      | 40.957 | sink           | 31.904 |
| refrigerator  | 48.174 | book         | 8.191  | clock          | 49.518 |
| vase          | 35.107 | scissors     | 15.465 | teddy bear     | 40.380 |
| hair drier    | 4.257  | toothbrush   | 8.944  |                |        |
[01/21 19:17:03 d2.engine.defaults]: Evaluation results for coco_2017_val in csv format:
[01/21 19:17:03 d2.evaluation.testing]: copypaste: Task: bbox
[01/21 19:17:03 d2.evaluation.testing]: copypaste: AP,AP50,AP75,APs,APm,APl
[01/21 19:17:03 d2.evaluation.testing]: copypaste: 37.2340,58.0043,40.4466,22.8836,40.7632,46.1108
[01/21 19:17:03 d2.evaluation.testing]: copypaste: Task: segm
[01/21 19:17:03 d2.evaluation.testing]: copypaste: AP,AP50,AP75,APs,APm,APl
[01/21 19:17:03 d2.evaluation.testing]: copypaste: 33.8407,54.9315,35.9840,17.0054,36.7437,46.9652
[01/21 19:17:03 d2.utils.events]: eta: 0:00:00  iter: 89999  total_loss: 0.766  loss_cls: 0.175  loss_box_reg: 0.223  loss_mask: 0.252  loss_rpn_cls: 0.031  loss_rpn_loc: 0.067  time: 0.3681  data_time: 0.0130  lr: 0.000200  max_mem: 5161M
[01/21 19:17:04 d2.engine.hooks]: Overall training speed: 89997 iterations in 9:12:07 (0.3681 s / it)
[01/21 19:17:04 d2.engine.hooks]: Total training time: 10:24:51 (1:12:43 on hooks)

Environment:

Please paste the output of python -m detectron2.utils.collect_env.
If detectron2 hasn't been successfully installed, use python detectron2/utils/collect_env.py.

------------------------  -------------------------------------------------------------------------------------------------------
sys.platform              linux
Python                    3.6.9 |Anaconda, Inc.| (default, Jul 30 2019, 19:07:31) [GCC 7.3.0]
numpy                     1.17.4
detectron2                0.1 @/home/an1/detectron2/detectron2
detectron2 compiler       GCC 6.5
detectron2 CUDA compiler  10.0
detectron2 arch flags     sm_70
DETECTRON2_ENV_MODULE     <not set>
PyTorch                   1.3.1 @/home/an1/miniconda3/envs/d2o/lib/python3.6/site-packages/torch
PyTorch debug build       False
CUDA available            True
GPU 0,1,2,3,4,5,6,7       Tesla V100-SXM2-32GB
CUDA_HOME                 /home/pub/spack/opt/spack/linux-centos7-x86_64/gcc-6.5.0/cuda-10.0.130-6rlvsy3qmu3txrldikm473betkqqespd
NVCC                      Cuda compilation tools, release 10.0, V10.0.130
Pillow                    6.2.2
torchvision               0.4.2 @/home/an1/miniconda3/envs/d2o/lib/python3.6/site-packages/torchvision
torchvision arch flags    sm_35, sm_50, sm_60, sm_70
cv2                       3.4.2
------------------------  -------------------------------------------------------------------------------------------------------
PyTorch built with:
  - GCC 7.3
  - Intel(R) Math Kernel Library Version 2019.0.4 Product Build 20190411 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v0.20.5 (Git Hash 0125f28c61c1f822fd48570b4c1066f96fcb9b2e)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - NNPACK is enabled
  - CUDA Runtime 9.2
  - NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_37,code=compute_37
  - CuDNN 7.6.3
  - Magma 2.5.1
  - Build settings: BLAS=MKL, BUILD_NAMEDTENSOR=OFF, BUILD_TYPE=Release, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -fopenmp -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -O2 -fPIC -Wno-narrowing -Wall -Wextra -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Wno-stringop-overflow, DISABLE_NUMA=1, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, USE_CUDA=True, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, USE_STATIC_DISPATCH=OFF,

The text was updated successfully, but these errors were encountered:

ppwwyyxx · 2020-01-22T07:57:32Z

It doesn't look like this has happened for builtin models and we can't help you since we don't know what you did.

ashnair1 · 2020-01-22T08:14:50Z

@ppwwyyxx

The model I'm using is identical to the mask rcnn except for the FPN backbone where I've implemented Path aggregation as in the PANet paper.

I don't understand how changing the fpn can be the cause of the evaluations being different in the two stages. Isn't there only one way to do evaluation?

Here's a comparison of their configs if it helps

Mask RCNN

_BASE_: "../Base-RCNN-FPN.yaml"
MODEL:
  WEIGHTS: "detectron2://ImageNetPretrained/MSRA/R-50.pkl"
  MASK_ON: True
  RESNETS:
    DEPTH: 50

Mine

_BASE_: "Base-RCNN-FPN.yaml"
MODEL:
  BACKBONE:
    NAME: "build_resnet_pafpn_backbone"
  WEIGHTS: "detectron2://ImageNetPretrained/MSRA/R-50.pkl"
  MASK_ON: True
  RESNETS:
    DEPTH: 50
TEST:
  EVAL_PERIOD: 5000
OUTPUT_DIR: "projects/custom/output"

Apart from the fpn , the output directory and eval period, they're identical.

ashnair1 · 2020-01-23T09:42:12Z

If the model evaluation done during training is correct and the evaluation done after saving the checkpoint is the issue then obviously the problem lies either in the saving or loading processes. I tried training a mask rcnn from scratch and did not observe the problem. So the problem does lie in the FPN modification I made.

I think what's happening is that my fpn modification (which is basically an additional bottom up branch) is preventing the fpn weights from being stored/loaded properly in the checkpoint which might be the cause of this huge difference in accuracy while evaluating. How this is happening, I have no idea.
Adding a couple of conv layers ideally shouldn't cause an issue while saving/loading.

ashnair1 · 2020-01-26T06:51:42Z

The problem was that I had not explicitly added a list of convolution ops as a module via the add_module method. Not doing this resulted in the model not saving the weights properly. During training the graph is created and the latest weights were used and hence the evaluations done during training was 'correct'. But when it came time to save a checkpoint, parts of my FPN (that wasn't added via add module) didn't have their weights saved and so evaluation done by loading a checkpoint didn't display the same results.

ashnair1 closed this as completed Jan 26, 2020

ashnair1 added a commit to ashnair1/detectron2 that referenced this issue Jan 26, 2020

fixes facebookresearch#739

2e7ebf6

github-actions bot locked as resolved and limited conversation to collaborators Jan 4, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Evaluation giving inconsistent results #739

Evaluation giving inconsistent results #739

ashnair1 commented Jan 22, 2020 •

edited

Loading

ppwwyyxx commented Jan 22, 2020

ashnair1 commented Jan 22, 2020 •

edited

Loading

ashnair1 commented Jan 23, 2020 •

edited

Loading

ashnair1 commented Jan 26, 2020 •

edited

Loading

Evaluation giving inconsistent results #739

Evaluation giving inconsistent results #739

Comments

ashnair1 commented Jan 22, 2020 • edited Loading

Instructions To Reproduce the Issue:

Expected behavior:

Environment:

ppwwyyxx commented Jan 22, 2020

ashnair1 commented Jan 22, 2020 • edited Loading

ashnair1 commented Jan 23, 2020 • edited Loading

ashnair1 commented Jan 26, 2020 • edited Loading

ashnair1 commented Jan 22, 2020 •

edited

Loading

ashnair1 commented Jan 22, 2020 •

edited

Loading

ashnair1 commented Jan 23, 2020 •

edited

Loading

ashnair1 commented Jan 26, 2020 •

edited

Loading