Training works but AssertionError for MODEL.ROI_HEADS.NUM_CLASSES only when COCOEvaluator is used #4922

ravijo · 2023-04-20T13:09:03Z

Based on the suggestions mentioned in this discussion, I am trying to compute validation loss during the training of Mask R-CNN model. Please note that my dataset has only 1 class, say toys, which are small items (maximum 10 cm long).

Instructions To Reproduce the Issue:

Using code from this discussion

Specially, please see the ValLossHook below:

class ValLossHook(HookBase):
    def __init__(self, cfg, validation_set_key):
        super().__init__()
        self.cfg = cfg.clone()
        self._loader = iter(build_detection_test_loader(self.cfg, validation_set_key, 
                                                        mapper=DatasetMapper(self.cfg, is_train=True)))

    def after_step(self):
        data = next(self._loader)
        with torch.no_grad():
            loss_dict = self.trainer.model(data)
            
            losses = sum(loss_dict.values())
            assert torch.isfinite(losses).all(), loss_dict

            loss_dict_reduced = {"val_" + k: v.item() for k, v in comm.reduce_dict(loss_dict).items()}
            losses_reduced = sum(loss for loss in loss_dict_reduced.values())
            if comm.is_main_process():
                self.trainer.storage.put_scalars(val_total_loss=losses_reduced, 
                                                 **loss_dict_reduced)

Observations:

While training together with computing validation loss, I noticed the following behaviors:

With default cfg, the training works but the COCOEvaluator throws AssertionError as shown below:

Traceback (most recent call last):
  File "train.py", line 123, in <module>
    main()
  File "train.py", line 119, in main
    trainer.train()
  File "/home/ravi/.local/lib/python3.6/site-packages/detectron2/engine/defaults.py", line 484, in train
    super().train(self.start_iter, self.max_iter)
  File "/home/ravi/.local/lib/python3.6/site-packages/detectron2/engine/train_loop.py", line 150, in train
    self.after_step()
  File "/home/ravi/.local/lib/python3.6/site-packages/detectron2/engine/train_loop.py", line 180, in after_step
    h.after_step()
  File "/home/ravi/.local/lib/python3.6/site-packages/detectron2/engine/hooks.py", line 552, in after_step
    self._do_eval()
  File "/home/ravi/.local/lib/python3.6/site-packages/detectron2/engine/hooks.py", line 525, in _do_eval
    results = self._func()
  File "/home/ravi/.local/lib/python3.6/site-packages/detectron2/engine/defaults.py", line 453, in test_and_save_results
    self._last_eval_results = self.test(self.cfg, self.model)
  File "/home/ravi/.local/lib/python3.6/site-packages/detectron2/engine/defaults.py", line 608, in test
    results_i = inference_on_dataset(model, data_loader, evaluator)
  File "/home/ravi/.local/lib/python3.6/site-packages/detectron2/evaluation/evaluator.py", line 204, in inference_on_dataset
    results = evaluator.evaluate()
  File "/home/ravi/.local/lib/python3.6/site-packages/detectron2/evaluation/coco_evaluation.py", line 194, in evaluate
    self._eval_predictions(predictions, img_ids=img_ids)
  File "/home/ravi/.local/lib/python3.6/site-packages/detectron2/evaluation/coco_evaluation.py", line 229, in _eval_predictions
    f"A prediction has class={category_id}, "
AssertionError: A prediction has class=77, but the dataset only has 1 classes and predicted class id should be in [0, 0].

The solution is to set cfg.MODEL.ROI_HEADS.NUM_CLASSES = 1. However, after adding this configuration, detectron2 is showing the following warning:

[04/20 18:56:34 d2.data.dataset_mapper]: [DatasetMapper] Augmentations used in training: [ResizeShortestEdge(short_edge_length=(640, 672, 704, 736, 768, 800), max_size=1333, sample_style='choice'), RandomFlip()]
[04/20 18:56:34 d2.data.build]: Using training sampler TrainingSampler
[04/20 18:56:34 d2.data.common]: Serializing 5 elements to byte tensors and concatenating them all ...
[04/20 18:56:34 d2.data.common]: Serialized dataset takes 0.25 MiB
Skip loading parameter 'roi_heads.box_predictor.cls_score.weight' to the model due to incompatible shapes: (81, 1024) in the checkpoint but (2, 1024) in the model! You might want to double check if this is expected.
Skip loading parameter 'roi_heads.box_predictor.cls_score.bias' to the model due to incompatible shapes: (81,) in the checkpoint but (2,) in the model! You might want to double check if this is expected.
Skip loading parameter 'roi_heads.box_predictor.bbox_pred.weight' to the model due to incompatible shapes: (320, 1024) in the checkpoint but (4, 1024) in the model! You might want to double check if this is expected.
Skip loading parameter 'roi_heads.box_predictor.bbox_pred.bias' to the model due to incompatible shapes: (320,) in the checkpoint but (4,) in the model! You might want to double check if this is expected.
Skip loading parameter 'roi_heads.mask_head.predictor.weight' to the model due to incompatible shapes: (80, 256, 1, 1) in the checkpoint but (1, 256, 1, 1) in the model! You might want to double check if this is expected.
Skip loading parameter 'roi_heads.mask_head.predictor.bias' to the model due to incompatible shapes: (80,) in the checkpoint but (1,) in the model! You might want to double check if this is expected.
Some model parameters or buffers are not found in the checkpoint:
roi_heads.box_predictor.bbox_pred.{bias, weight}
roi_heads.box_predictor.cls_score.{bias, weight}
roi_heads.mask_head.predictor.{bias, weight}
[04/20 18:56:34 d2.engine.train_loop]: Starting training from iteration 0

My dataset contains small items (maximum 10 cm long). However, the COCOEvaluator is showing nan for APs. See below, please:

[04/20 18:56:45 d2.evaluation.fast_eval_api]: Evaluate annotation type *segm*
[04/20 18:56:45 d2.evaluation.fast_eval_api]: COCOeval_opt.evaluate() finished in 0.02 seconds.
[04/20 18:56:45 d2.evaluation.fast_eval_api]: Accumulating evaluation results...
[04/20 18:56:45 d2.evaluation.fast_eval_api]: COCOeval_opt.accumulate() finished in 0.00 seconds.
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.432
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.709
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.466
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = -1.000
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.401
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.437
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.016
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.190
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.667
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = -1.000
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.750
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.662
[04/20 18:56:45 d2.evaluation.coco_evaluation]: Evaluation results for segm: 
|   AP   |  AP50  |  AP75  |  APs  |  APm   |  APl   |
|:------:|:------:|:------:|:-----:|:------:|:------:|
| 43.207 | 70.906 | 46.571 |  nan  | 40.081 | 43.697 |
[04/20 18:56:45 d2.evaluation.coco_evaluation]: Some metrics cannot be computed and is shown as NaN.
[04/20 18:56:45 d2.engine.defaults]: Evaluation results for toys_test in csv format:
[04/20 18:56:45 d2.evaluation.testing]: copypaste: Task: bbox
[04/20 18:56:45 d2.evaluation.testing]: copypaste: AP,AP50,AP75,APs,APm,APl
[04/20 18:56:45 d2.evaluation.testing]: copypaste: 35.4201,68.1637,32.0485,nan,61.3645,35.2860
[04/20 18:56:45 d2.evaluation.testing]: copypaste: Task: segm
[04/20 18:56:45 d2.evaluation.testing]: copypaste: AP,AP50,AP75,APs,APm,APl
[04/20 18:56:45 d2.evaluation.testing]: copypaste: 43.2066,70.9056,46.5707,nan,40.0806,43.6967

Environment:

Name	Info./Version
sys.platform	linux
OS	Ubuntu 18.04.6 LTS
OS Kernel	5.4.0-120-generic
Python	3.6.9 (default, Mar 10 2023, 16:46:00) [GCC 8.4.0]
numpy	1.19.5
detectron2	0.6 @/home/ravi/.local/lib/python3.6/site-packages/detectron2
Compiler	GCC 7.5
CUDA compiler	CUDA 11.6
detectron2 arch flags	7.0
DETECTRON2_ENV_MODULE	<not set>
PyTorch	1.9.0+cu111 @/home/ravi/.local/lib/python3.6/site-packages/torch
PyTorch debug build	False
GPU available	Yes
GPU 0,1	NVIDIA GeForce RTX 3090 (arch=8.6)
Driver version	515.48.07
CUDA_HOME	/usr/local/cuda
Pillow	8.4.0
torchvision	0.10.0+cu111 @/home/ravi/.local/lib/python3.6/site-packages/torchvision
torchvision arch flags	3.5, 5.0, 6.0, 7.0, 7.5, 8.0, 8.6
fvcore	0.1.5.post20221221
iopath	0.1.9
cv2	4.7.0

Questions:

Is it necessary to set NUM_CLASSES for roi_heads? Surprisingly, the training works even without setting it. Nevertheless, after setting it, detectron2 shows an incompatible message mentioned above. Does it mean that weights for roi_heads are not loaded? Furthermore. does setting NUM_CLASSES to 1 hurt model learning/performance?
Why does the COCOEvaluator show nan for APs in my dataset having only small objects?

Thank you very much.

The text was updated successfully, but these errors were encountered:

github-actions · 2023-04-20T13:09:20Z

You've chosen to report an unexpected problem or bug. Unless you already know the root cause of it, please include details about it by filling the issue template.
The following information is missing: "Instructions To Reproduce the Issue and Full Logs";

ravijo · 2023-04-23T03:24:36Z

Dear @ppwwyyxx

Any suggestions, please?

Thank you so much.

Lihewin · 2023-04-25T05:43:44Z

You have to set to set NUM_CLASSES for roi_heads. About the warnings given, see #196

ravijo · 2023-04-25T05:53:22Z

@Lihewin

Thanks a lot. I did it. May I request you to look at the Questions section in my post above?

clairelin23 · 2024-05-03T06:42:21Z

Did you resolve the problem? I also faced the same problem even after setting NUM_CLASSES.

github-actions bot added the needs-more-info More info is needed to complete the issue label Apr 20, 2023

github-actions bot removed the needs-more-info More info is needed to complete the issue label Apr 21, 2023

ravijo mentioned this issue Apr 23, 2023

How do I compute validation loss during training? #810

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training works but AssertionError for MODEL.ROI_HEADS.NUM_CLASSES only when COCOEvaluator is used #4922

Training works but AssertionError for MODEL.ROI_HEADS.NUM_CLASSES only when COCOEvaluator is used #4922

ravijo commented Apr 20, 2023 •

edited

Loading

github-actions bot commented Apr 20, 2023

ravijo commented Apr 23, 2023

Lihewin commented Apr 25, 2023

ravijo commented Apr 25, 2023

clairelin23 commented May 3, 2024

Training works but AssertionError for MODEL.ROI_HEADS.NUM_CLASSES only when COCOEvaluator is used #4922

Training works but AssertionError for MODEL.ROI_HEADS.NUM_CLASSES only when COCOEvaluator is used #4922

Comments

ravijo commented Apr 20, 2023 • edited Loading

Instructions To Reproduce the Issue:

Observations:

Environment:

Questions:

github-actions bot commented Apr 20, 2023

ravijo commented Apr 23, 2023

Lihewin commented Apr 25, 2023

ravijo commented Apr 25, 2023

clairelin23 commented May 3, 2024

ravijo commented Apr 20, 2023 •

edited

Loading