Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training works but AssertionError for MODEL.ROI_HEADS.NUM_CLASSES only when COCOEvaluator is used #4922

Open
ravijo opened this issue Apr 20, 2023 · 5 comments

Comments

@ravijo
Copy link

ravijo commented Apr 20, 2023

Based on the suggestions mentioned in this discussion, I am trying to compute validation loss during the training of Mask R-CNN model. Please note that my dataset has only 1 class, say toys, which are small items (maximum 10 cm long).

Instructions To Reproduce the Issue:

  1. Using code from this discussion
  2. Specially, please see the ValLossHook below:
    class ValLossHook(HookBase):
        def __init__(self, cfg, validation_set_key):
            super().__init__()
            self.cfg = cfg.clone()
            self._loader = iter(build_detection_test_loader(self.cfg, validation_set_key, 
                                                            mapper=DatasetMapper(self.cfg, is_train=True)))
    
        def after_step(self):
            data = next(self._loader)
            with torch.no_grad():
                loss_dict = self.trainer.model(data)
                
                losses = sum(loss_dict.values())
                assert torch.isfinite(losses).all(), loss_dict
    
                loss_dict_reduced = {"val_" + k: v.item() for k, v in comm.reduce_dict(loss_dict).items()}
                losses_reduced = sum(loss for loss in loss_dict_reduced.values())
                if comm.is_main_process():
                    self.trainer.storage.put_scalars(val_total_loss=losses_reduced, 
                                                     **loss_dict_reduced)
    

Observations:

While training together with computing validation loss, I noticed the following behaviors:

  1. With default cfg, the training works but the COCOEvaluator throws AssertionError as shown below:
    Traceback (most recent call last):
      File "train.py", line 123, in <module>
        main()
      File "train.py", line 119, in main
        trainer.train()
      File "/home/ravi/.local/lib/python3.6/site-packages/detectron2/engine/defaults.py", line 484, in train
        super().train(self.start_iter, self.max_iter)
      File "/home/ravi/.local/lib/python3.6/site-packages/detectron2/engine/train_loop.py", line 150, in train
        self.after_step()
      File "/home/ravi/.local/lib/python3.6/site-packages/detectron2/engine/train_loop.py", line 180, in after_step
        h.after_step()
      File "/home/ravi/.local/lib/python3.6/site-packages/detectron2/engine/hooks.py", line 552, in after_step
        self._do_eval()
      File "/home/ravi/.local/lib/python3.6/site-packages/detectron2/engine/hooks.py", line 525, in _do_eval
        results = self._func()
      File "/home/ravi/.local/lib/python3.6/site-packages/detectron2/engine/defaults.py", line 453, in test_and_save_results
        self._last_eval_results = self.test(self.cfg, self.model)
      File "/home/ravi/.local/lib/python3.6/site-packages/detectron2/engine/defaults.py", line 608, in test
        results_i = inference_on_dataset(model, data_loader, evaluator)
      File "/home/ravi/.local/lib/python3.6/site-packages/detectron2/evaluation/evaluator.py", line 204, in inference_on_dataset
        results = evaluator.evaluate()
      File "/home/ravi/.local/lib/python3.6/site-packages/detectron2/evaluation/coco_evaluation.py", line 194, in evaluate
        self._eval_predictions(predictions, img_ids=img_ids)
      File "/home/ravi/.local/lib/python3.6/site-packages/detectron2/evaluation/coco_evaluation.py", line 229, in _eval_predictions
        f"A prediction has class={category_id}, "
    AssertionError: A prediction has class=77, but the dataset only has 1 classes and predicted class id should be in [0, 0].
    
    The solution is to set cfg.MODEL.ROI_HEADS.NUM_CLASSES = 1. However, after adding this configuration, detectron2 is showing the following warning:
    [04/20 18:56:34 d2.data.dataset_mapper]: [DatasetMapper] Augmentations used in training: [ResizeShortestEdge(short_edge_length=(640, 672, 704, 736, 768, 800), max_size=1333, sample_style='choice'), RandomFlip()]
    [04/20 18:56:34 d2.data.build]: Using training sampler TrainingSampler
    [04/20 18:56:34 d2.data.common]: Serializing 5 elements to byte tensors and concatenating them all ...
    [04/20 18:56:34 d2.data.common]: Serialized dataset takes 0.25 MiB
    Skip loading parameter 'roi_heads.box_predictor.cls_score.weight' to the model due to incompatible shapes: (81, 1024) in the checkpoint but (2, 1024) in the model! You might want to double check if this is expected.
    Skip loading parameter 'roi_heads.box_predictor.cls_score.bias' to the model due to incompatible shapes: (81,) in the checkpoint but (2,) in the model! You might want to double check if this is expected.
    Skip loading parameter 'roi_heads.box_predictor.bbox_pred.weight' to the model due to incompatible shapes: (320, 1024) in the checkpoint but (4, 1024) in the model! You might want to double check if this is expected.
    Skip loading parameter 'roi_heads.box_predictor.bbox_pred.bias' to the model due to incompatible shapes: (320,) in the checkpoint but (4,) in the model! You might want to double check if this is expected.
    Skip loading parameter 'roi_heads.mask_head.predictor.weight' to the model due to incompatible shapes: (80, 256, 1, 1) in the checkpoint but (1, 256, 1, 1) in the model! You might want to double check if this is expected.
    Skip loading parameter 'roi_heads.mask_head.predictor.bias' to the model due to incompatible shapes: (80,) in the checkpoint but (1,) in the model! You might want to double check if this is expected.
    Some model parameters or buffers are not found in the checkpoint:
    roi_heads.box_predictor.bbox_pred.{bias, weight}
    roi_heads.box_predictor.cls_score.{bias, weight}
    roi_heads.mask_head.predictor.{bias, weight}
    [04/20 18:56:34 d2.engine.train_loop]: Starting training from iteration 0
    
  2. My dataset contains small items (maximum 10 cm long). However, the COCOEvaluator is showing nan for APs. See below, please:
    [04/20 18:56:45 d2.evaluation.fast_eval_api]: Evaluate annotation type *segm*
    [04/20 18:56:45 d2.evaluation.fast_eval_api]: COCOeval_opt.evaluate() finished in 0.02 seconds.
    [04/20 18:56:45 d2.evaluation.fast_eval_api]: Accumulating evaluation results...
    [04/20 18:56:45 d2.evaluation.fast_eval_api]: COCOeval_opt.accumulate() finished in 0.00 seconds.
     Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.432
     Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.709
     Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.466
     Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = -1.000
     Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.401
     Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.437
     Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.016
     Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.190
     Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.667
     Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = -1.000
     Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.750
     Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.662
    [04/20 18:56:45 d2.evaluation.coco_evaluation]: Evaluation results for segm: 
    |   AP   |  AP50  |  AP75  |  APs  |  APm   |  APl   |
    |:------:|:------:|:------:|:-----:|:------:|:------:|
    | 43.207 | 70.906 | 46.571 |  nan  | 40.081 | 43.697 |
    [04/20 18:56:45 d2.evaluation.coco_evaluation]: Some metrics cannot be computed and is shown as NaN.
    [04/20 18:56:45 d2.engine.defaults]: Evaluation results for toys_test in csv format:
    [04/20 18:56:45 d2.evaluation.testing]: copypaste: Task: bbox
    [04/20 18:56:45 d2.evaluation.testing]: copypaste: AP,AP50,AP75,APs,APm,APl
    [04/20 18:56:45 d2.evaluation.testing]: copypaste: 35.4201,68.1637,32.0485,nan,61.3645,35.2860
    [04/20 18:56:45 d2.evaluation.testing]: copypaste: Task: segm
    [04/20 18:56:45 d2.evaluation.testing]: copypaste: AP,AP50,AP75,APs,APm,APl
    [04/20 18:56:45 d2.evaluation.testing]: copypaste: 43.2066,70.9056,46.5707,nan,40.0806,43.6967
    

Environment:

Name Info./Version
sys.platform linux
OS Ubuntu 18.04.6 LTS
OS Kernel 5.4.0-120-generic
Python 3.6.9 (default, Mar 10 2023, 16:46:00) [GCC 8.4.0]
numpy 1.19.5
detectron2 0.6 @/home/ravi/.local/lib/python3.6/site-packages/detectron2
Compiler GCC 7.5
CUDA compiler CUDA 11.6
detectron2 arch flags 7.0
DETECTRON2_ENV_MODULE <not set>
PyTorch 1.9.0+cu111 @/home/ravi/.local/lib/python3.6/site-packages/torch
PyTorch debug build False
GPU available Yes
GPU 0,1 NVIDIA GeForce RTX 3090 (arch=8.6)
Driver version 515.48.07
CUDA_HOME /usr/local/cuda
Pillow 8.4.0
torchvision 0.10.0+cu111 @/home/ravi/.local/lib/python3.6/site-packages/torchvision
torchvision arch flags 3.5, 5.0, 6.0, 7.0, 7.5, 8.0, 8.6
fvcore 0.1.5.post20221221
iopath 0.1.9
cv2 4.7.0

Questions:

  1. Is it necessary to set NUM_CLASSES for roi_heads? Surprisingly, the training works even without setting it. Nevertheless, after setting it, detectron2 shows an incompatible message mentioned above. Does it mean that weights for roi_heads are not loaded? Furthermore. does setting NUM_CLASSES to 1 hurt model learning/performance?
  2. Why does the COCOEvaluator show nan for APs in my dataset having only small objects?

Thank you very much.

@github-actions github-actions bot added the needs-more-info More info is needed to complete the issue label Apr 20, 2023
@github-actions
Copy link

You've chosen to report an unexpected problem or bug. Unless you already know the root cause of it, please include details about it by filling the issue template.
The following information is missing: "Instructions To Reproduce the Issue and Full Logs";

@github-actions github-actions bot removed the needs-more-info More info is needed to complete the issue label Apr 21, 2023
@ravijo
Copy link
Author

ravijo commented Apr 23, 2023

Dear @ppwwyyxx

Any suggestions, please?

Thank you so much.

@Lihewin
Copy link

Lihewin commented Apr 25, 2023

You have to set to set NUM_CLASSES for roi_heads. About the warnings given, see #196

@ravijo
Copy link
Author

ravijo commented Apr 25, 2023

@Lihewin

Thanks a lot. I did it. May I request you to look at the Questions section in my post above?

@clairelin23
Copy link

Did you resolve the problem? I also faced the same problem even after setting NUM_CLASSES.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants