Training fails if the --validate flag is set #39

klois · 2020-04-30T08:17:16Z

I got the training on the coco dataset working as expected.
However, when I try to set the --validate flag, as recommended in the documentation, the training fails as soon as it starts to do the first validation step.

The command

./tools/dist_train.sh configs/solo/decoupled_solo_light_r50_fpn_8gpu_3x.py 2

works while

./tools/dist_train.sh configs/solo/decoupled_solo_light_r50_fpn_8gpu_3x.py 2 --validate

produces the following error

[>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>] 58/58, 40.1 task/s, elapsed: 1s, ETA:     0s

Traceback (most recent call last):
  File "./tools/train.py", line 125, in <module>
    main()
  File "./tools/train.py", line 121, in main
    timestamp=timestamp)
  File "/home/klauskofler/SOLO/mmdet/apis/train.py", line 103, in train_detector
    timestamp=timestamp)
  File "/home/klauskofler/SOLO/mmdet/apis/train.py", line 250, in _dist_train
    runner.run(data_loaders, cfg.workflow, cfg.total_epochs)
  File "/home/klauskofler/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/mmcv/runner/runner.py", line 364, in run
    epoch_runner(data_loaders[i], **kwargs)
  File "/home/klauskofler/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/mmcv/runner/runner.py", line 278, in train
    self.call_hook('after_train_epoch')
  File "/home/klauskofler/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/mmcv/runner/runner.py", line 231, in call_hook
    getattr(hook, fn_name)(self)
  File "/home/klauskofler/SOLO/mmdet/core/evaluation/eval_hooks.py", line 64, in after_train_epoch
    self.evaluate(runner, results)
  File "/home/klauskofler/SOLO/mmdet/core/evaluation/eval_hooks.py", line 124, in evaluate
    result_files = results2json(self.dataset, results, tmp_file)
  File "/home/klauskofler/SOLO/mmdet/core/evaluation/coco_utils.py", line 224, in results2json
    json_results = det2json(dataset, results)
  File "/home/klauskofler/SOLO/mmdet/core/evaluation/coco_utils.py", line 153, in det2json
    for i in range(bboxes.shape[0]):
AttributeError: 'NoneType' object has no attribute 'shape'
Traceback (most recent call last):
  File "/home/klauskofler/anaconda3/envs/open-mmlab/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home/klauskofler/anaconda3/envs/open-mmlab/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/klauskofler/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/distributed/launch.py", line 263, in <module>
    main()
  File "/home/klauskofler/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/distributed/launch.py", line 259, in main
    cmd=cmd)
subprocess.CalledProcessError: Command '['/home/klauskofler/anaconda3/envs/open-mmlab/bin/python', '-u', './tools/train.py', '--local_rank=1', 'configs/solo/decoupled_solo_light_r50_fpn_8gpu_3x.py', '--launcher', 'pytorch', '--validate']' returned non-zero exit status 1.

It seems to me, that the validation wants to calculate KPIs based on the bounding boxes produced by the network, while the network does not produce any bounding boxes. Is this behavior expected or am I doing something wrong?

The text was updated successfully, but these errors were encountered:

WXinlong · 2020-05-05T10:07:59Z

@klois The validation during training is not supported yet.

klois · 2020-05-05T11:17:18Z

Thanks for the information.
Maybe the documentation at https://github.com/WXinlong/SOLO/blob/master/docs/GETTING_STARTED.md should be updated to avoid further confusion.

WXinlong · 2020-05-13T02:42:41Z

@klois Yes, thanks for pointing it out.:)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training fails if the --validate flag is set #39

Training fails if the --validate flag is set #39

klois commented Apr 30, 2020

WXinlong commented May 5, 2020

klois commented May 5, 2020

WXinlong commented May 13, 2020

Training fails if the --validate flag is set #39

Training fails if the --validate flag is set #39

Comments

klois commented Apr 30, 2020

WXinlong commented May 5, 2020

klois commented May 5, 2020

WXinlong commented May 13, 2020