Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AssertionError: Default process group is not initialized #19

Closed
amiltonwong opened this issue Apr 8, 2021 · 3 comments
Closed

AssertionError: Default process group is not initialized #19

amiltonwong opened this issue Apr 8, 2021 · 3 comments

Comments

@amiltonwong
Copy link

amiltonwong commented Apr 8, 2021

Hi, authors,

I got the following error after executing command: python tools/train.py configs/SETR/SETR_PUP_768x768_40k_cityscapes_bs_8.py

2021-04-08 08:03:22,265 - mmseg - INFO - Loaded 2975 images
2021-04-08 08:03:24,275 - mmseg - INFO - Loaded 500 images
2021-04-08 08:03:24,276 - mmseg - INFO - Start running, host: root@milton-LabPC, work_dir: /media/root/mdata/data/code13/SETR/work_dirs/SETR_PUP_768x768_40k_cityscapes_bs_8
2021-04-08 08:03:24,276 - mmseg - INFO - workflow: [('train', 1)], max: 40000 iters
Traceback (most recent call last):
  File "tools/train.py", line 161, in <module>
    main()
  File "tools/train.py", line 150, in main
    train_segmentor(
  File "/media/root/mdata/data/code13/SETR/mmseg/apis/train.py", line 106, in train_segmentor
    runner.run(data_loaders, cfg.workflow, cfg.total_iters)
  File "/root/anaconda3/envs/pytorch1.7.0/lib/python3.8/site-packages/mmcv/runner/iter_based_runner.py", line 130, in run
    iter_runner(iter_loaders[i], **kwargs)
  File "/root/anaconda3/envs/pytorch1.7.0/lib/python3.8/site-packages/mmcv/runner/iter_based_runner.py", line 60, in train
    outputs = self.model.train_step(data_batch, self.optimizer, **kwargs)
  File "/root/anaconda3/envs/pytorch1.7.0/lib/python3.8/site-packages/mmcv/parallel/data_parallel.py", line 67, in train_step
    return self.module.train_step(*inputs[0], **kwargs[0])
  File "/media/root/mdata/data/code13/SETR/mmseg/models/segmentors/base.py", line 152, in train_step
    losses = self(**data_batch)
  File "/root/anaconda3/envs/pytorch1.7.0/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/root/anaconda3/envs/pytorch1.7.0/lib/python3.8/site-packages/mmcv/runner/fp16_utils.py", line 84, in new_func
    return old_func(*args, **kwargs)
  File "/media/root/mdata/data/code13/SETR/mmseg/models/segmentors/base.py", line 122, in forward
    return self.forward_train(img, img_metas, **kwargs)
  File "/media/root/mdata/data/code13/SETR/mmseg/models/segmentors/encoder_decoder.py", line 157, in forward_train
    loss_decode = self._decode_head_forward_train(x, img_metas,
  File "/media/root/mdata/data/code13/SETR/mmseg/models/segmentors/encoder_decoder.py", line 100, in _decode_head_forward_train
    loss_decode = self.decode_head.forward_train(x, img_metas,
  File "/media/root/mdata/data/code13/SETR/mmseg/models/decode_heads/decode_head.py", line 185, in forward_train
    seg_logits = self.forward(inputs)
  File "/media/root/mdata/data/code13/SETR/mmseg/models/decode_heads/vit_up_head.py", line 93, in forward
    x = self.syncbn_fc_0(x)
  File "/root/anaconda3/envs/pytorch1.7.0/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/root/anaconda3/envs/pytorch1.7.0/lib/python3.8/site-packages/torch/nn/modules/batchnorm.py", line 519, in forward
    world_size = torch.distributed.get_world_size(process_group)
  File "/root/anaconda3/envs/pytorch1.7.0/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 625, in get_world_size
    return _get_group_size(group)
  File "/root/anaconda3/envs/pytorch1.7.0/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 220, in _get_group_size
    _check_default_pg()
  File "/root/anaconda3/envs/pytorch1.7.0/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 210, in _check_default_pg
    assert _default_pg is not None, \
AssertionError: Default process group is not initialized
(pytorch1.7.0) root@milton-LabPC:/data/code13/SETR

As I use a single GPU device to perform the training, it seems the error is related to distributed training. Any hints to solve this issue?

THX!

@lzrobots
Copy link
Contributor

This is due to the sync bn. Try

./tools/dist_train.sh configs/SETR/SETR_PUP_768x768_40k_cityscapes_bs_8.py 1

but it won't help to train SETR with only one GPU

@amiltonwong
Copy link
Author

@lzrobots,

Does it mean that GPU device >=2 are required to run the training step?

@lzrobots
Copy link
Contributor

yes. I didn't see any modern segmentation model can be run on a single gpu. see mmsegmentation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants