AssertionError: Default process group is not initialized #19

amiltonwong · 2021-04-08T12:07:07Z

Hi, authors,

I got the following error after executing command: python tools/train.py configs/SETR/SETR_PUP_768x768_40k_cityscapes_bs_8.py

2021-04-08 08:03:22,265 - mmseg - INFO - Loaded 2975 images
2021-04-08 08:03:24,275 - mmseg - INFO - Loaded 500 images
2021-04-08 08:03:24,276 - mmseg - INFO - Start running, host: root@milton-LabPC, work_dir: /media/root/mdata/data/code13/SETR/work_dirs/SETR_PUP_768x768_40k_cityscapes_bs_8
2021-04-08 08:03:24,276 - mmseg - INFO - workflow: [('train', 1)], max: 40000 iters
Traceback (most recent call last):
  File "tools/train.py", line 161, in <module>
    main()
  File "tools/train.py", line 150, in main
    train_segmentor(
  File "/media/root/mdata/data/code13/SETR/mmseg/apis/train.py", line 106, in train_segmentor
    runner.run(data_loaders, cfg.workflow, cfg.total_iters)
  File "/root/anaconda3/envs/pytorch1.7.0/lib/python3.8/site-packages/mmcv/runner/iter_based_runner.py", line 130, in run
    iter_runner(iter_loaders[i], **kwargs)
  File "/root/anaconda3/envs/pytorch1.7.0/lib/python3.8/site-packages/mmcv/runner/iter_based_runner.py", line 60, in train
    outputs = self.model.train_step(data_batch, self.optimizer, **kwargs)
  File "/root/anaconda3/envs/pytorch1.7.0/lib/python3.8/site-packages/mmcv/parallel/data_parallel.py", line 67, in train_step
    return self.module.train_step(*inputs[0], **kwargs[0])
  File "/media/root/mdata/data/code13/SETR/mmseg/models/segmentors/base.py", line 152, in train_step
    losses = self(**data_batch)
  File "/root/anaconda3/envs/pytorch1.7.0/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/root/anaconda3/envs/pytorch1.7.0/lib/python3.8/site-packages/mmcv/runner/fp16_utils.py", line 84, in new_func
    return old_func(*args, **kwargs)
  File "/media/root/mdata/data/code13/SETR/mmseg/models/segmentors/base.py", line 122, in forward
    return self.forward_train(img, img_metas, **kwargs)
  File "/media/root/mdata/data/code13/SETR/mmseg/models/segmentors/encoder_decoder.py", line 157, in forward_train
    loss_decode = self._decode_head_forward_train(x, img_metas,
  File "/media/root/mdata/data/code13/SETR/mmseg/models/segmentors/encoder_decoder.py", line 100, in _decode_head_forward_train
    loss_decode = self.decode_head.forward_train(x, img_metas,
  File "/media/root/mdata/data/code13/SETR/mmseg/models/decode_heads/decode_head.py", line 185, in forward_train
    seg_logits = self.forward(inputs)
  File "/media/root/mdata/data/code13/SETR/mmseg/models/decode_heads/vit_up_head.py", line 93, in forward
    x = self.syncbn_fc_0(x)
  File "/root/anaconda3/envs/pytorch1.7.0/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/root/anaconda3/envs/pytorch1.7.0/lib/python3.8/site-packages/torch/nn/modules/batchnorm.py", line 519, in forward
    world_size = torch.distributed.get_world_size(process_group)
  File "/root/anaconda3/envs/pytorch1.7.0/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 625, in get_world_size
    return _get_group_size(group)
  File "/root/anaconda3/envs/pytorch1.7.0/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 220, in _get_group_size
    _check_default_pg()
  File "/root/anaconda3/envs/pytorch1.7.0/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 210, in _check_default_pg
    assert _default_pg is not None, \
AssertionError: Default process group is not initialized
(pytorch1.7.0) root@milton-LabPC:/data/code13/SETR

As I use a single GPU device to perform the training, it seems the error is related to distributed training. Any hints to solve this issue?

THX!

The text was updated successfully, but these errors were encountered:

lzrobots · 2021-04-11T05:44:03Z

This is due to the sync bn. Try

./tools/dist_train.sh configs/SETR/SETR_PUP_768x768_40k_cityscapes_bs_8.py 1

but it won't help to train SETR with only one GPU

amiltonwong · 2021-04-11T06:39:40Z

@lzrobots,

Does it mean that GPU device >=2 are required to run the training step?

lzrobots · 2021-04-11T06:45:47Z

yes. I didn't see any modern segmentation model can be run on a single gpu. see mmsegmentation

lzrobots closed this as completed Apr 11, 2021

selael123 mentioned this issue Mar 18, 2022

ZeroDivisionError: integer division or modulo by zero #50

Closed

wsj20010128 mentioned this issue Aug 31, 2023

RuntimeError: no valid convolution algorithms available in CuDNN #61

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AssertionError: Default process group is not initialized #19

AssertionError: Default process group is not initialized #19

amiltonwong commented Apr 8, 2021 •

edited

lzrobots commented Apr 11, 2021

amiltonwong commented Apr 11, 2021

lzrobots commented Apr 11, 2021

AssertionError: Default process group is not initialized #19

AssertionError: Default process group is not initialized #19

Comments

amiltonwong commented Apr 8, 2021 • edited

lzrobots commented Apr 11, 2021

amiltonwong commented Apr 11, 2021

lzrobots commented Apr 11, 2021

amiltonwong commented Apr 8, 2021 •

edited