How can I self-train with 1 gpu？ #38

Howie-Ye · 2023-07-21T07:54:49Z

Thank you for the cool work!
Now I have a question that how can I use just 1 gpu to train my dataset ?
Since there is a choice " --num-gpus " in the arg set , and I only have 1 gpu ,and my datasets is pretty small.

But there is a bug when I launch the script:
python train_net.py --num-gpus 1 --config-file model_zoo/configs/CutLER-ImageNet/cascade_mask_rcnn_R_50_FPN_self_train.yaml --train-dataset imagenet_train_r1 OUTPUT_DIR ../model_output/self-train-r1/
But it doesn't work , here is the part of logs with bug.

[07/21 15:40:26 d2.engine.train_loop]: Starting training from iteration 0
/root/autodl-tmp/project/CutLER/cutler/data/detection_utils.py:437: UserWarning: The given NumPy array is not writeable, and PyTorch does not support non-writeable tensors. This means you can write to the underlying (supposedly non-writeable) NumPy array using the tensor. You may want to copy the array to protect its data or make it writeable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at /pytorch/torch/csrc/utils/tensor_numpy.cpp:143.)
torch.stack([torch.from_numpy(np.ascontiguousarray(x)) for x in masks])
/root/autodl-tmp/project/CutLER/cutler/data/detection_utils.py:437: UserWarning: The given NumPy array is not writeable, and PyTorch does not support non-writeable tensors. This means you can write to the underlying (supposedly non-writeable) NumPy array using the tensor. You may want to copy the array to protect its data or make it writeable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at /pytorch/torch/csrc/utils/tensor_numpy.cpp:143.)
torch.stack([torch.from_numpy(np.ascontiguousarray(x)) for x in masks])
ERROR [07/21 15:40:27 d2.engine.train_loop]: Exception during training:
Traceback (most recent call last):
File "/root/autodl-tmp/project/detectron2/detectron2/engine/train_loop.py", line 155, in train
self.run_step()
File "/root/autodl-tmp/project/CutLER/cutler/engine/defaults.py", line 505, in run_step
self._trainer.run_step()
File "/root/autodl-tmp/project/CutLER/cutler/engine/train_loop.py", line 335, in run_step
loss_dict = self.model(data)
File "/root/miniconda3/envs/cutler/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/root/autodl-tmp/project/CutLER/cutler/modeling/meta_arch/rcnn.py", line 160, in forward
features = self.backbone(images.tensor)
File "/root/miniconda3/envs/cutler/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/root/autodl-tmp/project/detectron2/detectron2/modeling/backbone/fpn.py", line 139, in forward
bottom_up_features = self.bottom_up(x)
File "/root/miniconda3/envs/cutler/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/root/autodl-tmp/project/detectron2/detectron2/modeling/backbone/resnet.py", line 445, in forward
x = self.stem(x)
File "/root/miniconda3/envs/cutler/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/root/autodl-tmp/project/detectron2/detectron2/modeling/backbone/resnet.py", line 356, in forward
x = self.conv1(x)
File "/root/miniconda3/envs/cutler/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/root/autodl-tmp/project/detectron2/detectron2/layers/wrappers.py", line 131, in forward
x = self.norm(x)
File "/root/miniconda3/envs/cutler/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/root/miniconda3/envs/cutler/lib/python3.8/site-packages/torch/nn/modules/batchnorm.py", line 532, in forward
world_size = torch.distributed.get_world_size(process_group)
File "/root/miniconda3/envs/cutler/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 711, in get_world_size
return _get_group_size(group)
File "/root/miniconda3/envs/cutler/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 263, in _get_group_size
default_pg = _get_default_group()
File "/root/miniconda3/envs/cutler/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 347, in _get_default_group
raise RuntimeError("Default process group has not been initialized, "
RuntimeError: Default process group has not been initialized, please make sure to call init_process_group.
[07/21 15:40:27 d2.engine.hooks]: Total training time: 0:00:01 (0:00:00 on hooks)
[07/21 15:40:27 d2.utils.events]: iter: 0 lr: N/A max_mem: 1098M
Traceback (most recent call last):
File "train_net.py", line 170, in
launch(
File "/root/autodl-tmp/project/detectron2/detectron2/engine/launch.py", line 84, in launch
main_func(*args)
File "train_net.py", line 160, in main
return trainer.train()
File "/root/autodl-tmp/project/CutLER/cutler/engine/defaults.py", line 495, in train
super().train(self.start_iter, self.max_iter)
File "/root/autodl-tmp/project/detectron2/detectron2/engine/train_loop.py", line 155, in train
self.run_step()
File "/root/autodl-tmp/project/CutLER/cutler/engine/defaults.py", line 505, in run_step
self._trainer.run_step()
File "/root/autodl-tmp/project/CutLER/cutler/engine/train_loop.py", line 335, in run_step
loss_dict = self.model(data)
File "/root/miniconda3/envs/cutler/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/root/autodl-tmp/project/CutLER/cutler/modeling/meta_arch/rcnn.py", line 160, in forward
features = self.backbone(images.tensor)
File "/root/miniconda3/envs/cutler/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/root/autodl-tmp/project/detectron2/detectron2/modeling/backbone/fpn.py", line 139, in forward
bottom_up_features = self.bottom_up(x)
File "/root/miniconda3/envs/cutler/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/root/autodl-tmp/project/detectron2/detectron2/modeling/backbone/resnet.py", line 445, in forward
x = self.stem(x)
File "/root/miniconda3/envs/cutler/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/root/autodl-tmp/project/detectron2/detectron2/modeling/backbone/resnet.py", line 356, in forward
x = self.conv1(x)
File "/root/miniconda3/envs/cutler/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/root/autodl-tmp/project/detectron2/detectron2/layers/wrappers.py", line 131, in forward
x = self.norm(x)
File "/root/miniconda3/envs/cutler/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/root/miniconda3/envs/cutler/lib/python3.8/site-packages/torch/nn/modules/batchnorm.py", line 532, in forward
world_size = torch.distributed.get_world_size(process_group)
File "/root/miniconda3/envs/cutler/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 711, in get_world_size
return _get_group_size(group)
File "/root/miniconda3/envs/cutler/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 263, in _get_group_size
default_pg = _get_default_group()
File "/root/miniconda3/envs/cutler/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 347, in _get_default_group
raise RuntimeError("Default process group has not been initialized, "
RuntimeError: Default process group has not been initialized, please make sure to call init_process_group.

It looks like referring to DDP.
Thanks for your help !

The text was updated successfully, but these errors were encountered:

frank-xwang · 2023-07-24T03:03:05Z

Hello, considering that our model utilizes SyncBatchNorm, it is important to note that it cannot be used with a single worker on either CPU or GPU. To address this issue, I suggest referring to this link in the Detectron2 repository, where you may find a resolution for this problem.

Howie-Ye · 2023-07-24T14:13:11Z

Hello, considering that our model utilizes SyncBatchNorm, it is important to note that it cannot be used with a single worker on either CPU or GPU. To address this issue, I suggest referring to this link in the Detectron2 repository, where you may find a resolution for this problem.

Thanks for your reply！
I‘ve read the code and find these configs. Finally, I used at least 2 GPU which could address the issue.

Howie-Ye · 2023-07-24T14:41:09Z

Hello, considering that our model utilizes SyncBatchNorm, it is important to note that it cannot be used with a single worker on either CPU or GPU. To address this issue, I suggest referring to this link in the Detectron2 repository, where you may find a resolution for this problem.

Thanks for your reply！ I‘ve read the code and find these configs. Finally, I used at least 2 GPU which could address the issue.

BTW, I found that if I use python=3.8 torch=1.8.1 cudatoolkit=11.3 with RTX30 series , this CUDA error index >= -sizes[i] && index < sizes[i] && "index out of bounds will appear which should already fixed according to this issue and this issue.
Confusing. So I have to replace GPU with RTX20 series and cudatoolkit=10.2 .But the batch size must be set smaller in config file. I don't know why only me have met these problems. T_T

frank-xwang · 2023-08-07T21:10:04Z

This bug seems to be a hardware related issue, sorry for not being able to help much. Let me know if there is any code related bugs.

frank-xwang closed this as completed Sep 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How can I self-train with 1 gpu？ #38

How can I self-train with 1 gpu？ #38

Howie-Ye commented Jul 21, 2023

frank-xwang commented Jul 24, 2023

Howie-Ye commented Jul 24, 2023

Howie-Ye commented Jul 24, 2023

frank-xwang commented Aug 7, 2023

How can I self-train with 1 gpu？ #38

How can I self-train with 1 gpu？ #38

Comments

Howie-Ye commented Jul 21, 2023

frank-xwang commented Jul 24, 2023

Howie-Ye commented Jul 24, 2023

Howie-Ye commented Jul 24, 2023

frank-xwang commented Aug 7, 2023