-
Notifications
You must be signed in to change notification settings - Fork 90
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How can I self-train with 1 gpu? #38
Comments
Hello, considering that our model utilizes SyncBatchNorm, it is important to note that it cannot be used with a single worker on either CPU or GPU. To address this issue, I suggest referring to this link in the Detectron2 repository, where you may find a resolution for this problem. |
Thanks for your reply! |
BTW, I found that if I use |
This bug seems to be a hardware related issue, sorry for not being able to help much. Let me know if there is any code related bugs. |
Thank you for the cool work!
Now I have a question that how can I use just 1 gpu to train my dataset ?
Since there is a choice " --num-gpus " in the arg set , and I only have 1 gpu ,and my datasets is pretty small.
But there is a bug when I launch the script:
python train_net.py --num-gpus 1 --config-file model_zoo/configs/CutLER-ImageNet/cascade_mask_rcnn_R_50_FPN_self_train.yaml --train-dataset imagenet_train_r1 OUTPUT_DIR ../model_output/self-train-r1/
But it doesn't work , here is the part of logs with bug.
[07/21 15:40:26 d2.engine.train_loop]: Starting training from iteration 0
/root/autodl-tmp/project/CutLER/cutler/data/detection_utils.py:437: UserWarning: The given NumPy array is not writeable, and PyTorch does not support non-writeable tensors. This means you can write to the underlying (supposedly non-writeable) NumPy array using the tensor. You may want to copy the array to protect its data or make it writeable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at /pytorch/torch/csrc/utils/tensor_numpy.cpp:143.)
torch.stack([torch.from_numpy(np.ascontiguousarray(x)) for x in masks])
/root/autodl-tmp/project/CutLER/cutler/data/detection_utils.py:437: UserWarning: The given NumPy array is not writeable, and PyTorch does not support non-writeable tensors. This means you can write to the underlying (supposedly non-writeable) NumPy array using the tensor. You may want to copy the array to protect its data or make it writeable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at /pytorch/torch/csrc/utils/tensor_numpy.cpp:143.)
torch.stack([torch.from_numpy(np.ascontiguousarray(x)) for x in masks])
ERROR [07/21 15:40:27 d2.engine.train_loop]: Exception during training:
Traceback (most recent call last):
File "/root/autodl-tmp/project/detectron2/detectron2/engine/train_loop.py", line 155, in train
self.run_step()
File "/root/autodl-tmp/project/CutLER/cutler/engine/defaults.py", line 505, in run_step
self._trainer.run_step()
File "/root/autodl-tmp/project/CutLER/cutler/engine/train_loop.py", line 335, in run_step
loss_dict = self.model(data)
File "/root/miniconda3/envs/cutler/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/root/autodl-tmp/project/CutLER/cutler/modeling/meta_arch/rcnn.py", line 160, in forward
features = self.backbone(images.tensor)
File "/root/miniconda3/envs/cutler/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/root/autodl-tmp/project/detectron2/detectron2/modeling/backbone/fpn.py", line 139, in forward
bottom_up_features = self.bottom_up(x)
File "/root/miniconda3/envs/cutler/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/root/autodl-tmp/project/detectron2/detectron2/modeling/backbone/resnet.py", line 445, in forward
x = self.stem(x)
File "/root/miniconda3/envs/cutler/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/root/autodl-tmp/project/detectron2/detectron2/modeling/backbone/resnet.py", line 356, in forward
x = self.conv1(x)
File "/root/miniconda3/envs/cutler/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/root/autodl-tmp/project/detectron2/detectron2/layers/wrappers.py", line 131, in forward
x = self.norm(x)
File "/root/miniconda3/envs/cutler/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/root/miniconda3/envs/cutler/lib/python3.8/site-packages/torch/nn/modules/batchnorm.py", line 532, in forward
world_size = torch.distributed.get_world_size(process_group)
File "/root/miniconda3/envs/cutler/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 711, in get_world_size
return _get_group_size(group)
File "/root/miniconda3/envs/cutler/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 263, in _get_group_size
default_pg = _get_default_group()
File "/root/miniconda3/envs/cutler/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 347, in _get_default_group
raise RuntimeError("Default process group has not been initialized, "
RuntimeError: Default process group has not been initialized, please make sure to call init_process_group.
[07/21 15:40:27 d2.engine.hooks]: Total training time: 0:00:01 (0:00:00 on hooks)
[07/21 15:40:27 d2.utils.events]: iter: 0 lr: N/A max_mem: 1098M
Traceback (most recent call last):
File "train_net.py", line 170, in
launch(
File "/root/autodl-tmp/project/detectron2/detectron2/engine/launch.py", line 84, in launch
main_func(*args)
File "train_net.py", line 160, in main
return trainer.train()
File "/root/autodl-tmp/project/CutLER/cutler/engine/defaults.py", line 495, in train
super().train(self.start_iter, self.max_iter)
File "/root/autodl-tmp/project/detectron2/detectron2/engine/train_loop.py", line 155, in train
self.run_step()
File "/root/autodl-tmp/project/CutLER/cutler/engine/defaults.py", line 505, in run_step
self._trainer.run_step()
File "/root/autodl-tmp/project/CutLER/cutler/engine/train_loop.py", line 335, in run_step
loss_dict = self.model(data)
File "/root/miniconda3/envs/cutler/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/root/autodl-tmp/project/CutLER/cutler/modeling/meta_arch/rcnn.py", line 160, in forward
features = self.backbone(images.tensor)
File "/root/miniconda3/envs/cutler/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/root/autodl-tmp/project/detectron2/detectron2/modeling/backbone/fpn.py", line 139, in forward
bottom_up_features = self.bottom_up(x)
File "/root/miniconda3/envs/cutler/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/root/autodl-tmp/project/detectron2/detectron2/modeling/backbone/resnet.py", line 445, in forward
x = self.stem(x)
File "/root/miniconda3/envs/cutler/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/root/autodl-tmp/project/detectron2/detectron2/modeling/backbone/resnet.py", line 356, in forward
x = self.conv1(x)
File "/root/miniconda3/envs/cutler/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/root/autodl-tmp/project/detectron2/detectron2/layers/wrappers.py", line 131, in forward
x = self.norm(x)
File "/root/miniconda3/envs/cutler/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/root/miniconda3/envs/cutler/lib/python3.8/site-packages/torch/nn/modules/batchnorm.py", line 532, in forward
world_size = torch.distributed.get_world_size(process_group)
File "/root/miniconda3/envs/cutler/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 711, in get_world_size
return _get_group_size(group)
File "/root/miniconda3/envs/cutler/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 263, in _get_group_size
default_pg = _get_default_group()
File "/root/miniconda3/envs/cutler/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 347, in _get_default_group
raise RuntimeError("Default process group has not been initialized, "
RuntimeError: Default process group has not been initialized, please make sure to call init_process_group.
It looks like referring to DDP.
Thanks for your help !
The text was updated successfully, but these errors were encountered: