Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How can I self-train with 1 gpu? #38

Closed
Howie-Ye opened this issue Jul 21, 2023 · 4 comments
Closed

How can I self-train with 1 gpu? #38

Howie-Ye opened this issue Jul 21, 2023 · 4 comments

Comments

@Howie-Ye
Copy link

Thank you for the cool work!
Now I have a question that how can I use just 1 gpu to train my dataset ?
Since there is a choice " --num-gpus " in the arg set , and I only have 1 gpu ,and my datasets is pretty small.

But there is a bug when I launch the script:
python train_net.py --num-gpus 1 --config-file model_zoo/configs/CutLER-ImageNet/cascade_mask_rcnn_R_50_FPN_self_train.yaml --train-dataset imagenet_train_r1 OUTPUT_DIR ../model_output/self-train-r1/
But it doesn't work , here is the part of logs with bug.

[07/21 15:40:26 d2.engine.train_loop]: Starting training from iteration 0
/root/autodl-tmp/project/CutLER/cutler/data/detection_utils.py:437: UserWarning: The given NumPy array is not writeable, and PyTorch does not support non-writeable tensors. This means you can write to the underlying (supposedly non-writeable) NumPy array using the tensor. You may want to copy the array to protect its data or make it writeable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at /pytorch/torch/csrc/utils/tensor_numpy.cpp:143.)
torch.stack([torch.from_numpy(np.ascontiguousarray(x)) for x in masks])
/root/autodl-tmp/project/CutLER/cutler/data/detection_utils.py:437: UserWarning: The given NumPy array is not writeable, and PyTorch does not support non-writeable tensors. This means you can write to the underlying (supposedly non-writeable) NumPy array using the tensor. You may want to copy the array to protect its data or make it writeable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at /pytorch/torch/csrc/utils/tensor_numpy.cpp:143.)
torch.stack([torch.from_numpy(np.ascontiguousarray(x)) for x in masks])
ERROR [07/21 15:40:27 d2.engine.train_loop]: Exception during training:
Traceback (most recent call last):
File "/root/autodl-tmp/project/detectron2/detectron2/engine/train_loop.py", line 155, in train
self.run_step()
File "/root/autodl-tmp/project/CutLER/cutler/engine/defaults.py", line 505, in run_step
self._trainer.run_step()
File "/root/autodl-tmp/project/CutLER/cutler/engine/train_loop.py", line 335, in run_step
loss_dict = self.model(data)
File "/root/miniconda3/envs/cutler/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/root/autodl-tmp/project/CutLER/cutler/modeling/meta_arch/rcnn.py", line 160, in forward
features = self.backbone(images.tensor)
File "/root/miniconda3/envs/cutler/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/root/autodl-tmp/project/detectron2/detectron2/modeling/backbone/fpn.py", line 139, in forward
bottom_up_features = self.bottom_up(x)
File "/root/miniconda3/envs/cutler/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/root/autodl-tmp/project/detectron2/detectron2/modeling/backbone/resnet.py", line 445, in forward
x = self.stem(x)
File "/root/miniconda3/envs/cutler/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/root/autodl-tmp/project/detectron2/detectron2/modeling/backbone/resnet.py", line 356, in forward
x = self.conv1(x)
File "/root/miniconda3/envs/cutler/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/root/autodl-tmp/project/detectron2/detectron2/layers/wrappers.py", line 131, in forward
x = self.norm(x)
File "/root/miniconda3/envs/cutler/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/root/miniconda3/envs/cutler/lib/python3.8/site-packages/torch/nn/modules/batchnorm.py", line 532, in forward
world_size = torch.distributed.get_world_size(process_group)
File "/root/miniconda3/envs/cutler/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 711, in get_world_size
return _get_group_size(group)
File "/root/miniconda3/envs/cutler/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 263, in _get_group_size
default_pg = _get_default_group()
File "/root/miniconda3/envs/cutler/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 347, in _get_default_group
raise RuntimeError("Default process group has not been initialized, "
RuntimeError: Default process group has not been initialized, please make sure to call init_process_group.
[07/21 15:40:27 d2.engine.hooks]: Total training time: 0:00:01 (0:00:00 on hooks)
[07/21 15:40:27 d2.utils.events]: iter: 0 lr: N/A max_mem: 1098M
Traceback (most recent call last):
File "train_net.py", line 170, in
launch(
File "/root/autodl-tmp/project/detectron2/detectron2/engine/launch.py", line 84, in launch
main_func(*args)
File "train_net.py", line 160, in main
return trainer.train()
File "/root/autodl-tmp/project/CutLER/cutler/engine/defaults.py", line 495, in train
super().train(self.start_iter, self.max_iter)
File "/root/autodl-tmp/project/detectron2/detectron2/engine/train_loop.py", line 155, in train
self.run_step()
File "/root/autodl-tmp/project/CutLER/cutler/engine/defaults.py", line 505, in run_step
self._trainer.run_step()
File "/root/autodl-tmp/project/CutLER/cutler/engine/train_loop.py", line 335, in run_step
loss_dict = self.model(data)
File "/root/miniconda3/envs/cutler/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/root/autodl-tmp/project/CutLER/cutler/modeling/meta_arch/rcnn.py", line 160, in forward
features = self.backbone(images.tensor)
File "/root/miniconda3/envs/cutler/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/root/autodl-tmp/project/detectron2/detectron2/modeling/backbone/fpn.py", line 139, in forward
bottom_up_features = self.bottom_up(x)
File "/root/miniconda3/envs/cutler/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/root/autodl-tmp/project/detectron2/detectron2/modeling/backbone/resnet.py", line 445, in forward
x = self.stem(x)
File "/root/miniconda3/envs/cutler/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/root/autodl-tmp/project/detectron2/detectron2/modeling/backbone/resnet.py", line 356, in forward
x = self.conv1(x)
File "/root/miniconda3/envs/cutler/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/root/autodl-tmp/project/detectron2/detectron2/layers/wrappers.py", line 131, in forward
x = self.norm(x)
File "/root/miniconda3/envs/cutler/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/root/miniconda3/envs/cutler/lib/python3.8/site-packages/torch/nn/modules/batchnorm.py", line 532, in forward
world_size = torch.distributed.get_world_size(process_group)
File "/root/miniconda3/envs/cutler/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 711, in get_world_size
return _get_group_size(group)
File "/root/miniconda3/envs/cutler/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 263, in _get_group_size
default_pg = _get_default_group()
File "/root/miniconda3/envs/cutler/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 347, in _get_default_group
raise RuntimeError("Default process group has not been initialized, "
RuntimeError: Default process group has not been initialized, please make sure to call init_process_group.

It looks like referring to DDP.
Thanks for your help !

@frank-xwang
Copy link
Collaborator

Hello, considering that our model utilizes SyncBatchNorm, it is important to note that it cannot be used with a single worker on either CPU or GPU. To address this issue, I suggest referring to this link in the Detectron2 repository, where you may find a resolution for this problem.

@Howie-Ye
Copy link
Author

Hello, considering that our model utilizes SyncBatchNorm, it is important to note that it cannot be used with a single worker on either CPU or GPU. To address this issue, I suggest referring to this link in the Detectron2 repository, where you may find a resolution for this problem.

Thanks for your reply!
I‘ve read the code and find these configs. Finally, I used at least 2 GPU which could address the issue.

@Howie-Ye
Copy link
Author

Hello, considering that our model utilizes SyncBatchNorm, it is important to note that it cannot be used with a single worker on either CPU or GPU. To address this issue, I suggest referring to this link in the Detectron2 repository, where you may find a resolution for this problem.

Thanks for your reply! I‘ve read the code and find these configs. Finally, I used at least 2 GPU which could address the issue.

BTW, I found that if I use python=3.8 torch=1.8.1 cudatoolkit=11.3 with RTX30 series , this CUDA error index >= -sizes[i] && index < sizes[i] && "index out of bounds will appear which should already fixed according to this issue and this issue.
Confusing. So I have to replace GPU with RTX20 series and cudatoolkit=10.2 .But the batch size must be set smaller in config file. I don't know why only me have met these problems. T_T

@frank-xwang
Copy link
Collaborator

This bug seems to be a hardware related issue, sorry for not being able to help much. Let me know if there is any code related bugs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants