Skip to content
This repository has been archived by the owner on Aug 29, 2023. It is now read-only.

Unable to train the model #69

Open
kagawa588 opened this issue Aug 21, 2022 · 0 comments
Open

Unable to train the model #69

kagawa588 opened this issue Aug 21, 2022 · 0 comments

Comments

@kagawa588
Copy link

Hi,

Thanks for your great work! I try to train the model myself recently, but I found that it takes so long to transfer the model from cpu to gpu (about an hour) and then it failed. Could you pls give me any suggestions? Did I do something wrong?

Thanks in advance!

My environment is below:

sys.platform linux
Python 3.7.0 (default, Oct 9 2018, 10:31:47) [GCC 7.3.0]
numpy 1.21.5
detectron2 0.6 @/home/mu/anaconda3/envs/maskformer/lib/python3.7/site-packages/detectron2
Compiler GCC 7.3
CUDA compiler CUDA 10.2
detectron2 arch flags 3.7, 5.0, 5.2, 6.0, 6.1, 7.0, 7.5
DETECTRON2_ENV_MODULE
PyTorch 1.8.2 @/home/mu/anaconda3/envs/maskformer/lib/python3.7/site-packages/torch
PyTorch debug build False
GPU available Yes
GPU 0 NVIDIA GeForce RTX 3080 Laptop GPU (arch=8.6)
Driver version 510.60.02
CUDA_HOME /usr/local/cuda
Pillow 9.2.0
torchvision 0.9.2 @/home/mu/anaconda3/envs/maskformer/lib/python3.7/site-packages/torchvision
torchvision arch flags 3.5, 5.0, 6.0, 7.0, 7.5
fvcore 0.1.5.post20220512
iopath 0.1.9
cv2 4.6.0


The error is below:

res4.9.conv3.norm.num_batches_tracked
res5.0.conv1.norm.num_batches_tracked
res5.0.conv2.norm.num_batches_tracked
res5.0.conv3.norm.num_batches_tracked
res5.0.shortcut.norm.num_batches_tracked
res5.1.conv1.norm.num_batches_tracked
res5.1.conv2.norm.num_batches_tracked
res5.1.conv3.norm.num_batches_tracked
res5.2.conv1.norm.num_batches_tracked
res5.2.conv2.norm.num_batches_tracked
res5.2.conv3.norm.num_batches_tracked
stem.conv1.norm.num_batches_tracked
stem.conv2.norm.num_batches_tracked
stem.conv3.norm.num_batches_tracked
stem.fc.{bias, weight}
[08/21 20:18:39 d2.engine.train_loop]: Starting training from iteration 0
ERROR [08/21 20:20:24 d2.engine.train_loop]: Exception during training:
Traceback (most recent call last):
File "/cloud/maskformer/lib/python3.7/site-packages/detectron2/engine/train_loop.py", line 149, in train
self.run_step()
File "/cloud/maskformer/lib/python3.7/site-packages/detectron2/engine/defaults.py", line 494, in run_step
self._trainer.run_step()
File "/cloud/maskformer/lib/python3.7/site-packages/detectron2/engine/train_loop.py", line 285, in run_step
losses.backward()
File "/cloud/maskformer/lib/python3.7/site-packages/torch/tensor.py", line 245, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File "/cloud/maskformer/lib/python3.7/site-packages/torch/autograd/init.py", line 147, in backward
allow_unreachable=True, accumulate_grad=True) # allow_unreachable flag
RuntimeError: Unable to find a valid cuDNN algorithm to run convolution
[08/21 20:20:24 d2.engine.hooks]: Total training time: 0:01:45 (0:00:00 on hooks)
[08/21 20:20:24 d2.utils.events]: iter: 0 lr: N/A max_mem: 5604M
Traceback (most recent call last):
File "train_net.py", line 270, in
args=(args,),
File "/cloud/maskformer/lib/python3.7/site-packages/detectron2/engine/launch.py", line 82, in launch
main_func(*args)
File "train_net.py", line 258, in main
return trainer.train()
File "/cloud/maskformer/lib/python3.7/site-packages/detectron2/engine/defaults.py", line 484, in train
super().train(self.start_iter, self.max_iter)
File "/cloud/maskformer/lib/python3.7/site-packages/detectron2/engine/train_loop.py", line 149, in train
self.run_step()
File "/cloud/maskformer/lib/python3.7/site-packages/detectron2/engine/defaults.py", line 494, in run_step
self._trainer.run_step()
File "/cloud/maskformer/lib/python3.7/site-packages/detectron2/engine/train_loop.py", line 285, in run_step
losses.backward()
File "/cloud/maskformer/lib/python3.7/site-packages/torch/tensor.py", line 245, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File "/cloud/maskformer/lib/python3.7/site-packages/torch/autograd/init.py", line 147, in backward
allow_unreachable=True, accumulate_grad=True) # allow_unreachable flag
RuntimeError: Unable to find a valid cuDNN algorithm to run convolution

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant