You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When training an LVIS model out of the box on a single GPU, training begins and then immediately fails with error RuntimeError: CUDA error: invalid device function (ROIAlign_forward_cuda at /pasteur/u/jwhughes/detectron2/detectron2/layers/csrc/ROIAlign/ROIAlign_cuda.cu:359). I suspect this is not an issue with my system, as I've replicated the same error on two different clusters and I believe I am following the installation instructions exactly.
To Reproduce
I made no changes to the code. I run the command: python tools/train_net.py --num-gpus 1 --config-file configs/LVIS-InstanceSegmentation/mask_rcnn_R_101_FPN_1x.yaml SOLVER.IMS_PER_BATCH 8 OUTPUT_DIR ./output/test. (The error persists when batch size is set to 1).
I get the following error:
[10/14 13:15:26 d2.engine.train_loop]: Starting training from iteration 0 [2/1895]
[10/14 13:15:28 d2.engine.hooks]: Total training time: 0:00:01 (0:00:00 on hooks)
Traceback (most recent call last):
File "tools/train_net.py", line 161, in <module>
args=(args,),
File "/sailhome/jwhughes/miniconda3/envs/pytorch/lib/python3.7/site-packages/detectron2/engine/launch.py", line 52, in launch
main_func(*args)
File "tools/train_net.py", line 149, in main
return trainer.train()
File "/sailhome/jwhughes/miniconda3/envs/pytorch/lib/python3.7/site-packages/detectron2/engine/defaults.py", line 329, in train
super().train(self.start_iter, self.max_iter)
File "/sailhome/jwhughes/miniconda3/envs/pytorch/lib/python3.7/site-packages/detectron2/engine/train_loop.py", line 132, in train
self.run_step()
File "/sailhome/jwhughes/miniconda3/envs/pytorch/lib/python3.7/site-packages/detectron2/engine/train_loop.py", line 212, in run_step
loss_dict = self.model(data)
File "/sailhome/jwhughes/miniconda3/envs/pytorch/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in __call__
result = self.forward(*input, **kwargs)
File "/sailhome/jwhughes/miniconda3/envs/pytorch/lib/python3.7/site-packages/detectron2/modeling/meta_arch/rcnn.py", line 88, in forward
_, detector_losses = self.roi_heads(images, features, proposals, gt_instances)
File "/sailhome/jwhughes/miniconda3/envs/pytorch/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in __call__
result = self.forward(*input, **kwargs)
File "/sailhome/jwhughes/miniconda3/envs/pytorch/lib/python3.7/site-packages/detectron2/modeling/roi_heads/roi_heads.py", line 535, in forward
losses = self._forward_box(features_list, proposals)
File "/sailhome/jwhughes/miniconda3/envs/pytorch/lib/python3.7/site-packages/detectron2/modeling/roi_heads/roi_heads.py", line 589, in _forward_box
box_features = self.box_pooler(features, [x.proposal_boxes for x in proposals])
File "/sailhome/jwhughes/miniconda3/envs/pytorch/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in __call__
result = self.forward(*input, **kwargs)
File "/sailhome/jwhughes/miniconda3/envs/pytorch/lib/python3.7/site-packages/detectron2/modeling/poolers.py", line 195, in forward
output[inds] = pooler(x_level, pooler_fmt_boxes_level)
File "/sailhome/jwhughes/miniconda3/envs/pytorch/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in __call__
result = self.forward(*input, **kwargs)
File "/sailhome/jwhughes/miniconda3/envs/pytorch/lib/python3.7/site-packages/detectron2/layers/roi_align.py", line 95, in forward
input, rois, self.output_size, self.spatial_scale, self.sampling_ratio, self.aligned
File "/sailhome/jwhughes/miniconda3/envs/pytorch/lib/python3.7/site-packages/detectron2/layers/roi_align.py", line 20, in forward
input, roi, spatial_scale, output_size[0], output_size[1], sampling_ratio, aligned
RuntimeError: CUDA error: invalid device function (ROIAlign_forward_cuda at /pasteur/u/jwhughes/detectron2/detectron2/layers/csrc/ROIAlign/ROIAlign_cuda.cu:359)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x47 (0x7f6a9111d687 in /sailhome/jwhughes/miniconda3/envs/pytorch/lib/python3.7/site-packages/torch/lib/libc$0.so)
frame #1: ROIAlign_forward_cuda(at::Tensor const&, at::Tensor const&, float, int, int, int, bool) + 0xa37 (0x7f6a6e389ff3 in /sailhome/jwhughes/miniconda3/envs/pytorch/lib/python3.$/site-packages/detectron2/_C.cpython-37m-x86_64-linux-gnu.so)
frame #2: ROIAlign_forward(at::Tensor const&, at::Tensor const&, float, int, int, int, bool) + 0xbc (0x7f6a6e33201c in /sailhome/jwhughes/miniconda3/envs/pytorch/lib/python3.7/site$packages/detectron2/_C.cpython-37m-x86_64-linux-gnu.so)
frame #3: <unknown function> + 0x5967a (0x7f6a6e34367a in /sailhome/jwhughes/miniconda3/envs/pytorch/lib/python3.7/site-packages/detectron2/_C.cpython-37m-x86_64-linux-gnu.so)
frame #4: <unknown function> + 0x5977e (0x7f6a6e34377e in /sailhome/jwhughes/miniconda3/envs/pytorch/lib/python3.7/site-packages/detectron2/_C.cpython-37m-x86_64-linux-gnu.so)
frame #5: <unknown function> + 0x53d00 (0x7f6a6e33dd00 in /sailhome/jwhughes/miniconda3/envs/pytorch/lib/python3.7/site-packages/detectron2/_C.cpython-37m-x86_64-linux-gnu.so)
<omitting python frames>
frame #10: THPFunction_apply(_object*, _object*) + 0x8d6 (0x7f6ac514ae96 in /sailhome/jwhughes/miniconda3/envs/pytorch/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
When training an LVIS model out of the box on a single GPU, training begins and then immediately fails with error
RuntimeError: CUDA error: invalid device function (ROIAlign_forward_cuda at /pasteur/u/jwhughes/detectron2/detectron2/layers/csrc/ROIAlign/ROIAlign_cuda.cu:359)
. I suspect this is not an issue with my system, as I've replicated the same error on two different clusters and I believe I am following the installation instructions exactly.To Reproduce
I made no changes to the code. I run the command:
python tools/train_net.py --num-gpus 1 --config-file configs/LVIS-InstanceSegmentation/mask_rcnn_R_101_FPN_1x.yaml SOLVER.IMS_PER_BATCH 8 OUTPUT_DIR ./output/test
. (The error persists when batch size is set to 1).I get the following error:
Environment
The text was updated successfully, but these errors were encountered: