LVIS Training fails to find ROIAlign_forward_cuda #63

weston100 · 2019-10-14T20:21:29Z

When training an LVIS model out of the box on a single GPU, training begins and then immediately fails with error RuntimeError: CUDA error: invalid device function (ROIAlign_forward_cuda at /pasteur/u/jwhughes/detectron2/detectron2/layers/csrc/ROIAlign/ROIAlign_cuda.cu:359). I suspect this is not an issue with my system, as I've replicated the same error on two different clusters and I believe I am following the installation instructions exactly.

To Reproduce

I made no changes to the code. I run the command: python tools/train_net.py --num-gpus 1 --config-file configs/LVIS-InstanceSegmentation/mask_rcnn_R_101_FPN_1x.yaml SOLVER.IMS_PER_BATCH 8 OUTPUT_DIR ./output/test. (The error persists when batch size is set to 1).

I get the following error:

[10/14 13:15:26 d2.engine.train_loop]: Starting training from iteration 0                                                                                                    [2/1895]
[10/14 13:15:28 d2.engine.hooks]: Total training time: 0:00:01 (0:00:00 on hooks)
Traceback (most recent call last):
  File "tools/train_net.py", line 161, in <module>
    args=(args,),
  File "/sailhome/jwhughes/miniconda3/envs/pytorch/lib/python3.7/site-packages/detectron2/engine/launch.py", line 52, in launch
    main_func(*args)
  File "tools/train_net.py", line 149, in main
    return trainer.train()
  File "/sailhome/jwhughes/miniconda3/envs/pytorch/lib/python3.7/site-packages/detectron2/engine/defaults.py", line 329, in train
    super().train(self.start_iter, self.max_iter)
  File "/sailhome/jwhughes/miniconda3/envs/pytorch/lib/python3.7/site-packages/detectron2/engine/train_loop.py", line 132, in train
    self.run_step()
  File "/sailhome/jwhughes/miniconda3/envs/pytorch/lib/python3.7/site-packages/detectron2/engine/train_loop.py", line 212, in run_step
    loss_dict = self.model(data)
  File "/sailhome/jwhughes/miniconda3/envs/pytorch/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in __call__
    result = self.forward(*input, **kwargs)
  File "/sailhome/jwhughes/miniconda3/envs/pytorch/lib/python3.7/site-packages/detectron2/modeling/meta_arch/rcnn.py", line 88, in forward
    _, detector_losses = self.roi_heads(images, features, proposals, gt_instances)
  File "/sailhome/jwhughes/miniconda3/envs/pytorch/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in __call__
    result = self.forward(*input, **kwargs)
  File "/sailhome/jwhughes/miniconda3/envs/pytorch/lib/python3.7/site-packages/detectron2/modeling/roi_heads/roi_heads.py", line 535, in forward
    losses = self._forward_box(features_list, proposals)
  File "/sailhome/jwhughes/miniconda3/envs/pytorch/lib/python3.7/site-packages/detectron2/modeling/roi_heads/roi_heads.py", line 589, in _forward_box
    box_features = self.box_pooler(features, [x.proposal_boxes for x in proposals])
  File "/sailhome/jwhughes/miniconda3/envs/pytorch/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in __call__
    result = self.forward(*input, **kwargs)
  File "/sailhome/jwhughes/miniconda3/envs/pytorch/lib/python3.7/site-packages/detectron2/modeling/poolers.py", line 195, in forward
    output[inds] = pooler(x_level, pooler_fmt_boxes_level)
  File "/sailhome/jwhughes/miniconda3/envs/pytorch/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in __call__
    result = self.forward(*input, **kwargs)
  File "/sailhome/jwhughes/miniconda3/envs/pytorch/lib/python3.7/site-packages/detectron2/layers/roi_align.py", line 95, in forward
    input, rois, self.output_size, self.spatial_scale, self.sampling_ratio, self.aligned
  File "/sailhome/jwhughes/miniconda3/envs/pytorch/lib/python3.7/site-packages/detectron2/layers/roi_align.py", line 20, in forward
    input, roi, spatial_scale, output_size[0], output_size[1], sampling_ratio, aligned
RuntimeError: CUDA error: invalid device function (ROIAlign_forward_cuda at /pasteur/u/jwhughes/detectron2/detectron2/layers/csrc/ROIAlign/ROIAlign_cuda.cu:359)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x47 (0x7f6a9111d687 in /sailhome/jwhughes/miniconda3/envs/pytorch/lib/python3.7/site-packages/torch/lib/libc$0.so)
frame #1: ROIAlign_forward_cuda(at::Tensor const&, at::Tensor const&, float, int, int, int, bool) + 0xa37 (0x7f6a6e389ff3 in /sailhome/jwhughes/miniconda3/envs/pytorch/lib/python3.$/site-packages/detectron2/_C.cpython-37m-x86_64-linux-gnu.so)
frame #2: ROIAlign_forward(at::Tensor const&, at::Tensor const&, float, int, int, int, bool) + 0xbc (0x7f6a6e33201c in /sailhome/jwhughes/miniconda3/envs/pytorch/lib/python3.7/site$packages/detectron2/_C.cpython-37m-x86_64-linux-gnu.so)
frame #3: <unknown function> + 0x5967a (0x7f6a6e34367a in /sailhome/jwhughes/miniconda3/envs/pytorch/lib/python3.7/site-packages/detectron2/_C.cpython-37m-x86_64-linux-gnu.so)
frame #4: <unknown function> + 0x5977e (0x7f6a6e34377e in /sailhome/jwhughes/miniconda3/envs/pytorch/lib/python3.7/site-packages/detectron2/_C.cpython-37m-x86_64-linux-gnu.so)
frame #5: <unknown function> + 0x53d00 (0x7f6a6e33dd00 in /sailhome/jwhughes/miniconda3/envs/pytorch/lib/python3.7/site-packages/detectron2/_C.cpython-37m-x86_64-linux-gnu.so)
<omitting python frames>
frame #10: THPFunction_apply(_object*, _object*) + 0x8d6 (0x7f6ac514ae96 in /sailhome/jwhughes/miniconda3/envs/pytorch/lib/python3.7/site-packages/torch/lib/libtorch_python.so)

Environment

---------------------  --------------------------------------------------
Python                 3.7.4 (default, Aug 13 2019, 20:35:49) [GCC 7.3.0]
Detectron2 Compiler    GCC 5.4
DETECTRON2_ENV_MODULE  <not set>
PyTorch                1.3.0
PyTorch Debug Build    False
CUDA available         True
GPU 0                  TITAN RTX
Pillow                 6.2.0
cv2                    4.1.0
---------------------  --------------------------------------------------
PyTorch built with:
  - GCC 7.3
  - Intel(R) Math Kernel Library Version 2019.0.4 Product Build 20190411 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v0.20.5 (Git Hash 0125f28c61c1f822fd48570b4c1066f96fcb9b2e)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - NNPACK is enabled
  - CUDA Runtime 10.1
  - NVCC architecture flags: -gencode;arch=compute_35,code=sm_35;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_50,code=compute_50
  - CuDNN 7.6.3
  - Magma 2.5.1
  - Build settings: BLAS=MKL, BUILD_NAMEDTENSOR=OFF, BUILD_TYPE=Release, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -fopenmp -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -O2 -fPIC -Wno-narrowing -Wall -Wextra -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Wno-stringop-overflow, DISABLE_NUMA=1, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, USE_CUDA=True, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, USE_STATIC_DISPATCH=OFF,

The text was updated successfully, but these errors were encountered:

weston100 · 2019-10-14T20:27:43Z

Oops, looks like #62 and I submitted the same issue at exactly the same time.

ppwwyyxx · 2019-10-14T23:05:53Z

closing as a duplicate

ppwwyyxx added the duplicate This issue or pull request already exists label Oct 14, 2019

ppwwyyxx closed this as completed Oct 14, 2019

XuanyuanDi mentioned this issue Nov 7, 2019

RuntimeError: Not compiled with GPU support (ROIAlign_forward at /home/hd/detectron2_repo/detectron2/layers/csrc/ROIAlign/ROIAlign.h:73) #267

Closed

servercalap mentioned this issue Feb 17, 2020

custom dataset and custom train_net.py runtime error #893

Closed

github-actions bot locked as resolved and limited conversation to collaborators Jan 14, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LVIS Training fails to find ROIAlign_forward_cuda #63

LVIS Training fails to find ROIAlign_forward_cuda #63

weston100 commented Oct 14, 2019

weston100 commented Oct 14, 2019

ppwwyyxx commented Oct 14, 2019

LVIS Training fails to find ROIAlign_forward_cuda #63

LVIS Training fails to find ROIAlign_forward_cuda #63

Comments

weston100 commented Oct 14, 2019

To Reproduce

Environment

weston100 commented Oct 14, 2019

ppwwyyxx commented Oct 14, 2019