Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LVIS Training fails to find ROIAlign_forward_cuda #63

Closed
weston100 opened this issue Oct 14, 2019 · 2 comments
Closed

LVIS Training fails to find ROIAlign_forward_cuda #63

weston100 opened this issue Oct 14, 2019 · 2 comments
Labels
duplicate This issue or pull request already exists

Comments

@weston100
Copy link

When training an LVIS model out of the box on a single GPU, training begins and then immediately fails with error RuntimeError: CUDA error: invalid device function (ROIAlign_forward_cuda at /pasteur/u/jwhughes/detectron2/detectron2/layers/csrc/ROIAlign/ROIAlign_cuda.cu:359). I suspect this is not an issue with my system, as I've replicated the same error on two different clusters and I believe I am following the installation instructions exactly.

To Reproduce

I made no changes to the code. I run the command: python tools/train_net.py --num-gpus 1 --config-file configs/LVIS-InstanceSegmentation/mask_rcnn_R_101_FPN_1x.yaml SOLVER.IMS_PER_BATCH 8 OUTPUT_DIR ./output/test. (The error persists when batch size is set to 1).

I get the following error:

[10/14 13:15:26 d2.engine.train_loop]: Starting training from iteration 0                                                                                                    [2/1895]
[10/14 13:15:28 d2.engine.hooks]: Total training time: 0:00:01 (0:00:00 on hooks)
Traceback (most recent call last):
  File "tools/train_net.py", line 161, in <module>
    args=(args,),
  File "/sailhome/jwhughes/miniconda3/envs/pytorch/lib/python3.7/site-packages/detectron2/engine/launch.py", line 52, in launch
    main_func(*args)
  File "tools/train_net.py", line 149, in main
    return trainer.train()
  File "/sailhome/jwhughes/miniconda3/envs/pytorch/lib/python3.7/site-packages/detectron2/engine/defaults.py", line 329, in train
    super().train(self.start_iter, self.max_iter)
  File "/sailhome/jwhughes/miniconda3/envs/pytorch/lib/python3.7/site-packages/detectron2/engine/train_loop.py", line 132, in train
    self.run_step()
  File "/sailhome/jwhughes/miniconda3/envs/pytorch/lib/python3.7/site-packages/detectron2/engine/train_loop.py", line 212, in run_step
    loss_dict = self.model(data)
  File "/sailhome/jwhughes/miniconda3/envs/pytorch/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in __call__
    result = self.forward(*input, **kwargs)
  File "/sailhome/jwhughes/miniconda3/envs/pytorch/lib/python3.7/site-packages/detectron2/modeling/meta_arch/rcnn.py", line 88, in forward
    _, detector_losses = self.roi_heads(images, features, proposals, gt_instances)
  File "/sailhome/jwhughes/miniconda3/envs/pytorch/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in __call__
    result = self.forward(*input, **kwargs)
  File "/sailhome/jwhughes/miniconda3/envs/pytorch/lib/python3.7/site-packages/detectron2/modeling/roi_heads/roi_heads.py", line 535, in forward
    losses = self._forward_box(features_list, proposals)
  File "/sailhome/jwhughes/miniconda3/envs/pytorch/lib/python3.7/site-packages/detectron2/modeling/roi_heads/roi_heads.py", line 589, in _forward_box
    box_features = self.box_pooler(features, [x.proposal_boxes for x in proposals])
  File "/sailhome/jwhughes/miniconda3/envs/pytorch/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in __call__
    result = self.forward(*input, **kwargs)
  File "/sailhome/jwhughes/miniconda3/envs/pytorch/lib/python3.7/site-packages/detectron2/modeling/poolers.py", line 195, in forward
    output[inds] = pooler(x_level, pooler_fmt_boxes_level)
  File "/sailhome/jwhughes/miniconda3/envs/pytorch/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in __call__
    result = self.forward(*input, **kwargs)
  File "/sailhome/jwhughes/miniconda3/envs/pytorch/lib/python3.7/site-packages/detectron2/layers/roi_align.py", line 95, in forward
    input, rois, self.output_size, self.spatial_scale, self.sampling_ratio, self.aligned
  File "/sailhome/jwhughes/miniconda3/envs/pytorch/lib/python3.7/site-packages/detectron2/layers/roi_align.py", line 20, in forward
    input, roi, spatial_scale, output_size[0], output_size[1], sampling_ratio, aligned
RuntimeError: CUDA error: invalid device function (ROIAlign_forward_cuda at /pasteur/u/jwhughes/detectron2/detectron2/layers/csrc/ROIAlign/ROIAlign_cuda.cu:359)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x47 (0x7f6a9111d687 in /sailhome/jwhughes/miniconda3/envs/pytorch/lib/python3.7/site-packages/torch/lib/libc$0.so)
frame #1: ROIAlign_forward_cuda(at::Tensor const&, at::Tensor const&, float, int, int, int, bool) + 0xa37 (0x7f6a6e389ff3 in /sailhome/jwhughes/miniconda3/envs/pytorch/lib/python3.$/site-packages/detectron2/_C.cpython-37m-x86_64-linux-gnu.so)
frame #2: ROIAlign_forward(at::Tensor const&, at::Tensor const&, float, int, int, int, bool) + 0xbc (0x7f6a6e33201c in /sailhome/jwhughes/miniconda3/envs/pytorch/lib/python3.7/site$packages/detectron2/_C.cpython-37m-x86_64-linux-gnu.so)
frame #3: <unknown function> + 0x5967a (0x7f6a6e34367a in /sailhome/jwhughes/miniconda3/envs/pytorch/lib/python3.7/site-packages/detectron2/_C.cpython-37m-x86_64-linux-gnu.so)
frame #4: <unknown function> + 0x5977e (0x7f6a6e34377e in /sailhome/jwhughes/miniconda3/envs/pytorch/lib/python3.7/site-packages/detectron2/_C.cpython-37m-x86_64-linux-gnu.so)
frame #5: <unknown function> + 0x53d00 (0x7f6a6e33dd00 in /sailhome/jwhughes/miniconda3/envs/pytorch/lib/python3.7/site-packages/detectron2/_C.cpython-37m-x86_64-linux-gnu.so)
<omitting python frames>
frame #10: THPFunction_apply(_object*, _object*) + 0x8d6 (0x7f6ac514ae96 in /sailhome/jwhughes/miniconda3/envs/pytorch/lib/python3.7/site-packages/torch/lib/libtorch_python.so)

Environment

---------------------  --------------------------------------------------
Python                 3.7.4 (default, Aug 13 2019, 20:35:49) [GCC 7.3.0]
Detectron2 Compiler    GCC 5.4
DETECTRON2_ENV_MODULE  <not set>
PyTorch                1.3.0
PyTorch Debug Build    False
CUDA available         True
GPU 0                  TITAN RTX
Pillow                 6.2.0
cv2                    4.1.0
---------------------  --------------------------------------------------
PyTorch built with:
  - GCC 7.3
  - Intel(R) Math Kernel Library Version 2019.0.4 Product Build 20190411 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v0.20.5 (Git Hash 0125f28c61c1f822fd48570b4c1066f96fcb9b2e)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - NNPACK is enabled
  - CUDA Runtime 10.1
  - NVCC architecture flags: -gencode;arch=compute_35,code=sm_35;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_50,code=compute_50
  - CuDNN 7.6.3
  - Magma 2.5.1
  - Build settings: BLAS=MKL, BUILD_NAMEDTENSOR=OFF, BUILD_TYPE=Release, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -fopenmp -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -O2 -fPIC -Wno-narrowing -Wall -Wextra -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Wno-stringop-overflow, DISABLE_NUMA=1, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, USE_CUDA=True, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, USE_STATIC_DISPATCH=OFF, 
@weston100
Copy link
Author

Oops, looks like #62 and I submitted the same issue at exactly the same time.

@ppwwyyxx ppwwyyxx added the duplicate This issue or pull request already exists label Oct 14, 2019
@ppwwyyxx
Copy link
Contributor

closing as a duplicate

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
duplicate This issue or pull request already exists
Projects
None yet
Development

No branches or pull requests

2 participants