Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RuntimeError: radix_sort: failed on 1st step: cudaErrorInvalidDevice: invalid device ordinal #2743

Closed
tuttelikz opened this issue Mar 14, 2021 · 4 comments
Labels

Comments

@tuttelikz
Copy link

tuttelikz commented Mar 14, 2021

Hi!

  1. I want to start simple training on coco dataset (all default) with batch size of 4 as described on Getting Started page of documentation. But I get RuntimeError: radix_sort:

Instructions To Reproduce the 🐛 Bug:

  1. Output of git rev-parse HEAD; git diff :
4aca4bdaa9ad48b8e91d7520e0d0815bb8ca0fb1
  1. Exact command I run:
./train_net.py --config-file ../configs/COCO-InstanceSegmentation/mask_rcnn_R_50_FPN_1x.yaml --num-gpus 2 SOLVER.IMS_PER_BATCH 4 SOLVER.BASE_LR 0.0025
  1. Output in terminal that I get:
RuntimeError: radix_sort: failed on 1st step: cudaErrorInvalidDevice: invalid device ordinal
[03/15 07:25:50 d2.engine.hooks]: Total training time: 0:00:10 (0:00:00 on hooks)
[03/15 07:25:50 d2.utils.events]:  iter: 0    lr: N/A  max_mem: 1965M
Traceback (most recent call last):
  File "./train_net.py", line 161, in <module>
    launch(
  File "/home/suzy/miniconda3/envs/refs/lib/python3.8/site-packages/detectron2/engine/launch.py", line 55, in launch
    mp.spawn(
  File "/home/suzy/miniconda3/envs/refs/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 230, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/home/suzy/miniconda3/envs/refs/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
    while not context.join():
  File "/home/suzy/miniconda3/envs/refs/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 150, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 1 terminated with the following error:
Traceback (most recent call last):
  File "/home/suzy/miniconda3/envs/refs/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
    fn(i, *args)
  File "/home/suzy/miniconda3/envs/refs/lib/python3.8/site-packages/detectron2/engine/launch.py", line 94, in _distributed_worker
    main_func(*args)
  File "/home/suzy/notebooks/refs/detectron2/tools/train_net.py", line 155, in main
    return trainer.train()
  File "/home/suzy/miniconda3/envs/refs/lib/python3.8/site-packages/detectron2/engine/defaults.py", line 431, in train
    super().train(self.start_iter, self.max_iter)
  File "/home/suzy/miniconda3/envs/refs/lib/python3.8/site-packages/detectron2/engine/train_loop.py", line 138, in train
    self.run_step()
  File "/home/suzy/miniconda3/envs/refs/lib/python3.8/site-packages/detectron2/engine/defaults.py", line 441, in run_step
    self._trainer.run_step()
  File "/home/suzy/miniconda3/envs/refs/lib/python3.8/site-packages/detectron2/engine/train_loop.py", line 232, in run_step
    loss_dict = self.model(data)
  File "/home/suzy/miniconda3/envs/refs/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/suzy/miniconda3/envs/refs/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 705, in forward
    output = self.module(*inputs[0], **kwargs[0])
  File "/home/suzy/miniconda3/envs/refs/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/suzy/miniconda3/envs/refs/lib/python3.8/site-packages/detectron2/modeling/meta_arch/rcnn.py", line 160, in forward
    proposals, proposal_losses = self.proposal_generator(images, features, gt_instances)
  File "/home/suzy/miniconda3/envs/refs/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/suzy/miniconda3/envs/refs/lib/python3.8/site-packages/detectron2/modeling/proposal_generator/rpn.py", line 430, in forward
    gt_labels, gt_boxes = self.label_and_sample_anchors(anchors, gt_instances)
  File "/home/suzy/miniconda3/envs/refs/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/home/suzy/miniconda3/envs/refs/lib/python3.8/site-packages/detectron2/modeling/proposal_generator/rpn.py", line 313, in label_and_sample_anchors
    gt_labels_i = self._subsample_labels(gt_labels_i)
  File "/home/suzy/miniconda3/envs/refs/lib/python3.8/site-packages/detectron2/modeling/proposal_generator/rpn.py", line 257, in _subsample_labels
    pos_idx, neg_idx = subsample_labels(
  File "/home/suzy/miniconda3/envs/refs/lib/python3.8/site-packages/detectron2/modeling/sampling.py", line 50, in subsample_labels
    perm2 = torch.randperm(negative.numel(), device=negative.device)[:num_neg]
RuntimeError: radix_sort: failed on 1st step: cudaErrorInvalidDevice: invalid device ordinal
  1. For datasets, i use coco (instances_train2017.json), structured as follows:

└── coco
    ├── annotations
    │   ├── instances_minival2014_100.json
    │   ├── instances_train2017.json
    │   ├── instances_val2017_100.json
    │   ├── person_keypoints_minival2014_100.json
    │   └── person_keypoints_val2017_100.json
    ├── train2017
    │   ├──  00XXXXXXXXXX.jpg
    │   ├──  00XXXXXXXXXX.jpg
    │   ├──  00......
    ├── val2017
    │   ├──  00XXXXXXXXXX.jpg
    │   ├──  00XXXXXXXXXX.jpg
    │   ├──  00......

Expected behavior:

I expected the start of training

Environment:

sys.platform            linux
Python                  3.8.8 (default, Feb 24 2021, 21:46:12) [GCC 7.3.0]
numpy                   1.19.2
detectron2              0.4 @/home/suzy/miniconda3/envs/refs/lib/python3.8/site-packages/detectron2
Compiler                GCC 9.3
CUDA compiler           CUDA 11.2
detectron2 arch flags   8.6
DETECTRON2_ENV_MODULE   <not set>
PyTorch                 1.8.0 @/home/suzy/miniconda3/envs/refs/lib/python3.8/site-packages/torch
PyTorch debug build     False
GPU available           True
GPU 0,1,2,3,4           GeForce RTX 3090 (arch=8.6)
CUDA_HOME               /usr/local/cuda-11.2
Pillow                  8.1.2
torchvision             0.9.0 @/home/suzy/miniconda3/envs/refs/lib/python3.8/site-packages/torchvision
torchvision arch flags  3.5, 5.0, 6.0, 7.0, 7.5, 8.0, 8.6
fvcore                  0.1.3.post20210311
cv2                     4.5.1
----------------------  ------------------------------------------------------------------------------
PyTorch built with:
  - GCC 7.3
  - C++ Version: 201402
  - Intel(R) Math Kernel Library Version 2020.0.4 Product Build 20200917 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v1.7.0 (Git Hash 7aed236906b1f7a05c0917e5257a1af05e9ff683)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - NNPACK is enabled
  - CPU capability usage: AVX2
  - CUDA Runtime 11.1
  - NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_37,code=compute_37
  - CuDNN 8.0.5
  - Magma 2.5.2
  - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.1, CUDNN_VERSION=8.0.5, CXX_COMPILER=/opt/rh/devtoolset-7/root/usr/bin/c++, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.8.0, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON,
@github-actions
Copy link

You've chosen to report an unexpected problem or bug. Unless you already know the root cause of it, please include details about it by filling the issue template.
The following information is missing: "Instructions To Reproduce the Issue and Full Logs";

@github-actions github-actions bot added needs-more-info More info is needed to complete the issue and removed needs-more-info More info is needed to complete the issue labels Mar 14, 2021
@tuttelikz
Copy link
Author

ok, downgrading to torch==1.7.1 seems to resolve the issue:
pytorch/pytorch#49161 (comment)

@ppwwyyxx
Copy link
Contributor

We cannot reproduce this so it's unlikely we'd be able to help with it. If you're able to provide ways reproduce this (e.g. a docker) we can then investigate it.

According to your environment info, your cuda version and pytorch's cuda version do not match. That's likely to cause issues.

@ppwwyyxx
Copy link
Contributor

ppwwyyxx commented Apr 3, 2021

This turns out to be a pytorch bug of 1.8.0: pytorch/pytorch#54245

@ppwwyyxx ppwwyyxx added the upstream issues Issues in other libraries label Apr 3, 2021
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Jan 29, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

2 participants