RuntimeError: radix_sort: failed on 1st step: cudaErrorInvalidDevice: invalid device ordinal #2743

tuttelikz · 2021-03-14T22:33:46Z

Hi!

I want to start simple training on coco dataset (all default) with batch size of 4 as described on Getting Started page of documentation. But I get RuntimeError: radix_sort:

Instructions To Reproduce the 🐛 Bug:

Output of git rev-parse HEAD; git diff :

4aca4bdaa9ad48b8e91d7520e0d0815bb8ca0fb1

Exact command I run:

./train_net.py --config-file ../configs/COCO-InstanceSegmentation/mask_rcnn_R_50_FPN_1x.yaml --num-gpus 2 SOLVER.IMS_PER_BATCH 4 SOLVER.BASE_LR 0.0025

Output in terminal that I get:

RuntimeError: radix_sort: failed on 1st step: cudaErrorInvalidDevice: invalid device ordinal
[03/15 07:25:50 d2.engine.hooks]: Total training time: 0:00:10 (0:00:00 on hooks)
[03/15 07:25:50 d2.utils.events]:  iter: 0    lr: N/A  max_mem: 1965M
Traceback (most recent call last):
  File "./train_net.py", line 161, in <module>
    launch(
  File "/home/suzy/miniconda3/envs/refs/lib/python3.8/site-packages/detectron2/engine/launch.py", line 55, in launch
    mp.spawn(
  File "/home/suzy/miniconda3/envs/refs/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 230, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/home/suzy/miniconda3/envs/refs/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
    while not context.join():
  File "/home/suzy/miniconda3/envs/refs/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 150, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 1 terminated with the following error:
Traceback (most recent call last):
  File "/home/suzy/miniconda3/envs/refs/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
    fn(i, *args)
  File "/home/suzy/miniconda3/envs/refs/lib/python3.8/site-packages/detectron2/engine/launch.py", line 94, in _distributed_worker
    main_func(*args)
  File "/home/suzy/notebooks/refs/detectron2/tools/train_net.py", line 155, in main
    return trainer.train()
  File "/home/suzy/miniconda3/envs/refs/lib/python3.8/site-packages/detectron2/engine/defaults.py", line 431, in train
    super().train(self.start_iter, self.max_iter)
  File "/home/suzy/miniconda3/envs/refs/lib/python3.8/site-packages/detectron2/engine/train_loop.py", line 138, in train
    self.run_step()
  File "/home/suzy/miniconda3/envs/refs/lib/python3.8/site-packages/detectron2/engine/defaults.py", line 441, in run_step
    self._trainer.run_step()
  File "/home/suzy/miniconda3/envs/refs/lib/python3.8/site-packages/detectron2/engine/train_loop.py", line 232, in run_step
    loss_dict = self.model(data)
  File "/home/suzy/miniconda3/envs/refs/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/suzy/miniconda3/envs/refs/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 705, in forward
    output = self.module(*inputs[0], **kwargs[0])
  File "/home/suzy/miniconda3/envs/refs/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/suzy/miniconda3/envs/refs/lib/python3.8/site-packages/detectron2/modeling/meta_arch/rcnn.py", line 160, in forward
    proposals, proposal_losses = self.proposal_generator(images, features, gt_instances)
  File "/home/suzy/miniconda3/envs/refs/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/suzy/miniconda3/envs/refs/lib/python3.8/site-packages/detectron2/modeling/proposal_generator/rpn.py", line 430, in forward
    gt_labels, gt_boxes = self.label_and_sample_anchors(anchors, gt_instances)
  File "/home/suzy/miniconda3/envs/refs/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/home/suzy/miniconda3/envs/refs/lib/python3.8/site-packages/detectron2/modeling/proposal_generator/rpn.py", line 313, in label_and_sample_anchors
    gt_labels_i = self._subsample_labels(gt_labels_i)
  File "/home/suzy/miniconda3/envs/refs/lib/python3.8/site-packages/detectron2/modeling/proposal_generator/rpn.py", line 257, in _subsample_labels
    pos_idx, neg_idx = subsample_labels(
  File "/home/suzy/miniconda3/envs/refs/lib/python3.8/site-packages/detectron2/modeling/sampling.py", line 50, in subsample_labels
    perm2 = torch.randperm(negative.numel(), device=negative.device)[:num_neg]
RuntimeError: radix_sort: failed on 1st step: cudaErrorInvalidDevice: invalid device ordinal

For datasets, i use coco (instances_train2017.json), structured as follows:


└── coco
    ├── annotations
    │   ├── instances_minival2014_100.json
    │   ├── instances_train2017.json
    │   ├── instances_val2017_100.json
    │   ├── person_keypoints_minival2014_100.json
    │   └── person_keypoints_val2017_100.json
    ├── train2017
    │   ├──  00XXXXXXXXXX.jpg
    │   ├──  00XXXXXXXXXX.jpg
    │   ├──  00......
    ├── val2017
    │   ├──  00XXXXXXXXXX.jpg
    │   ├──  00XXXXXXXXXX.jpg
    │   ├──  00......

Expected behavior:

I expected the start of training

Environment:

sys.platform            linux
Python                  3.8.8 (default, Feb 24 2021, 21:46:12) [GCC 7.3.0]
numpy                   1.19.2
detectron2              0.4 @/home/suzy/miniconda3/envs/refs/lib/python3.8/site-packages/detectron2
Compiler                GCC 9.3
CUDA compiler           CUDA 11.2
detectron2 arch flags   8.6
DETECTRON2_ENV_MODULE   <not set>
PyTorch                 1.8.0 @/home/suzy/miniconda3/envs/refs/lib/python3.8/site-packages/torch
PyTorch debug build     False
GPU available           True
GPU 0,1,2,3,4           GeForce RTX 3090 (arch=8.6)
CUDA_HOME               /usr/local/cuda-11.2
Pillow                  8.1.2
torchvision             0.9.0 @/home/suzy/miniconda3/envs/refs/lib/python3.8/site-packages/torchvision
torchvision arch flags  3.5, 5.0, 6.0, 7.0, 7.5, 8.0, 8.6
fvcore                  0.1.3.post20210311
cv2                     4.5.1
----------------------  ------------------------------------------------------------------------------
PyTorch built with:
  - GCC 7.3
  - C++ Version: 201402
  - Intel(R) Math Kernel Library Version 2020.0.4 Product Build 20200917 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v1.7.0 (Git Hash 7aed236906b1f7a05c0917e5257a1af05e9ff683)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - NNPACK is enabled
  - CPU capability usage: AVX2
  - CUDA Runtime 11.1
  - NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_37,code=compute_37
  - CuDNN 8.0.5
  - Magma 2.5.2
  - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.1, CUDNN_VERSION=8.0.5, CXX_COMPILER=/opt/rh/devtoolset-7/root/usr/bin/c++, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.8.0, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON,

The text was updated successfully, but these errors were encountered:

github-actions · 2021-03-14T22:33:58Z

You've chosen to report an unexpected problem or bug. Unless you already know the root cause of it, please include details about it by filling the issue template.
The following information is missing: "Instructions To Reproduce the Issue and Full Logs";

tuttelikz · 2021-03-15T02:51:01Z

ok, downgrading to torch==1.7.1 seems to resolve the issue:
pytorch/pytorch#49161 (comment)

ppwwyyxx · 2021-03-17T00:50:47Z

We cannot reproduce this so it's unlikely we'd be able to help with it. If you're able to provide ways reproduce this (e.g. a docker) we can then investigate it.

According to your environment info, your cuda version and pytorch's cuda version do not match. That's likely to cause issues.

ppwwyyxx · 2021-04-03T06:25:46Z

This turns out to be a pytorch bug of 1.8.0: pytorch/pytorch#54245

github-actions bot added needs-more-info More info is needed to complete the issue and removed needs-more-info More info is needed to complete the issue labels Mar 14, 2021

ppwwyyxx closed this as completed Mar 17, 2021

ppwwyyxx added the installation / environment label Mar 17, 2021

drapado mentioned this issue Mar 31, 2021

RuntimeError: CUDA error: device-side assert triggered while training balloon dataset from tutorial #2846

Closed

ppwwyyxx added the upstream issues Issues in other libraries label Apr 3, 2021

zhonge mentioned this issue Nov 9, 2021

RuntimeError: radix_sort: failed on 1st step: cudaErrorInvalidDevice: invalid device ordinal ml-struct-bio/cryodrgn#86

Closed

github-actions bot locked as resolved and limited conversation to collaborators Jan 29, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RuntimeError: radix_sort: failed on 1st step: cudaErrorInvalidDevice: invalid device ordinal #2743

RuntimeError: radix_sort: failed on 1st step: cudaErrorInvalidDevice: invalid device ordinal #2743

tuttelikz commented Mar 14, 2021 •

edited

Loading

github-actions bot commented Mar 14, 2021

tuttelikz commented Mar 15, 2021

ppwwyyxx commented Mar 17, 2021

ppwwyyxx commented Apr 3, 2021

RuntimeError: radix_sort: failed on 1st step: cudaErrorInvalidDevice: invalid device ordinal #2743

RuntimeError: radix_sort: failed on 1st step: cudaErrorInvalidDevice: invalid device ordinal #2743

Comments

tuttelikz commented Mar 14, 2021 • edited Loading

Instructions To Reproduce the 🐛 Bug:

Expected behavior:

Environment:

github-actions bot commented Mar 14, 2021

tuttelikz commented Mar 15, 2021

ppwwyyxx commented Mar 17, 2021

ppwwyyxx commented Apr 3, 2021

tuttelikz commented Mar 14, 2021 •

edited

Loading