Skip to content
This repository was archived by the owner on Nov 17, 2023. It is now read-only.
This repository was archived by the owner on Nov 17, 2023. It is now read-only.

[Fast-RCNN] Proposal operator failed: too many resources requested for launch #6204

@lialie

Description

@lialie

Environment info

Operating System: Ubuntu 16.04

Compiler: Gcc 5.4, cuDnn 5.1, cuda 8.0.44, Nvidia 375.39 (980Ti)

Package used (Python/R/Scala/Julia): Python

MXNet version: 0.9.5

MXNet commit hash (git rev-parse HEAD):

5d65519

Python version and distribution: Python 2.7.12

Error Message:

Called with argument: Namespace(begin_epoch=7, dataset='PascalVOC', dataset_path='data/VOCdevkit', end_epoch=20, frequent=20, gpus='0', image_set='2007_trainval', kvstore='device', lr=0.001, lr_step='20', network='vgg', no_flip=False, no_shuffle=False, prefix='model/mx95', pretrained='model/vgg', pretrained_epoch=7, resume=False, root_path='data', work_load_list=None)
{'ANCHOR_RATIOS': [0.5, 1, 2],
'ANCHOR_SCALES': [8, 16, 32],
'FIXED_PARAMS': ['conv1', 'conv2'],
'FIXED_PARAMS_SHARED': ['conv1', 'conv2', 'conv3', 'conv4', 'conv5'],
'IMAGE_STRIDE': 0,
'NUM_ANCHORS': 9,
'NUM_CLASSES': 21,
'PIXEL_MEANS': array([ 103.939, 116.779, 123.68 ]),
'RCNN_FEAT_STRIDE': 16,
'RPN_FEAT_STRIDE': 16,
'SCALES': [(600, 1000)],
'TEST': {'BATCH_IMAGES': 1,
'CXX_PROPOSAL': True,
'HAS_RPN': False,
'NMS': 0.3,
'PROPOSAL_MIN_SIZE': 16,
'PROPOSAL_NMS_THRESH': 0.7,
'PROPOSAL_POST_NMS_TOP_N': 2000,
'PROPOSAL_PRE_NMS_TOP_N': 20000,
'RPN_MIN_SIZE': 16,
'RPN_NMS_THRESH': 0.7,
'RPN_POST_NMS_TOP_N': 300,
'RPN_PRE_NMS_TOP_N': 6000},
'TRAIN': {'ASPECT_GROUPING': True,
'BATCH_IMAGES': 1,
'BATCH_ROIS': 128,
'BBOX_MEANS': [0.0, 0.0, 0.0, 0.0],
'BBOX_NORMALIZATION_PRECOMPUTED': True,
'BBOX_REGRESSION_THRESH': 0.5,
'BBOX_STDS': [0.1, 0.1, 0.2, 0.2],
'BBOX_WEIGHTS': array([ 1., 1., 1., 1.]),
'BG_THRESH_HI': 0.5,
'BG_THRESH_LO': 0.0,
'CXX_PROPOSAL': True,
'END2END': True,
'FG_FRACTION': 0.25,
'FG_THRESH': 0.5,
'RPN_BATCH_SIZE': 256,
'RPN_BBOX_WEIGHTS': [1.0, 1.0, 1.0, 1.0],
'RPN_CLOBBER_POSITIVES': False,
'RPN_FG_FRACTION': 0.5,
'RPN_MIN_SIZE': 16,
'RPN_NEGATIVE_OVERLAP': 0.3,
'RPN_NMS_THRESH': 0.7,
'RPN_POSITIVE_OVERLAP': 0.7,
'RPN_POSITIVE_WEIGHT': -1.0,
'RPN_POST_NMS_TOP_N': 2000,
'RPN_PRE_NMS_TOP_N': 12000}}
num_images 472
voc_2007_trainval gt roidb loaded from data/cache/voc_2007_trainval_gt_roidb.pkl
append flipped images to roidb
filtered 0 roidb entries: 944 -> 944
providing maximum shape [('data', (1, 3, 600, 1000)), ('gt_boxes', (1, 100, 5))] [('label', (1, 20646)), ('bbox_target', (1, 36, 37, 62)), ('bbox_weight', (1, 36, 37, 62))]
output shape
{'bbox_loss_reshape_output': (1L, 128L, 84L),
'blockgrad0_output': (1L, 128L),
'cls_prob_reshape_output': (1L, 128L, 21L),
'rpn_bbox_loss_output': (1L, 36L, 37L, 38L),
'rpn_cls_prob_output': (1L, 2L, 333L, 38L)}
lr 0.001 lr_epoch_diff [13] lr_iters [12272]
[15:47:47] ....../src/operator/././cudnn_algoreg-inl.h:65: Running performance tests to find the best convolution algorithm, this can take a while... (setting env variable MXNET_CUDNN_AUTOTUNE_DEFAULT to 0 to disable)
[15:47:56]....../dmlc-core/include/dmlc/././logging.h:304: [15:47:56] ....../src/operator/contrib/proposal.cu:476: Check failed: error == cudaSuccess (7 vs. 0) too many resources requested for launch

Stack trace returned 10 entries:
[bt] (0) /usr/local/lib/python2.7/dist-packages/mxnet-0.9.5-py2.7.egg/mxnet/libmxnet.so(_ZN4dmlc15LogMessageFatalD1Ev+0x3f) [0x7f3c727f0a99]
[bt] (1) /usr/local/lib/python2.7/dist-packages/mxnet-0.9.5-py2.7.egg/mxnet/libmxnet.so(ZN5mxnet2op13ProposalGPUOpIN7mshadow3gpuEE7ForwardERKNS_9OpContextERKSt6vectorINS_5TBlobESaIS9_EERKS8_INS_9OpReqTypeESaISE_EESD_SD+0x17dd) [0x7f3c73add499]
........
........
........
Traceback (most recent call last):
File "train_end2end.py", line 185, in
main()
File "train_end2end.py", line 182, in main
lr=args.lr, lr_step=args.lr_step)
File "train_end2end.py", line 144, in train_net
arg_params=arg_params, aux_params=aux_params, begin_epoch=begin_epoch, num_epoch=end_epoch)
File "/usr/local/lib/python2.7/dist-packages/mxnet-0.9.5-py2.7.egg/mxnet/module/base_module.py", line 472, in fit
self.forward_backward(data_batch)
File "/usr/local/lib/python2.7/dist-packages/mxnet-0.9.5-py2.7.egg/mxnet/module/base_module.py", line 193, in forward_backward
self.forward(data_batch, is_train=True)
File "....../example/rcnn/rcnn/core/module.py", line 190, in forward
self._curr_module.forward(data_batch, is_train=is_train)
File "/usr/local/lib/python2.7/dist-packages/mxnet-0.9.5-py2.7.egg/mxnet/module/module.py", line 538, in forward
self.exec_group.forward(data_batch, is_train)
File "/usr/local/lib/python2.7/dist-packages/mxnet-0.9.5-py2.7.egg/mxnet/module/executor_group.py", line 379, in forward
exec
.forward(is_train=is_train)
File "/usr/local/lib/python2.7/dist-packages/mxnet-0.9.5-py2.7.egg/mxnet/executor.py", line 133, in forward
ctypes.c_int(int(is_train))))
File "/usr/local/lib/python2.7/dist-packages/mxnet-0.9.5-py2.7.egg/mxnet/base.py", line 84, in check_call
raise MXNetError(py_str(_LIB.MXGetLastError()))
mxnet.base.MXNetError: [15:50:32] ......../src/operator/contrib/proposal.cu:476: Check failed: error == cudaSuccess (7 vs. 0) too many resources requested for launch

Minimum reproducible example

example/rcnn

Steps to reproduce

  1. pwd: .../example/rcnn

  2. cmdline:
    python train_end2end.py --pretrained model/vgg --pretrained_epoch 7 --prefix model/mx95 --begin_epoch 7 --end_epoch 20 --lr_step 20 --gpus 0

What have you tried to solve it?

  1. Try multiple versions, e.g. 0.9.3 && 0.9.5 series, the same failure.
  2. set kMaxThreadsPerBlock (tensor_gpu-inl.cuh) to 512, cause another error.
  3. any hint?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions