RCNN example fails for using latest mxnet #9823

zhechen · 2018-02-19T06:15:14Z

I am using mxnet with CUDA9 + CUDNN7 and distributed training enabled. However, when I re-run the rcnn code in the example, I got the following error:

Traceback (most recent call last):
File "train_end2end.py", line 199, in
main()
File "train_end2end.py", line 196, in main
lr=args.lr, lr_step=args.lr_step)
File "train_end2end.py", line 158, in train_net
arg_params=arg_params, aux_params=aux_params, begin_epoch=begin_epoch, num_epoch=end_epoch)
File "/----/libs/incubator-mxnet/python/mxnet/module/base_module.py", line 496, in fit
self.update_metric(eval_metric, data_batch.label)
File "/----/mx-rcnn/rcnn/core/module.py", line 227, in update_metric
self._curr_module.update_metric(eval_metric, labels)
File "/----/libs/incubator-mxnet/python/mxnet/module/module.py", line 749, in update_metric
self.exec_group.update_metric(eval_metric, labels)
File "/----/libs/incubator-mxnet/python/mxnet/module/executor_group.py", line 616, in update_metric
eval_metric.update_dict(labels, preds)
File "/----/libs/incubator-mxnet/python/mxnet/metric.py", line 280, in update_dict
metric.update_dict(labels, preds)
File "/----/libs/incubator-mxnet/python/mxnet/metric.py", line 108, in update_dict
self.update(label, pred)
File "/----/mx-rcnn/rcnn/core/metric.py", line 51, in update
pred_label = mx.ndarray.argmax_channel(pred).asnumpy().astype('int32')
File "/----/libs/incubator-mxnet/python/mxnet/ndarray/ndarray.py", line 1801, in asnumpy
ctypes.c_size_t(data.size)))
File "/----/libs/incubator-mxnet/python/mxnet/base.py", line 148, in check_call
raise MXNetError(py_str(_LIB.MXGetLastError()))
mxnet.base.MXNetError: [17:08:44] src/operator/nn/./cudnn/cudnn_softmax_activation-inl.h:154: Check failed: e == CUDNN_STATUS_SUCCESS (3 vs. 0) cuDNN: CUDNN_STATUS_BAD_PARAM

Stack trace returned 10 entries:
[bt] (0) /----/libs/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(dmlc::StackTrace()+0x3d) [0x2adc0c3395cd]
[bt] (1) /----/libs/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x18) [0x2adc0c339a58]
[bt] (2) /----/libs/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::op::CuDNNSoftmaxActivationOp::Backward(mxnet::OpContext const&, mxnet::TBlob const&, mxnet::TBlob const&, mxnet::OpReqType const&, mxnet::TBlob const&)+0x10b9) [0x2adc0f5c7669]
[bt] (3) /----/libs/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(void mxnet::op::SoftmaxActivationGradComputemshadow::gpu(nnvm::NodeAttrs const&, mxnet::OpContext const&, std::vector<mxnet::TBlob, std::allocatormxnet::TBlob > const&, std::vector<mxnet::OpReqType, std::allocatormxnet::OpReqType > const&, std::vector<mxnet::TBlob, std::allocatormxnet::TBlob > const&)+0xd4c) [0x2adc0f5c2eac]
[bt] (4) /----/libs/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::exec::FComputeExecutor::Run(mxnet::RunContext, bool)+0x50) [0x2adc0ec4cc40]
[bt] (5) /----/libs/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(+0x3284653) [0x2adc0ec54653]
[bt] (6)/----/libs/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::engine::ThreadedEngine::ExecuteOprBlock(mxnet::RunContext, mxnet::engine::OprBlock*)+0x2c4) [0x2adc0ec2fcd4]
[bt] (7) /----/libs/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(void mxnet::engine::ThreadedEnginePerDevice::GPUWorker<(dmlc::ConcurrentQueueType)0>(mxnet::Context, bool, mxnet::engine::ThreadedEnginePerDevice::ThreadWorkerBlock<(dmlc::ConcurrentQueueType)0>, std::shared_ptrmxnet::engine::ThreadPool::SimpleEvent const&)+0x103) [0x2adc0ec34253]
[bt] (8) /----/libs/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(std::_Function_handler<void (std::shared_ptrmxnet::engine::ThreadPool::SimpleEvent), mxnet::engine::ThreadedEnginePerDevice::PushToExecute(mxnet::engine::OprBlock, bool)::{lambda()#3}::operator()() const::{lambda(std::shared_ptrmxnet::engine::ThreadPool::SimpleEvent)#1}>::_M_invoke(std::_Any_data const&, std::shared_ptrmxnet::engine::ThreadPool::SimpleEvent)+0x3e) [0x2adc0ec3448e]
[bt] (9)/----/libs/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(std::thread::_Impl<std::_Bind_simple<std::function<void (std::shared_ptrmxnet::engine::ThreadPool::SimpleEvent)> (std::shared_ptrmxnet::engine::ThreadPool::SimpleEvent)> >::_M_run()+0x3b) [0x2adc0ec2e36b]

Can anyone help me with it? Thanks very much!

zhechen · 2018-02-20T00:22:11Z

I somehow found a solution to this. Since I observed that this issue is caused by cudnn_softmax_activation function, both disabling cudnn and dropping the cudnn implementation of softmax will solve the problem. This mainly happens when using asnumpy() function for softmax results. Maybe someone can help check the real problem out and fix it. Thanks!

chinakook · 2018-03-27T04:27:07Z

I encountered this problem in RCNN too. I've tested with cuda 8.0 and cudnn6.0.2/cudnn7.1.2, both of them are failed today. However, It can run seccussfully on mxnet version two month ago.
I think there may be a bug within mxnet backend.

chinakook · 2018-03-27T04:33:26Z

@marcoabreu It's not only the bug in RCNN, but also in mx.sym.SoftmaxOutput or mx.sym.SoftmaxActivation when their result are using in metric such as pred.asnumpy().
It may occur in multi-gpu case.
So I suggest that reopen this issue util it's solved.

chinakook · 2018-03-27T05:28:52Z

It's solved when I roll back to mxnet v1.1.0.

marcoabreu · 2018-03-27T13:59:00Z

Thanks a lot for providing more detail! This indeed sounds like quite a serious issue. Just to clarify, does this only happen on a multi-gpu or on a distributed training environment?

@szha @rahul003 could check this please?

chinakook · 2018-03-28T03:33:09Z

pred_label = mx.ndarray.argmax_channel(pred).asnumpy().astype('int32')

File "/home/ABCDEFG/dev/incubator-mxnet/python/mxnet/ndarray/ndarray.py", line 1826, in asnumpy
ctypes.c_size_t(data.size)))
File "/home/ABCDEFG/dev/incubator-mxnet/python/mxnet/base.py", line 149, in check_call
raise MXNetError(py_str(_LIB.MXGetLastError()))
mxnet.base.MXNetError: [11:18:51] src/operator/nn/./cudnn/cudnn_softmax_activation-inl.h:154: Check failed: e == CUDNN_STATUS_SUCCESS (3 vs. 0) cuDNN: CUDNN_STATUS_BAD_PARAM

Stack trace returned 10 entries:
[bt] (0) /home/ABCDEFG/dev/incubator-mxnet/lib/libmxnet.so(dmlc::StackTraceabi:cxx11+0x5b) [0x7f3943b0efab]
[bt] (1) /home/ABCDEFG/dev/incubator-mxnet/lib/libmxnet.so(mxnet::op::CuDNNSoftmaxActivationOp::Backward(mxnet::OpContext const&, mxnet::TBlob const&, mxnet::TBlob const&, mxnet::OpReqType const&, mxnet::TBlob const&)+0x1bf5) [0x7f3947f52885]
[bt] (2) /home/ABCDEFG/dev/incubator-mxnet/lib/libmxnet.so(void mxnet::op::SoftmaxActivationGradComputemshadow::gpu(nnvm::NodeAttrs const&, mxnet::OpContext const&, std::vector<mxnet::TBlob, std::allocatormxnet::TBlob > const&, std::vector<mxnet::OpReqType, std::allocatormxnet::OpReqType > const&, std::vector<mxnet::TBlob, std::allocatormxnet::TBlob > const&)+0x1e1b) [0x7f3947f4dd8b]
[bt] (3) /home/ABCDEFG/dev/incubator-mxnet/lib/libmxnet.so(mxnet::exec::FComputeExecutor::Run(mxnet::RunContext, bool)+0x50) [0x7f39462912d0]
[bt] (4) /home/ABCDEFG/dev/incubator-mxnet/lib/libmxnet.so(+0x330c7f8) [0x7f39462587f8]
[bt] (5) /home/ABCDEFG/dev/incubator-mxnet/lib/libmxnet.so(mxnet::engine::ThreadedEngine::ExecuteOprBlock(mxnet::RunContext, mxnet::engine::OprBlock*)+0x8e5) [0x7f394689d2c5]
[bt] (6) /home/ABCDEFG/dev/incubator-mxnet/lib/libmxnet.so(void mxnet::engine::ThreadedEnginePerDevice::GPUWorker<(dmlc::ConcurrentQueueType)0>(mxnet::Context, bool, mxnet::engine::ThreadedEnginePerDevice::ThreadWorkerBlock<(dmlc::ConcurrentQueueType)0>, std::shared_ptrdmlc::ManualEvent const&)+0xeb) [0x7f39468b2e4b]
[bt] (7) /home/ABCDEFG/dev/incubator-mxnet/lib/libmxnet.so(std::_Function_handler<void (std::shared_ptrdmlc::ManualEvent), mxnet::engine::ThreadedEnginePerDevice::PushToExecute(mxnet::engine::OprBlock, bool)::{lambda()#3}::operator()() const::{lambda(std::shared_ptrdmlc::ManualEvent)#1}>::_M_invoke(std::_Any_data const&, std::shared_ptrdmlc::ManualEvent&&)+0x4e) [0x7f39468b30ae]
[bt] (8) /home/ABCDEFG/dev/incubator-mxnet/lib/libmxnet.so(std::thread::_Impl<std::_Bind_simple<std::function<void (std::shared_ptrdmlc::ManualEvent)> (std::shared_ptrdmlc::ManualEvent)> >::_M_run()+0x4a) [0x7f39468acf5a]
[bt] (9) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xb8c80) [0x7f3975ab5c80]

chinakook · 2018-03-28T12:32:45Z

@marcoabreu It also happened on single GPU by chance.

chinakook · 2018-03-30T00:52:59Z

I'm sure that the problem is caused by this line.
https://github.com/apache/incubator-mxnet/blob/7c28089749287f42ea8f41abd1358e6dbac54187/example/rcnn/rcnn/symbol/symbol_resnet.py#L187
When I changed the line to

rpn_cls_prob = mx.symbol.softmax(data=rpn_cls_score_reshape, axis=1, name="rpn_cls_prob")

The problem is solved. So I'm sure the mx.symbol.SoftmaxActivation operator(which depend on CUDNN, on the other hand, the mx.symbol.softmax operator is native implementation) has some bug after #9677 . @zheng-da

ysfalo · 2018-04-12T03:15:32Z

It happened on single GPU by chance in my machine with mxnet 1.2.0.

wkcn · 2018-05-09T04:45:01Z

It seems the reason is that CUDNN call fails.

When the compile options includes USE_CUDNN=1, the softmax activation operator uses CUDNNSoftmax.

And the convolution operator prints the log:
src/operator/nn/convolution.cu:140: This convolution is not supported by cudnn, MXNET convolution is applied. The old CUDNN doesn't support dilated conv.

Softmax Operator dosen't use CUDNN, so it doesn't cause any error.

Solution:
1.compile MXNet with latest CUDNN
2.Replace mx.sym.SoftmaxActivation(cudnn) with mx.sym.Softmax(pure CUDA) if CUDNN didn't support SoftmaxActivation.

Ram-Godavarthi · 2018-06-20T12:36:16Z

I had the same problem. I have just installed mxnet gpu support version 1.1.0
Before i had mxnet-cu80 1.2.0.

and It worked.

wkcn · 2018-06-20T13:05:00Z

@ram124 The latest MXNet has fixed the bug.
#10918

Ram-Godavarthi · 2018-06-20T13:12:38Z

@wkcn . Oh cool. i will check that..

I have my custom dataset in pascal format.
what changes needs to be done to get started with training.
I have 2 classes( pedestrian + bicycle). i need to classify them on single image..
I have changed pascal.py by changing class names and numbers.

Is there anything else that i need to change?

Anybody done training on own dataset. Please help me out.

wkcn · 2018-06-20T13:48:29Z

@ram124
You also need to change num_classes in config.py.

Ram-Godavarthi · 2018-06-20T13:55:53Z

@wkcn
I should make it 3 right? with background.
config.NUM_CLASSES = 21

And where should i specify my datset??
i have my dataset in
./data/my_own_data/
Annotations
Imagesets
Images

In which files should i give this path??
so that it can read my custom data

wkcn · 2018-06-20T14:21:31Z

The num_classes includes background, so 3 is right.
For the dataset path, you could check config.py, pascal_voc.py and pascal_voc_eval.py

Ram-Godavarthi · 2018-06-20T14:24:35Z

@wkcn
Thank you.
when i run demo.py. I am getting something like this.
What is this??

(mxnet_p27) ubuntu@ip-172-31-10-202:~/mx-rcnn-1$ python demo.py --prefix model/vgg16 --epoch 0 --image myimage.jpg --gpu 0 --vis
Traceback (most recent call last):
File "demo.py", line 143, in
main()
File "demo.py", line 138, in main
predictor = get_net(symbol, args.prefix, args.epoch, ctx)
File "demo.py", line 49, in get_net
assert k in arg_params, k + ' not initialized'
AssertionError: rpn_conv_3x3_weight not initialized

wkcn · 2018-06-20T15:57:34Z

It seems that you read a pretrained model rather than detection model.
There is no rpn_conv parameter.

Ram-Godavarthi · 2018-06-21T07:24:00Z

I solved that problem. Thank You fro that.

I am trying to train on my own dataset.
I have changed Number of classes in config.py

I have modified pascal.py (changed Class names, i have only 2 classes + 1 background)

But now getting this error..What is the problem @wkcn

INFO:root:voc_radar_train append flipped images to roidb
Traceback (most recent call last):
File "train_end2end.py", line 178, in
main()
File "train_end2end.py", line 175, in main
lr=args.lr, lr_step=args.lr_step)
File "train_end2end.py", line 39, in train_net
for image_set in image_sets]
File "/home/ubuntu/mx-rcnn-1/rcnn/utils/load_data.py", line 13, in load_gt_roidb
roidb = imdb.append_flipped_images(roidb)
File "/home/ubuntu/mx-rcnn-1/rcnn/dataset/imdb.py", line 168, in append_flipped_images
assert (boxes[:, 2] >= boxes[:, 0]).all()
AssertionError

wkcn · 2018-06-21T07:35:13Z

It seems the dataset is wrong.
The boxes xmin, ymin, xmax, ymax should be started with 1.

chinakook · 2018-06-21T08:02:52Z

matlab is starting with 1

Ram-Godavarthi · 2018-06-21T08:17:12Z

Yaa I corrected it.

Now i m getting this error after 1 epoch. How to solve this? Is it related to mxnet version?

Traceback (most recent call last):
File "train_end2end.py", line 178, in
main()
File "train_end2end.py", line 175, in main
lr=args.lr, lr_step=args.lr_step)
File "train_end2end.py", line 137, in train_net
arg_params=arg_params, aux_params=aux_params, begin_epoch=begin_epoch, num_epoch=end_epoch)
File "/home/ubuntu/anaconda3/envs/mxnet_p27/lib/python2.7/site-packages/mxnet/module/base_module.py", line 517, in fit
self.set_params(arg_params, aux_params)
File "/home/ubuntu/anaconda3/envs/mxnet_p27/lib/python2.7/site-packages/mxnet/module/base_module.py", line 652, in set_params
allow_extra=allow_extra)
TypeError: init_params() got an unexpected keyword argument 'allow_extra'

Ram-Godavarthi · 2018-06-26T11:22:08Z

@wkcn @chinakook @ysfalo @ijkguo I have a question regarding batch size. Can we use batch size of more than 1 in mxnet-rcnn training??
Because i have a large dataset of 15000 images.
if i do training on them , the speed : 2.35 sample/sec.
it takes almost 4 hours per epoch.
is there anyother way i could increase the speed??

Any help is really appreciated.

ijkguo · 2018-06-26T17:11:51Z

So the original issue has been fixed in #10918.

As to unexpected kwarg 'allow_extra' and multi-batch size training, they are solved in #11373.

chinakook · 2018-06-27T00:54:21Z

@ram124 If you regard batch_size, please use SNIPER. It can be trained with large batch_size.

zhreshold · 2018-07-13T00:51:39Z

Closing this after merging #11373, feel free to ping me to reopen it it's not fixed.

marcoabreu added the Example label Feb 19, 2018

zhechen closed this as completed Feb 20, 2018

marcoabreu reopened this Mar 27, 2018

marcoabreu added Bug Operator labels Mar 27, 2018

wkcn mentioned this issue May 12, 2018

cudnn_softmax_activation error in MXNet 1.2.0 #10914

Closed

sandeep-krishnamurthy mentioned this issue May 27, 2018

Fix crashing softmax mxnet backend operator on GPUs awslabs/keras-apache-mxnet#107

Merged

ijkguo mentioned this issue Jun 27, 2018

Can we use Batch size more than 1 ijkguo/mx-rcnn#96

Closed

zhreshold closed this as completed Jul 13, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RCNN example fails for using latest mxnet #9823

RCNN example fails for using latest mxnet #9823

zhechen commented Feb 19, 2018

zhechen commented Feb 20, 2018

chinakook commented Mar 27, 2018

chinakook commented Mar 27, 2018 •

edited

Loading

chinakook commented Mar 27, 2018

marcoabreu commented Mar 27, 2018

chinakook commented Mar 28, 2018

chinakook commented Mar 28, 2018

chinakook commented Mar 30, 2018

ysfalo commented Apr 12, 2018

wkcn commented May 9, 2018 •

edited

Loading

Ram-Godavarthi commented Jun 20, 2018

wkcn commented Jun 20, 2018

Ram-Godavarthi commented Jun 20, 2018

wkcn commented Jun 20, 2018

Ram-Godavarthi commented Jun 20, 2018

wkcn commented Jun 20, 2018

Ram-Godavarthi commented Jun 20, 2018

wkcn commented Jun 20, 2018

Ram-Godavarthi commented Jun 21, 2018

wkcn commented Jun 21, 2018

chinakook commented Jun 21, 2018

Ram-Godavarthi commented Jun 21, 2018

Ram-Godavarthi commented Jun 26, 2018

ijkguo commented Jun 26, 2018

chinakook commented Jun 27, 2018

zhreshold commented Jul 13, 2018

RCNN example fails for using latest mxnet #9823

RCNN example fails for using latest mxnet #9823

Comments

zhechen commented Feb 19, 2018

zhechen commented Feb 20, 2018

chinakook commented Mar 27, 2018

chinakook commented Mar 27, 2018 • edited Loading

chinakook commented Mar 27, 2018

marcoabreu commented Mar 27, 2018

chinakook commented Mar 28, 2018

chinakook commented Mar 28, 2018

chinakook commented Mar 30, 2018

ysfalo commented Apr 12, 2018

wkcn commented May 9, 2018 • edited Loading

Ram-Godavarthi commented Jun 20, 2018

wkcn commented Jun 20, 2018

Ram-Godavarthi commented Jun 20, 2018

wkcn commented Jun 20, 2018

Ram-Godavarthi commented Jun 20, 2018

wkcn commented Jun 20, 2018

Ram-Godavarthi commented Jun 20, 2018

wkcn commented Jun 20, 2018

Ram-Godavarthi commented Jun 21, 2018

wkcn commented Jun 21, 2018

chinakook commented Jun 21, 2018

Ram-Godavarthi commented Jun 21, 2018

Ram-Godavarthi commented Jun 26, 2018

ijkguo commented Jun 26, 2018

chinakook commented Jun 27, 2018

zhreshold commented Jul 13, 2018

chinakook commented Mar 27, 2018 •

edited

Loading

wkcn commented May 9, 2018 •

edited

Loading