Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

RCNN example fails for using latest mxnet #9823

Closed
zhechen opened this issue Feb 19, 2018 · 26 comments
Closed

RCNN example fails for using latest mxnet #9823

zhechen opened this issue Feb 19, 2018 · 26 comments

Comments

@zhechen
Copy link

zhechen commented Feb 19, 2018

I am using mxnet with CUDA9 + CUDNN7 and distributed training enabled. However, when I re-run the rcnn code in the example, I got the following error:

Traceback (most recent call last):
File "train_end2end.py", line 199, in
main()
File "train_end2end.py", line 196, in main
lr=args.lr, lr_step=args.lr_step)
File "train_end2end.py", line 158, in train_net
arg_params=arg_params, aux_params=aux_params, begin_epoch=begin_epoch, num_epoch=end_epoch)
File "/----/libs/incubator-mxnet/python/mxnet/module/base_module.py", line 496, in fit
self.update_metric(eval_metric, data_batch.label)
File "/----/mx-rcnn/rcnn/core/module.py", line 227, in update_metric
self._curr_module.update_metric(eval_metric, labels)
File "/----/libs/incubator-mxnet/python/mxnet/module/module.py", line 749, in update_metric
self.exec_group.update_metric(eval_metric, labels)
File "/----/libs/incubator-mxnet/python/mxnet/module/executor_group.py", line 616, in update_metric
eval_metric.update_dict(labels
, preds)
File "/----/libs/incubator-mxnet/python/mxnet/metric.py", line 280, in update_dict
metric.update_dict(labels, preds)
File "/----/libs/incubator-mxnet/python/mxnet/metric.py", line 108, in update_dict
self.update(label, pred)
File "/----/mx-rcnn/rcnn/core/metric.py", line 51, in update
pred_label = mx.ndarray.argmax_channel(pred).asnumpy().astype('int32')
File "/----/libs/incubator-mxnet/python/mxnet/ndarray/ndarray.py", line 1801, in asnumpy
ctypes.c_size_t(data.size)))
File "/----/libs/incubator-mxnet/python/mxnet/base.py", line 148, in check_call
raise MXNetError(py_str(_LIB.MXGetLastError()))
mxnet.base.MXNetError: [17:08:44] src/operator/nn/./cudnn/cudnn_softmax_activation-inl.h:154: Check failed: e == CUDNN_STATUS_SUCCESS (3 vs. 0) cuDNN: CUDNN_STATUS_BAD_PARAM

Stack trace returned 10 entries:
[bt] (0) /----/libs/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(dmlc::StackTrace()+0x3d) [0x2adc0c3395cd]
[bt] (1) /----/libs/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x18) [0x2adc0c339a58]
[bt] (2) /----/libs/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::op::CuDNNSoftmaxActivationOp::Backward(mxnet::OpContext const&, mxnet::TBlob const&, mxnet::TBlob const&, mxnet::OpReqType const&, mxnet::TBlob const&)+0x10b9) [0x2adc0f5c7669]
[bt] (3) /----/libs/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(void mxnet::op::SoftmaxActivationGradComputemshadow::gpu(nnvm::NodeAttrs const&, mxnet::OpContext const&, std::vector<mxnet::TBlob, std::allocatormxnet::TBlob > const&, std::vector<mxnet::OpReqType, std::allocatormxnet::OpReqType > const&, std::vector<mxnet::TBlob, std::allocatormxnet::TBlob > const&)+0xd4c) [0x2adc0f5c2eac]
[bt] (4) /----/libs/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::exec::FComputeExecutor::Run(mxnet::RunContext, bool)+0x50) [0x2adc0ec4cc40]
[bt] (5) /----/libs/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(+0x3284653) [0x2adc0ec54653]
[bt] (6)/----/libs/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::engine::ThreadedEngine::ExecuteOprBlock(mxnet::RunContext, mxnet::engine::OprBlock*)+0x2c4) [0x2adc0ec2fcd4]
[bt] (7) /----/libs/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(void mxnet::engine::ThreadedEnginePerDevice::GPUWorker<(dmlc::ConcurrentQueueType)0>(mxnet::Context, bool, mxnet::engine::ThreadedEnginePerDevice::ThreadWorkerBlock<(dmlc::ConcurrentQueueType)0>, std::shared_ptrmxnet::engine::ThreadPool::SimpleEvent const&)+0x103) [0x2adc0ec34253]
[bt] (8) /----/libs/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(std::_Function_handler<void (std::shared_ptrmxnet::engine::ThreadPool::SimpleEvent), mxnet::engine::ThreadedEnginePerDevice::PushToExecute(mxnet::engine::OprBlock
, bool)::{lambda()#3}::operator()() const::{lambda(std::shared_ptrmxnet::engine::ThreadPool::SimpleEvent)#1}>::_M_invoke(std::_Any_data const&, std::shared_ptrmxnet::engine::ThreadPool::SimpleEvent)+0x3e) [0x2adc0ec3448e]
[bt] (9)/----/libs/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(std::thread::_Impl<std::_Bind_simple<std::function<void (std::shared_ptrmxnet::engine::ThreadPool::SimpleEvent)> (std::shared_ptrmxnet::engine::ThreadPool::SimpleEvent)> >::_M_run()+0x3b) [0x2adc0ec2e36b]

Can anyone help me with it? Thanks very much!

@zhechen
Copy link
Author

zhechen commented Feb 20, 2018

I somehow found a solution to this. Since I observed that this issue is caused by cudnn_softmax_activation function, both disabling cudnn and dropping the cudnn implementation of softmax will solve the problem. This mainly happens when using asnumpy() function for softmax results. Maybe someone can help check the real problem out and fix it. Thanks!

@zhechen zhechen closed this as completed Feb 20, 2018
@chinakook
Copy link
Contributor

I encountered this problem in RCNN too. I've tested with cuda 8.0 and cudnn6.0.2/cudnn7.1.2, both of them are failed today. However, It can run seccussfully on mxnet version two month ago.
I think there may be a bug within mxnet backend.

@chinakook
Copy link
Contributor

chinakook commented Mar 27, 2018

@marcoabreu It's not only the bug in RCNN, but also in mx.sym.SoftmaxOutput or mx.sym.SoftmaxActivation when their result are using in metric such as pred.asnumpy().
It may occur in multi-gpu case.
So I suggest that reopen this issue util it's solved.

@chinakook
Copy link
Contributor

It's solved when I roll back to mxnet v1.1.0.

@marcoabreu
Copy link
Contributor

Thanks a lot for providing more detail! This indeed sounds like quite a serious issue. Just to clarify, does this only happen on a multi-gpu or on a distributed training environment?

@szha @rahul003 could check this please?

@chinakook
Copy link
Contributor

pred_label = mx.ndarray.argmax_channel(pred).asnumpy().astype('int32')

File "/home/ABCDEFG/dev/incubator-mxnet/python/mxnet/ndarray/ndarray.py", line 1826, in asnumpy
ctypes.c_size_t(data.size)))
File "/home/ABCDEFG/dev/incubator-mxnet/python/mxnet/base.py", line 149, in check_call
raise MXNetError(py_str(_LIB.MXGetLastError()))
mxnet.base.MXNetError: [11:18:51] src/operator/nn/./cudnn/cudnn_softmax_activation-inl.h:154: Check failed: e == CUDNN_STATUS_SUCCESS (3 vs. 0) cuDNN: CUDNN_STATUS_BAD_PARAM

Stack trace returned 10 entries:
[bt] (0) /home/ABCDEFG/dev/incubator-mxnet/lib/libmxnet.so(dmlc::StackTraceabi:cxx11+0x5b) [0x7f3943b0efab]
[bt] (1) /home/ABCDEFG/dev/incubator-mxnet/lib/libmxnet.so(mxnet::op::CuDNNSoftmaxActivationOp::Backward(mxnet::OpContext const&, mxnet::TBlob const&, mxnet::TBlob const&, mxnet::OpReqType const&, mxnet::TBlob const&)+0x1bf5) [0x7f3947f52885]
[bt] (2) /home/ABCDEFG/dev/incubator-mxnet/lib/libmxnet.so(void mxnet::op::SoftmaxActivationGradComputemshadow::gpu(nnvm::NodeAttrs const&, mxnet::OpContext const&, std::vector<mxnet::TBlob, std::allocatormxnet::TBlob > const&, std::vector<mxnet::OpReqType, std::allocatormxnet::OpReqType > const&, std::vector<mxnet::TBlob, std::allocatormxnet::TBlob > const&)+0x1e1b) [0x7f3947f4dd8b]
[bt] (3) /home/ABCDEFG/dev/incubator-mxnet/lib/libmxnet.so(mxnet::exec::FComputeExecutor::Run(mxnet::RunContext, bool)+0x50) [0x7f39462912d0]
[bt] (4) /home/ABCDEFG/dev/incubator-mxnet/lib/libmxnet.so(+0x330c7f8) [0x7f39462587f8]
[bt] (5) /home/ABCDEFG/dev/incubator-mxnet/lib/libmxnet.so(mxnet::engine::ThreadedEngine::ExecuteOprBlock(mxnet::RunContext, mxnet::engine::OprBlock*)+0x8e5) [0x7f394689d2c5]
[bt] (6) /home/ABCDEFG/dev/incubator-mxnet/lib/libmxnet.so(void mxnet::engine::ThreadedEnginePerDevice::GPUWorker<(dmlc::ConcurrentQueueType)0>(mxnet::Context, bool, mxnet::engine::ThreadedEnginePerDevice::ThreadWorkerBlock<(dmlc::ConcurrentQueueType)0>, std::shared_ptrdmlc::ManualEvent const&)+0xeb) [0x7f39468b2e4b]
[bt] (7) /home/ABCDEFG/dev/incubator-mxnet/lib/libmxnet.so(std::_Function_handler<void (std::shared_ptrdmlc::ManualEvent), mxnet::engine::ThreadedEnginePerDevice::PushToExecute(mxnet::engine::OprBlock
, bool)::{lambda()#3}::operator()() const::{lambda(std::shared_ptrdmlc::ManualEvent)#1}>::_M_invoke(std::_Any_data const&, std::shared_ptrdmlc::ManualEvent&&)+0x4e) [0x7f39468b30ae]
[bt] (8) /home/ABCDEFG/dev/incubator-mxnet/lib/libmxnet.so(std::thread::_Impl<std::_Bind_simple<std::function<void (std::shared_ptrdmlc::ManualEvent)> (std::shared_ptrdmlc::ManualEvent)> >::_M_run()+0x4a) [0x7f39468acf5a]
[bt] (9) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xb8c80) [0x7f3975ab5c80]

@chinakook
Copy link
Contributor

@marcoabreu It also happened on single GPU by chance.

@chinakook
Copy link
Contributor

I'm sure that the problem is caused by this line.
https://github.com/apache/incubator-mxnet/blob/7c28089749287f42ea8f41abd1358e6dbac54187/example/rcnn/rcnn/symbol/symbol_resnet.py#L187
When I changed the line to

rpn_cls_prob = mx.symbol.softmax(data=rpn_cls_score_reshape, axis=1, name="rpn_cls_prob")

The problem is solved. So I'm sure the mx.symbol.SoftmaxActivation operator(which depend on CUDNN, on the other hand, the mx.symbol.softmax operator is native implementation) has some bug after #9677 . @zheng-da

@ysfalo
Copy link

ysfalo commented Apr 12, 2018

It happened on single GPU by chance in my machine with mxnet 1.2.0.

@wkcn
Copy link
Member

wkcn commented May 9, 2018

It seems the reason is that CUDNN call fails.

When the compile options includes USE_CUDNN=1, the softmax activation operator uses CUDNNSoftmax.

And the convolution operator prints the log:
src/operator/nn/convolution.cu:140: This convolution is not supported by cudnn, MXNET convolution is applied. The old CUDNN doesn't support dilated conv.

Softmax Operator dosen't use CUDNN, so it doesn't cause any error.

Solution:
1.compile MXNet with latest CUDNN
2.Replace mx.sym.SoftmaxActivation(cudnn) with mx.sym.Softmax(pure CUDA) if CUDNN didn't support SoftmaxActivation.

@Ram-Godavarthi
Copy link

I had the same problem. I have just installed mxnet gpu support version 1.1.0
Before i had mxnet-cu80 1.2.0.

and It worked.

@wkcn
Copy link
Member

wkcn commented Jun 20, 2018

@ram124 The latest MXNet has fixed the bug.
#10918

@Ram-Godavarthi
Copy link

@wkcn . Oh cool. i will check that..

I have my custom dataset in pascal format.
what changes needs to be done to get started with training.
I have 2 classes( pedestrian + bicycle). i need to classify them on single image..
I have changed pascal.py by changing class names and numbers.

Is there anything else that i need to change?

Anybody done training on own dataset. Please help me out.

@wkcn
Copy link
Member

wkcn commented Jun 20, 2018

@ram124
You also need to change num_classes in config.py.

@Ram-Godavarthi
Copy link

@wkcn
I should make it 3 right? with background.
config.NUM_CLASSES = 21

And where should i specify my datset??
i have my dataset in
./data/my_own_data/
Annotations
Imagesets
Images

In which files should i give this path??
so that it can read my custom data

@wkcn
Copy link
Member

wkcn commented Jun 20, 2018

The num_classes includes background, so 3 is right.
For the dataset path, you could check config.py, pascal_voc.py and pascal_voc_eval.py

@Ram-Godavarthi
Copy link

@wkcn
Thank you.
when i run demo.py. I am getting something like this.
What is this??

(mxnet_p27) ubuntu@ip-172-31-10-202:~/mx-rcnn-1$ python demo.py --prefix model/vgg16 --epoch 0 --image myimage.jpg --gpu 0 --vis
Traceback (most recent call last):
File "demo.py", line 143, in
main()
File "demo.py", line 138, in main
predictor = get_net(symbol, args.prefix, args.epoch, ctx)
File "demo.py", line 49, in get_net
assert k in arg_params, k + ' not initialized'
AssertionError: rpn_conv_3x3_weight not initialized

@wkcn
Copy link
Member

wkcn commented Jun 20, 2018

It seems that you read a pretrained model rather than detection model.
There is no rpn_conv parameter.

@Ram-Godavarthi
Copy link

I solved that problem. Thank You fro that.

I am trying to train on my own dataset.
I have changed Number of classes in config.py

I have modified pascal.py (changed Class names, i have only 2 classes + 1 background)

But now getting this error..What is the problem @wkcn

INFO:root:voc_radar_train append flipped images to roidb
Traceback (most recent call last):
File "train_end2end.py", line 178, in
main()
File "train_end2end.py", line 175, in main
lr=args.lr, lr_step=args.lr_step)
File "train_end2end.py", line 39, in train_net
for image_set in image_sets]
File "/home/ubuntu/mx-rcnn-1/rcnn/utils/load_data.py", line 13, in load_gt_roidb
roidb = imdb.append_flipped_images(roidb)
File "/home/ubuntu/mx-rcnn-1/rcnn/dataset/imdb.py", line 168, in append_flipped_images
assert (boxes[:, 2] >= boxes[:, 0]).all()
AssertionError

@wkcn
Copy link
Member

wkcn commented Jun 21, 2018

It seems the dataset is wrong.
The boxes xmin, ymin, xmax, ymax should be started with 1.

@chinakook
Copy link
Contributor

matlab is starting with 1

@Ram-Godavarthi
Copy link

Yaa I corrected it.

Now i m getting this error after 1 epoch. How to solve this? Is it related to mxnet version?

Traceback (most recent call last):
File "train_end2end.py", line 178, in
main()
File "train_end2end.py", line 175, in main
lr=args.lr, lr_step=args.lr_step)
File "train_end2end.py", line 137, in train_net
arg_params=arg_params, aux_params=aux_params, begin_epoch=begin_epoch, num_epoch=end_epoch)
File "/home/ubuntu/anaconda3/envs/mxnet_p27/lib/python2.7/site-packages/mxnet/module/base_module.py", line 517, in fit
self.set_params(arg_params, aux_params)
File "/home/ubuntu/anaconda3/envs/mxnet_p27/lib/python2.7/site-packages/mxnet/module/base_module.py", line 652, in set_params
allow_extra=allow_extra)
TypeError: init_params() got an unexpected keyword argument 'allow_extra'

@Ram-Godavarthi
Copy link

@wkcn @chinakook @ysfalo @ijkguo I have a question regarding batch size. Can we use batch size of more than 1 in mxnet-rcnn training??
Because i have a large dataset of 15000 images.
if i do training on them , the speed : 2.35 sample/sec.
it takes almost 4 hours per epoch.
is there anyother way i could increase the speed??

Any help is really appreciated.

@ijkguo
Copy link
Contributor

ijkguo commented Jun 26, 2018

So the original issue has been fixed in #10918.

As to unexpected kwarg 'allow_extra' and multi-batch size training, they are solved in #11373.

@chinakook
Copy link
Contributor

@ram124 If you regard batch_size, please use SNIPER. It can be trained with large batch_size.

@zhreshold
Copy link
Member

Closing this after merging #11373, feel free to ping me to reopen it it's not fixed.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

8 participants