-
Notifications
You must be signed in to change notification settings - Fork 6.8k
RCNN example fails for using latest mxnet #9823
Comments
I somehow found a solution to this. Since I observed that this issue is caused by cudnn_softmax_activation function, both disabling cudnn and dropping the cudnn implementation of softmax will solve the problem. This mainly happens when using asnumpy() function for softmax results. Maybe someone can help check the real problem out and fix it. Thanks! |
I encountered this problem in RCNN too. I've tested with cuda 8.0 and cudnn6.0.2/cudnn7.1.2, both of them are failed today. However, It can run seccussfully on mxnet version two month ago. |
@marcoabreu It's not only the bug in RCNN, but also in mx.sym.SoftmaxOutput or mx.sym.SoftmaxActivation when their result are using in metric such as |
It's solved when I roll back to mxnet v1.1.0. |
File "/home/ABCDEFG/dev/incubator-mxnet/python/mxnet/ndarray/ndarray.py", line 1826, in asnumpy Stack trace returned 10 entries: |
@marcoabreu It also happened on single GPU by chance. |
I'm sure that the problem is caused by this line.
The problem is solved. So I'm sure the mx.symbol.SoftmaxActivation operator(which depend on CUDNN, on the other hand, the mx.symbol.softmax operator is native implementation) has some bug after #9677 . @zheng-da |
It happened on single GPU by chance in my machine with mxnet 1.2.0. |
It seems the reason is that CUDNN call fails. When the compile options includes USE_CUDNN=1, the softmax activation operator uses CUDNNSoftmax. And the convolution operator prints the log: Softmax Operator dosen't use CUDNN, so it doesn't cause any error. Solution: |
I had the same problem. I have just installed mxnet gpu support version 1.1.0 and It worked. |
@wkcn . Oh cool. i will check that.. I have my custom dataset in pascal format. Is there anything else that i need to change? Anybody done training on own dataset. Please help me out. |
@wkcn And where should i specify my datset?? In which files should i give this path?? |
The num_classes includes background, so 3 is right. |
@wkcn (mxnet_p27) ubuntu@ip-172-31-10-202:~/mx-rcnn-1$ python demo.py --prefix model/vgg16 --epoch 0 --image myimage.jpg --gpu 0 --vis |
It seems that you read a pretrained model rather than detection model. |
I solved that problem. Thank You fro that. I am trying to train on my own dataset. I have modified pascal.py (changed Class names, i have only 2 classes + 1 background) But now getting this error..What is the problem @wkcn INFO:root:voc_radar_train append flipped images to roidb |
It seems the dataset is wrong. |
matlab is starting with 1 |
Yaa I corrected it. Now i m getting this error after 1 epoch. How to solve this? Is it related to mxnet version? Traceback (most recent call last): |
@wkcn @chinakook @ysfalo @ijkguo I have a question regarding batch size. Can we use batch size of more than 1 in mxnet-rcnn training?? Any help is really appreciated. |
Closing this after merging #11373, feel free to ping me to reopen it it's not fixed. |
I am using mxnet with CUDA9 + CUDNN7 and distributed training enabled. However, when I re-run the rcnn code in the example, I got the following error:
Traceback (most recent call last):
File "train_end2end.py", line 199, in
main()
File "train_end2end.py", line 196, in main
lr=args.lr, lr_step=args.lr_step)
File "train_end2end.py", line 158, in train_net
arg_params=arg_params, aux_params=aux_params, begin_epoch=begin_epoch, num_epoch=end_epoch)
File "/----/libs/incubator-mxnet/python/mxnet/module/base_module.py", line 496, in fit
self.update_metric(eval_metric, data_batch.label)
File "/----/mx-rcnn/rcnn/core/module.py", line 227, in update_metric
self._curr_module.update_metric(eval_metric, labels)
File "/----/libs/incubator-mxnet/python/mxnet/module/module.py", line 749, in update_metric
self.exec_group.update_metric(eval_metric, labels)
File "/----/libs/incubator-mxnet/python/mxnet/module/executor_group.py", line 616, in update_metric
eval_metric.update_dict(labels, preds)
File "/----/libs/incubator-mxnet/python/mxnet/metric.py", line 280, in update_dict
metric.update_dict(labels, preds)
File "/----/libs/incubator-mxnet/python/mxnet/metric.py", line 108, in update_dict
self.update(label, pred)
File "/----/mx-rcnn/rcnn/core/metric.py", line 51, in update
pred_label = mx.ndarray.argmax_channel(pred).asnumpy().astype('int32')
File "/----/libs/incubator-mxnet/python/mxnet/ndarray/ndarray.py", line 1801, in asnumpy
ctypes.c_size_t(data.size)))
File "/----/libs/incubator-mxnet/python/mxnet/base.py", line 148, in check_call
raise MXNetError(py_str(_LIB.MXGetLastError()))
mxnet.base.MXNetError: [17:08:44] src/operator/nn/./cudnn/cudnn_softmax_activation-inl.h:154: Check failed: e == CUDNN_STATUS_SUCCESS (3 vs. 0) cuDNN: CUDNN_STATUS_BAD_PARAM
Stack trace returned 10 entries:
[bt] (0) /----/libs/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(dmlc::StackTrace()+0x3d) [0x2adc0c3395cd]
[bt] (1) /----/libs/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x18) [0x2adc0c339a58]
[bt] (2) /----/libs/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::op::CuDNNSoftmaxActivationOp::Backward(mxnet::OpContext const&, mxnet::TBlob const&, mxnet::TBlob const&, mxnet::OpReqType const&, mxnet::TBlob const&)+0x10b9) [0x2adc0f5c7669]
[bt] (3) /----/libs/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(void mxnet::op::SoftmaxActivationGradComputemshadow::gpu(nnvm::NodeAttrs const&, mxnet::OpContext const&, std::vector<mxnet::TBlob, std::allocatormxnet::TBlob > const&, std::vector<mxnet::OpReqType, std::allocatormxnet::OpReqType > const&, std::vector<mxnet::TBlob, std::allocatormxnet::TBlob > const&)+0xd4c) [0x2adc0f5c2eac]
[bt] (4) /----/libs/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::exec::FComputeExecutor::Run(mxnet::RunContext, bool)+0x50) [0x2adc0ec4cc40]
[bt] (5) /----/libs/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(+0x3284653) [0x2adc0ec54653]
[bt] (6)/----/libs/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::engine::ThreadedEngine::ExecuteOprBlock(mxnet::RunContext, mxnet::engine::OprBlock*)+0x2c4) [0x2adc0ec2fcd4]
[bt] (7) /----/libs/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(void mxnet::engine::ThreadedEnginePerDevice::GPUWorker<(dmlc::ConcurrentQueueType)0>(mxnet::Context, bool, mxnet::engine::ThreadedEnginePerDevice::ThreadWorkerBlock<(dmlc::ConcurrentQueueType)0>, std::shared_ptrmxnet::engine::ThreadPool::SimpleEvent const&)+0x103) [0x2adc0ec34253]
[bt] (8) /----/libs/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(std::_Function_handler<void (std::shared_ptrmxnet::engine::ThreadPool::SimpleEvent), mxnet::engine::ThreadedEnginePerDevice::PushToExecute(mxnet::engine::OprBlock, bool)::{lambda()#3}::operator()() const::{lambda(std::shared_ptrmxnet::engine::ThreadPool::SimpleEvent)#1}>::_M_invoke(std::_Any_data const&, std::shared_ptrmxnet::engine::ThreadPool::SimpleEvent)+0x3e) [0x2adc0ec3448e]
[bt] (9)/----/libs/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(std::thread::_Impl<std::_Bind_simple<std::function<void (std::shared_ptrmxnet::engine::ThreadPool::SimpleEvent)> (std::shared_ptrmxnet::engine::ThreadPool::SimpleEvent)> >::_M_run()+0x3b) [0x2adc0ec2e36b]
Can anyone help me with it? Thanks very much!
The text was updated successfully, but these errors were encountered: