Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

Flaky test: test_gluon.test_hybrid_static_memory_switching #11171

Closed
ThomasDelteil opened this issue Jun 6, 2018 · 14 comments
Closed

Flaky test: test_gluon.test_hybrid_static_memory_switching #11171

ThomasDelteil opened this issue Jun 6, 2018 · 14 comments

Comments

@ThomasDelteil
Copy link
Contributor

test_gluon.test_hybrid_static_memory_switching

http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/PR-11024/6/pipeline/

t_gluon.test_hybrid_static_memory_switching ... /work/runtime_functions.sh: line 382:     7 Segmentation fault      (core dumped) nosetests-3.4 --verbose tests/python/unittest

build.py: 2018-06-06 13:18:39,834 Running of command in container failed (139): docker run --rm -t --shm-size=500m -v /home/jenkins_slave/workspace/ut-python3-mkldnn-cpu:/work/mxnet -v /home/jenkins_slave/workspace/ut-python3-mkldnn-cpu/build:/work/build -u 1001:1001 mxnetci/build.ubuntu_cpu /work/runtime_functions.sh unittest_ubuntu_python3_cpu_mkldnn

build.py: 2018-06-06 13:18:39,835 You can try to get into the container by using the following command: docker run --rm -t --shm-size=500m -v /home/jenkins_slave/workspace/ut-python3-mkldnn-cpu:/work/mxnet -v /home/jenkins_slave/workspace/ut-python3-mkldnn-cpu/build:/work/build -u 1001:1001 -ti --entrypoint /bin/bash mxnetci/build.ubuntu_cpu /work/runtime_functions.sh unittest_ubuntu_python3_cpu_mkldnn

into container: False

Traceback (most recent call last):

  File "ci/build.py", line 318, in <module>

    sys.exit(main())

  File "ci/build.py", line 253, in main

    command=command, docker_registry=docker_registry)

  File "ci/build.py", line 155, in container_run

    raise subprocess.CalledProcessError(ret, cmd)

subprocess.CalledProcessError: Command 'docker run --rm -t --shm-size=500m -v /home/jenkins_slave/workspace/ut-python3-mkldnn-cpu:/work/mxnet -v /home/jenkins_slave/workspace/ut-python3-mkldnn-cpu/build:/work/build -u 1001:1001 mxnetci/build.ubuntu_cpu /work/runtime_functions.sh unittest_ubuntu_python3_cpu_mkldnn' returned non-zero exit status 139

script returned exit code 1
@kalyc
Copy link
Contributor

kalyc commented Jun 14, 2018

Thanks for submitting this issue @ThomasDelteil

@ThomasDelteil
Copy link
Contributor Author

ThomasDelteil commented Jun 15, 2018

@piiswrong you introduced this test in this commit

[WIP] Do Not Merge. Static memory allocation for cached_op (#10817) 2dbd143

and it seems to be flaky. I have seen it failing a few times in recent builds. Can you take a look? e.g
http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/master/1002/pipeline/

Given all the talk about quality contributions going on the dev mailing list, I am a little unsettled by the fact this PR went undocumented (no design docs or explanation in the PR), unreviewed (1 question ignored), the optimization wasn't tested or benchmarked (it was actually making slower), and the code was self-merged.

Should we enforce that each PR , especially the ones that introduce a significant number of changes be properly documented and reviewed before merging?

@szha @marcoabreu

@zheng-da
Copy link
Contributor

zheng-da commented Jun 17, 2018

I ran the test thousands of times. It doesn't appear frequently. I checked core dumps. It seems they fail all exactly in the same place.

#0  0x00007ffff1c5c510 in typeinfo for mkldnn::impl::cpu::cpu_primitive_t () from /home/ubuntu/incubator-mxnet/lib/libmkldnn.so.0
#1  0x00007ffff15c1372 in mkldnn::impl::cpu::jit_uni_reorder_t::execute(mkldnn::impl::event_t*) ()
   from /home/ubuntu/incubator-mxnet/lib/libmkldnn.so.0
#2  0x00007ffff16f8293 in mkldnn::impl::cpu::cpu_engine_t::submit(mkldnn_primitive*, mkldnn::impl::event_t*, mkldnn::impl::nstl::vector<mkldnn::impl::event_t*>&) () from /home/ubuntu/incubator-mxnet/lib/libmkldnn.so.0
#3  0x00007ffff15741c6 in mkldnn::impl::stream_eager_t::submit_impl(unsigned long, unsigned long, mkldnn_primitive**) ()
   from /home/ubuntu/incubator-mxnet/lib/libmkldnn.so.0
#4  0x00007ffff15735b1 in mkldnn_stream::submit(mkldnn::impl::nstl::vector<mkldnn_primitive*> const&, mkldnn_primitive**) ()
   from /home/ubuntu/incubator-mxnet/lib/libmkldnn.so.0
#5  0x00007ffff15737c8 in mkldnn_stream_submit () from /home/ubuntu/incubator-mxnet/lib/libmkldnn.so.0
#6  0x00007fffcac89608 in mkldnn::stream::submit(std::vector<mkldnn::primitive, std::allocator<mkldnn::primitive> >) ()
   from /home/ubuntu/incubator-mxnet/lib/libmxnet.so
#7  0x00007fffcac8cfc2 in mxnet::MKLDNNMemory::ReorderTo(mkldnn::memory*) const ()
   from /home/ubuntu/incubator-mxnet/lib/libmxnet.so
#8  0x00007fffcac791da in mxnet::NDArray::Reorder2Default() const () from /home/ubuntu/incubator-mxnet/lib/libmxnet.so
#9  0x00007fffc847911a in mxnet::FallBackCompute(std::function<void (nnvm::NodeAttrs const&, mxnet::OpContext const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&)>, nnvm::NodeAttrs const&, mxnet::OpContext const&, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&) () from /home/ubuntu/incubator-mxnet/lib/libmxnet.so
#10 0x00007fffca8b8696 in mxnet::op::PoolingComputeExCPU(nnvm::NodeAttrs const&, mxnet::OpContext const&, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&) () from /home/ubuntu/incubator-mxnet/lib/libmxnet.so
#11 0x00007fffcabf34c5 in std::_Function_handler<void (mxnet::RunContext), mxnet::imperative::PushFComputeEx(std::function<void (nnvm::NodeAttrs const&, mxnet::OpContext const&, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&)> const&, nnvm::Op const*, nnvm::NodeAttrs const&, mxnet::Context const&, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&, std::vector<mxnet::Resource, std::allocator<mxnet::Resource> > const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&)::{la---Type <return> to continue, or q <return> to quit---
mbda(mxnet::RunContext)#1}>::_M_invoke(std::_Any_data const&, mxnet::RunContext&&) ()
   from /home/ubuntu/incubator-mxnet/lib/libmxnet.so
#12 0x00007fffcb0d6d88 in std::_Function_handler<void (mxnet::RunContext), mxnet::engine::ThreadedEngine::BulkAppend(std::function<void (mxnet::RunContext)>, mxnet::Context, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&)::{lambda(mxnet::RunContext)#1}>::_M_invoke(std::_Any_data const&, mxnet::RunContext&&) () from /home/ubuntu/incubator-mxnet/lib/libmxnet.so
#13 0x00007fffcb0d6d67 in std::_Function_handler<void (mxnet::RunContext), mxnet::engine::ThreadedEngine::BulkAppend(std::function<void (mxnet::RunContext)>, mxnet::Context, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&)::{lambda(mxnet::RunContext)#1}>::_M_invoke(std::_Any_data const&, mxnet::RunContext&&) () from /home/ubuntu/incubator-mxnet/lib/libmxnet.so
#14 0x00007fffcb0cb48b in std::_Function_handler<void (mxnet::RunContext, mxnet::engine::CallbackOnComplete), mxnet::engine::ThreadedEngine::BulkFlush()::{lambda(mxnet::RunContext, mxnet::engine::CallbackOnComplete)#1}>::_M_invoke(std::_Any_data const&, mxnet::RunContext&&, mxnet::engine::CallbackOnComplete&&) () from /home/ubuntu/incubator-mxnet/lib/libmxnet.so
#15 0x00007fffcb0ccd45 in mxnet::engine::ThreadedEngine::ExecuteOprBlock(mxnet::RunContext, mxnet::engine::OprBlock*) ()
   from /home/ubuntu/incubator-mxnet/lib/libmxnet.so
#16 0x00007fffcb0df459 in std::_Function_handler<void (std::shared_ptr<dmlc::ManualEvent>), mxnet::engine::ThreadedEnginePerDevice::PushToExecute(mxnet::engine::OprBlock*, bool)::{lambda()#1}::operator()() const::{lambda(std::shared_ptr<dmlc::ManualEvent>)#1}>::_M_invoke(std::_Any_data const&, std::shared_ptr<dmlc::ManualEvent>&&) () from /home/ubuntu/incubator-mxnet/lib/libmxnet.so
#17 0x00007fffcb0cc34a in std::thread::_Impl<std::_Bind_simple<std::function<void (std::shared_ptr<dmlc::ManualEvent>)> (std::shared_ptr<dmlc::ManualEvent>)> >::_M_run() () from /home/ubuntu/incubator-mxnet/lib/libmxnet.so
#18 0x00007fffe7f4dc80 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#19 0x00007ffff7bc16ba in start_thread (arg=0x7fff96154700) at pthread_create.c:333
#20 0x00007ffff78f741d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109

@zheng-da
Copy link
Contributor

zheng-da commented Jun 17, 2018

@TaoLv @pengzhao-intel @ashokei @azai91 Could you help check why it always fails in the same place?

@ThomasDelteil
Copy link
Contributor Author

@zheng-da which hybridize kwargs trigger the failure?

@zheng-da
Copy link
Contributor

it seems the error only happens in static_alloc=True, static_shape=True.

@marcoabreu
Copy link
Contributor

Does it only happen with MKLDNN or is it unrelated?

@pengzhao-intel
Copy link
Contributor

If it's only related with MKLDNN, we will take over the issue. @marcoabreu @zheng-da

@szha szha added this to To Do in Tests Improvement via automation Jun 21, 2018
@piiswrong
Copy link
Contributor

@pengzhao-intel Looks like it only fails when using MKLDNN. Could you help take a look?

@pengzhao-intel
Copy link
Contributor

@piiswrong @zheng-da I can reproduce the issue. Will do further investigation.

@szha szha self-assigned this Jun 23, 2018
@anirudhacharya
Copy link
Member

Failure Logs - http://jenkins.mxnet-ci.amazon-ml.com/blue/rest/organizations/jenkins/pipelines/incubator-mxnet/branches/PR-10889/runs/18/nodes/735/log/?start=0

test_gluon.test_hybrid_static_memory_switching ... /work/runtime_functions.sh: line 541:     7 Segmentation fault      (core dumped) nosetests-2.7 $NOSE_COVERAGE_ARGUMENTS --with-xunit --xunit-file nosetests_unittest.xml --verbose tests/python/unittest
build.py: 2018-06-26 20:13:34,825 Running of command in container failed (139): docker run --rm -t --shm-size=500m -v /home/jenkins_slave/workspace/ut-python2-mkldnn-cpu:/work/mxnet -v /home/jenkins_slave/workspace/ut-python2-mkldnn-cpu/build:/work/build -v /efs-ccache:/work/ccache -u 1001:1001 -e CCACHE_MAXSIZE=500G -e CCACHE_TEMPDIR=/tmp/ccache -e CCACHE_DIR=/work/ccache -e CCACHE_LOGFILE=/tmp/ccache.log mxnetci/build.ubuntu_cpu /work/runtime_functions.sh unittest_ubuntu_python2_cpu
build.py: 2018-06-26 20:13:34,825 You can try to get into the container by using the following command: docker run --rm -t --shm-size=500m -v /home/jenkins_slave/workspace/ut-python2-mkldnn-cpu:/work/mxnet -v /home/jenkins_slave/workspace/ut-python2-mkldnn-cpu/build:/work/build -v /efs-ccache:/work/ccache -u 1001:1001 -ti --entrypoint /bin/bash -e CCACHE_MAXSIZE=500G -e CCACHE_TEMPDIR=/tmp/ccache -e CCACHE_DIR=/work/ccache -e CCACHE_LOGFILE=/tmp/ccache.log mxnetci/build.ubuntu_cpu /work/runtime_functions.sh unittest_ubuntu_python2_cpu
Traceback (most recent call last):
  File "ci/build.py", line 358, in <module>
    sys.exit(main())
  File "ci/build.py", line 291, in main
    command=command, docker_registry=args.docker_registry, local_ccache_dir=args.ccache_dir)
  File "ci/build.py", line 185, in container_run
    raise subprocess.CalledProcessError(ret, cmd)
subprocess.CalledProcessError: Command 'docker run --rm -t --shm-size=500m -v /home/jenkins_slave/workspace/ut-python2-mkldnn-cpu:/work/mxnet -v /home/jenkins_slave/workspace/ut-python2-mkldnn-cpu/build:/work/build -v /efs-ccache:/work/ccache -u 1001:1001 -e CCACHE_MAXSIZE=500G -e CCACHE_TEMPDIR=/tmp/ccache -e CCACHE_DIR=/work/ccache -e CCACHE_LOGFILE=/tmp/ccache.log mxnetci/build.ubuntu_cpu /work/runtime_functions.sh unittest_ubuntu_python2_cpu' returned non-zero exit status 139
script returned exit code 1script returned exit code 1

@KellenSunderland
Copy link
Contributor

Failure Log: http://jenkins.mxnet-ci.amazon-ml.com/blue/rest/organizations/jenkins/pipelines/incubator-mxnet/branches/PR-11325/runs/16/nodes/768/log/?start=0

test_gluon.test_hybrid_static_memory ... ok
test_gluon.test_hybrid_static_memory_switching ... *** Error in `/usr/bin/python3': corrupted size vs. prev_size: 0x00007f6f010e01f0 ***
======= Backtrace: =========
/lib/x86_64-linux-gnu/libc.so.6(+0x777e5)[0x7f72ef4c27e5]
/lib/x86_64-linux-gnu/libc.so.6(+0x82aec)[0x7f72ef4cdaec]
/lib/x86_64-linux-gnu/libc.so.6(+0x82c0a)[0x7f72ef4cdc0a]
/lib/x86_64-linux-gnu/libc.so.6(posix_memalign+0x11d)[0x7f72ef4d271d]
/work/mxnet/python/mxnet/../../lib/libmxnet.so(_ZN5mxnet7storage16CPUDeviceStorage5AllocEm+0x2e)[0x7f72cc91211e]
/work/mxnet/python/mxnet/../../lib/libmxnet.so(_ZN5mxnet7storage19NaiveStorageManagerINS0_16CPUDeviceStorageEE5AllocEPNS_7Storage6HandleE+0xd)[0x7f72cc9121ad]
/work/mxnet/python/mxnet/../../lib/libmxnet.so(_ZN5mxnet11StorageImpl5AllocEPNS_7Storage6HandleE+0x60)[0x7f72cc90d700]
/work/mxnet/python/mxnet/../../lib/libmxnet.so(_ZN5mxnet7NDArray5Chunk13CheckAndAllocEm+0x1ea)[0x7f72cc480b2a]
/work/mxnet/python/mxnet/../../lib/libmxnet.so(_ZN5mxnet7NDArray16CreateMKLDNNDataERKN6mkldnn6memory14primitive_descE+0x6fe)[0x7f72cc46e06e]
/work/mxnet/python/mxnet/../../lib/libmxnet.so(_ZN5mxnet15CreateMKLDNNMemERKNS_7NDArrayERKN6mkldnn6memory14primitive_descENS_9OpReqTypeEPS1_+0x203)[0x7f72c9cae803]
/work/mxnet/python/mxnet/../../lib/libmxnet.so(_ZN5mxnet2op25MKLDNNConvolutionBackwardERKN4nnvm9NodeAttrsERKNS_9OpContextERKSt6vectorINS_7NDArrayESaIS9_EERKS8_INS_9OpReqTypeESaISE_EESD_+0x989)[0x7f72c9c8b179]
/work/mxnet/python/mxnet/../../lib/libmxnet.so(+0x29c003e)[0x7f72cc17e03e]
/work/mxnet/python/mxnet/../../lib/libmxnet.so(_ZZN5mxnet10imperative14CreateEngineOpERKNS_7ContextERKSt6vectorISt10shared_ptrINS_4exec10OpExecutorEESaIS8_EEENKUlNS_10RunContextENS_6engine18CallbackOnCompleteEE_clESD_SF_+0x9c)[0x7f72cc3afb9c]
/work/mxnet/python/mxnet/../../lib/libmxnet.so(_ZNSt17_Function_handlerIFvN5mxnet10RunContextENS0_6engine18CallbackOnCompleteEEZNS0_10imperative14CreateEngineOpERKNS0_7ContextERKSt6vectorISt10shared_ptrINS0_4exec10OpExecutorEESaISD_EEEUlS1_S3_E_E9_M_invokeERKSt9_Any_dataOS1_OS3_+0x21)[0x7f72cc3afd51]
/work/mxnet/python/mxnet/../../lib/libmxnet.so(_ZN5mxnet6engine14ThreadedEngine15ExecuteOprBlockENS_10RunContextEPNS0_8OprBlockE+0x8e5)[0x7f72cc8ef4d5]
/work/mxnet/python/mxnet/../../lib/libmxnet.so(_ZNSt17_Function_handlerIFvSt10shared_ptrIN4dmlc11ManualEventEEEZZN5mxnet6engine23ThreadedEnginePerDevice13PushToExecuteEPNS6_8OprBlockEbENKUlvE_clEvEUlS3_E_E9_M_invokeERKSt9_Any_dataOS3_+0xd9)[0x7f72cc901b59]
/work/mxnet/python/mxnet/../../lib/libmxnet.so(_ZNSt6thread5_ImplISt12_Bind_simpleIFSt8functionIFvSt10shared_ptrIN4dmlc11ManualEventEEEES6_EEE6_M_runEv+0x4a)[0x7f72cc8eeada]
/usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xb8c80)[0x7f72e5402c80]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x76ba)[0x7f72ef81c6ba]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x6d)[0x7f72ef55241d]
======= Memory map: ========

@zheng-da
Copy link
Contributor

zheng-da commented Jul 16, 2018

the problem should have been fixed by #11577
can we close this issue?

@szha szha closed this as completed Jul 16, 2018
Tests Improvement automation moved this from To Do to Done Jul 16, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
Development

No branches or pull requests

9 participants