Skip to content
This repository was archived by the owner on Nov 17, 2023. It is now read-only.
This repository was archived by the owner on Nov 17, 2023. It is now read-only.

[v1.9.x] mkldnn build and test failures #20643

@josephevans

Description

@josephevans

Description

The following tests keep failing consistently in the v1.9.x branch:

  1. tests/python/unittest/test_gluon.py/test_gluon.py - test_hybrid_static_memory_switching
  2. tests/cpp/operator/mkldnn_test.cc:103 - MKLDNN_UTIL_FUNC.MemFormat
  3. tests/cpp/thread_safety/thread_safety_test.cc:314 - ThreadSafety.CachedOpFullModel

Also, the mkldnn windows builds keep failing in the v1.9.x branch. Some example failures:

Occurrences

  1. test_gluon.test_hybrid_static_memory_switching examples:
  1. MKLDNN_UTIL_FUNC.MemFormat examples:
  1. ThreadSafety.CachedOpFullModel examples:

Test/Build Failure Log Output

  1. test_gluon.test_hybrid_static_memory_switching
[2021-10-07T01:27:03.835Z] ======================================================================
[2021-10-07T01:27:03.835Z] ERROR: test_gluon.test_hybrid_static_memory_switching
[2021-10-07T01:27:03.835Z] ----------------------------------------------------------------------
[2021-10-07T01:27:03.835Z] Traceback (most recent call last):
[2021-10-07T01:27:03.835Z]   File "/usr/local/lib/python3.7/dist-packages/nose/case.py", line 198, in runTest
[2021-10-07T01:27:03.835Z]     self.test(*self.arg)
[2021-10-07T01:27:03.835Z]   File "/work/mxnet/tests/python/unittest/common.py", line 218, in test_new
[2021-10-07T01:27:03.835Z]     orig_test(*args, **kwargs)
[2021-10-07T01:27:03.835Z]   File "/work/mxnet/tests/python/unittest/test_gluon.py", line 1760, in test_hybrid_static_memory_switching
[2021-10-07T01:27:03.835Z]     check_hybrid_static_memory_switching(static_alloc=True)
[2021-10-07T01:27:03.835Z]   File "/work/mxnet/tests/python/unittest/test_gluon.py", line 1755, in check_hybrid_static_memory_switching
[2021-10-07T01:27:03.835Z]     mx.nd.waitall()
[2021-10-07T01:27:03.835Z]   File "/work/mxnet/python/mxnet/ndarray/ndarray.py", line 211, in waitall
[2021-10-07T01:27:03.835Z]     check_call(_LIB.MXNDArrayWaitAll())
[2021-10-07T01:27:03.835Z]   File "/work/mxnet/python/mxnet/base.py", line 246, in check_call
[2021-10-07T01:27:03.835Z]     raise get_last_ffi_error()
[2021-10-07T01:27:03.835Z] mxnet.base.MXNetError: Traceback (most recent call last):
[2021-10-07T01:27:03.835Z]   [bt] (9) /work/mxnet/python/mxnet/../../lib/libmxnet.so(std::_Function_handler<void (std::shared_ptr<dmlc::ManualEvent>), mxnet::engine::ThreadedEnginePerDevice::PushToExecute(mxnet::engine::OprBlock*, bool)::{lambda()#1}::operator()() const::{lambda(std::shared_ptr<dmlc::ManualEvent>)#1}>::_M_invoke(std::_Any_data const&, std::shared_ptr<dmlc::ManualEvent>&&)+0x147) [0x7f9183955ee7]
[2021-10-07T01:27:03.835Z]   [bt] (8) /work/mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::engine::ThreadedEngine::ExecuteOprBlock(mxnet::RunContext, mxnet::engine::OprBlock*)+0x2d8) [0x7f9183942738]
[2021-10-07T01:27:03.835Z]   [bt] (7) /work/mxnet/python/mxnet/../../lib/libmxnet.so(std::_Function_handler<void (mxnet::RunContext, mxnet::engine::CallbackOnComplete), mxnet::engine::ThreadedEngine::BulkFlush()::{lambda(mxnet::RunContext, mxnet::engine::CallbackOnComplete)#1}>::_M_invoke(std::_Any_data const&, mxnet::RunContext&&, mxnet::engine::CallbackOnComplete&&)+0x1c6) [0x7f9183940056]
[2021-10-07T01:27:03.835Z]   [bt] (6) /work/mxnet/python/mxnet/../../lib/libmxnet.so(std::_Function_handler<void (mxnet::RunContext), mxnet::imperative::PushFComputeEx(std::function<void (nnvm::NodeAttrs const&, mxnet::OpContext const&, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&)> const&, nnvm::Op const*, nnvm::NodeAttrs const&, mxnet::Context const&, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&, std::vector<mxnet::Resource, std::allocator<mxnet::Resource> > const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&)::{lambda(mxnet::RunContext)#1}>::_M_invoke(std::_Any_data const&, mxnet::RunContext&&)+0x17) [0x7f91838680d7]
[2021-10-07T01:27:03.835Z]   [bt] (5) /work/mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::imperative::PushFComputeEx(std::function<void (nnvm::NodeAttrs const&, mxnet::OpContext const&, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&)> const&, nnvm::Op const*, nnvm::NodeAttrs const&, mxnet::Context const&, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&, std::vector<mxnet::Resource, std::allocator<mxnet::Resource> > const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&)::{lambda(mxnet::RunContext)#1}::operator()(mxnet::RunContext) const+0x293) [0x7f9183867f43]
[2021-10-07T01:27:03.835Z]   [bt] (4) /work/mxnet/python/mxnet/../../lib/libmxnet.so(+0x5658020) [0x7f91835bb020]
[2021-10-07T01:27:03.835Z]   [bt] (3) /work/mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::MKLDNNRun(std::function<void (nnvm::NodeAttrs const&, mxnet::OpContext const&, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&)>, nnvm::NodeAttrs const&, mxnet::OpContext const&, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&)+0x264) [0x7f917f3eacf4]
[2021-10-07T01:27:03.835Z]   [bt] (2) /work/mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::op::MKLDNNConvolutionForward(nnvm::NodeAttrs const&, mxnet::OpContext const&, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&)+0x4f0) [0x7f917f3d6280]
[2021-10-07T01:27:03.835Z]   [bt] (1) /work/mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::op::MKLDNNConvolutionForwardFullFeature(mxnet::op::MKLDNNConvFullParam const&, mxnet::OpContext const&, mxnet::op::MKLDNNConvForward*, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&)+0x580) [0x7f917f3d5540]
[2021-10-07T01:27:03.835Z]   [bt] (0) /work/mxnet/python/mxnet/../../lib/libmxnet.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x72) [0x7f917ea45852]
[2021-10-07T01:27:03.835Z]   File "src/operator/nn/mkldnn/mkldnn_convolution.cc", line 434
[2021-10-07T01:27:03.835Z] MXNetError: Check failed: weight_mem->get_desc() == fwd->GetPd().weights_desc(): 
[2021-10-07T01:27:03.835Z] -------------------- >> begin captured logging << --------------------
[2021-10-07T01:27:03.835Z] common: WARNING: Error seen with seeded test, use MXNET_TEST_SEED=1188622132 to reproduce.
[2021-10-07T01:27:03.835Z] --------------------- >> end captured logging << ---------------------
  1. MKLDNN_UTIL_FUNC.MemFormat
[2021-10-07T02:39:35.601Z] [----------] 2 tests from MKLDNN_UTIL_FUNC
[2021-10-07T02:39:35.601Z] [ RUN      ] MKLDNN_UTIL_FUNC.AlignMem
[2021-10-07T02:39:35.601Z] [       OK ] MKLDNN_UTIL_FUNC.AlignMem (1 ms)
[2021-10-07T02:39:35.601Z] [ RUN      ] MKLDNN_UTIL_FUNC.MemFormat
[2021-10-07T02:39:35.601Z] unknown file: Failure
[2021-10-07T02:39:35.601Z] C++ exception with description "[02:39:59] /work/mxnet/tests/cpp/operator/mkldnn_test.cc:103: Check failed: (dnnl_format_tag_last) == (222) 
[2021-10-07T02:39:35.601Z] 
[2021-10-07T02:39:35.601Z] " thrown in the test body.
[2021-10-07T02:39:35.601Z] [  FAILED  ] MKLDNN_UTIL_FUNC.MemFormat (0 ms)
[2021-10-07T02:39:35.601Z] [----------] 2 tests from MKLDNN_UTIL_FUNC (1 ms total)
  1. ThreadSafety.CachedOpFullModel
[2021-10-07T02:32:31.459Z] [ RUN      ] ThreadSafety.CachedOpFullModel
[2021-10-07T02:32:31.459Z] [02:32:53] src/nnvm/legacy_json_util.cc:208: Loading symbol saved by previous version v0.8.0. Attempting to upgrade...
[2021-10-07T02:32:31.459Z] [02:32:53] src/nnvm/legacy_json_util.cc:216: Symbol successfully upgraded!
[2021-10-07T02:32:34.725Z] terminate called after throwing an instance of 'dmlc::Error'
[2021-10-07T02:32:34.725Z]   what():  [02:32:57] tests/cpp/thread_safety/thread_safety_test.cc:314: MXNetError: Check failed: weight_mem->get_desc() == fwd->GetPd().weights_desc(): 
[2021-10-07T02:32:34.725Z] Stack trace:
[2021-10-07T02:32:34.725Z]   File "src/operator/nn/mkldnn/mkldnn_convolution.cc", line 434
[2021-10-07T02:32:34.725Z] 
[2021-10-07T02:32:34.725Z] 
[2021-10-07T02:32:34.725Z] 
[2021-10-07T02:32:34.725Z] /work/runtime_functions.sh: line 1306:  1730 Aborted                 (core dumped) build/tests/cpp/mxnet_unit_tests --gtest_filter="ThreadSafety.*"
  1. Windows build failures:
[2021-10-07T01:49:05.521Z] [748/749] Linking CXX shared library libmxnet.dll
[2021-10-07T01:49:05.521Z] FAILED: libmxnet.dll libmxnet.lib 
[2021-10-07T01:49:05.521Z] cmd.exe /C "cd . && "C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\Common7\IDE\CommonExtensions\Microsoft\CMake\CMake\bin\cmake.exe" -E vs_link_dll --intdir=CMakeFiles\mxnet.dir --rc=C:\PROGRA~2\WI3CF2~1\10\bin\100162~1.0\x64\rc.exe --mt=C:\PROGRA~2\WI3CF2~1\10\bin\100162~1.0\x64\mt.exe --manifests  -- C:\PROGRA~2\MICROS~1\2019\COMMUN~1\VC\Tools\MSVC\1428~1.293\bin\Hostx64\x64\link.exe /nologo @CMakeFiles\mxnet.rsp  /out:libmxnet.dll /implib:libmxnet.lib /pdb:libmxnet.pdb /dll /version:0.0 /machine:x64  /INCREMENTAL:NO /OPT:REF /OPT:ICF  && cmd.exe /C "cd /D C:\jenkins_slave\workspace\build-cpu-mkldnn\build && "C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\Common7\IDE\CommonExtensions\Microsoft\CMake\CMake\bin\cmake.exe" -E copy C:/jenkins_slave/workspace/build-cpu-mkldnn/build/3rdparty/mkldnn/include/oneapi/dnnl/dnnl_config.h C:/jenkins_slave/workspace/build-cpu-mkldnn/include/mkldnn/oneapi/dnnl/ && "C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\Common7\IDE\CommonExtensions\Microsoft\CMake\CMake\bin\cmake.exe" -E copy C:/jenkins_slave/workspace/build-cpu-mkldnn/build/3rdparty/mkldnn/include/oneapi/dnnl/dnnl_version.h C:/jenkins_slave/workspace/build-cpu-mkldnn/include/mkldnn/oneapi/dnnl/""
[2021-10-07T01:49:05.521Z] LINK: command "C:\PROGRA~2\MICROS~1\2019\COMMUN~1\VC\Tools\MSVC\1428~1.293\bin\Hostx64\x64\link.exe /nologo @CMakeFiles\mxnet.rsp /out:libmxnet.dll /implib:libmxnet.lib /pdb:libmxnet.pdb /dll /version:0.0 /machine:x64 /INCREMENTAL:NO /OPT:REF /OPT:ICF /MANIFEST /MANIFESTFILE:libmxnet.dll.manifest" failed (exit code 1120) with the following output:
[2021-10-07T01:49:05.521Z]    Creating library libmxnet.lib and object libmxnet.exp
[2021-10-07T01:49:05.521Z] LINK : warning LNK4098: defaultlib 'MSVCRT' conflicts with use of other libs; use /NODEFAULTLIB:library
[2021-10-07T01:49:05.521Z] LINK : warning LNK4217: symbol '_wcsdup' defined in 'libucrt.lib(wcsdup.obj)' is imported by 'dnnl.lib(ittnotify_static.c.obj)' in function '__itt_domain_createW_init_3_0'
[2021-10-07T01:49:05.521Z] LINK : warning LNK4217: symbol 'strncpy_s' defined in 'libucrt.lib(strncpy_s.obj)' is imported by 'dnnl.lib(ittnotify_static.c.obj)' in function '__itt_get_groups'
[2021-10-07T01:49:05.521Z] LINK : warning LNK4217: symbol 'malloc' defined in 'libucrt.lib(malloc.obj)' is imported by 'dnnl.lib(ittnotify_static.c.obj)' in function '__itt_domain_createA_init_3_0'
[2021-10-07T01:49:05.521Z] dnnl.lib(ittnotify_static.c.obj) : error LNK2019: unresolved external symbol __imp__strdup referenced in function __itt_domain_createA_init_3_0
[2021-10-07T01:49:05.521Z] libmxnet.dll : fatal error LNK1120: 1 unresolved externals
[2021-10-07T01:49:05.521Z] ninja: build stopped: subcommand failed.
[2021-10-07T01:49:05.521Z] 2021-10-07 01:49:29,320 5 build(s) have failed
[2021-10-07T01:49:05.521Z] 2021-10-07 01:49:29,320 Build failed

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions