Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

[v1.x] V1.x CD blocked by test_gluon_data.test_list_dataset error #19918

Closed
Zha0q1 opened this issue Feb 18, 2021 · 5 comments
Closed

[v1.x] V1.x CD blocked by test_gluon_data.test_list_dataset error #19918

Zha0q1 opened this issue Feb 18, 2021 · 5 comments

Comments

@Zha0q1
Copy link
Contributor

Zha0q1 commented Feb 18, 2021

https://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/restricted-mxnet-cd%2Fmxnet-cd-release-job-1.x/detail/mxnet-cd-release-job-1.x/1530/pipeline/222

[2021-02-18T20:51:13.065Z] ERROR: test_gluon_data.test_list_dataset

[2021-02-18T20:51:13.065Z] ----------------------------------------------------------------------

[2021-02-18T20:51:13.065Z] Traceback (most recent call last):

[2021-02-18T20:51:13.065Z]   File "/usr/local/lib/python3.7/dist-packages/nose/case.py", line 198, in runTest

[2021-02-18T20:51:13.065Z]     self.test(*self.arg)

[2021-02-18T20:51:13.065Z]   File "/work/mxnet/tests/python/unittest/common.py", line 226, in test_new

[2021-02-18T20:51:13.065Z]     mx.nd.waitall()

[2021-02-18T20:51:13.065Z]   File "/work/mxnet/python/mxnet/ndarray/ndarray.py", line 211, in waitall

[2021-02-18T20:51:13.065Z]     check_call(_LIB.MXNDArrayWaitAll())

[2021-02-18T20:51:13.065Z]   File "/work/mxnet/python/mxnet/base.py", line 246, in check_call

[2021-02-18T20:51:13.065Z]     raise get_last_ffi_error()

[2021-02-18T20:51:13.065Z] mxnet.base.MXNetError: Traceback (most recent call last):

[2021-02-18T20:51:13.065Z]   [bt] (9) /work/mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::engine::ThreadedEngine::PushAsync(std::function<void (mxnet::RunContext, mxnet::engine::CallbackOnComplete)>, mxnet::Context, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&, mxnet::FnProperty, int, char const*, bool)+0xd8) [0x7fe50d2a63c8]

[2021-02-18T20:51:13.065Z]   [bt] (8) /work/mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::engine::ThreadedEngine::Push(mxnet::engine::Opr*, mxnet::Context, int, bool)+0x3be) [0x7fe50d2b1c8e]

[2021-02-18T20:51:13.065Z]   [bt] (7) /work/mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::engine::ThreadedEnginePerDevice::PushToExecute(mxnet::engine::OprBlock*, bool)+0x372) [0x7fe50d2be602]

[2021-02-18T20:51:13.065Z]   [bt] (6) /work/mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::engine::ThreadedEngine::ExecuteOprBlock(mxnet::RunContext, mxnet::engine::OprBlock*)+0x621) [0x7fe50d2b4c11]

[2021-02-18T20:51:13.065Z]   [bt] (5) /work/mxnet/python/mxnet/../../lib/libmxnet.so(+0x5afd55b) [0x7fe50d2a655b]

[2021-02-18T20:51:13.065Z]   [bt] (4) /work/mxnet/python/mxnet/../../lib/libmxnet.so(+0x5cf9b40) [0x7fe50d4a2b40]

[2021-02-18T20:51:13.065Z]   [bt] (3) /work/mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::StorageImpl::Free(mxnet::Storage::Handle)+0x74) [0x7fe50dc111e4]

[2021-02-18T20:51:13.065Z]   [bt] (2) /work/mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::storage::CPUSharedStorageManager::Free(mxnet::Storage::Handle)+0xe3) [0x7fe50dc0be93]

[2021-02-18T20:51:13.065Z]   [bt] (1) /work/mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::storage::CPUSharedStorageManager::FreeImpl(mxnet::Storage::Handle const&)+0xc0) [0x7fe50dc0bad0]

[2021-02-18T20:51:13.065Z]   [bt] (0) /work/mxnet/python/mxnet/../../lib/libmxnet.so(+0xc981a8) [0x7fe5084411a8]

[2021-02-18T20:51:13.065Z]   File "src/storage/./cpu_shared_storage_manager.h", line 218

[2021-02-18T20:51:13.065Z] MXNetError: Check failed: count >= 0 (-3 vs. 0) : 

@leezu @mseth10 @josephevans Have you seen this before?

@mseth10
Copy link
Contributor

mseth10 commented Feb 18, 2021

@Zha0q1 Not seen this error before. Looks like it's fixed in default flavor (MKLDNN=1) of MXNet as it fails for "native" but passes in the "cpu" pipeline.

@Zha0q1
Copy link
Contributor Author

Zha0q1 commented Feb 18, 2021

@Zha0q1 Not seen this error before. Looks like it's fixed in default flavor (MKLDNN=1) of MXNet as it fails for "native" but passes in the "cpu" pipeline.

I think it's a flaky test. It also fails on cu 101.

In this run three days ealier native also passed.
https://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/restricted-mxnet-cd%2Fmxnet-cd-release-job-1.x/detail/mxnet-cd-release-job-1.x/1487/pipeline

@ptrendx
Copy link
Member

ptrendx commented Feb 19, 2021

We saw this error before. The problem happens when you make a new data loader while the previous data loader is not yet fully destroyed (including the data it produced in shared memory). Workers in the new data loader inherit those shared memory ndarrays (without increasing the usage counter which exists in the shared memory region itself) and once python's garbage collector decides to destroy them, they decrement the usage counter (and so it gets decremented too much). There are 2 things that may happen then - either the workers destroy the ndarray and the main process gets this error, or the main process does it and then workers get this error and crash, which results in a hang.

I made a small workaround for this in our container by inserting waitall and forcing python gc before the fork. I will make a pr tomorrow with this workaround.

@Zha0q1
Copy link
Contributor Author

Zha0q1 commented Feb 19, 2021

We saw this error before. The problem happens when you make a new data loader while the previous data loader is not yet fully destroyed (including the data it produced in shared memory). Workers in the new data loader inherit those shared memory ndarrays (without increasing the usage counter which exists in the shared memory region itself) and once python's garbage collector decides to destroy them, they decrement the usage counter (and so it gets decremented too much). There are 2 things that may happen then - either the workers destroy the ndarray and the main process gets this error, or the main process does it and then workers get this error and crash, which results in a hang.

I made a small workaround for this in our container by inserting waitall and forcing python gc before the fork. I will make a pr tomorrow with this workaround.

That's great thanks!

@Zha0q1
Copy link
Contributor Author

Zha0q1 commented Feb 22, 2021

should be fixed now thanks @ptrendx !!

@Zha0q1 Zha0q1 closed this as completed Feb 22, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants