New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unit test on shared memory fails nondeterministically #755
Comments
@aksnzhy would you please have a look? |
Sure, I will check this error. |
@aksnzhy any progress? We might need to temporarily disable the test case since it's too much a nuisance. |
@jermainewang It's very strange that I cannot reproduce this bug on my dev-machine. @zheng-da Could you give some advice? I'm not very familiar with the graph_store code. |
Try directly use the docker image. It should reliably reproduce the problem. |
Just go through the tests/distributed/test_shared_mem_store.py. In the buggy test_init(): dgl/tests/distributed/test_shared_mem_store.py Lines 105 to 116 in c4989b4
The execution order of work_p1 and work_p2 is undetermined even in a single core machine. Before dgl/tests/distributed/test_shared_mem_store.py Lines 68 to 70 in c4989b4
the workers are synced at g._sync_barrier(60). But the latter execution are not synced. It is possible that: dgl/tests/distributed/test_shared_mem_store.py Lines 31 to 34 in c4989b4
execute latter than dgl/tests/distributed/test_shared_mem_store.py Lines 77 to 79 in c4989b4
which may cause the problem. To reproduce the bug deterministically, add time.sleep(3) to L27 as follows:
|
Also make test_shared_mem_store.py more deterministic.
* upd * fig edgebatch edges * add test * trigger * Update README.md for pytorch PinSage example. Add noting that the PinSage model example under example/pytorch/recommendation only work with Python 3.6+ as its dataset loader depends on stanfordnlp package which work only with Python 3.6+. * Provid a frame agnostic API to test nn modules on both CPU and CUDA side. 1. make dgl.nn.xxx frame agnostic 2. make test.backend include dgl.nn modules 3. modify test_edge_softmax of test/mxnet/test_nn.py and test/pytorch/test_nn.py work on both CPU and GPU * Fix style * Delete unused code * Make agnostic test only related to tests/backend 1. clear all agnostic related code in dgl.nn 2. make test_graph_conv agnostic to cpu/gpu * Fix code style * fix * doc * Make all test code under tests.mxnet/pytorch.test_nn.py work on both CPU and GPU. * Fix syntex * Remove rand * Add TAGCN nn.module and example * Now tagcn can run on CPU. * Add unitest for TGConv * Fix style * For pubmed dataset, using --lr=0.005 can achieve better acc * Fix style * Fix some descriptions * trigger * Fix doc * Add nn.TGConv and example * Fix bug * Update data in mxnet.tagcn test acc. * Fix some comments and code * delete useless code * Fix namming * Fix bug * Fix bug * Add test for mxnet TAGCov * Add test code for mxnet TAGCov * Update some docs * Fix some code * Update docs dgl.nn.mxnet * Update weight init * Fix * reproduce the bug * Fix concurrency bug reported at #755. Also make test_shared_mem_store.py more deterministic. * Update test_shared_mem_store.py * Update dmlc/core
* upd * fig edgebatch edges * add test * trigger * Update README.md for pytorch PinSage example. Add noting that the PinSage model example under example/pytorch/recommendation only work with Python 3.6+ as its dataset loader depends on stanfordnlp package which work only with Python 3.6+. * Provid a frame agnostic API to test nn modules on both CPU and CUDA side. 1. make dgl.nn.xxx frame agnostic 2. make test.backend include dgl.nn modules 3. modify test_edge_softmax of test/mxnet/test_nn.py and test/pytorch/test_nn.py work on both CPU and GPU * Fix style * Delete unused code * Make agnostic test only related to tests/backend 1. clear all agnostic related code in dgl.nn 2. make test_graph_conv agnostic to cpu/gpu * Fix code style * fix * doc * Make all test code under tests.mxnet/pytorch.test_nn.py work on both CPU and GPU. * Fix syntex * Remove rand * Add TAGCN nn.module and example * Now tagcn can run on CPU. * Add unitest for TGConv * Fix style * For pubmed dataset, using --lr=0.005 can achieve better acc * Fix style * Fix some descriptions * trigger * Fix doc * Add nn.TGConv and example * Fix bug * Update data in mxnet.tagcn test acc. * Fix some comments and code * delete useless code * Fix namming * Fix bug * Fix bug * Add test for mxnet TAGCov * Add test code for mxnet TAGCov * Update some docs * Fix some code * Update docs dgl.nn.mxnet * Update weight init * Fix * reproduce the bug * Fix concurrency bug reported at #755. Also make test_shared_mem_store.py more deterministic. * Update test_shared_mem_store.py * Update dmlc/core * Add complEx for mxnet * ComplEx is ready for MXNet
* upd * fig edgebatch edges * add test * trigger * Update README.md for pytorch PinSage example. Add noting that the PinSage model example under example/pytorch/recommendation only work with Python 3.6+ as its dataset loader depends on stanfordnlp package which work only with Python 3.6+. * Provid a frame agnostic API to test nn modules on both CPU and CUDA side. 1. make dgl.nn.xxx frame agnostic 2. make test.backend include dgl.nn modules 3. modify test_edge_softmax of test/mxnet/test_nn.py and test/pytorch/test_nn.py work on both CPU and GPU * Fix style * Delete unused code * Make agnostic test only related to tests/backend 1. clear all agnostic related code in dgl.nn 2. make test_graph_conv agnostic to cpu/gpu * Fix code style * fix * doc * Make all test code under tests.mxnet/pytorch.test_nn.py work on both CPU and GPU. * Fix syntex * Remove rand * Add TAGCN nn.module and example * Now tagcn can run on CPU. * Add unitest for TGConv * Fix style * For pubmed dataset, using --lr=0.005 can achieve better acc * Fix style * Fix some descriptions * trigger * Fix doc * Add nn.TGConv and example * Fix bug * Update data in mxnet.tagcn test acc. * Fix some comments and code * delete useless code * Fix namming * Fix bug * Fix bug * Add test for mxnet TAGCov * Add test code for mxnet TAGCov * Update some docs * Fix some code * Update docs dgl.nn.mxnet * Update weight init * Fix * reproduce the bug * Fix concurrency bug reported at #755. Also make test_shared_mem_store.py more deterministic. * Update test_shared_mem_store.py * Update dmlc/core * Update Knowledge Graph CI with new Docker image * Remove unused line_profierx * Poke Jenkins * Update test with exit code check and simplify docker * Update Jenkinsfile to make app test a standalone stage * Update kg_test * Update Jenkinsfile * Make some KG test parallel * Update * KG MXNet does not support ComplEx * Update Jenkinsfile * Update Jenkins file * Change torch-1.2 to torch-1.2-cu92 * ci * Update ubuntu_install_mxnet_cpu.sh * Update ubuntu_install_mxnet_gpu.sh * We only need to test train and eval script. Delete some test code
* upd * fig edgebatch edges * add test * trigger * Update README.md for pytorch PinSage example. Add noting that the PinSage model example under example/pytorch/recommendation only work with Python 3.6+ as its dataset loader depends on stanfordnlp package which work only with Python 3.6+. * Provid a frame agnostic API to test nn modules on both CPU and CUDA side. 1. make dgl.nn.xxx frame agnostic 2. make test.backend include dgl.nn modules 3. modify test_edge_softmax of test/mxnet/test_nn.py and test/pytorch/test_nn.py work on both CPU and GPU * Fix style * Delete unused code * Make agnostic test only related to tests/backend 1. clear all agnostic related code in dgl.nn 2. make test_graph_conv agnostic to cpu/gpu * Fix code style * fix * doc * Make all test code under tests.mxnet/pytorch.test_nn.py work on both CPU and GPU. * Fix syntex * Remove rand * Add TAGCN nn.module and example * Now tagcn can run on CPU. * Add unitest for TGConv * Fix style * For pubmed dataset, using --lr=0.005 can achieve better acc * Fix style * Fix some descriptions * trigger * Fix doc * Add nn.TGConv and example * Fix bug * Update data in mxnet.tagcn test acc. * Fix some comments and code * delete useless code * Fix namming * Fix bug * Fix bug * Add test for mxnet TAGCov * Add test code for mxnet TAGCov * Update some docs * Fix some code * Update docs dgl.nn.mxnet * Update weight init * Fix * reproduce the bug * Fix concurrency bug reported at #755. Also make test_shared_mem_store.py more deterministic. * Update test_shared_mem_store.py * Update dmlc/core * networkx >= 2.4 will break our examples * Update tutorials/requirements * fix selfloop edges * upd version
馃悰 Bug
test_shared_mem_store
may occasionally fail on CI:Also confirmed by @yzh119
The text was updated successfully, but these errors were encountered: