Unit test on shared memory fails nondeterministically #755

BarclayII · 2019-08-11T06:22:22Z

🐛 Bug

test_shared_mem_store may occasionally fail on CI:

test_shared_mem_store.test_init ... /var/jenkins_home/workspace/DGL_PR-752@2/python/dgl/base.py:18: UserWarning: Initializer is not set. Use zero initializer instead. To suppress this warning, use `set_initializer` to explicitly specify which initializer to use.
  warnings.warn(msg, warn_type)
/var/jenkins_home/workspace/DGL_PR-752@2/python/dgl/base.py:18: UserWarning: It may not be safe to access node data of all nodes.It's recommended to node data of a subset of nodes directly.
  warnings.warn(msg, warn_type)
/var/jenkins_home/workspace/DGL_PR-752@2/python/dgl/base.py:18: UserWarning: It may not be safe to access edge data of all edges.It's recommended to edge data of a subset of edges directly.
  warnings.warn(msg, warn_type)
[06:09:05] /var/jenkins_home/workspace/DGL_PR-752@2/src/runtime/shared_mem.cc:32: remove /test_graph1_node_feat for shared memory
[06:09:05] /var/jenkins_home/workspace/DGL_PR-752@2/src/runtime/shared_mem.cc:32: remove /test_graph1_node_test4 for shared memory
[06:09:05] /var/jenkins_home/workspace/DGL_PR-752@2/src/runtime/shared_mem.cc:32: remove /test_graph1_in for shared memory
[06:09:05] /var/jenkins_home/workspace/DGL_PR-752@2/src/runtime/shared_mem.cc:32: remove /test_graph1_edge_feat for shared memory
[06:09:05] /var/jenkins_home/workspace/DGL_PR-752@2/src/runtime/shared_mem.cc:32: remove /test_graph1_edge_test4 for shared memory
/var/jenkins_home/workspace/DGL_PR-752@2/python/dgl/base.py:18: UserWarning: Initializer is not set. Use zero initializer instead. To suppress this warning, use `set_initializer` to explicitly specify which initializer to use.
  warnings.warn(msg, warn_type)
/var/jenkins_home/workspace/DGL_PR-752@2/python/dgl/base.py:18: UserWarning: It may not be safe to access node data of all nodes.It's recommended to node data of a subset of nodes directly.
  warnings.warn(msg, warn_type)
/var/jenkins_home/workspace/DGL_PR-752@2/python/dgl/base.py:18: UserWarning: It may not be safe to access edge data of all edges.It's recommended to edge data of a subset of edges directly.
  warnings.warn(msg, warn_type)

Arrays are not almost equal to 7 decimals

(mismatch 100.0%)
 x: array([20., 20., 20., 20., 20., 20., 20., 20., 20., 20.], dtype=float32)
 y: array(11)
Traceback (most recent call last):
  File "/var/jenkins_home/workspace/DGL_PR-752@2/tests/distributed/test_shared_mem_store.py", line 65, in check_init_func
    check_array_shared_memory(g, worker_id, [g.nodes[:].data['test4'], g.edges[:].data['test4']])
  File "/var/jenkins_home/workspace/DGL_PR-752@2/tests/distributed/test_shared_mem_store.py", line 28, in check_array_shared_memory
    assert_almost_equal(F.asnumpy(arr[0]), i + 10)
  File "/usr/local/lib/python3.5/dist-packages/numpy/testing/nose_tools/utils.py", line 567, in assert_almost_equal
    return assert_array_almost_equal(actual, desired, decimal, err_msg)
  File "/usr/local/lib/python3.5/dist-packages/numpy/testing/nose_tools/utils.py", line 965, in assert_array_almost_equal
    precision=decimal)
  File "/usr/local/lib/python3.5/dist-packages/numpy/testing/nose_tools/utils.py", line 781, in assert_array_compare
    raise AssertionError(msg)
AssertionError: 
Arrays are not almost equal to 7 decimals

(mismatch 100.0%)
 x: array([20., 20., 20., 20., 20., 20., 20., 20., 20., 20.], dtype=float32)
 y: array(11)
FAIL

Also confirmed by @yzh119

The text was updated successfully, but these errors were encountered:

jermainewang · 2019-08-16T03:32:40Z

@aksnzhy would you please have a look?

aksnzhy · 2019-08-16T09:37:35Z

Sure, I will check this error.

jermainewang · 2019-08-21T17:46:34Z

@aksnzhy any progress? We might need to temporarily disable the test case since it's too much a nuisance.

aksnzhy · 2019-08-22T02:13:18Z

@jermainewang It's very strange that I cannot reproduce this bug on my dev-machine. @zheng-da Could you give some advice? I'm not very familiar with the graph_store code.

jermainewang · 2019-08-23T20:08:19Z

Try directly use the docker image. It should reliably reproduce the problem.

classicsong · 2019-09-01T13:09:36Z

Just go through the tests/distributed/test_shared_mem_store.py. In the buggy test_init():

dgl/tests/distributed/test_shared_mem_store.py

Lines 105 to 116 in c4989b4

    
           def test_init(): 
        
               manager = Manager() 
        
               return_dict = manager.dict() 
        
               serv_p = Process(target=server_func, args=(2, 'test_graph1')) 
        
               work_p1 = Process(target=check_init_func, args=(0, 'test_graph1', return_dict)) 
        
               work_p2 = Process(target=check_init_func, args=(1, 'test_graph1', return_dict)) 
        
               serv_p.start() 
        
               work_p1.start() 
        
               work_p2.start() 
        
               serv_p.join() 
        
               work_p1.join() 
        
               work_p2.join()

The execution order of work_p1 and work_p2 is undetermined even in a single core machine. Before

dgl/tests/distributed/test_shared_mem_store.py

Lines 68 to 70 in c4989b4

    
           g.init_ndata('test4', (g.number_of_nodes(), 10), 'float32') 
        
           g.init_edata('test4', (g.number_of_edges(), 10), 'float32') 
        
           g._sync_barrier(60)

the workers are synced at g._sync_barrier(60). But the latter execution are not synced. It is possible that:

dgl/tests/distributed/test_shared_mem_store.py

Lines 31 to 34 in c4989b4

    
           else: 
        
               g._sync_barrier(60) 
        
               for i, arr in enumerate(arrays): 
        
                   assert_almost_equal(F.asnumpy(arr[0]), i + 10)

execute latter than

dgl/tests/distributed/test_shared_mem_store.py

Lines 77 to 79 in c4989b4

    
           data = g.edges[:].data['test4'] 
        
           g.set_e_repr({'test4': F.ones((1, 10)) * 20}, edges=[0]) 
        
           assert_almost_equal(F.asnumpy(data[0]), np.squeeze(F.asnumpy(g.edges[0].data['test4'])))

which may cause the problem.

To reproduce the bug deterministically, add time.sleep(3) to L27 as follows:

def check_array_shared_memory(g, worker_id, arrays):
    if worker_id == 0:
        for i, arr in enumerate(arrays):
            arr[0] = i + 10
        g._sync_barrier(60)
    else:
        g._sync_barrier(60)
        time.sleep(3)
        for i, arr in enumerate(arrays):
            assert_almost_equal(F.asnumpy(arr[0]), i + 10)

@zheng-da @jermainewang

Also make test_shared_mem_store.py more deterministic.

* upd * fig edgebatch edges * add test * trigger * Update README.md for pytorch PinSage example. Add noting that the PinSage model example under example/pytorch/recommendation only work with Python 3.6+ as its dataset loader depends on stanfordnlp package which work only with Python 3.6+. * Provid a frame agnostic API to test nn modules on both CPU and CUDA side. 1. make dgl.nn.xxx frame agnostic 2. make test.backend include dgl.nn modules 3. modify test_edge_softmax of test/mxnet/test_nn.py and test/pytorch/test_nn.py work on both CPU and GPU * Fix style * Delete unused code * Make agnostic test only related to tests/backend 1. clear all agnostic related code in dgl.nn 2. make test_graph_conv agnostic to cpu/gpu * Fix code style * fix * doc * Make all test code under tests.mxnet/pytorch.test_nn.py work on both CPU and GPU. * Fix syntex * Remove rand * Add TAGCN nn.module and example * Now tagcn can run on CPU. * Add unitest for TGConv * Fix style * For pubmed dataset, using --lr=0.005 can achieve better acc * Fix style * Fix some descriptions * trigger * Fix doc * Add nn.TGConv and example * Fix bug * Update data in mxnet.tagcn test acc. * Fix some comments and code * delete useless code * Fix namming * Fix bug * Fix bug * Add test for mxnet TAGCov * Add test code for mxnet TAGCov * Update some docs * Fix some code * Update docs dgl.nn.mxnet * Update weight init * Fix * reproduce the bug * Fix concurrency bug reported at #755. Also make test_shared_mem_store.py more deterministic. * Update test_shared_mem_store.py * Update dmlc/core

* upd * fig edgebatch edges * add test * trigger * Update README.md for pytorch PinSage example. Add noting that the PinSage model example under example/pytorch/recommendation only work with Python 3.6+ as its dataset loader depends on stanfordnlp package which work only with Python 3.6+. * Provid a frame agnostic API to test nn modules on both CPU and CUDA side. 1. make dgl.nn.xxx frame agnostic 2. make test.backend include dgl.nn modules 3. modify test_edge_softmax of test/mxnet/test_nn.py and test/pytorch/test_nn.py work on both CPU and GPU * Fix style * Delete unused code * Make agnostic test only related to tests/backend 1. clear all agnostic related code in dgl.nn 2. make test_graph_conv agnostic to cpu/gpu * Fix code style * fix * doc * Make all test code under tests.mxnet/pytorch.test_nn.py work on both CPU and GPU. * Fix syntex * Remove rand * Add TAGCN nn.module and example * Now tagcn can run on CPU. * Add unitest for TGConv * Fix style * For pubmed dataset, using --lr=0.005 can achieve better acc * Fix style * Fix some descriptions * trigger * Fix doc * Add nn.TGConv and example * Fix bug * Update data in mxnet.tagcn test acc. * Fix some comments and code * delete useless code * Fix namming * Fix bug * Fix bug * Add test for mxnet TAGCov * Add test code for mxnet TAGCov * Update some docs * Fix some code * Update docs dgl.nn.mxnet * Update weight init * Fix * reproduce the bug * Fix concurrency bug reported at #755. Also make test_shared_mem_store.py more deterministic. * Update test_shared_mem_store.py * Update dmlc/core * Add complEx for mxnet * ComplEx is ready for MXNet

* upd * fig edgebatch edges * add test * trigger * Update README.md for pytorch PinSage example. Add noting that the PinSage model example under example/pytorch/recommendation only work with Python 3.6+ as its dataset loader depends on stanfordnlp package which work only with Python 3.6+. * Provid a frame agnostic API to test nn modules on both CPU and CUDA side. 1. make dgl.nn.xxx frame agnostic 2. make test.backend include dgl.nn modules 3. modify test_edge_softmax of test/mxnet/test_nn.py and test/pytorch/test_nn.py work on both CPU and GPU * Fix style * Delete unused code * Make agnostic test only related to tests/backend 1. clear all agnostic related code in dgl.nn 2. make test_graph_conv agnostic to cpu/gpu * Fix code style * fix * doc * Make all test code under tests.mxnet/pytorch.test_nn.py work on both CPU and GPU. * Fix syntex * Remove rand * Add TAGCN nn.module and example * Now tagcn can run on CPU. * Add unitest for TGConv * Fix style * For pubmed dataset, using --lr=0.005 can achieve better acc * Fix style * Fix some descriptions * trigger * Fix doc * Add nn.TGConv and example * Fix bug * Update data in mxnet.tagcn test acc. * Fix some comments and code * delete useless code * Fix namming * Fix bug * Fix bug * Add test for mxnet TAGCov * Add test code for mxnet TAGCov * Update some docs * Fix some code * Update docs dgl.nn.mxnet * Update weight init * Fix * reproduce the bug * Fix concurrency bug reported at #755. Also make test_shared_mem_store.py more deterministic. * Update test_shared_mem_store.py * Update dmlc/core * Update Knowledge Graph CI with new Docker image * Remove unused line_profierx * Poke Jenkins * Update test with exit code check and simplify docker * Update Jenkinsfile to make app test a standalone stage * Update kg_test * Update Jenkinsfile * Make some KG test parallel * Update * KG MXNet does not support ComplEx * Update Jenkinsfile * Update Jenkins file * Change torch-1.2 to torch-1.2-cu92 * ci * Update ubuntu_install_mxnet_cpu.sh * Update ubuntu_install_mxnet_gpu.sh * We only need to test train and eval script. Delete some test code

* upd * fig edgebatch edges * add test * trigger * Update README.md for pytorch PinSage example. Add noting that the PinSage model example under example/pytorch/recommendation only work with Python 3.6+ as its dataset loader depends on stanfordnlp package which work only with Python 3.6+. * Provid a frame agnostic API to test nn modules on both CPU and CUDA side. 1. make dgl.nn.xxx frame agnostic 2. make test.backend include dgl.nn modules 3. modify test_edge_softmax of test/mxnet/test_nn.py and test/pytorch/test_nn.py work on both CPU and GPU * Fix style * Delete unused code * Make agnostic test only related to tests/backend 1. clear all agnostic related code in dgl.nn 2. make test_graph_conv agnostic to cpu/gpu * Fix code style * fix * doc * Make all test code under tests.mxnet/pytorch.test_nn.py work on both CPU and GPU. * Fix syntex * Remove rand * Add TAGCN nn.module and example * Now tagcn can run on CPU. * Add unitest for TGConv * Fix style * For pubmed dataset, using --lr=0.005 can achieve better acc * Fix style * Fix some descriptions * trigger * Fix doc * Add nn.TGConv and example * Fix bug * Update data in mxnet.tagcn test acc. * Fix some comments and code * delete useless code * Fix namming * Fix bug * Fix bug * Add test for mxnet TAGCov * Add test code for mxnet TAGCov * Update some docs * Fix some code * Update docs dgl.nn.mxnet * Update weight init * Fix * reproduce the bug * Fix concurrency bug reported at #755. Also make test_shared_mem_store.py more deterministic. * Update test_shared_mem_store.py * Update dmlc/core * networkx >= 2.4 will break our examples * Update tutorials/requirements * fix selfloop edges * upd version

BarclayII closed this as completed Aug 11, 2019

BarclayII reopened this Aug 11, 2019

jermainewang assigned aksnzhy Aug 16, 2019

jermainewang added the bug:confirmed Something isn't working label Aug 16, 2019

yzh119 mentioned this issue Aug 29, 2019

[bugfix] Disable shared memory test that may fails CI. #810

Merged

6 tasks

classicsong pushed a commit to classicsong/dgl that referenced this issue Sep 2, 2019

Fix concurrency bug reported at dmlc#755.

39bb6f8

Also make test_shared_mem_store.py more deterministic.

classicsong mentioned this issue Sep 2, 2019

[Bug fix] Fix concurrency bug reported at issue#755 #823

Merged

6 tasks

jermainewang closed this as completed Sep 10, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unit test on shared memory fails nondeterministically #755

Unit test on shared memory fails nondeterministically #755

BarclayII commented Aug 11, 2019 •

edited

jermainewang commented Aug 16, 2019

aksnzhy commented Aug 16, 2019

jermainewang commented Aug 21, 2019 •

edited

aksnzhy commented Aug 22, 2019

jermainewang commented Aug 23, 2019

classicsong commented Sep 1, 2019 •

edited

Unit test on shared memory fails nondeterministically #755

Unit test on shared memory fails nondeterministically #755

Comments

BarclayII commented Aug 11, 2019 • edited

🐛 Bug

jermainewang commented Aug 16, 2019

aksnzhy commented Aug 16, 2019

jermainewang commented Aug 21, 2019 • edited

aksnzhy commented Aug 22, 2019

jermainewang commented Aug 23, 2019

classicsong commented Sep 1, 2019 • edited

BarclayII commented Aug 11, 2019 •

edited

jermainewang commented Aug 21, 2019 •

edited

classicsong commented Sep 1, 2019 •

edited