Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unit test on shared memory fails nondeterministically #755

Closed
BarclayII opened this issue Aug 11, 2019 · 6 comments
Closed

Unit test on shared memory fails nondeterministically #755

BarclayII opened this issue Aug 11, 2019 · 6 comments
Assignees
Labels
bug:confirmed Something isn't working

Comments

@BarclayII
Copy link
Collaborator

BarclayII commented Aug 11, 2019

馃悰 Bug

test_shared_mem_store may occasionally fail on CI:

test_shared_mem_store.test_init ... /var/jenkins_home/workspace/DGL_PR-752@2/python/dgl/base.py:18: UserWarning: Initializer is not set. Use zero initializer instead. To suppress this warning, use `set_initializer` to explicitly specify which initializer to use.
  warnings.warn(msg, warn_type)
/var/jenkins_home/workspace/DGL_PR-752@2/python/dgl/base.py:18: UserWarning: It may not be safe to access node data of all nodes.It's recommended to node data of a subset of nodes directly.
  warnings.warn(msg, warn_type)
/var/jenkins_home/workspace/DGL_PR-752@2/python/dgl/base.py:18: UserWarning: It may not be safe to access edge data of all edges.It's recommended to edge data of a subset of edges directly.
  warnings.warn(msg, warn_type)
[06:09:05] /var/jenkins_home/workspace/DGL_PR-752@2/src/runtime/shared_mem.cc:32: remove /test_graph1_node_feat for shared memory
[06:09:05] /var/jenkins_home/workspace/DGL_PR-752@2/src/runtime/shared_mem.cc:32: remove /test_graph1_node_test4 for shared memory
[06:09:05] /var/jenkins_home/workspace/DGL_PR-752@2/src/runtime/shared_mem.cc:32: remove /test_graph1_in for shared memory
[06:09:05] /var/jenkins_home/workspace/DGL_PR-752@2/src/runtime/shared_mem.cc:32: remove /test_graph1_edge_feat for shared memory
[06:09:05] /var/jenkins_home/workspace/DGL_PR-752@2/src/runtime/shared_mem.cc:32: remove /test_graph1_edge_test4 for shared memory
/var/jenkins_home/workspace/DGL_PR-752@2/python/dgl/base.py:18: UserWarning: Initializer is not set. Use zero initializer instead. To suppress this warning, use `set_initializer` to explicitly specify which initializer to use.
  warnings.warn(msg, warn_type)
/var/jenkins_home/workspace/DGL_PR-752@2/python/dgl/base.py:18: UserWarning: It may not be safe to access node data of all nodes.It's recommended to node data of a subset of nodes directly.
  warnings.warn(msg, warn_type)
/var/jenkins_home/workspace/DGL_PR-752@2/python/dgl/base.py:18: UserWarning: It may not be safe to access edge data of all edges.It's recommended to edge data of a subset of edges directly.
  warnings.warn(msg, warn_type)

Arrays are not almost equal to 7 decimals

(mismatch 100.0%)
 x: array([20., 20., 20., 20., 20., 20., 20., 20., 20., 20.], dtype=float32)
 y: array(11)
Traceback (most recent call last):
  File "/var/jenkins_home/workspace/DGL_PR-752@2/tests/distributed/test_shared_mem_store.py", line 65, in check_init_func
    check_array_shared_memory(g, worker_id, [g.nodes[:].data['test4'], g.edges[:].data['test4']])
  File "/var/jenkins_home/workspace/DGL_PR-752@2/tests/distributed/test_shared_mem_store.py", line 28, in check_array_shared_memory
    assert_almost_equal(F.asnumpy(arr[0]), i + 10)
  File "/usr/local/lib/python3.5/dist-packages/numpy/testing/nose_tools/utils.py", line 567, in assert_almost_equal
    return assert_array_almost_equal(actual, desired, decimal, err_msg)
  File "/usr/local/lib/python3.5/dist-packages/numpy/testing/nose_tools/utils.py", line 965, in assert_array_almost_equal
    precision=decimal)
  File "/usr/local/lib/python3.5/dist-packages/numpy/testing/nose_tools/utils.py", line 781, in assert_array_compare
    raise AssertionError(msg)
AssertionError: 
Arrays are not almost equal to 7 decimals

(mismatch 100.0%)
 x: array([20., 20., 20., 20., 20., 20., 20., 20., 20., 20.], dtype=float32)
 y: array(11)
FAIL

Also confirmed by @yzh119

@BarclayII BarclayII reopened this Aug 11, 2019
@jermainewang
Copy link
Member

@aksnzhy would you please have a look?

@jermainewang jermainewang added the bug:confirmed Something isn't working label Aug 16, 2019
@aksnzhy
Copy link
Contributor

aksnzhy commented Aug 16, 2019

Sure, I will check this error.

@jermainewang
Copy link
Member

jermainewang commented Aug 21, 2019

@aksnzhy any progress? We might need to temporarily disable the test case since it's too much a nuisance.

@aksnzhy
Copy link
Contributor

aksnzhy commented Aug 22, 2019

@jermainewang It's very strange that I cannot reproduce this bug on my dev-machine. @zheng-da Could you give some advice? I'm not very familiar with the graph_store code.

@jermainewang
Copy link
Member

Try directly use the docker image. It should reliably reproduce the problem.

@classicsong
Copy link
Contributor

classicsong commented Sep 1, 2019

Just go through the tests/distributed/test_shared_mem_store.py. In the buggy test_init():

def test_init():
manager = Manager()
return_dict = manager.dict()
serv_p = Process(target=server_func, args=(2, 'test_graph1'))
work_p1 = Process(target=check_init_func, args=(0, 'test_graph1', return_dict))
work_p2 = Process(target=check_init_func, args=(1, 'test_graph1', return_dict))
serv_p.start()
work_p1.start()
work_p2.start()
serv_p.join()
work_p1.join()
work_p2.join()

The execution order of work_p1 and work_p2 is undetermined even in a single core machine. Before

g.init_ndata('test4', (g.number_of_nodes(), 10), 'float32')
g.init_edata('test4', (g.number_of_edges(), 10), 'float32')
g._sync_barrier(60)

the workers are synced at g._sync_barrier(60). But the latter execution are not synced. It is possible that:
else:
g._sync_barrier(60)
for i, arr in enumerate(arrays):
assert_almost_equal(F.asnumpy(arr[0]), i + 10)

execute latter than
data = g.edges[:].data['test4']
g.set_e_repr({'test4': F.ones((1, 10)) * 20}, edges=[0])
assert_almost_equal(F.asnumpy(data[0]), np.squeeze(F.asnumpy(g.edges[0].data['test4'])))

which may cause the problem.

To reproduce the bug deterministically, add time.sleep(3) to L27 as follows:

def check_array_shared_memory(g, worker_id, arrays):
    if worker_id == 0:
        for i, arr in enumerate(arrays):
            arr[0] = i + 10
        g._sync_barrier(60)
    else:
        g._sync_barrier(60)
        time.sleep(3)
        for i, arr in enumerate(arrays):
            assert_almost_equal(F.asnumpy(arr[0]), i + 10)

@zheng-da @jermainewang

classicsong pushed a commit to classicsong/dgl that referenced this issue Sep 2, 2019
Also make test_shared_mem_store.py more deterministic.
zheng-da pushed a commit that referenced this issue Oct 8, 2019
* upd

* fig edgebatch edges

* add test

* trigger

* Update README.md for pytorch PinSage example.

Add noting that the PinSage model example under
example/pytorch/recommendation only work with Python 3.6+
as its dataset loader depends on stanfordnlp package
which work only with Python 3.6+.

* Provid a frame agnostic API to test nn modules on both CPU and CUDA side.

1. make dgl.nn.xxx frame agnostic
2. make test.backend include dgl.nn modules
3. modify test_edge_softmax of test/mxnet/test_nn.py and
    test/pytorch/test_nn.py work on both CPU and GPU

* Fix style

* Delete unused code

* Make agnostic test only related to tests/backend

1. clear all agnostic related code in dgl.nn
2. make test_graph_conv agnostic to cpu/gpu

* Fix code style

* fix

* doc

* Make all test code under tests.mxnet/pytorch.test_nn.py
work on both CPU and GPU.

* Fix syntex

* Remove rand

* Add TAGCN nn.module and example

* Now tagcn can run on CPU.

* Add unitest for TGConv

* Fix style

* For pubmed dataset, using --lr=0.005 can achieve better acc

* Fix style

* Fix some descriptions

* trigger

* Fix doc

* Add nn.TGConv and example

* Fix bug

* Update data in mxnet.tagcn test acc.

* Fix some comments and code

* delete useless code

* Fix namming

* Fix bug

* Fix bug

* Add test for mxnet TAGCov

* Add test code for mxnet TAGCov

* Update some docs

* Fix some code

* Update docs dgl.nn.mxnet

* Update weight init

* Fix

* reproduce the bug

* Fix concurrency bug reported at #755.
Also make test_shared_mem_store.py more deterministic.

* Update test_shared_mem_store.py

* Update dmlc/core
zheng-da pushed a commit that referenced this issue Oct 11, 2019
* upd

* fig edgebatch edges

* add test

* trigger

* Update README.md for pytorch PinSage example.

Add noting that the PinSage model example under
example/pytorch/recommendation only work with Python 3.6+
as its dataset loader depends on stanfordnlp package
which work only with Python 3.6+.

* Provid a frame agnostic API to test nn modules on both CPU and CUDA side.

1. make dgl.nn.xxx frame agnostic
2. make test.backend include dgl.nn modules
3. modify test_edge_softmax of test/mxnet/test_nn.py and
    test/pytorch/test_nn.py work on both CPU and GPU

* Fix style

* Delete unused code

* Make agnostic test only related to tests/backend

1. clear all agnostic related code in dgl.nn
2. make test_graph_conv agnostic to cpu/gpu

* Fix code style

* fix

* doc

* Make all test code under tests.mxnet/pytorch.test_nn.py
work on both CPU and GPU.

* Fix syntex

* Remove rand

* Add TAGCN nn.module and example

* Now tagcn can run on CPU.

* Add unitest for TGConv

* Fix style

* For pubmed dataset, using --lr=0.005 can achieve better acc

* Fix style

* Fix some descriptions

* trigger

* Fix doc

* Add nn.TGConv and example

* Fix bug

* Update data in mxnet.tagcn test acc.

* Fix some comments and code

* delete useless code

* Fix namming

* Fix bug

* Fix bug

* Add test for mxnet TAGCov

* Add test code for mxnet TAGCov

* Update some docs

* Fix some code

* Update docs dgl.nn.mxnet

* Update weight init

* Fix

* reproduce the bug

* Fix concurrency bug reported at #755.
Also make test_shared_mem_store.py more deterministic.

* Update test_shared_mem_store.py

* Update dmlc/core

* Add complEx for mxnet

* ComplEx is ready for MXNet
zheng-da pushed a commit that referenced this issue Oct 11, 2019
* upd

* fig edgebatch edges

* add test

* trigger

* Update README.md for pytorch PinSage example.

Add noting that the PinSage model example under
example/pytorch/recommendation only work with Python 3.6+
as its dataset loader depends on stanfordnlp package
which work only with Python 3.6+.

* Provid a frame agnostic API to test nn modules on both CPU and CUDA side.

1. make dgl.nn.xxx frame agnostic
2. make test.backend include dgl.nn modules
3. modify test_edge_softmax of test/mxnet/test_nn.py and
    test/pytorch/test_nn.py work on both CPU and GPU

* Fix style

* Delete unused code

* Make agnostic test only related to tests/backend

1. clear all agnostic related code in dgl.nn
2. make test_graph_conv agnostic to cpu/gpu

* Fix code style

* fix

* doc

* Make all test code under tests.mxnet/pytorch.test_nn.py
work on both CPU and GPU.

* Fix syntex

* Remove rand

* Add TAGCN nn.module and example

* Now tagcn can run on CPU.

* Add unitest for TGConv

* Fix style

* For pubmed dataset, using --lr=0.005 can achieve better acc

* Fix style

* Fix some descriptions

* trigger

* Fix doc

* Add nn.TGConv and example

* Fix bug

* Update data in mxnet.tagcn test acc.

* Fix some comments and code

* delete useless code

* Fix namming

* Fix bug

* Fix bug

* Add test for mxnet TAGCov

* Add test code for mxnet TAGCov

* Update some docs

* Fix some code

* Update docs dgl.nn.mxnet

* Update weight init

* Fix

* reproduce the bug

* Fix concurrency bug reported at #755.
Also make test_shared_mem_store.py more deterministic.

* Update test_shared_mem_store.py

* Update dmlc/core

* Update Knowledge Graph CI with new Docker image

* Remove unused line_profierx

* Poke Jenkins

* Update test with exit code check and simplify docker

* Update Jenkinsfile to make app test a standalone stage

* Update kg_test

* Update Jenkinsfile

* Make some KG test parallel

* Update

* KG MXNet does not support ComplEx

* Update Jenkinsfile

* Update Jenkins file

* Change torch-1.2 to torch-1.2-cu92

* ci

* Update ubuntu_install_mxnet_cpu.sh

* Update ubuntu_install_mxnet_gpu.sh

* We only need to test train and eval script.
Delete some test code
jermainewang pushed a commit that referenced this issue Oct 30, 2019
* upd

* fig edgebatch edges

* add test

* trigger

* Update README.md for pytorch PinSage example.

Add noting that the PinSage model example under
example/pytorch/recommendation only work with Python 3.6+
as its dataset loader depends on stanfordnlp package
which work only with Python 3.6+.

* Provid a frame agnostic API to test nn modules on both CPU and CUDA side.

1. make dgl.nn.xxx frame agnostic
2. make test.backend include dgl.nn modules
3. modify test_edge_softmax of test/mxnet/test_nn.py and
    test/pytorch/test_nn.py work on both CPU and GPU

* Fix style

* Delete unused code

* Make agnostic test only related to tests/backend

1. clear all agnostic related code in dgl.nn
2. make test_graph_conv agnostic to cpu/gpu

* Fix code style

* fix

* doc

* Make all test code under tests.mxnet/pytorch.test_nn.py
work on both CPU and GPU.

* Fix syntex

* Remove rand

* Add TAGCN nn.module and example

* Now tagcn can run on CPU.

* Add unitest for TGConv

* Fix style

* For pubmed dataset, using --lr=0.005 can achieve better acc

* Fix style

* Fix some descriptions

* trigger

* Fix doc

* Add nn.TGConv and example

* Fix bug

* Update data in mxnet.tagcn test acc.

* Fix some comments and code

* delete useless code

* Fix namming

* Fix bug

* Fix bug

* Add test for mxnet TAGCov

* Add test code for mxnet TAGCov

* Update some docs

* Fix some code

* Update docs dgl.nn.mxnet

* Update weight init

* Fix

* reproduce the bug

* Fix concurrency bug reported at #755.
Also make test_shared_mem_store.py more deterministic.

* Update test_shared_mem_store.py

* Update dmlc/core

* networkx >= 2.4 will break our examples

* Update tutorials/requirements

* fix selfloop edges

* upd version
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug:confirmed Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants