Unrepeatable test_gluon_model_zoo_gpu.py:test_training CI failures seen. #9812

DickJC123 · 2018-02-16T19:51:42Z

During the development of the ci_test_randomness3 #9791 PR, failures were seen in test_gluon_model_zoo_gpu.py:test_training. The first failure was seen on the Python2: MKLDNN-GPU CI runner, but before the 'with_seed()' decorator had been added, so no rng seed information was recorded. After the @with_seed() decorator was added, a 2nd failure was seen (produced by seed 1521019752) on the same runner. After that seed was hard-coded for that test, the test passed on all nodes. This suggests a problem not data-related and perhaps also tied to the Python2 MKLDNN GPU implementation.

First test failure output:

test_gluon_model_zoo_gpu.test_training ... [04:13:14] src/io/iter_image_recordio_2.cc:170: ImageRecordIOParser2: data/val-5k-256.rec, use 1 threads for decoding..
testing resnet18_v1
[04:13:15] src/operator/nn/mkldnn/mkldnn_base.cc:60: Allocate 8028160 bytes with malloc directly
[04:13:16] src/operator/nn/./cudnn/./cudnn_algoreg-inl.h:107: Running performance tests to find the best convolution algorithm, this can take a while... (setting env variable MXNET_CUDNN_AUTOTUNE_DEFAULT to 0 to disable)
testing densenet121
[04:13:23] src/operator/nn/./cudnn/./cudnn_algoreg-inl.h:107: Running performance tests to find the best convolution algorithm, this can take a while... (setting env variable MXNET_CUDNN_AUTOTUNE_DEFAULT to 0 to disable)
FAIL

FAIL: test_gluon_model_zoo_gpu.test_training
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/nose/case.py", line 197, in runTest
    self.test(*self.arg)
  File "/workspace/tests/python/gpu/test_gluon_model_zoo_gpu.py", line 150, in test_training
    assert_almost_equal(cpu_out.asnumpy(), gpu_out.asnumpy(), rtol=1e-2, atol=1e-2)
  File "/workspace/python/mxnet/test_utils.py", line 493, in assert_almost_equal
    raise AssertionError(msg)
AssertionError: 
Items are not equal:
Error 60.428631 exceeds tolerance rtol=0.010000, atol=0.010000.  Location of maximum error:(6, 649), a=0.584048, b=-0.051143
 a: array([[-0.40942043, -0.5455089 , -0.26064384, ...,  0.33553356,
         0.5314904 , -0.15903676],
       [-0.6971618 , -0.3223077 , -0.7059576 , ...,  0.7106416 ,...
 b: array([[-0.40580893, -0.63151675, -0.37356558, ...,  0.36654586,
         0.43078798, -0.19291902],
       [-0.51749593, -0.26392186, -0.66467005, ...,  0.794114  ,...

Second test failure output:

test_gluon_model_zoo_gpu.test_training ... [09:40:45] src/io/iter_image_recordio_2.cc:170: ImageRecordIOParser2: data/val-5k-256.rec, use 1 threads for decoding..
testing resnet18_v1
[09:40:46] src/operator/nn/mkldnn/mkldnn_base.cc:60: Allocate 8028160 bytes with malloc directly
[09:40:47] src/operator/nn/./cudnn/./cudnn_algoreg-inl.h:107: Running performance tests to find the best convolution algorithm, this can take a while... (setting env variable MXNET_CUDNN_AUTOTUNE_DEFAULT to 0 to disable)
testing densenet121
[09:40:54] src/operator/nn/./cudnn/./cudnn_algoreg-inl.h:107: Running performance tests to find the best convolution algorithm, this can take a while... (setting env variable MXNET_CUDNN_AUTOTUNE_DEFAULT to 0 to disable)
[INFO] Setting test np/mx/python random seeds, use MXNET_TEST_SEED=1521019752 to reproduce.
FAIL

FAIL: test_gluon_model_zoo_gpu.test_training
----------------------------------------------------------------------
Traceback (most recent call last):

  File "/usr/local/lib/python2.7/dist-packages/nose/case.py", line 197, in runTest
    self.test(*self.arg)
  File "/workspace/tests/python/gpu/../unittest/common.py", line 155, in test_new
    orig_test(*args, **kwargs)
  File "/workspace/tests/python/gpu/test_gluon_model_zoo_gpu.py", line 156, in test_training
    assert_almost_equal(cpu_out.asnumpy(), gpu_out.asnumpy(), rtol=1e-2, atol=1e-2)
  File "/workspace/python/mxnet/test_utils.py", line 493, in assert_almost_equal
    raise AssertionError(msg)
AssertionError: 
Items are not equal:
Error 47.864170 exceeds tolerance rtol=0.010000, atol=0.010000.  Location of maximum error:(6, 311), a=-0.094315, b=0.737164
 a: array([[ 0.19677146,  0.15249339, -0.14161389, ..., -0.6827745 ,
         0.12698895, -0.08247809],
       [-0.01026695,  0.31750488, -0.14363009, ..., -0.7834535 ,...
 b: array([[ 0.05424175,  0.04719666, -0.09091276, ..., -0.7888349 ,
         0.11255977, -0.13169175],
       [ 0.0638914 ,  0.34906954, -0.02986413, ..., -0.7855257 ,...

Git hash with the test hardcoded to the bad seed: 2ea19a2
Git hash of PR with the test open to random seeding (that will be printed if it fails): daceaca

Note that a related test, test_gluon_model_zoo_gpu.py:test_inference is marked as 'skip':
@unittest.skip("test fails intermittently. temporarily disabled.")

Should this test be disabled as well?

The text was updated successfully, but these errors were encountered:

szha · 2018-02-21T18:25:13Z

Ping @zheng-da

zheng-da · 2018-04-20T07:04:10Z

This bug has been fixed by #9862
I think we can close the issue now.

sandeep-krishnamurthy · 2018-06-29T17:51:38Z

Could not reproduce with 1000 runs. Issue is fixed by Da in this PR - #9862 resolving MKLDNN memory layout race condition. Verified that recent 50 failures in CI doesn't have this issue.

Please reopen if issue still persists.

marcoabreu added Test Flaky labels Feb 16, 2018

marcoabreu added this to To Do in Tests Improvement via automation Mar 17, 2018

sandeep-krishnamurthy closed this as completed Jun 29, 2018

Tests Improvement automation moved this from To Do to Done Jun 29, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unrepeatable test_gluon_model_zoo_gpu.py:test_training CI failures seen. #9812

Unrepeatable test_gluon_model_zoo_gpu.py:test_training CI failures seen. #9812

DickJC123 commented Feb 16, 2018

szha commented Feb 21, 2018

zheng-da commented Apr 20, 2018

sandeep-krishnamurthy commented Jun 29, 2018

Unrepeatable test_gluon_model_zoo_gpu.py:test_training CI failures seen. #9812

Unrepeatable test_gluon_model_zoo_gpu.py:test_training CI failures seen. #9812

Comments

DickJC123 commented Feb 16, 2018

szha commented Feb 21, 2018

zheng-da commented Apr 20, 2018

sandeep-krishnamurthy commented Jun 29, 2018