Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

Unrepeatable test_gluon_model_zoo_gpu.py:test_training CI failures seen. #9812

Closed
DickJC123 opened this issue Feb 16, 2018 · 3 comments
Closed

Comments

@DickJC123
Copy link
Contributor

During the development of the ci_test_randomness3 #9791 PR, failures were seen in test_gluon_model_zoo_gpu.py:test_training. The first failure was seen on the Python2: MKLDNN-GPU CI runner, but before the 'with_seed()' decorator had been added, so no rng seed information was recorded. After the @with_seed() decorator was added, a 2nd failure was seen (produced by seed 1521019752) on the same runner. After that seed was hard-coded for that test, the test passed on all nodes. This suggests a problem not data-related and perhaps also tied to the Python2 MKLDNN GPU implementation.

First test failure output:

test_gluon_model_zoo_gpu.test_training ... [04:13:14] src/io/iter_image_recordio_2.cc:170: ImageRecordIOParser2: data/val-5k-256.rec, use 1 threads for decoding..
testing resnet18_v1
[04:13:15] src/operator/nn/mkldnn/mkldnn_base.cc:60: Allocate 8028160 bytes with malloc directly
[04:13:16] src/operator/nn/./cudnn/./cudnn_algoreg-inl.h:107: Running performance tests to find the best convolution algorithm, this can take a while... (setting env variable MXNET_CUDNN_AUTOTUNE_DEFAULT to 0 to disable)
testing densenet121
[04:13:23] src/operator/nn/./cudnn/./cudnn_algoreg-inl.h:107: Running performance tests to find the best convolution algorithm, this can take a while... (setting env variable MXNET_CUDNN_AUTOTUNE_DEFAULT to 0 to disable)
FAIL

FAIL: test_gluon_model_zoo_gpu.test_training
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/nose/case.py", line 197, in runTest
    self.test(*self.arg)
  File "/workspace/tests/python/gpu/test_gluon_model_zoo_gpu.py", line 150, in test_training
    assert_almost_equal(cpu_out.asnumpy(), gpu_out.asnumpy(), rtol=1e-2, atol=1e-2)
  File "/workspace/python/mxnet/test_utils.py", line 493, in assert_almost_equal
    raise AssertionError(msg)
AssertionError: 
Items are not equal:
Error 60.428631 exceeds tolerance rtol=0.010000, atol=0.010000.  Location of maximum error:(6, 649), a=0.584048, b=-0.051143
 a: array([[-0.40942043, -0.5455089 , -0.26064384, ...,  0.33553356,
         0.5314904 , -0.15903676],
       [-0.6971618 , -0.3223077 , -0.7059576 , ...,  0.7106416 ,...
 b: array([[-0.40580893, -0.63151675, -0.37356558, ...,  0.36654586,
         0.43078798, -0.19291902],
       [-0.51749593, -0.26392186, -0.66467005, ...,  0.794114  ,...

Second test failure output:

test_gluon_model_zoo_gpu.test_training ... [09:40:45] src/io/iter_image_recordio_2.cc:170: ImageRecordIOParser2: data/val-5k-256.rec, use 1 threads for decoding..
testing resnet18_v1
[09:40:46] src/operator/nn/mkldnn/mkldnn_base.cc:60: Allocate 8028160 bytes with malloc directly
[09:40:47] src/operator/nn/./cudnn/./cudnn_algoreg-inl.h:107: Running performance tests to find the best convolution algorithm, this can take a while... (setting env variable MXNET_CUDNN_AUTOTUNE_DEFAULT to 0 to disable)
testing densenet121
[09:40:54] src/operator/nn/./cudnn/./cudnn_algoreg-inl.h:107: Running performance tests to find the best convolution algorithm, this can take a while... (setting env variable MXNET_CUDNN_AUTOTUNE_DEFAULT to 0 to disable)
[INFO] Setting test np/mx/python random seeds, use MXNET_TEST_SEED=1521019752 to reproduce.
FAIL

FAIL: test_gluon_model_zoo_gpu.test_training
----------------------------------------------------------------------
Traceback (most recent call last):

  File "/usr/local/lib/python2.7/dist-packages/nose/case.py", line 197, in runTest
    self.test(*self.arg)
  File "/workspace/tests/python/gpu/../unittest/common.py", line 155, in test_new
    orig_test(*args, **kwargs)
  File "/workspace/tests/python/gpu/test_gluon_model_zoo_gpu.py", line 156, in test_training
    assert_almost_equal(cpu_out.asnumpy(), gpu_out.asnumpy(), rtol=1e-2, atol=1e-2)
  File "/workspace/python/mxnet/test_utils.py", line 493, in assert_almost_equal
    raise AssertionError(msg)
AssertionError: 
Items are not equal:
Error 47.864170 exceeds tolerance rtol=0.010000, atol=0.010000.  Location of maximum error:(6, 311), a=-0.094315, b=0.737164
 a: array([[ 0.19677146,  0.15249339, -0.14161389, ..., -0.6827745 ,
         0.12698895, -0.08247809],
       [-0.01026695,  0.31750488, -0.14363009, ..., -0.7834535 ,...
 b: array([[ 0.05424175,  0.04719666, -0.09091276, ..., -0.7888349 ,
         0.11255977, -0.13169175],
       [ 0.0638914 ,  0.34906954, -0.02986413, ..., -0.7855257 ,...

Git hash with the test hardcoded to the bad seed: 2ea19a2
Git hash of PR with the test open to random seeding (that will be printed if it fails): daceaca

Note that a related test, test_gluon_model_zoo_gpu.py:test_inference is marked as 'skip':
@unittest.skip("test fails intermittently. temporarily disabled.")

Should this test be disabled as well?

@szha
Copy link
Member

szha commented Feb 21, 2018

Ping @zheng-da

@marcoabreu marcoabreu added this to To Do in Tests Improvement via automation Mar 17, 2018
@zheng-da
Copy link
Contributor

This bug has been fixed by #9862
I think we can close the issue now.

@sandeep-krishnamurthy
Copy link
Contributor

Could not reproduce with 1000 runs. Issue is fixed by Da in this PR - #9862 resolving MKLDNN memory layout race condition. Verified that recent 50 failures in CI doesn't have this issue.

Please reopen if issue still persists.

Tests Improvement automation moved this from To Do to Done Jun 29, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
Development

No branches or pull requests

5 participants