You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.
During the development of the ci_test_randomness3 #9791 PR, failures were seen in test_gluon_model_zoo_gpu.py:test_training. The first failure was seen on the Python2: MKLDNN-GPU CI runner, but before the 'with_seed()' decorator had been added, so no rng seed information was recorded. After the @with_seed() decorator was added, a 2nd failure was seen (produced by seed 1521019752) on the same runner. After that seed was hard-coded for that test, the test passed on all nodes. This suggests a problem not data-related and perhaps also tied to the Python2 MKLDNN GPU implementation.
First test failure output:
test_gluon_model_zoo_gpu.test_training ... [04:13:14] src/io/iter_image_recordio_2.cc:170: ImageRecordIOParser2: data/val-5k-256.rec, use 1 threads for decoding..
testing resnet18_v1
[04:13:15] src/operator/nn/mkldnn/mkldnn_base.cc:60: Allocate 8028160 bytes with malloc directly
[04:13:16] src/operator/nn/./cudnn/./cudnn_algoreg-inl.h:107: Running performance tests to find the best convolution algorithm, this can take a while... (setting env variable MXNET_CUDNN_AUTOTUNE_DEFAULT to 0 to disable)
testing densenet121
[04:13:23] src/operator/nn/./cudnn/./cudnn_algoreg-inl.h:107: Running performance tests to find the best convolution algorithm, this can take a while... (setting env variable MXNET_CUDNN_AUTOTUNE_DEFAULT to 0 to disable)
FAIL
FAIL: test_gluon_model_zoo_gpu.test_training
----------------------------------------------------------------------
Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/nose/case.py", line 197, in runTest
self.test(*self.arg)
File "/workspace/tests/python/gpu/test_gluon_model_zoo_gpu.py", line 150, in test_training
assert_almost_equal(cpu_out.asnumpy(), gpu_out.asnumpy(), rtol=1e-2, atol=1e-2)
File "/workspace/python/mxnet/test_utils.py", line 493, in assert_almost_equal
raise AssertionError(msg)
AssertionError:
Items are not equal:
Error 60.428631 exceeds tolerance rtol=0.010000, atol=0.010000. Location of maximum error:(6, 649), a=0.584048, b=-0.051143
a: array([[-0.40942043, -0.5455089 , -0.26064384, ..., 0.33553356,
0.5314904 , -0.15903676],
[-0.6971618 , -0.3223077 , -0.7059576 , ..., 0.7106416 ,...
b: array([[-0.40580893, -0.63151675, -0.37356558, ..., 0.36654586,
0.43078798, -0.19291902],
[-0.51749593, -0.26392186, -0.66467005, ..., 0.794114 ,...
Second test failure output:
test_gluon_model_zoo_gpu.test_training ... [09:40:45] src/io/iter_image_recordio_2.cc:170: ImageRecordIOParser2: data/val-5k-256.rec, use 1 threads for decoding..
testing resnet18_v1
[09:40:46] src/operator/nn/mkldnn/mkldnn_base.cc:60: Allocate 8028160 bytes with malloc directly
[09:40:47] src/operator/nn/./cudnn/./cudnn_algoreg-inl.h:107: Running performance tests to find the best convolution algorithm, this can take a while... (setting env variable MXNET_CUDNN_AUTOTUNE_DEFAULT to 0 to disable)
testing densenet121
[09:40:54] src/operator/nn/./cudnn/./cudnn_algoreg-inl.h:107: Running performance tests to find the best convolution algorithm, this can take a while... (setting env variable MXNET_CUDNN_AUTOTUNE_DEFAULT to 0 to disable)
[INFO] Setting test np/mx/python random seeds, use MXNET_TEST_SEED=1521019752 to reproduce.
FAIL
FAIL: test_gluon_model_zoo_gpu.test_training
----------------------------------------------------------------------
Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/nose/case.py", line 197, in runTest
self.test(*self.arg)
File "/workspace/tests/python/gpu/../unittest/common.py", line 155, in test_new
orig_test(*args, **kwargs)
File "/workspace/tests/python/gpu/test_gluon_model_zoo_gpu.py", line 156, in test_training
assert_almost_equal(cpu_out.asnumpy(), gpu_out.asnumpy(), rtol=1e-2, atol=1e-2)
File "/workspace/python/mxnet/test_utils.py", line 493, in assert_almost_equal
raise AssertionError(msg)
AssertionError:
Items are not equal:
Error 47.864170 exceeds tolerance rtol=0.010000, atol=0.010000. Location of maximum error:(6, 311), a=-0.094315, b=0.737164
a: array([[ 0.19677146, 0.15249339, -0.14161389, ..., -0.6827745 ,
0.12698895, -0.08247809],
[-0.01026695, 0.31750488, -0.14363009, ..., -0.7834535 ,...
b: array([[ 0.05424175, 0.04719666, -0.09091276, ..., -0.7888349 ,
0.11255977, -0.13169175],
[ 0.0638914 , 0.34906954, -0.02986413, ..., -0.7855257 ,...
Git hash with the test hardcoded to the bad seed: 2ea19a2
Git hash of PR with the test open to random seeding (that will be printed if it fails): daceaca
Note that a related test, test_gluon_model_zoo_gpu.py:test_inference is marked as 'skip':
@unittest.skip("test fails intermittently. temporarily disabled.")
Should this test be disabled as well?
The text was updated successfully, but these errors were encountered:
Could not reproduce with 1000 runs. Issue is fixed by Da in this PR - #9862 resolving MKLDNN memory layout race condition. Verified that recent 50 failures in CI doesn't have this issue.
During the development of the ci_test_randomness3 #9791 PR, failures were seen in test_gluon_model_zoo_gpu.py:test_training. The first failure was seen on the Python2: MKLDNN-GPU CI runner, but before the 'with_seed()' decorator had been added, so no rng seed information was recorded. After the @with_seed() decorator was added, a 2nd failure was seen (produced by seed 1521019752) on the same runner. After that seed was hard-coded for that test, the test passed on all nodes. This suggests a problem not data-related and perhaps also tied to the Python2 MKLDNN GPU implementation.
First test failure output:
Second test failure output:
Git hash with the test hardcoded to the bad seed: 2ea19a2
Git hash of PR with the test open to random seeding (that will be printed if it fails): daceaca
Note that a related test, test_gluon_model_zoo_gpu.py:test_inference is marked as 'skip':
@unittest.skip("test fails intermittently. temporarily disabled.")
Should this test be disabled as well?
The text was updated successfully, but these errors were encountered: