-
Notifications
You must be signed in to change notification settings - Fork 6.8k
Flaky test: test_mkldnn.test_activation #12377
Comments
@luobao-intel please take a look for the reason. |
This test is to validate the activation calculation in mkldnn by checking the gradient compared to the theano.gradient.numeric_grad. However, the activation gradient calculation of code referring to theano is not correct with the input closed to zero. Thus, flaky errors occurred when there are some extremely small positive numbers in the random input vector. experiment 1:input data :[[1, 2], [3, 0.0001]] location: gradient calculation referring to theano : mkldnn : experiment 2:input data :[[1, -2], [-4, 0.0005]] location: gradient calculation referring to theano : mkldnn : analysisIt's easy to know that the derivative of ReLU function is : Therefore, in the check_numeric_gradient function, the gradient of executor should be equal to location if the corresponding element of input data is positive and be 0 otherwise by element-wise. |
The reference checker applied the finite difference method but the eps is too large for float datatype in here. |
Is failing again:
|
sorry, I can't reproduce the same result with the same random seed XNET_TEST_SEED=304218922. experimentcommandexport MXNET_TEST_SEED=304218922 log[INFO] Setting test np/mx/python random seeds, use MXNET_TEST_SEED=304218922 to reproduce. Ran 1 test in 0.023s OK |
Failing again - http://jenkins.mxnet-ci.amazon-ml.com/blue/rest/organizations/jenkins/pipelines/incubator-mxnet/branches/PR-12391/runs/9/nodes/951/log/?start=0 ======================================================================
FAIL: test_mkldnn.test_activation
----------------------------------------------------------------------
Traceback (most recent call last):
File "/usr/local/lib/python3.5/dist-packages/nose/case.py", line 198, in runTest
self.test(*self.arg)
File "/work/mxnet/tests/python/mkl/../unittest/common.py", line 172, in test_new
orig_test(*args, **kwargs)
File "/work/mxnet/tests/python/mkl/test_mkldnn.py", line 298, in test_activation
check_activation_training(stype)
File "/work/mxnet/tests/python/mkl/test_mkldnn.py", line 294, in check_activation_training
check_numeric_gradient(test, in_location, numeric_eps=1e-6, rtol=0.16, atol=1e-4)
File "/work/mxnet/python/mxnet/test_utils.py", line 912, in check_numeric_gradient
("NUMERICAL_%s"%name, "BACKWARD_%s"%name))
File "/work/mxnet/python/mxnet/test_utils.py", line 491, in assert_almost_equal
raise AssertionError(msg)
AssertionError:
Items are not equal:
Error 1.184596 exceeds tolerance rtol=0.160000, atol=0.000100. Location of maximum error:(0, 0, 1, 0), a=0.715256, b=0.882672
NUMERICAL_data: array([[[[0. , 0. ],
[0.71525574, 0. ]],
...
BACKWARD_data: array([[[[0. , 0. ],
[0.8826717 , 0. ]],
...
-------------------- >> begin captured logging << --------------------
common: INFO: Setting test np/mx/python random seeds, use MXNET_TEST_SEED=1731055743 to reproduce.
--------------------- >> end captured logging << --------------------- |
Sorry for that, in previous situation, we confronted the situation that element of input data is close to zero. After all, those problems are caused by the inaccurate baseline calculation referring to the theano gradient. We are trying to rewrite the test case with other approaches. I suggest to disable the flaky test for the time being. |
…che#12418)" This reverts commit 445967e.
PR to disable the test again: #12516 |
* test_activation_rec_eps * enable case
* Revert "Removing the re-size for validation data, which breaking the validation accuracy of CIFAR training (#12362)" This reverts commit ceabcaa. * Revert "[MXNET-580] Add SN-GAN example (#12419)" This reverts commit 46a5cee. * Revert "Remove regression checks for website links (#12507)" This reverts commit 619bc3e. * Revert "Revert "Fix flaky test: test_mkldnn.test_activation #12377 (#12418)" (#12516)" This reverts commit 7ea0533. * Revert "further bump up tolerance for sparse dot (#12527)" This reverts commit 90599e1. * Revert "Fix broken URLs (#12508)" This reverts commit 3d83c89. * Revert "Temporarily disable flaky tests (#12520)" This reverts commit 35ca13c. * Revert "Add support for more req patterns for bilinear sampler backward (#12386)" This reverts commit 4ee866f. * Revert "Change the way NDArrayIter handle the last batch (#12285)" This reverts commit 597a637.
made a PR that addresses just this test (ran 10000 times with different seeds as well) #12560. in regards to @luobao-intel , this is not due to inputs being too large. activation is linear above 0 so this is not due to lack of approximation. in fact we should be able to get an exact solution. the reason the change is causing an error is the fact that with a very small eps the outputs (f(x + eps/2) and f(x - eps/2)) do not have enough precision. the formula is
since eps was 1e-6 this means the gradient was calculated by differences must be captured below 1e-6. |
tldr: you should never use anything less than 1e-5 as there is not enough precision in the numerator (f(x + eps/2) - f(x - eps/s)) to derive an accurate slope. |
* test_activation_rec_eps * enable case
…che#12418)" (apache#12516) This reverts commit 445967e.
* Revert "Removing the re-size for validation data, which breaking the validation accuracy of CIFAR training (apache#12362)" This reverts commit ceabcaa. * Revert "[MXNET-580] Add SN-GAN example (apache#12419)" This reverts commit 46a5cee. * Revert "Remove regression checks for website links (apache#12507)" This reverts commit 619bc3e. * Revert "Revert "Fix flaky test: test_mkldnn.test_activation apache#12377 (apache#12418)" (apache#12516)" This reverts commit 7ea0533. * Revert "further bump up tolerance for sparse dot (apache#12527)" This reverts commit 90599e1. * Revert "Fix broken URLs (apache#12508)" This reverts commit 3d83c89. * Revert "Temporarily disable flaky tests (apache#12520)" This reverts commit 35ca13c. * Revert "Add support for more req patterns for bilinear sampler backward (apache#12386)" This reverts commit 4ee866f. * Revert "Change the way NDArrayIter handle the last batch (apache#12285)" This reverts commit 597a637.
Has been fixed with #12418 |
Sorry, probably this is the fix: #12560 |
http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/master/1529/pipeline
The text was updated successfully, but these errors were encountered: