Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

Flaky test: test_mkldnn.test_activation #12377

Closed
lebeg opened this issue Aug 28, 2018 · 12 comments
Closed

Flaky test: test_mkldnn.test_activation #12377

lebeg opened this issue Aug 28, 2018 · 12 comments

Comments

@lebeg
Copy link
Contributor

lebeg commented Aug 28, 2018

http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/master/1529/pipeline

======================================================================
FAIL: test_mkldnn.test_activation
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/nose/case.py", line 198, in runTest
    self.test(*self.arg)
  File "/work/mxnet/tests/python/mkl/../unittest/common.py", line 172, in test_new
    orig_test(*args, **kwargs)
  File "/work/mxnet/tests/python/mkl/test_mkldnn.py", line 297, in test_activation
    check_activation_training(stype)
  File "/work/mxnet/tests/python/mkl/test_mkldnn.py", line 293, in check_activation_training
    check_numeric_gradient(test, in_location, numeric_eps=1e-2, rtol=0.16, atol=1e-4)
  File "/work/mxnet/python/mxnet/test_utils.py", line 912, in check_numeric_gradient
    ("NUMERICAL_%s"%name, "BACKWARD_%s"%name))
  File "/work/mxnet/python/mxnet/test_utils.py", line 491, in assert_almost_equal
    raise AssertionError(msg)
AssertionError: 
Items are not equal:
Error 2.232502 exceeds tolerance rtol=0.160000, atol=0.000100.  Location of maximum error:(0, 0, 0), a=0.445562, b=0.693506
 NUMERICAL_data: array([[[0.44556212, 0.2619341 , 0.77837706],
        [0.        , 0.8214429 , 0.5259812 ],
        [0.        , 0.        , 0.        ]],...
 BACKWARD_data: array([[[0.693506  , 0.26193386, 0.77837765],
        [0.        , 0.8214439 , 0.52598166],
        [0.        , 0.        , 0.        ]],...
-------------------- >> begin captured logging << --------------------
common: INFO: Setting test np/mx/python random seeds, use MXNET_TEST_SEED=1284728931 to reproduce.
--------------------- >> end captured logging << ---------------------
@pengzhao-intel
Copy link
Contributor

@luobao-intel please take a look for the reason.

@luobao-intel
Copy link
Contributor

luobao-intel commented Aug 29, 2018

This test is to validate the activation calculation in mkldnn by checking the gradient compared to the theano.gradient.numeric_grad. However, the activation gradient calculation of code referring to theano is not correct with the input closed to zero. Thus, flaky errors occurred when there are some extremely small positive numbers in the random input vector.
The experiment is as follows.

experiment 1:

input data :[[1, 2], [3, 0.0001]]

location:
{'data':
<RowSparseNDArray 2x2 @cpu(0)>, '__random_proj':
[[0.3546685 0.8954062 ]
[0.40476447 0.7724642 ]]
<NDArray 2x2 @cpu(0)>}

gradient calculation referring to theano :
[[0.35466552 0.8954048 ]
[0.40476322 0.39395675]]

mkldnn :
[[0.3546685 0.8954062 ]
[0.40476447 0.7724642 ]]

experiment 2:

input data :[[1, -2], [-4, 0.0005]]

location:
{'data':
<RowSparseNDArray 2x2 @cpu(0)>, '__random_proj':
[[0.3546685 0.8954062 ]
[0.40476447 0.7724642 ]]
<NDArray 2x2 @cpu(0)>}

gradient calculation referring to theano :
[[0.35466552 0. ]
[0. 0.4248553 ]]

mkldnn :
[[0.3546685 0. ]
[0. 0.7724642]]

analysis

It's easy to know that the derivative of ReLU function is :
if x < 0, output is 0. if x > 0, output is 1.

Therefore, in the check_numeric_gradient function, the gradient of executor should be equal to location if the corresponding element of input data is positive and be 0 otherwise by element-wise.
The gradient based on theano is apparently false when the corresponding element of input data is close to zero.

@pengzhao-intel
Copy link
Contributor

pengzhao-intel commented Aug 29, 2018

The reference checker applied the finite difference method but the eps is too large for float datatype in here.
In @luobao-intel case, the input data is about xe-5, so the eps can't calculate correctly.
I suggest changing eps to 1e-6. @luobao-intel will fill the PR soon.

https://github.com/apache/incubator-mxnet/blob/e2a3eef349cb6643c08a7840d8cbd43b38fedfd5/python/mxnet/test_utils.py#L716

@lebeg
Copy link
Contributor Author

lebeg commented Sep 10, 2018

Is failing again:
http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/master/1563/pipeline

======================================================================
FAIL: test_mkldnn.test_activation
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/nose/case.py", line 198, in runTest
    self.test(*self.arg)
  File "/work/mxnet/tests/python/mkl/../unittest/common.py", line 172, in test_new
    orig_test(*args, **kwargs)
  File "/work/mxnet/tests/python/mkl/test_mkldnn.py", line 298, in test_activation
    check_activation_training(stype)
  File "/work/mxnet/tests/python/mkl/test_mkldnn.py", line 294, in check_activation_training
    check_numeric_gradient(test, in_location, numeric_eps=1e-6, rtol=0.16, atol=1e-4)
  File "/work/mxnet/python/mxnet/test_utils.py", line 912, in check_numeric_gradient
    ("NUMERICAL_%s"%name, "BACKWARD_%s"%name))
  File "/work/mxnet/python/mxnet/test_utils.py", line 491, in assert_almost_equal
    raise AssertionError(msg)
AssertionError: 
Items are not equal:
Error 1.153736 exceeds tolerance rtol=0.160000, atol=0.000100.  Location of maximum error:(0, 2, 1, 1), a=0.119209, b=0.146338
 NUMERICAL_data: array([[[[0.32782555, 0.52154064],
         [0.32782555, 0.        ]],
...
 BACKWARD_data: array([[[[0.31696534, 0.53385574],
         [0.3415597 , 0.        ]],
...
-------------------- >> begin captured logging << --------------------
common: INFO: Setting test np/mx/python random seeds, use MXNET_TEST_SEED=304218922 to reproduce.
--------------------- >> end captured logging << ---------------------

@luobao-intel
Copy link
Contributor

sorry, I can't reproduce the same result with the same random seed XNET_TEST_SEED=304218922.
In my trial, the test_activation is ok.
The experiment is shown as follows:

experiment

command

export MXNET_TEST_SEED=304218922
python /usr/bin/nosetests tests/python/mkl/test_mkldnn.py:test_activation

log

[INFO] Setting test np/mx/python random seeds, use MXNET_TEST_SEED=304218922 to reproduce.
[22:51:22] src/operator/tensor/././../../common/utils.h:450:
Storage type fallback detected:
operator = Activation
input storage types = [row_sparse, ]
output storage types = [default, ]
params = {"act_type" : relu, }
context.dev_mask = cpu
The operator with default storage type will be dispatched for execution. You're seeing this warning message because the operator above is unable to process the given ndarrays with specified storage types, context and parameter. Temporary dense ndarrays are generated in order to execute the operator. This does not affect the correctness of the programme. You can set environment variable MXNET_STORAGE_FALLBACK_LOG_VERBOSE to 0 to suppress this warning.

Ran 1 test in 0.023s

OK

@anirudhacharya
Copy link
Member

Failing again - http://jenkins.mxnet-ci.amazon-ml.com/blue/rest/organizations/jenkins/pipelines/incubator-mxnet/branches/PR-12391/runs/9/nodes/951/log/?start=0

======================================================================
FAIL: test_mkldnn.test_activation
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/nose/case.py", line 198, in runTest
    self.test(*self.arg)
  File "/work/mxnet/tests/python/mkl/../unittest/common.py", line 172, in test_new
    orig_test(*args, **kwargs)
  File "/work/mxnet/tests/python/mkl/test_mkldnn.py", line 298, in test_activation
    check_activation_training(stype)
  File "/work/mxnet/tests/python/mkl/test_mkldnn.py", line 294, in check_activation_training
    check_numeric_gradient(test, in_location, numeric_eps=1e-6, rtol=0.16, atol=1e-4)
  File "/work/mxnet/python/mxnet/test_utils.py", line 912, in check_numeric_gradient
    ("NUMERICAL_%s"%name, "BACKWARD_%s"%name))
  File "/work/mxnet/python/mxnet/test_utils.py", line 491, in assert_almost_equal
    raise AssertionError(msg)
AssertionError: 
Items are not equal:
Error 1.184596 exceeds tolerance rtol=0.160000, atol=0.000100.  Location of maximum error:(0, 0, 1, 0), a=0.715256, b=0.882672
 NUMERICAL_data: array([[[[0.        , 0.        ],
         [0.71525574, 0.        ]],
...
 BACKWARD_data: array([[[[0.        , 0.        ],
         [0.8826717 , 0.        ]],
...
-------------------- >> begin captured logging << --------------------
common: INFO: Setting test np/mx/python random seeds, use MXNET_TEST_SEED=1731055743 to reproduce.
--------------------- >> end captured logging << ---------------------

@luobao-intel
Copy link
Contributor

luobao-intel commented Sep 11, 2018

Sorry for that, in previous situation, we confronted the situation that element of input data is close to zero.
And we found that the extremely big difference step, eps, should be blamed. And we turned its value down. However, in current situation, test failure is caused by the small eps for the big element of input data. The smaller the eps is, the more calculation steps are required. And for big input data, every step could cause small error and finally the cumulative error may exceed the limit. Thus, suitable eps should be picked.

After all, those problems are caused by the inaccurate baseline calculation referring to the theano gradient. We are trying to rewrite the test case with other approaches. I suggest to disable the flaky test for the time being.

@lebeg
Copy link
Contributor Author

lebeg commented Sep 11, 2018

PR to disable the test again: #12516

aaronmarkham pushed a commit to aaronmarkham/incubator-mxnet that referenced this issue Sep 11, 2018
szha pushed a commit that referenced this issue Sep 12, 2018
* Revert "Removing the re-size for validation data, which breaking the validation accuracy of CIFAR training (#12362)"

This reverts commit ceabcaa.

* Revert "[MXNET-580] Add SN-GAN example (#12419)"

This reverts commit 46a5cee.

* Revert "Remove regression checks for website links (#12507)"

This reverts commit 619bc3e.

* Revert "Revert "Fix flaky test: test_mkldnn.test_activation #12377 (#12418)" (#12516)"

This reverts commit 7ea0533.

* Revert "further bump up tolerance for sparse dot (#12527)"

This reverts commit 90599e1.

* Revert "Fix broken URLs (#12508)"

This reverts commit 3d83c89.

* Revert "Temporarily disable flaky tests (#12520)"

This reverts commit 35ca13c.

* Revert "Add support for more req patterns for bilinear sampler backward (#12386)"

This reverts commit 4ee866f.

* Revert "Change the way NDArrayIter handle the last batch (#12285)"

This reverts commit 597a637.
@azai91
Copy link
Contributor

azai91 commented Sep 14, 2018

made a PR that addresses just this test (ran 10000 times with different seeds as well) #12560.

in regards to @luobao-intel , this is not due to inputs being too large. activation is linear above 0 so this is not due to lack of approximation. in fact we should be able to get an exact solution. the reason the change is causing an error is the fact that with a very small eps the outputs (f(x + eps/2) and f(x - eps/2)) do not have enough precision.

the formula is

grad = (f(x + eps/2)  - f(x - eps/s)) / eps).

since eps was 1e-6 this means the gradient was calculated by differences must be captured below 1e-6.

@azai91
Copy link
Contributor

azai91 commented Sep 14, 2018

tldr: you should never use anything less than 1e-5 as there is not enough precision in the numerator (f(x + eps/2) - f(x - eps/s)) to derive an accurate slope.

anirudh2290 pushed a commit to anirudh2290/mxnet that referenced this issue Sep 19, 2018
anirudh2290 pushed a commit to anirudh2290/mxnet that referenced this issue Sep 19, 2018
anirudh2290 pushed a commit to anirudh2290/mxnet that referenced this issue Sep 19, 2018
* Revert "Removing the re-size for validation data, which breaking the validation accuracy of CIFAR training (apache#12362)"

This reverts commit ceabcaa.

* Revert "[MXNET-580] Add SN-GAN example (apache#12419)"

This reverts commit 46a5cee.

* Revert "Remove regression checks for website links (apache#12507)"

This reverts commit 619bc3e.

* Revert "Revert "Fix flaky test: test_mkldnn.test_activation apache#12377 (apache#12418)" (apache#12516)"

This reverts commit 7ea0533.

* Revert "further bump up tolerance for sparse dot (apache#12527)"

This reverts commit 90599e1.

* Revert "Fix broken URLs (apache#12508)"

This reverts commit 3d83c89.

* Revert "Temporarily disable flaky tests (apache#12520)"

This reverts commit 35ca13c.

* Revert "Add support for more req patterns for bilinear sampler backward (apache#12386)"

This reverts commit 4ee866f.

* Revert "Change the way NDArrayIter handle the last batch (apache#12285)"

This reverts commit 597a637.
@lebeg
Copy link
Contributor Author

lebeg commented Oct 9, 2018

Has been fixed with #12418

@lebeg lebeg closed this as completed Oct 9, 2018
Tests Improvement automation moved this from To Do to Done Oct 9, 2018
@lebeg
Copy link
Contributor Author

lebeg commented Oct 9, 2018

Sorry, probably this is the fix: #12560

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
Development

No branches or pull requests

6 participants