Flaky test: test_mkldnn.test_activation #12377

lebeg · 2018-08-28T08:27:38Z

http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/master/1529/pipeline

======================================================================
FAIL: test_mkldnn.test_activation
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/nose/case.py", line 198, in runTest
    self.test(*self.arg)
  File "/work/mxnet/tests/python/mkl/../unittest/common.py", line 172, in test_new
    orig_test(*args, **kwargs)
  File "/work/mxnet/tests/python/mkl/test_mkldnn.py", line 297, in test_activation
    check_activation_training(stype)
  File "/work/mxnet/tests/python/mkl/test_mkldnn.py", line 293, in check_activation_training
    check_numeric_gradient(test, in_location, numeric_eps=1e-2, rtol=0.16, atol=1e-4)
  File "/work/mxnet/python/mxnet/test_utils.py", line 912, in check_numeric_gradient
    ("NUMERICAL_%s"%name, "BACKWARD_%s"%name))
  File "/work/mxnet/python/mxnet/test_utils.py", line 491, in assert_almost_equal
    raise AssertionError(msg)
AssertionError: 
Items are not equal:
Error 2.232502 exceeds tolerance rtol=0.160000, atol=0.000100.  Location of maximum error:(0, 0, 0), a=0.445562, b=0.693506
 NUMERICAL_data: array([[[0.44556212, 0.2619341 , 0.77837706],
        [0.        , 0.8214429 , 0.5259812 ],
        [0.        , 0.        , 0.        ]],...
 BACKWARD_data: array([[[0.693506  , 0.26193386, 0.77837765],
        [0.        , 0.8214439 , 0.52598166],
        [0.        , 0.        , 0.        ]],...
-------------------- >> begin captured logging << --------------------
common: INFO: Setting test np/mx/python random seeds, use MXNET_TEST_SEED=1284728931 to reproduce.
--------------------- >> end captured logging << ---------------------

The text was updated successfully, but these errors were encountered:

pengzhao-intel · 2018-08-28T09:44:50Z

@luobao-intel please take a look for the reason.

luobao-intel · 2018-08-29T07:58:31Z

This test is to validate the activation calculation in mkldnn by checking the gradient compared to the theano.gradient.numeric_grad. However, the activation gradient calculation of code referring to theano is not correct with the input closed to zero. Thus, flaky errors occurred when there are some extremely small positive numbers in the random input vector.
The experiment is as follows.

experiment 1:

input data :[[1, 2], [3, 0.0001]]

location:
{'data':
<RowSparseNDArray 2x2 @cpu(0)>, '__random_proj':
[[0.3546685 0.8954062 ]
[0.40476447 0.7724642 ]]
<NDArray 2x2 @cpu(0)>}

gradient calculation referring to theano :
[[0.35466552 0.8954048 ]
[0.40476322 0.39395675]]

mkldnn :
[[0.3546685 0.8954062 ]
[0.40476447 0.7724642 ]]

experiment 2:

input data :[[1, -2], [-4, 0.0005]]

location:
{'data':
<RowSparseNDArray 2x2 @cpu(0)>, '__random_proj':
[[0.3546685 0.8954062 ]
[0.40476447 0.7724642 ]]
<NDArray 2x2 @cpu(0)>}

gradient calculation referring to theano :
[[0.35466552 0. ]
[0. 0.4248553 ]]

mkldnn :
[[0.3546685 0. ]
[0. 0.7724642]]

analysis

It's easy to know that the derivative of ReLU function is :
if x < 0, output is 0. if x > 0, output is 1.

Therefore, in the check_numeric_gradient function, the gradient of executor should be equal to location if the corresponding element of input data is positive and be 0 otherwise by element-wise.
The gradient based on theano is apparently false when the corresponding element of input data is close to zero.

pengzhao-intel · 2018-08-29T08:52:23Z

The reference checker applied the finite difference method but the eps is too large for float datatype in here.
In @luobao-intel case, the input data is about xe-5, so the eps can't calculate correctly.
I suggest changing eps to 1e-6. @luobao-intel will fill the PR soon.

https://github.com/apache/incubator-mxnet/blob/e2a3eef349cb6643c08a7840d8cbd43b38fedfd5/python/mxnet/test_utils.py#L716

* test_activation_rec_eps * enable case

lebeg · 2018-09-10T08:26:58Z

Is failing again:
http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/master/1563/pipeline

======================================================================
FAIL: test_mkldnn.test_activation
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/nose/case.py", line 198, in runTest
    self.test(*self.arg)
  File "/work/mxnet/tests/python/mkl/../unittest/common.py", line 172, in test_new
    orig_test(*args, **kwargs)
  File "/work/mxnet/tests/python/mkl/test_mkldnn.py", line 298, in test_activation
    check_activation_training(stype)
  File "/work/mxnet/tests/python/mkl/test_mkldnn.py", line 294, in check_activation_training
    check_numeric_gradient(test, in_location, numeric_eps=1e-6, rtol=0.16, atol=1e-4)
  File "/work/mxnet/python/mxnet/test_utils.py", line 912, in check_numeric_gradient
    ("NUMERICAL_%s"%name, "BACKWARD_%s"%name))
  File "/work/mxnet/python/mxnet/test_utils.py", line 491, in assert_almost_equal
    raise AssertionError(msg)
AssertionError: 
Items are not equal:
Error 1.153736 exceeds tolerance rtol=0.160000, atol=0.000100.  Location of maximum error:(0, 2, 1, 1), a=0.119209, b=0.146338
 NUMERICAL_data: array([[[[0.32782555, 0.52154064],
         [0.32782555, 0.        ]],
...
 BACKWARD_data: array([[[[0.31696534, 0.53385574],
         [0.3415597 , 0.        ]],
...
-------------------- >> begin captured logging << --------------------
common: INFO: Setting test np/mx/python random seeds, use MXNET_TEST_SEED=304218922 to reproduce.
--------------------- >> end captured logging << ---------------------

luobao-intel · 2018-09-10T15:00:24Z

sorry, I can't reproduce the same result with the same random seed XNET_TEST_SEED=304218922.
In my trial, the test_activation is ok.
The experiment is shown as follows:

experiment

command

export MXNET_TEST_SEED=304218922
python /usr/bin/nosetests tests/python/mkl/test_mkldnn.py:test_activation

log

[INFO] Setting test np/mx/python random seeds, use MXNET_TEST_SEED=304218922 to reproduce.
[22:51:22] src/operator/tensor/././../../common/utils.h:450:
Storage type fallback detected:
operator = Activation
input storage types = [row_sparse, ]
output storage types = [default, ]
params = {"act_type" : relu, }
context.dev_mask = cpu
The operator with default storage type will be dispatched for execution. You're seeing this warning message because the operator above is unable to process the given ndarrays with specified storage types, context and parameter. Temporary dense ndarrays are generated in order to execute the operator. This does not affect the correctness of the programme. You can set environment variable MXNET_STORAGE_FALLBACK_LOG_VERBOSE to 0 to suppress this warning.

Ran 1 test in 0.023s

OK

anirudhacharya · 2018-09-11T04:14:06Z

Failing again - http://jenkins.mxnet-ci.amazon-ml.com/blue/rest/organizations/jenkins/pipelines/incubator-mxnet/branches/PR-12391/runs/9/nodes/951/log/?start=0

======================================================================
FAIL: test_mkldnn.test_activation
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/nose/case.py", line 198, in runTest
    self.test(*self.arg)
  File "/work/mxnet/tests/python/mkl/../unittest/common.py", line 172, in test_new
    orig_test(*args, **kwargs)
  File "/work/mxnet/tests/python/mkl/test_mkldnn.py", line 298, in test_activation
    check_activation_training(stype)
  File "/work/mxnet/tests/python/mkl/test_mkldnn.py", line 294, in check_activation_training
    check_numeric_gradient(test, in_location, numeric_eps=1e-6, rtol=0.16, atol=1e-4)
  File "/work/mxnet/python/mxnet/test_utils.py", line 912, in check_numeric_gradient
    ("NUMERICAL_%s"%name, "BACKWARD_%s"%name))
  File "/work/mxnet/python/mxnet/test_utils.py", line 491, in assert_almost_equal
    raise AssertionError(msg)
AssertionError: 
Items are not equal:
Error 1.184596 exceeds tolerance rtol=0.160000, atol=0.000100.  Location of maximum error:(0, 0, 1, 0), a=0.715256, b=0.882672
 NUMERICAL_data: array([[[[0.        , 0.        ],
         [0.71525574, 0.        ]],
...
 BACKWARD_data: array([[[[0.        , 0.        ],
         [0.8826717 , 0.        ]],
...
-------------------- >> begin captured logging << --------------------
common: INFO: Setting test np/mx/python random seeds, use MXNET_TEST_SEED=1731055743 to reproduce.
--------------------- >> end captured logging << ---------------------

luobao-intel · 2018-09-11T07:35:59Z

Sorry for that, in previous situation, we confronted the situation that element of input data is close to zero.
And we found that the extremely big difference step, eps, should be blamed. And we turned its value down. However, in current situation, test failure is caused by the small eps for the big element of input data. The smaller the eps is, the more calculation steps are required. And for big input data, every step could cause small error and finally the cumulative error may exceed the limit. Thus, suitable eps should be picked.

After all, those problems are caused by the inaccurate baseline calculation referring to the theano gradient. We are trying to rewrite the test case with other approaches. I suggest to disable the flaky test for the time being.

…che#12418)" This reverts commit 445967e.

lebeg · 2018-09-11T08:35:16Z

PR to disable the test again: #12516

* test_activation_rec_eps * enable case

…12516) This reverts commit 445967e.

…12418)" (#12516)" This reverts commit 7ea0533.

* Revert "Removing the re-size for validation data, which breaking the validation accuracy of CIFAR training (#12362)" This reverts commit ceabcaa. * Revert "[MXNET-580] Add SN-GAN example (#12419)" This reverts commit 46a5cee. * Revert "Remove regression checks for website links (#12507)" This reverts commit 619bc3e. * Revert "Revert "Fix flaky test: test_mkldnn.test_activation #12377 (#12418)" (#12516)" This reverts commit 7ea0533. * Revert "further bump up tolerance for sparse dot (#12527)" This reverts commit 90599e1. * Revert "Fix broken URLs (#12508)" This reverts commit 3d83c89. * Revert "Temporarily disable flaky tests (#12520)" This reverts commit 35ca13c. * Revert "Add support for more req patterns for bilinear sampler backward (#12386)" This reverts commit 4ee866f. * Revert "Change the way NDArrayIter handle the last batch (#12285)" This reverts commit 597a637.

azai91 · 2018-09-14T05:24:08Z

made a PR that addresses just this test (ran 10000 times with different seeds as well) #12560.

in regards to @luobao-intel , this is not due to inputs being too large. activation is linear above 0 so this is not due to lack of approximation. in fact we should be able to get an exact solution. the reason the change is causing an error is the fact that with a very small eps the outputs (f(x + eps/2) and f(x - eps/2)) do not have enough precision.

the formula is

grad = (f(x + eps/2)  - f(x - eps/s)) / eps).

since eps was 1e-6 this means the gradient was calculated by differences must be captured below 1e-6.

azai91 · 2018-09-14T15:15:30Z

tldr: you should never use anything less than 1e-5 as there is not enough precision in the numerator (f(x + eps/2) - f(x - eps/s)) to derive an accurate slope.

* test_activation_rec_eps * enable case

…che#12418)" (apache#12516) This reverts commit 445967e.

* Revert "Removing the re-size for validation data, which breaking the validation accuracy of CIFAR training (apache#12362)" This reverts commit ceabcaa. * Revert "[MXNET-580] Add SN-GAN example (apache#12419)" This reverts commit 46a5cee. * Revert "Remove regression checks for website links (apache#12507)" This reverts commit 619bc3e. * Revert "Revert "Fix flaky test: test_mkldnn.test_activation apache#12377 (apache#12418)" (apache#12516)" This reverts commit 7ea0533. * Revert "further bump up tolerance for sparse dot (apache#12527)" This reverts commit 90599e1. * Revert "Fix broken URLs (apache#12508)" This reverts commit 3d83c89. * Revert "Temporarily disable flaky tests (apache#12520)" This reverts commit 35ca13c. * Revert "Add support for more req patterns for bilinear sampler backward (apache#12386)" This reverts commit 4ee866f. * Revert "Change the way NDArrayIter handle the last batch (apache#12285)" This reverts commit 597a637.

lebeg · 2018-10-09T14:24:58Z

Has been fixed with #12418

lebeg · 2018-10-09T15:02:00Z

Sorry, probably this is the fix: #12560

lebeg mentioned this issue Aug 28, 2018

Disabled flaky test: test_mkldnn.test_activation #12378

Merged

marcoabreu added Test Flaky Disabled test MKLDNN labels Aug 28, 2018

luobao-intel mentioned this issue Aug 31, 2018

Fix flaky test: test_mkldnn.test_activation #12377 #12418

Merged

szha added this to To Do in Tests Improvement via automation Sep 1, 2018

lebeg mentioned this issue Sep 4, 2018

Revert "Subgraph API for integrating accelerators with MXNet (#12157)" #12443

Closed

eric-haibin-lin pushed a commit that referenced this issue Sep 8, 2018

Fix flaky test: test_mkldnn.test_activation #12377 (#12418)

445967e

* test_activation_rec_eps * enable case

Chancebair mentioned this issue Sep 10, 2018

[MXNET-12377] Disable Flaky Test: test_mkldnn.test_activation #12496

Closed

12 tasks

lebeg added a commit to lebeg/incubator-mxnet that referenced this issue Sep 11, 2018

Revert "Fix flaky test: test_mkldnn.test_activation apache#12377 (apa…

5b1b952

…che#12418)" This reverts commit 445967e.

lebeg mentioned this issue Sep 11, 2018

Revert "Fix flaky test: test_mkldnn.test_activation #12377 (#12418)" #12516

Merged

aaronmarkham pushed a commit to aaronmarkham/incubator-mxnet that referenced this issue Sep 11, 2018

Fix flaky test: test_mkldnn.test_activation apache#12377 (apache#12418)

b4a67e4

* test_activation_rec_eps * enable case

marcoabreu pushed a commit that referenced this issue Sep 12, 2018

Revert "Fix flaky test: test_mkldnn.test_activation #12377 (#12418)" (#…

7ea0533

…12516) This reverts commit 445967e.

zhreshold added a commit that referenced this issue Sep 12, 2018

Revert "Revert "Fix flaky test: test_mkldnn.test_activation #12377 (#…

e11bbf7

…12418)" (#12516)" This reverts commit 7ea0533.

azai91 mentioned this issue Sep 14, 2018

fix test_activation by lowering threshold + validate eps for check_numeric_gradient #12560

Merged

6 tasks

anirudh2290 pushed a commit to anirudh2290/mxnet that referenced this issue Sep 19, 2018

Fix flaky test: test_mkldnn.test_activation apache#12377 (apache#12418)

33e95c1

* test_activation_rec_eps * enable case

anirudh2290 pushed a commit to anirudh2290/mxnet that referenced this issue Sep 19, 2018

Revert "Fix flaky test: test_mkldnn.test_activation apache#12377 (apa…

c0658a2

…che#12418)" (apache#12516) This reverts commit 445967e.

lebeg closed this as completed Oct 9, 2018

Tests Improvement automation moved this from To Do to Done Oct 9, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Flaky test: test_mkldnn.test_activation #12377

Flaky test: test_mkldnn.test_activation #12377

lebeg commented Aug 28, 2018

pengzhao-intel commented Aug 28, 2018

luobao-intel commented Aug 29, 2018 •

edited

Loading

pengzhao-intel commented Aug 29, 2018 •

edited

Loading

lebeg commented Sep 10, 2018

luobao-intel commented Sep 10, 2018

anirudhacharya commented Sep 11, 2018

luobao-intel commented Sep 11, 2018 •

edited

Loading

lebeg commented Sep 11, 2018

azai91 commented Sep 14, 2018

azai91 commented Sep 14, 2018

lebeg commented Oct 9, 2018

lebeg commented Oct 9, 2018

Flaky test: test_mkldnn.test_activation #12377

Flaky test: test_mkldnn.test_activation #12377

Comments

lebeg commented Aug 28, 2018

pengzhao-intel commented Aug 28, 2018

luobao-intel commented Aug 29, 2018 • edited Loading

experiment 1:

experiment 2:

analysis

pengzhao-intel commented Aug 29, 2018 • edited Loading

lebeg commented Sep 10, 2018

luobao-intel commented Sep 10, 2018

experiment

command

log

anirudhacharya commented Sep 11, 2018

luobao-intel commented Sep 11, 2018 • edited Loading

lebeg commented Sep 11, 2018

azai91 commented Sep 14, 2018

azai91 commented Sep 14, 2018

lebeg commented Oct 9, 2018

lebeg commented Oct 9, 2018

luobao-intel commented Aug 29, 2018 •

edited

Loading

pengzhao-intel commented Aug 29, 2018 •

edited

Loading

luobao-intel commented Sep 11, 2018 •

edited

Loading