test_operator.test_laop_3 hangs #9295

marcoabreu · 2018-01-04T01:11:03Z

The execution of test_operator.test_laop_3 randomly hangs without a message on Python3: MKLML-CPU. See http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/master/130/pipeline/329

marcoabreu · 2018-01-04T01:11:26Z

@piiswrong @szha Please assign the proper label.

KellenSunderland · 2018-01-06T11:22:31Z

Here's the ticket for the LA hangs with MKL @mseeger @asmushetzel.

KellenSunderland · 2018-01-06T11:25:24Z

The next time someone runs into this one let's try and get a trace of all the threads. I looked at this a little already with @asmushetzel, and it certainly was a deadlock during initialization (i.e. not livelock, no activity on any thread), but I wasn't exactly clear on why one of the threads was blocked.

mseeger · 2018-01-06T14:59:07Z

If the test hangs using MKL, then I am afraid also a normal run of MXNet, using the linalg ops, will hang on MKL eventually.

Do you also see these problems with the other laop tests? Because laop3 is only for a small subset of the ops. Is it clear which of them is the issue?

marcoabreu · 2018-01-06T15:07:53Z

I have never seen this problem arise with any other test than laop3. Of course, it might be possible that the previous tests brought the process into a corrupted state. I haven't had the chance yet to dig into the problem and identify the failing line - we might consider adding more debug output to those tests, especially if one test-method consists of multiple tests.

asmushetzel · 2018-01-10T13:37:10Z

@meissnereric
Eric, can you own this (I'm willing to help). The configuration that fails is
python3, USE_MKL2017=1, USE_MKL2017_EXPERIMENTAL=1, USE_BLAS=openblas
The hanging test executes seyvd, which according to the above settings, will be the openblas-version of seyvd.
Nobody currently knows what the issue is, there seems to be no change in test environment, neither have we modified any linalg-functions (at least to my knowledge, you may take a look at the git logs). Still this problem suddenly popped up (and worse only occasional, you have to run it repeatedly until it sometimes hangs).
This is important, as we want to have all tests in MXNet working.

meissnereric · 2018-01-10T13:43:24Z

Sure, I'll have a look at it today thanks for letting me know.

KellenSunderland · 2018-01-10T20:33:22Z

Did these test start failing around the end of December, when this change was merged? d0388d0

marcoabreu · 2018-01-10T20:39:48Z

That's the one! Yes, it started happening after Christmas. You're awesome Kellen, thanks for finding it!

…

On Wed, Jan 10, 2018 at 9:33 PM, Kellen Sunderland ***@***.*** > wrote: Did these test start failing around the end of December, when this change was merged? d0388d0#diff-6d85ccdfa8c4118c1722e64466901ef2 <d0388d0#diff-6d85ccdfa8c4118c1722e64466901ef2> — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#9295 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ARxB6_W_yPcJadrVW-C7QTm9HyVs7jpCks5tJR6tgaJpZM4RSdR1> .

mseeger · 2018-01-11T07:38:18Z

Eric Meissner in Cambridge now owning this one.

…

On Jan 10, 2018 9:33 PM, "Kellen Sunderland" ***@***.***> wrote: Did these test start failing around the end of December, when this change was merged? d0388d0#diff-6d85ccdfa8c4118c1722e64466901ef2 <d0388d0#diff-6d85ccdfa8c4118c1722e64466901ef2> — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#9295 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AGNRoilFB-a47TSiN-5-KVXmwuyDkX5gks5tJR6tgaJpZM4RSdR1> .

meissnereric · 2018-01-11T13:44:06Z

So on a c5.2xlarge instance I built mxnet from source using USE_MKL2017=1, USE_MKL2017_EXPERIMENTAL=1, USE_BLAS=openblas flags this morning and ran
PYTHONPATH=./python/ nosetests3 --verbose tests/python/unittest

3 separate times test_operator.test_laop_3 failed. It didn't hang for me at all. It's the only test to return a FAILURE as well. I'm not sure if we've seen failures like this before for the test, or only the hanging issue.

Thoughts?

======================================================================
FAIL: test_operator.test_laop_3

Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/nose/case.py", line 198, in runTest
self.test(*self.arg)
File "/home/ubuntu/src/incubator-mxnet/tests/python/unittest/test_operator.py", line 4278, in test_laop_3
check_fw(test_syevd2, [data_in1], [res_eye, res_a])
File "/home/ubuntu/src/incubator-mxnet/tests/python/unittest/test_operator.py", line 4253, in
atol=atol_fw, dtype=dtype)
File "/home/ubuntu/src/incubator-mxnet/python/mxnet/test_utils.py", line 997, in check_symbolic_forward
equal_nan=equal_nan)
File "/home/ubuntu/src/incubator-mxnet/python/mxnet/test_utils.py", line 495, in assert_almost_equal
raise AssertionError(msg)
AssertionError:
Items are not equal:
Error 2142110.669088 exceeds tolerance rtol=0.000001, atol=0.000001. Location of maximum error:(5, 7), a=3.772180, b=0.518782
EXPECTED_Ut_L_U_output: array([[ -6.29539959, 3.14786301, -7.76048236, ..., -14.55426068,
5.89200304, 2.89609144],
[ 3.14786301, -2.20936719, -5.00206088, ..., 2.18995202,...
FORWARD_Ut_L_U_output: array([[ -5.86758909, 2.1861068 , -8.22037894, ..., -15.72788167,
7.86870499, 2.89609144],
[ 2.1861068 , -3.11242836, -5.28864811, ..., 3.78613349,...

mseeger · 2018-01-12T06:55:23Z

Hello, This test compares A and identity to what you get computing expressions from U and Lambda. Please output everything, A, U, Lambda, and results of expressions. Do this directly in numpy format. Then compare to numpy. Use same style. These errors are too large to be due to low prec. Exactly what a test is for. If you find something wrong either with syevd, or worse, with gemm2, please isolate the case in non random test. On Jan 11, 2018 14:44, "Eric R Meissner" <notifications@github.com> wrote: Hmm, the plot thickens. So on a c5.2xlarge instance I built mxnet from source using USE_MKL2017=1, USE_MKL2017_EXPERIMENTAL=1, USE_BLAS=openblas flags this morning and ran PYTHONPATH=./python/ nosetests3 --verbose tests/python/unittest 3 separate times test_operator.test_laop_3 FAILED apparently due to some numerical issues. I haven't seen anyone mention this test failing yet, but it didn't hang for me at all. It's the only test to return a FAILURE as well. Thoughts? ====================================================================== FAIL: test_operator.test_laop_3 Traceback (most recent call last): File "/usr/lib/python3/dist-packages/nose/case.py", line 198, in runTest self.test(*self.arg) File "/home/ubuntu/src/incubator-mxnet/tests/python/unittest/test_operator.py", line 4278, in test_laop_3 check_fw(test_syevd2, [data_in1], [res_eye, res_a]) File "/home/ubuntu/src/incubator-mxnet/tests/python/unittest/test_operator.py", line 4253, in atol=atol_fw, dtype=dtype) File "/home/ubuntu/src/incubator-mxnet/python/mxnet/test_utils.py", line 997, in check_symbolic_forward equal_nan=equal_nan) File "/home/ubuntu/src/incubator-mxnet/python/mxnet/test_utils.py", line 495, in assert_almost_equal raise AssertionError(msg) AssertionError: Items are not equal: Error 2142110.669088 exceeds tolerance rtol=0.000001, atol=0.000001. Location of maximum error:(5, 7), a=3.772180, b=0.518782 EXPECTED_Ut_L_U_output: array([[ -6.29539959, 3.14786301, -7.76048236, ..., -14.55426068, 5.89200304, 2.89609144], [ 3.14786301, -2.20936719, -5.00206088, ..., 2.18995202,... FORWARD_Ut_L_U_output: array([[ -5.86758909, 2.1861068 , -8.22037894, ..., -15.72788167, 7.86870499, 2.89609144], [ 2.1861068 , -3.11242836, -5.28864811, ..., 3.78613349,... ------------------------------ — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#9295 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AGNRomJq1Z8Zj7UzMcgka3TlUVjsijqKks5tJhBJgaJpZM4RSdR1> .

marcoabreu · 2018-01-12T07:20:20Z

Just FYI, we updated MXNet to use MLKML v0.12 last night. Am 12.01.2018 7:55 vorm. schrieb "mseeger" <notifications@github.com>:

…

Hello, This test compares A and identity to what you get computing expressions from U and Lambda. Please output everything, A, U, Lambda, and results of expressions. Do this directly in numpy format. Then compare to numpy. Use same style. These errors are too large to be due to low prec. Exactly what a test is for. If you find something wrong either with syevd, or worse, with gemm2, please isolate the case in non random test. On Jan 11, 2018 14:44, "Eric R Meissner" ***@***.***> wrote: Hmm, the plot thickens. So on a c5.2xlarge instance I built mxnet from source using USE_MKL2017=1, USE_MKL2017_EXPERIMENTAL=1, USE_BLAS=openblas flags this morning and ran PYTHONPATH=./python/ nosetests3 --verbose tests/python/unittest 3 separate times test_operator.test_laop_3 FAILED apparently due to some numerical issues. I haven't seen anyone mention this test failing yet, but it didn't hang for me at all. It's the only test to return a FAILURE as well. Thoughts? ====================================================================== FAIL: test_operator.test_laop_3 Traceback (most recent call last): File "/usr/lib/python3/dist-packages/nose/case.py", line 198, in runTest self.test(*self.arg) File "/home/ubuntu/src/incubator-mxnet/tests/python/unittest/ test_operator.py", line 4278, in test_laop_3 check_fw(test_syevd2, [data_in1], [res_eye, res_a]) File "/home/ubuntu/src/incubator-mxnet/tests/python/unittest/ test_operator.py", line 4253, in atol=atol_fw, dtype=dtype) File "/home/ubuntu/src/incubator-mxnet/python/mxnet/test_utils.py", line 997, in check_symbolic_forward equal_nan=equal_nan) File "/home/ubuntu/src/incubator-mxnet/python/mxnet/test_utils.py", line 495, in assert_almost_equal raise AssertionError(msg) AssertionError: Items are not equal: Error 2142110.669088 exceeds tolerance rtol=0.000001, atol=0.000001. Location of maximum error:(5, 7), a=3.772180, b=0.518782 EXPECTED_Ut_L_U_output: array([[ -6.29539959, 3.14786301, -7.76048236, ..., -14.55426068, 5.89200304, 2.89609144], [ 3.14786301, -2.20936719, -5.00206088, ..., 2.18995202,... FORWARD_Ut_L_U_output: array([[ -5.86758909, 2.1861068 , -8.22037894, ..., -15.72788167, 7.86870499, 2.89609144], [ 2.1861068 , -3.11242836, -5.28864811, ..., 3.78613349,... ------------------------------ — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#9295 issuecomment-356937657>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ AGNRomJq1Z8Zj7UzMcgka3TlUVjsijqKks5tJhBJgaJpZM4RSdR1> . — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#9295 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ARxB6wZxnozND1UhhRjSQp_7AEQVVnfzks5tJwHygaJpZM4RSdR1> .

meissnereric · 2018-01-12T15:21:42Z

With some help from Zhenwen in debugging the problem, I think we've figured out what the bug is that's causing this. Somewhere during syevd the signs of the eigenvectors are getting computed or flipped incorrectly.

Haven't found where its introduced or how, but below is some simple example code to reproduce the issue. This was running the latest mxnet build as of this morning, with MKLML v0.12.

import mxnet as mx
import numpy as np

np.random.seed(1896893920)
n=10
data_in1 = np.random.normal(0., 10., (n, n))
data_in1 = 0.5 * (data_in1 + data_in1.T)

np_eigval, np_eigvec = np.linalg.eig(data_in1)
mx_eigvec, mx_eigval = mx.ndarray.linalg.syevd(mx.ndarray.array(data_in1, dtype=np.float64))

np.allclose(np_eigval[idx], mx_eigval.asnumpy())
#True, Eigenvalues are close to each other after sorting the unsorted numpy ones

np.allclose(np_eigvec[idx], mx_eigvec.asnumpy())
#False , but eigenvectors are not the same

np_signs = np.sign(np_eigvec[:,idx].T)[:,0]
mx_signs = np.sign(mx_eigvec.asnumpy())[:,0]

sign_flips = np_signs * mx_signs
print(sign_flips)
#[ 1. -1.  1. -1. -1.  1. -1. -1.  1. -1.], I.E. numpy and mxnet signs not the same

np.allclose(np_eigvec[:,idx].T * sign_flips[:,None], mx_eigvec.asnumpy())
# If we flip the signs to match, now the eigenvectors are all the same.

mseeger · 2018-01-12T21:03:57Z

I noticed this issue with the signs as well, and I am in fact controlling the signs, see my code. The test however should be independent of signs of eigen vectors, it just depends on the definition of the decomposition, right? On Jan 12, 2018 4:22 PM, "Eric R Meissner" <notifications@github.com> wrote: With some help from Zhenwen in debugging the problem, I think we've figured out what the bug is that's causing this. Somewhere during syevd the signs of the eigenvectors are getting computed or flipped incorrectly. Haven't found where its introduced or how, but below is some simple example code to reproduce the issue. This was running the latest mxnet build as of this morning, with MKLML v0.12. import mxnet as mximport numpy as np np.random.seed(1896893920) n=10 data_in1 = np.random.normal(0., 10., (n, n)) data_in1 = 0.5 * (data_in1 + data_in1.T) np_eigval, np_eigvec = np.linalg.eig(data_in1) mx_eigvec, mx_eigval = mx.ndarray.linalg.syevd(mx.ndarray.array(data_in1, dtype=np.float64)) np.allclose(np_eigval[idx], mx_eigval.asnumpy())#True, Eigenvalues are close to each other after sorting the unsorted numpy ones np.allclose(np_eigvec[idx], mx_eigvec.asnumpy())#False , but eigenvectors are not the same np_signs = np.sign(np_eigvec[:,idx].T)[:,0] mx_signs = np.sign(mx_eigvec.asnumpy())[:,0] sign_flips = np_signs * mx_signsprint(sign_flips)#[ 1. -1. 1. -1. -1. 1. -1. -1. 1. -1.], I.E. numpy and mxnet signs not the same np.allclose(np_eigvec[:,idx].T * sign_flips[:,None], mx_eigvec.asnumpy())# If we flip the signs to match, now the eigenvectors are all the same. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#9295 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AGNRog72dux8GZqdrJQp9YM6TIVxBQL_ks5tJ3i7gaJpZM4RSdR1> .

mseeger · 2018-01-12T21:04:57Z

The test does not compare eigen vectors. This cannot be the reason it fails.

…

On Jan 12, 2018 4:22 PM, "Eric R Meissner" ***@***.***> wrote: With some help from Zhenwen in debugging the problem, I think we've figured out what the bug is that's causing this. Somewhere during syevd the signs of the eigenvectors are getting computed or flipped incorrectly. Haven't found where its introduced or how, but below is some simple example code to reproduce the issue. This was running the latest mxnet build as of this morning, with MKLML v0.12. import mxnet as mximport numpy as np np.random.seed(1896893920) n=10 data_in1 = np.random.normal(0., 10., (n, n)) data_in1 = 0.5 * (data_in1 + data_in1.T) np_eigval, np_eigvec = np.linalg.eig(data_in1) mx_eigvec, mx_eigval = mx.ndarray.linalg.syevd(mx.ndarray.array(data_in1, dtype=np.float64)) np.allclose(np_eigval[idx], mx_eigval.asnumpy())#True, Eigenvalues are close to each other after sorting the unsorted numpy ones np.allclose(np_eigvec[idx], mx_eigvec.asnumpy())#False , but eigenvectors are not the same np_signs = np.sign(np_eigvec[:,idx].T)[:,0] mx_signs = np.sign(mx_eigvec.asnumpy())[:,0] sign_flips = np_signs * mx_signsprint(sign_flips)#[ 1. -1. 1. -1. -1. 1. -1. -1. 1. -1.], I.E. numpy and mxnet signs not the same np.allclose(np_eigvec[:,idx].T * sign_flips[:,None], mx_eigvec.asnumpy())# If we flip the signs to match, now the eigenvectors are all the same. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#9295 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AGNRog72dux8GZqdrJQp9YM6TIVxBQL_ks5tJ3i7gaJpZM4RSdR1> .

meissnereric · 2018-01-15T12:32:23Z

We’ve been working on it this morning and have narrowed the issue down to a problem with syevd, only reproducible in the test environment.

Running the following simplified code:

def test_laop_TEST():
    np.random.seed(1896893920)
    n=10
    data_in1 = np.random.normal(0., 10., (n, n))
    data_in1 = 0.5 * (data_in1 + data_in1.T)
    u, lam = mx.nd.linalg.syevd(mx.nd.array(data_in1, dtype=np.float64))
    print(data_in1, u, lam)
    assert(False)

Inside the test environment (using “PYTHONPATH=./python/ nosetests3 --verbose tests/python/unittest/test_operator.py:test_laop_TEST”) returns the following results:
A: [[-13.34414296 15.99829902 -5.9348693 0.33725903 7.08999108
3.48517748 -8.1782682 2.64821474 -11.71583022 -2.67093716]
[ 15.99829902 -1.24079583 -5.15169143 -1.66193884 7.78811657
-3.24022297 6.49511542 12.28121039 8.40569059 -2.6123374 ]
[ -5.9348693 -5.15169143 18.40724804 -3.32434622 -8.2294941
-12.05570085 -3.134901 -12.91228277 11.51627784 -3.26478123]
[ 0.33725903 -1.66193884 -3.32434622 13.99603538 5.64011818
-5.48959896 4.85978757 4.35528089 -5.7480471 -11.40238071]
[ 7.08999108 7.78811657 -8.2294941 5.64011818 -5.28942777
-3.71190675 -13.53220115 3.52255819 6.93703998 8.68193436]
[ 3.48517748 -3.24022297 -12.05570085 -5.48959896 -3.71190675
-6.68378713 9.05448277 7.37555848 3.11379519 8.66276378]
[ -8.1782682 6.49511542 -3.134901 4.85978757 -13.53220115
9.05448277 -1.94185915 0.70195063 -10.21748333 4.8669592 ]
[ 2.64821474 12.28121039 -12.91228277 4.35528089 3.52255819
7.37555848 0.70195063 -20.63472654 7.87549062 1.3931764 ]
[-11.71583022 8.40569059 11.51627784 -5.7480471 6.93703998
3.11379519 -10.21748333 7.87549062 3.42171743 4.7941094 ]
[ -2.67093716 -2.6123374 -3.26478123 -11.40238071 8.68193436
8.66276378 4.8669592 1.3931764 4.7941094 -5.61335189]]
U: [[-0.40939395 -0.02004096 0.27457066 -0.25153737 0.66140858 0.28340247
0.14125274 0.22853749 0.18226673 -0.26486544]
[ 0.4472236 -0.48209223 0.10682268 -0.08815147 0.06254703 -0.36772867
0.30321108 0.56092533 0.04835745 -0.01863414]
[-0.39439743 0.09942506 -0.01934432 -0.06989485 -0.3580831 0.07148518
-0.39629101 0.70887307 -0.11974635 0.14861161]
[ 0.47108087 -0.15032447 0.34594981 0.19559081 -0.07895447 0.72793988
-0.23360136 0.07264473 -0.01704545 -0.03578629]
[-0.07231265 0.06808902 0.2476193 0.26860241 0.20379301 -0.02531744
0.10901297 0.01760572 0.27485097 0.85250383]
[-0.16324278 -0.30664205 -0.10891006 0.40596886 -0.10056171 -0.10892519
-0.28055657 -0.02900726 0.72918847 -0.26344591]
[ 0.12151877 0.6399878 0.30715681 0.04750568 -0.33368478 -0.02954461
0.40925563 0.16125632 0.34502726 -0.23300369]
[-0.34626999 -0.24576316 -0.15655482 0.51831127 -0.17288398 0.26124855
0.58672901 0.07084532 -0.27323606 -0.06688774]
[ 0.24814453 0.40352612 -0.33373039 0.52155155 0.48083476 -0.10008073
-0.17301593 0.27878979 -0.1649485 -0.12694282]
[-0.16370208 -0.05674461 0.6998121 0.32548404 0.00940855 -0.39724873
-0.21029293 -0.12432748 -0.34877216 -0.18731605]]
<NDArray 10x10 @cpu(0)>
Lambda: [-55.42942652 -36.08229736 -21.43322708 -13.37817004 -6.12139065
4.87496436 6.50358305 15.6977825 29.92340939 56.52168194]
<NDArray 10 @cpu(0)>

Whereas running the exact same code in an ipython notebook returns the following:

A: [[-13.34414296 15.99829902 -5.9348693 0.33725903 7.08999108
3.48517748 -8.1782682 2.64821474 -11.71583022 -2.67093716]
[ 15.99829902 -1.24079583 -5.15169143 -1.66193884 7.78811657
-3.24022297 6.49511542 12.28121039 8.40569059 -2.6123374 ]
[ -5.9348693 -5.15169143 18.40724804 -3.32434622 -8.2294941
-12.05570085 -3.134901 -12.91228277 11.51627784 -3.26478123]
[ 0.33725903 -1.66193884 -3.32434622 13.99603538 5.64011818
-5.48959896 4.85978757 4.35528089 -5.7480471 -11.40238071]
[ 7.08999108 7.78811657 -8.2294941 5.64011818 -5.28942777
-3.71190675 -13.53220115 3.52255819 6.93703998 8.68193436]
[ 3.48517748 -3.24022297 -12.05570085 -5.48959896 -3.71190675
-6.68378713 9.05448277 7.37555848 3.11379519 8.66276378]
[ -8.1782682 6.49511542 -3.134901 4.85978757 -13.53220115
9.05448277 -1.94185915 0.70195063 -10.21748333 4.8669592 ]
[ 2.64821474 12.28121039 -12.91228277 4.35528089 3.52255819
7.37555848 0.70195063 -20.63472654 7.87549062 1.3931764 ]
[-11.71583022 8.40569059 11.51627784 -5.7480471 6.93703998
3.11379519 -10.21748333 7.87549062 3.42171743 4.7941094 ]
[ -2.67093716 -2.6123374 -3.26478123 -11.40238071 8.68193436
8.66276378 4.8669592 1.3931764 4.7941094 -5.61335189]]
U: [[ 0.56896406 -0.49508446 -0.05515309 -0.09274153 0.05952866 -0.30276783
0.41076905 0.18170427 0.34030827 -0.08221328]
[-0.24222508 -0.14435087 0.26295741 -0.18396091 0.30094882 0.01134089
0.04484504 0.75461613 -0.35400702 -0.17327796]
[-0.13275656 -0.00823993 0.00394372 -0.16026696 0.57617405 0.29618633
0.3182481 -0.42766506 0.05309422 -0.4999056 ]
[ 0.4795233 -0.15113111 0.35068639 0.19762309 -0.06883189 0.72410954
-0.22482745 0.06161804 -0.02440456 -0.03418712]
[ 0.1443947 0.08427882 0.32210052 0.15902208 0.42793043 -0.1065733
0.30022962 -0.18552473 -0.29677106 0.6602256 ]
[ 0.4100674 0.61100202 0.24203264 -0.36583436 -0.26012926 -0.14006751
0.21300133 -0.01063569 -0.25163655 -0.26460707]
[-0.26364054 0.32696231 0.18616994 0.38349845 -0.17037144 0.19078566
0.53068931 0.2664058 0.47268578 0.01995924]
[ 0.19833204 0.25956163 0.15391535 0.52816029 0.37924181 -0.39832189
-0.39632359 0.06842959 0.15922644 -0.31789337]
[ 0.06100089 0.28041907 0.00365086 -0.52407767 0.32170882 0.11788932
-0.29540979 0.16579616 0.55700809 0.31712067]
[-0.2616713 -0.27894058 0.76428004 -0.18169327 -0.20501889 -0.23708203
-0.12629573 -0.27049638 0.21217396 -0.06485779]]
<NDArray 10x10 @cpu(0)>
Lambda: [-39.53770898 -31.85235173 -27.55026183 -13.38111156 -6.97443774
5.59771442 7.56903358 22.57472579 27.01218163 37.619126 ]
<NDArray 10 @cpu(0)>

In both cases, A is the same but U and Lambda are different.

Thoughts?

mseeger · 2018-01-15T19:20:43Z

But the test you said fails, it recomputes a and identity from u and lambda. While u is not unique (signs), the results must fulfill the definition, so must get a and identity.

…

On Jan 15, 2018 1:33 PM, "Eric R Meissner" ***@***.***> wrote: We’ve been working on it this morning and have narrowed the issue down to a problem with syevd, only reproducible in the test environment. Running the following simplified code: def test_laop_TEST(): np.random.seed(1896893920) n=10 data_in1 = np.random.normal(0., 10., (n, n)) data_in1 = 0.5 * (data_in1 + data_in1.T) u, lam = mx.nd.linalg.syevd(mx.nd.array(data_in1, dtype=np.float64)) print(data_in1, u, lam) assert(False) Inside the test environment (using “PYTHONPATH=./python/ nosetests3 --verbose tests/python/unittest/test_operator.py:test_laop_TEST”) returns the following results: A: [[-13.34414296 15.99829902 -5.9348693 0.33725903 7.08999108 <089%2099108> 3.48517748 -8.1782682 2.64821474 -11.71583022 -2.67093716] [ 15.99829902 -1.24079583 -5.15169143 -1.66193884 7.78811657 -3.24022297 6.49511542 12.28121039 8.40569059 -2.6123374 ] [ -5.9348693 -5.15169143 18.40724804 -3.32434622 -8.2294941 -12.05570085 -3.134901 -12.91228277 11.51627784 -3.26478123] [ 0.33725903 <03372%205903> -1.66193884 -3.32434622 13.99603538 5.64011818 -5.48959896 4.85978757 4.35528089 -5.7480471 -11.40238071] [ 7.08999108 7.78811657 -8.2294941 5.64011818 -5.28942777 -3.71190675 -13.53220115 3.52255819 6.93703998 8.68193436] [ 3.48517748 -3.24022297 -12.05570085 -5.48959896 -3.71190675 -6.68378713 9.05448277 7.37555848 3.11379519 8.66276378] [ -8.1782682 6.49511542 -3.134901 4.85978757 -13.53220115 9.05448277 -1.94185915 0.70195063 -10.21748333 4.8669592 ] [ 2.64821474 12.28121039 -12.91228277 4.35528089 3.52255819 7.37555848 0.70195063 -20.63472654 7.87549062 1.3931764 ] [-11.71583022 8.40569059 11.51627784 -5.7480471 6.93703998 3.11379519 -10.21748333 7.87549062 3.42171743 4.7941094 ] [ -2.67093716 -2.6123374 -3.26478123 -11.40238071 8.68193436 8.66276378 4.8669592 1.3931764 4.7941094 -5.61335189]] U: [[-0.40939395 -0.02004096 0.27457066 -0.25153737 0.66140858 0.28340247 0.14125274 0.22853749 0.18226673 -0.26486544 <02648%206544>] [ 0.4472236 <04472%20236> -0.48209223 <04820%209223> 0.10682268 -0.08815147 0.06254703 -0.36772867 0.30321108 <030%20321108> 0.56092533 0.04835745 -0.01863414] [-0.39439743 0.09942506 -0.01934432 -0.06989485 -0.3580831 0.07148518 <07148%20518> -0.39629101 <03962%209101> 0.70887307 <07088%207307> -0.11974635 0.14861161] [ 0.47108087 <0471%2008087> -0.15032447 0.34594981 <0345%2094981> 0.19559081 -0.07895447 0.72793988 -0.23360136 0.07264473 -0.01704545 -0.03578629 <035786%2029>] [-0.07231265 0.06808902 0.2476193 0.26860241 0.20379301 -0.02531744 <02531%20744> 0.10901297 0.01760572 0.27485097 <02748%205097> 0.85250383] [-0.16324278 -0.30664205 <030%20664205> -0.10891006 0.40596886 -0.10056171 -0.10892519 -0.28055657 <02805%205657> -0.02900726 0.72918847 -0.26344591] [ 0.12151877 0.6399878 0.30715681 0.04750568 -0.33368478 -0.02954461 <02954%20461> 0.40925563 0.16125632 0.34502726 -0.23300369 <02330%200369>] [-0.34626999 <03462%206999> -0.24576316 -0.15655482 0.51831127 -0.17288398 0.26124855 0.58672901 0.07084532 -0.27323606 -0.06688774 <06688%20774>] [ 0.24814453 <02481%204453> 0.40352612 -0.33373039 0.52155155 0.48083476 -0.10008073 -0.17301593 0.27878979 <02787%208979> -0.1649485 -0.12694282] [-0.16370208 -0.05674461 0.6998121 0.32548404 0.00940855 -0.39724873 -0.21029293 <02102%209293> -0.12432748 -0.34877216 -0.18731605]] <NDArray 10x10 @cpu <https://github.com/cpu>(0)> Lambda: [-55.42942652 -36.08229736 -21.43322708 -13.37817004 -6.12139065 4.87496436 6.50358305 15.6977825 29.92340939 56.52168194] <NDArray 10 @cpu <https://github.com/cpu>(0)> Whereas running the exact same code in an ipython notebook returns the following: A: [[-13.34414296 15.99829902 -5.9348693 0.33725903 7.08999108 <089%2099108> 3.48517748 -8.1782682 2.64821474 -11.71583022 -2.67093716] [ 15.99829902 -1.24079583 -5.15169143 -1.66193884 7.78811657 -3.24022297 6.49511542 12.28121039 8.40569059 -2.6123374 ] [ -5.9348693 -5.15169143 18.40724804 -3.32434622 -8.2294941 -12.05570085 -3.134901 -12.91228277 11.51627784 -3.26478123] [ 0.33725903 <03372%205903> -1.66193884 -3.32434622 13.99603538 5.64011818 -5.48959896 4.85978757 4.35528089 -5.7480471 -11.40238071] [ 7.08999108 7.78811657 -8.2294941 5.64011818 -5.28942777 -3.71190675 -13.53220115 3.52255819 6.93703998 8.68193436] [ 3.48517748 -3.24022297 -12.05570085 -5.48959896 -3.71190675 -6.68378713 9.05448277 7.37555848 3.11379519 8.66276378] [ -8.1782682 6.49511542 -3.134901 4.85978757 -13.53220115 9.05448277 -1.94185915 0.70195063 -10.21748333 4.8669592 ] [ 2.64821474 12.28121039 -12.91228277 4.35528089 3.52255819 7.37555848 0.70195063 -20.63472654 7.87549062 1.3931764 ] [-11.71583022 8.40569059 11.51627784 -5.7480471 6.93703998 3.11379519 -10.21748333 7.87549062 3.42171743 4.7941094 ] [ -2.67093716 -2.6123374 -3.26478123 -11.40238071 8.68193436 8.66276378 4.8669592 1.3931764 4.7941094 -5.61335189]] U: [[ 0.56896406 <05689%206406> -0.49508446 -0.05515309 -0.09274153 0.05952866 -0.30276783 0.41076905 0.18170427 0.34030827 -0.08221328 <08221%20328>] [-0.24222508 <02422%202508> -0.14435087 0.26295741 -0.18396091 0.30094882 0.01134089 0.04484504 0.75461613 <07546%201613> -0.35400702 -0.17327796] [-0.13275656 -0.00823993 0.00394372 -0.16026696 0.57617405 0.29618633 0.3182481 -0.42766506 <04276%206506> 0.05309422 -0.4999056 ] [ 0.4795233 <04795%20233> -0.15113111 0.35068639 0.19762309 -0.06883189 0.72410954 -0.22482745 0.06161804 -0.02440456 -0.03418712 <0341%208712>] [ 0.1443947 0.08427882 0.32210052 0.15902208 0.42793043 -0.1065733 0.30022962 <030%20022962> -0.18552473 -0.29677106 0.6602256 ] [ 0.4100674 <04100%20674> 0.61100202 0.24203264 -0.36583436 -0.26012926 -0.14006751 0.21300133 <02130%200133> -0.01063569 -0.25163655 -0.26460707] [-0.26364054 <02636%204054> 0.32696231 0.18616994 0.38349845 -0.17037144 0.19078566 0.53068931 <05306%208931> 0.2664058 0.47268578 0.01995924] [ 0.19833204 0.25956163 0.15391535 0.52816029 0.37924181 -0.39832189 <039832%20189> -0.39632359 <03963%202359> 0.06842959 0.15922644 -0.31789337] [ 0.06100089 0.28041907 0.00365086 -0.52407767 0.32170882 0.11788932 -0.29540979 <02954%200979> 0.16579616 0.55700809 0.31712067] [-0.2616713 <0261%206713> -0.27894058 0.76428004 -0.18169327 -0.20501889 -0.23708203 -0.12629573 -0.27049638 0.21217396 -0.06485779 <06485%20779>]] <NDArray 10x10 @cpu <https://github.com/cpu>(0)> Lambda: [-39.53770898 -31.85235173 -27.55026183 -13.38111156 -6.97443774 5.59771442 7.56903358 22.57472579 27.01218163 37.619126 ] <NDArray 10 @cpu <https://github.com/cpu>(0)> In both cases, A is the same but U and Lambda are different. Thoughts? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#9295 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AGNRovx0Jo7MFglaG289w3N-ClcjBdGSks5tK0WKgaJpZM4RSdR1> .

meissnereric · 2018-01-18T18:07:55Z

Ok new update to this story.

I synced to the latest changes from master branch for each of them, and now I can't reproduce the bug at all. I then ran the build in 3 configurations:

one with no MKL installed/configured (my mac)
one with MKL2017+MKL2017_EXPERIMENTAL+USE_BLAS=openblas (ec2 host on DL AMI)
one with MKL2017, MKL2017_EXPERIMENTAL, USE_BLAS=mkl and full MKL (ec2 host on DL AMI)

Previously, we were seeing the bug in the second configuration. I haven't gone back to check if the bug is still reproducible at an earlier commit, but if it's fixed now it may or may not be worth figuring out what the issue was. What do you all think?

mseeger · 2018-01-18T18:21:14Z

Are we sure the test hangs due to the issue you saw? Well can watch for a while, see whether issues that Marco raised, are still present

…

On Jan 18, 2018 7:08 PM, "Eric R Meissner" ***@***.***> wrote: Ok new update to this story. I synced to the latest changes from master branch for each of them, and now I can't reproduce the bug at all. I then ran the build in 3 configurations: 1. one with no MKL installed/configured (my mac) 2. one with MKL2017+MKL2017_EXPERIMENTAL+USE_BLAS=openblas (ec2 host on DL AMI) 3. one with MKL2017, MKL2017_EXPERIMENTAL, USE_BLAS=mkl and full MKL (ec2 host on DL AMI) Previously, we were seeing the bug in the second configuration. I haven't gone back to check if the bug is still reproducible at an earlier commit, but if it's fixed now it may or may not be worth figuring out what the issue was. What do you all think? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#9295 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AGNRomzwkNJ90J1rqqd4_ljw413mcyeIks5tL4iggaJpZM4RSdR1> .

szha · 2018-01-18T18:59:42Z

MKLML is a dependency of the MKL2017 option and was updated recently in #9218. Might be related. It would be easy to verify if you revert to the point before that PR.

meissnereric · 2018-01-22T13:50:09Z

@mseeger No, I don't know for sure this issue caused the hanging tests as I never saw one hang. Both issues being in laop3 felt too specific for coincidence so I assumed they were the same underlying problem.

@szha Could be. Since the issue seems to be resolved now, it would just be testing that it wasn't broken at that point.

Anyone know if we're still seeing the hanging issue in CI builds?

asmushetzel · 2018-01-22T19:54:16Z

If all works fine now, we should close this issue. We have checked everything again, it should work as is, it does work as is, tests are in that ensure that we will notice if this ever happens again.

Marco, ok with this?

szha · 2018-01-22T20:01:36Z

@meissnereric @asmushetzel, though not high priority, it would still be good if we rule out whether it's a bug in mxnet or not. This is why I suggested reverting to a version prior to PR9218, so that we know for sure.

marcoabreu · 2018-01-23T05:53:31Z

Sorry for the late response. I didn't see any hangs since the upgrade, but it would be good if we could figure out whether that specific version was the issue. Am 22.01.2018 9:02 nachm. schrieb "Sheng Zha" <notifications@github.com>: @meissnereric <https://github.com/meissnereric> @asmushetzel <https://github.com/asmushetzel>, though not high priority, it would still be good if we rule out whether it's a bug in mxnet or not. This is why I suggested reverting to a version prior to PR9218, so that we know for sure. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#9295 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ARxB62jrelHBKmBEi071pTiCLkbyDDDkks5tNOk6gaJpZM4RSdR1> .

marcoabreu · 2018-03-09T17:14:19Z

Closed due to MKLDNN replacement

marcoabreu mentioned this issue Jan 4, 2018

Disable test_operator.test_laop_3 (temp fix #9295) #9296

Closed

szha added Bug Flaky MKL Test labels Jan 4, 2018

marcoabreu mentioned this issue Jan 11, 2018

Update MKLML to v0.12 #9383

Merged

szha mentioned this issue Jan 13, 2018

Flaky Tests Tracking Issue #9412

Closed

szha added this to To Do in Tests Improvement Jan 27, 2018

marcoabreu closed this as completed Mar 9, 2018

Tests Improvement automation moved this from To Do to Done Mar 9, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test_operator.test_laop_3 hangs #9295

test_operator.test_laop_3 hangs #9295

marcoabreu commented Jan 4, 2018

marcoabreu commented Jan 4, 2018

KellenSunderland commented Jan 6, 2018

KellenSunderland commented Jan 6, 2018

mseeger commented Jan 6, 2018

marcoabreu commented Jan 6, 2018

asmushetzel commented Jan 10, 2018

meissnereric commented Jan 10, 2018

KellenSunderland commented Jan 10, 2018 •

edited

marcoabreu commented Jan 10, 2018 via email

mseeger commented Jan 11, 2018 via email

meissnereric commented Jan 11, 2018 •

edited

mseeger commented Jan 12, 2018 via email

marcoabreu commented Jan 12, 2018 via email

meissnereric commented Jan 12, 2018

mseeger commented Jan 12, 2018 via email

mseeger commented Jan 12, 2018 via email

meissnereric commented Jan 15, 2018

mseeger commented Jan 15, 2018 via email

meissnereric commented Jan 18, 2018

mseeger commented Jan 18, 2018 via email

szha commented Jan 18, 2018 •

edited

meissnereric commented Jan 22, 2018

asmushetzel commented Jan 22, 2018

szha commented Jan 22, 2018

marcoabreu commented Jan 23, 2018 via email

marcoabreu commented Mar 9, 2018

test_operator.test_laop_3 hangs #9295

test_operator.test_laop_3 hangs #9295

Comments

marcoabreu commented Jan 4, 2018

marcoabreu commented Jan 4, 2018

KellenSunderland commented Jan 6, 2018

KellenSunderland commented Jan 6, 2018

mseeger commented Jan 6, 2018

marcoabreu commented Jan 6, 2018

asmushetzel commented Jan 10, 2018

meissnereric commented Jan 10, 2018

KellenSunderland commented Jan 10, 2018 • edited

marcoabreu commented Jan 10, 2018 via email

mseeger commented Jan 11, 2018 via email

meissnereric commented Jan 11, 2018 • edited

====================================================================== FAIL: test_operator.test_laop_3

mseeger commented Jan 12, 2018 via email

marcoabreu commented Jan 12, 2018 via email

meissnereric commented Jan 12, 2018

mseeger commented Jan 12, 2018 via email

mseeger commented Jan 12, 2018 via email

meissnereric commented Jan 15, 2018

mseeger commented Jan 15, 2018 via email

meissnereric commented Jan 18, 2018

mseeger commented Jan 18, 2018 via email

szha commented Jan 18, 2018 • edited

meissnereric commented Jan 22, 2018

asmushetzel commented Jan 22, 2018

szha commented Jan 22, 2018

marcoabreu commented Jan 23, 2018 via email

marcoabreu commented Mar 9, 2018

KellenSunderland commented Jan 10, 2018 •

edited

meissnereric commented Jan 11, 2018 •

edited

======================================================================
FAIL: test_operator.test_laop_3

szha commented Jan 18, 2018 •

edited