Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

test_operator.test_laop_3 hangs #9295

Closed
marcoabreu opened this issue Jan 4, 2018 · 26 comments
Closed

test_operator.test_laop_3 hangs #9295

marcoabreu opened this issue Jan 4, 2018 · 26 comments

Comments

@marcoabreu
Copy link
Contributor

The execution of test_operator.test_laop_3 randomly hangs without a message on Python3: MKLML-CPU. See http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/master/130/pipeline/329

@marcoabreu
Copy link
Contributor Author

@piiswrong @szha Please assign the proper label.

@KellenSunderland
Copy link
Contributor

Here's the ticket for the LA hangs with MKL @mseeger @asmushetzel.

@KellenSunderland
Copy link
Contributor

The next time someone runs into this one let's try and get a trace of all the threads. I looked at this a little already with @asmushetzel, and it certainly was a deadlock during initialization (i.e. not livelock, no activity on any thread), but I wasn't exactly clear on why one of the threads was blocked.

@mseeger
Copy link
Contributor

mseeger commented Jan 6, 2018

If the test hangs using MKL, then I am afraid also a normal run of MXNet, using the linalg ops, will hang on MKL eventually.

Do you also see these problems with the other laop tests? Because laop3 is only for a small subset of the ops. Is it clear which of them is the issue?

@marcoabreu
Copy link
Contributor Author

I have never seen this problem arise with any other test than laop3. Of course, it might be possible that the previous tests brought the process into a corrupted state. I haven't had the chance yet to dig into the problem and identify the failing line - we might consider adding more debug output to those tests, especially if one test-method consists of multiple tests.

@asmushetzel
Copy link
Contributor

@meissnereric
Eric, can you own this (I'm willing to help). The configuration that fails is
python3, USE_MKL2017=1, USE_MKL2017_EXPERIMENTAL=1, USE_BLAS=openblas
The hanging test executes seyvd, which according to the above settings, will be the openblas-version of seyvd.
Nobody currently knows what the issue is, there seems to be no change in test environment, neither have we modified any linalg-functions (at least to my knowledge, you may take a look at the git logs). Still this problem suddenly popped up (and worse only occasional, you have to run it repeatedly until it sometimes hangs).
This is important, as we want to have all tests in MXNet working.

@meissnereric
Copy link
Contributor

Sure, I'll have a look at it today thanks for letting me know.

@KellenSunderland
Copy link
Contributor

KellenSunderland commented Jan 10, 2018

Did these test start failing around the end of December, when this change was merged? d0388d0

@marcoabreu
Copy link
Contributor Author

marcoabreu commented Jan 10, 2018 via email

@mseeger
Copy link
Contributor

mseeger commented Jan 11, 2018 via email

@meissnereric
Copy link
Contributor

meissnereric commented Jan 11, 2018

So on a c5.2xlarge instance I built mxnet from source using USE_MKL2017=1, USE_MKL2017_EXPERIMENTAL=1, USE_BLAS=openblas flags this morning and ran
PYTHONPATH=./python/ nosetests3 --verbose tests/python/unittest

3 separate times test_operator.test_laop_3 failed. It didn't hang for me at all. It's the only test to return a FAILURE as well. I'm not sure if we've seen failures like this before for the test, or only the hanging issue.

Thoughts?

======================================================================
FAIL: test_operator.test_laop_3

Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/nose/case.py", line 198, in runTest
self.test(*self.arg)
File "/home/ubuntu/src/incubator-mxnet/tests/python/unittest/test_operator.py", line 4278, in test_laop_3
check_fw(test_syevd2, [data_in1], [res_eye, res_a])
File "/home/ubuntu/src/incubator-mxnet/tests/python/unittest/test_operator.py", line 4253, in
atol=atol_fw, dtype=dtype)
File "/home/ubuntu/src/incubator-mxnet/python/mxnet/test_utils.py", line 997, in check_symbolic_forward
equal_nan=equal_nan)
File "/home/ubuntu/src/incubator-mxnet/python/mxnet/test_utils.py", line 495, in assert_almost_equal
raise AssertionError(msg)
AssertionError:
Items are not equal:
Error 2142110.669088 exceeds tolerance rtol=0.000001, atol=0.000001. Location of maximum error:(5, 7), a=3.772180, b=0.518782
EXPECTED_Ut_L_U_output: array([[ -6.29539959, 3.14786301, -7.76048236, ..., -14.55426068,
5.89200304, 2.89609144],
[ 3.14786301, -2.20936719, -5.00206088, ..., 2.18995202,...
FORWARD_Ut_L_U_output: array([[ -5.86758909, 2.1861068 , -8.22037894, ..., -15.72788167,
7.86870499, 2.89609144],
[ 2.1861068 , -3.11242836, -5.28864811, ..., 3.78613349,...


@mseeger
Copy link
Contributor

mseeger commented Jan 12, 2018 via email

@marcoabreu
Copy link
Contributor Author

marcoabreu commented Jan 12, 2018 via email

@meissnereric
Copy link
Contributor

With some help from Zhenwen in debugging the problem, I think we've figured out what the bug is that's causing this. Somewhere during syevd the signs of the eigenvectors are getting computed or flipped incorrectly.

Haven't found where its introduced or how, but below is some simple example code to reproduce the issue. This was running the latest mxnet build as of this morning, with MKLML v0.12.

import mxnet as mx
import numpy as np

np.random.seed(1896893920)
n=10
data_in1 = np.random.normal(0., 10., (n, n))
data_in1 = 0.5 * (data_in1 + data_in1.T)

np_eigval, np_eigvec = np.linalg.eig(data_in1)
mx_eigvec, mx_eigval = mx.ndarray.linalg.syevd(mx.ndarray.array(data_in1, dtype=np.float64))

np.allclose(np_eigval[idx], mx_eigval.asnumpy())
#True, Eigenvalues are close to each other after sorting the unsorted numpy ones

np.allclose(np_eigvec[idx], mx_eigvec.asnumpy())
#False , but eigenvectors are not the same

np_signs = np.sign(np_eigvec[:,idx].T)[:,0]
mx_signs = np.sign(mx_eigvec.asnumpy())[:,0]

sign_flips = np_signs * mx_signs
print(sign_flips)
#[ 1. -1.  1. -1. -1.  1. -1. -1.  1. -1.], I.E. numpy and mxnet signs not the same

np.allclose(np_eigvec[:,idx].T * sign_flips[:,None], mx_eigvec.asnumpy())
# If we flip the signs to match, now the eigenvectors are all the same.

@mseeger
Copy link
Contributor

mseeger commented Jan 12, 2018 via email

@mseeger
Copy link
Contributor

mseeger commented Jan 12, 2018 via email

@meissnereric
Copy link
Contributor

We’ve been working on it this morning and have narrowed the issue down to a problem with syevd, only reproducible in the test environment.

Running the following simplified code:

def test_laop_TEST():
    np.random.seed(1896893920)
    n=10
    data_in1 = np.random.normal(0., 10., (n, n))
    data_in1 = 0.5 * (data_in1 + data_in1.T)
    u, lam = mx.nd.linalg.syevd(mx.nd.array(data_in1, dtype=np.float64))
    print(data_in1, u, lam)
    assert(False)

Inside the test environment (using “PYTHONPATH=./python/ nosetests3 --verbose tests/python/unittest/test_operator.py:test_laop_TEST”) returns the following results:
A: [[-13.34414296 15.99829902 -5.9348693 0.33725903 7.08999108
3.48517748 -8.1782682 2.64821474 -11.71583022 -2.67093716]
[ 15.99829902 -1.24079583 -5.15169143 -1.66193884 7.78811657
-3.24022297 6.49511542 12.28121039 8.40569059 -2.6123374 ]
[ -5.9348693 -5.15169143 18.40724804 -3.32434622 -8.2294941
-12.05570085 -3.134901 -12.91228277 11.51627784 -3.26478123]
[ 0.33725903 -1.66193884 -3.32434622 13.99603538 5.64011818
-5.48959896 4.85978757 4.35528089 -5.7480471 -11.40238071]
[ 7.08999108 7.78811657 -8.2294941 5.64011818 -5.28942777
-3.71190675 -13.53220115 3.52255819 6.93703998 8.68193436]
[ 3.48517748 -3.24022297 -12.05570085 -5.48959896 -3.71190675
-6.68378713 9.05448277 7.37555848 3.11379519 8.66276378]
[ -8.1782682 6.49511542 -3.134901 4.85978757 -13.53220115
9.05448277 -1.94185915 0.70195063 -10.21748333 4.8669592 ]
[ 2.64821474 12.28121039 -12.91228277 4.35528089 3.52255819
7.37555848 0.70195063 -20.63472654 7.87549062 1.3931764 ]
[-11.71583022 8.40569059 11.51627784 -5.7480471 6.93703998
3.11379519 -10.21748333 7.87549062 3.42171743 4.7941094 ]
[ -2.67093716 -2.6123374 -3.26478123 -11.40238071 8.68193436
8.66276378 4.8669592 1.3931764 4.7941094 -5.61335189]]
U: [[-0.40939395 -0.02004096 0.27457066 -0.25153737 0.66140858 0.28340247
0.14125274 0.22853749 0.18226673 -0.26486544]
[ 0.4472236 -0.48209223 0.10682268 -0.08815147 0.06254703 -0.36772867
0.30321108 0.56092533 0.04835745 -0.01863414]
[-0.39439743 0.09942506 -0.01934432 -0.06989485 -0.3580831 0.07148518
-0.39629101 0.70887307 -0.11974635 0.14861161]
[ 0.47108087 -0.15032447 0.34594981 0.19559081 -0.07895447 0.72793988
-0.23360136 0.07264473 -0.01704545 -0.03578629]
[-0.07231265 0.06808902 0.2476193 0.26860241 0.20379301 -0.02531744
0.10901297 0.01760572 0.27485097 0.85250383]
[-0.16324278 -0.30664205 -0.10891006 0.40596886 -0.10056171 -0.10892519
-0.28055657 -0.02900726 0.72918847 -0.26344591]
[ 0.12151877 0.6399878 0.30715681 0.04750568 -0.33368478 -0.02954461
0.40925563 0.16125632 0.34502726 -0.23300369]
[-0.34626999 -0.24576316 -0.15655482 0.51831127 -0.17288398 0.26124855
0.58672901 0.07084532 -0.27323606 -0.06688774]
[ 0.24814453 0.40352612 -0.33373039 0.52155155 0.48083476 -0.10008073
-0.17301593 0.27878979 -0.1649485 -0.12694282]
[-0.16370208 -0.05674461 0.6998121 0.32548404 0.00940855 -0.39724873
-0.21029293 -0.12432748 -0.34877216 -0.18731605]]
<NDArray 10x10 @cpu(0)>
Lambda: [-55.42942652 -36.08229736 -21.43322708 -13.37817004 -6.12139065
4.87496436 6.50358305 15.6977825 29.92340939 56.52168194]
<NDArray 10 @cpu(0)>

Whereas running the exact same code in an ipython notebook returns the following:

A: [[-13.34414296 15.99829902 -5.9348693 0.33725903 7.08999108
3.48517748 -8.1782682 2.64821474 -11.71583022 -2.67093716]
[ 15.99829902 -1.24079583 -5.15169143 -1.66193884 7.78811657
-3.24022297 6.49511542 12.28121039 8.40569059 -2.6123374 ]
[ -5.9348693 -5.15169143 18.40724804 -3.32434622 -8.2294941
-12.05570085 -3.134901 -12.91228277 11.51627784 -3.26478123]
[ 0.33725903 -1.66193884 -3.32434622 13.99603538 5.64011818
-5.48959896 4.85978757 4.35528089 -5.7480471 -11.40238071]
[ 7.08999108 7.78811657 -8.2294941 5.64011818 -5.28942777
-3.71190675 -13.53220115 3.52255819 6.93703998 8.68193436]
[ 3.48517748 -3.24022297 -12.05570085 -5.48959896 -3.71190675
-6.68378713 9.05448277 7.37555848 3.11379519 8.66276378]
[ -8.1782682 6.49511542 -3.134901 4.85978757 -13.53220115
9.05448277 -1.94185915 0.70195063 -10.21748333 4.8669592 ]
[ 2.64821474 12.28121039 -12.91228277 4.35528089 3.52255819
7.37555848 0.70195063 -20.63472654 7.87549062 1.3931764 ]
[-11.71583022 8.40569059 11.51627784 -5.7480471 6.93703998
3.11379519 -10.21748333 7.87549062 3.42171743 4.7941094 ]
[ -2.67093716 -2.6123374 -3.26478123 -11.40238071 8.68193436
8.66276378 4.8669592 1.3931764 4.7941094 -5.61335189]]
U: [[ 0.56896406 -0.49508446 -0.05515309 -0.09274153 0.05952866 -0.30276783
0.41076905 0.18170427 0.34030827 -0.08221328]
[-0.24222508 -0.14435087 0.26295741 -0.18396091 0.30094882 0.01134089
0.04484504 0.75461613 -0.35400702 -0.17327796]
[-0.13275656 -0.00823993 0.00394372 -0.16026696 0.57617405 0.29618633
0.3182481 -0.42766506 0.05309422 -0.4999056 ]
[ 0.4795233 -0.15113111 0.35068639 0.19762309 -0.06883189 0.72410954
-0.22482745 0.06161804 -0.02440456 -0.03418712]
[ 0.1443947 0.08427882 0.32210052 0.15902208 0.42793043 -0.1065733
0.30022962 -0.18552473 -0.29677106 0.6602256 ]
[ 0.4100674 0.61100202 0.24203264 -0.36583436 -0.26012926 -0.14006751
0.21300133 -0.01063569 -0.25163655 -0.26460707]
[-0.26364054 0.32696231 0.18616994 0.38349845 -0.17037144 0.19078566
0.53068931 0.2664058 0.47268578 0.01995924]
[ 0.19833204 0.25956163 0.15391535 0.52816029 0.37924181 -0.39832189
-0.39632359 0.06842959 0.15922644 -0.31789337]
[ 0.06100089 0.28041907 0.00365086 -0.52407767 0.32170882 0.11788932
-0.29540979 0.16579616 0.55700809 0.31712067]
[-0.2616713 -0.27894058 0.76428004 -0.18169327 -0.20501889 -0.23708203
-0.12629573 -0.27049638 0.21217396 -0.06485779]]
<NDArray 10x10 @cpu(0)>
Lambda: [-39.53770898 -31.85235173 -27.55026183 -13.38111156 -6.97443774
5.59771442 7.56903358 22.57472579 27.01218163 37.619126 ]
<NDArray 10 @cpu(0)>

In both cases, A is the same but U and Lambda are different.

Thoughts?

@mseeger
Copy link
Contributor

mseeger commented Jan 15, 2018 via email

@meissnereric
Copy link
Contributor

Ok new update to this story.

I synced to the latest changes from master branch for each of them, and now I can't reproduce the bug at all. I then ran the build in 3 configurations:

  1. one with no MKL installed/configured (my mac)
  2. one with MKL2017+MKL2017_EXPERIMENTAL+USE_BLAS=openblas (ec2 host on DL AMI)
  3. one with MKL2017, MKL2017_EXPERIMENTAL, USE_BLAS=mkl and full MKL (ec2 host on DL AMI)

Previously, we were seeing the bug in the second configuration. I haven't gone back to check if the bug is still reproducible at an earlier commit, but if it's fixed now it may or may not be worth figuring out what the issue was. What do you all think?

@mseeger
Copy link
Contributor

mseeger commented Jan 18, 2018 via email

@szha
Copy link
Member

szha commented Jan 18, 2018

MKLML is a dependency of the MKL2017 option and was updated recently in #9218. Might be related. It would be easy to verify if you revert to the point before that PR.

@meissnereric
Copy link
Contributor

@mseeger No, I don't know for sure this issue caused the hanging tests as I never saw one hang. Both issues being in laop3 felt too specific for coincidence so I assumed they were the same underlying problem.

@szha Could be. Since the issue seems to be resolved now, it would just be testing that it wasn't broken at that point.

Anyone know if we're still seeing the hanging issue in CI builds?

@asmushetzel
Copy link
Contributor

If all works fine now, we should close this issue. We have checked everything again, it should work as is, it does work as is, tests are in that ensure that we will notice if this ever happens again.

Marco, ok with this?

@szha
Copy link
Member

szha commented Jan 22, 2018

@meissnereric @asmushetzel, though not high priority, it would still be good if we rule out whether it's a bug in mxnet or not. This is why I suggested reverting to a version prior to PR9218, so that we know for sure.

@marcoabreu
Copy link
Contributor Author

marcoabreu commented Jan 23, 2018 via email

@szha szha added this to To Do in Tests Improvement Jan 27, 2018
@marcoabreu
Copy link
Contributor Author

Closed due to MKLDNN replacement

Tests Improvement automation moved this from To Do to Done Mar 9, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
Development

No branches or pull requests

6 participants