-
Notifications
You must be signed in to change notification settings - Fork 6.8k
test_operator.test_laop_3 hangs #9295
Comments
@piiswrong @szha Please assign the proper label. |
Here's the ticket for the LA hangs with MKL @mseeger @asmushetzel. |
The next time someone runs into this one let's try and get a trace of all the threads. I looked at this a little already with @asmushetzel, and it certainly was a deadlock during initialization (i.e. not livelock, no activity on any thread), but I wasn't exactly clear on why one of the threads was blocked. |
If the test hangs using MKL, then I am afraid also a normal run of MXNet, using the linalg ops, will hang on MKL eventually. Do you also see these problems with the other laop tests? Because laop3 is only for a small subset of the ops. Is it clear which of them is the issue? |
I have never seen this problem arise with any other test than laop3. Of course, it might be possible that the previous tests brought the process into a corrupted state. I haven't had the chance yet to dig into the problem and identify the failing line - we might consider adding more debug output to those tests, especially if one test-method consists of multiple tests. |
@meissnereric |
Sure, I'll have a look at it today thanks for letting me know. |
Did these test start failing around the end of December, when this change was merged? d0388d0 |
That's the one! Yes, it started happening after Christmas. You're awesome
Kellen, thanks for finding it!
…On Wed, Jan 10, 2018 at 9:33 PM, Kellen Sunderland ***@***.*** > wrote:
Did these test start failing around the end of December, when this change
was merged? d0388d0#diff-6d85ccdfa8c4118c1722e64466901ef2
<d0388d0#diff-6d85ccdfa8c4118c1722e64466901ef2>
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#9295 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ARxB6_W_yPcJadrVW-C7QTm9HyVs7jpCks5tJR6tgaJpZM4RSdR1>
.
|
Eric Meissner in Cambridge now owning this one.
…On Jan 10, 2018 9:33 PM, "Kellen Sunderland" ***@***.***> wrote:
Did these test start failing around the end of December, when this change
was merged? d0388d0#diff-6d85ccdfa8c4118c1722e64466901ef2
<d0388d0#diff-6d85ccdfa8c4118c1722e64466901ef2>
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#9295 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AGNRoilFB-a47TSiN-5-KVXmwuyDkX5gks5tJR6tgaJpZM4RSdR1>
.
|
So on a c5.2xlarge instance I built mxnet from source using USE_MKL2017=1, USE_MKL2017_EXPERIMENTAL=1, USE_BLAS=openblas flags this morning and ran 3 separate times test_operator.test_laop_3 failed. It didn't hang for me at all. It's the only test to return a FAILURE as well. I'm not sure if we've seen failures like this before for the test, or only the hanging issue. Thoughts? ======================================================================
|
Hello,
This test compares A and identity to what you get computing expressions
from U and Lambda.
Please output everything, A, U, Lambda, and results of expressions. Do this
directly in numpy format. Then compare to numpy. Use same style.
These errors are too large to be due to low prec.
Exactly what a test is for.
If you find something wrong either with syevd, or worse, with gemm2, please
isolate the case in non random test.
On Jan 11, 2018 14:44, "Eric R Meissner" <notifications@github.com> wrote:
Hmm, the plot thickens.
So on a c5.2xlarge instance I built mxnet from source using USE_MKL2017=1,
USE_MKL2017_EXPERIMENTAL=1, USE_BLAS=openblas flags this morning and ran
PYTHONPATH=./python/ nosetests3 --verbose tests/python/unittest
3 separate times test_operator.test_laop_3 FAILED apparently due to some
numerical issues. I haven't seen anyone mention this test failing yet, but
it didn't hang for me at all. It's the only test to return a FAILURE as
well.
Thoughts?
======================================================================
FAIL: test_operator.test_laop_3
Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/nose/case.py", line 198, in runTest
self.test(*self.arg)
File "/home/ubuntu/src/incubator-mxnet/tests/python/unittest/test_operator.py",
line 4278, in test_laop_3
check_fw(test_syevd2, [data_in1], [res_eye, res_a])
File "/home/ubuntu/src/incubator-mxnet/tests/python/unittest/test_operator.py",
line 4253, in
atol=atol_fw, dtype=dtype)
File "/home/ubuntu/src/incubator-mxnet/python/mxnet/test_utils.py", line
997, in check_symbolic_forward
equal_nan=equal_nan)
File "/home/ubuntu/src/incubator-mxnet/python/mxnet/test_utils.py", line
495, in assert_almost_equal
raise AssertionError(msg)
AssertionError:
Items are not equal:
Error 2142110.669088 exceeds tolerance rtol=0.000001, atol=0.000001.
Location of maximum error:(5, 7), a=3.772180, b=0.518782
EXPECTED_Ut_L_U_output: array([[ -6.29539959, 3.14786301, -7.76048236, ...,
-14.55426068,
5.89200304, 2.89609144],
[ 3.14786301, -2.20936719, -5.00206088, ..., 2.18995202,...
FORWARD_Ut_L_U_output: array([[ -5.86758909, 2.1861068 , -8.22037894, ...,
-15.72788167,
7.86870499, 2.89609144],
[ 2.1861068 , -3.11242836, -5.28864811, ..., 3.78613349,...
------------------------------
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#9295 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AGNRomJq1Z8Zj7UzMcgka3TlUVjsijqKks5tJhBJgaJpZM4RSdR1>
.
|
Just FYI, we updated MXNet to use MLKML v0.12 last night.
Am 12.01.2018 7:55 vorm. schrieb "mseeger" <notifications@github.com>:
… Hello,
This test compares A and identity to what you get computing expressions
from U and Lambda.
Please output everything, A, U, Lambda, and results of expressions. Do this
directly in numpy format. Then compare to numpy. Use same style.
These errors are too large to be due to low prec.
Exactly what a test is for.
If you find something wrong either with syevd, or worse, with gemm2, please
isolate the case in non random test.
On Jan 11, 2018 14:44, "Eric R Meissner" ***@***.***> wrote:
Hmm, the plot thickens.
So on a c5.2xlarge instance I built mxnet from source using USE_MKL2017=1,
USE_MKL2017_EXPERIMENTAL=1, USE_BLAS=openblas flags this morning and ran
PYTHONPATH=./python/ nosetests3 --verbose tests/python/unittest
3 separate times test_operator.test_laop_3 FAILED apparently due to some
numerical issues. I haven't seen anyone mention this test failing yet, but
it didn't hang for me at all. It's the only test to return a FAILURE as
well.
Thoughts?
======================================================================
FAIL: test_operator.test_laop_3
Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/nose/case.py", line 198, in runTest
self.test(*self.arg)
File "/home/ubuntu/src/incubator-mxnet/tests/python/unittest/
test_operator.py",
line 4278, in test_laop_3
check_fw(test_syevd2, [data_in1], [res_eye, res_a])
File "/home/ubuntu/src/incubator-mxnet/tests/python/unittest/
test_operator.py",
line 4253, in
atol=atol_fw, dtype=dtype)
File "/home/ubuntu/src/incubator-mxnet/python/mxnet/test_utils.py", line
997, in check_symbolic_forward
equal_nan=equal_nan)
File "/home/ubuntu/src/incubator-mxnet/python/mxnet/test_utils.py", line
495, in assert_almost_equal
raise AssertionError(msg)
AssertionError:
Items are not equal:
Error 2142110.669088 exceeds tolerance rtol=0.000001, atol=0.000001.
Location of maximum error:(5, 7), a=3.772180, b=0.518782
EXPECTED_Ut_L_U_output: array([[ -6.29539959, 3.14786301, -7.76048236, ...,
-14.55426068,
5.89200304, 2.89609144],
[ 3.14786301, -2.20936719, -5.00206088, ..., 2.18995202,...
FORWARD_Ut_L_U_output: array([[ -5.86758909, 2.1861068 , -8.22037894, ...,
-15.72788167,
7.86870499, 2.89609144],
[ 2.1861068 , -3.11242836, -5.28864811, ..., 3.78613349,...
------------------------------
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#9295
issuecomment-356937657>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/
AGNRomJq1Z8Zj7UzMcgka3TlUVjsijqKks5tJhBJgaJpZM4RSdR1>
.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#9295 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ARxB6wZxnozND1UhhRjSQp_7AEQVVnfzks5tJwHygaJpZM4RSdR1>
.
|
With some help from Zhenwen in debugging the problem, I think we've figured out what the bug is that's causing this. Somewhere during syevd the signs of the eigenvectors are getting computed or flipped incorrectly. Haven't found where its introduced or how, but below is some simple example code to reproduce the issue. This was running the latest mxnet build as of this morning, with MKLML v0.12. import mxnet as mx
import numpy as np
np.random.seed(1896893920)
n=10
data_in1 = np.random.normal(0., 10., (n, n))
data_in1 = 0.5 * (data_in1 + data_in1.T)
np_eigval, np_eigvec = np.linalg.eig(data_in1)
mx_eigvec, mx_eigval = mx.ndarray.linalg.syevd(mx.ndarray.array(data_in1, dtype=np.float64))
np.allclose(np_eigval[idx], mx_eigval.asnumpy())
#True, Eigenvalues are close to each other after sorting the unsorted numpy ones
np.allclose(np_eigvec[idx], mx_eigvec.asnumpy())
#False , but eigenvectors are not the same
np_signs = np.sign(np_eigvec[:,idx].T)[:,0]
mx_signs = np.sign(mx_eigvec.asnumpy())[:,0]
sign_flips = np_signs * mx_signs
print(sign_flips)
#[ 1. -1. 1. -1. -1. 1. -1. -1. 1. -1.], I.E. numpy and mxnet signs not the same
np.allclose(np_eigvec[:,idx].T * sign_flips[:,None], mx_eigvec.asnumpy())
# If we flip the signs to match, now the eigenvectors are all the same. |
I noticed this issue with the signs as well, and I am in fact controlling
the signs, see my code.
The test however should be independent of signs of eigen vectors, it just
depends on the definition of the decomposition, right?
On Jan 12, 2018 4:22 PM, "Eric R Meissner" <notifications@github.com> wrote:
With some help from Zhenwen in debugging the problem, I think we've figured
out what the bug is that's causing this. Somewhere during syevd the signs
of the eigenvectors are getting computed or flipped incorrectly.
Haven't found where its introduced or how, but below is some simple example
code to reproduce the issue. This was running the latest mxnet build as of
this morning, with MKLML v0.12.
import mxnet as mximport numpy as np
np.random.seed(1896893920)
n=10
data_in1 = np.random.normal(0., 10., (n, n))
data_in1 = 0.5 * (data_in1 + data_in1.T)
np_eigval, np_eigvec = np.linalg.eig(data_in1)
mx_eigvec, mx_eigval =
mx.ndarray.linalg.syevd(mx.ndarray.array(data_in1, dtype=np.float64))
np.allclose(np_eigval[idx], mx_eigval.asnumpy())#True, Eigenvalues are
close to each other after sorting the unsorted numpy ones
np.allclose(np_eigvec[idx], mx_eigvec.asnumpy())#False , but
eigenvectors are not the same
np_signs = np.sign(np_eigvec[:,idx].T)[:,0]
mx_signs = np.sign(mx_eigvec.asnumpy())[:,0]
sign_flips = np_signs * mx_signsprint(sign_flips)#[ 1. -1. 1. -1. -1.
1. -1. -1. 1. -1.], I.E. numpy and mxnet signs not the same
np.allclose(np_eigvec[:,idx].T * sign_flips[:,None],
mx_eigvec.asnumpy())# If we flip the signs to match, now the
eigenvectors are all the same.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#9295 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AGNRog72dux8GZqdrJQp9YM6TIVxBQL_ks5tJ3i7gaJpZM4RSdR1>
.
|
The test does not compare eigen vectors. This cannot be the reason it fails.
…On Jan 12, 2018 4:22 PM, "Eric R Meissner" ***@***.***> wrote:
With some help from Zhenwen in debugging the problem, I think we've
figured out what the bug is that's causing this. Somewhere during syevd the
signs of the eigenvectors are getting computed or flipped incorrectly.
Haven't found where its introduced or how, but below is some simple
example code to reproduce the issue. This was running the latest mxnet
build as of this morning, with MKLML v0.12.
import mxnet as mximport numpy as np
np.random.seed(1896893920)
n=10
data_in1 = np.random.normal(0., 10., (n, n))
data_in1 = 0.5 * (data_in1 + data_in1.T)
np_eigval, np_eigvec = np.linalg.eig(data_in1)
mx_eigvec, mx_eigval = mx.ndarray.linalg.syevd(mx.ndarray.array(data_in1, dtype=np.float64))
np.allclose(np_eigval[idx], mx_eigval.asnumpy())#True, Eigenvalues are close to each other after sorting the unsorted numpy ones
np.allclose(np_eigvec[idx], mx_eigvec.asnumpy())#False , but eigenvectors are not the same
np_signs = np.sign(np_eigvec[:,idx].T)[:,0]
mx_signs = np.sign(mx_eigvec.asnumpy())[:,0]
sign_flips = np_signs * mx_signsprint(sign_flips)#[ 1. -1. 1. -1. -1. 1. -1. -1. 1. -1.], I.E. numpy and mxnet signs not the same
np.allclose(np_eigvec[:,idx].T * sign_flips[:,None], mx_eigvec.asnumpy())# If we flip the signs to match, now the eigenvectors are all the same.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#9295 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AGNRog72dux8GZqdrJQp9YM6TIVxBQL_ks5tJ3i7gaJpZM4RSdR1>
.
|
We’ve been working on it this morning and have narrowed the issue down to a problem with syevd, only reproducible in the test environment. Running the following simplified code: def test_laop_TEST():
np.random.seed(1896893920)
n=10
data_in1 = np.random.normal(0., 10., (n, n))
data_in1 = 0.5 * (data_in1 + data_in1.T)
u, lam = mx.nd.linalg.syevd(mx.nd.array(data_in1, dtype=np.float64))
print(data_in1, u, lam)
assert(False) Inside the test environment (using “PYTHONPATH=./python/ nosetests3 --verbose tests/python/unittest/test_operator.py:test_laop_TEST”) returns the following results: Whereas running the exact same code in an ipython notebook returns the following: A: [[-13.34414296 15.99829902 -5.9348693 0.33725903 7.08999108 In both cases, A is the same but U and Lambda are different. Thoughts? |
But the test you said fails, it recomputes a and identity from u and lambda.
While u is not unique (signs), the results must fulfill the definition, so
must get a and identity.
…On Jan 15, 2018 1:33 PM, "Eric R Meissner" ***@***.***> wrote:
We’ve been working on it this morning and have narrowed the issue down to
a problem with syevd, only reproducible in the test environment.
Running the following simplified code:
def test_laop_TEST():
np.random.seed(1896893920)
n=10
data_in1 = np.random.normal(0., 10., (n, n))
data_in1 = 0.5 * (data_in1 + data_in1.T)
u, lam = mx.nd.linalg.syevd(mx.nd.array(data_in1, dtype=np.float64))
print(data_in1, u, lam)
assert(False)
Inside the test environment (using “PYTHONPATH=./python/ nosetests3
--verbose tests/python/unittest/test_operator.py:test_laop_TEST”) returns
the following results:
A: [[-13.34414296 15.99829902 -5.9348693 0.33725903 7.08999108
<089%2099108>
3.48517748 -8.1782682 2.64821474 -11.71583022 -2.67093716]
[ 15.99829902 -1.24079583 -5.15169143 -1.66193884 7.78811657
-3.24022297 6.49511542 12.28121039 8.40569059 -2.6123374 ]
[ -5.9348693 -5.15169143 18.40724804 -3.32434622 -8.2294941
-12.05570085 -3.134901 -12.91228277 11.51627784 -3.26478123]
[ 0.33725903 <03372%205903> -1.66193884 -3.32434622 13.99603538 5.64011818
-5.48959896 4.85978757 4.35528089 -5.7480471 -11.40238071]
[ 7.08999108 7.78811657 -8.2294941 5.64011818 -5.28942777
-3.71190675 -13.53220115 3.52255819 6.93703998 8.68193436]
[ 3.48517748 -3.24022297 -12.05570085 -5.48959896 -3.71190675
-6.68378713 9.05448277 7.37555848 3.11379519 8.66276378]
[ -8.1782682 6.49511542 -3.134901 4.85978757 -13.53220115
9.05448277 -1.94185915 0.70195063 -10.21748333 4.8669592 ]
[ 2.64821474 12.28121039 -12.91228277 4.35528089 3.52255819
7.37555848 0.70195063 -20.63472654 7.87549062 1.3931764 ]
[-11.71583022 8.40569059 11.51627784 -5.7480471 6.93703998
3.11379519 -10.21748333 7.87549062 3.42171743 4.7941094 ]
[ -2.67093716 -2.6123374 -3.26478123 -11.40238071 8.68193436
8.66276378 4.8669592 1.3931764 4.7941094 -5.61335189]]
U: [[-0.40939395 -0.02004096 0.27457066 -0.25153737 0.66140858 0.28340247
0.14125274 0.22853749 0.18226673 -0.26486544 <02648%206544>]
[ 0.4472236 <04472%20236> -0.48209223 <04820%209223> 0.10682268
-0.08815147 0.06254703 -0.36772867
0.30321108 <030%20321108> 0.56092533 0.04835745 -0.01863414]
[-0.39439743 0.09942506 -0.01934432 -0.06989485 -0.3580831 0.07148518
<07148%20518>
-0.39629101 <03962%209101> 0.70887307 <07088%207307> -0.11974635
0.14861161]
[ 0.47108087 <0471%2008087> -0.15032447 0.34594981 <0345%2094981>
0.19559081 -0.07895447 0.72793988
-0.23360136 0.07264473 -0.01704545 -0.03578629 <035786%2029>]
[-0.07231265 0.06808902 0.2476193 0.26860241 0.20379301 -0.02531744
<02531%20744>
0.10901297 0.01760572 0.27485097 <02748%205097> 0.85250383]
[-0.16324278 -0.30664205 <030%20664205> -0.10891006 0.40596886
-0.10056171 -0.10892519
-0.28055657 <02805%205657> -0.02900726 0.72918847 -0.26344591]
[ 0.12151877 0.6399878 0.30715681 0.04750568 -0.33368478 -0.02954461
<02954%20461>
0.40925563 0.16125632 0.34502726 -0.23300369 <02330%200369>]
[-0.34626999 <03462%206999> -0.24576316 -0.15655482 0.51831127
-0.17288398 0.26124855
0.58672901 0.07084532 -0.27323606 -0.06688774 <06688%20774>]
[ 0.24814453 <02481%204453> 0.40352612 -0.33373039 0.52155155 0.48083476
-0.10008073
-0.17301593 0.27878979 <02787%208979> -0.1649485 -0.12694282]
[-0.16370208 -0.05674461 0.6998121 0.32548404 0.00940855 -0.39724873
-0.21029293 <02102%209293> -0.12432748 -0.34877216 -0.18731605]]
<NDArray 10x10 @cpu <https://github.com/cpu>(0)>
Lambda: [-55.42942652 -36.08229736 -21.43322708 -13.37817004 -6.12139065
4.87496436 6.50358305 15.6977825 29.92340939 56.52168194]
<NDArray 10 @cpu <https://github.com/cpu>(0)>
Whereas running the exact same code in an ipython notebook returns the
following:
A: [[-13.34414296 15.99829902 -5.9348693 0.33725903 7.08999108
<089%2099108>
3.48517748 -8.1782682 2.64821474 -11.71583022 -2.67093716]
[ 15.99829902 -1.24079583 -5.15169143 -1.66193884 7.78811657
-3.24022297 6.49511542 12.28121039 8.40569059 -2.6123374 ]
[ -5.9348693 -5.15169143 18.40724804 -3.32434622 -8.2294941
-12.05570085 -3.134901 -12.91228277 11.51627784 -3.26478123]
[ 0.33725903 <03372%205903> -1.66193884 -3.32434622 13.99603538 5.64011818
-5.48959896 4.85978757 4.35528089 -5.7480471 -11.40238071]
[ 7.08999108 7.78811657 -8.2294941 5.64011818 -5.28942777
-3.71190675 -13.53220115 3.52255819 6.93703998 8.68193436]
[ 3.48517748 -3.24022297 -12.05570085 -5.48959896 -3.71190675
-6.68378713 9.05448277 7.37555848 3.11379519 8.66276378]
[ -8.1782682 6.49511542 -3.134901 4.85978757 -13.53220115
9.05448277 -1.94185915 0.70195063 -10.21748333 4.8669592 ]
[ 2.64821474 12.28121039 -12.91228277 4.35528089 3.52255819
7.37555848 0.70195063 -20.63472654 7.87549062 1.3931764 ]
[-11.71583022 8.40569059 11.51627784 -5.7480471 6.93703998
3.11379519 -10.21748333 7.87549062 3.42171743 4.7941094 ]
[ -2.67093716 -2.6123374 -3.26478123 -11.40238071 8.68193436
8.66276378 4.8669592 1.3931764 4.7941094 -5.61335189]]
U: [[ 0.56896406 <05689%206406> -0.49508446 -0.05515309 -0.09274153
0.05952866 -0.30276783
0.41076905 0.18170427 0.34030827 -0.08221328 <08221%20328>]
[-0.24222508 <02422%202508> -0.14435087 0.26295741 -0.18396091 0.30094882
0.01134089
0.04484504 0.75461613 <07546%201613> -0.35400702 -0.17327796]
[-0.13275656 -0.00823993 0.00394372 -0.16026696 0.57617405 0.29618633
0.3182481 -0.42766506 <04276%206506> 0.05309422 -0.4999056 ]
[ 0.4795233 <04795%20233> -0.15113111 0.35068639 0.19762309 -0.06883189
0.72410954
-0.22482745 0.06161804 -0.02440456 -0.03418712 <0341%208712>]
[ 0.1443947 0.08427882 0.32210052 0.15902208 0.42793043 -0.1065733
0.30022962 <030%20022962> -0.18552473 -0.29677106 0.6602256 ]
[ 0.4100674 <04100%20674> 0.61100202 0.24203264 -0.36583436 -0.26012926
-0.14006751
0.21300133 <02130%200133> -0.01063569 -0.25163655 -0.26460707]
[-0.26364054 <02636%204054> 0.32696231 0.18616994 0.38349845 -0.17037144
0.19078566
0.53068931 <05306%208931> 0.2664058 0.47268578 0.01995924]
[ 0.19833204 0.25956163 0.15391535 0.52816029 0.37924181 -0.39832189
<039832%20189>
-0.39632359 <03963%202359> 0.06842959 0.15922644 -0.31789337]
[ 0.06100089 0.28041907 0.00365086 -0.52407767 0.32170882 0.11788932
-0.29540979 <02954%200979> 0.16579616 0.55700809 0.31712067]
[-0.2616713 <0261%206713> -0.27894058 0.76428004 -0.18169327 -0.20501889
-0.23708203
-0.12629573 -0.27049638 0.21217396 -0.06485779 <06485%20779>]]
<NDArray 10x10 @cpu <https://github.com/cpu>(0)>
Lambda: [-39.53770898 -31.85235173 -27.55026183 -13.38111156 -6.97443774
5.59771442 7.56903358 22.57472579 27.01218163 37.619126 ]
<NDArray 10 @cpu <https://github.com/cpu>(0)>
In both cases, A is the same but U and Lambda are different.
Thoughts?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#9295 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AGNRovx0Jo7MFglaG289w3N-ClcjBdGSks5tK0WKgaJpZM4RSdR1>
.
|
Ok new update to this story. I synced to the latest changes from master branch for each of them, and now I can't reproduce the bug at all. I then ran the build in 3 configurations:
Previously, we were seeing the bug in the second configuration. I haven't gone back to check if the bug is still reproducible at an earlier commit, but if it's fixed now it may or may not be worth figuring out what the issue was. What do you all think? |
Are we sure the test hangs due to the issue you saw?
Well can watch for a while, see whether issues that Marco raised, are still
present
…On Jan 18, 2018 7:08 PM, "Eric R Meissner" ***@***.***> wrote:
Ok new update to this story.
I synced to the latest changes from master branch for each of them, and
now I can't reproduce the bug at all. I then ran the build in 3
configurations:
1. one with no MKL installed/configured (my mac)
2. one with MKL2017+MKL2017_EXPERIMENTAL+USE_BLAS=openblas (ec2 host
on DL AMI)
3. one with MKL2017, MKL2017_EXPERIMENTAL, USE_BLAS=mkl and full MKL
(ec2 host on DL AMI)
Previously, we were seeing the bug in the second configuration. I haven't
gone back to check if the bug is still reproducible at an earlier commit,
but if it's fixed now it may or may not be worth figuring out what the
issue was. What do you all think?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#9295 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AGNRomzwkNJ90J1rqqd4_ljw413mcyeIks5tL4iggaJpZM4RSdR1>
.
|
MKLML is a dependency of the MKL2017 option and was updated recently in #9218. Might be related. It would be easy to verify if you revert to the point before that PR. |
@mseeger No, I don't know for sure this issue caused the hanging tests as I never saw one hang. Both issues being in laop3 felt too specific for coincidence so I assumed they were the same underlying problem. @szha Could be. Since the issue seems to be resolved now, it would just be testing that it wasn't broken at that point. Anyone know if we're still seeing the hanging issue in CI builds? |
If all works fine now, we should close this issue. We have checked everything again, it should work as is, it does work as is, tests are in that ensure that we will notice if this ever happens again. Marco, ok with this? |
@meissnereric @asmushetzel, though not high priority, it would still be good if we rule out whether it's a bug in mxnet or not. This is why I suggested reverting to a version prior to PR9218, so that we know for sure. |
Sorry for the late response. I didn't see any hangs since the upgrade, but
it would be good if we could figure out whether that specific version was
the issue.
Am 22.01.2018 9:02 nachm. schrieb "Sheng Zha" <notifications@github.com>:
@meissnereric <https://github.com/meissnereric> @asmushetzel
<https://github.com/asmushetzel>, though not high priority, it would still
be good if we rule out whether it's a bug in mxnet or not. This is why I
suggested reverting to a version prior to PR9218, so that we know for sure.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#9295 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ARxB62jrelHBKmBEi071pTiCLkbyDDDkks5tNOk6gaJpZM4RSdR1>
.
|
Closed due to MKLDNN replacement |
The execution of test_operator.test_laop_3 randomly hangs without a message on Python3: MKLML-CPU. See http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/master/130/pipeline/329
The text was updated successfully, but these errors were encountered: