Flakey test: test_operator_gpu.py:test_rnntanh_bidirectional #15034

DickJC123 · 2019-05-22T03:23:43Z

Description

The failures are seen infrequently and nondeterministically (but within 1000 trials) on a GPU platform when run on NVIDIA P40. Based on initial investigation, the problem is introduced by the commit:

1c49e40fd  2019-04-13  Hao Li                 Change RNN OP to stateful (#14476)

... which is not too surprising given the sizeable refactoring of the rnn code with that commit.
Because the P40 has far fewer compute resources compared to P100 and V100, I suspect a timing related issue. No failures are seen on P100 or V100, nor on P40 on a checkout of the prior commit to 1c49e40. Looking over that commit, I see changes in how the various 'spaces' are handled in the GPU case. Maybe the commit author @lihaofd can chime on the need/motivation for these changes:

Prior to the commit, the 'workspace' (as set by cudnnGetRNNWorkspaceSize) was allocated from MXNet's TempSpace. With the commit, it becomes a per-instance permanent allocation.

Also, prior to the commit, the dropout state space was a per-instance permanent allocation, while with the commit it became managed by the MXNet context resources (and swapped in/out with various instance uses). While I understand that MXNet is set up to manage the dropout state, is there any other motivation to make this switch?

When the test fails, the non-fused model output is random garbage. To support the notion that a race condition exists, the test failures go away when a waitall() is inserted in test_operator.py function check_rnn_consistency:

    dy = mx.random.uniform(shape=mod1.get_outputs()[0].shape)
    mod1.backward(out_grads=[dy])
    mx.nd.waitall()
    mod2.backward(out_grads=[dy])

@ptrendx @eric-haibin-lin , I'd like to see this resolved by the 1.5 code freeze.

Environment info (Required)

----------Python Info----------
Version      : 3.5.2
Compiler     : GCC 5.4.0 20160609
Build        : ('default', 'Nov 12 2018 13:43:14')
Arch         : ('64bit', 'ELF')
------------Pip Info-----------
Version      : 19.1.1
Directory    : /usr/local/lib/python3.5/dist-packages/pip
----------MXNet Info-----------
Version      : 1.5.0
Directory    : /opt/mxnet/python/mxnet
Hashtag not found. Not installed from pre-built package.
----------System Info----------
Platform     : Linux-4.4.0-36-generic-x86_64-with-Ubuntu-16.04-xenial
system       : Linux
node         : 636ace361501
release      : 4.4.0-36-generic
version      : #55~14.04.1-Ubuntu SMP Fri Aug 12 11:49:30 UTC 2016
----------Hardware Info----------
machine      : x86_64
processor    : x86_64
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                40
On-line CPU(s) list:   0-39
Thread(s) per core:    2
Core(s) per socket:    10
Socket(s):             2
NUMA node(s):          2
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 63
Model name:            Intel(R) Xeon(R) CPU E5-2660 v3 @ 2.60GHz
Stepping:              2
CPU MHz:               2902.046
CPU max MHz:           3300.0000
CPU min MHz:           1200.0000
BogoMIPS:              5201.48
Virtualization:        VT-x
Hypervisor vendor:     vertical
Virtualization type:   full
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              25600K
NUMA node0 CPU(s):     0-9,20-29
NUMA node1 CPU(s):     10-19,30-39
Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm epb tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm xsaveopt cqm_llc cqm_occup_llc dtherm ida arat pln pts

Error Message:

Two example outputs shown:

======================================================================
FAIL: test_operator_gpu.test_rnntanh_bidirectional
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/nose/case.py", line 198, in runTest
    self.test(*self.arg)
  File "/usr/local/lib/python3.5/dist-packages/nose/util.py", line 620, in newfunc
    return func(*arg, **kw)
  File "/opt/mxnet/tests/python/gpu/../unittest/common.py", line 177, in test_new
    orig_test(*args, **kwargs)
  File "/opt/mxnet/tests/python/gpu/../unittest/common.py", line 110, in test_new
    orig_test(*args, **kwargs)
  File "/opt/mxnet/tests/python/gpu/../unittest/test_operator.py", line 182, in test_rnntanh_bidirectional
    check_rnn_consistency(fused, stack, T, N, I, H, 'add')
  File "/opt/mxnet/tests/python/gpu/../unittest/test_operator.py", line 69, in check_rnn_consistency
    assert_allclose(mod1.get_input_grads()[0].asnumpy(), mod2.get_input_grads()[0].asnumpy(), rtol=rtol, atol=atol)
  File "/usr/local/lib/python3.5/dist-packages/numpy/testing/utils.py", line 1395, in assert_allclose
    verbose=verbose, header=header, equal_nan=equal_nan)
  File "/usr/local/lib/python3.5/dist-packages/numpy/testing/utils.py", line 778, in assert_array_compare
    raise AssertionError(msg)
AssertionError: 
Not equal to tolerance rtol=0.01, atol=0.0001

(mismatch 99.9725%)
 x: array([[[ 0.122356,  0.663351,  0.721616, ...,  0.300692,  0.809006,
          0.190476],
        [ 0.063241,  0.969914,  0.543127, ...,  0.040564,  0.695362,...
 y: array([[[-0.036576,  0.022715, -0.00182 , ...,  0.014202, -0.042219,
          0.026592],
        [-0.008018,  0.020907,  0.006875, ..., -0.017068, -0.04306 ,...
-

Second failure: note difference in second (unfused) rnn model only.

======================================================================
FAIL: test_operator_gpu.test_rnntanh_bidirectional
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/nose/case.py", line 198, in runTest
    self.test(*self.arg)
  File "/usr/local/lib/python3.5/dist-packages/nose/util.py", line 620, in newfunc
    return func(*arg, **kw)
  File "/opt/mxnet/tests/python/gpu/../unittest/common.py", line 177, in test_new
    orig_test(*args, **kwargs)
  File "/opt/mxnet/tests/python/gpu/../unittest/common.py", line 110, in test_new
    orig_test(*args, **kwargs)
  File "/opt/mxnet/tests/python/gpu/../unittest/test_operator.py", line 182, in test_rnntanh_bidirectional
    check_rnn_consistency(fused, stack, T, N, I, H, 'add')
  File "/opt/mxnet/tests/python/gpu/../unittest/test_operator.py", line 69, in check_rnn_consistency
    assert_allclose(mod1.get_input_grads()[0].asnumpy(), mod2.get_input_grads()[0].asnumpy(), rtol=rtol, atol=atol)
  File "/usr/local/lib/python3.5/dist-packages/numpy/testing/utils.py", line 1395, in assert_allclose
    verbose=verbose, header=header, equal_nan=equal_nan)
  File "/usr/local/lib/python3.5/dist-packages/numpy/testing/utils.py", line 778, in assert_array_compare
    raise AssertionError(msg)
AssertionError: 
Not equal to tolerance rtol=0.01, atol=0.0001

(mismatch 99.96375%)
 x: array([[[ 0.122356,  0.663351,  0.721616, ...,  0.300692,  0.809006,
          0.190476],
        [ 0.063241,  0.969914,  0.543127, ...,  0.040564,  0.695362,...
 y: array([[[  8.044541e-03,   5.417631e-02,   3.945356e-02, ...,
          -4.552861e-03,  -1.618103e-02,   7.161065e-02],
        [ -8.904358e-03,   3.195144e-02,   2.073918e-02, ...,...

Steps to reproduce

MXNET_TEST_COUNT=1000 MXNET_TEST_SEED=42 nosetests --verbose -s --logging-level=DEBUG tests/python/gpu/test_operator_gpu.py:test_rnntanh_bidirectional

What have you tried to solve it?

See above discussion.

The text was updated successfully, but these errors were encountered:

mxnet-label-bot · 2019-05-22T03:23:46Z

Hey, this is the MXNet Label Bot.
Thank you for submitting the issue! I will try and suggest some labels so that the appropriate MXNet community members can help resolve it.
Here are my recommended labels: Test

pengzhao-intel · 2019-05-22T05:48:14Z

The refactor for legacy code is not easy and we have taken these tasks to make MXNet going forward.
Since we are not very familiar with GPU code, @DickJC123 would you mind looking into details and fix the issue? We can take a look for the CPU part if anything is necessary.

roywei · 2019-05-22T23:58:41Z

@pengzhao-intel It seems there is another refactor in #14713. Will it cause more behaviors similar in this issue?
maybe we should evaluate the impact before futher refactor

cc @szha @eric-haibin-lin

TaoLv · 2019-05-23T01:54:53Z

@DickJC123 I can see many discussions about resource and storage in #14476. Could you please take a look at them? Hope they can address part of your questions. Thanks.

DickJC123 · 2019-05-23T03:36:21Z

I experimented with reverting the workspace and dropout state handling changes of commit 1c49e40 to the original approach. This has eliminated the failures seen on P40. I will be making a PR of this tomorrow.

szha · 2019-05-23T04:00:01Z

While I understand that MXNet is set up to manage the dropout state, is there any other motivation to make this switch?

cudnn set dropout descriptor is very slow if the dropout state space is not reused.

DickJC123 · 2019-05-23T18:11:55Z

I discovered a new issue with commit #14476: it's use of kCuDNNDropoutDesc and the cuDNN API that supports it was only possible starting with cuDNN 7.0. From this commit onward, the MXNet master no longer compiles against cuDNN 6.0. I will work up a PR that uses the new dropout state handling approach only when it is available, thus eliminating the inadvertent 'breaking change.'

szha · 2019-05-23T19:29:19Z

cudnn support for specific versions has never been promised in mxnet. The only cuda version that requires compiling with cudnn 6.0 was cuda 7.5, which is no longer supported in mxnet.

DickJC123 · 2019-05-24T00:20:44Z

I found that moving the RNN workspace from a permanent per-instance allocation to the prior (and typical) approach of using the TempSpace resource corrected the test flakeyness on P40. I won't touch the dropout descriptor handling, nor the fact that it implicitly forces MXNet users to cuda v7.0. PR shortly.

DickJC123 mentioned this issue May 23, 2019

MKLDNN RNN Inference Integration(fp32 LSTM and vRNN with tanh and relu) #14713

Merged

3 tasks

zachgk added Flaky Test labels May 23, 2019

DickJC123 mentioned this issue May 24, 2019

GPU RNN to use TempSpace resource for workspace. #15056

Merged

5 tasks

DickJC123 mentioned this issue Aug 10, 2019

MKL-DNN LBR-GRU Inference Integration (FP32 LBR-GRU) #15741

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Flakey test: test_operator_gpu.py:test_rnntanh_bidirectional #15034

Flakey test: test_operator_gpu.py:test_rnntanh_bidirectional #15034

DickJC123 commented May 22, 2019 •

edited

mxnet-label-bot commented May 22, 2019

pengzhao-intel commented May 22, 2019

roywei commented May 22, 2019 •

edited

TaoLv commented May 23, 2019

DickJC123 commented May 23, 2019

szha commented May 23, 2019

DickJC123 commented May 23, 2019

szha commented May 23, 2019

DickJC123 commented May 24, 2019

Flakey test: test_operator_gpu.py:test_rnntanh_bidirectional #15034

Flakey test: test_operator_gpu.py:test_rnntanh_bidirectional #15034

Comments

DickJC123 commented May 22, 2019 • edited

Description

Environment info (Required)

Error Message:

Steps to reproduce

What have you tried to solve it?

mxnet-label-bot commented May 22, 2019

pengzhao-intel commented May 22, 2019

roywei commented May 22, 2019 • edited

TaoLv commented May 23, 2019

DickJC123 commented May 23, 2019

szha commented May 23, 2019

DickJC123 commented May 23, 2019

szha commented May 23, 2019

DickJC123 commented May 24, 2019

DickJC123 commented May 22, 2019 •

edited

roywei commented May 22, 2019 •

edited