MKLDNN softmax outputs NaN in mkldnn 0.14 #13141

azai91 · 2018-11-06T20:03:04Z

Description

Extremely negative softmax inputs output NaN. This is an error caught detected in MKLDNN already (oneapi-src/oneDNN#106) with a fix (https://gist.github.com/emfomenk/0386c529c5df21ae308b00d16454c48e) in MKLDNN v0.15+ (we are v0.14).

The fix is either to:

patch MKLDNN v0.14 with the earlier fix
to upgrade the MKLDNN version in mxnet (Update MKL-DNN dependency #12953).

Environment info (Required)

ubuntu@ip-172-31-3-217:~$ python diagnose.py
----------Python Info----------
Version      : 3.6.4
Compiler     : GCC 7.2.0
Build        : ('default', 'Jan 16 2018 18:10:19')
Arch         : ('64bit', '')
------------Pip Info-----------
Version      : 9.0.1
Directory    : /home/ubuntu/anaconda3/lib/python3.6/site-packages/pip
----------MXNet Info-----------
/home/ubuntu/anaconda3/lib/python3.6/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  from ._conv import register_converters as _register_converters
Version      : 1.3.0
Directory    : /home/ubuntu/anaconda3/lib/python3.6/site-packages/mxnet
Commit Hash   : b3be92f4a48bce62a5a8424271871c2f81c8f7f1
----------System Info----------
Platform     : Linux-4.4.0-1065-aws-x86_64-with-debian-stretch-sid
system       : Linux
node         : ip-172-31-3-217
release      : 4.4.0-1065-aws
version      : #75-Ubuntu SMP Fri Aug 10 11:14:32 UTC 2018
----------Hardware Info----------
machine      : x86_64
processor    : x86_64
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                72
On-line CPU(s) list:   0-71
Thread(s) per core:    2
Core(s) per socket:    18
Socket(s):             2
NUMA node(s):          2
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 85
Model name:            Intel(R) Xeon(R) Platinum 8124M CPU @ 3.00GHz
Stepping:              3
CPU MHz:               3000.000
BogoMIPS:              6000.00
Hypervisor vendor:     KVM
Virtualization type:   full
L1d cache:             32K
L1i cache:             32K
L2 cache:              1024K
L3 cache:              25344K
NUMA node0 CPU(s):     0-17,36-53
NUMA node1 CPU(s):     18-35,54-71
Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single kaiser fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f rdseed adx smap clflushopt clwb avx512cd xsaveopt xsavec xgetbv1 ida arat
----------Network Test----------
Setting timeout: 10
Timing for MXNet: https://github.com/apache/incubator-mxnet, DNS: 0.0012 sec, LOAD: 0.4806 sec.
Timing for Gluon Tutorial(en): http://gluon.mxnet.io, DNS: 0.1717 sec, LOAD: 0.5293 sec.
Timing for Gluon Tutorial(cn): https://zh.gluon.ai, DNS: 0.1596 sec, LOAD: 0.3734 sec.
Timing for FashionMNIST: https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/dataset/fashion-mnist/train-labels-idx1-ubyte.gz, DNS: 0.0262 sec, LOAD: 0.1173 sec.
Timing for PYPI: https://pypi.python.org/pypi/pip, DNS: 0.0013 sec, LOAD: 0.3264 sec.
Timing for Conda: https://repo.continuum.io/pkgs/free/, DNS: 0.0118 sec, LOAD: 0.0690 sec.

Package used (Python/R/Scala/Julia):
Python

For Scala user, please provide:

Java version: (java -version)
Maven version: (mvn -version)
Scala runtime if applicable: (scala -version)

For R user, please provide R sessionInfo():

Build info (Required if built from source)

Compiler (gcc/clang/mingw/visual studio):

MXNet commit hash:
6b5d9f9

Build config:
MKLDNN (pip install mxnet-mkl)

Error Message:

ubuntu@ip-172-31-3-217:~/incubator-mxnet$ python tt.py
/home/ubuntu/anaconda3/lib/python3.6/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  from ._conv import register_converters as _register_converters
[
[[[[nan nan]]]]
<NDArray 1x1x1x2 @cpu(0)>]

Minimum reproducible example

import mxnet as mx
input_data = mx.nd.array([[[[-1e30,-1e30]]]])
data = mx.sym.Variable('data')
out1 = data.softmax(axis=1)
exec1 = out1.bind(mx.cpu(), args={'data': input_data, 'softmax_label': mx.nd.ones([1]), 'fc_weight': mx.nd.ones([2,2]), 'fc1_weight': mx.nd.ones([2,2])})
exec1.forward()[0].wait_to_read()
print(exec1.outputs)

Steps to reproduce

Run the following script.

What have you tried to solve it?

Applying this one line fix (https://gist.github.com/emfomenk/0386c529c5df21ae308b00d16454c48e) in mkldnn fixes the issue.

The text was updated successfully, but these errors were encountered:

azai91 · 2018-11-06T20:03:15Z

@nswamy

pengzhao-intel · 2018-11-07T05:40:24Z

@azai91 could you help to add a testcase for this issue after the #12953 is merged?

pengzhao-intel · 2018-11-08T04:51:02Z

@azai91 PR is merged, go ahead to create a test for this issue 👍
Feel free to let me know if any help is needed.

@TaoLv

Original: MKL-DNN 0.14 CI: 498e03d

patric@mlt-skx122 mxnet]$ python soft.py
/home/patric/.local/lib/python2.7/site-packages/h5py/__init__.py:34: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  from ._conv import register_converters as _register_converters
[
[[[[nan nan]]]]
<NDArray 1x1x1x2 @cpu(0)>]

Now: MKL-DNN 0.17 CI: a32fa84

[patric@mlt-skx122 mxnet]$ python soft.py
/home/patric/.local/lib/python2.7/site-packages/h5py/__init__.py:34: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  from ._conv import register_converters as _register_converters
[
[[[[1. 1.]]]]
<NDArray 1x1x1x2 @cpu(0)>]

pengzhao-intel · 2018-11-08T05:31:31Z

@mxnet-label-bot [MKLDNN]

marcoabreu added the MKLDNN label Nov 8, 2018

nswamy mentioned this issue Nov 8, 2018

adding unittest for MKLDNN Softmax operator #12884

Merged

6 tasks

nswamy added the Bug label Nov 8, 2018

mseth10 mentioned this issue Nov 20, 2018

adding test for softmax operator for inputs with large magnitude #13328

Merged

5 tasks

anirudh2290 closed this as completed in #13328 Nov 22, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MKLDNN softmax outputs NaN in mkldnn 0.14 #13141

MKLDNN softmax outputs NaN in mkldnn 0.14 #13141

azai91 commented Nov 6, 2018 •

edited by nswamy

azai91 commented Nov 6, 2018

pengzhao-intel commented Nov 7, 2018

pengzhao-intel commented Nov 8, 2018 •

edited

pengzhao-intel commented Nov 8, 2018

MKLDNN softmax outputs NaN in mkldnn 0.14 #13141

MKLDNN softmax outputs NaN in mkldnn 0.14 #13141

Comments

azai91 commented Nov 6, 2018 • edited by nswamy

Description

Environment info (Required)

Build info (Required if built from source)

Error Message:

Minimum reproducible example

Steps to reproduce

What have you tried to solve it?

azai91 commented Nov 6, 2018

pengzhao-intel commented Nov 7, 2018

pengzhao-intel commented Nov 8, 2018 • edited

pengzhao-intel commented Nov 8, 2018

azai91 commented Nov 6, 2018 •

edited by nswamy

pengzhao-intel commented Nov 8, 2018 •

edited