When compiled with MKL, fully_connected calls DNNL while dot and batch_dot call MKL #17980

kpuatamazon · 2020-04-06T13:42:47Z

Problem

Not sure how much we care about MKL support but to the extent it still appears in the buld system, operator support should be consistent.

When compiled with MKL present (MKL is found in /opt/intel), MXNet calls MKL for dot and batch_dot and DNNL for fully_connected. These are all GEMM operators; why is it inconsistent? This is making Sockeye decoding 22% slower (see below).

This inconsistency did not matter much in MXNet 1.5.0 because MKLDNN would delegate to MKL. However, aa1074d upgraded to MKLDNN 1.0, which hid the ability of MKLDNN to delegate to MKL: oneapi-src/oneDNN@3049150 . (MKLDNN has since been renamed DNNL.)

Since MKLDNN only hid support for delegating to MKL, it's possible to restore delegatation (see workaround).

Testing

Tested with MXNet cfb474b. Compiled with mostly-default cmake settings:

cmake -GNinja -DUSE_CUDA=OFF -DCMAKE_BUILD_TYPE=Release ..

Then when I run

export MKL_VERBOSE=1
export MKLDNN_VERBOSE=1
python3
Python 3.6.9 (default, Nov  7 2019, 10:44:02) 
[GCC 8.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import mxnet as mx
Numpy + Intel(R) MKL: THREADING LAYER: (null)
Numpy + Intel(R) MKL: setting Intel(R) MKL to use INTEL OpenMP runtime
Numpy + Intel(R) MKL: preloading libiomp5.so runtime
MKL_VERBOSE Intel(R) MKL 2019.0 Product build 20180829 for Intel(R) 64 architecture Intel(R) Advanced Vector Extensions 512 (Intel(R) AVX-512) enabled processors, Lnx 3.00GHz lp64 intel_thread
MKL_VERBOSE SDOT(2,0x1a0fdc0,1,0x1a0fdc0,1) 1.47ms CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:24
>>> a = mx.nd.ones(shape=(2,2))
>>> mx.nd.FullyConnected(a,a,num_hidden=2,no_bias=True)
dnnl_verbose,info,DNNL v1.1.2 (commit cb2cc7ac17ff4e2ef50805c7048d33256d82be4d)
dnnl_verbose,info,Detected ISA is Intel AVX-512 with Intel DL Boost
dnnl_verbose,exec,cpu,inner_product,gemm:jit,forward_inference,src_f32::blocked:ab:f0 wei_f32::blocked:ab:f0 bia_undef::undef::f0 dst_f32::blocked:ab:f0,,,mb2ic2oc2,74.9971

[[2. 2.]
 [2. 2.]]
<NDArray 2x2 @cpu(0)>
>>> a = mx.nd.ones(shape=(2,2,2))
>>> mx.nd.batch_dot(a,a)
MKL_VERBOSE SGEMM_BATCH(N,N,0x7fc3238b809c,0x7fc3238b80a0,0x7fc3238b80a4,0x7fc3238b80b4,0x7fc228010b90,0x7fc3238b80a8,0x7fc22800f770,0x7fc3238b80ac,0x7fc3238b80b8,0x7fc2280190e0,0x7fc3238b80b0,0x7fc3238b7fc8,0x7 363.79us CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:24

[[[2. 2.]
  [2. 2.]]

 [[2. 2.]
  [2. 2.]]]
>>> mx.nd.dot(a,a)
MKL_VERBOSE SGEMM(N,N,4,4,2,0x7fc3238b8198,0x7fc2280043c0,4,0x7fc2280043c0,2,0x7fc3238b81a0,0x7fc228004580,4) 8.52us CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:24

[[[[2. 2.]
   [2. 2.]]

  [[2. 2.]
   [2. 2.]]]


 [[[2. 2.]
   [2. 2.]]

  [[2. 2.]
   [2. 2.]]]]
<NDArray 2x2x2x2 @cpu(0)>

You can see DNNL is called for FullyConnected while MKL is called for dot and batch_dot.

Performance impact

I timed Sockeye decoding. Commit aa1074d made decoding 22% slower (416.878s up from 342.037s for b5d07e3) even with MKL installed in /opt/intel/.

Commit	Compilation	Time(s)
`b5d07e3` (before MKLDNN 1.0 change)	Default	342.037
`aa1074d` (MKLDNN 1.0 change)	Default	416.878
`aa1074d` (MKLDNN 1.0 change)	Workaround	343.706
`cfb474b` Recent	Default	385.587
`cfb474b` Recent	Workaround	312.509

(Default compilation is cmake -GNinja -DUSE_CUDA=OFF -DCMAKE_BUILD_TYPE=Release ..; workaround compilation is below.)

Tested on a Skylake Xeon, c5.9xlarge Intel(R) Xeon(R) Platinum 8124M CPU @ 3.00GHz with OMP_NUM_THREADS=4.

Workaround

Since DNNL hid support for delegating to MKL, it's still possible to turn delegation back on.

cmake -GNinja -DUSE_CUDA=OFF -DCMAKE_BUILD_TYPE=Release -D_DNNL_USE_MKL=FULL -DMKLINC=/opt/intel/mkl/include ..

which compiles but then triggers a link error at runtime OSError: /home/ubuntu/mxnet/build/3rdparty/mkldnn/src/libmkldnn.so.1: undefined symbol: cblas_gemm_s8u8s32_pack
So I kludged it with export LD_PRELOAD=/opt/intel/mkl/lib/intel64/libmkl_rt.so and was then able to use MXNet at runtime. There's probably a cleaner way of fixing the linkage.

Recommended fix

When compiled with MKL, MXNet should call MKL directly from FullyConnected like it already does for dot and batch_dot.

The text was updated successfully, but these errors were encountered:

TaoLv · 2020-04-06T14:13:35Z

Thanks for raising the issue, @kpuatamazon. We plan to optimize the dot and batch_dot operator with DNNL MatMul primitive. But please note that, the performance of MatMul primitive was not optimized until v1.3 which was released last week. That's why we didn't integrate the primitive at the first time when it's introduced in v1.2.

When compiled with MKL, MXNet should call MKL directly from FullyConnected like it already does for dot and batch_dot.

As mentioned above, dot and batch_dot will also be optimized with DNNL. And as DNNL is more dedicated on deep learning, we will consider DNNL with high priority when both DNNL and MKL are enabled in compilation.

Please provide a simple reproducer if you find FullyConnected based on DNNL is slower than it based on MKL BLAS.

kpuatamazon · 2020-04-06T14:52:32Z

Here's some benchmarks on 2fff11d (current tip of master) with the same Skylake c5.9xlarge. DNNL is substantially slower than MKL.

(Which DNNL is in master?)

#!/usr/bin/env python3
import mxnet as mx
import time

def time_procedure(shape, count, proc):
  rows, inner, cols = shape
  a = mx.nd.random_uniform(shape=(rows, inner), low=-1.0, high=1.0)
  b = mx.nd.random_uniform(shape=(cols, inner), low=-1.0, high=1.0)
  # Burn in
  proc(a, b, cols)
  mx.nd.waitall()
  begin = time.time()
  for i in range(0, count):
    proc(a, b, cols)
    mx.nd.waitall()
  return (time.time() - begin) / count

shapes = [(5, 512, 512), (5,512,1536), (5,512,2048), (5,2048,512), (4,512,512)]
count = 1000

procedures = {
  "fullyconnected (DNNL)" : (lambda a, b, cols : mx.nd.FullyConnected(a, b, no_bias=True, num_hidden=cols)),
  "dot (MKL)" : (lambda a, b, cols : mx.nd.dot(a, b, transpose_b = True))
}
for s in shapes:
  print("Shape " + str(s))
  stats = {}
  for name, l in procedures.items():
    stats[name] = time_procedure(s, count, l)
    print("{:.7f} seconds for {}".format(stats[name], name))

Run as OMP_NUM_THREADS=4 ./mult_bench.py:

Shape (5, 512, 512)
0.0000961 seconds for fullyconnected (DNNL)
0.0000509 seconds for dot (MKL)
Shape (5, 512, 1536)
0.0002011 seconds for fullyconnected (DNNL)
0.0000735 seconds for dot (MKL)
Shape (5, 512, 2048)
0.0002521 seconds for fullyconnected (DNNL)
0.0001027 seconds for dot (MKL)
Shape (5, 2048, 512)
0.0003569 seconds for fullyconnected (DNNL)
0.0001018 seconds for dot (MKL)
Shape (4, 512, 512)
0.0000946 seconds for fullyconnected (DNNL)
0.0000496 seconds for dot (MKL)

I don't really mind what the default BLAS implementation is. But choosing to use MKL should be 1) atomic and 2) not require undocumented compilation options.

kpuatamazon · 2020-04-06T15:18:54Z

I also tried the latest DNNL:

cd 3rdparty
rm -rf mkldnn
git clone https://github.com/oneapi-src/oneDNN mkldnn
cd ../build && ninja -j 32

and DNNL is still slower:

Shape (5, 512, 512)
0.0000875 seconds for fullyconnected (DNNL)
0.0000507 seconds for dot (MKL)
Shape (5, 512, 1536)
0.0002005 seconds for fullyconnected (DNNL)
0.0000729 seconds for dot (MKL)
Shape (5, 512, 2048)
0.0002516 seconds for fullyconnected (DNNL)
0.0000974 seconds for dot (MKL)
Shape (5, 2048, 512)
0.0003564 seconds for fullyconnected (DNNL)
0.0000981 seconds for dot (MKL)
Shape (4, 512, 512)
0.0000928 seconds for fullyconnected (DNNL)
0.0000497 seconds for dot (MKL)

TaoLv · 2020-04-06T15:20:36Z

Thank you @kpuatamazon. I will try to reproduce the performance issue and get back to you later.

(Which DNNL is in master?)

It should be v1.2.2 but was down graded to v1.1.2 by #17084 (maybe by accident).

But choosing to use MKL should be 1) atomic and 2) not require undocumented compilation options.

Not sure I understand what do you mean. But using MKL BLAS is documented in:

kpuatamazon · 2020-04-06T15:47:05Z

This still uses MKLDNN for FullyConnected

cmake -GNinja -DUSE_CUDA=OFF -DCMAKE_BUILD_TYPE=Release -DBLAS=MKL -DUSE_MKL_IF_AVAILABLE=ON ..

I was able to get MKL to run by disabling MKLDNN entirely:

cmake -GNinja -DUSE_CUDA=OFF -DCMAKE_BUILD_TYPE=Release -DBLAS=MKL -DUSE_MKL_IF_AVAILABLE=ON -DUSE_MKLDNN=OFF ..

But then I lose other kernels like DNNL softmax and actually Sockeye randomly crashes:

mxnet.base.MXNetError: MXNetError: Out of range value for value, value='inf', in operator _full(name="", dtype="float32", value="inf", ctx="cpu(0)", shape="(5, 1)")

(bizarrely this happens after it translated many sentences with that same code.)

How do I achieve the fastest combination of DNNL softmax and MKL matrix multiply for FullyConnected using only documented options?

TaoLv · 2020-04-06T16:03:41Z

Yes, USE_MKLDNN is ON by default on Linux. See https://github.com/apache/incubator-mxnet/blob/master/CMakeLists.txt#L44. If want to disable DNNL optimizations, you need set USE_MKLDNN to OFF explicitly in the cmake line.

How do I achieve the fastest combination of DNNL softmax and MKL matrix multiply for FullyConnected using only documented options?

I don't think it's possible at this moment. I think the real question is how to improve the performance of FullyConnected when DNNL is used. In fact, I see the performance of DNNL primitive kernel is good, compared with the output of the script. Take the last shape as an example (on my machine, not ec2):

dnnl_verbose,exec,cpu,inner_product,gemm:jit,forward_inference,src_f32::blocked:ab:f0 wei_f32::blocked:ab:f0 bia_undef::undef::f0 dst_f32::blocked:ab:f0,,,mb4ic512oc512,0.0490723
0.0001380 seconds for fullyconnected (DNNL)
0.0000556 seconds for dot (MKL)

You can set DNNL_VERBOSE=1 when running the script to get the verbose output.

kpuatamazon · 2020-04-06T16:42:58Z

I don't think it's possible at this moment.

Can we please have a build option to force MKL usage for all GEMM calls? This is a pretty big performance regression.

I think the real question is how to improve the performance of FullyConnected when DNNL is used.

Hmmm, when DNNL delegates to MKL, that happens internally to DNNL, right? If the MXNet wrapper is the problem, then why is delegating to MKL much faster?

Applying my workaround:

cmake -GNinja -DUSE_CUDA=OFF -DCMAKE_BUILD_TYPE=Release -D_DNNL_USE_MKL=FULL -DMKLINC=/opt/intel/mkl/include ..
ninja -j 30
export LD_PRELOAD=/opt/intel/mkl/lib/intel64/libmkl_rt.so
export OMP_NUM_THREADS=4
./mult_bench.py #NB I edited the script to label things more clearly.

Output:

Shape (5, 512, 512)
0.0000628 seconds for fullyconnected (MKL workaround)
0.0000488 seconds for dot (MKL)
Shape (5, 512, 1536)
0.0000816 seconds for fullyconnected (MKL workaround)
0.0000661 seconds for dot (MKL)
Shape (5, 512, 2048)
0.0001109 seconds for fullyconnected (MKL workaround)
0.0000957 seconds for dot (MKL)
Shape (5, 2048, 512)
0.0001078 seconds for fullyconnected (MKL workaround)
0.0000954 seconds for dot (MKL)
Shape (4, 512, 512)
0.0000501 seconds for fullyconnected (MKL workaround)
0.0000434 seconds for dot (MKL)

There is something less efficient about FullyConnected than dot but it's nowhere near explaining DNNL's slowdown.

FYI I usually wear this hat (as opposed to @kpu) on Mondays so expect slower responses on other days.

TaoLv · 2020-04-07T03:48:13Z

@emfomenk What's -D_DNNL_USE_MKL=FULL for? I'm not aware of this option and guessing it's not a user facing option.

emfomenk · 2020-04-07T03:54:10Z

@TaoLv, it is a developer's option to enable Intel MKL gemm back (instead of the jitted one). I would suggest to avoid using it, as it internal one (marked with leading underscore). I think it was even broken somewhere in v1.1 to v1.3...

TaoLv · 2020-04-07T04:48:16Z

@emfomenk Thanks for the explanation. Do you think that enabling this option can improve the performance of inner product primitive for the above cases?

pengzhao-intel · 2020-04-07T13:44:40Z

@bartekkuncer please help take a look for this issue as well

emfomenk · 2020-04-07T15:05:54Z

Do you think that enabling this option can improve the performance of inner product primitive for the above cases?

@TaoLv, yes, this could be. Two reasons for that:

If Intel MKL is linked, then most likely libiomp5.so will be used instead of libgomp.so. So, if you see the performance difference, I would suggest to test Intel MKL-DNN v1.0 / DNNL v1.1+ with LD_PRELOAD=libiomp5.so, just to confirm that the issue due to or not due to OpenMP run-time.
Performance bug in Intel MKL-DNN v1.0 / DNNL v1.1+. We definitely fixed few performance issues in the library since v1.0.

I would also suggest to try upgrading the library to the most recent version and check the performance.
I strongly suggest not using this developer's option. It is unsupported, not tested, and as I said, it is most likely broken in the subsequent releases.

kpu · 2020-04-07T22:05:47Z

If Intel MKL is linked

MKL was linked in both cases and in fact called in both cases from dot just not from FullyConnected

libiomp5.so will be used instead of libgomp.so

I tried single-threaded (OMP_NUM_THREADS=1) and still saw a performance drop. There shouldn't be much difference between OMP libraries then right?

I would also suggest to try upgrading the library to the most recent version

See my post above "I also tried the latest DNNL". Performance was unchanged.

emfomenk · 2020-04-07T22:19:37Z

I tried single-threaded (OMP_NUM_THREADS=1) and still saw a performance drop. There shouldn't be much difference between OMP libraries then right?

Yeah, there should be zero difference in this case :)

I would also suggest to try upgrading the library to the most recent version

See my post above "I also tried the latest DNNL". Performance was unchanged.

Didn't see that, thanks! Yeah, it looks like a performance bug indeed.

pengzhao-intel · 2020-04-08T01:50:28Z

Thanks @emfomenk & @TaoLv to identify the issue. Our team will follow up on the issue and switch more OP to DNNL soon.

kpuatamazon · 2020-04-13T10:01:04Z

Single-threaded benchmarks (OMP_NUM_THREADS=1) to confirm that it's not a difference in OMP library:

Shape	dot (MKL)	FullyConnected (MKL workaround)	FullyConnected (DNNL)	DNNL slowdown
5, 512, 512	0.0000914	0.0001016	0.0002448	2.40x
5, 512, 1536	0.0002939	0.0003051	0.0006576	2.15x
5, 512, 2048	0.0003828	0.0003919	0.0008582	2.18x
5, 2048, 512	0.0003681	0.0003785	0.0014638	3.86x
4, 512, 512	0.0000917	0.0001038	0.0002364	2.27x

emfomenk · 2020-04-13T10:04:17Z

Summoning @aaraujom

kpuatamazon · 2020-04-13T12:14:41Z

@TaoLv Regarding your comment

how to improve the performance of FullyConnected when DNNL is used. In fact, I see the performance of DNNL primitive kernel is good, compared with the output of the script.

I ran with both MKL and DNNL verbose options: OMP_NUM_THREADS=1 MKL_VERBOSE=1 DNNL_VERBOSE=1

verbose.txt

To summarize the verbose output in microseconds:

Shape	MKL raw	DNNL raw	DNNL raw slowdown	MKL Python overhead	DNNL Python overhead
5, 512, 512	55.71	213.19	3.83x	53.19	61.61
5, 512, 1536	260.18	636.10	2.44x	55.82	62.90
5, 512, 2048	345.66	844.64	2.44x	56.74	65.66
5, 2048, 512	328.68	1346.32	4.10x	56.62	187.78
4, 512, 512	53.75	202.27	3.76x	51.45	60.03

Raw is defined by the time reported from verbose output; note I converted DNNL ms to microseconds.
Python overhead is defined as time reported by the above Python script (converted to microseconds) minus raw time. Note that MKL used dot whilst DNNL used FullyConnected.

In both cases, I excluded the first call for each shape to allow for JIT/burn-in.

So we really have four issues:

Raw speed as reported by MKL's verbose output is much faster than raw speed reported by DNNL's verbose output. This is definitely a speed problem with DNNL.
No documented option to compile MKLDNN's kernels with MKL's GEMM. In my opinion, when I compile with -DBLAS set to something, it should use that BLAS. You can add an auto option that picks according to the priorities on the website.
It's weird that dot calls MKL but FullyConnected calls DNNL given that they are both GEMMs. I understand @pengzhao-intel is working on this. Though unless 1 and 2 above are fixed, it will only make performance worse.
MXNet / Python overhead is larger for FullyConnected than dot. This is small compared to the DNNL vs MKL difference though. We can also question why there is so much overhead relative to the cost of the multiply.

Intel should fix DNNL but that will take time.

MXNet should make -DBLAS actually choose the BLAS routine both to support a short-term bypass and because it better matches user expectations of what a compilation flag does.

kpuatamazon · 2020-04-13T17:45:28Z

In case somebody finds this issue and wants their optimized build, here is a different workaround that removes the need for LD_PRELOAD. Just do this before running cmake the first time:

export CXXFLAGS="${CXXFLAGS} -DUSE_MKL -I/opt/intel/mkl/include"

Then cmake can be run normally:

cmake -GNinja -DUSE_CUDA=OFF -DCMAKE_BUILD_TYPE=Release ..

and the compiled MXNet can be run normally without any special environment variables.

To be clear, the above kludge is an undocumented abomination. The problem with cmake -D_DNNL_USE_MKL=FULL -DMKLINC=/opt/intel/mkl/include is that it links against MKL twice. DNNL is hard-coded to link dynamically against libmkl_rt.so:
https://github.com/oneapi-src/oneDNN/blob/1b05a28eb9666efef83b281e4cc1936db5e6cf6c/cmake/MKL.cmake#L64
Then MXNet also links statically against MKL. And it still wants the shared library in dlopen. So we should just remove the shared library. How do we do that? Implement everything else directly without using the build file. Specifically it implements in CXXFLAGS what the rest of build system does https://github.com/oneapi-src/oneDNN/blob/1b05a28eb9666efef83b281e4cc1936db5e6cf6c/cmake/MKL.cmake#L65-L67

leezu · 2020-05-04T22:14:56Z

Does DNNL always link to MKL by default if MKL is available on the system, just as MXNet does link to MKL in such situation?

Further, regarding the double-linking issue, #17794 should fix it.

kpuatamazon · 2020-05-04T22:18:40Z

DNNL has only a hidden unsupported option to link to MKL. It will not link to MKL by default.

ChaiBapchya · 2020-05-05T02:34:59Z

In case somebody finds this issue and wants their optimized build, here is a different workaround that removes the need for LD_PRELOAD. Just do this before running cmake the first time:
export CXXFLAGS="${CXXFLAGS} -DUSE_MKL -I/opt/intel/mkl/include"
Then cmake can be run normally:
cmake -GNinja -DUSE_CUDA=OFF -DCMAKE_BUILD_TYPE=Release ..
and the compiled MXNet can be run normally without any special environment variables.

@kpuatamazon Hi I was trying to benchmark using opperf for mkl [default] vs workaround
And despite ensuring mkl is installed & using export CXXFlags followed by usual cmake command, build failed with

gemm.cpp:(.text+0xb45): undefined reference to `cblas_gemm_s8u8s32'

I tried the undocumented abominable kludge option you mentioned and that worked smoothly.

export LD_PRELOAD=/opt/intel/mkl/lib/intel64/libmkl_rt.so
rm -rf build/
mkdir -p build && cd build
cmake -GNinja -DUSE_CUDA=OFF -DCMAKE_BUILD_TYPE=Release -D_DNNL_USE_MKL=FULL -DMKLINC=/opt/intel/mkl/include ..
cmake --build . --parallel 1024

Script for OpPerf : https://gist.github.com/ChaiBapchya/5f2342f75ddeb1e21f14acac665c76ad

Results

Operator	LHS	RHS	MKL Default	MKL Workaround
Dot	(4, 512, 512)	(4, 512, 512)	15.1122	4.1254
	(5, 512, 512)	(5, 512, 512)	38.1678	7.5323
	(5, 512, 1536	(5, 512, 1536)	21.6601	19.2503
	(5, 512, 2048)	(5, 512, 2048)	29.0369	23.7432
	(5, 2048, 512)	(5, 2048, 512)	167.5528	129.9957

Batch_dot	(4, 512, 512)	(4, 512, 512)	1.7898	1.5445
	(5, 512, 512)	(5, 512, 512)	2.2457	1.9361
	(5, 512, 1536)	(5, 512, 1536)	6.1453	5.4034
	(5, 512, 2048)	(5, 512, 2048)	8.246	8.0442
	(5, 2048, 512)	(5, 2048, 512)	160.6243	29.0772

	Data	Weight
FC	(4, 512)	(512, 512)	0.0609	0.068
	(5, 512)	(512, 512)	0.0633	0.0731
	(5, 512)	(1536, 512)	0.0916	0.0996
	(5, 512)	(2048, 512)	0.1081	0.1084

However @kpuatamazon when I try to test out with default [i.e. default -> workaround -> default] by unsetting the environment variable LD_PRELOAD, it failed to build default with gemm.cpp:(.text+0xe6b): undefined reference to cblas_gemm_s8u8s32'`

kpuatamazon · 2020-05-11T08:46:47Z

@ChaiBapchya to be clear, here's how I am building the second option:

export CXXFLAGS="${CXXFLAGS} -DUSE_MKL -I/opt/intel/mkl/include"
unset LD_PRELOAD  #Technically this should be what exists in your environment by default
rm -rf build
mkdir build
cd build
cmake -GNinja -DUSE_CUDA=OFF -DCMAKE_BUILD_TYPE=Release ..
ninja -j 30

Note that cmake does not appear to pick up on changes to the build so it needs a fresh build directory (deleting the cache might work).

ChaiBapchya · 2020-05-15T17:29:59Z

Tested with MXNet cfb474b. Compiled with mostly-default cmake settings:

cmake -GNinja -DUSE_CUDA=OFF -DCMAKE_BUILD_TYPE=Release ..

Then when I run

export MKL_VERBOSE=1
export MKLDNN_VERBOSE=1
python3
Python 3.6.9 (default, Nov  7 2019, 10:44:02) 
[GCC 8.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import mxnet as mx
Numpy + Intel(R) MKL: THREADING LAYER: (null)
Numpy + Intel(R) MKL: setting Intel(R) MKL to use INTEL OpenMP runtime
Numpy + Intel(R) MKL: preloading libiomp5.so runtime

@kpuatamazon @kpu
Running on Ubuntu 18.04 [which doesn't have MKL installed by default] with default cmake config doesn't use MKL as blas.
Hence we can't get the above exports.

Thus for Ubuntu 18.04 base AMI, one has to install MKL in /opt/intel & update the cmake command to

cmake -GNinja -DUSE_CUDA=OFF -DCMAKE_BUILD_TYPE=Release -DUSE_BLAS=mkl ..

This I found uses mkl as blas & export MKL_VERBOSE=1 confirms it.

With this addition to both [default & workaround] I reran the opperf & I didn't see much perf differences.

Commands

Default

cmake -GNinja -DUSE_CUDA=OFF -DCMAKE_BUILD_TYPE=Release -DUSE_BLAS=mkl ..

Workaround

export CXXFLAGS="${CXXFLAGS} -DUSE_MKL -I/opt/intel/mkl/include"
cmake -GNinja -DUSE_CUDA=OFF -DCMAKE_BUILD_TYPE=Release -DUSE_BLAS=mkl ..

Logs

Default

Batch_dot

MKL_VERBOSE SGEMM_BATCH(T,N,0x7fdafe19ecec,0x7fdafe19ecf0,0x7fdafe19ecf4,0x7fdafe19ed04,0x7fda6001dfd0,0x7fdafe19ecf8,0x7fda6001e490,0x7fdafe19ecfc,0x7fdafe19ed08,0x3fd7ec0,0x7fdafe19ed00,0x7fdafe19ec28,0x7fdafe 28.71ms CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:16
MKL_VERBOSE SGEMM_BATCH(T,N,0x7fdafe19ecec,0x7fdafe19ecf0,0x7fdafe19ecf4,0x7fdafe19ed04,0x7fda6001dfd0,0x7fdafe19ecf8,0x7fda6001e490,0x7fdafe19ecfc,0x7fdafe19ed08,0x3fd7ec0,0x7fdafe19ed00,0x7fdafe19ec28,0x7fdafe 28.53ms CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:16

FC

dnnl_verbose,exec,cpu,inner_product,gemm:jit,forward_inference,src_f32::blocked:ab:f0 wei_f32::blocked:ab:f0 bia_undef::undef::f0 dst_f32::blocked:ab:f0,,,mb5ic2048oc512,0.0551758
dnnl_verbose,exec,cpu,inner_product,gemm:jit,forward_inference,src_f32::blocked:ab:f0 wei_f32::blocked:ab:f0 bia_undef::undef::f0 dst_f32::blocked:ab:f0,,,mb5ic2048oc512,0.0559082

Workaround

Batch_dot [same as default]

MKL_VERBOSE SGEMM_BATCH(T,N,0x7f985b78acec,0x7f985b78acf0,0x7f985b78acf4,0x7f985b78ad04,0x7f97b4016cd0,0x7f985b78acf8,0x7f97b401e550,0x7f985b78acfc,0x7f985b78ad08,0x26f2890,0x7f985b78ad00,0x7f985b78ac28,0x7f985b 28.72ms CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:16
MKL_VERBOSE SGEMM_BATCH(T,N,0x7f985b78acec,0x7f985b78acf0,0x7f985b78acf4,0x7f985b78ad04,0x7f97b4016cd0,0x7f985b78acf8,0x7f97b401e550,0x7f985b78acfc,0x7f985b78ad08,0x26f2890,0x7f985b78ad00,0x7f985b78ac28,0x7f985b 28.77ms CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:16

FC [additional MKL_VERBOSE before DNNL_VERBOSE]

MKL_VERBOSE SGEMM(T,N,512,4,512,0x7f985b789c28,0x7f97b5e52e80,512,0x7f97b401e600,512,0x7f985b789c30,0x7f976e89dd80,512) 39.68us CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:16
dnnl_verbose,exec,cpu,inner_product,gemm:blas,forward_inference,src_f32::blocked:ab:f0 wei_f32::blocked:ab:f0 bia_undef::undef::f0 dst_f32::blocked:ab:f0,,,mb4ic512oc512,0.0769043
MKL_VERBOSE SGEMM(T,N,512,5,2048,0x7f985b789c28,0x7f976c400100,2048,0x7f976e887100,2048,0x7f985b789c30,0x7f976e8da000,512) 79.41us CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:16
dnnl_verbose,exec,cpu,inner_product,gemm:blas,forward_inference,src_f32::blocked:ab:f0 wei_f32::blocked:ab:f0 bia_undef::undef::f0 dst_f32::blocked:ab:f0,,,mb5ic2048oc512,0.11377

Results

Operator	LHS	RHS	MKL Default	MKL Workaround
Dot	(4, 512, 512)	(4, 512, 512)	4.1112	4.8241
	(5, 512, 512)	(5, 512, 512)	6.4421	7.607
	(5, 512, 1536	(5, 512, 1536)	20.3648	19.2217
	(5, 512, 2048)	(5, 512, 2048)	23.3236	23.2849
	(5, 2048, 512)	(5, 2048, 512)	123.1235	123.9806

Batch_dot	(4, 512, 512)	(4, 512, 512)	1.4105	1.407
	(5, 512, 512)	(5, 512, 512)	1.7558	1.7511
	(5, 512, 1536)	(5, 512, 1536)	6.5931	6.5585
	(5, 512, 2048)	(5, 512, 2048)	9.1452	9.1031
	(5, 2048, 512)	(5, 2048, 512)	29.0192	28.9236

Operator	Data	Weight	MKL Default	MKL Workaround
FC	(4, 512)	(512, 512)	0.057	0.0685
	(5, 512)	(512, 512)	0.0591	0.0698
	(5, 512)	(1536, 512)	0.0823	0.0939
	(5, 512)	(2048, 512)	0.0916	0.1026
	(5, 2048)	(512, 2048)	0.1146	0.1267

kpu · 2020-05-15T18:02:17Z

dot hasn't been taken over by MKLDNN = DNNL = oneAPI yet, so you shouldn't expect to see a difference there because it never calls MKLDNN's GEMM anyway. FullyConnected has been taken over by MKLDNN = DNNL = oneAPI. This inconsistency is part of the bug report. Your logs agree with these statements.

Do you really mean LHS (4, 512, 512)? I'm talking about LHS (4,512) RHS (512, 512).

What CPU is this on? Note that mine were on a Skylake Xeon c9.9xlarge. I'd expect the AVX512 kernels to be pretty different from the AVX2 kernels.

What OMP threading were you using? I'd expect for matrices this small that thread scaling is terrible for anything but a tiny number of threads. (I mentioned using 4).

ChaiBapchya · 2020-05-15T20:52:07Z

Yes logs agree with these statements. But the perf difference isn't visible via opperf.
My bad, lhs, rhs wrongly interpreted. But that's for dot,batch_dot. FC values for data, weight are same as yours.

CPU : Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz
Instance p3.8xl

This instance doesn't have avx512

$ lscpu | grep Flags | grep avx
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single pti fsgsbase bmi1 hle avx2 smep bmi2 erms invpcid rtm rdseed adx xsaveopt

OMP_Threading i used was default.

However trying FC with OMP_NUM_THREADS=4, still shows that workaround makes FC slower [despite it using MKL] than default[which doesn't use MKL]

Workaround [slow]

MKL_VERBOSE SGEMM(T,N,512,5,2048,0x7f0b79ff8c28,0x7f0b62cfb040,2048,0x7f0b6800b740,2048,0x7f0b79ff8c30,0x7f0b68074e80,512) 119.93us CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:4
dnnl_verbose,exec,cpu,inner_product,gemm:blas,forward_inference,src_f32::blocked:ab:f0 wei_f32::blocked:ab:f0 bia_undef::undef::f0 dst_f32::blocked:ab:f0,,,mb5ic2048oc512,0.142822

{'inputs': {'data': (4, 512), 'weight': (512, 512), 'no_bias': True, 'num_hidden': 512}, 'avg_time_forward_FullyConnected': 0.0925}, 
{'inputs': {'data': (5, 512), 'weight': (512, 512), 'no_bias': True, 'num_hidden': 512}, 'avg_time_forward_FullyConnected': 0.0921}, 
{'inputs': {'data': (5, 512), 'weight': (1536, 512), 'no_bias': True, 'num_hidden': 1536}, 'avg_time_forward_FullyConnected': 0.1441}, 
{'inputs': {'data': (5, 512), 'weight': (2048, 512), 'no_bias': True, 'num_hidden': 2048}, 'avg_time_forward_FullyConnected': 0.1688}, 
{'inputs': {'data': (5, 2048), 'weight': (512, 2048), 'no_bias': True, 'num_hidden': 512}, 'avg_time_forward_FullyConnected': 0.1773}]}]

Default [faster]

dnnl_verbose,exec,cpu,inner_product,gemm:jit,forward_inference,src_f32::blocked:ab:f0 wei_f32::blocked:ab:f0 bia_undef::undef::f0 dst_f32::blocked:ab:f0,,,mb5ic2048oc512,0.11499
dnnl_verbose,exec,cpu,inner_product,gemm:jit,forward_inference,src_f32::blocked:ab:f0 wei_f32::blocked:ab:f0 bia_undef::undef::f0 dst_f32::blocked:ab:f0,,,mb5ic2048oc512,0.115967

{'inputs': {'data': (4, 512), 'weight': (512, 512), 'no_bias': True, 'num_hidden': 512}, 'avg_time_forward_FullyConnected': 0.0691}, 
{'inputs': {'data': (5, 512), 'weight': (512, 512), 'no_bias': True, 'num_hidden': 512}, 'avg_time_forward_FullyConnected': 0.0669}, 
{'inputs': {'data': (5, 512), 'weight': (1536, 512), 'no_bias': True, 'num_hidden': 1536}, 'avg_time_forward_FullyConnected': 0.1166}, 
{'inputs': {'data': (5, 512), 'weight': (2048, 512), 'no_bias': True, 'num_hidden': 2048},  'avg_time_forward_FullyConnected': 0.1438}, 
{'inputs': {'data': (5, 2048), 'weight': (512, 2048), 'no_bias': True, 'num_hidden': 512}, 'avg_time_forward_FullyConnected': 0.1509}]}]

Python's time-it function in OpPerf

Also, instead of using MXNet's native [built-in] profiler, if i use python time-it function still

Workaround [slow]

MKL_VERBOSE SGEMM(T,N,512,5,2048,0x7f86067f9c28,0x7f85e34fd040,2048,0x7f85f000b740,2048,0x7f86067f9c30,0x7f85f005fc80,512) 120.56us CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:4
dnnl_verbose,exec,cpu,inner_product,gemm:blas,forward_inference,src_f32::blocked:ab:f0 wei_f32::blocked:ab:f0 bia_undef::undef::f0 dst_f32::blocked:ab:f0,,,mb5ic2048oc512,0.143066

[{'FullyConnected': [
{'inputs': {'data': (4, 512), 'weight': (512, 512), 'no_bias': True, 'num_hidden': 512}, 'avg_time_FullyConnected': 0.18504947423934937, 'p50_time_FullyConnected': 0.1817569718696177, 'p90_time_FullyConnected': 0.19327133195474744, 'p99_time_FullyConnected': 0.21545797004364425}, 
{'inputs': {'data': (5, 512), 'weight': (512, 512), 'no_bias': True, 'num_hidden': 512}, 'avg_time_FullyConnected': 0.1922117848880589, 'p50_time_FullyConnected': 0.18283800454810262, 'p90_time_FullyConnected': 0.19954713061451912, 'p99_time_FullyConnected': 0.36291924072429527}, 
{'inputs': {'data': (5, 512), 'weight': (1536, 512), 'no_bias': True, 'num_hidden': 1536}, 'avg_time_FullyConnected': 0.24743830785155296, 'p50_time_FullyConnected': 0.23927190341055393, 'p90_time_FullyConnected': 0.25965459644794464, 'p99_time_FullyConnected': 0.3613424429204313}, 
{'inputs': {'data': (5, 512), 'weight': (2048, 512), 'no_bias': True, 'num_hidden': 2048}, 'avg_time_FullyConnected': 0.2728361077606678, 'p50_time_FullyConnected': 0.2635499695315957, 'p90_time_FullyConnected': 0.2853184472769499, 'p99_time_FullyConnected': 0.39780672872439016}, 
{'inputs': {'data': (5, 2048), 'weight': (512, 2048), 'no_bias': True, 'num_hidden': 512}, 'avg_time_FullyConnected': 0.2834270987659693, 'p50_time_FullyConnected': 0.27543300529941916, 'p90_time_FullyConnected': 0.29397032922133803, 'p99_time_FullyConnected': 0.40460130083374674}]}]

Default [faster]

dnnl_verbose,exec,cpu,inner_product,gemm:jit,forward_inference,src_f32::blocked:ab:f0 wei_f32::blocked:ab:f0 bia_undef::undef::f0 dst_f32::blocked:ab:f0,,,mb5ic2048oc512,0.116943
dnnl_verbose,exec,cpu,inner_product,gemm:jit,forward_inference,src_f32::blocked:ab:f0 wei_f32::blocked:ab:f0 bia_undef::undef::f0 dst_f32::blocked:ab:f0,,,mb5ic2048oc512,0.115967

[{'FullyConnected': [
{'inputs': {'data': (4, 512), 'weight': (512, 512), 'no_bias': True, 'num_hidden': 512}, 'avg_time_FullyConnected': 0.17584662651643157, 'p50_time_FullyConnected': 0.17386197578161955, 'p90_time_FullyConnected': 0.18297710921615362, 'p99_time_FullyConnected': 0.2026369632221758}, 
{'inputs': {'data': (5, 512), 'weight': (512, 512), 'no_bias': True, 'num_hidden': 512}, 'avg_time_FullyConnected': 0.177996635902673, 'p50_time_FullyConnected': 0.17414247849956155, 'p90_time_FullyConnected': 0.18130375538021326, 'p99_time_FullyConnected': 0.23936283076181997}, 
{'inputs': {'data': (5, 512), 'weight': (1536, 512), 'no_bias': True, 'num_hidden': 1536}, 'avg_time_FullyConnected': 0.22948215948417783, 'p50_time_FullyConnected': 0.22402300965040922, 'p90_time_FullyConnected': 0.2324321656487882, 'p99_time_FullyConnected': 0.3020400775130837}, 
{'inputs': {'data': (5, 512), 'weight': (2048, 512), 'no_bias': True, 'num_hidden': 2048}, 'avg_time_FullyConnected': 0.2527528256177902, 'p50_time_FullyConnected': 0.24718954227864742, 'p90_time_FullyConnected': 0.263673288282007, 'p99_time_FullyConnected': 0.3303098364267497}, 
{'inputs': {'data': (5, 2048), 'weight': (512, 2048), 'no_bias': True, 'num_hidden': 512}, 'avg_time_FullyConnected': 0.26115449611097574, 'p50_time_FullyConnected': 0.25852309772744775, 'p90_time_FullyConnected': 0.26592351496219635, 'p99_time_FullyConnected': 0.304172352189198}]}]

Since Skylake has AVX512. I'm guessing this FC slowdown might be in AVX512 kernel. Let me reproduce that.

ChaiBapchya · 2020-05-15T22:17:24Z

Can confirm that this issue is specific to AVX512 kernels.
Tried this on c5.xl
$ lscpu

Intel(R) Xeon(R) Platinum 8124M CPU @ 3.00GHz
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single pti fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves ida arat pku ospke

Results

Default [slower]

dnnl_verbose,exec,cpu,inner_product,gemm:jit,forward_inference,src_f32::blocked:ab:f0 wei_f32::blocked:ab:f0 bia_undef::undef::f0 dst_f32::blocked:ab:f0,,,mb5ic2048oc512,0.133789
dnnl_verbose,exec,cpu,inner_product,gemm:jit,forward_inference,src_f32::blocked:ab:f0 wei_f32::blocked:ab:f0 bia_undef::undef::f0 dst_f32::blocked:ab:f0,,,mb5ic2048oc512,0.132812

[{'FullyConnected': [
{'inputs': {'data': (4, 512), 'weight': (512, 512), 'no_bias': True, 'num_hidden': 512}, 'avg_time_FullyConnected': 0.10202302001744101, 'p50_time_FullyConnected': 0.10086749989568489, 'p90_time_FullyConnected': 0.10658760029400582, 'p99_time_FullyConnected': 0.13521948004836298}, 
{'inputs': {'data': (5, 512), 'weight': (512, 512), 'no_bias': True, 'num_hidden': 512}, 'avg_time_FullyConnected': 0.10642346004715364, 'p50_time_FullyConnected': 0.09991750016524747, 'p90_time_FullyConnected': 0.10565369971118344, 'p99_time_FullyConnected': 0.2586996700802042}, 
{'inputs': {'data': (5, 512), 'weight': (1536, 512), 'no_bias': True, 'num_hidden': 1536}, 'avg_time_FullyConnected': 0.16890607999812346, 'p50_time_FullyConnected': 0.16431500012004108, 'p90_time_FullyConnected': 0.1781331999154645, 'p99_time_FullyConnected': 0.2831235897247094}, 
{'inputs': {'data': (5, 512), 'weight': (2048, 512), 'no_bias': True, 'num_hidden': 2048}, 'avg_time_FullyConnected': 0.20140223995440465, 'p50_time_FullyConnected': 0.19778950013460417, 'p90_time_FullyConnected': 0.20401089991537447, 'p99_time_FullyConnected': 0.3063294199228036}, 
{'inputs': {'data': (5, 2048), 'weight': (512, 2048), 'no_bias': True, 'num_hidden': 512}, 'avg_time_FullyConnected': 0.21596427998701984, 'p50_time_FullyConnected': 0.2096700000038254, 'p90_time_FullyConnected': 0.21819640001012885, 'p99_time_FullyConnected': 0.3412436299549877}]}]

MKL Workaround [Faster]

MKL_VERBOSE SGEMM(T,N,512,5,2048,0x7f9bcf6fac28,0x7f9bc22f4040,2048,0x7f9b1400ce80,2048,0x7f9bcf6fac30,0x7f9b1405e840,512) 21.25us CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:18
dnnl_verbose,exec,cpu,inner_product,gemm:blas,forward_inference,src_f32::blocked:ab:f0 wei_f32::blocked:ab:f0 bia_undef::undef::f0 dst_f32::blocked:ab:f0,,,mb5ic2048oc512,0.0378418
MKL_VERBOSE SGEMM(T,N,512,5,2048,0x7f9bcf6fac28,0x7f9bc22f4040,2048,0x7f9b1400ce80,2048,0x7f9bcf6fac30,0x7f9b14061c00,512) 20.94us CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:18
dnnl_verbose,exec,cpu,inner_product,gemm:blas,forward_inference,src_f32::blocked:ab:f0 wei_f32::blocked:ab:f0 bia_undef::undef::f0 dst_f32::blocked:ab:f0,,,mb5ic2048oc512,0.0371094

[{'FullyConnected': [
{'inputs': {'data': (4, 512), 'weight': (512, 512), 'no_bias': True, 'num_hidden': 512}, 'avg_time_FullyConnected': 0.11772135999308375, 'p50_time_FullyConnected': 0.1149684999290912, 'p90_time_FullyConnected': 0.1244978000613628, 'p99_time_FullyConnected': 0.14825501980340045}, 
{'inputs': {'data': (5, 512), 'weight': (512, 512), 'no_bias': True, 'num_hidden': 512}, 'avg_time_FullyConnected': 0.120828840035756, 'p50_time_FullyConnected': 0.11370450010872446, 'p90_time_FullyConnected': 0.12752780021401122, 'p99_time_FullyConnected': 0.2412066401620902}, 
{'inputs': {'data': (5, 512), 'weight': (1536, 512), 'no_bias': True, 'num_hidden': 1536}, 'avg_time_FullyConnected': 0.13385597998421872, 'p50_time_FullyConnected': 0.12600750005731243, 'p90_time_FullyConnected': 0.14806160011175962, 'p99_time_FullyConnected': 0.2509373301927551}, 
{'inputs': {'data': (5, 512), 'weight': (2048, 512), 'no_bias': True, 'num_hidden': 2048}, 'avg_time_FullyConnected': 0.14175208003507578, 'p50_time_FullyConnected': 0.1372545000322134, 'p90_time_FullyConnected': 0.14401020002878798, 'p99_time_FullyConnected': 0.2423993399725075}, 
{'inputs': {'data': (5, 2048), 'weight': (512, 2048), 'no_bias': True, 'num_hidden': 512}, 'avg_time_FullyConnected': 0.143890859962994, 'p50_time_FullyConnected': 0.1397979999637755, 'p90_time_FullyConnected': 0.14637689982919258, 'p99_time_FullyConnected': 0.22678783964693117}]}]

To reproduce
https://gist.github.com/ChaiBapchya/a849cfd566b8114e695454850b48077b
https://gist.github.com/ChaiBapchya/5f2342f75ddeb1e21f14acac665c76ad#file-benchmark_intel_mkl-py

aaraujom · 2020-05-15T23:07:32Z

Hi thanks for checking. Let me try to reproduce this in my end and fix it.

ChaiBapchya · 2020-06-02T19:11:47Z

@aaraujom any update?

aaraujom · 2020-06-03T00:01:38Z

Hi @ChaiBapchya - I was able to reproduce issue on my end. DNNL is missing some kernels for very specific problem sizes affecting the gemm size in this thread. We will address this issue in DNNL.

kpuatamazon · 2020-08-17T15:23:43Z

Mostly fixed by the v1.6.x release of oneDNN. There is still a speed discrepancy but it's not as bad.

git checkout 6b568fd2c10e8cc14c21d6150ba71933a2af2a41 (current v1.x branch of MXNet including oneDNN 1.6.1)

Shape (5, 512, 512)
0.0000471 seconds for fullyconnected (DNNL)
0.0000413 seconds for dot (MKL)
Shape (5, 512, 1536)
0.0000691 seconds for fullyconnected (DNNL)
0.0000639 seconds for dot (MKL)
Shape (5, 512, 2048)
0.0000936 seconds for fullyconnected (DNNL)
0.0000868 seconds for dot (MKL)
Shape (5, 2048, 512)
0.0000949 seconds for fullyconnected (DNNL)
0.0000870 seconds for dot (MKL)
Shape (4, 512, 512)
0.0000446 seconds for fullyconnected (DNNL)
0.0000392 seconds for dot (MKL)

Revert to oneDNN v1.3.
git checkout 6b568fd2c10e8cc14c21d6150ba71933a2af2a41 && git revert 17111037176294d86b7dc6f719347aa24c96b317 && git revert eae6171cb56190cf06f5b11597380c082745991b

Shape (5, 512, 512)
0.0000842 seconds for fullyconnected (DNNL)
0.0000407 seconds for dot (MKL)
Shape (5, 512, 1536)
0.0001908 seconds for fullyconnected (DNNL)
0.0000606 seconds for dot (MKL)
Shape (5, 512, 2048)
0.0002421 seconds for fullyconnected (DNNL)
0.0000880 seconds for dot (MKL)
Shape (5, 2048, 512)
0.0003462 seconds for fullyconnected (DNNL)
0.0000912 seconds for dot (MKL)
Shape (4, 512, 512)
0.0000841 seconds for fullyconnected (DNNL)
0.0000395 seconds for dot (MKL)

aaraujom · 2020-08-17T16:59:25Z

@ChaiBapchya - Will those improvements work for you?

ChaiBapchya · 2020-08-18T17:08:37Z

Is there a way to bring down the deficit and have parity? or do we know the reason why DNNL is slower? @TaoLv @aaraujom

b-shi · 2020-08-18T21:52:06Z

Hi, we benchmarked gemm performance for those sizes internally using oneDNN vs oneMKL directly and here is what we got:

Max (20K trials) GFlops	oneMKL (1Th)	oneDNN (1Th)	oneMKL (4Th)	oneDNN (4Th)	oneMKL (8Th)	oneDNN (8Th)	oneMKL (28Th)	oneDNN (28Th)
5, 512, 512	34.49	40.96	174.76	262.14	291.27	436.91	374.49	524.29
5, 512, 1536	26.57	27.40	178.73	231.30	357.47	491.52	561.74	873.81
5, 512, 2048	25.83	26.55	143.64	174.76	338.25	436.91	616.81	1048.58
5, 2048, 512	25.51	26.82	139.81	183.96	388.36	524.29	873.81	1310.72
4, 512, 512	31.78	42.80	161.32	233.02	262.14	419.43	299.59	419.43

Avg (20K trials) GFlops	oneMKL (1Th)	oneDNN (1Th)	oneMKL (4Th)	oneDNN (4Th)	oneMKL (8Th)	oneDNN (8Th)	oneMKL (28Th)	oneDNN (28Th)
5, 512, 512	33.96	39.45	169.05	233.88	258.32	364.93	328.68	399.95
5, 512, 1536	26.30	27.13	174.04	226.75	342.51	463.81	492.14	765.09
5, 512, 2048	25.58	26.35	140.57	171.42	328.70	429.25	589.81	873.36
5, 2048, 512	25.27	26.57	137.35	179.91	368.89	508.38	792.29	1101.67
4, 512, 512	31.02	41.91	146.59	216.20	226.68	335.54	271.98	334.98

TaoLv · 2020-08-19T15:48:57Z

@ChaiBapchya Is it possible to compare the performance of FC in two versions? Comparing FC with dot is not so fair to me. As you may know, they are using different execution paths, FCompute vs. FComputeEx.

pengzhao-intel · 2020-08-31T04:13:22Z

closing since the fix is updated with the new MKLDNN.

ChaiBapchya · 2020-10-04T21:29:19Z

@pengzhao-intel v1.7.0 public mxnet uses MKLDNN v1.3 and hence won't have this optimization right?
https://github.com/apache/incubator-mxnet/tree/1.7.0/3rdparty
This means public mxnet v1.7 won't have this fix. v1.8, on the contrary, would have the optimized mkldnn [by virtue of v1.x branch having mkldnn v1.6+ prior to 1.8.rc0] right?

pengzhao-intel · 2020-10-05T01:41:38Z

@pengzhao-intel v1.7.0 public mxnet uses MKLDNN v1.3 and hence won't have this optimization right?
https://github.com/apache/incubator-mxnet/tree/1.7.0/3rdparty
This means public mxnet v1.7 won't have this fix. v1.8, on the contrary, would have the optimized mkldnn [by virtue of v1.x branch having mkldnn v1.6+ prior to 1.8.rc0] right?

Thanks to raise the question. China team is on a public holiday this week.
@anko-intel could you help take a look at the issue?

anko-intel · 2020-10-05T06:43:30Z

@ChaiBapchya, You are right, we also have to include 1.6.4 (#19251) to make it happened.

kpuatamazon added the Bug label Apr 6, 2020

pengzhao-intel added this to To do in CPU Performance and Quantization via automation Apr 7, 2020

pengzhao-intel added the MKLDNN label Apr 7, 2020

leezu mentioned this issue May 11, 2020

DOT product too slow on CPU and GPU compared to np and pytorch #17971

Closed

kpu mentioned this issue May 13, 2020

Error: You need to compile with MKL in order to use the CPU version marian-nmt/marian-dev#653

Closed

ChaiBapchya mentioned this issue May 16, 2020

[DO NOT MERGE] Use mkl cmake flag to force DNNL to delegate FC op to MKL #18338

Closed

2 tasks

pengzhao-intel closed this as completed Aug 31, 2020

kpu mentioned this issue Aug 31, 2020

Consider oneDNN instead of MKL for SGEMM marian-nmt/marian-dev#706

Open

kpu mentioned this issue Sep 16, 2020

Question: GEMM on 8-bit buffers? WebAssembly/wasi-nn#4

Closed

When compiled with MKL, fully_connected calls DNNL while dot and batch_dot call MKL #17980

When compiled with MKL, fully_connected calls DNNL while dot and batch_dot call MKL #17980

Comments

kpuatamazon commented Apr 6, 2020 • edited Loading

Problem

Testing

Performance impact

Workaround

Recommended fix

TaoLv commented Apr 6, 2020 • edited Loading

kpuatamazon commented Apr 6, 2020 • edited Loading

kpuatamazon commented Apr 6, 2020

TaoLv commented Apr 6, 2020

kpuatamazon commented Apr 6, 2020 • edited Loading

TaoLv commented Apr 6, 2020 • edited Loading

kpuatamazon commented Apr 6, 2020 • edited Loading

TaoLv commented Apr 7, 2020

emfomenk commented Apr 7, 2020

TaoLv commented Apr 7, 2020

pengzhao-intel commented Apr 7, 2020

emfomenk commented Apr 7, 2020

kpu commented Apr 7, 2020

emfomenk commented Apr 7, 2020

pengzhao-intel commented Apr 8, 2020

kpuatamazon commented Apr 13, 2020

emfomenk commented Apr 13, 2020

kpuatamazon commented Apr 13, 2020 • edited Loading

kpuatamazon commented Apr 13, 2020

leezu commented May 4, 2020

kpuatamazon commented May 4, 2020

ChaiBapchya commented May 5, 2020 • edited Loading

kpuatamazon commented May 11, 2020

ChaiBapchya commented May 15, 2020 • edited Loading

Commands

Logs

Default

Workaround

Results

kpu commented May 15, 2020

ChaiBapchya commented May 15, 2020

Workaround [slow]

Default [faster]

Python's time-it function in OpPerf

Workaround [slow]

Default [faster]

ChaiBapchya commented May 15, 2020

Results

Default [slower]

MKL Workaround [Faster]

aaraujom commented May 15, 2020

ChaiBapchya commented Jun 2, 2020

aaraujom commented Jun 3, 2020

kpuatamazon commented Aug 17, 2020

aaraujom commented Aug 17, 2020

ChaiBapchya commented Aug 18, 2020

b-shi commented Aug 18, 2020

TaoLv commented Aug 19, 2020

pengzhao-intel commented Aug 31, 2020

ChaiBapchya commented Oct 4, 2020

pengzhao-intel commented Oct 5, 2020

anko-intel commented Oct 5, 2020

kpuatamazon commented Apr 6, 2020 •

edited

Loading

TaoLv commented Apr 6, 2020 •

edited

Loading

kpuatamazon commented Apr 6, 2020 •

edited

Loading

kpuatamazon commented Apr 6, 2020 •

edited

Loading

TaoLv commented Apr 6, 2020 •

edited

Loading

kpuatamazon commented Apr 6, 2020 •

edited

Loading

kpuatamazon commented Apr 13, 2020 •

edited

Loading

ChaiBapchya commented May 5, 2020 •

edited

Loading

ChaiBapchya commented May 15, 2020 •

edited

Loading