Integrating the MKL VML functions to MXNET to speed-up the (element-wised) mathematic computation #14893

juliusshufan · 2019-05-06T14:24:26Z

Description

Intel MKL provides a wide range of the generic vectored mathematic (VML) functions befitting from AVX512 instructions, via the integration of the VML functions, some of the element-wised operation is expected to be speedup.

Currently, the generic mathematic computations are implemented by a series of mshadow OPs, which encapsulate the functions provided by standard math library, and the vectorization of the input is implemented by dense/sparse tensor, and computations are paralized by OpenMP. Specially,
Element-wise OP taking one or two inputs, with “write-to”/”write-inplace” types can be supported by MKL VML functions.

@TaoLv @pengzhao-intel

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

The PR title starts with [MXNET-$JIRA_ID], where $JIRA_ID refers to the relevant JIRA issue created (except PRs with tiny changes)
Changes are complete (i.e. I finished coding on this PR)
All changes have test coverage:
Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
Code is well-documented:
For user-facing API changes, API doc string has been updated.
For new C++ functions in header files, their functionalities and arguments are documented.
For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
Check the API doc at http://mxnet-ci-doc.s3-accelerate.dualstack.amazonaws.com/PR-$PR_ID/$BUILD_ID/index.html
To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

The unit tests are provided with the changes.

…to erf-opt

Fix review comments

…into vml

anirudhacharya · 2019-05-06T17:16:41Z

@mxnet-label-bot add [pr-awaiting-review]

pengzhao-intel · 2019-05-08T22:54:23Z

@eric-haibin-lin @szha please help take a review.

src/operator/mkl_functions-inl.h

pengzhao-intel · 2019-05-09T08:13:54Z

src/operator/mkl_functions-inl.h

+MXNET_MKL_BINARY_MATH_FUNC(sub, Sub);
+MXNET_MKL_BINARY_MATH_FUNC(mul, Mul);
+MXNET_MKL_BINARY_MATH_FUNC(pow, Pow);
+MXNET_MKL_BINARY_MATH_FUNC(hypot, Hypot);


Does all of these functions will be mapped automatically when MKL is enabled?

No. We just put all the VML functions here. We think these functions can be leveraged by MXNet in the future. But currently it need to change the registration of each operator to use these functions. In this PR we only optimized some operators which are used in BERT. We propose to optimize others when we face performance problems on them.

Thanks for the explanation. We can add it back when we use it; otherwise, it is a little confusion for other developers.

…to vml

eric-haibin-lin · 2019-05-14T04:24:39Z

src/operator/tensor/elemwise_unary_op.h

+    .set_attr<FInferStorageType>("FInferStorageType", ElemwiseStorageType<1, 1,                    \
+                                 false, true, true>)                                               \
+    .set_attr<FCompute>("FCompute<" #__xpu$ ">", UnaryOp::MKL_Compute<__kernel$, __mkl_kernel$>)   \
+    .set_attr<FComputeEx>("FComputeEx<" #__xpu$ ">", UnaryOp::MKL_ComputeEx<__kernel$,             \


Why do you override the FComputeEx attribute?

@eric-haibin-lin Thanks for review. Not sure how sparse is handled in the original FComputeEx. Previously I thought sparse can also benefit from VML if its values are stored in a dense way. But we don't much data to prove that.

Thanks for the explanation. Yeah it should benefit

I just reverted the change for FComputeEx as we don't have much data for that yet. Will revisit this part once we meet any performance issue for sparse.

I actually I'd prefer not reverting it. I don't see any reason why it won't help. Let's undo the revert?

Can you provide a simple benchmark for a sparse unary operator? We can take a quick try. Thanks! cc @juliusshufan

@eric-haibin-lin @TaoLv Friendly ping, May I know your decison on the sparse part?

eric-haibin-lin · 2019-05-15T04:59:05Z

src/operator/mkl_functions-inl.h

+
+// LayerNorm on the last dimension
+template <typename DType>
+MSHADOW_XINLINE static void LayerNormLastDim(const index_t m,


Does this PR also want to enable optimization for layernorm ?

Yes. I'm working on enabling it and trying to understand the optimization and workflow from @sxjscience 's PR.

comments addressed

…to vml

pengzhao-intel · 2019-05-17T04:34:24Z

@eric-haibin-lin please help to review again :)

pengzhao-intel

The minor comment is added.

LGTM

pengzhao-intel · 2019-05-17T04:39:40Z

src/operator/mkl_functions-inl.h

+
+    mul::Vectorize(n, out_offset, gamma, out_offset);
+    div_(n, out_offset, var[i], out_offset);
+    add::Vectorize(n, out_offset, beta, out_offset);


Any chance to fusion some of these operations to reduce the memory bandwidth?

How much faster is this version compared to the mshadow one?

After reading the code, I think the current implementation, which relies on the vectorized operations, should be fast at scaling and shifting the data (data * gamma & data + beta). One possible improvement is to use the Welford's online algorithm (https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance) to calculate the mean/variance in one pass, the code will look like this:

template <typename DType> MSHADOW_XINLINE static void mean_var_(index_t n, DType *in, DType *mean, DType* variance) { DType sigma2 = 0; DType mean_v = 0; DType old_mean_v = 0; for (index_t i = 0; i < n; i++) { DType x = in[i]; old_mean_v = mean_v; mean_v += (x - old_mean_v) / (i + 1); sigma2 += (x - old_mean_v) * (x - mean_v); } mean[0] = mean_v; variance[0] = sigma2 / n; } template <typename DType> MSHADOW_XINLINE static void LayerNormLastDim(index_t m, index_t n, DType *a, DType *b, DType *ws, DType *gamma, DType *beta, DType *mean, DType *var, DType eps) { auto nthreads = engine::OpenMP::Get()->GetRecommendedOMPThreadCount(); #pragma omp parallel for num_threads(nthreads) for (index_t i = 0; i < m; i++) { DType ele_mean, ele_var; DType* in_offset = a + i * n; DType* out_offset = b + i * n; mean_var_(n, in_offset, &ele_mean, &ele_var); sub_(n, in_offset, ele_mean, out_offset); ele_var = math::sqrt(ele_var + eps); mul::Vectorize(n, out_offset, gamma, out_offset); div_(n, out_offset, ele_var, out_offset); add::Vectorize(n, out_offset, beta, out_offset); mean[i] = ele_mean; var[i] = ele_var; } }

@pengzhao-intel @sxjscience loops are fused in the latest commit. I also removed the required workspace but that means we can not leverage VML functions and need rely on compiler for vectorization.

…to vml

TaoLv · 2019-05-18T02:15:47Z

@sxjscience Can you help to review? Here is a optimization for CPU LayerNorm.

…to vml Conflicts: src/operator/nn/layer_norm-inl.h

…to vml

TaoLv · 2019-05-21T15:12:05Z

LayerNorm performance is measured on my skl machine. Shapes are from BERT base and large model respectively. The speedup from this PR is around 3x~10x. @eric-haibin-lin @sxjscience @pengzhao-intel

# mxnet-mkl==1.4.1
layernorm (1L, 128L, 768L): 0.23437 ms
layernorm (8L, 128L, 768L): 1.39641 ms
layernorm (32L, 128L, 768L): 5.18604 ms
layernorm (1L, 128L, 1024L): 0.35661 ms
layernorm (8L, 128L, 1024L): 1.80795 ms
layernorm (32L, 128L, 1024L): 6.76601 ms

# this PR built with USE_BLAS=mkl
layernorm (1, 128, 768): 0.07230 ms
layernorm (8, 128, 768): 0.21550 ms
layernorm (32, 128, 768): 0.51188 ms
layernorm (1, 128, 1024): 0.08863 ms
layernorm (8, 128, 1024): 0.25120 ms
layernorm (32, 128, 1024): 0.63479 ms

pengzhao-intel · 2019-05-22T01:28:09Z

@sxjscience @eric-haibin-lin any other comments? If no, I will merge this PR soon for the release 1.5.

…to vml

…ised) mathematic computation (apache#14893) * mkl_func test with erf&log op, build success~ * fix lint and build issues * Try to add support to sparse array * fix build * add functions * Fix review comments * remove unecessary code * Update test case * minor fix * move the position of MKL_Compute * mkl_func test with erf&log op, build success~ * fix lint and build issues * Try to add support to sparse array * fix build * Fix review comments * remove unecessary code * Update test case * minor fix * add functions * move the position of MKL_Compute * fix cpplint * cpp lint * trigger ci * address comments * coding style * enable layernorm * fix windows build * revert changes to FComputeEx * int -> index_t * remove workspace * fix lint * clean code

pengxin99 and others added 23 commits March 3, 2019 13:04

mkl_func test with erf&log op, build success~

f0c7264

fix lint and build issues

9311777

Try to add support to sparse array

a79f7db

fix build

015fd0a

Merge branch 'master' of https://github.com/apache/incubator-mxnet in…

495ce36

…to erf-opt

add functions

672be6a

Fix review comments

c69a25c

remove unecessary code

2c5c20c

Update test case

b1b6355

minor fix

f96c34a

move the position of MKL_Compute

06c51e9

Merge pull request apache#6 from juliusshufan/erf

acd7b56

Fix review comments

mkl_func test with erf&log op, build success~

dc0086f

fix lint and build issues

1758e91

Try to add support to sparse array

4461f62

fix build

a3efd02

add functions

64d01a4

Fix review comments

7edca49

remove unecessary code

d6139fc

Update test case

46a49d6

minor fix

0e36f93

move the position of MKL_Compute

c6e2518

Merge branch 'vml' of https://github.com/juliusshufan/incubator-mxnet …

1153479

…into vml

juliusshufan changed the title ~~Integrating the MKL VML functions to MXNET to speed-up the mathematic computation~~ Integrating the MKL VML functions to MXNET to speed-up the (element-wised) mathematic computation May 6, 2019

marcoabreu added the pr-awaiting-review PR is waiting for code review label May 6, 2019

pengzhao-intel mentioned this pull request May 7, 2019

[Discussion] 1.5.0 Roadmap #14619

Closed

juliusshufan added 3 commits May 7, 2019 22:08

fix cpplint

e60493c

cpp lint

f360320

trigger ci

15f2f20

pengzhao-intel reviewed May 9, 2019

View reviewed changes

TaoLv added 2 commits May 14, 2019 10:58

Merge branch 'master' of https://github.com/apache/incubator-mxnet in…

01d3f7e

…to vml

address comments

7a360e8

eric-haibin-lin previously requested changes May 14, 2019

View reviewed changes

coding style

22a9c4c

eric-haibin-lin reviewed May 15, 2019

View reviewed changes

TaoLv added 4 commits May 16, 2019 11:02

Merge branch 'master' of https://github.com/apache/incubator-mxnet in…

2b9eca4

…to vml

enable layernorm

56384df

Merge branch 'master' of https://github.com/apache/incubator-mxnet in…

8d1dfee

…to vml

fix windows build

c557788

pengzhao-intel approved these changes May 17, 2019

View reviewed changes

TaoLv added 3 commits May 17, 2019 15:52

revert changes to FComputeEx

7e99f3e

int -> index_t

94bafb0

Merge branch 'master' of https://github.com/apache/incubator-mxnet in…

a3e07c5

…to vml

TaoLv added 4 commits May 21, 2019 14:54

Merge branch 'master' of https://github.com/apache/incubator-mxnet in…

e275daa

…to vml Conflicts: src/operator/nn/layer_norm-inl.h

remove workspace

a383f46

Merge branch 'master' of https://github.com/apache/incubator-mxnet in…

ff76244

…to vml

fix lint

b13d6ef

eric-haibin-lin approved these changes May 22, 2019

View reviewed changes

TaoLv added 3 commits May 22, 2019 17:40

clean code

eb4c82b

Merge branch 'master' of https://github.com/apache/incubator-mxnet in…

0cb1120

…to vml

Merge branch 'master' of https://github.com/apache/incubator-mxnet in…

fc51292

…to vml

pengzhao-intel merged commit b0be6c5 into apache:master May 22, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Integrating the MKL VML functions to MXNET to speed-up the (element-wised) mathematic computation #14893

Integrating the MKL VML functions to MXNET to speed-up the (element-wised) mathematic computation #14893

juliusshufan commented May 6, 2019 •

edited

anirudhacharya commented May 6, 2019

pengzhao-intel commented May 8, 2019

pengzhao-intel May 9, 2019

TaoLv May 14, 2019

pengzhao-intel May 14, 2019

eric-haibin-lin May 14, 2019

TaoLv May 14, 2019 •

edited

eric-haibin-lin May 15, 2019

TaoLv May 17, 2019

eric-haibin-lin May 17, 2019

TaoLv May 18, 2019

juliusshufan May 20, 2019

eric-haibin-lin May 15, 2019

TaoLv May 15, 2019

pengzhao-intel commented May 17, 2019

pengzhao-intel left a comment

pengzhao-intel May 17, 2019

eric-haibin-lin May 18, 2019

sxjscience May 20, 2019 •

edited

TaoLv May 21, 2019

TaoLv commented May 18, 2019

TaoLv commented May 21, 2019

pengzhao-intel commented May 22, 2019

Integrating the MKL VML functions to MXNET to speed-up the (element-wised) mathematic computation #14893

Integrating the MKL VML functions to MXNET to speed-up the (element-wised) mathematic computation #14893

Conversation

juliusshufan commented May 6, 2019 • edited

Description

Checklist

Essentials

Changes

anirudhacharya commented May 6, 2019

pengzhao-intel commented May 8, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TaoLv May 14, 2019 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pengzhao-intel commented May 17, 2019

pengzhao-intel left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sxjscience May 20, 2019 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TaoLv commented May 18, 2019

TaoLv commented May 21, 2019

pengzhao-intel commented May 22, 2019

juliusshufan commented May 6, 2019 •

edited

TaoLv May 14, 2019 •

edited

sxjscience May 20, 2019 •

edited