[BYOC][DNNL] Improve performance of DNNL BYOC dense operator #11513

billishyahao · 2022-05-31T07:32:15Z

This patch is to enhance the performance of DNNL BYOC dense operators by 1) introducing gelu fusion and ~~2) introducing alter dense weight layout.~~ (Implemented after merging PR #11345, Thanks @apeskov )

Why do we introduce gelu fusion:
For the model family of BERT, GELU (Gaussian Error Linear Unit) activation is used heavily so if we perform gelu fusion in those models, then we gain a better performance boost.

Why do we introduce automatically packed dense and its altered weight layout:
Format tag::ab (aka. tag::NC) is not the best format selected by DNNL inner_product primitive. It is a drawback in current DNNL BYOC module.

For what model it fit in:
Dense intensity type such as Bert family

With this patch, I benchmarked the inference performance of a kind of vision-tranformer called PCPVT (https://arxiv.org/abs/2104.13840) on ICX-8352Y. Here is some boost data:

32 cores	Latency (dev)
baseline byoc	11.45ms
byoc w/ patch	7.93ms

Thanks for contributing to TVM! Please refer to guideline https://tvm.apache.org/docs/contribute/ for useful information and tips. After the pull request is submitted, please request code reviews from Reviewers by @ them in the pull request thread.

crazydemo · 2022-05-31T09:24:52Z

I think this PR may have some overlap with @mengceng15's work.
Here are some lint issue, and UT is needed.

src/runtime/contrib/dnnl/dnnl_json_runtime.cc

mengceng15 · 2022-06-01T13:11:39Z

I think this PR may have some overlap with @mengceng15's work. Here are some lint issue, and UT is needed.

Yes. I worked on BYOC oneDNN dense several months ago.

Since it is for models including bert, maybe batch matmul support is also needed (with matmul primitive of oneDNN)?

billishyahao · 2022-06-02T07:31:09Z

I think this PR may have some overlap with @mengceng15's work. Here are some lint issue, and UT is needed.

Hi @crazydemo , Thanks for your suggestion. I have added some testcases in UI module. It seems that the end-of-the-line style of test_dnnl.py is CRLF(windows mode). I reformat it by dos2unix to make it identical to other dnnl file. Feel free to leave comments here.

billishyahao · 2022-06-02T07:49:17Z

I think this PR may have some overlap with @mengceng15's work. Here are some lint issue, and UT is needed.

Yes. I worked on BYOC oneDNN dense several months ago.

Since it is for models including bert, maybe batch matmul support is also needed (with matmul primitive of oneDNN)?

I think so. I implemented matmul primitive too in my local environment and may publish it soon.

src/runtime/contrib/dnnl/dnnl_json_runtime.cc

billishyahao · 2022-06-08T08:17:21Z

Hi @masahi , @comaniac , @trevor-m , @mbaret , Could you take a look at this pr? Thanks!

…fusion and 2) introducing alter dense weight layout.

linlifan · 2022-06-09T06:47:44Z

LGTM

billishyahao · 2022-06-09T08:04:53Z

@masahi Please take a look at this PR. Thanks!

src/relay/backend/contrib/dnnl/codegen.cc

python/tvm/relay/op/contrib/dnnl.py

apeskov · 2022-06-09T08:37:24Z

One important comment about performance. Just to point out.

In this patch you are using mechanic of auto detection proper layout inside of dnnl_json_runtime. It works correctly and dense primitive will use optimal layout. But it will execute weight reordering each inference call. This reordering significantly break performance (still better than previously, but less than possible).

To avoid weight reordering it should be done once during Init. For that you need change dense weight pattern from wildcard to is_constant.

billishyahao · 2022-06-09T08:59:14Z

One important comment about performance. Just to point out.

In this patch you are using mechanic of auto detection proper layout inside of dnnl_json_runtime. It works correctly and dense primitive will use optimal layout. But it will execute weight reordering each inference call. This reordering significantly break performance (still better than previously, but less than possible).

To avoid weight reordering it should be done once during Init. For that you need change dense weight pattern from wildcard to is_constant.

Hi @apeskov , the following is a clip of dnnl verbose log:

onednn_verbose,exec,cpu,inner_product,brgemm:avx512_core,forward_inference,src_f32::blocked:ab:f0 wei_f32::blocked:AB16b64a:f0 bia_undef::undef::f0 dst_f32::blocked:ab:f0,attr-scratchpad:user ,,mb49ic512oc512,0.0400391 onednn_verbose,exec,cpu,inner_product,brgemm:avx512_core,forward_inference,src_f32::blocked:ab:f0 wei_f32::blocked:AB16b64a:f0 bia_undef::undef::f0 dst_f32::blocked:ab:f0,attr-scratchpad:user ,,mb49ic512oc1024,0.0717773 onednn_verbose,exec,cpu,inner_product,brgemm:avx512_core,forward_inference,src_f32::blocked:ab:f0 wei_f32::blocked:AB16b64a:f0 bia_f32::blocked:a:f0 dst_f32::blocked:ab:f0,attr-scratchpad:user ,,mb49ic512oc512,0.0351562 onednn_verbose,exec,cpu,inner_product,brgemm:avx512_core,forward_inference,src_f32::blocked:ab:f0 wei_f32::blocked:AB16b64a:f0 bia_f32::blocked:a:f0 dst_f32::blocked:ab:f0,attr-scratchpad:user attr-post-ops:eltwise_gelu_erf ,,mb49ic512oc2048,0.215088 onednn_verbose,exec,cpu,inner_product,brgemm:avx512_core,forward_inference,src_f32::blocked:ab:f0 wei_f32::blocked:AB16b64a:f0 bia_f32::blocked:a:f0 dst_f32::blocked:ab:f0,attr-scratchpad:user ,,mb49ic2048oc512,0.227051 onednn_verbose,exec,cpu,inner_product,brgemm:avx512_core,forward_inference,src_f32::blocked:ab:f0 wei_f32::blocked:AB16b64a:f0 bia_undef::undef::f0 dst_f32::blocked:ab:f0,attr-scratchpad:user ,,mb49ic512oc512,0.0339355 onednn_verbose,exec,cpu,inner_product,brgemm:avx512_core,forward_inference,src_f32::blocked:ab:f0 wei_f32::blocked:AB16b64a:f0 bia_undef::undef::f0 dst_f32::blocked:ab:f0,attr-scratchpad:user ,,mb49ic512oc1024,0.072998 onednn_verbose,exec,cpu,inner_product,brgemm:avx512_core,forward_inference,src_f32::blocked:ab:f0 wei_f32::blocked:AB16b64a:f0 bia_f32::blocked:a:f0 dst_f32::blocked:ab:f0,attr-scratchpad:user ,,mb49ic512oc512,0.0349121 onednn_verbose,exec,cpu,inner_product,brgemm:avx512_core,forward_inference,src_f32::blocked:ab:f0 wei_f32::blocked:AB16b64a:f0 bia_f32::blocked:a:f0 dst_f32::blocked:ab:f0,attr-scratchpad:user attr-post-ops:eltwise_gelu_erf ,,mb49ic512oc2048,0.226807 onednn_verbose,exec,cpu,inner_product,brgemm:avx512_core,forward_inference,src_f32::blocked:ab:f0 wei_f32::blocked:AB16b64a:f0 bia_f32::blocked:a:f0 dst_f32::blocked:ab:f0,attr-scratchpad:user ,,mb49ic2048oc512,0.231934

I don't observe the reorder primitive executed before or after inner_product. I think current mechanism still work?

apeskov · 2022-06-09T09:09:23Z

@billishyahao Thanks for verbose log and quick response!

Looks like it works for you. But I'm a little bit surprised. My previous experiments with BERT (quantised version) show that reordering is happen... I will recheck.

yangulei · 2022-06-09T09:21:24Z

@billishyahao I remember that you introduced a mechanism to query optimal layout and alter op layout before the graph is consumed by DNNL, just like we do for Conv. Did you cancel that and do the layout transform in dnnl json runtime? If so, I agree with @apeskov and wondering why no layout transform of the weights.

billishyahao · 2022-06-09T09:57:57Z

@billishyahao I remember that you introduced a mechanism to query optimal layout and alter op layout before the graph is consumed by DNNL, just like we do for Conv. Did you cancel that and do the layout transform in dnnl json runtime? If so, I agree with @apeskov and wondering why no layout transform of the weights.

Yes, I do revert the change about altering op layout after I saw the PR #11345 from @apeskov .

billishyahao · 2022-06-09T12:40:49Z

@apeskov @yangulei I addressed the above comments. Feel free to comment more.

billishyahao · 2022-06-09T16:19:18Z

Hi @masahi , want to check if approval can be granted for this patch? Thanks in advance:-)

src/relay/backend/contrib/dnnl/codegen.cc

yangulei · 2022-06-10T01:40:44Z

@billishyahao I remember that you introduced a mechanism to query optimal layout and alter op layout before the graph is consumed by DNNL, just like we do for Conv. Did you cancel that and do the layout transform in dnnl json runtime? If so, I agree with @apeskov and wondering why no layout transform of the weights.

Yes, I do revert the change about altering op layout after I saw the PR #11345 from @apeskov .

I think we need them both. Query and alter layout can do the reordering of the weights in build time to ensure optimal performance in run time.

billishyahao · 2022-06-10T03:07:56Z

@billishyahao I remember that you introduced a mechanism to query optimal layout and alter op layout before the graph is consumed by DNNL, just like we do for Conv. Did you cancel that and do the layout transform in dnnl json runtime? If so, I agree with @apeskov and wondering why no layout transform of the weights.

Yes, I do revert the change about altering op layout after I saw the PR #11345 from @apeskov .

I think we need them both. Query and alter layout can do the reordering of the weights in build time to ensure optimal performance in run time.

Sure. @yangulei Let me add this code mentioned in the following change.

billishyahao · 2022-06-10T23:40:37Z

@masahi Thanks for the approval. Shall we go on to merge this pr?

…11513) * Enhance dnnl byoc dense operators performance by 1) introducing gelu fusion and 2) introducing alter dense weight layout. * fix lint issue * add unittest for dense pack * Make code compatible after introducing TensorRequisite(PR-11345) * Fix comments & refactor code * Fix lint * Fix partition graph unittest case * Fix comments * Fix comments * Fix lint

linlifan reviewed Jun 1, 2022

View reviewed changes

src/runtime/contrib/dnnl/dnnl_json_runtime.cc Outdated Show resolved Hide resolved

yangulei reviewed Jun 6, 2022

View reviewed changes

src/runtime/contrib/dnnl/dnnl_json_runtime.cc Outdated Show resolved Hide resolved

billishyahao force-pushed the enhance_dnnl_dense branch 2 times, most recently from 847f871 to 77beff9 Compare June 8, 2022 02:44

billishyahao mentioned this pull request Jun 8, 2022

[DNNL] Fix end of line in test_dnnl UT file #11560

Merged

billishyahao changed the title ~~[BYOC][DNNL] Enhance performance of DNNL BYOC dense operator~~ [BYOC][DNNL] Improve performance of DNNL BYOC dense operator Jun 8, 2022

billishyahao force-pushed the enhance_dnnl_dense branch from 77beff9 to 0cc56e8 Compare June 8, 2022 07:29

masahi mentioned this pull request Jun 8, 2022

[BYOC][DNNL] Enable layer normalization in DNNL byoc. #11508

Merged

billishyahao force-pushed the enhance_dnnl_dense branch from 0fc1c26 to f2b9dcb Compare June 9, 2022 01:27

billishyahao and others added 7 commits June 9, 2022 14:13

Enhance dnnl byoc dense operators performance by 1) introducing gelu …

010ab29

…fusion and 2) introducing alter dense weight layout.

fix lint issue

8d9b085

add unittest for dense pack

05d53f0

Make code compatible after introducing TensorRequisite(PR-11345)

6030403

Fix comments & refactor code

54ae4d0

Fix lint

e71778e

Fix partition graph unittest case

2bd1dd3

billishyahao force-pushed the enhance_dnnl_dense branch from f2b9dcb to 2bd1dd3 Compare June 9, 2022 06:15

apeskov reviewed Jun 9, 2022

View reviewed changes

src/relay/backend/contrib/dnnl/codegen.cc Outdated Show resolved Hide resolved

apeskov reviewed Jun 9, 2022

View reviewed changes

python/tvm/relay/op/contrib/dnnl.py Outdated Show resolved Hide resolved

apeskov reviewed Jun 9, 2022

View reviewed changes

python/tvm/relay/op/contrib/dnnl.py Show resolved Hide resolved

Fix comments

ef2e7a1

masahi approved these changes Jun 9, 2022

View reviewed changes

src/relay/backend/contrib/dnnl/codegen.cc Outdated Show resolved Hide resolved

billishyahao added 2 commits June 10, 2022 07:17

Fix comments

fc8dbb2

Fix lint

6b811c6

masahi merged commit e8712a9 into apache:main Jun 10, 2022

mbs-octoml mentioned this pull request Jun 11, 2022

[Relay] Finish implementations of WithFields #11674

Merged

masahi mentioned this pull request Jun 12, 2022

[BUILD] Fixed cutlass BYOC build break #11686

Merged

billishyahao mentioned this pull request Jun 20, 2022

Enable QNN primitives for DNNL runtime #11642

Merged

driazati mentioned this pull request Jul 14, 2022

TVM v0.9.0.rc0 Release Candidate Notes #12102

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BYOC][DNNL] Improve performance of DNNL BYOC dense operator #11513

[BYOC][DNNL] Improve performance of DNNL BYOC dense operator #11513

billishyahao commented May 31, 2022 •

edited

Loading

crazydemo commented May 31, 2022

mengceng15 commented Jun 1, 2022

billishyahao commented Jun 2, 2022

billishyahao commented Jun 2, 2022

billishyahao commented Jun 8, 2022

linlifan commented Jun 9, 2022

billishyahao commented Jun 9, 2022

apeskov commented Jun 9, 2022

billishyahao commented Jun 9, 2022

apeskov commented Jun 9, 2022

yangulei commented Jun 9, 2022

billishyahao commented Jun 9, 2022

billishyahao commented Jun 9, 2022

billishyahao commented Jun 9, 2022

yangulei commented Jun 10, 2022

billishyahao commented Jun 10, 2022

billishyahao commented Jun 10, 2022

[BYOC][DNNL] Improve performance of DNNL BYOC dense operator #11513

[BYOC][DNNL] Improve performance of DNNL BYOC dense operator #11513

Conversation

billishyahao commented May 31, 2022 • edited Loading

crazydemo commented May 31, 2022

mengceng15 commented Jun 1, 2022

billishyahao commented Jun 2, 2022

billishyahao commented Jun 2, 2022

billishyahao commented Jun 8, 2022

linlifan commented Jun 9, 2022

billishyahao commented Jun 9, 2022

apeskov commented Jun 9, 2022

billishyahao commented Jun 9, 2022

apeskov commented Jun 9, 2022

yangulei commented Jun 9, 2022

billishyahao commented Jun 9, 2022

billishyahao commented Jun 9, 2022

billishyahao commented Jun 9, 2022

yangulei commented Jun 10, 2022

billishyahao commented Jun 10, 2022

billishyahao commented Jun 10, 2022

billishyahao commented May 31, 2022 •

edited

Loading