Add MXNet Ops for fast multihead attention #16408

Caenorst · 2019-10-09T16:59:39Z

Description

Add new optimized Ops for Multihead attention

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

Changes are complete (i.e. I finished coding on this PR)
All changes have test coverage:
Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
Code is well-documented:
For new C++ functions in header files, their functionalities and arguments are documented.
To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

Add 4 Ops: (matmul(K,Q) and matmul(attention_weights, V). For both self-attention and encoder-decoder
Add unit test for those Ops

Comments

https://github.com/Caenorst/gluon-nlp/tree/fast_mha shows a example of integration in BERT, it will not be PRed to gluon-nlp as it's breaking it (out of BERT usage).
Those Ops require to have a different layout (sequence, batch, encoding) except for the masked softmax / dropout
Those Ops change the ordering of projection weights, which means a pretrained BERT without those Ops need to have the weights processed as in: Caenorst@e987614#diff-4758fb9329d438de2836db2634a8f5f7R2505-R2519 in order to use those Ops.
The argument bwd_ignore_zero_init allow to further speedup and reduce the memory consumption but is only giving good results with MXNET_EXEC_ENABLE_ADDTO set to 1, it's also a dirty trick as it's actually not "adding to" but initializing (it rely on the fact that the two ops inputs are actually complete (not overlapping and using the whole tensor), despite using the same tensor.

sxjscience · 2019-10-09T18:20:45Z

src/operator/contrib/transformer.cu

+    const float alpha               = 1.f;
+
+    if (req[0] != kNullOp) {
+      if (req[0] == kWriteTo && !params.bwd_ignore_zero_init) {


Why should we add this flag bwd_ignore_zero_init?

I cannot well understand the rationale behind this flag. I think we can determine whether we need to inplace_accumulate the gradient when req[0] == kAddTo

As I put in the description, this is a hack if you use the Ops within an multihead attention the input tensor is an interlacing of projected QKV, one Op do backward on QK part the other do backward on V (and attention weights). because of this you actually never need to "accumulate" the gradients between both Ops. It's hack that you only enable if you use this flag bwd_ignore_zero_init. If you intend to use this Op out of the official multihead attention don't enable this flag and the behavior will be as expected.

src/operator/contrib/transformer.cu

.gitmodules

src/operator/contrib/transformer.cu

Makefile

src/operator/contrib/transformer.cc

src/operator/contrib/transformer.cu

…ilation flags

eric-haibin-lin · 2019-10-17T04:16:08Z

src/operator/contrib/transformer.cc

+
+the input must be a single tensor of interleaved projections
+of queries, keys and values following the layout:
+(seq_length, batch_size, num_heads * head_dim * 3)


Does this require the weight of k, q, v concatenated together? For LAMB it calculates the weight and grad norm per parameter, so for gradient update, we cannot use the weight stored in concatenated form, I assume?

This is a valid concern, I don't know if it matter that much (we should track the coefficients between weights of q, k, v to see how much they differ). But for correctness purpose we can integrate the concatenation in the graph / forward pass. Let me verify the impact on training.

Adding the concatenation reduce by about 20% the speedup due to multihead attention. I think we can think about an improvement but meanwhile that is still a speedup. I would encourage to make an analysis of LAMB coefficients difference within multihead attention blocks, maybe directly applying it on the concatenation of weights would be fine 🤷‍♂️ .

src/operator/contrib/transformer.cc

Caenorst · 2019-10-18T21:09:22Z

I had to remove some tests on my Ops, the behaviour for bwd_ignore_zero_init cannot be tested as we can't enable the kAddTo yet.

eric-haibin-lin · 2019-10-21T20:41:56Z

@aaronmarkham is the website preview functionality still working after the website upgrade? I cannot see the preview of this PR: https://mxnet-ci-doc.s3-accelerate.dualstack.amazonaws.com/PR-16408/11/index.html

eric-haibin-lin

Can we move the op to _contrib and explicity say in the doc that this only supports GPU? Thanks

sxjscience · 2019-10-22T23:09:13Z

tests/python/gpu/test_operator_gpu.py

+    check_multihead_attention_selfatt(bwd_ignore_zero_init=False)
+    #os.environ['MXNET_EXEC_ENABLE_ADDTO'] = '1'
+    #check_multihead_attention_selfatt(bwd_ignore_zero_init=False)
+    #check_multihead_attention_selfatt(bwd_ignore_zero_init=True)


Why comment these lines?

My comment : #16408 (comment) refer to this modification

Apparently the usage of kAddTo is not working yet on this repository. and the behavior with bwd_ignore_zero_init=True is only correct if you enable it. I can add a little TODO comment like "to uncomment once kAddTo feature is working"

Caenorst · 2019-11-01T23:42:21Z

Please @TaoLv as @eric-haibin-lin indicated there is no impact on Gluon, I removed the flag, so there is no reason to block this PR.

TaoLv · 2019-11-02T03:27:23Z

Thank you @eric-haibin-lin for clarifying the impact on MXNet Gluon. My change request is not only for that but also for an elaborative proposal about these new operators. Although they will live under contrib op category, I didn't see a clear scope definition about contrib op in MXNet.
To make it clear, I hope below items can be described in the proposal:

What's the current design? APIs, features, formulas, limitations, performance, ...
Why cuDNN MHT, dynamic operator, or subgraph doesn't help here?
Do we need call for contribution for CPU implementation?
How long will these operators exist? If I understand correctly, once they're in, less likely they will be removed before next major release. So contrib op is not a trivial thing.
Will they be replaced with other implementations? with the same operator interfaces? If not, how about backward compatibility?
Do we encourage users and downstream projects to use them?
Known issues or further plan to improve?

I hope the proposal and more discussions can help other developers as well as users to understand these operators better and also make MXNet development more open.
I don't want to be a hostile here. So it's not necessary for you to finish the proposal before merging this PR.

aaronmarkham · 2019-11-05T00:45:03Z

@aaronmarkham is the website preview functionality still working after the website upgrade? I cannot see the preview of this PR: https://mxnet-ci-doc.s3-accelerate.dualstack.amazonaws.com/PR-16408/11/index.html

No, a variety of features of the new site didn't work on S3.
To preview, you need to follow the directions on the wiki, or use the devmenu features I added in this PR: #16514

Caenorst · 2019-11-06T19:10:00Z

@TaoLv:

The current design are two separated Ops which represent each matrix multiplication part of the multihead attention on both case of self-attention and encoder-decoder the performance depend on the model implementation / hardware. As an example on our internal contains we have about 15% improvement on BERT training. The Ops force to modify the layout to (sequence_length, batch), and interleave the weights for input linear projection.
cuDNN MHA Ops are not as fast as those Ops yet. Subgraph API can be doable to automatically fuse to those ops, but the implementation is not obvious as the two Ops are inter-dependent. Those ops require to modify the layout which we can't do on the whole model without user involvement yet.
I don't think a CPU implementation would be important for now. Those ops have been designed to have specific performance improvements on GPUs, as there are no plans to have them on Gluon-NLP those Ops are here to give the option to the users to have a non-negligible speedup on GPU if they need to. We do provide an example of implementation on https://github.com/Caenorst/gluon-nlp/tree/fast_mha, but don't intend to present it as a "standard" implementation of multihead attention.
We do plan to send a design proposal to dev@ for the CuDNN fused multihead attention integration we're prototyping. For that efforts, we are looking for contributions from the CPU side to make the feature better.
We don't know, it depend on future performance / ergonomic improvement and users feedbacks.
We don't know, it depend on future performance / ergonomic improvement and users feedbacks.
NVIDIA would encourage and help users that are looking for a very efficient implementation of multihead attention on GPU. We wouldn't encourage the usage for somebody who's looking for "A BERT implementation" for both CPU and GPU.
Unfortunately because of lack of kAddTo full functionality there is a cudaMemset that could be avoided, we are looking forward completing this functionality. We have several options for improvement but we want to release some tests internally before making any plans. We would like to have a single op for the whole MHA but it constraint to force the _masked_softmax + dropout inside the Op, and that would remove some of the flexbility of the current set of Ops, hence our current decision to scatter it in two ops.

DickJC123

Thanks @Caenorst for these clarifications. Thanks also for your responses to the other reviewers, e.g. removing the bwd_ignore_zero_init flag which was the source of some confusion. At this point, the new operator definition is quite clean and a totally appropriate addition to the contrib area. With the performance improvements this PR brings, I think users will be eager to experiment with these operators, with the understanding of course that contrib operator definition is not guaranteed to be stable or be retained. In fact, as you say, the final MXNet definition of multi-head attention operators will come after a thoughtful developer discussion that takes into account the evolving cpu and cudnn implementations.

This PR LGTM.

TaoLv · 2019-11-07T01:51:34Z

Thanks for your response @Caenorst. Looking forward to your general proposal for cuDNN MHA integration. Now I'm withdrawing the change request.

.

DickJC123 · 2019-11-07T02:00:34Z

Thanks Tao. Looking foward to working with you and others on MXNet's MHA definition.

* add MXNet Ops for fast multihead attention * add cutlass as 3rdparty dependency * add cutlass to compilation flags * remove all cutlass stuff * add better error message and description and remove cutlass from compilation flags * change credit for the approach since the code have changed * fix typos * correct another typo * Add all the cuda/cublas helper functions * remove tests using kAddTo * only use cublasStridedBatchedGemm if CUDA >= 9.1 * add equivalent mxnet code in description of mha ops * remove a wrong copy-paste * add _contrib for namespace and add GPU only on description * add warning in bwd_ignore_zero_init description, also test with fp32 * add error return if bwd_ignore_zero_init is used without MXNET_EXEC_ENABLE_ADDTO * remove std::move for clang * remove bwd_ignore_zero_init flag * remove bwd_ignore_zero_init in test_operator_gpu.py * fix typo * fix another typo

* support mixed-precision true_divide (#16711) * [MKLDNN] use dim_t instead of int in slice/transpose operators (#16737) * use dim_t instead of int * fix same issue in pooling * rebase code * trigger CI * Add MXNet Ops for fast multihead attention (#16408) * add MXNet Ops for fast multihead attention * add cutlass as 3rdparty dependency * add cutlass to compilation flags * remove all cutlass stuff * add better error message and description and remove cutlass from compilation flags * change credit for the approach since the code have changed * fix typos * correct another typo * Add all the cuda/cublas helper functions * remove tests using kAddTo * only use cublasStridedBatchedGemm if CUDA >= 9.1 * add equivalent mxnet code in description of mha ops * remove a wrong copy-paste * add _contrib for namespace and add GPU only on description * add warning in bwd_ignore_zero_init description, also test with fp32 * add error return if bwd_ignore_zero_init is used without MXNET_EXEC_ENABLE_ADDTO * remove std::move for clang * remove bwd_ignore_zero_init flag * remove bwd_ignore_zero_init in test_operator_gpu.py * fix typo * fix another typo * Removed unrelated test

szha · 2020-01-10T02:27:38Z

tests/python/gpu/test_operator_gpu.py

+        assert(grads_orig[k].shape == grads_opti[k].shape)
+        assert_allclose(grads_orig[k], grads_opti[k], rtol=1e-2, atol=1e-3)
+
+def test_multihead_attention_selfatt():


this test should have been marked with minimum cudnn/cuda version requirement, since the feature is not available in earlier versions

These tests have been moved to unittests/test_operator.py. Please see the discussion here: #17138 (comment).

* Add cached op threadsafe version with corresponding C APIs, CPP Package changes, CI changes and tests * Fix download cmd in runtime_functions * Add CI changes * Add stage Fix indentation * Fix lint * Change to DEFAULT for C API * Fix mxnet_unit_tests path * export correct LD_LIBRARY_PATH * Add cpp include dirs * Build test with USE_CPP_PACKAGE * Add cached op threadsafe version with corresponding C APIs, CPP Package changes, CI changes and tests * Fix download cmd in runtime_functions * Merge * change mkldnn lib name * Add static_alloc, static_Shape support * Address review comments * Make GetCachedOpThreadSafeState similar to cached_op * Address review comments: comments for locking strategy * multithreaded inference tutorial * [Estimator] handle composite metrics in estimator (apache#16676) * handle composite metrics in estimator * fix composite metric case in handlers * remove unused import * [Estimator] refactor estimator to allow overriding evaluate/fit of a batch (apache#16678) * refactor estimator to allow overriding evaluate/fit of a batch * add doc to explain call structure and how to override * fix and doc * Pointwise fusion for GPU (apache#15167) * Beginning of RTC of pointwise ops * Code generation from the given JSON * add initial simple_partition_pass and use it for pointwise fusion * fix the fusion, use a symbol.Copy() at the beginning of binding function, use the name of input nodes in the cuda code * Fixes * Adding support for attribute inference for backward nodes when fusing * keep proper input ordering for fused Op * instantiate the indexed_graph before starting the subgraph replacement, return a new graph to reset the indexed_graph * Fuse backward * fix ordering of subgraph node inputs using subgraph topological ordering instead of main graph topological ordering, add tvm.patch * excluse forward node fusion during the fusion of the nodes in the backward graph * Dealing with fused backward nodes inferattr * use subgraph.indexed_graph() instead of main for _FusedOpHelper nodes node_id, invert control_deps loop to modify topology of subgraph before calling its indexed_graph(), check that all node of the first DFSVisit are actually in the subgraph * Adding support for other reqs in codegen * Fix * Cleaning * Change the TVM submodule * More cleaning * Making linter happy * Do fusion only if default context is GPU * Fixes for tests Add powerscalar and rpowerscalar, fix return type of zero and one Cleaning, fixing lint Go back to proper TVM submodule * Fix the TVM commit * Fix lint * Guard fusion with MXNET_USE_CUDA * Fix * Fix clang-tidy * Add erf and erfinv backward * Gluon support for fusion * Cleaning * Cleaning and allow shape/type change in FusedOp * Fixing Gluon bugs * Fixing after rebase * Fixing race condition and guarding against races when using NVRTC * Cleaning and renaming FusedOp to _FusedOp * Going easy on Windows compiler * Disable fusion on Windows for now * Refactor InferAttr and InferShapeAttr * Added slice and half2 support to FusedOp * Fix lint errors * Added multiple types support for vector loading/storing * add slice fusion when it's at the beginning of subgraphs * Removed constant ndim assumption in fused op * Fix memory alignment issue in slice for FusedOp * Fixes * Fix lint errors * Do not include cuda_fp16.h * Refactor fused op op lists * Make linter happy * Changes from review * Fixes after rebase * Expand FusedOp support for slice * Fix for fp16 _zeros and _ones * Fix * Moving aux functions to unnamed namespace and detail namespace -> fusion namespace * Disabling fusion if it alters topological order of inputs * Print code only when env variable is set * Fix * Fix lint and 2 tests that specify the same names for multiple inputs * Fixes from review and disabling fusion of slice with non-default step * Add amp_cast to fusion, fixes * Add amp_multicast and its backward to the list of support ops * Apply wording suggestions from code review Co-Authored-By: Aaron Markham <markhama@amazon.com> * Apply wording suggestions from code review Co-Authored-By: Aaron Markham <markhama@amazon.com> * Make clearer comment * Adding punctuation and capitalization to \brief descriptions * Fix * Fix * Add backward_cast to fusion * Adding unittests for fusion. Fix for erfinv_grad * Adding slice ops and add_n to tests * Fixes from review * Setting inplace option * Fix lint * Storing double in half * Retrigger CI * Slight relaxing of the relative tolerance in the test * Move the env variable check to the end * Fix a race condition between InferShape and scheduled Forward * Fix flakey test_fusion test involving fp32 erfinv op. * Fix from review * Added broadcast_like and slice_like to fused op * Minor fix and cleanup * Added negative axis support in slice_axis, temporarily disabled fusion of slice_like and broadcast_like * Added axes support to slice_like * Added axis support to broadcast_like * Add fast_load_slice function to fused op code * Added runtime switch for choosing fast and slow slice kernel * Fix lint and warning * Going easy on Windows compiler (again) * Fix slice_like * Debug broadcast_like fusion * Fix lint * Fix lint * Trigger CI * Get rid of the initializer list * Fix backward calls with different gradient type * avoid cycle when adding node specific for inputs of subgraph for pointwise fusion * Fix lint * Add namespace to the fusion implementations * Set launch bounds on the fused kernel * Fix NumPy tests * Test showcasing an issue fixed in PR apache#16553 * Cast scalarts to FP32 and perform (a*1.0/b) instead of (a/b) Fix lint errors Fix lint * Fix a bug in cycle detection for inputs only op in pointwise fusion * Add comments to simple_partition_pass.h file * fix install dir (apache#16690) * [numpy] add numpy operator : append (apache#16564) * add operator : append ; fix op concatenate when axis = None * pylint disable remove mistake disable pylint * Initializer.__eq__ (apache#16680) * fix binary dependencies in CD and nightly (apache#16693) * [MKL-DNN] Add mxnet mkldnn cmake tutorial (apache#16688) * add mxnet mkldnn cmake instruction * imporve doc * OMP->OpenMP * Revert "[MKLDNN]Fix reorder2default (apache#16602)" (apache#16697) This reverts commit dd4eaf5. * [Estimator] refactor estimator and clarify docs (apache#16694) * refactor estimator and clarify docs * fix info message and test * clean up after releasing logging handler * Eliminate common expressions (apache#15657) * Eliminate common expressions from a graph * Guarding against optimizing out stateful ops and ops that require resource * Fix lint * Added THasDeterministicOutput to multiple ops * DDebug eliminate common expr * Added test * Expose get_optimized_symbol * Fix * Fix 2 * Add doc to the Python call * Add env var MXNET_ELIMINATE_COMMON_EXPR, default true * Add comments, improve readability of eliminate_common_expr_pass.cc * Expand testing * Lower priority of THasDeterministicOutput attr for equal Node test * Change mx.gpu() to mx.cpu() in tests * Skip CSE test on Windows (as env variable setting during test does not work there) * Add missing import sys * Add missing import logging * Backport of apache#16711, apache#16737, apache#16408 to 1.6 branch (apache#16763) * support mixed-precision true_divide (apache#16711) * [MKLDNN] use dim_t instead of int in slice/transpose operators (apache#16737) * use dim_t instead of int * fix same issue in pooling * rebase code * trigger CI * Add MXNet Ops for fast multihead attention (apache#16408) * add MXNet Ops for fast multihead attention * add cutlass as 3rdparty dependency * add cutlass to compilation flags * remove all cutlass stuff * add better error message and description and remove cutlass from compilation flags * change credit for the approach since the code have changed * fix typos * correct another typo * Add all the cuda/cublas helper functions * remove tests using kAddTo * only use cublasStridedBatchedGemm if CUDA >= 9.1 * add equivalent mxnet code in description of mha ops * remove a wrong copy-paste * add _contrib for namespace and add GPU only on description * add warning in bwd_ignore_zero_init description, also test with fp32 * add error return if bwd_ignore_zero_init is used without MXNET_EXEC_ENABLE_ADDTO * remove std::move for clang * remove bwd_ignore_zero_init flag * remove bwd_ignore_zero_init in test_operator_gpu.py * fix typo * fix another typo * Removed unrelated test * Add example and documentation for multi threaded inference * Add LICENSE * Add get_model.py * Add license for README * Refactor cached op and cached op threadsafe * Add limitation * Add tests for naive engine * Add latest test changes * Thread Safety tests in NaiveEngine mode * Thread Safety tests update * Update thread safety tests, add unsupported use cases * Changes to doc and refactor * Fix todo owner, indentation and mx_float->float * Refactor cached op code, remove num_threads arg from example * Fix lint * Fix warning * Add back cython, required for unix-gpu build * Fix for windows * Add bulking support for thread safe cached op version * Add support for subgraph testing * import mxnet before calling get_backend_symbol * Fix symbol json name * Refactor DynamicForward * Add comments * Add DMLC_ATTRIBUTE_UNUSED * Fix use_naive_run issue * Fix lint * Revert unittest_cpp to old test since it doesnt test thread safety * Fix doc Co-authored-by: Sheng Zha <szha@users.noreply.github.com> Co-authored-by: Przemyslaw Tredak <ptrendx@gmail.com> Co-authored-by: Tao Lv <tao.a.lv@intel.com> Co-authored-by: JiangZhaoh <54654391+JiangZhaoh@users.noreply.github.com> Co-authored-by: Leonard Lausen <leonard@lausen.nl> Co-authored-by: Xinyu Chen <xinyu1.chen@intel.com> Co-authored-by: Zhennan Qin <zhennan.qin@intel.com>

leezu · 2020-02-13T22:38:35Z

BTW, cublasGemmStridedBatchedEx used in this PR is broken in Cuda 10.1 which will cause crashes on p2 instances. Seems fixed in Cuda 10.2 (as per release notes).

larroy · 2020-02-14T18:37:10Z

So it's related to cuda and not the arch?

leezu · 2020-02-14T18:45:56Z

There are two separate problems. cublasGemmStridedBatchedEx is buggy and has fixes in 10.2. But cublasGemmStridedBatchedEx is not supported in the first place on 3.5 arch and that's a bug in MXNet.

cublasGemmBatchedEx is only supported for GPU with architecture capabilities equal or greater than 5.0. Fixes a bug in #16408

cublasGemmBatchedEx is only supported for GPU with architecture capabilities equal or greater than 5.0. Fixes a bug in apache#16408

* Fix transformer.cu interleaved matmul for cuda arch < 5 (#17596) cublasGemmBatchedEx is only supported for GPU with architecture capabilities equal or greater than 5.0. Fixes a bug in #16408 * pin Markdown version to 3.1 in Julia doc build (#17549) * pin Sphinx due to autodocsumm issue with v4.2.0 (#17561) * pin python dependencies (#17556) * [CI] Fix static build pipeline (#17474) * 1.5.x CI fixes (#17426) * Fix numpy decorator * Workaround pytest-dev/pytest#5903 * Disable pylint warnings * Fix Edge build * Fix numpy decorator on Centos * Follow redirects when downloading apache-maven-3.3.9-bin.tar.gz Co-authored-by: Hao Jin <hjjn.amzn@gmail.com> Co-authored-by: Aaron Markham <markhama@amazon.com>

* Add cached op threadsafe version with corresponding C APIs, CPP Package changes, CI changes and tests * Fix download cmd in runtime_functions * Add CI changes * Add stage Fix indentation * Fix lint * Change to DEFAULT for C API * Fix mxnet_unit_tests path * export correct LD_LIBRARY_PATH * Add cpp include dirs * Build test with USE_CPP_PACKAGE * Add cached op threadsafe version with corresponding C APIs, CPP Package changes, CI changes and tests * Fix download cmd in runtime_functions * Merge * change mkldnn lib name * Add static_alloc, static_Shape support * Address review comments * Make GetCachedOpThreadSafeState similar to cached_op * Address review comments: comments for locking strategy * multithreaded inference tutorial * [Estimator] handle composite metrics in estimator (apache#16676) * handle composite metrics in estimator * fix composite metric case in handlers * remove unused import * [Estimator] refactor estimator to allow overriding evaluate/fit of a batch (apache#16678) * refactor estimator to allow overriding evaluate/fit of a batch * add doc to explain call structure and how to override * fix and doc * Pointwise fusion for GPU (apache#15167) * Beginning of RTC of pointwise ops * Code generation from the given JSON * add initial simple_partition_pass and use it for pointwise fusion * fix the fusion, use a symbol.Copy() at the beginning of binding function, use the name of input nodes in the cuda code * Fixes * Adding support for attribute inference for backward nodes when fusing * keep proper input ordering for fused Op * instantiate the indexed_graph before starting the subgraph replacement, return a new graph to reset the indexed_graph * Fuse backward * fix ordering of subgraph node inputs using subgraph topological ordering instead of main graph topological ordering, add tvm.patch * excluse forward node fusion during the fusion of the nodes in the backward graph * Dealing with fused backward nodes inferattr * use subgraph.indexed_graph() instead of main for _FusedOpHelper nodes node_id, invert control_deps loop to modify topology of subgraph before calling its indexed_graph(), check that all node of the first DFSVisit are actually in the subgraph * Adding support for other reqs in codegen * Fix * Cleaning * Change the TVM submodule * More cleaning * Making linter happy * Do fusion only if default context is GPU * Fixes for tests Add powerscalar and rpowerscalar, fix return type of zero and one Cleaning, fixing lint Go back to proper TVM submodule * Fix the TVM commit * Fix lint * Guard fusion with MXNET_USE_CUDA * Fix * Fix clang-tidy * Add erf and erfinv backward * Gluon support for fusion * Cleaning * Cleaning and allow shape/type change in FusedOp * Fixing Gluon bugs * Fixing after rebase * Fixing race condition and guarding against races when using NVRTC * Cleaning and renaming FusedOp to _FusedOp * Going easy on Windows compiler * Disable fusion on Windows for now * Refactor InferAttr and InferShapeAttr * Added slice and half2 support to FusedOp * Fix lint errors * Added multiple types support for vector loading/storing * add slice fusion when it's at the beginning of subgraphs * Removed constant ndim assumption in fused op * Fix memory alignment issue in slice for FusedOp * Fixes * Fix lint errors * Do not include cuda_fp16.h * Refactor fused op op lists * Make linter happy * Changes from review * Fixes after rebase * Expand FusedOp support for slice * Fix for fp16 _zeros and _ones * Fix * Moving aux functions to unnamed namespace and detail namespace -> fusion namespace * Disabling fusion if it alters topological order of inputs * Print code only when env variable is set * Fix * Fix lint and 2 tests that specify the same names for multiple inputs * Fixes from review and disabling fusion of slice with non-default step * Add amp_cast to fusion, fixes * Add amp_multicast and its backward to the list of support ops * Apply wording suggestions from code review Co-Authored-By: Aaron Markham <markhama@amazon.com> * Apply wording suggestions from code review Co-Authored-By: Aaron Markham <markhama@amazon.com> * Make clearer comment * Adding punctuation and capitalization to \brief descriptions * Fix * Fix * Add backward_cast to fusion * Adding unittests for fusion. Fix for erfinv_grad * Adding slice ops and add_n to tests * Fixes from review * Setting inplace option * Fix lint * Storing double in half * Retrigger CI * Slight relaxing of the relative tolerance in the test * Move the env variable check to the end * Fix a race condition between InferShape and scheduled Forward * Fix flakey test_fusion test involving fp32 erfinv op. * Fix from review * Added broadcast_like and slice_like to fused op * Minor fix and cleanup * Added negative axis support in slice_axis, temporarily disabled fusion of slice_like and broadcast_like * Added axes support to slice_like * Added axis support to broadcast_like * Add fast_load_slice function to fused op code * Added runtime switch for choosing fast and slow slice kernel * Fix lint and warning * Going easy on Windows compiler (again) * Fix slice_like * Debug broadcast_like fusion * Fix lint * Fix lint * Trigger CI * Get rid of the initializer list * Fix backward calls with different gradient type * avoid cycle when adding node specific for inputs of subgraph for pointwise fusion * Fix lint * Add namespace to the fusion implementations * Set launch bounds on the fused kernel * Fix NumPy tests * Test showcasing an issue fixed in PR apache#16553 * Cast scalarts to FP32 and perform (a*1.0/b) instead of (a/b) Fix lint errors Fix lint * Fix a bug in cycle detection for inputs only op in pointwise fusion * Add comments to simple_partition_pass.h file * fix install dir (apache#16690) * [numpy] add numpy operator : append (apache#16564) * add operator : append ; fix op concatenate when axis = None * pylint disable remove mistake disable pylint * Initializer.__eq__ (apache#16680) * fix binary dependencies in CD and nightly (apache#16693) * [MKL-DNN] Add mxnet mkldnn cmake tutorial (apache#16688) * add mxnet mkldnn cmake instruction * imporve doc * OMP->OpenMP * Revert "[MKLDNN]Fix reorder2default (apache#16602)" (apache#16697) This reverts commit dd4eaf5. * [Estimator] refactor estimator and clarify docs (apache#16694) * refactor estimator and clarify docs * fix info message and test * clean up after releasing logging handler * Eliminate common expressions (apache#15657) * Eliminate common expressions from a graph * Guarding against optimizing out stateful ops and ops that require resource * Fix lint * Added THasDeterministicOutput to multiple ops * DDebug eliminate common expr * Added test * Expose get_optimized_symbol * Fix * Fix 2 * Add doc to the Python call * Add env var MXNET_ELIMINATE_COMMON_EXPR, default true * Add comments, improve readability of eliminate_common_expr_pass.cc * Expand testing * Lower priority of THasDeterministicOutput attr for equal Node test * Change mx.gpu() to mx.cpu() in tests * Skip CSE test on Windows (as env variable setting during test does not work there) * Add missing import sys * Add missing import logging * Backport of apache#16711, apache#16737, apache#16408 to 1.6 branch (apache#16763) * support mixed-precision true_divide (apache#16711) * [MKLDNN] use dim_t instead of int in slice/transpose operators (apache#16737) * use dim_t instead of int * fix same issue in pooling * rebase code * trigger CI * Add MXNet Ops for fast multihead attention (apache#16408) * add MXNet Ops for fast multihead attention * add cutlass as 3rdparty dependency * add cutlass to compilation flags * remove all cutlass stuff * add better error message and description and remove cutlass from compilation flags * change credit for the approach since the code have changed * fix typos * correct another typo * Add all the cuda/cublas helper functions * remove tests using kAddTo * only use cublasStridedBatchedGemm if CUDA >= 9.1 * add equivalent mxnet code in description of mha ops * remove a wrong copy-paste * add _contrib for namespace and add GPU only on description * add warning in bwd_ignore_zero_init description, also test with fp32 * add error return if bwd_ignore_zero_init is used without MXNET_EXEC_ENABLE_ADDTO * remove std::move for clang * remove bwd_ignore_zero_init flag * remove bwd_ignore_zero_init in test_operator_gpu.py * fix typo * fix another typo * Removed unrelated test * Add example and documentation for multi threaded inference * Add LICENSE * Add get_model.py * Add license for README * Refactor cached op and cached op threadsafe * Add limitation * Add tests for naive engine * Add latest test changes * Thread Safety tests in NaiveEngine mode * Thread Safety tests update * Update thread safety tests, add unsupported use cases * Changes to doc and refactor * Fix todo owner, indentation and mx_float->float * Refactor cached op code, remove num_threads arg from example * Fix lint * Fix warning * Add back cython, required for unix-gpu build * Fix for windows * Add bulking support for thread safe cached op version * Add support for subgraph testing * import mxnet before calling get_backend_symbol * Fix symbol json name * Refactor DynamicForward * Add comments * Add DMLC_ATTRIBUTE_UNUSED * Fix use_naive_run issue * Fix lint * Revert unittest_cpp to old test since it doesnt test thread safety * Fix doc Co-authored-by: Sheng Zha <szha@users.noreply.github.com> Co-authored-by: Przemyslaw Tredak <ptrendx@gmail.com> Co-authored-by: Tao Lv <tao.a.lv@intel.com> Co-authored-by: JiangZhaoh <54654391+JiangZhaoh@users.noreply.github.com> Co-authored-by: Leonard Lausen <leonard@lausen.nl> Co-authored-by: Xinyu Chen <xinyu1.chen@intel.com> Co-authored-by: Zhennan Qin <zhennan.qin@intel.com>

cublasGemmBatchedEx is only supported for GPU with architecture capabilities equal or greater than 5.0. Fixes a bug in apache#16408

* Add cached op threadsafe version with corresponding C APIs, CPP Package changes, CI changes and tests * Fix download cmd in runtime_functions * Add CI changes * Add stage Fix indentation * Fix lint * Change to DEFAULT for C API * Fix mxnet_unit_tests path * export correct LD_LIBRARY_PATH * Add cpp include dirs * Build test with USE_CPP_PACKAGE * Add cached op threadsafe version with corresponding C APIs, CPP Package changes, CI changes and tests * Fix download cmd in runtime_functions * Merge * change mkldnn lib name * Add static_alloc, static_Shape support * Address review comments * Make GetCachedOpThreadSafeState similar to cached_op * Address review comments: comments for locking strategy * multithreaded inference tutorial * [Estimator] handle composite metrics in estimator (apache#16676) * handle composite metrics in estimator * fix composite metric case in handlers * remove unused import * [Estimator] refactor estimator to allow overriding evaluate/fit of a batch (apache#16678) * refactor estimator to allow overriding evaluate/fit of a batch * add doc to explain call structure and how to override * fix and doc * Pointwise fusion for GPU (apache#15167) * Beginning of RTC of pointwise ops * Code generation from the given JSON * add initial simple_partition_pass and use it for pointwise fusion * fix the fusion, use a symbol.Copy() at the beginning of binding function, use the name of input nodes in the cuda code * Fixes * Adding support for attribute inference for backward nodes when fusing * keep proper input ordering for fused Op * instantiate the indexed_graph before starting the subgraph replacement, return a new graph to reset the indexed_graph * Fuse backward * fix ordering of subgraph node inputs using subgraph topological ordering instead of main graph topological ordering, add tvm.patch * excluse forward node fusion during the fusion of the nodes in the backward graph * Dealing with fused backward nodes inferattr * use subgraph.indexed_graph() instead of main for _FusedOpHelper nodes node_id, invert control_deps loop to modify topology of subgraph before calling its indexed_graph(), check that all node of the first DFSVisit are actually in the subgraph * Adding support for other reqs in codegen * Fix * Cleaning * Change the TVM submodule * More cleaning * Making linter happy * Do fusion only if default context is GPU * Fixes for tests Add powerscalar and rpowerscalar, fix return type of zero and one Cleaning, fixing lint Go back to proper TVM submodule * Fix the TVM commit * Fix lint * Guard fusion with MXNET_USE_CUDA * Fix * Fix clang-tidy * Add erf and erfinv backward * Gluon support for fusion * Cleaning * Cleaning and allow shape/type change in FusedOp * Fixing Gluon bugs * Fixing after rebase * Fixing race condition and guarding against races when using NVRTC * Cleaning and renaming FusedOp to _FusedOp * Going easy on Windows compiler * Disable fusion on Windows for now * Refactor InferAttr and InferShapeAttr * Added slice and half2 support to FusedOp * Fix lint errors * Added multiple types support for vector loading/storing * add slice fusion when it's at the beginning of subgraphs * Removed constant ndim assumption in fused op * Fix memory alignment issue in slice for FusedOp * Fixes * Fix lint errors * Do not include cuda_fp16.h * Refactor fused op op lists * Make linter happy * Changes from review * Fixes after rebase * Expand FusedOp support for slice * Fix for fp16 _zeros and _ones * Fix * Moving aux functions to unnamed namespace and detail namespace -> fusion namespace * Disabling fusion if it alters topological order of inputs * Print code only when env variable is set * Fix * Fix lint and 2 tests that specify the same names for multiple inputs * Fixes from review and disabling fusion of slice with non-default step * Add amp_cast to fusion, fixes * Add amp_multicast and its backward to the list of support ops * Apply wording suggestions from code review Co-Authored-By: Aaron Markham <markhama@amazon.com> * Apply wording suggestions from code review Co-Authored-By: Aaron Markham <markhama@amazon.com> * Make clearer comment * Adding punctuation and capitalization to \brief descriptions * Fix * Fix * Add backward_cast to fusion * Adding unittests for fusion. Fix for erfinv_grad * Adding slice ops and add_n to tests * Fixes from review * Setting inplace option * Fix lint * Storing double in half * Retrigger CI * Slight relaxing of the relative tolerance in the test * Move the env variable check to the end * Fix a race condition between InferShape and scheduled Forward * Fix flakey test_fusion test involving fp32 erfinv op. * Fix from review * Added broadcast_like and slice_like to fused op * Minor fix and cleanup * Added negative axis support in slice_axis, temporarily disabled fusion of slice_like and broadcast_like * Added axes support to slice_like * Added axis support to broadcast_like * Add fast_load_slice function to fused op code * Added runtime switch for choosing fast and slow slice kernel * Fix lint and warning * Going easy on Windows compiler (again) * Fix slice_like * Debug broadcast_like fusion * Fix lint * Fix lint * Trigger CI * Get rid of the initializer list * Fix backward calls with different gradient type * avoid cycle when adding node specific for inputs of subgraph for pointwise fusion * Fix lint * Add namespace to the fusion implementations * Set launch bounds on the fused kernel * Fix NumPy tests * Test showcasing an issue fixed in PR apache#16553 * Cast scalarts to FP32 and perform (a*1.0/b) instead of (a/b) Fix lint errors Fix lint * Fix a bug in cycle detection for inputs only op in pointwise fusion * Add comments to simple_partition_pass.h file * fix install dir (apache#16690) * [numpy] add numpy operator : append (apache#16564) * add operator : append ; fix op concatenate when axis = None * pylint disable remove mistake disable pylint * Initializer.__eq__ (apache#16680) * fix binary dependencies in CD and nightly (apache#16693) * [MKL-DNN] Add mxnet mkldnn cmake tutorial (apache#16688) * add mxnet mkldnn cmake instruction * imporve doc * OMP->OpenMP * Revert "[MKLDNN]Fix reorder2default (apache#16602)" (apache#16697) This reverts commit dd4eaf5. * [Estimator] refactor estimator and clarify docs (apache#16694) * refactor estimator and clarify docs * fix info message and test * clean up after releasing logging handler * Eliminate common expressions (apache#15657) * Eliminate common expressions from a graph * Guarding against optimizing out stateful ops and ops that require resource * Fix lint * Added THasDeterministicOutput to multiple ops * DDebug eliminate common expr * Added test * Expose get_optimized_symbol * Fix * Fix 2 * Add doc to the Python call * Add env var MXNET_ELIMINATE_COMMON_EXPR, default true * Add comments, improve readability of eliminate_common_expr_pass.cc * Expand testing * Lower priority of THasDeterministicOutput attr for equal Node test * Change mx.gpu() to mx.cpu() in tests * Skip CSE test on Windows (as env variable setting during test does not work there) * Add missing import sys * Add missing import logging * Backport of apache#16711, apache#16737, apache#16408 to 1.6 branch (apache#16763) * support mixed-precision true_divide (apache#16711) * [MKLDNN] use dim_t instead of int in slice/transpose operators (apache#16737) * use dim_t instead of int * fix same issue in pooling * rebase code * trigger CI * Add MXNet Ops for fast multihead attention (apache#16408) * add MXNet Ops for fast multihead attention * add cutlass as 3rdparty dependency * add cutlass to compilation flags * remove all cutlass stuff * add better error message and description and remove cutlass from compilation flags * change credit for the approach since the code have changed * fix typos * correct another typo * Add all the cuda/cublas helper functions * remove tests using kAddTo * only use cublasStridedBatchedGemm if CUDA >= 9.1 * add equivalent mxnet code in description of mha ops * remove a wrong copy-paste * add _contrib for namespace and add GPU only on description * add warning in bwd_ignore_zero_init description, also test with fp32 * add error return if bwd_ignore_zero_init is used without MXNET_EXEC_ENABLE_ADDTO * remove std::move for clang * remove bwd_ignore_zero_init flag * remove bwd_ignore_zero_init in test_operator_gpu.py * fix typo * fix another typo * Removed unrelated test * Add example and documentation for multi threaded inference * Add LICENSE * Add get_model.py * Add license for README * Refactor cached op and cached op threadsafe * Add limitation * Add tests for naive engine * Add latest test changes * Thread Safety tests in NaiveEngine mode * Thread Safety tests update * Update thread safety tests, add unsupported use cases * Changes to doc and refactor * Fix todo owner, indentation and mx_float->float * Refactor cached op code, remove num_threads arg from example * Fix lint * Fix warning * Add back cython, required for unix-gpu build * Fix for windows * Add bulking support for thread safe cached op version * Add support for subgraph testing * import mxnet before calling get_backend_symbol * Fix symbol json name * Refactor DynamicForward * Add comments * Add DMLC_ATTRIBUTE_UNUSED * Fix use_naive_run issue * Fix lint * Revert unittest_cpp to old test since it doesnt test thread safety * Fix doc Co-authored-by: Sheng Zha <szha@users.noreply.github.com> Co-authored-by: Przemyslaw Tredak <ptrendx@gmail.com> Co-authored-by: Tao Lv <tao.a.lv@intel.com> Co-authored-by: JiangZhaoh <54654391+JiangZhaoh@users.noreply.github.com> Co-authored-by: Leonard Lausen <leonard@lausen.nl> Co-authored-by: Xinyu Chen <xinyu1.chen@intel.com> Co-authored-by: Zhennan Qin <zhennan.qin@intel.com>

add MXNet Ops for fast multihead attention

e987614

ptrendx mentioned this pull request Oct 9, 2019

[Discussion] 1.6.0 Roadmap #15589

Closed

sxjscience reviewed Oct 9, 2019

View reviewed changes

src/operator/contrib/transformer.cu Outdated Show resolved Hide resolved

add cutlass as 3rdparty dependency

cbbeac9

Caenorst requested a review from szha as a code owner October 9, 2019 18:49

add cutlass to compilation flags

7a46c6a

marcoabreu reviewed Oct 9, 2019

View reviewed changes

.gitmodules Outdated Show resolved Hide resolved

sxjscience reviewed Oct 9, 2019

View reviewed changes

src/operator/contrib/transformer.cu Outdated Show resolved Hide resolved

remove all cutlass stuff

7fe71f9