[Metal] Add pass for splitting kernel with huge number of args #8313

echuraev · 2021-06-23T12:37:16Z

The Metal has some limitations on the number of input parameters. More
information can be found here:
https://developer.apple.com/documentation/metal/buffers/about_argument_buffers?language=objc

In this commit a new pass for splitting functions with big number of
arguments to smaller parts was added. In parameter max_function_args
we can specify the maximum number of kernel arguments for specific
target and then split kernel when the number of arguments exceeds the
value of max_function_args. Currently this pass works only for concat
layer.

Thanks for contributing to TVM! Please refer to guideline https://tvm.apache.org/docs/contribute/ for useful information and tips. After the pull request is submitted, please request code reviews from Reviewers by @ them in the pull request thread.

src/relay/transforms/split_args.cc

jwfromm · 2021-06-28T16:28:30Z

@mbrookhart can you take a look at this one?

mbrookhart · 2021-06-28T16:37:59Z

cc @masahi

I remember Masa doing something like this for Vulkan at one point, but I'm not sure if that was a branch or if it ever got merged. If it's merged somewhere, maybe we should combine make this a generally available tool?

masahi · 2021-06-28T21:19:02Z

Not exactly, but I've dealt with a similar issue. My mitigation was to limit the maximum fusion depth, which breaks large parameter kernels into smaller ones. But that is not guaranteed to work and not predictable. I can imagine that having a pass like this that allows more fine-grained controls might be necessary in some cases.

@echuraev FYI you can cap the fuse depth by

tvm/tests/python/relay/test_pass_fuse_ops.py

Line 755 in 720e7b1

    
           with tvm.transform.PassContext(config={"relay.FuseOps.max_depth": max_fused_ops}):

mbrookhart

Minor nit.

In general, I see the utility for single Relay kernels that have too many inputs, but I wonder if you'll hit this more for kernels post-fusion. This doesn't seem to tackle that for anything other than giant concatenation?

src/relay/transforms/pattern_utils.h

The Metal has some limitations on the number of input parameters. More information can be found here: https://developer.apple.com/documentation/metal/buffers/about_argument_buffers?language=objc In this commit a new pass for splitting functions with big number of arguments to smaller parts was added. In parameter `max_function_args` we can specify the maximum number of kernel arguments for specific target and then split kernel when the number of arguments exceeds the value of `max_function_args`. Currently this pass works only for concat layer.

echuraev · 2021-06-29T09:57:11Z

Not exactly, but I've dealt with a similar issue. My mitigation was to limit the maximum fusion depth, which breaks large parameter kernels into smaller ones. But that is not guaranteed to work and not predictable. I can imagine that having a pass like this that allows more fine-grained controls might be necessary in some cases.

@echuraev FYI you can cap the fuse depth by

tvm/tests/python/relay/test_pass_fuse_ops.py

Line 755 in 720e7b1

with tvm.transform.PassContext(config={"relay.FuseOps.max_depth": max_fused_ops}):

Thank you! I thought about the reducing fuse depth, but as you mentioned, it is not predictable and not guaranteed to work. This is why I think that this approach with splitting kernels is more robust.

echuraev · 2021-06-29T10:04:11Z

Minor nit.

In general, I see the utility for single Relay kernels that have too many inputs, but I wonder if you'll hit this more for kernels post-fusion. This doesn't seem to tackle that for anything other than giant concatenation?

Yes, now it is solving only problem with concat layer. I thought that maybe this pass can be useful in case of split layer for example. But I wasn't able to reproduce the same problem for a split layer on several simple tests.

mbrookhart · 2021-06-29T14:37:18Z

I have no problems merging this as it stands, but I do think I have a bigger question:

Should we put some sort of logic into fusion to automatically stop fusion if the argument list grows too large per this setting? That should be more robust than arbitrarily limiting the fusion depth. It could of course be a second PR.

echuraev · 2021-06-29T19:22:28Z

I have no problems merging this as it stands, but I do think I have a bigger question:

Should we put some sort of logic into fusion to automatically stop fusion if the argument list grows too large per this setting? That should be more robust than arbitrarily limiting the fusion depth. It could of course be a second PR.

It's a good point and looks reasonable to add such logic into the fusion algorithm. It could help us to avoid some possible problems with number of arguments in the future. I think it would be better to do such logic in separate PR, due to problem with concat layer can appear and without fusing. For example, in the original problem we had many inputs because for each input to concat we had some preprocessing. And the fusing algorithm wasn't able to fuse these inputs into tuple due to this preprocessing. I don't think that the limit on the fusion depth can solve this problem.

…e#8313) * [Metal] Add pass for splitting kernel with huge number of args The Metal has some limitations on the number of input parameters. More information can be found here: https://developer.apple.com/documentation/metal/buffers/about_argument_buffers?language=objc In this commit a new pass for splitting functions with big number of arguments to smaller parts was added. In parameter `max_function_args` we can specify the maximum number of kernel arguments for specific target and then split kernel when the number of arguments exceeds the value of `max_function_args`. Currently this pass works only for concat layer. * Add getting number of output parameters * Fix CI and apply comments

In PR apache#8313 a parameter `max_function_args` was introduced. It leads to limit number of function argument and in case when this value is exceeded then concatenation layer is splittet to a several concat operations. I faced a problem on Adreno GPU that for kernel with big number of arguments the enqueueNDRange was wailed without any errors. The problem was also in the huge number arguments. But in this case not only concat layer was a root cause of the problem. Also after fusing several operations the final functions had big number of arguments. As it was discussed in apache#8313, adding limitation on the number of function arguments to the FuseOps pass might be a good improvement. In this PR I introduced such mechanism for limitation number of function arguments for FuseOps pass and add an arguments limit to OpenCL devices at 128 parameters.

In PR apache#8313 a parameter `max_function_args` was introduced. It leads to limit number of function argument and in case when this value is exceeded then concatenation layer is split to a several concat operations. I faced a problem on Adreno GPU that for kernel with big number of arguments the enqueueNDRange was crashed without any errors. The problem appeared because of the huge number of arguments. But in this case not only concat layer was a root cause of the problem. Also after fusing several operations the final functions had a big number of arguments. As it was discussed in apache#8313, adding a limitation on the number of function arguments to the FuseOps pass might be a good improvement. In this PR I introduced such mechanism for limitation number of function arguments for FuseOps pass and add an arguments limit to OpenCL devices at 128 parameters.

In PR apache#8313 a parameter `max_function_args` was introduced. It leads to limit number of function argument and in case when this value is exceeded then concatenation layer is split to a several concat operations. I faced a problem on Adreno GPU that for kernel with big number of arguments the enqueueNDRange was crashed without any errors. The problem appeared because of the huge number of arguments. But in this case not only concat layer was a root cause of the problem. Also after fusing several operations the final functions had a big number of arguments. As it was discussed in apache#8313, adding a limitation on the number of function arguments to the FuseOps pass might be a good improvement. In this PR I introduced such mechanism for limitation number of function arguments for FuseOps pass and add an arguments limit to OpenCL devices at 128 parameters. The idea of current approach is calculate the number of arguments for each node in fusing algorithm and in case then the number of function arguments exceeds the limit, specified by `max_function_args`, then the fusing should be stopped. In case when node has several inputs and for some of the inputs the number of arguments wasn't computed, then we postpone fusing for this node and will try fuse this node later when the number of arguments will be computed for all inputs. This approach with postponed fusing helps to avoid additional computations during compilation. Additionally, case of dynamic shapes should be handled. In case of dynamic shape, function arguments also included sizes of dynamic dimension and strides. The number of strides can be computed by calculating number of tensor dimensions (the number of strides equals to the rank of the tensor). The number of additional parameters with sizes of dynamic dimensions can be calculated by computing number of dynamic dimensions.

* [Relay] Introduce arguments limit to FuseOps pass In PR #8313 a parameter `max_function_args` was introduced. It leads to limit number of function argument and in case when this value is exceeded then concatenation layer is split to a several concat operations. I faced a problem on Adreno GPU that for kernel with big number of arguments the enqueueNDRange was crashed without any errors. The problem appeared because of the huge number of arguments. But in this case not only concat layer was a root cause of the problem. Also after fusing several operations the final functions had a big number of arguments. As it was discussed in #8313, adding a limitation on the number of function arguments to the FuseOps pass might be a good improvement. In this PR I introduced such mechanism for limitation number of function arguments for FuseOps pass and add an arguments limit to OpenCL devices at 128 parameters. The idea of current approach is calculate the number of arguments for each node in fusing algorithm and in case then the number of function arguments exceeds the limit, specified by `max_function_args`, then the fusing should be stopped. In case when node has several inputs and for some of the inputs the number of arguments wasn't computed, then we postpone fusing for this node and will try fuse this node later when the number of arguments will be computed for all inputs. This approach with postponed fusing helps to avoid additional computations during compilation. Additionally, case of dynamic shapes should be handled. In case of dynamic shape, function arguments also included sizes of dynamic dimension and strides. The number of strides can be computed by calculating number of tensor dimensions (the number of strides equals to the rank of the tensor). The number of additional parameters with sizes of dynamic dimensions can be calculated by computing number of dynamic dimensions. * Fix memory_scope order in test * Apply code review comments * Apply comments

echuraev commented Jun 23, 2021

View reviewed changes

src/relay/transforms/split_args.cc Outdated Show resolved Hide resolved

echuraev force-pushed the echuraev/split_kernel_with_many_arguments branch 3 times, most recently from c45c2b5 to c9028c7 Compare June 24, 2021 10:12

mbrookhart reviewed Jun 28, 2021

View reviewed changes

src/relay/transforms/pattern_utils.h Outdated Show resolved Hide resolved

echuraev added 3 commits June 29, 2021 09:39

Add getting number of output parameters

858d3f0

Fix CI and apply comments

f71fc96

echuraev force-pushed the echuraev/split_kernel_with_many_arguments branch from c9028c7 to f71fc96 Compare June 29, 2021 09:14

mbrookhart approved these changes Jun 29, 2021

View reviewed changes

masahi merged commit c989e4a into apache:main Jul 1, 2021

echuraev deleted the echuraev/split_kernel_with_many_arguments branch September 24, 2021 10:37

junrushao mentioned this pull request Nov 1, 2021

Apache TVM v0.8 Release Note Candidate #9416

Closed

echuraev mentioned this pull request Jun 21, 2023

[Relay] Introduce arguments limit to FuseOps pass #15137

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Metal] Add pass for splitting kernel with huge number of args #8313

[Metal] Add pass for splitting kernel with huge number of args #8313

echuraev commented Jun 23, 2021

jwfromm commented Jun 28, 2021

mbrookhart commented Jun 28, 2021

masahi commented Jun 28, 2021 •

edited

Loading

mbrookhart left a comment

echuraev commented Jun 29, 2021

echuraev commented Jun 29, 2021

mbrookhart commented Jun 29, 2021 •

edited

Loading

echuraev commented Jun 29, 2021

[Metal] Add pass for splitting kernel with huge number of args #8313

[Metal] Add pass for splitting kernel with huge number of args #8313

Conversation

echuraev commented Jun 23, 2021

jwfromm commented Jun 28, 2021

mbrookhart commented Jun 28, 2021

masahi commented Jun 28, 2021 • edited Loading

mbrookhart left a comment

Choose a reason for hiding this comment

echuraev commented Jun 29, 2021

echuraev commented Jun 29, 2021

mbrookhart commented Jun 29, 2021 • edited Loading

echuraev commented Jun 29, 2021

masahi commented Jun 28, 2021 •

edited

Loading

mbrookhart commented Jun 29, 2021 •

edited

Loading