Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Metal] Add pass for splitting kernel with huge number of args #8313

Merged

Conversation

echuraev
Copy link
Contributor

The Metal has some limitations on the number of input parameters. More
information can be found here:
https://developer.apple.com/documentation/metal/buffers/about_argument_buffers?language=objc

In this commit a new pass for splitting functions with big number of
arguments to smaller parts was added. In parameter max_function_args
we can specify the maximum number of kernel arguments for specific
target and then split kernel when the number of arguments exceeds the
value of max_function_args. Currently this pass works only for concat
layer.

Thanks for contributing to TVM! Please refer to guideline https://tvm.apache.org/docs/contribute/ for useful information and tips. After the pull request is submitted, please request code reviews from Reviewers by @ them in the pull request thread.

@echuraev echuraev force-pushed the echuraev/split_kernel_with_many_arguments branch 3 times, most recently from c45c2b5 to c9028c7 Compare June 24, 2021 10:12
@jwfromm
Copy link
Contributor

jwfromm commented Jun 28, 2021

@mbrookhart can you take a look at this one?

@mbrookhart
Copy link
Contributor

cc @masahi

I remember Masa doing something like this for Vulkan at one point, but I'm not sure if that was a branch or if it ever got merged. If it's merged somewhere, maybe we should combine make this a generally available tool?

@masahi
Copy link
Member

masahi commented Jun 28, 2021

Not exactly, but I've dealt with a similar issue. My mitigation was to limit the maximum fusion depth, which breaks large parameter kernels into smaller ones. But that is not guaranteed to work and not predictable. I can imagine that having a pass like this that allows more fine-grained controls might be necessary in some cases.

@echuraev FYI you can cap the fuse depth by

with tvm.transform.PassContext(config={"relay.FuseOps.max_depth": max_fused_ops}):

Copy link
Contributor

@mbrookhart mbrookhart left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor nit.

In general, I see the utility for single Relay kernels that have too many inputs, but I wonder if you'll hit this more for kernels post-fusion. This doesn't seem to tackle that for anything other than giant concatenation?

src/relay/transforms/pattern_utils.h Outdated Show resolved Hide resolved
The Metal has some limitations on the number of input parameters. More
information can be found here:
https://developer.apple.com/documentation/metal/buffers/about_argument_buffers?language=objc

In this commit a new pass for splitting functions with big number of
arguments to smaller parts was added. In parameter `max_function_args`
we can specify the maximum number of kernel arguments for specific
target and then split kernel when the number of arguments exceeds the
value of `max_function_args`. Currently this pass works only for concat
layer.
@echuraev echuraev force-pushed the echuraev/split_kernel_with_many_arguments branch from c9028c7 to f71fc96 Compare June 29, 2021 09:14
@echuraev
Copy link
Contributor Author

Not exactly, but I've dealt with a similar issue. My mitigation was to limit the maximum fusion depth, which breaks large parameter kernels into smaller ones. But that is not guaranteed to work and not predictable. I can imagine that having a pass like this that allows more fine-grained controls might be necessary in some cases.

@echuraev FYI you can cap the fuse depth by

with tvm.transform.PassContext(config={"relay.FuseOps.max_depth": max_fused_ops}):

Thank you! I thought about the reducing fuse depth, but as you mentioned, it is not predictable and not guaranteed to work. This is why I think that this approach with splitting kernels is more robust.

@echuraev
Copy link
Contributor Author

Minor nit.

In general, I see the utility for single Relay kernels that have too many inputs, but I wonder if you'll hit this more for kernels post-fusion. This doesn't seem to tackle that for anything other than giant concatenation?

Yes, now it is solving only problem with concat layer. I thought that maybe this pass can be useful in case of split layer for example. But I wasn't able to reproduce the same problem for a split layer on several simple tests.

@mbrookhart
Copy link
Contributor

mbrookhart commented Jun 29, 2021

I have no problems merging this as it stands, but I do think I have a bigger question:

Should we put some sort of logic into fusion to automatically stop fusion if the argument list grows too large per this setting? That should be more robust than arbitrarily limiting the fusion depth. It could of course be a second PR.

@echuraev
Copy link
Contributor Author

I have no problems merging this as it stands, but I do think I have a bigger question:

Should we put some sort of logic into fusion to automatically stop fusion if the argument list grows too large per this setting? That should be more robust than arbitrarily limiting the fusion depth. It could of course be a second PR.

It's a good point and looks reasonable to add such logic into the fusion algorithm. It could help us to avoid some possible problems with number of arguments in the future. I think it would be better to do such logic in separate PR, due to problem with concat layer can appear and without fusing. For example, in the original problem we had many inputs because for each input to concat we had some preprocessing. And the fusing algorithm wasn't able to fuse these inputs into tuple due to this preprocessing. I don't think that the limit on the fusion depth can solve this problem.

@masahi masahi merged commit c989e4a into apache:main Jul 1, 2021
lygztq pushed a commit to lygztq/tvm that referenced this pull request Jul 1, 2021
…e#8313)

* [Metal] Add pass for splitting kernel with huge number of args

The Metal has some limitations on the number of input parameters. More
information can be found here:
https://developer.apple.com/documentation/metal/buffers/about_argument_buffers?language=objc

In this commit a new pass for splitting functions with big number of
arguments to smaller parts was added. In parameter `max_function_args`
we can specify the maximum number of kernel arguments for specific
target and then split kernel when the number of arguments exceeds the
value of `max_function_args`. Currently this pass works only for concat
layer.

* Add getting number of output parameters

* Fix CI and apply comments
@echuraev echuraev deleted the echuraev/split_kernel_with_many_arguments branch September 24, 2021 10:37
ylc pushed a commit to ylc/tvm that referenced this pull request Sep 29, 2021
…e#8313)

* [Metal] Add pass for splitting kernel with huge number of args

The Metal has some limitations on the number of input parameters. More
information can be found here:
https://developer.apple.com/documentation/metal/buffers/about_argument_buffers?language=objc

In this commit a new pass for splitting functions with big number of
arguments to smaller parts was added. In parameter `max_function_args`
we can specify the maximum number of kernel arguments for specific
target and then split kernel when the number of arguments exceeds the
value of `max_function_args`. Currently this pass works only for concat
layer.

* Add getting number of output parameters

* Fix CI and apply comments
zxy844288792 pushed a commit to zxy844288792/tvm that referenced this pull request Mar 4, 2022
…e#8313)

* [Metal] Add pass for splitting kernel with huge number of args

The Metal has some limitations on the number of input parameters. More
information can be found here:
https://developer.apple.com/documentation/metal/buffers/about_argument_buffers?language=objc

In this commit a new pass for splitting functions with big number of
arguments to smaller parts was added. In parameter `max_function_args`
we can specify the maximum number of kernel arguments for specific
target and then split kernel when the number of arguments exceeds the
value of `max_function_args`. Currently this pass works only for concat
layer.

* Add getting number of output parameters

* Fix CI and apply comments
echuraev added a commit to echuraev/tvm that referenced this pull request Jun 21, 2023
In PR apache#8313 a parameter `max_function_args` was introduced. It leads to
limit number of function argument and in case when this value is
exceeded then concatenation layer is splittet to a several concat
operations.

I faced a problem on Adreno GPU that for kernel with big number of
arguments the enqueueNDRange was wailed without any errors. The problem
was also in the huge number arguments. But in this case not only concat
layer was a root cause of the problem. Also after fusing several
operations the final functions had big number of arguments.

As it was discussed in apache#8313, adding limitation on the number of
function arguments to the FuseOps pass might be a good improvement. In
this PR I introduced such mechanism for limitation number of function
arguments for FuseOps pass and add an arguments limit to OpenCL devices
at 128 parameters.
echuraev added a commit to echuraev/tvm that referenced this pull request Jun 21, 2023
In PR apache#8313 a parameter `max_function_args` was introduced. It leads to
limit number of function argument and in case when this value is
exceeded then concatenation layer is split to a several concat
operations.

I faced a problem on Adreno GPU that for kernel with big number of
arguments the enqueueNDRange was crashed without any errors. The problem
appeared because of the huge number of arguments. But in this case not
only concat layer was a root cause of the problem. Also after fusing
several operations the final functions had a big number of arguments.

As it was discussed in apache#8313, adding a limitation on the number of
function arguments to the FuseOps pass might be a good improvement. In
this PR I introduced such mechanism for limitation number of function
arguments for FuseOps pass and add an arguments limit to OpenCL devices
at 128 parameters.
echuraev added a commit to echuraev/tvm that referenced this pull request Jul 5, 2023
In PR apache#8313 a parameter `max_function_args` was introduced. It leads to
limit number of function argument and in case when this value is
exceeded then concatenation layer is split to a several concat
operations.

I faced a problem on Adreno GPU that for kernel with big number of
arguments the enqueueNDRange was crashed without any errors. The problem
appeared because of the huge number of arguments. But in this case not
only concat layer was a root cause of the problem. Also after fusing
several operations the final functions had a big number of arguments.

As it was discussed in apache#8313, adding a limitation on the number of
function arguments to the FuseOps pass might be a good improvement. In
this PR I introduced such mechanism for limitation number of function
arguments for FuseOps pass and add an arguments limit to OpenCL devices
at 128 parameters.
echuraev added a commit to echuraev/tvm that referenced this pull request Jul 7, 2023
In PR apache#8313 a parameter `max_function_args` was introduced. It leads to
limit number of function argument and in case when this value is
exceeded then concatenation layer is split to a several concat
operations.

I faced a problem on Adreno GPU that for kernel with big number of
arguments the enqueueNDRange was crashed without any errors. The problem
appeared because of the huge number of arguments. But in this case not
only concat layer was a root cause of the problem. Also after fusing
several operations the final functions had a big number of arguments.

As it was discussed in apache#8313, adding a limitation on the number of
function arguments to the FuseOps pass might be a good improvement. In
this PR I introduced such mechanism for limitation number of function
arguments for FuseOps pass and add an arguments limit to OpenCL devices
at 128 parameters.
echuraev added a commit to echuraev/tvm that referenced this pull request Jul 7, 2023
In PR apache#8313 a parameter `max_function_args` was introduced. It leads to
limit number of function argument and in case when this value is
exceeded then concatenation layer is split to a several concat
operations.

I faced a problem on Adreno GPU that for kernel with big number of
arguments the enqueueNDRange was crashed without any errors. The problem
appeared because of the huge number of arguments. But in this case not
only concat layer was a root cause of the problem. Also after fusing
several operations the final functions had a big number of arguments.

As it was discussed in apache#8313, adding a limitation on the number of
function arguments to the FuseOps pass might be a good improvement. In
this PR I introduced such mechanism for limitation number of function
arguments for FuseOps pass and add an arguments limit to OpenCL devices
at 128 parameters.
echuraev added a commit to echuraev/tvm that referenced this pull request Jul 14, 2023
In PR apache#8313 a parameter `max_function_args` was introduced. It leads to
limit number of function argument and in case when this value is
exceeded then concatenation layer is split to a several concat
operations.

I faced a problem on Adreno GPU that for kernel with big number of
arguments the enqueueNDRange was crashed without any errors. The problem
appeared because of the huge number of arguments. But in this case not
only concat layer was a root cause of the problem. Also after fusing
several operations the final functions had a big number of arguments.

As it was discussed in apache#8313, adding a limitation on the number of
function arguments to the FuseOps pass might be a good improvement. In
this PR I introduced such mechanism for limitation number of function
arguments for FuseOps pass and add an arguments limit to OpenCL devices
at 128 parameters.
echuraev added a commit to echuraev/tvm that referenced this pull request Jul 18, 2023
In PR apache#8313 a parameter `max_function_args` was introduced. It leads to
limit number of function argument and in case when this value is
exceeded then concatenation layer is split to a several concat
operations.

I faced a problem on Adreno GPU that for kernel with big number of
arguments the enqueueNDRange was crashed without any errors. The
problem appeared because of the huge number of arguments. But in this
case not only concat layer was a root cause of the problem. Also after
fusing several operations the final functions had a big number of
arguments.

As it was discussed in apache#8313, adding a limitation on the number of
function arguments to the FuseOps pass might be a good improvement. In
this PR I introduced such mechanism for limitation number of function
arguments for FuseOps pass and add an arguments limit to OpenCL devices
at 128 parameters.

The idea of current approach is calculate the number of arguments for
each node in fusing algorithm and in case then the number of function
arguments exceeds the limit, specified by `max_function_args`, then the
fusing should be stopped. In case when node has several inputs and for
some of the inputs the number of arguments wasn't computed, then we
postpone fusing for this node and will try fuse this node later when
the number of arguments will be computed for all inputs. This approach
with postponed fusing helps to avoid additional computations during
compilation.

Additionally, case of dynamic shapes should be handled.  In case of
dynamic shape, function arguments also included sizes of dynamic
dimension and strides. The number of strides can be computed by
calculating number of tensor dimensions (the number of strides equals
to the rank of the tensor). The number of additional parameters with
sizes of dynamic dimensions can be calculated by computing number of
dynamic dimensions.
echuraev added a commit to echuraev/tvm that referenced this pull request Jul 18, 2023
In PR apache#8313 a parameter `max_function_args` was introduced. It leads to
limit number of function argument and in case when this value is
exceeded then concatenation layer is split to a several concat
operations.

I faced a problem on Adreno GPU that for kernel with big number of
arguments the enqueueNDRange was crashed without any errors. The
problem appeared because of the huge number of arguments. But in this
case not only concat layer was a root cause of the problem. Also after
fusing several operations the final functions had a big number of
arguments.

As it was discussed in apache#8313, adding a limitation on the number of
function arguments to the FuseOps pass might be a good improvement. In
this PR I introduced such mechanism for limitation number of function
arguments for FuseOps pass and add an arguments limit to OpenCL devices
at 128 parameters.

The idea of current approach is calculate the number of arguments for
each node in fusing algorithm and in case then the number of function
arguments exceeds the limit, specified by `max_function_args`, then the
fusing should be stopped. In case when node has several inputs and for
some of the inputs the number of arguments wasn't computed, then we
postpone fusing for this node and will try fuse this node later when
the number of arguments will be computed for all inputs. This approach
with postponed fusing helps to avoid additional computations during
compilation.

Additionally, case of dynamic shapes should be handled.  In case of
dynamic shape, function arguments also included sizes of dynamic
dimension and strides. The number of strides can be computed by
calculating number of tensor dimensions (the number of strides equals
to the rank of the tensor). The number of additional parameters with
sizes of dynamic dimensions can be calculated by computing number of
dynamic dimensions.
echuraev added a commit to echuraev/tvm that referenced this pull request Jul 18, 2023
In PR apache#8313 a parameter `max_function_args` was introduced. It leads to
limit number of function argument and in case when this value is
exceeded then concatenation layer is split to a several concat
operations.

I faced a problem on Adreno GPU that for kernel with big number of
arguments the enqueueNDRange was crashed without any errors. The
problem appeared because of the huge number of arguments. But in this
case not only concat layer was a root cause of the problem. Also after
fusing several operations the final functions had a big number of
arguments.

As it was discussed in apache#8313, adding a limitation on the number of
function arguments to the FuseOps pass might be a good improvement. In
this PR I introduced such mechanism for limitation number of function
arguments for FuseOps pass and add an arguments limit to OpenCL devices
at 128 parameters.

The idea of current approach is calculate the number of arguments for
each node in fusing algorithm and in case then the number of function
arguments exceeds the limit, specified by `max_function_args`, then the
fusing should be stopped. In case when node has several inputs and for
some of the inputs the number of arguments wasn't computed, then we
postpone fusing for this node and will try fuse this node later when
the number of arguments will be computed for all inputs. This approach
with postponed fusing helps to avoid additional computations during
compilation.

Additionally, case of dynamic shapes should be handled.  In case of
dynamic shape, function arguments also included sizes of dynamic
dimension and strides. The number of strides can be computed by
calculating number of tensor dimensions (the number of strides equals
to the rank of the tensor). The number of additional parameters with
sizes of dynamic dimensions can be calculated by computing number of
dynamic dimensions.
echuraev added a commit to echuraev/tvm that referenced this pull request Jul 20, 2023
In PR apache#8313 a parameter `max_function_args` was introduced. It leads to
limit number of function argument and in case when this value is
exceeded then concatenation layer is split to a several concat
operations.

I faced a problem on Adreno GPU that for kernel with big number of
arguments the enqueueNDRange was crashed without any errors. The
problem appeared because of the huge number of arguments. But in this
case not only concat layer was a root cause of the problem. Also after
fusing several operations the final functions had a big number of
arguments.

As it was discussed in apache#8313, adding a limitation on the number of
function arguments to the FuseOps pass might be a good improvement. In
this PR I introduced such mechanism for limitation number of function
arguments for FuseOps pass and add an arguments limit to OpenCL devices
at 128 parameters.

The idea of current approach is calculate the number of arguments for
each node in fusing algorithm and in case then the number of function
arguments exceeds the limit, specified by `max_function_args`, then the
fusing should be stopped. In case when node has several inputs and for
some of the inputs the number of arguments wasn't computed, then we
postpone fusing for this node and will try fuse this node later when
the number of arguments will be computed for all inputs. This approach
with postponed fusing helps to avoid additional computations during
compilation.

Additionally, case of dynamic shapes should be handled.  In case of
dynamic shape, function arguments also included sizes of dynamic
dimension and strides. The number of strides can be computed by
calculating number of tensor dimensions (the number of strides equals
to the rank of the tensor). The number of additional parameters with
sizes of dynamic dimensions can be calculated by computing number of
dynamic dimensions.
masahi pushed a commit that referenced this pull request Jul 21, 2023
* [Relay] Introduce arguments limit to FuseOps pass

In PR #8313 a parameter `max_function_args` was introduced. It leads to
limit number of function argument and in case when this value is
exceeded then concatenation layer is split to a several concat
operations.

I faced a problem on Adreno GPU that for kernel with big number of
arguments the enqueueNDRange was crashed without any errors. The
problem appeared because of the huge number of arguments. But in this
case not only concat layer was a root cause of the problem. Also after
fusing several operations the final functions had a big number of
arguments.

As it was discussed in #8313, adding a limitation on the number of
function arguments to the FuseOps pass might be a good improvement. In
this PR I introduced such mechanism for limitation number of function
arguments for FuseOps pass and add an arguments limit to OpenCL devices
at 128 parameters.

The idea of current approach is calculate the number of arguments for
each node in fusing algorithm and in case then the number of function
arguments exceeds the limit, specified by `max_function_args`, then the
fusing should be stopped. In case when node has several inputs and for
some of the inputs the number of arguments wasn't computed, then we
postpone fusing for this node and will try fuse this node later when
the number of arguments will be computed for all inputs. This approach
with postponed fusing helps to avoid additional computations during
compilation.

Additionally, case of dynamic shapes should be handled.  In case of
dynamic shape, function arguments also included sizes of dynamic
dimension and strides. The number of strides can be computed by
calculating number of tensor dimensions (the number of strides equals
to the rank of the tensor). The number of additional parameters with
sizes of dynamic dimensions can be calculated by computing number of
dynamic dimensions.

* Fix memory_scope order in test

* Apply code review comments

* Apply comments
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants