-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Metal] Add pass for splitting kernel with huge number of args #8313
[Metal] Add pass for splitting kernel with huge number of args #8313
Conversation
c45c2b5
to
c9028c7
Compare
@mbrookhart can you take a look at this one? |
cc @masahi I remember Masa doing something like this for Vulkan at one point, but I'm not sure if that was a branch or if it ever got merged. If it's merged somewhere, maybe we should combine make this a generally available tool? |
Not exactly, but I've dealt with a similar issue. My mitigation was to limit the maximum fusion depth, which breaks large parameter kernels into smaller ones. But that is not guaranteed to work and not predictable. I can imagine that having a pass like this that allows more fine-grained controls might be necessary in some cases. @echuraev FYI you can cap the fuse depth by tvm/tests/python/relay/test_pass_fuse_ops.py Line 755 in 720e7b1
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor nit.
In general, I see the utility for single Relay kernels that have too many inputs, but I wonder if you'll hit this more for kernels post-fusion. This doesn't seem to tackle that for anything other than giant concatenation?
The Metal has some limitations on the number of input parameters. More information can be found here: https://developer.apple.com/documentation/metal/buffers/about_argument_buffers?language=objc In this commit a new pass for splitting functions with big number of arguments to smaller parts was added. In parameter `max_function_args` we can specify the maximum number of kernel arguments for specific target and then split kernel when the number of arguments exceeds the value of `max_function_args`. Currently this pass works only for concat layer.
c9028c7
to
f71fc96
Compare
Thank you! I thought about the reducing fuse depth, but as you mentioned, it is not predictable and not guaranteed to work. This is why I think that this approach with splitting kernels is more robust. |
Yes, now it is solving only problem with concat layer. I thought that maybe this pass can be useful in case of split layer for example. But I wasn't able to reproduce the same problem for a split layer on several simple tests. |
I have no problems merging this as it stands, but I do think I have a bigger question: Should we put some sort of logic into fusion to automatically stop fusion if the argument list grows too large per this setting? That should be more robust than arbitrarily limiting the fusion depth. It could of course be a second PR. |
It's a good point and looks reasonable to add such logic into the fusion algorithm. It could help us to avoid some possible problems with number of arguments in the future. I think it would be better to do such logic in separate PR, due to problem with concat layer can appear and without fusing. For example, in the original problem we had many inputs because for each input to concat we had some preprocessing. And the fusing algorithm wasn't able to fuse these inputs into tuple due to this preprocessing. I don't think that the limit on the fusion depth can solve this problem. |
…e#8313) * [Metal] Add pass for splitting kernel with huge number of args The Metal has some limitations on the number of input parameters. More information can be found here: https://developer.apple.com/documentation/metal/buffers/about_argument_buffers?language=objc In this commit a new pass for splitting functions with big number of arguments to smaller parts was added. In parameter `max_function_args` we can specify the maximum number of kernel arguments for specific target and then split kernel when the number of arguments exceeds the value of `max_function_args`. Currently this pass works only for concat layer. * Add getting number of output parameters * Fix CI and apply comments
…e#8313) * [Metal] Add pass for splitting kernel with huge number of args The Metal has some limitations on the number of input parameters. More information can be found here: https://developer.apple.com/documentation/metal/buffers/about_argument_buffers?language=objc In this commit a new pass for splitting functions with big number of arguments to smaller parts was added. In parameter `max_function_args` we can specify the maximum number of kernel arguments for specific target and then split kernel when the number of arguments exceeds the value of `max_function_args`. Currently this pass works only for concat layer. * Add getting number of output parameters * Fix CI and apply comments
…e#8313) * [Metal] Add pass for splitting kernel with huge number of args The Metal has some limitations on the number of input parameters. More information can be found here: https://developer.apple.com/documentation/metal/buffers/about_argument_buffers?language=objc In this commit a new pass for splitting functions with big number of arguments to smaller parts was added. In parameter `max_function_args` we can specify the maximum number of kernel arguments for specific target and then split kernel when the number of arguments exceeds the value of `max_function_args`. Currently this pass works only for concat layer. * Add getting number of output parameters * Fix CI and apply comments
In PR apache#8313 a parameter `max_function_args` was introduced. It leads to limit number of function argument and in case when this value is exceeded then concatenation layer is splittet to a several concat operations. I faced a problem on Adreno GPU that for kernel with big number of arguments the enqueueNDRange was wailed without any errors. The problem was also in the huge number arguments. But in this case not only concat layer was a root cause of the problem. Also after fusing several operations the final functions had big number of arguments. As it was discussed in apache#8313, adding limitation on the number of function arguments to the FuseOps pass might be a good improvement. In this PR I introduced such mechanism for limitation number of function arguments for FuseOps pass and add an arguments limit to OpenCL devices at 128 parameters.
In PR apache#8313 a parameter `max_function_args` was introduced. It leads to limit number of function argument and in case when this value is exceeded then concatenation layer is split to a several concat operations. I faced a problem on Adreno GPU that for kernel with big number of arguments the enqueueNDRange was crashed without any errors. The problem appeared because of the huge number of arguments. But in this case not only concat layer was a root cause of the problem. Also after fusing several operations the final functions had a big number of arguments. As it was discussed in apache#8313, adding a limitation on the number of function arguments to the FuseOps pass might be a good improvement. In this PR I introduced such mechanism for limitation number of function arguments for FuseOps pass and add an arguments limit to OpenCL devices at 128 parameters.
In PR apache#8313 a parameter `max_function_args` was introduced. It leads to limit number of function argument and in case when this value is exceeded then concatenation layer is split to a several concat operations. I faced a problem on Adreno GPU that for kernel with big number of arguments the enqueueNDRange was crashed without any errors. The problem appeared because of the huge number of arguments. But in this case not only concat layer was a root cause of the problem. Also after fusing several operations the final functions had a big number of arguments. As it was discussed in apache#8313, adding a limitation on the number of function arguments to the FuseOps pass might be a good improvement. In this PR I introduced such mechanism for limitation number of function arguments for FuseOps pass and add an arguments limit to OpenCL devices at 128 parameters.
In PR apache#8313 a parameter `max_function_args` was introduced. It leads to limit number of function argument and in case when this value is exceeded then concatenation layer is split to a several concat operations. I faced a problem on Adreno GPU that for kernel with big number of arguments the enqueueNDRange was crashed without any errors. The problem appeared because of the huge number of arguments. But in this case not only concat layer was a root cause of the problem. Also after fusing several operations the final functions had a big number of arguments. As it was discussed in apache#8313, adding a limitation on the number of function arguments to the FuseOps pass might be a good improvement. In this PR I introduced such mechanism for limitation number of function arguments for FuseOps pass and add an arguments limit to OpenCL devices at 128 parameters.
In PR apache#8313 a parameter `max_function_args` was introduced. It leads to limit number of function argument and in case when this value is exceeded then concatenation layer is split to a several concat operations. I faced a problem on Adreno GPU that for kernel with big number of arguments the enqueueNDRange was crashed without any errors. The problem appeared because of the huge number of arguments. But in this case not only concat layer was a root cause of the problem. Also after fusing several operations the final functions had a big number of arguments. As it was discussed in apache#8313, adding a limitation on the number of function arguments to the FuseOps pass might be a good improvement. In this PR I introduced such mechanism for limitation number of function arguments for FuseOps pass and add an arguments limit to OpenCL devices at 128 parameters.
In PR apache#8313 a parameter `max_function_args` was introduced. It leads to limit number of function argument and in case when this value is exceeded then concatenation layer is split to a several concat operations. I faced a problem on Adreno GPU that for kernel with big number of arguments the enqueueNDRange was crashed without any errors. The problem appeared because of the huge number of arguments. But in this case not only concat layer was a root cause of the problem. Also after fusing several operations the final functions had a big number of arguments. As it was discussed in apache#8313, adding a limitation on the number of function arguments to the FuseOps pass might be a good improvement. In this PR I introduced such mechanism for limitation number of function arguments for FuseOps pass and add an arguments limit to OpenCL devices at 128 parameters.
In PR apache#8313 a parameter `max_function_args` was introduced. It leads to limit number of function argument and in case when this value is exceeded then concatenation layer is split to a several concat operations. I faced a problem on Adreno GPU that for kernel with big number of arguments the enqueueNDRange was crashed without any errors. The problem appeared because of the huge number of arguments. But in this case not only concat layer was a root cause of the problem. Also after fusing several operations the final functions had a big number of arguments. As it was discussed in apache#8313, adding a limitation on the number of function arguments to the FuseOps pass might be a good improvement. In this PR I introduced such mechanism for limitation number of function arguments for FuseOps pass and add an arguments limit to OpenCL devices at 128 parameters. The idea of current approach is calculate the number of arguments for each node in fusing algorithm and in case then the number of function arguments exceeds the limit, specified by `max_function_args`, then the fusing should be stopped. In case when node has several inputs and for some of the inputs the number of arguments wasn't computed, then we postpone fusing for this node and will try fuse this node later when the number of arguments will be computed for all inputs. This approach with postponed fusing helps to avoid additional computations during compilation. Additionally, case of dynamic shapes should be handled. In case of dynamic shape, function arguments also included sizes of dynamic dimension and strides. The number of strides can be computed by calculating number of tensor dimensions (the number of strides equals to the rank of the tensor). The number of additional parameters with sizes of dynamic dimensions can be calculated by computing number of dynamic dimensions.
In PR apache#8313 a parameter `max_function_args` was introduced. It leads to limit number of function argument and in case when this value is exceeded then concatenation layer is split to a several concat operations. I faced a problem on Adreno GPU that for kernel with big number of arguments the enqueueNDRange was crashed without any errors. The problem appeared because of the huge number of arguments. But in this case not only concat layer was a root cause of the problem. Also after fusing several operations the final functions had a big number of arguments. As it was discussed in apache#8313, adding a limitation on the number of function arguments to the FuseOps pass might be a good improvement. In this PR I introduced such mechanism for limitation number of function arguments for FuseOps pass and add an arguments limit to OpenCL devices at 128 parameters. The idea of current approach is calculate the number of arguments for each node in fusing algorithm and in case then the number of function arguments exceeds the limit, specified by `max_function_args`, then the fusing should be stopped. In case when node has several inputs and for some of the inputs the number of arguments wasn't computed, then we postpone fusing for this node and will try fuse this node later when the number of arguments will be computed for all inputs. This approach with postponed fusing helps to avoid additional computations during compilation. Additionally, case of dynamic shapes should be handled. In case of dynamic shape, function arguments also included sizes of dynamic dimension and strides. The number of strides can be computed by calculating number of tensor dimensions (the number of strides equals to the rank of the tensor). The number of additional parameters with sizes of dynamic dimensions can be calculated by computing number of dynamic dimensions.
In PR apache#8313 a parameter `max_function_args` was introduced. It leads to limit number of function argument and in case when this value is exceeded then concatenation layer is split to a several concat operations. I faced a problem on Adreno GPU that for kernel with big number of arguments the enqueueNDRange was crashed without any errors. The problem appeared because of the huge number of arguments. But in this case not only concat layer was a root cause of the problem. Also after fusing several operations the final functions had a big number of arguments. As it was discussed in apache#8313, adding a limitation on the number of function arguments to the FuseOps pass might be a good improvement. In this PR I introduced such mechanism for limitation number of function arguments for FuseOps pass and add an arguments limit to OpenCL devices at 128 parameters. The idea of current approach is calculate the number of arguments for each node in fusing algorithm and in case then the number of function arguments exceeds the limit, specified by `max_function_args`, then the fusing should be stopped. In case when node has several inputs and for some of the inputs the number of arguments wasn't computed, then we postpone fusing for this node and will try fuse this node later when the number of arguments will be computed for all inputs. This approach with postponed fusing helps to avoid additional computations during compilation. Additionally, case of dynamic shapes should be handled. In case of dynamic shape, function arguments also included sizes of dynamic dimension and strides. The number of strides can be computed by calculating number of tensor dimensions (the number of strides equals to the rank of the tensor). The number of additional parameters with sizes of dynamic dimensions can be calculated by computing number of dynamic dimensions.
In PR apache#8313 a parameter `max_function_args` was introduced. It leads to limit number of function argument and in case when this value is exceeded then concatenation layer is split to a several concat operations. I faced a problem on Adreno GPU that for kernel with big number of arguments the enqueueNDRange was crashed without any errors. The problem appeared because of the huge number of arguments. But in this case not only concat layer was a root cause of the problem. Also after fusing several operations the final functions had a big number of arguments. As it was discussed in apache#8313, adding a limitation on the number of function arguments to the FuseOps pass might be a good improvement. In this PR I introduced such mechanism for limitation number of function arguments for FuseOps pass and add an arguments limit to OpenCL devices at 128 parameters. The idea of current approach is calculate the number of arguments for each node in fusing algorithm and in case then the number of function arguments exceeds the limit, specified by `max_function_args`, then the fusing should be stopped. In case when node has several inputs and for some of the inputs the number of arguments wasn't computed, then we postpone fusing for this node and will try fuse this node later when the number of arguments will be computed for all inputs. This approach with postponed fusing helps to avoid additional computations during compilation. Additionally, case of dynamic shapes should be handled. In case of dynamic shape, function arguments also included sizes of dynamic dimension and strides. The number of strides can be computed by calculating number of tensor dimensions (the number of strides equals to the rank of the tensor). The number of additional parameters with sizes of dynamic dimensions can be calculated by computing number of dynamic dimensions.
* [Relay] Introduce arguments limit to FuseOps pass In PR #8313 a parameter `max_function_args` was introduced. It leads to limit number of function argument and in case when this value is exceeded then concatenation layer is split to a several concat operations. I faced a problem on Adreno GPU that for kernel with big number of arguments the enqueueNDRange was crashed without any errors. The problem appeared because of the huge number of arguments. But in this case not only concat layer was a root cause of the problem. Also after fusing several operations the final functions had a big number of arguments. As it was discussed in #8313, adding a limitation on the number of function arguments to the FuseOps pass might be a good improvement. In this PR I introduced such mechanism for limitation number of function arguments for FuseOps pass and add an arguments limit to OpenCL devices at 128 parameters. The idea of current approach is calculate the number of arguments for each node in fusing algorithm and in case then the number of function arguments exceeds the limit, specified by `max_function_args`, then the fusing should be stopped. In case when node has several inputs and for some of the inputs the number of arguments wasn't computed, then we postpone fusing for this node and will try fuse this node later when the number of arguments will be computed for all inputs. This approach with postponed fusing helps to avoid additional computations during compilation. Additionally, case of dynamic shapes should be handled. In case of dynamic shape, function arguments also included sizes of dynamic dimension and strides. The number of strides can be computed by calculating number of tensor dimensions (the number of strides equals to the rank of the tensor). The number of additional parameters with sizes of dynamic dimensions can be calculated by computing number of dynamic dimensions. * Fix memory_scope order in test * Apply code review comments * Apply comments
The Metal has some limitations on the number of input parameters. More
information can be found here:
https://developer.apple.com/documentation/metal/buffers/about_argument_buffers?language=objc
In this commit a new pass for splitting functions with big number of
arguments to smaller parts was added. In parameter
max_function_args
we can specify the maximum number of kernel arguments for specific
target and then split kernel when the number of arguments exceeds the
value of
max_function_args
. Currently this pass works only for concatlayer.
Thanks for contributing to TVM! Please refer to guideline https://tvm.apache.org/docs/contribute/ for useful information and tips. After the pull request is submitted, please request code reviews from Reviewers by @ them in the pull request thread.