[TOPI][Relay] Add conv2d NHWC hybrid schedule for `arm_cpu` #16106

Anndrey24 · 2023-11-10T16:28:06Z

Implemented an arm_cpu conv2d NHWC schedule using a hybrid GeMM approach, effectively breaking down the matrix multiplication into a macro-kernel (partitioning into fixed-sized, tile-level subproblems) and a micro-kernel (independently dealing with each subproblem). After the im2col transformation, the input matrix is handled natively (not interleaved), while the weights matrix is tiled and interleaved at compile time.
In the fp32 case, the micro-kernel uses 16 registers to accumulate the results of each 4x16 output tile, cycling through the operands needed to compute them (from the input and weight matrices) in the remaining registers.

There are now two ways to transform the weights matrix for conv2d, which are detailed in convolution.cc:

for int8 / uint8: tile, interleave, transpose
for everything else: tile, interleave

To maintain naming consistency across both of these implementations (transposed vs not transposed), all mentions of tile_rows_B or tile_cols_B have been changed to tile_N and tile_K respectively to denote the tiling size along each axis of the flattened B matrix. As usual, N = out_channels and K = kernel_width * kernel_height * in_channels.

I have also added a new conv2d NHWC fp32 test for both the conv2d_nhwc_spatial_pack and conv2d_NHWC_hybrid schedules, as well as new fp32 and fp16 implementation selection tests in test_select_implementation.py.

cc @ekalda @lhutton1 @neildhickey @leandron

lhutton1

Thanks @Anndrey24, this looks like a great change :) I had a couple of comments around selecting when to use the schedule to ensure we don't break existing functionality. I think it would be good to check this by adding some tests to test_select_implementation.py to make sure we're still selecting the correct schedule for different devices.

python/tvm/relay/op/strategy/arm_cpu.py

python/tvm/topi/arm_cpu/arm_utils.py

python/tvm/topi/arm_cpu/conv2d_gemm.py

tests/python/integration/test_arm_aprofile.py

lhutton1 · 2023-11-15T10:13:26Z

@tvm-bot rerun

lhutton1

Thanks for the changes @Anndrey24, overall LGTM! Just noticed something I missed previously

python/tvm/relay/op/strategy/arm_cpu.py

lhutton1

LGTM!

Implemented an `arm_cpu` conv2d NHWC schedule for fp32 using a hybrid GeMM approach, effectively breaking down the matrix multiplication into a macro-kernel (partitioning into fixed-sized, tile-level subproblems) and a micro-kernel (independently dealing with each subproblem). After the im2col transformation, the input matrix is handled natively (not interleaved), while the weights matrix is tiled and interleaved at compile time. The micro-kernel uses 16 registers to accumulate the results of each 4x16 output tile, cycling through the operands needed to compute them (from the input and weight matrices) in the remaining registers. There are now two ways to transform the weights matrix for conv2d, which are detailed in `convolution.cc`: * for fp32: tile, interleave * for int8: tile, interleave, transpose To maintain naming consistency across both of these implementations (transposed vs not transposed), all mentions of `tile_rows_B` or `tile_cols_B` have been changed to `tile_N` and `tile_K` respectively to denote the tiling size along each axis of the flattened B matrix. As usual, `N = out_channels` and `K = kernel_width * kernel_height * in_channels`. I have also added a new conv2d NHWC fp32 test for both the `conv2d_nhwc_spatial_pack` and `conv2d_NHWC_fp32_hybrid` schedules.

ekalda

Thanks @Anndrey24 that is a lot of great work! 🚀 Some minor comments...

python/tvm/relay/op/strategy/arm_cpu.py

lhutton1 · 2023-11-22T14:43:21Z

@tvm-bot rerun

ekalda

Looks great, thanks @Anndrey24!

lhutton1 · 2023-11-23T09:14:15Z

@tvm-bot rerun

lhutton1 · 2023-11-24T09:54:01Z

Thanks @Anndrey24 @ekalda!

This commit adds an `arm_cpu` conv2d NHWC schedule which generates SVE instructions by extending the hybrid GeMM approach implemented in apache#16106 to use scalable expressions as splitting factors. Various vscale-related fixes needed to implement the schedule are also included, such as: - adding vscale bounds in the `ConstIntBoundAnalyzer` and `IntervalSetEvaluator` - simplifying `MinNode` and `MaxNode` that have scalable expression operands in `RewriteSimplifier`, which would appear when defining the shape of a buffer padded to be a multiple of vscale and in its respective buffer access indices (e.g. `C_1 = T.Buffer((1024 * (T.vscale() * 16 + 256 - 16 % T.vscale() * 16),), data=C)` instead of `C_1 = T.Buffer((1024 * (T.max(255, T.vscale() * 16 + 255 - 16 % T.vscale() * 16) + 1),), data=C)`) The correctness of the new schedule is checked using a TOPI test, while the presence of generated SVE instructions is verified by a codegen test. The new `rewrite_simplify` rules are also covered by additional test cases.

This commit adds an `arm_cpu` conv2d NHWC schedule which generates SVE instructions by extending the hybrid GeMM approach implemented in #16106 to use scalable expressions as splitting factors. Various vscale-related fixes needed to implement the schedule are also included, such as: - adding vscale bounds in the `ConstIntBoundAnalyzer` and `IntervalSetEvaluator` - simplifying `MinNode` and `MaxNode` that have scalable expression operands in `RewriteSimplifier`, which would appear when defining the shape of a buffer padded to be a multiple of vscale and in its respective buffer access indices (e.g. `C_1 = T.Buffer((1024 * (T.vscale() * 16 + 256 - 16 % T.vscale() * 16),), data=C)` instead of `C_1 = T.Buffer((1024 * (T.max(255, T.vscale() * 16 + 255 - 16 % T.vscale() * 16) + 1),), data=C)`) The correctness of the new schedule is checked using a TOPI test, while the presence of generated SVE instructions is verified by a codegen_aarch64 test. The new rewrite_simplify rules are also covered by additional test cases.

…pu` targets This patch partly reverts the unification of scalable and non-scalable scheduling of conv2d NHWC for `arm_cpu` targets introduced in apache#16899. The non-scalable schedule for float32 splits the N axis (corresponding to number of output channels) by 16 in both the unified and the nonunified schedule versions, and then additionally splits the inner partitions by 4 in only the nonunified version to which this patch is reverting (first added in apache#16106). The two versions' behaviour would be equivalent if none of the padding on the N axis was removed during lowering, however we allow for that to happen as it proved to increase performance for very small convolutions. As it stands, there seems to be a regression in cases where the datatype is float32 and the number of output channels is greater than 16, a multiple of 4, and not a multiple of 16, because even with the removed padding the nonunified schedule is able to vectorise over 4 elements, while the unified version cannot vectorise over 16 elements anymore. Since all of the conv2d NHWC hybrid topi test cases used numbers of output channels either less than 16 or divisible by 16, this patch also adds a new case which falls in the aforementioned regression area.

…pu` targets (#16951) This patch partly reverts the unification of scalable and non-scalable scheduling of conv2d NHWC for `arm_cpu` targets introduced in #16899. The non-scalable schedule for float32 splits the N axis (corresponding to number of output channels) by 16 in both the unified and the nonunified schedule versions, and then additionally splits the inner partitions by 4 in only the nonunified version to which this patch is reverting (first added in #16106). The two versions' behaviour would be equivalent if none of the padding on the N axis was removed during lowering, however we allow for that to happen as it proved to increase performance for very small convolutions. As it stands, there seems to be a regression in cases where the datatype is float32 and the number of output channels is greater than 16, a multiple of 4, and not a multiple of 16, because even with the removed padding the nonunified schedule is able to vectorise over 4 elements, while the unified version cannot vectorise over 16 elements anymore. Since all of the conv2d NHWC hybrid topi test cases used numbers of output channels either less than 16 or divisible by 16, this patch also adds a new case which falls in the aforementioned regression area.

github-actions bot requested review from ekalda, leandron and lhutton1 November 10, 2023 16:28

lhutton1 reviewed Nov 13, 2023

View reviewed changes

Anndrey24 changed the title ~~[TOPI][Relay] Add conv2d NHWC fp32 hybrid schedule for arm_cpu~~ [TOPI][Relay] Add conv2d NHWC hybrid schedule for arm_cpu Nov 16, 2023

lhutton1 reviewed Nov 20, 2023

View reviewed changes

python/tvm/relay/op/strategy/arm_cpu.py Outdated Show resolved Hide resolved

lhutton1 approved these changes Nov 20, 2023

View reviewed changes

Anndrey24 added 5 commits November 20, 2023 16:48

Fix dotprod native schedule

1498fe2

Address code review comments

2d9d817

Rename schedule and restrict usage to fp32 or fp16

98f6c11

Add spatial_pack implementation with low plevel for fp32/fp16

d56a96b

Anndrey24 force-pushed the fp32-hybrid-schedule branch from 92d5dd2 to d56a96b Compare November 20, 2023 16:52

ekalda reviewed Nov 21, 2023

View reviewed changes

python/tvm/relay/op/strategy/arm_cpu.py Outdated Show resolved Hide resolved

python/tvm/relay/op/strategy/arm_cpu.py Show resolved Hide resolved

Rewrite arm_cpu conv2d quantized strategy selection

0063880

ekalda approved these changes Nov 22, 2023

View reviewed changes

lhutton1 merged commit f38dc14 into apache:main Nov 24, 2023
18 checks passed

Anndrey24 deleted the fp32-hybrid-schedule branch November 24, 2023 15:56

ysh329 mentioned this pull request Jan 11, 2024

[Release] v0.15.0 Release Candidate Notes #16391

Closed

Anndrey24 mentioned this pull request Apr 17, 2024

[SVE][TOPI] Add conv2d NHWC hybrid SVE schedule for arm_cpu #16899

Merged

Anndrey24 mentioned this pull request Apr 29, 2024

[TOPI] Revert unification of conv2d NHWC hybrid scheduling for arm_cpu targets #16951

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[TOPI][Relay] Add conv2d NHWC hybrid schedule for `arm_cpu` #16106

[TOPI][Relay] Add conv2d NHWC hybrid schedule for `arm_cpu` #16106

Anndrey24 commented Nov 10, 2023 •

edited

lhutton1 left a comment

lhutton1 commented Nov 15, 2023

lhutton1 left a comment

lhutton1 left a comment

ekalda left a comment

lhutton1 commented Nov 22, 2023

ekalda left a comment

lhutton1 commented Nov 23, 2023

lhutton1 commented Nov 24, 2023

[TOPI][Relay] Add conv2d NHWC hybrid schedule for arm_cpu #16106

[TOPI][Relay] Add conv2d NHWC hybrid schedule for arm_cpu #16106

Conversation

Anndrey24 commented Nov 10, 2023 • edited

lhutton1 left a comment

Choose a reason for hiding this comment

lhutton1 commented Nov 15, 2023

lhutton1 left a comment

Choose a reason for hiding this comment

lhutton1 left a comment

Choose a reason for hiding this comment

ekalda left a comment

Choose a reason for hiding this comment

lhutton1 commented Nov 22, 2023

ekalda left a comment

Choose a reason for hiding this comment

lhutton1 commented Nov 23, 2023

lhutton1 commented Nov 24, 2023

[TOPI][Relay] Add conv2d NHWC hybrid schedule for `arm_cpu` #16106

[TOPI][Relay] Add conv2d NHWC hybrid schedule for `arm_cpu` #16106

Anndrey24 commented Nov 10, 2023 •

edited