Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SME] Utilize predication in fp32 matmul and conv2d schedules #17054

Merged
merged 2 commits into from
Jun 14, 2024

Conversation

lhutton1
Copy link
Contributor

@lhutton1 lhutton1 commented May 31, 2024

Prior to this commit, the matmul and conv2d schedules required padding of the inputs to some multiple of vscale and a final "unpadding" stage.

Instead, we can leverage predicated operations to avoid the the requirement for padding. Both the transpose interleave and outer product fp32 intrinsics are updated to use predication. The get_active_lane_mask intrinsic is utilized to generate a variably sized mask of active lanes depending on the global position the tensor intrinsic is operating on.

For now this relies on using offset_of and stride information from the tensor we're predicating an access on. Likely we will want to build on this in the future with a more intuitive API for determining the current tile location.

Support for batched conv2d was removed since this causes numerical issues which is suspected to be due to how the current tile is determined (paragraph above).

Note: this should not be merged until after #17048

cc @ekalda @Anndrey24

@github-actions github-actions bot requested a review from ekalda May 31, 2024 12:34
@lhutton1 lhutton1 force-pushed the predicate-sme-fp32-schedules branch 2 times, most recently from 398ea16 to 41a1f04 Compare June 10, 2024 09:47
@lhutton1 lhutton1 marked this pull request as ready for review June 10, 2024 09:59
Copy link
Contributor

@ekalda ekalda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @lhutton1, very nice work! Looks good to me, but I'll let @Anndrey24 to take a look as well...

@Anndrey24
Copy link
Contributor

LGTM, too! Seems there's just a small merge conflict that has come up.

Prior to this commit, the matmul and conv2d schedules required padding
of the inputs to some multiple of vscale and a final "unpadding" stage.

Instead, we can leverage predicated operations to avoid the
the requirement for padding. Both the transpose interleave and outer
product fp32 intrinsics are updated to use predication. The
`get_active_lane_mask` intrinsic is utilized to generate a variably
sized mask of active lanes depending on the global position the tensor
intrinsic is operating on.

For now this relies on using `offset_of` and `stride` information from
the tensor we're predicating an access on. Likely we will want to
build on this in the future with a more intuitive API for determining
the current tile location.

Support for batched conv2d was removed since this causes numerical
issues which is suspected to be due to how the current tile is
determined (paragraph above).

Change-Id: I79620200c9a94e2ca9d7297c4ed2abf87549cc41
Change-Id: Iaddeb046bdecb0352a067174f6e6e4be335e94fd
@lhutton1 lhutton1 force-pushed the predicate-sme-fp32-schedules branch from 41a1f04 to e755e43 Compare June 13, 2024 12:55
@ekalda ekalda merged commit d3011ab into apache:main Jun 14, 2024
18 checks passed
@ekalda
Copy link
Contributor

ekalda commented Jun 14, 2024

Thanks @lhutton1 and @Anndrey24!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants