-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Special packing for complex, specialize packing for avx2 #75
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
For complex, we'll want to use a different packing function. Add packing into the GemmKernel interface so that kernels can request a different packing function. The standard packing function is unchanged but gets its own module in the code.
bluss
force-pushed
the
cgemmpack
branch
3 times, most recently
from
April 29, 2023 15:04
8831fc8
to
c4a3896
Compare
bluss
changed the title
Use different matrix packing for complex microkernels
Special packing for complex, specialize packing for avx2
Apr 29, 2023
avx2 packing bench
|
bluss
force-pushed
the
cgemmpack
branch
5 times, most recently
from
April 29, 2023 20:43
e32906a
to
e259bb2
Compare
Use a different pack function for complex micorkernels which puts real and imag parts in separate rows. This enables much better autovectorization for the fallback kernels.
Custom sizes for Fma and Avx2 is a win for performance, and Avx2 does better than fma here, so both can be worthwhile.
Because we detect target features to select kernel, and the kernel can select its own packing functions, we can now specialize the packing functions per target. As matrices get larger, the packing performance matters much less, but for small matrix products it contributes more to the runtime. The default packing also already has a special case for contiguous matrices, which happens when in C = A B, A is column major and B is row major. The specialization in this commit helps the most outside this special case.
avx2, fma and f32::mul_add is a success in autovectorization, while just fma with f32::mul_add is not (!). For this reason, only call f32::mul_add when we opt in to this.
Remove flags that are now used by default by miri.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Complex (cgemm, zgemm):
Use a different pack layout for complex micorkernels which puts real and
imag parts in separate rows. This enables much better autovectorization for
the fallback kernels.
Also enable an Avx2 + Fma autovectorized kernel.
Performance improvements (all kernels autovectorized for cgemm, zgemm
at this time)
Float (sgemm, dgemm):
When the kernels can now select their own packing functions, instantiate
an avx2 version of the general packing function for sgemm and dgemm.
Packing performance matters most for small matrix multiplications, for
bigger sizes it is a vanishingly small part of runtime.
depending on input layouts. Tested on M, K, N = 32, i.e a small matrix.