Special packing for complex, specialize packing for avx2 #75

bluss · 2023-04-29T14:19:50Z

Complex (cgemm, zgemm):

Use a different pack layout for complex micorkernels which puts real and
imag parts in separate rows. This enables much better autovectorization for
the fallback kernels.

Also enable an Avx2 + Fma autovectorized kernel.

Performance improvements (all kernels autovectorized for cgemm, zgemm
at this time)

AArch64 NEON (apple m1): cgemm: +60%, zgemm +10%
Fma + Avx (Intel Tiger lake) cgemm: +143%
Fma + Avx2 (Intel Tiger lake) cgemm: +395% (new), zgemm: +77% (new)

Float (sgemm, dgemm):

When the kernels can now select their own packing functions, instantiate
an avx2 version of the general packing function for sgemm and dgemm.

Packing performance matters most for small matrix multiplications, for
bigger sizes it is a vanishingly small part of runtime.

Avx2 (Intel Tiger Lake): sgemm improves 6-15%, dgemm improves 0-8%
depending on input layouts. Tested on M, K, N = 32, i.e a small matrix.

For complex, we'll want to use a different packing function. Add packing into the GemmKernel interface so that kernels can request a different packing function. The standard packing function is unchanged but gets its own module in the code.

bluss · 2023-04-29T16:24:21Z

avx2 packing bench

 name                        nobeta-avx-before1 ns/iter  nobeta-avx-after1 ns/iter  diff ns/iter   diff %  speedup 
 layout_f32_032::nobeta_ccc  1,257                       1,143                              -114   -9.07%   x 1.10 
 layout_f32_032::nobeta_ccf  1,251                       1,140                              -111   -8.87%   x 1.10 
 layout_f32_032::nobeta_cfc  1,445                       1,259                              -186  -12.87%   x 1.15 
 layout_f32_032::nobeta_cff  1,441                       1,255                              -186  -12.91%   x 1.15 
 layout_f32_032::nobeta_fcc  1,080                       1,020                               -60   -5.56%   x 1.06 
 layout_f32_032::nobeta_fcf  1,074                       1,018                               -56   -5.21%   x 1.06 
 layout_f32_032::nobeta_ffc  1,287                       1,147                              -140  -10.88%   x 1.12 
 layout_f32_032::nobeta_fff  1,280                       1,142                              -138  -10.78%   x 1.12 
 layout_f64_032::nobeta_ccc  1,761                       1,783                                22    1.25%   x 0.99 
 layout_f64_032::nobeta_ccf  1,760                       1,776                                16    0.91%   x 0.99 
 layout_f64_032::nobeta_cfc  1,920                       1,839                               -81   -4.22%   x 1.04 
 layout_f64_032::nobeta_cff  1,914                       1,830                               -84   -4.39%   x 1.05 
 layout_f64_032::nobeta_fcc  1,636                       1,581                               -55   -3.36%   x 1.03 
 layout_f64_032::nobeta_fcf  1,632                       1,572                               -60   -3.68%   x 1.04 
 layout_f64_032::nobeta_ffc  1,766                       1,634                              -132   -7.47%   x 1.08 
 layout_f64_032::nobeta_fff  1,760                       1,627                              -133   -7.56%   x 1.08

Use a different pack function for complex micorkernels which puts real and imag parts in separate rows. This enables much better autovectorization for the fallback kernels.

Custom sizes for Fma and Avx2 is a win for performance, and Avx2 does better than fma here, so both can be worthwhile.

Because we detect target features to select kernel, and the kernel can select its own packing functions, we can now specialize the packing functions per target. As matrices get larger, the packing performance matters much less, but for small matrix products it contributes more to the runtime. The default packing also already has a special case for contiguous matrices, which happens when in C = A B, A is column major and B is row major. The specialization in this commit helps the most outside this special case.

avx2, fma and f32::mul_add is a success in autovectorization, while just fma with f32::mul_add is not (!). For this reason, only call f32::mul_add when we opt in to this.

Remove flags that are now used by default by miri.

gemm: Allow custom packing functions

cecdb9f

For complex, we'll want to use a different packing function. Add packing into the GemmKernel interface so that kernels can request a different packing function. The standard packing function is unchanged but gets its own module in the code.

bluss force-pushed the cgemmpack branch 3 times, most recently from 8831fc8 to c4a3896 Compare April 29, 2023 15:04

bluss changed the title ~~Use different matrix packing for complex microkernels~~ Special packing for complex, specialize packing for avx2 Apr 29, 2023

bluss force-pushed the cgemmpack branch 5 times, most recently from e32906a to e259bb2 Compare April 29, 2023 20:43

complex: pack real and imag separately

b0acf50

Use a different pack function for complex micorkernels which puts real and imag parts in separate rows. This enables much better autovectorization for the fallback kernels.

bluss force-pushed the cgemmpack branch from 91391aa to 4e96490 Compare April 30, 2023 09:11

bluss added 5 commits April 30, 2023 11:17

cgemm: Setup Avx2 and Fma autovectorized kernels

00b3849

Custom sizes for Fma and Avx2 is a win for performance, and Avx2 does better than fma here, so both can be worthwhile.

cgemm: use fma in avx2 kernel

2fef4fd

avx2, fma and f32::mul_add is a success in autovectorization, while just fma with f32::mul_add is not (!). For this reason, only call f32::mul_add when we opt in to this.

cgemm: Add known-answer test

578ade7

cgemm: enable fma for neon

bb64bfa

bluss force-pushed the cgemmpack branch from 4e96490 to 30af393 Compare April 30, 2023 09:18

ci: Update miri flags

ba81168

Remove flags that are now used by default by miri.

bluss force-pushed the cgemmpack branch from 30af393 to ba81168 Compare April 30, 2023 09:35

bluss merged commit 84c0baa into master Apr 30, 2023

bluss deleted the cgemmpack branch April 30, 2023 10:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Special packing for complex, specialize packing for avx2 #75

Special packing for complex, specialize packing for avx2 #75

bluss commented Apr 29, 2023 •

edited

Loading

bluss commented Apr 29, 2023

Special packing for complex, specialize packing for avx2 #75

Special packing for complex, specialize packing for avx2 #75

Conversation

bluss commented Apr 29, 2023 • edited Loading

bluss commented Apr 29, 2023

bluss commented Apr 29, 2023 •

edited

Loading