Use fma, fused multiply add, for architectures supporting fma #35

SuperFluffy · 2018-12-03T14:13:49Z

Modern Intel architectures supporting fma instruction sets can perform the first loop calculating the matrix-matrix product between panels a and b in one go using _mm256_fmadd_pd. We should implement these and see how they affect performance.

The text was updated successfully, but these errors were encountered:

bluss · 2018-12-03T14:17:20Z

Let's land the pure avx dgemm first

SuperFluffy · 2018-12-03T14:59:55Z

These performance gains are just lovely:

 name                 ./dgemm_avx ns/iter  dgemm_fma ns/iter  diff ns/iter   diff %  speedup 
 layout_f64_032::ccc  3,258                2,396                      -862  -26.46%   x 1.36 
 layout_f64_032::ccf  3,212                2,320                      -892  -27.77%   x 1.38 
 layout_f64_032::cfc  3,384                2,525                      -859  -25.38%   x 1.34 
 layout_f64_032::cff  3,341                2,457                      -884  -26.46%   x 1.36 
 layout_f64_032::fcc  3,144                2,276                      -868  -27.61%   x 1.38 
 layout_f64_032::fcf  3,090                2,189                      -901  -29.16%   x 1.41 
 layout_f64_032::ffc  3,251                2,376                      -875  -26.91%   x 1.37 
 layout_f64_032::fff  3,201                2,296                      -905  -28.27%   x 1.39 
 mat_mul_f64::m004    175                  171                          -4   -2.29%   x 1.02 
 mat_mul_f64::m006    237                  226                         -11   -4.64%   x 1.05 
 mat_mul_f64::m008    257                  242                         -15   -5.84%   x 1.06 
 mat_mul_f64::m012    516                  431                         -85  -16.47%   x 1.20 
 mat_mul_f64::m016    648                  519                        -129  -19.91%   x 1.25 
 mat_mul_f64::m032    3,296                2,384                      -912  -27.67%   x 1.38 
 mat_mul_f64::m064    22,168               14,856                   -7,312  -32.98%   x 1.49 
 mat_mul_f64::m127    160,532              104,342                 -56,190  -35.00%   x 1.54

SuperFluffy · 2018-12-03T15:08:24Z

Sgemm is not as impressive, but has some serious improvement as well:

 layout_f32_032::ccc  2,050              1,788                      -262  -12.78%   x 1.15 
 layout_f32_032::ccf  2,050              1,774                      -276  -13.46%   x 1.16 
 layout_f32_032::cfc  2,317              2,042                      -275  -11.87%   x 1.13 
 layout_f32_032::cff  2,316              2,046                      -270  -11.66%   x 1.13 
 layout_f32_032::fcc  1,796              1,527                      -269  -14.98%   x 1.18 
 layout_f32_032::fcf  1,799              1,513                      -286  -15.90%   x 1.19 
 layout_f32_032::ffc  2,058              1,784                      -274  -13.31%   x 1.15 
 layout_f32_032::fff  2,052              1,785                      -267  -13.01%   x 1.15 
 mat_mul_f32::m004    187                171                         -16   -8.56%   x 1.09 
 mat_mul_f32::m006    210                208                          -2   -0.95%   x 1.01 
 mat_mul_f32::m008    179                175                          -4   -2.23%   x 1.02 
 mat_mul_f32::m012    524                458                         -66  -12.60%   x 1.14 
 mat_mul_f32::m016    492                429                         -63  -12.80%   x 1.15 
 mat_mul_f32::m032    2,036              1,793                      -243  -11.94%   x 1.14 
 mat_mul_f32::m064    12,621             11,283                   -1,338  -10.60%   x 1.12 
 mat_mul_f32::m127    88,308             82,163                   -6,145   -6.96%   x 1.07

bluss · 2018-12-03T15:55:15Z

That's amazing

This introduces a new trait `DgemmMultiplyAdd` that selects fused multiply add if available, and multiplication followed by addition if now. Tests for avx and fma kernels are disabled for now.

SuperFluffy added a commit to SuperFluffy/matrixmultiply that referenced this issue Dec 3, 2018

Implement sgemm and dgemm using fma; addresses bluss#35

6e28f79

SuperFluffy mentioned this issue Dec 3, 2018

Implement sgemm and dgemm using fma #36

Merged

SuperFluffy added a commit to SuperFluffy/matrixmultiply that referenced this issue Dec 4, 2018

Implement sgemm and dgemm using fma; addresses bluss#35

5e90824

SuperFluffy added a commit to SuperFluffy/matrixmultiply that referenced this issue Dec 4, 2018

Implement sgemm and dgemm using fma; closes bluss#35

dfca327

SuperFluffy added a commit to SuperFluffy/matrixmultiply that referenced this issue Dec 5, 2018

Implement sgemm and dgemm using fma; closes bluss#35

59f3595

bluss closed this as completed in #36 Dec 7, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use fma, fused multiply add, for architectures supporting fma #35

Use fma, fused multiply add, for architectures supporting fma #35

SuperFluffy commented Dec 3, 2018

bluss commented Dec 3, 2018

SuperFluffy commented Dec 3, 2018

SuperFluffy commented Dec 3, 2018

bluss commented Dec 3, 2018

Use fma, fused multiply add, for architectures supporting fma #35

Use fma, fused multiply add, for architectures supporting fma #35

Comments

SuperFluffy commented Dec 3, 2018

bluss commented Dec 3, 2018

SuperFluffy commented Dec 3, 2018

SuperFluffy commented Dec 3, 2018

bluss commented Dec 3, 2018