Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use fma, fused multiply add, for architectures supporting fma #35

Closed
SuperFluffy opened this issue Dec 3, 2018 · 4 comments
Closed

Comments

@SuperFluffy
Copy link
Contributor

Modern Intel architectures supporting fma instruction sets can perform the first loop calculating the matrix-matrix product between panels a and b in one go using _mm256_fmadd_pd. We should implement these and see how they affect performance.

@bluss
Copy link
Owner

bluss commented Dec 3, 2018

Let's land the pure avx dgemm first

@SuperFluffy
Copy link
Contributor Author

These performance gains are just lovely:

 name                 ./dgemm_avx ns/iter  dgemm_fma ns/iter  diff ns/iter   diff %  speedup 
 layout_f64_032::ccc  3,258                2,396                      -862  -26.46%   x 1.36 
 layout_f64_032::ccf  3,212                2,320                      -892  -27.77%   x 1.38 
 layout_f64_032::cfc  3,384                2,525                      -859  -25.38%   x 1.34 
 layout_f64_032::cff  3,341                2,457                      -884  -26.46%   x 1.36 
 layout_f64_032::fcc  3,144                2,276                      -868  -27.61%   x 1.38 
 layout_f64_032::fcf  3,090                2,189                      -901  -29.16%   x 1.41 
 layout_f64_032::ffc  3,251                2,376                      -875  -26.91%   x 1.37 
 layout_f64_032::fff  3,201                2,296                      -905  -28.27%   x 1.39 
 mat_mul_f64::m004    175                  171                          -4   -2.29%   x 1.02 
 mat_mul_f64::m006    237                  226                         -11   -4.64%   x 1.05 
 mat_mul_f64::m008    257                  242                         -15   -5.84%   x 1.06 
 mat_mul_f64::m012    516                  431                         -85  -16.47%   x 1.20 
 mat_mul_f64::m016    648                  519                        -129  -19.91%   x 1.25 
 mat_mul_f64::m032    3,296                2,384                      -912  -27.67%   x 1.38 
 mat_mul_f64::m064    22,168               14,856                   -7,312  -32.98%   x 1.49 
 mat_mul_f64::m127    160,532              104,342                 -56,190  -35.00%   x 1.54

@SuperFluffy
Copy link
Contributor Author

Sgemm is not as impressive, but has some serious improvement as well:

 layout_f32_032::ccc  2,050              1,788                      -262  -12.78%   x 1.15 
 layout_f32_032::ccf  2,050              1,774                      -276  -13.46%   x 1.16 
 layout_f32_032::cfc  2,317              2,042                      -275  -11.87%   x 1.13 
 layout_f32_032::cff  2,316              2,046                      -270  -11.66%   x 1.13 
 layout_f32_032::fcc  1,796              1,527                      -269  -14.98%   x 1.18 
 layout_f32_032::fcf  1,799              1,513                      -286  -15.90%   x 1.19 
 layout_f32_032::ffc  2,058              1,784                      -274  -13.31%   x 1.15 
 layout_f32_032::fff  2,052              1,785                      -267  -13.01%   x 1.15 
 mat_mul_f32::m004    187                171                         -16   -8.56%   x 1.09 
 mat_mul_f32::m006    210                208                          -2   -0.95%   x 1.01 
 mat_mul_f32::m008    179                175                          -4   -2.23%   x 1.02 
 mat_mul_f32::m012    524                458                         -66  -12.60%   x 1.14 
 mat_mul_f32::m016    492                429                         -63  -12.80%   x 1.15 
 mat_mul_f32::m032    2,036              1,793                      -243  -11.94%   x 1.14 
 mat_mul_f32::m064    12,621             11,283                   -1,338  -10.60%   x 1.12 
 mat_mul_f32::m127    88,308             82,163                   -6,145   -6.96%   x 1.07

SuperFluffy added a commit to SuperFluffy/matrixmultiply that referenced this issue Dec 3, 2018
@bluss
Copy link
Owner

bluss commented Dec 3, 2018

That's amazing

SuperFluffy added a commit to SuperFluffy/matrixmultiply that referenced this issue Dec 4, 2018
SuperFluffy added a commit to SuperFluffy/matrixmultiply that referenced this issue Dec 4, 2018
SuperFluffy added a commit to SuperFluffy/matrixmultiply that referenced this issue Dec 5, 2018
SuperFluffy added a commit to SuperFluffy/matrixmultiply that referenced this issue Dec 7, 2018
This introduces a new trait `DgemmMultiplyAdd` that selects
fused multiply add if available, and multiplication followed
by addition if now.

Tests for avx and fma kernels are disabled for now.
@bluss bluss closed this as completed in #36 Dec 7, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants