Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use optimal kernel parameters (architectures, matrix layouts) #34

Open
SuperFluffy opened this issue Dec 3, 2018 · 7 comments
Open

Comments

@SuperFluffy
Copy link
Contributor

I am trying to figure out what to use as optimal kernel parameter for different architectures.

For example, it looks like blis is using 8x4 for Sandy Bridge, but 8x6 for Haswell. Why? What lead them to this setup? Specifically, because operations are usually on 4 doubles at a time, how does the 6 fit in there. Is Haswell able to separately execute a _mm256 and a _mm operation at the same time?

Furthermore, if we have non-square kernels like for dgemm, is there a scenario where choosing 4x8 over 8x4 is better?

@bluss
Copy link
Owner

bluss commented Dec 3, 2018

You must also include src/archparam.rs in this

@SuperFluffy
Copy link
Contributor Author

This discussion is revealing in terms of how to determine optimal kernel parameters: flame/blis#253

In particular, this states:

@VirtualEarth Turn your attention to Eq. 1:

  mr nr >= Nvec Lvfma Nvfma

On Intel Haswell/Broadwell/Skylake/Kabylake, Nvec is 4, Lvfma is 5, and Nvfma is 2. Thus, the register blocksize product mr nr must be at least 40 in order to overcome the floating-point latency. 6 x 8 = 48 more than satisfies this, but 8 x 6 would also be fine. The former is biased toward row-oriented SIMD output while the latter is better for column-oriented SIMD output. In BLIS, it almost never actually matters which one you pick because high-level logic will transpose the entire operation and make the output matrix C appear to be row- or column-stored, depending on the SIMD output "preference" of the microkernel.

Another interesting bit is the choice of 8x6 over 6x8 (or 8x4 over 4x8 for Sandy Bridge, which is our current implementation), which prefers column- vs row-storage in the C matrix. This then ties in with my question here: #31

@mratsim
Copy link

mratsim commented Dec 8, 2018

Hello there,

I'm the author of Arraymancer a Numpy + machine learning + deep learning written from scratch in Nim.

In the past month I've been building a new faster backend, with a specific focus on matrix multiplication as MKL, BLAS, BLIS were limiting my optimisation opportunities (like fusing neural network activations into GEMM or merging the im2col pass for convolution into matrix multiplication).

I've done extensive review of the literature here and added a lot of comments in my tiling code.

The most useful papers are:

[1] Anatomy of High-Performance Matrix Multiplication (Revised)
Kazushige Goto, Robert A. Van de Geijn
- http://www.cs.utexas.edu/~flame/pubs/GotoTOMS_revision.pdf

[2] Anatomy of High-Performance Many-Threaded Matrix Multiplication
Smith et al
- http://www.cs.utexas.edu/users/flame/pubs/blis3_ipdps14.pdf

[3] Automating the Last-Mile for High Performance Dense Linear Algebra
Veras et al
- https://arxiv.org/pdf/1611.08035.pdf

[4] GEMM: From Pure C to SSE Optimized Micro Kernels
Michael Lehn
- http://apfel.mathematik.uni-ulm.de/~lehn/sghpc/gemm/index.html

I also keep some extra links that I didn't have time to sort.

Anyway, in terms of performance I have a generic multi-threaded BLAS (float32, float64, int32, int64) that reaches between 97~102% of OpenBLAS on my Broadwell (AVX2 and FMA) laptop depending if multithreaded/serial/float32 or float64:

Kernel parameters

  • mr * nr: 6 * 16.
    • Why 16?: one-dimension must be a multiple of the vector size. With AVX it's 8. Also since CPU can issue 2 Fused-Multiply-Add in parallel (instruction-level parallelism) 16 is an easy choice.
    • Why 6?: We have 16 general purposes register on SSE2~AVX2. We need to keep a pointer to packed A, packed B and the loop index over kc. So with 6 I can unroll twice, using 12 registers, and have 4 left for bookeeping.
    • Why not 16x6?: C will be a MxN matrix, and very often C is a contiguous matrix that has been allocated. As by default I use row-major, 6x16 avoids transposing during the C update step and allow me to specialize when C as a unit stride along the N dimension.

Panel parameters

Proper tuning of mc and kc is very important as well.
Currently I use an arbitrary 768 bytes for mc and 2048 bytes for kc. In my testing I lost up to 35% performance with other values.

There are various constraint for both, the Goto paper goes quite in-depth into them:

  • micropanel of packed B of size kc * nr should remain in L1 cache, and so take less than half of it to not be evicted by other panels.
  • Panel of packed A of size kc * mc should take a considerable part of the L2 cache, but still stay addressable by the TLB.
    I had an algorithm to choose them, but as I can only query cache but no TLB information at the moment, I removed it and decided to tune manually.

@mratsim
Copy link

mratsim commented Dec 8, 2018

I forgot to add. As mentioned in the paper Automating the last mile. You need to choose your SIMD for C updates.

You can go for shuffle/permute or broadcast and balance them to a varying degree.

To find the best you need to check from which port those instructions can be issued. Also interleave them with FMA to hide data fetching latencies.

In my code I use full broadcast and no shuffle/permute but mainly because it was simpler to reason about, I didn't test other config.

@bluss
Copy link
Owner

bluss commented Dec 8, 2018

@mratsim Wow, cool to hear from you! Thanks for the links to the papers and for sharing your knowledge!

@bluss
Copy link
Owner

bluss commented Nov 13, 2021

Issue #59 allows tweaking the NC, MC, KC variables easily at compile-time which is one small step and a model for further compile time tweakability.

@ethanhs
Copy link

ethanhs commented Nov 25, 2021

Another idea which libsxmm uses is an autotuner, such as https://opentuner.org/. OpenTuner automatically evaluates the best parameters for the architecture the code is compiled on.

robertknight added a commit to robertknight/rten that referenced this issue Jan 3, 2023
These values were taken from the Blis kernel configuration for Haswell [1],
Setting NR to use 2 AVX registers takes advantage of the 2 FMA execution ports
[2].

[1] https://github.com/flame/blis/blob/f956b79922da412791e4c8b8b846b3aafc0a5ee0/kernels/haswell/bli_kernels_haswell.h#L55

[2] bluss/matrixmultiply#34 (comment)
robertknight added a commit to robertknight/rten that referenced this issue Jan 4, 2023
These values were taken from the Blis kernel configuration for Haswell [1],
Setting NR to use 2 AVX registers takes advantage of the 2 FMA execution ports
[2].

The clippy lint about the compiler optimizing away constant assertions
was disabled because that is exactly the behavior that we want. See also
rust-lang/rust-clippy#8159.

[1] https://github.com/flame/blis/blob/f956b79922da412791e4c8b8b846b3aafc0a5ee0/kernels/haswell/bli_kernels_haswell.h#L55

[2] bluss/matrixmultiply#34 (comment)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants