Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integrating with SB-SIMD #1

Open
phoe opened this issue May 13, 2023 · 14 comments
Open

Integrating with SB-SIMD #1

phoe opened this issue May 13, 2023 · 14 comments

Comments

@phoe
Copy link

phoe commented May 13, 2023

Since we now have packed arrays of structures, it should perhaps be possible to use SIMD to operate on these for going maximum brrrrrr. SB-SIMD became a contrib a while ago thanks to @marcoheisig; it provides an interface for SIMD instructions on SBCL. Will it be possible to somehow leverage it for use with unboxables?

One obstacle is that what we have here is not SIMD packs, or CL specialized vectors, but raw pointers with some external type information and possibly heterogenous data inside. Can SB-SIMD be somehow used with these?

@marcoheisig
Copy link

Right now, sb-simd only has instructions for loading data from vectors, not from raw memory. But adding these instructions is a piece of cake.

As usual, the biggest challenge is naming things. How do you like sap-ref-X.Y for the function that loads a pack of type X.Y from a supplied foreign pointer and offset?

@phoe
Copy link
Author

phoe commented May 13, 2023

Glad to hear the first part. I'm no good at naming and I'm no a specialist in SB-SIMD so I won't be able to say anything meaningful about naming though.

@marcoheisig
Copy link

OK, I'll try to get this into sbcl 2.3.5.

@digikar99
Copy link
Owner

The C world would be happy with untyped SIMD packs, but it wouldn't fit well with lisp.

Will we generate a distinct sap-ref-X.Y for every unboxable X and pack-fitting Y? It doesn't sound too bad. But does that also call for a distinct X.Y-values and make-X.Y for each case? We could go with a cffi:mem-ref like convention that takes in a type as an argument, but details need more working out.

Also, are there any particular fixed set of operations (looping while providing access) that we intend to provide? If so, we could design the primitives centered around those operations, but otherwise, more general primitives do seem necessary.

@marcoheisig
Copy link

Will we generate a distinct sap-ref-X.Y for every unboxable X and pack-fitting Y?

The X.Y would be for each element type abbreviation X (f32, f64, u8, ..., s64), and for each corresponding pack width Y. The rule is that the number of bits in X, multiplied by Y, must be the size of a SIMD register (either 128 or 256, and maybe 512 one day). With ten different element types X, two kinds of Y each, both loads and stores, and both regular and non-temporal variants, this makes for 10 x 2 x 2 x 2 = 80 functions. Plus another 40 functions (with 128 bit only) for SSE. Luckily, sb-simd can generate all these variants from a table :)

Also, are there any particular fixed set of operations (looping while providing access) that we intend to provide?

I started working on a library for this some time ago: https://github.com/marcoheisig/Loopus

I think what users want is to write the non-SIMD loop only, and then use an auto-vectorizing macro. Unfortunately, I never finished Loopus. Maybe this is something I could work on for next year's SBCL developer meeting 🤔

@phoe
Copy link
Author

phoe commented May 14, 2023

A naïve non-CL question from a non-SIMD developer: let's assume that we have a structure comprised of a f32 and a ub64, so heterogenous data, 96 bits in total. Is it somehow possible to efficiently operate on the float parts only via some sort of masking?

@marcoheisig
Copy link

@phoe Sure, sb-simd has masking operations. But if you only have a single structure instance with a single value, SIMD won't do much for you.

Things get interesting once you have more than one instance though. In that case, everything depends on the memory layout. If you have an array of structures layout then things are somewhat awkward, but if you have a structure of arrays layout then you'll have one array containing only the values of the first slots, followed by one array of values of the second slot, and so on. In the latter case, SIMD works really well.

@phoe
Copy link
Author

phoe commented May 14, 2023

Welp, apologies; rather than a structure of arrays (which is the trivial case) I meant an array of 96-bit structures f32 + ub64.

This example would mean that a 256-bit register needs to operate only on the 1st, 4th, 7th 32-bit chunk, then only 2nd, 5th, 8th, then only 3rd and 6th before wrapping over. I assume that this is possible with masking and this still gives some sort of non-negligible speedup over simple iteration?

@digikar99
Copy link
Owner

Even without masking, there are gather/scatter instructions that allow loading from / storing to offsets from a base address. I'm not sure about the speedup numbers though.

@digikar99
Copy link
Owner

The X.Y would be for each element type abbreviation X (f32, f64, u8, ..., s64), and for each corresponding pack width Y.

Ah, wait, so was the plan to generate these instructions for cases where X can be, say, point or triplet, or no such plans?

I started working on a library for this some time ago: https://github.com/marcoheisig/Loopus

Oh yes, you had mentioned it in your paper the previous year. Unfortunately I haven't had the time to dive into it, and I'm stuck in that "20% of the things provide 80% of functionality" local optima that can be a bane for lisp.

@phoe
Copy link
Author

phoe commented May 14, 2023

Ah, wait, so was the plan to generate these instructions for cases where X can be, say, point or triplet, or no such plans?

I don't know if a point or triplet can be SIMD-settable. I assume it would be the job of some sort of unboxables-simd system to compile some higher-level operations on user-defined types into sb-simd calls on the primitive constituents of that type.

@digikar99
Copy link
Owner

Oh, okay! I think I misunderstood somethings above.

Yup, okay, seems like the issue can be resolved by (i) having SB-SIMD functions that work on foreign pointers, and (ii) providing a LOOPUS or equivalent functionality.

@digikar99
Copy link
Owner

A more general idea on the wishlist is for SIMD support in CFFI itself.

@marcoheisig
Copy link

Even without masking, there are gather/scatter instructions that allow loading from / storing to offsets from a base address.

I haven't yet added gather/scatter instructions to sb-simd, because they are also not much faster than just emitting one load per element and then packing those values. The resulting CPU microcode is probably almost identical.

Yup, okay, seems like the issue can be resolved by (i) having SB-SIMD functions that work on foreign pointers, and (ii) providing a LOOPUS or equivalent functionality.

If you --- or anyone else --- wants to help with Loopus or start a related project, don't hesitate to contact me. I also wrote https://github.com/marcoheisig/cl-isl so that one can use the same secret sauce as GCC or Clang when it comes to optimizing loop nests.

A more general idea on the wishlist is for SIMD support in CFFI itself.

My original plan was to write a library called cl-simd to offer SIMD support for all (well, most) implementations. But given the daunting challenge of adding all those 1000+ instructions to half a dozen implementations, I decided to go for SBCL first. My hope is that this sets a de-facto standard that will be adopted by other implementation, so that a cl-simd portability library will emerge eventually. Once there is such a portability library, it would be straightforward to make CFFI use it. I really wouldn't want some #+sb-simd statements directly in CFFI, out of fear that I'd have to take on some of that maintenance burden.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants