New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Integrating with SB-SIMD #1
Comments
Right now, As usual, the biggest challenge is naming things. How do you like |
Glad to hear the first part. I'm no good at naming and I'm no a specialist in SB-SIMD so I won't be able to say anything meaningful about naming though. |
OK, I'll try to get this into sbcl 2.3.5. |
The C world would be happy with untyped SIMD packs, but it wouldn't fit well with lisp. Will we generate a distinct Also, are there any particular fixed set of operations (looping while providing access) that we intend to provide? If so, we could design the primitives centered around those operations, but otherwise, more general primitives do seem necessary. |
The X.Y would be for each element type abbreviation X (f32, f64, u8, ..., s64), and for each corresponding pack width Y. The rule is that the number of bits in X, multiplied by Y, must be the size of a SIMD register (either 128 or 256, and maybe 512 one day). With ten different element types X, two kinds of Y each, both loads and stores, and both regular and non-temporal variants, this makes for 10 x 2 x 2 x 2 = 80 functions. Plus another 40 functions (with 128 bit only) for SSE. Luckily, sb-simd can generate all these variants from a table :)
I started working on a library for this some time ago: https://github.com/marcoheisig/Loopus I think what users want is to write the non-SIMD loop only, and then use an auto-vectorizing macro. Unfortunately, I never finished Loopus. Maybe this is something I could work on for next year's SBCL developer meeting 🤔 |
A naïve non-CL question from a non-SIMD developer: let's assume that we have a structure comprised of a f32 and a ub64, so heterogenous data, 96 bits in total. Is it somehow possible to efficiently operate on the float parts only via some sort of masking? |
@phoe Sure, sb-simd has masking operations. But if you only have a single structure instance with a single value, SIMD won't do much for you. Things get interesting once you have more than one instance though. In that case, everything depends on the memory layout. If you have an array of structures layout then things are somewhat awkward, but if you have a structure of arrays layout then you'll have one array containing only the values of the first slots, followed by one array of values of the second slot, and so on. In the latter case, SIMD works really well. |
Welp, apologies; rather than a structure of arrays (which is the trivial case) I meant an array of 96-bit structures f32 + ub64. This example would mean that a 256-bit register needs to operate only on the 1st, 4th, 7th 32-bit chunk, then only 2nd, 5th, 8th, then only 3rd and 6th before wrapping over. I assume that this is possible with masking and this still gives some sort of non-negligible speedup over simple iteration? |
Even without masking, there are gather/scatter instructions that allow loading from / storing to offsets from a base address. I'm not sure about the speedup numbers though. |
Ah, wait, so was the plan to generate these instructions for cases where X can be, say,
Oh yes, you had mentioned it in your paper the previous year. Unfortunately I haven't had the time to dive into it, and I'm stuck in that "20% of the things provide 80% of functionality" local optima that can be a bane for lisp. |
I don't know if a |
Oh, okay! I think I misunderstood somethings above. Yup, okay, seems like the issue can be resolved by (i) having SB-SIMD functions that work on foreign pointers, and (ii) providing a LOOPUS or equivalent functionality. |
A more general idea on the wishlist is for SIMD support in CFFI itself. |
I haven't yet added gather/scatter instructions to sb-simd, because they are also not much faster than just emitting one load per element and then packing those values. The resulting CPU microcode is probably almost identical.
If you --- or anyone else --- wants to help with Loopus or start a related project, don't hesitate to contact me. I also wrote https://github.com/marcoheisig/cl-isl so that one can use the same secret sauce as GCC or Clang when it comes to optimizing loop nests.
My original plan was to write a library called |
Since we now have packed arrays of structures, it should perhaps be possible to use SIMD to operate on these for going maximum brrrrrr. SB-SIMD became a contrib a while ago thanks to @marcoheisig; it provides an interface for SIMD instructions on SBCL. Will it be possible to somehow leverage it for use with unboxables?
One obstacle is that what we have here is not SIMD packs, or CL specialized vectors, but raw pointers with some external type information and possibly heterogenous data inside. Can SB-SIMD be somehow used with these?
The text was updated successfully, but these errors were encountered: