v0.4: Arm NEON, SVE, and OpenMP-like Pools
This minor release implements the missing NEON and SVE kernel variants, which don't yield any noticeable improvements for float single-precision inputs compared to GCC 12 auto-vectorization on dual-socket Graviton 4. This release also adds a minimalistic thread-pool implementation via fork_union, that yields lower performance than OpenMP on small inputs, highlighting the need for more work.
Minor
- Add:
fork_unionparallel version (c40b7f3) - Add: Generic OpenMP pool (cce8be4)
- Add: NEON and SVE kernels (32d7d3e)