perf(parquet/internal/encoding): vectorize amd64 bool unpack#735
Open
zeroshade wants to merge 2 commits intoapache:mainfrom
Open
perf(parquet/internal/encoding): vectorize amd64 bool unpack#735zeroshade wants to merge 2 commits intoapache:mainfrom
zeroshade wants to merge 2 commits intoapache:mainfrom
Conversation
Keep go.mod at 1.24.0 to maintain compatibility with TinyGo 0.38.0, which doesn't yet support Go 1.25. CI still tests Go 1.25 and 1.26 to ensure forward compatibility.
…ghput Replace the scalar bit-by-bit implementations of _bytes_to_bools_sse4 and _bytes_to_bools_avx2 with actual SIMD vectorized code. The previous implementations were auto-generated by c2goasm from clang output that failed to auto-vectorize, resulting in purely scalar code (movzx/shr/and/ mov one bit at a time) despite being labeled as SSE4 and AVX2. SSE4: uses PSHUFB to broadcast 2 input bytes into 16 XMM lanes, then PAND+PCMPEQB for parallel bit-test and PAND to normalize to 0/1. Processes 2 bytes (16 bools) per iteration. AVX2: uses VPBROADCASTD+VPSHUFB to broadcast 4 input bytes into 32 YMM lanes, then VPAND+VPCMPEQB+VPAND for parallel bit-test and normalize. Processes 4 bytes (32 bools) per iteration. Includes VZEROUPPER to avoid SSE-AVX transition penalties. Both include scalar tails for edge cases with <vector-width output slots. Benchmarks on AMD Ryzen 7 7800X3D (linux/amd64): BytesToBools/64B 146.0ns -> 15.60ns (9.4x, 418->3913 MiB/s) BytesToBools/256B 562.3ns -> 63.36ns (8.9x, 434->3853 MiB/s) BytesToBools/1KB 2247ns -> 253.9ns (8.8x, 435->3846 MiB/s) BytesToBools/4KB 8970ns -> 1018ns (8.8x, 436->3838 MiB/s) BytesToBools/16KB 35798ns -> 4044ns (8.9x, 437->3864 MiB/s) geomean: -88.8% latency, +795% throughput
8380b64 to
f11e97a
Compare
lidavidm
approved these changes
Mar 29, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Rationale for this change
The SSE4 and AVX2 implementations of _bytes_to_bools in parquet/internal/utils/ contain zero SIMD instructions. They completely failed to auto-vectorize the C loop, producing purely scalar code (movzx/shr/and/mov one bit at a time). The SSE4 and AVX2 .s files are byte-for-byte identical — just scalar code with different labels.
This is the amd64 counterpart to #731 which fixed the same issue on ARM64 NEON.
What changes are included in this PR?
Rewrote both assembly implementations with actual SIMD vectorized code.
SSE4 (unpack_bool_sse4_amd64.s) — processes 2 input bytes → 16 output bools per iteration:
AVX2 (unpack_bool_avx2_amd64.s) — processes 4 input bytes → 32 output bools per iteration:
Both include scalar tails for when fewer than vector-width output slots remain.
Are these changes tested?
All existing tests continue to pass, new tests added to further validate:
Are there any user-facing changes?
No, this is purely a performance optimization:
Benchmark Results (AMD Ryzen 7 7800X3D, linux/amd64)
Throughput: 432 MiB/s → 3,853 MiB/s (+795%)
Zero allocations in both versions. All results statistically significant.