Skip to content

perf(parquet/internal/encoding): vectorize amd64 bool unpack#735

Open
zeroshade wants to merge 2 commits intoapache:mainfrom
zeroshade:perf/vectorize-amd64-bool-unpack
Open

perf(parquet/internal/encoding): vectorize amd64 bool unpack#735
zeroshade wants to merge 2 commits intoapache:mainfrom
zeroshade:perf/vectorize-amd64-bool-unpack

Conversation

@zeroshade
Copy link
Copy Markdown
Member

Rationale for this change

The SSE4 and AVX2 implementations of _bytes_to_bools in parquet/internal/utils/ contain zero SIMD instructions. They completely failed to auto-vectorize the C loop, producing purely scalar code (movzx/shr/and/mov one bit at a time). The SSE4 and AVX2 .s files are byte-for-byte identical — just scalar code with different labels.

This is the amd64 counterpart to #731 which fixed the same issue on ARM64 NEON.

What changes are included in this PR?

Rewrote both assembly implementations with actual SIMD vectorized code.
SSE4 (unpack_bool_sse4_amd64.s) — processes 2 input bytes → 16 output bools per iteration:

  1. MOVWLZX + MOVD — load 2 input bytes into XMM
  2. PSHUFB — broadcast byte 0 → lanes 0-7, byte 1 → lanes 8-15
  3. PAND + PCMPEQB — parallel bit-test against mask [1,2,4,8,16,32,64,128] × 2
  4. PAND — normalize 0xFF → 0x01 for valid Go bool values
  5. MOVOU — store 16 output bools at once

AVX2 (unpack_bool_avx2_amd64.s) — processes 4 input bytes → 32 output bools per iteration:

  1. MOVL + MOVD + VPBROADCASTD — load and broadcast 4 bytes across all 32 YMM lanes
  2. VPSHUFB — distribute each byte to its 8 corresponding lanes
  3. VPAND + VPCMPEQB + VPAND — parallel bit-test + normalize to 0/1
  4. VMOVDQU — store 32 output bools at once
  5. VZEROUPPER — avoid SSE-AVX transition penalties on return

Both include scalar tails for when fewer than vector-width output slots remain.

Are these changes tested?

All existing tests continue to pass, new tests added to further validate:

  • TestBytesToBoolsCorrectness — validates every bit position against the reference Go implementation for sizes 1–1024 bytes
  • TestBytesToBoolsOutlenSmaller — edge case where output is smaller than 8× input
  • BenchmarkBytesToBools — parametric benchmark at 64B, 256B, 1KB, 4KB, 16KB

Are there any user-facing changes?

No, this is purely a performance optimization:

Benchmark Results (AMD Ryzen 7 7800X3D, linux/amd64)

                               baseline (scalar)   optimized (AVX2)
                                   sec/op              sec/op       vs base
BytesToBools/bytes=64-16           146.0n              15.60n     -89.32% (p=0.008)
BytesToBools/bytes=256-16          562.3n              63.36n     -88.73% (p=0.008)
BytesToBools/bytes=1K-16           2.247µ              253.9n     -88.70% (p=0.008)
BytesToBools/bytes=4K-16           8.970µ              1.018µ     -88.65% (p=0.008)
BytesToBools/bytes=16K-16         35.798µ              4.044µ     -88.70% (p=0.008)
geomean                            2.262µ              252.8n     -88.82%

Throughput: 432 MiB/s → 3,853 MiB/s (+795%)
Zero allocations in both versions. All results statistically significant.

Keep go.mod at 1.24.0 to maintain compatibility with TinyGo 0.38.0,
which doesn't yet support Go 1.25. CI still tests Go 1.25 and 1.26
to ensure forward compatibility.
@zeroshade zeroshade requested review from kou and lidavidm March 27, 2026 20:27
…ghput

Replace the scalar bit-by-bit implementations of _bytes_to_bools_sse4
and _bytes_to_bools_avx2 with actual SIMD vectorized code. The previous
implementations were auto-generated by c2goasm from clang output that
failed to auto-vectorize, resulting in purely scalar code (movzx/shr/and/
mov one bit at a time) despite being labeled as SSE4 and AVX2.

SSE4: uses PSHUFB to broadcast 2 input bytes into 16 XMM lanes, then
PAND+PCMPEQB for parallel bit-test and PAND to normalize to 0/1.
Processes 2 bytes (16 bools) per iteration.

AVX2: uses VPBROADCASTD+VPSHUFB to broadcast 4 input bytes into 32 YMM
lanes, then VPAND+VPCMPEQB+VPAND for parallel bit-test and normalize.
Processes 4 bytes (32 bools) per iteration. Includes VZEROUPPER to
avoid SSE-AVX transition penalties.

Both include scalar tails for edge cases with <vector-width output slots.

Benchmarks on AMD Ryzen 7 7800X3D (linux/amd64):

  BytesToBools/64B   146.0ns -> 15.60ns  (9.4x, 418->3913 MiB/s)
  BytesToBools/256B  562.3ns -> 63.36ns  (8.9x, 434->3853 MiB/s)
  BytesToBools/1KB   2247ns  -> 253.9ns  (8.8x, 435->3846 MiB/s)
  BytesToBools/4KB   8970ns  -> 1018ns   (8.8x, 436->3838 MiB/s)
  BytesToBools/16KB  35798ns -> 4044ns   (8.9x, 437->3864 MiB/s)

  geomean: -88.8% latency, +795% throughput
@zeroshade zeroshade force-pushed the perf/vectorize-amd64-bool-unpack branch from 8380b64 to f11e97a Compare March 27, 2026 20:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants