perf(parquet/internal/encoding): vectorize amd64 bool unpack by zeroshade · Pull Request #735 · apache/arrow-go

zeroshade · 2026-03-27T20:27:28Z

Rationale for this change

The SSE4 and AVX2 implementations of _bytes_to_bools in parquet/internal/utils/ contain zero SIMD instructions. They completely failed to auto-vectorize the C loop, producing purely scalar code (movzx/shr/and/mov one bit at a time). The SSE4 and AVX2 .s files are byte-for-byte identical — just scalar code with different labels.

This is the amd64 counterpart to #731 which fixed the same issue on ARM64 NEON.

What changes are included in this PR?

Rewrote both assembly implementations with actual SIMD vectorized code.
SSE4 (unpack_bool_sse4_amd64.s) — processes 2 input bytes → 16 output bools per iteration:

MOVWLZX + MOVD — load 2 input bytes into XMM
PSHUFB — broadcast byte 0 → lanes 0-7, byte 1 → lanes 8-15
PAND + PCMPEQB — parallel bit-test against mask [1,2,4,8,16,32,64,128] × 2
PAND — normalize 0xFF → 0x01 for valid Go bool values
MOVOU — store 16 output bools at once

AVX2 (unpack_bool_avx2_amd64.s) — processes 4 input bytes → 32 output bools per iteration:

MOVL + MOVD + VPBROADCASTD — load and broadcast 4 bytes across all 32 YMM lanes
VPSHUFB — distribute each byte to its 8 corresponding lanes
VPAND + VPCMPEQB + VPAND — parallel bit-test + normalize to 0/1
VMOVDQU — store 32 output bools at once
VZEROUPPER — avoid SSE-AVX transition penalties on return

Both include scalar tails for when fewer than vector-width output slots remain.

Are these changes tested?

All existing tests continue to pass, new tests added to further validate:

TestBytesToBoolsCorrectness — validates every bit position against the reference Go implementation for sizes 1–1024 bytes
TestBytesToBoolsOutlenSmaller — edge case where output is smaller than 8× input
BenchmarkBytesToBools — parametric benchmark at 64B, 256B, 1KB, 4KB, 16KB

Are there any user-facing changes?

No, this is purely a performance optimization:

Benchmark Results (AMD Ryzen 7 7800X3D, linux/amd64)

                               baseline (scalar)   optimized (AVX2)
                                   sec/op              sec/op       vs base
BytesToBools/bytes=64-16           146.0n              15.60n     -89.32% (p=0.008)
BytesToBools/bytes=256-16          562.3n              63.36n     -88.73% (p=0.008)
BytesToBools/bytes=1K-16           2.247µ              253.9n     -88.70% (p=0.008)
BytesToBools/bytes=4K-16           8.970µ              1.018µ     -88.65% (p=0.008)
BytesToBools/bytes=16K-16         35.798µ              4.044µ     -88.70% (p=0.008)
geomean                            2.262µ              252.8n     -88.82%

Throughput: 432 MiB/s → 3,853 MiB/s (+795%)
Zero allocations in both versions. All results statistically significant.

Keep go.mod at 1.24.0 to maintain compatibility with TinyGo 0.38.0, which doesn't yet support Go 1.25. CI still tests Go 1.25 and 1.26 to ensure forward compatibility.

…ghput Replace the scalar bit-by-bit implementations of _bytes_to_bools_sse4 and _bytes_to_bools_avx2 with actual SIMD vectorized code. The previous implementations were auto-generated by c2goasm from clang output that failed to auto-vectorize, resulting in purely scalar code (movzx/shr/and/ mov one bit at a time) despite being labeled as SSE4 and AVX2. SSE4: uses PSHUFB to broadcast 2 input bytes into 16 XMM lanes, then PAND+PCMPEQB for parallel bit-test and PAND to normalize to 0/1. Processes 2 bytes (16 bools) per iteration. AVX2: uses VPBROADCASTD+VPSHUFB to broadcast 4 input bytes into 32 YMM lanes, then VPAND+VPCMPEQB+VPAND for parallel bit-test and normalize. Processes 4 bytes (32 bools) per iteration. Includes VZEROUPPER to avoid SSE-AVX transition penalties. Both include scalar tails for edge cases with <vector-width output slots. Benchmarks on AMD Ryzen 7 7800X3D (linux/amd64): BytesToBools/64B 146.0ns -> 15.60ns (9.4x, 418->3913 MiB/s) BytesToBools/256B 562.3ns -> 63.36ns (8.9x, 434->3853 MiB/s) BytesToBools/1KB 2247ns -> 253.9ns (8.8x, 435->3846 MiB/s) BytesToBools/4KB 8970ns -> 1018ns (8.8x, 436->3838 MiB/s) BytesToBools/16KB 35798ns -> 4044ns (8.9x, 437->3864 MiB/s) geomean: -88.8% latency, +795% throughput

chore: Revert go.mod to Go 1.24.0 for TinyGo compatibility

25f2367

Keep go.mod at 1.24.0 to maintain compatibility with TinyGo 0.38.0, which doesn't yet support Go 1.25. CI still tests Go 1.25 and 1.26 to ensure forward compatibility.

zeroshade requested review from kou and lidavidm March 27, 2026 20:27

zeroshade force-pushed the perf/vectorize-amd64-bool-unpack branch from 8380b64 to f11e97a Compare March 27, 2026 20:38

lidavidm approved these changes Mar 29, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(parquet/internal/encoding): vectorize amd64 bool unpack#735

perf(parquet/internal/encoding): vectorize amd64 bool unpack#735
zeroshade wants to merge 2 commits intoapache:mainfrom
zeroshade:perf/vectorize-amd64-bool-unpack

zeroshade commented Mar 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

zeroshade commented Mar 27, 2026

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants