Context
Current scanner uses AVX2 + PCLMUL (128-bit). On CPUs supporting avx512bw + vpclmulqdq (Ice Lake / Sapphire Rapids / Zen 4+), a 128-byte chunk path could halve the loop iteration count.
Prerequisite: CPU support audit
This issue is gated on confirming that the project's actual build/CI hosts support vpclmulqdq. If not, ROI is 0 and the issue should be deferred indefinitely.
If CI runners do not reliably provide vpclmulqdq, the only path to validating this is paid larger-runners or self-hosted runners.
Proposal (pending CPU confirmation)
- New
src/scan/avx512.rs mirroring avx2.rs with 128-byte chunks
- Dispatcher (
src/scan/mod.rs): AVX-512 → AVX2 → scalar fallback chain
- New
avx512 feature flag (default off) so release builds stay portable
- Use
_mm512_clmulepi64_epi128 for the inside-string prefix-XOR
Estimated impact
|
est. speedup |
CPUs with avx512bw + vpclmulqdq |
~1.5–2× scan throughput |
| Other CPUs |
0 (dispatcher falls back) |
Validation plan
Recommendation
Last in the perf followup queue. The CPU support situation is uncertain; if it turns out CI runners don't have vpclmulqdq, this is dead code we maintain forever. Do the cheap wins (#5 memchr, #6 pooling, #7 PGO, #8 micro-opts) first.
Context
Current scanner uses AVX2 + PCLMUL (128-bit). On CPUs supporting
avx512bw+vpclmulqdq(Ice Lake / Sapphire Rapids / Zen 4+), a 128-byte chunk path could halve the loop iteration count.Prerequisite: CPU support audit
This issue is gated on confirming that the project's actual build/CI hosts support
vpclmulqdq. If not, ROI is 0 and the issue should be deferred indefinitely.Local dev host: confirmed missing
vpclmulqdq(Skylake-X / Skylake-SP — has avx512bw but not vpclmulqdq). Cannot test locally.CI runners:
ubuntu-latestrunner CPUs vary by allocation. Add a one-line diagnostic to the workflow:Collect output over several CI runs; only proceed if
vpclmulqdqis reliably present.If CI runners do not reliably provide
vpclmulqdq, the only path to validating this is paidlarger-runnersor self-hosted runners.Proposal (pending CPU confirmation)
src/scan/avx512.rsmirroringavx2.rswith 128-byte chunkssrc/scan/mod.rs): AVX-512 → AVX2 → scalar fallback chainavx512feature flag (default off) so release builds stay portable_mm512_clmulepi64_epi128for the inside-string prefix-XOREstimated impact
avx512bw+vpclmulqdqValidation plan
scanner_crosscheckproptest extended to compare AVX-512 vs scalarvpclmulqdqmake bench3-run median on supported hardwareRecommendation
Last in the perf followup queue. The CPU support situation is uncertain; if it turns out CI runners don't have
vpclmulqdq, this is dead code we maintain forever. Do the cheap wins (#5 memchr, #6 pooling, #7 PGO, #8 micro-opts) first.