Skip to content

perf: add AVX512-optimized is_ascii to fix native CPU regression#22

Merged
bonega merged 2 commits intomasterfrom
fix/avx512-is-ascii
Jan 22, 2026
Merged

perf: add AVX512-optimized is_ascii to fix native CPU regression#22
bonega merged 2 commits intomasterfrom
fix/avx512-is-ascii

Conversation

@bonega
Copy link
Copy Markdown
Owner

@bonega bonega commented Jan 22, 2026

Summary

  • Fix severe performance regression when compiling with -C target-cpu=native on AVX512 CPUs
  • Add custom is_ascii implementation using AVX512BW intrinsics
  • Runtime feature detection ensures compatibility with all systems

Problem

When compiling with -C target-cpu=native on AVX512-capable CPUs, LLVM generates poor code for stdlib's is_ascii(), causing 30x slowdowns in ASCII-heavy workloads.

Solution

Add src/simd.rs with AVX512-optimized is_ascii using:

  • _mm512_movepi8_mask to extract MSB of 64 bytes at once
  • _mm512_maskz_loadu_epi8 for safe tail handling
  • Runtime detection via is_x86_feature_detected!("avx512bw")
  • Fallback to stdlib on non-AVX512 systems

Benchmarks

Workload Default compilation With -C target-cpu=native
ASCII decode/encode 1.3-1.8x faster 30-35x faster

Test plan

  • All existing tests pass (cargo test)
  • Benchmarks show expected improvements
  • Works on non-AVX512 systems (falls back to stdlib)

When compiling with `-C target-cpu=native` on AVX512 CPUs, LLVM generates
poor code for stdlib's is_ascii(), causing 30x slowdowns. This adds a custom
implementation using AVX512BW intrinsics with runtime feature detection.

- Add src/simd.rs with is_ascii() using _mm512_movepi8_mask
- Use masked loads for safe tail handling (<64 bytes)
- Runtime detection via is_x86_feature_detected!("avx512bw")
- Falls back to stdlib is_ascii() on non-AVX512 systems

Benchmarks vs master:
- ASCII workloads: 1.3-1.8x faster (default compilation)
- ASCII workloads: 30-35x faster (with -C target-cpu=native)
@bonega bonega enabled auto-merge (squash) January 22, 2026 22:02
Copy link
Copy Markdown

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yore Benchmarks

Details
Benchmark suite Current: a6dcb41 Previous: dd07916 Ratio
decode_checked/mostly_ascii/8 33 ns/iter (± 12) 34 ns/iter (± 13) 0.97
decode_checked/mostly_ascii/64 66 ns/iter (± 7) 70 ns/iter (± 14) 0.94
decode_checked/mostly_ascii/256 139 ns/iter (± 13) 143 ns/iter (± 12) 0.97
decode_checked/mostly_ascii/512 232 ns/iter (± 19) 228 ns/iter (± 21) 1.02
decode_checked/mostly_ascii/1024 415 ns/iter (± 24) 423 ns/iter (± 28) 0.98
decode_checked/mostly_ascii/2048 879 ns/iter (± 53) 888 ns/iter (± 47) 0.99
decode_checked/mostly_ascii/4096 1964 ns/iter (± 74) 2004 ns/iter (± 75) 0.98
decode_checked/ascii/8 9 ns/iter (± 0) 8 ns/iter (± 0) 1.13
decode_checked/ascii/64 8 ns/iter (± 0) 8 ns/iter (± 0) 1
decode_checked/ascii/256 10 ns/iter (± 0) 10 ns/iter (± 0) 1
decode_checked/ascii/512 16 ns/iter (± 0) 15 ns/iter (± 0) 1.07
decode_checked/ascii/1024 26 ns/iter (± 3) 25 ns/iter (± 0) 1.04
decode_checked/ascii/2048 51 ns/iter (± 4) 50 ns/iter (± 11) 1.02
decode_checked/ascii/4096 91 ns/iter (± 0) 91 ns/iter (± 3) 1
decode_checked/extended/8 35 ns/iter (± 1) 36 ns/iter (± 1) 0.97
decode_checked/extended/64 84 ns/iter (± 3) 85 ns/iter (± 2) 0.99
decode_checked/extended/256 259 ns/iter (± 3) 260 ns/iter (± 6) 1.00
decode_checked/extended/512 484 ns/iter (± 19) 485 ns/iter (± 89) 1.00
decode_checked/extended/1024 904 ns/iter (± 5) 902 ns/iter (± 11) 1.00
decode_checked/extended/2048 1758 ns/iter (± 48) 1760 ns/iter (± 100) 1.00
decode_checked/extended/4096 3464 ns/iter (± 392) 3478 ns/iter (± 32) 1.00
decode_lossy/all_bad/8 48 ns/iter (± 1) 39 ns/iter (± 0) 1.23
decode_lossy/all_bad/64 63 ns/iter (± 0) 63 ns/iter (± 1) 1
decode_lossy/all_bad/256 178 ns/iter (± 10) 173 ns/iter (± 17) 1.03
decode_lossy/all_bad/512 336 ns/iter (± 42) 323 ns/iter (± 7) 1.04
decode_lossy/all_bad/1024 647 ns/iter (± 21) 622 ns/iter (± 78) 1.04
decode_lossy/all_bad/2048 1256 ns/iter (± 7) 1220 ns/iter (± 36) 1.03
decode_lossy/all_bad/4096 2482 ns/iter (± 22) 2394 ns/iter (± 35) 1.04
decode_lossy/mostly_ascii/8 47 ns/iter (± 21) 39 ns/iter (± 15) 1.21
decode_lossy/mostly_ascii/64 65 ns/iter (± 6) 65 ns/iter (± 15) 1
decode_lossy/mostly_ascii/256 155 ns/iter (± 16) 148 ns/iter (± 17) 1.05
decode_lossy/mostly_ascii/512 245 ns/iter (± 21) 236 ns/iter (± 23) 1.04
decode_lossy/mostly_ascii/1024 429 ns/iter (± 24) 414 ns/iter (± 31) 1.04
decode_lossy/mostly_ascii/2048 806 ns/iter (± 40) 789 ns/iter (± 39) 1.02
decode_lossy/mostly_ascii/4096 1583 ns/iter (± 130) 1545 ns/iter (± 55) 1.02
encode_checked/mostly_ascii/8 42 ns/iter (± 17) 42 ns/iter (± 17) 1
encode_checked/mostly_ascii/64 126 ns/iter (± 3) 128 ns/iter (± 20) 0.98
encode_checked/mostly_ascii/256 456 ns/iter (± 6) 457 ns/iter (± 17) 1.00
encode_checked/mostly_ascii/512 905 ns/iter (± 70) 911 ns/iter (± 180) 0.99
encode_checked/mostly_ascii/1024 1921 ns/iter (± 46) 1801 ns/iter (± 103) 1.07
encode_checked/mostly_ascii/2048 3693 ns/iter (± 192) 3541 ns/iter (± 331) 1.04
encode_checked/mostly_ascii/4096 7053 ns/iter (± 294) 7027 ns/iter (± 206) 1.00
encode_checked/ascii/8 9 ns/iter (± 0) 8 ns/iter (± 0) 1.13
encode_checked/ascii/64 8 ns/iter (± 0) 8 ns/iter (± 0) 1
encode_checked/ascii/256 10 ns/iter (± 0) 11 ns/iter (± 0) 0.91
encode_checked/ascii/512 14 ns/iter (± 0) 14 ns/iter (± 0) 1
encode_checked/ascii/1024 25 ns/iter (± 0) 24 ns/iter (± 0) 1.04
encode_checked/ascii/2048 50 ns/iter (± 1) 50 ns/iter (± 6) 1
encode_checked/ascii/4096 91 ns/iter (± 2) 89 ns/iter (± 1) 1.02
encode_checked/extended/8 54 ns/iter (± 1) 52 ns/iter (± 1) 1.04
encode_checked/extended/64 206 ns/iter (± 3) 208 ns/iter (± 8) 0.99
encode_checked/extended/256 716 ns/iter (± 6) 705 ns/iter (± 83) 1.02
encode_checked/extended/512 1400 ns/iter (± 15) 1382 ns/iter (± 283) 1.01
encode_checked/extended/1024 2749 ns/iter (± 26) 2747 ns/iter (± 162) 1.00
encode_checked/extended/2048 5513 ns/iter (± 45) 5425 ns/iter (± 338) 1.02
encode_checked/extended/4096 10854 ns/iter (± 165) 10785 ns/iter (± 218) 1.01
encode_lossy/all_bad/8 55 ns/iter (± 1) 52 ns/iter (± 0) 1.06
encode_lossy/all_bad/64 251 ns/iter (± 4) 227 ns/iter (± 2) 1.11
encode_lossy/all_bad/256 890 ns/iter (± 11) 797 ns/iter (± 5) 1.12
encode_lossy/all_bad/512 1747 ns/iter (± 72) 1559 ns/iter (± 10) 1.12
encode_lossy/all_bad/1024 3484 ns/iter (± 54) 3094 ns/iter (± 20) 1.13
encode_lossy/all_bad/2048 6797 ns/iter (± 1146) 6138 ns/iter (± 129) 1.11
encode_lossy/all_bad/4096 13522 ns/iter (± 602) 12228 ns/iter (± 90) 1.11

This comment was automatically generated by workflow using github-action-benchmark.

Copy link
Copy Markdown

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Performance Alert ⚠️

Possible performance regression was detected for benchmark 'Yore Benchmarks'.
Benchmark result of this commit is worse than the previous benchmark result exceeding threshold 1.10.

Benchmark suite Current: a6dcb41 Previous: dd07916 Ratio
decode_checked/ascii/8 9 ns/iter (± 0) 8 ns/iter (± 0) 1.13
decode_lossy/all_bad/8 48 ns/iter (± 1) 39 ns/iter (± 0) 1.23
decode_lossy/mostly_ascii/8 47 ns/iter (± 21) 39 ns/iter (± 15) 1.21
encode_checked/ascii/8 9 ns/iter (± 0) 8 ns/iter (± 0) 1.13
encode_lossy/all_bad/64 251 ns/iter (± 4) 227 ns/iter (± 2) 1.11
encode_lossy/all_bad/256 890 ns/iter (± 11) 797 ns/iter (± 5) 1.12
encode_lossy/all_bad/512 1747 ns/iter (± 72) 1559 ns/iter (± 10) 1.12
encode_lossy/all_bad/1024 3484 ns/iter (± 54) 3094 ns/iter (± 20) 1.13
encode_lossy/all_bad/2048 6797 ns/iter (± 1146) 6138 ns/iter (± 129) 1.11
encode_lossy/all_bad/4096 13522 ns/iter (± 602) 12228 ns/iter (± 90) 1.11

This comment was automatically generated by workflow using github-action-benchmark.

@bonega bonega merged commit 65ad00d into master Jan 22, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant