Skip to content

tscore: SIMD ts::memcpy_tolower; use in URL.cc cache-key fast path#13167

Draft
phongn wants to merge 1 commit into
apache:masterfrom
phongn:simd-bulk-tolower
Draft

tscore: SIMD ts::memcpy_tolower; use in URL.cc cache-key fast path#13167
phongn wants to merge 1 commit into
apache:masterfrom
phongn:simd-bulk-tolower

Conversation

@phongn
Copy link
Copy Markdown
Collaborator

@phongn phongn commented May 14, 2026

Summary

Add a SIMD-accelerated bulk ASCII tolower helper ts::memcpy_tolower in tscore, and use it in place of the byte-at-a-time loop on the URL canonicalization fast path that produces the cache-key digest. Header-only helper with a compile-time ISA cascade: 64-byte AVX-512BW, 32-byte AVX2, 16-byte SSE2 on x86_64, plus 16-byte NEON on ARMv8. Selection is purely compile-time; runtime ifunc dispatch is left for a follow-up. Operators get the wider path automatically by raising -march (x86-64-v3 = AVX2, x86-64-v4 = AVX-512BW); a stock x86_64 build keeps SSE2.

Behavior matches ParseRules::ink_tolower exactly: bytes in A..Z map to a..z, all others (including 0x80..0xFF) pass through unchanged.

Implementation notes

  • Cascade: wider bodies drain into narrower ones, so the worst-case scalar tail is always <16 bytes regardless of build flags.
  • AVX-512BW kernel uses _mm512_mask_add_epi8 to fuse the conditional +0x20 into a single op, and a masked load/store for the 1–63-byte tail in one SIMD pass. Inspired by Tony Finch's copytolower64.c.
  • The whole AVX-512BW block is gated at n ≥ 64, because the masked load/store carries ~7 ns of fixed setup that loses to the narrower paths for short inputs; below 64 bytes the AVX-512BW build falls through to its AVX2 + SSE2 cascade.

src/proxy/hdrs/URL.cc drops its static-inline memcpy_tolower and calls ts::memcpy_tolower instead.

Performance — measured on Xeon Gold 6338 (Ice Lake, 2.0 GHz)

Mean ns per call from tools/benchmark/benchmark_memcpy_tolower:

Size scalar SSE2 AVX2 (-mavx2) AVX-512BW (-mavx512bw)
16 B 10.4 2.15 1.75 1.98
32 B 15.4 2.90 2.24 2.31
64 B 28.0 4.43 2.85 2.61
256 B 113 13.87 7.57 6.20
1024 B 425 50.47 24.23 17.49

Speedup vs scalar at 1024 B: SSE2 8.4×, AVX2 17.5×, AVX-512BW 24.3×.

URL hot path inputs: HTTP schemes ("http"/"https") are 4–5 bytes and stay on the scalar tail with no change. Typical host names (16+ bytes) get the full 4–14× speedup depending on build flags.

Test plan

  • Microbench tools/benchmark/benchmark_memcpy_tolower runs 269 correctness assertions covering:
    • Sizes 0, 1, 5, 15, 16, 17, 23, 31, 32, 33, 64, 257 (bracketing each SIMD body) against the scalar reference.
    • An exhaustive sweep of all 256 byte values verifying that only A..Z are remapped — guards against any future widening of the case-fold range.
  • All paths run correctness clean on:
    • Broadwell (AVX2-capable) with -mavx2
    • Ice Lake (AVX-512BW) with -mavx2, -mavx512bw, and the default
  • cmake --build build -t format clean.
  • src/proxy/hdrs/libhdrs.a builds clean with the updated URL.cc.
  • Jenkins CI green.

Notes for reviewers

  • No new compile flags or dependencies. Just baseline SSE2 (x86_64 ABI) and baseline NEON (ARMv8 ABI); wider paths kick in automatically with -march=x86-64-v3 or -march=x86-64-v4.
  • The header includes <immintrin.h> / <arm_neon.h> only inside the #if that needs them, so other architectures don't pull them in.
  • AVX-512BW kernel design (mask_add + masked tail) was adapted from Tony Finch's vectolower, license-compatible (0BSD / MIT-0).
  • Other call sites with the same byte-at-a-time tolower pattern (HPACK.cc, QPACK.cc, UrlRewrite.cc) could also benefit; left untouched here to keep this PR focused.

🤖 Generated with Claude Code

@phongn phongn force-pushed the simd-bulk-tolower branch from ed9a596 to d975fb2 Compare May 14, 2026 20:30
The bulk ASCII tolower loop used to canonicalize the scheme and host
portions of a URL before hashing into the cache key runs at ~1.5 GB/s
scalar (one byte and one ParseRules table lookup per iteration). The
work is trivially data-parallel and there is no per-byte branching, so
a SIMD kernel that lowercases a whole register at once gives a
straightforward speedup once the input is long enough to amortize the
vector setup.

Add a header-only helper ts::memcpy_tolower under
include/tscore/ink_memcpy_tolower.h with a compile-time-selected
cascade of SIMD bodies: 64-byte AVX-512BW, 32-byte AVX2, 16-byte SSE2
on x86_64, plus 16-byte NEON on ARMv8. Wider bodies fall through to
narrower drain loops, so the worst-case scalar tail is always <16
bytes. Selection is purely compile-time; runtime ifunc dispatch is
left for a follow-up.

The AVX-512BW body uses _mm512_mask_add_epi8 to fuse the conditional
"+0x20 where upper" into a single op, and a masked load/store handles
1..63 leftover bytes in a single SIMD pass (inspired by Tony Finch's
copytolower64.c, https://dotat.at/cgi/git/vectolower.git/). The whole
AVX-512BW block is gated at n >= 64 because the masked load/store has
~7 ns of fixed setup that loses to the narrower paths for short
inputs; below 64 bytes we fall through to the AVX2 + SSE2 cascade.

Semantics match the existing ParseRules::ink_tolower table exactly:
bytes in 'A'..'Z' map to 'a'..'z', all others (including 0x80..0xFF)
pass through unchanged.

Replace the static inline memcpy_tolower in src/proxy/hdrs/URL.cc with
this helper. Baseline x86_64 builds use the 16-byte SSE2 path; builds
that opt into a wider -march (x86-64-v3 = AVX2, x86-64-v4 = AVX-512BW)
get the wider bodies automatically. Sub-16-byte inputs (e.g. short
HTTP schemes like "http") use the scalar tail and see no perf change.

Measured throughput on a 2.0 GHz Ice Lake Xeon Gold 6338, mean ns:

  size   scalar   SSE2     AVX2     AVX-512BW
  ----   ------   ----     ----     ---------
  16 B   10.4     2.15     1.75     1.98
  32 B   15.4     2.90     2.24     2.31
  64 B   28.0     4.43     2.85     2.61
  256 B  113      13.87    7.57     6.20
  1024 B 425      50.47    24.23    17.49

Speedup vs scalar at 1024 B: SSE2 8.4x, AVX2 17.5x, AVX-512BW 24.3x.

A new microbenchmark under tools/benchmark covers correctness across
sizes 0..257 (bracketing each SIMD body size) plus an exhaustive byte-
value sweep that guards against any future widening of the case-fold
range, alongside scalar-vs-SIMD throughput numbers and a config-print
case that emits the selected ISA path.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@bryancall
Copy link
Copy Markdown
Contributor

@phongn Should we do this for in place too?

@bryancall bryancall added this to the 11.0.0 milestone May 18, 2026
@bryancall bryancall requested a review from masaori335 May 18, 2026 22:16
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds a header-only SIMD ASCII lowercase-copy helper in tscore and switches the URL cache-key fast path to use it instead of the local scalar loop.

Changes:

  • Adds ts::memcpy_tolower with scalar, SSE2, AVX2, AVX-512BW, and NEON paths.
  • Replaces URL fast-path scheme/host lowercasing with the shared helper.
  • Adds an optional Catch2 benchmark/correctness harness for the helper.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.

File Description
include/tscore/ink_memcpy_tolower.h Defines the new SIMD/scalar lowercase-copy helper.
src/proxy/hdrs/URL.cc Uses ts::memcpy_tolower in cache-key fast-path canonicalization.
tools/benchmark/benchmark_memcpy_tolower.cc Adds benchmark and correctness checks for the helper.
tools/benchmark/CMakeLists.txt Builds the new benchmark target when benchmarks are enabled.

Comment on lines +59 to +60
inline void
memcpy_tolower(char *dst, const char *src, std::size_t n) noexcept
Comment on lines +19 to +24
Implementation note: the bodies are stacked widest-first and each
drains its block size before falling through to the next. A build
with AVX-512BW gets the 64-byte body as the main loop, then at most
one 32-byte AVX2 iteration and one 16-byte SSE2 iteration to drain
the remainder before the scalar tail handles 0-15 bytes. Builds
without the wider ISAs simply skip those blocks. Selection is purely
Comment on lines +157 to +164
return output_scalar[0];
};

std::string simd_name = "ts::mct " + std::to_string(sz) + "B";
BENCHMARK(simd_name.c_str())
{
ts::memcpy_tolower(output_simd.data(), input.data(), sz);
return output_simd[0];
Comment on lines +163 to +164
ts::memcpy_tolower(output_simd.data(), input.data(), sz);
return output_simd[0];
@bryancall bryancall self-requested a review May 18, 2026 22:21
// the speedup that path already provides.
//
// Inspired by Tony Finch's copytolower64.c
// (https://dotat.at/cgi/git/vectolower.git/).
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The referenced copytolower64.c seems to be under a BSD/MIT-style license.

It's worth to mention it in our NOTICE file. https://github.com/apache/trafficserver/blob/master/NOTICE

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants