tscore: SIMD ts::memcpy_tolower; use in URL.cc cache-key fast path by phongn · Pull Request #13167 · apache/trafficserver

phongn · 2026-05-14T19:55:30Z

Summary

Add a SIMD-accelerated bulk ASCII tolower helper ts::memcpy_tolower in tscore, and use it in place of the byte-at-a-time loop on the URL canonicalization fast path that produces the cache-key digest. Header-only helper with a compile-time ISA cascade: 64-byte AVX-512BW, 32-byte AVX2, 16-byte SSE2 on x86_64, plus 16-byte NEON on ARMv8. Selection is purely compile-time; runtime ifunc dispatch is left for a follow-up. Operators get the wider path automatically by raising -march (x86-64-v3 = AVX2, x86-64-v4 = AVX-512BW); a stock x86_64 build keeps SSE2.

Behavior matches ParseRules::ink_tolower exactly: bytes in A..Z map to a..z, all others (including 0x80..0xFF) pass through unchanged.

Implementation notes

Cascade: wider bodies drain into narrower ones, so the worst-case scalar tail is always <16 bytes regardless of build flags.
AVX-512BW kernel uses _mm512_mask_add_epi8 to fuse the conditional +0x20 into a single op, and a masked load/store for the 1–63-byte tail in one SIMD pass. Inspired by Tony Finch's copytolower64.c.
The whole AVX-512BW block is gated at n ≥ 64, because the masked load/store carries ~7 ns of fixed setup that loses to the narrower paths for short inputs; below 64 bytes the AVX-512BW build falls through to its AVX2 + SSE2 cascade.

src/proxy/hdrs/URL.cc drops its static-inline memcpy_tolower and calls ts::memcpy_tolower instead.

Performance — measured on Xeon Gold 6338 (Ice Lake, 2.0 GHz)

Mean ns per call from tools/benchmark/benchmark_memcpy_tolower:

Size	scalar	SSE2	AVX2 (`-mavx2`)	AVX-512BW (`-mavx512bw`)
16 B	10.4	2.15	1.75	1.98
32 B	15.4	2.90	2.24	2.31
64 B	28.0	4.43	2.85	2.61
256 B	113	13.87	7.57	6.20
1024 B	425	50.47	24.23	17.49

Speedup vs scalar at 1024 B: SSE2 8.4×, AVX2 17.5×, AVX-512BW 24.3×.

URL hot path inputs: HTTP schemes ("http"/"https") are 4–5 bytes and stay on the scalar tail with no change. Typical host names (16+ bytes) get the full 4–14× speedup depending on build flags.

Test plan

Microbench tools/benchmark/benchmark_memcpy_tolower runs 269 correctness assertions covering:
- Sizes 0, 1, 5, 15, 16, 17, 23, 31, 32, 33, 64, 257 (bracketing each SIMD body) against the scalar reference.
- An exhaustive sweep of all 256 byte values verifying that only A..Z are remapped — guards against any future widening of the case-fold range.
All paths run correctness clean on:
- Broadwell (AVX2-capable) with -mavx2
- Ice Lake (AVX-512BW) with -mavx2, -mavx512bw, and the default
cmake --build build -t format clean.
src/proxy/hdrs/libhdrs.a builds clean with the updated URL.cc.
Jenkins CI green.

Notes for reviewers

No new compile flags or dependencies. Just baseline SSE2 (x86_64 ABI) and baseline NEON (ARMv8 ABI); wider paths kick in automatically with -march=x86-64-v3 or -march=x86-64-v4.
The header includes <immintrin.h> / <arm_neon.h> only inside the #if that needs them, so other architectures don't pull them in.
AVX-512BW kernel design (mask_add + masked tail) was adapted from Tony Finch's vectolower, license-compatible (0BSD / MIT-0).
Other call sites with the same byte-at-a-time tolower pattern (HPACK.cc, QPACK.cc, UrlRewrite.cc) could also benefit; left untouched here to keep this PR focused.

🤖 Generated with Claude Code

The bulk ASCII tolower loop used to canonicalize the scheme and host portions of a URL before hashing into the cache key runs at ~1.5 GB/s scalar (one byte and one ParseRules table lookup per iteration). The work is trivially data-parallel and there is no per-byte branching, so a SIMD kernel that lowercases a whole register at once gives a straightforward speedup once the input is long enough to amortize the vector setup. Add a header-only helper ts::memcpy_tolower under include/tscore/ink_memcpy_tolower.h with a compile-time-selected cascade of SIMD bodies: 64-byte AVX-512BW, 32-byte AVX2, 16-byte SSE2 on x86_64, plus 16-byte NEON on ARMv8. Wider bodies fall through to narrower drain loops, so the worst-case scalar tail is always <16 bytes. Selection is purely compile-time; runtime ifunc dispatch is left for a follow-up. The AVX-512BW body uses _mm512_mask_add_epi8 to fuse the conditional "+0x20 where upper" into a single op, and a masked load/store handles 1..63 leftover bytes in a single SIMD pass (inspired by Tony Finch's copytolower64.c, https://dotat.at/cgi/git/vectolower.git/). The whole AVX-512BW block is gated at n >= 64 because the masked load/store has ~7 ns of fixed setup that loses to the narrower paths for short inputs; below 64 bytes we fall through to the AVX2 + SSE2 cascade. Semantics match the existing ParseRules::ink_tolower table exactly: bytes in 'A'..'Z' map to 'a'..'z', all others (including 0x80..0xFF) pass through unchanged. Replace the static inline memcpy_tolower in src/proxy/hdrs/URL.cc with this helper. Baseline x86_64 builds use the 16-byte SSE2 path; builds that opt into a wider -march (x86-64-v3 = AVX2, x86-64-v4 = AVX-512BW) get the wider bodies automatically. Sub-16-byte inputs (e.g. short HTTP schemes like "http") use the scalar tail and see no perf change. Measured throughput on a 2.0 GHz Ice Lake Xeon Gold 6338, mean ns: size scalar SSE2 AVX2 AVX-512BW ---- ------ ---- ---- --------- 16 B 10.4 2.15 1.75 1.98 32 B 15.4 2.90 2.24 2.31 64 B 28.0 4.43 2.85 2.61 256 B 113 13.87 7.57 6.20 1024 B 425 50.47 24.23 17.49 Speedup vs scalar at 1024 B: SSE2 8.4x, AVX2 17.5x, AVX-512BW 24.3x. A new microbenchmark under tools/benchmark covers correctness across sizes 0..257 (bracketing each SIMD body size) plus an exhaustive byte- value sweep that guards against any future widening of the case-fold range, alongside scalar-vs-SIMD throughput numbers and a config-print case that emits the selected ISA path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

bryancall · 2026-05-18T22:15:21Z

@phongn Should we do this for in place too?

Copilot

Pull request overview

This PR adds a header-only SIMD ASCII lowercase-copy helper in tscore and switches the URL cache-key fast path to use it instead of the local scalar loop.

Changes:

Adds ts::memcpy_tolower with scalar, SSE2, AVX2, AVX-512BW, and NEON paths.
Replaces URL fast-path scheme/host lowercasing with the shared helper.
Adds an optional Catch2 benchmark/correctness harness for the helper.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.

File	Description
`include/tscore/ink_memcpy_tolower.h`	Defines the new SIMD/scalar lowercase-copy helper.
`src/proxy/hdrs/URL.cc`	Uses `ts::memcpy_tolower` in cache-key fast-path canonicalization.
`tools/benchmark/benchmark_memcpy_tolower.cc`	Adds benchmark and correctness checks for the helper.
`tools/benchmark/CMakeLists.txt`	Builds the new benchmark target when benchmarks are enabled.

+inline void
+memcpy_tolower(char *dst, const char *src, std::size_t n) noexcept


+  Implementation note: the bodies are stacked widest-first and each
+  drains its block size before falling through to the next. A build
+  with AVX-512BW gets the 64-byte body as the main loop, then at most
+  one 32-byte AVX2 iteration and one 16-byte SSE2 iteration to drain
+  the remainder before the scalar tail handles 0-15 bytes. Builds
+  without the wider ISAs simply skip those blocks. Selection is purely


+      return output_scalar[0];
+    };
+
+    std::string simd_name = "ts::mct " + std::to_string(sz) + "B";
+    BENCHMARK(simd_name.c_str())
+    {
+      ts::memcpy_tolower(output_simd.data(), input.data(), sz);
+      return output_simd[0];


+      ts::memcpy_tolower(output_simd.data(), input.data(), sz);
+      return output_simd[0];


masaori335 · 2026-05-19T06:17:23Z

+  // the speedup that path already provides.
+  //
+  // Inspired by Tony Finch's copytolower64.c
+  // (https://dotat.at/cgi/git/vectolower.git/).


The referenced copytolower64.c seems to be under a BSD/MIT-style license.

It's worth to mention it in our NOTICE file. https://github.com/apache/trafficserver/blob/master/NOTICE

phongn force-pushed the simd-bulk-tolower branch from ed9a596 to d975fb2 Compare May 14, 2026 20:30

phongn force-pushed the simd-bulk-tolower branch from d975fb2 to 32016aa Compare May 14, 2026 20:56

bryancall requested a review from Copilot May 18, 2026 22:14

Copilot started reviewing on behalf of bryancall May 18, 2026 22:14 View session

bryancall assigned phongn May 18, 2026

bryancall added this to the 11.0.0 milestone May 18, 2026

bryancall added the Performance label May 18, 2026

bryancall requested a review from masaori335 May 18, 2026 22:16

Copilot AI reviewed May 18, 2026

View reviewed changes

bryancall self-requested a review May 18, 2026 22:21

masaori335 reviewed May 19, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tscore: SIMD ts::memcpy_tolower; use in URL.cc cache-key fast path#13167

tscore: SIMD ts::memcpy_tolower; use in URL.cc cache-key fast path#13167
phongn wants to merge 1 commit into
apache:masterfrom
phongn:simd-bulk-tolower

phongn commented May 14, 2026 •

edited

Loading

Uh oh!

bryancall commented May 18, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

masaori335 May 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

		inline void
		memcpy_tolower(char dst, const char src, std::size_t n) noexcept

		ts::memcpy_tolower(output_simd.data(), input.data(), sz);
		return output_simd[0];

Conversation

phongn commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Implementation notes

Performance — measured on Xeon Gold 6338 (Ice Lake, 2.0 GHz)

Test plan

Notes for reviewers

Uh oh!

bryancall commented May 18, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

masaori335 May 19, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

phongn commented May 14, 2026 •

edited

Loading