Skip to content

Skip materializing parsed_number_string_t spans on the hot path (addresses #384)#386

Closed
fcostaoliveira wants to merge 2 commits into
fastfloat:mainfrom
redis-performance:pr/lazy-spans-coldpath
Closed

Skip materializing parsed_number_string_t spans on the hot path (addresses #384)#386
fcostaoliveira wants to merge 2 commits into
fastfloat:mainfrom
redis-performance:pr/lazy-spans-coldpath

Conversation

@fcostaoliveira
Copy link
Copy Markdown
Contributor

@fcostaoliveira fcostaoliveira commented Jun 3, 2026

Summary

parsed_number_string_t carries two span<UC const> members (integer, fraction) that are read only on the rare slow pathsdigit_comp, and the >19-significant-digit truncation recompute. But they are written on every parse, which forces the ~56/64-byte struct to be materialized and marshaled through the by-value return. On the hot path that surfaces as backend/store pressure. This addresses #384 ("the structure parsed_number_string_t is ... probably too fat for its own good").

What changed

  • parse_number_string gains a runtime bool store_spans = true parameter (default keeps every existing caller unchanged). When false, the integer/fraction span stores and the span-reading >19-digit recompute are skipped.
  • from_chars_float_advanced parses with store_spans = false, attempts Clinger + Eisel-Lemire inline, and routes the two rare slow branches (too_many_digits, am.power2 < 0) to a single fastfloat_noinline (noinline+cold) helper that re-parses with spans and calls the unchanged from_chars_advanced.
  • New fastfloat_noinline macro in float_common.h.

Two deliberate choices:

  • Runtime flag, not a template parameter — a template would create a second instantiation of the whole scanner whose icache cost wipes out the gain.
  • noinline cold slow path — the rare re-parse must stay out of line, or the force-inlined hot scanner gets duplicated into the caller; that bloats the hot frame and lengthens the loop-carried dependency chain, which regresses some targets (notably ARM gcc) even though it removes the spill.

Public from_chars / from_chars_advanced / parsed_number_string_t are unchanged.

Performance

Per-parser microbench (from_chars<double> in a tight loop over each dataset, median of 5, pinned core). base → this PR, MB/s (Δ%), vs current tip:

Target random canada mesh
ARM Neoverse-V2 (Graviton4) gcc 1087 → 1907 (+75%) 948 → 1645 (+73%) 503 → 1489 (+196%)
ARM Neoverse-V2 (Graviton4) clang 1347 → 1449 (+8%) 1049 → 1135 (+8%) 879 → 976 (+11%)
Intel Ice Lake (Xeon 8360Y) gcc 1142 → 1358 (+19%) 973 → 1138 (+17%) 814 → 955 (+17%)
Intel Cascade Lake (Xeon 6248) gcc 681 → 800 (+18%) 599 → 705 (+18%) 448 → 548 (+22%)
Intel Cascade Lake (Xeon 6248) clang 528 → 595 (+13%) 431 → 528 (+23%) 311 → 365 (+17%)

float mirrors double (e.g. ARM gcc float: +72% / +71% / +172%). The win is largest where the base codegen spilled the struct most (ARM gcc); clang baselines that already partly avoided the spill gain less. (Drift-controlled: the unchanged ffc-vs-fast_float control row was flat across base/patch on these nodes.)

Intel TMA (top-down), Ice Lake (Xeon 8360Y), short floats (mesh)

Isolated fast_float microbench under perf stat -M TopdownL1/L2:

base this PR
Backend-Bound 26.0% 2.2%
Retiring 60.3% 77.3%
pipeline slots (TOPDOWN.SLOTS) 37.2 B 23.7 B (−36%)
wall time (same work) 2.41 s 1.53 s (−37%)

The base spends 26% of pipeline slots backend-bound on the span spill; this PR collapses that to 2.2% and lifts retiring to 77%, with 36% fewer issued slots. That is the microarchitectural mechanism behind the throughput numbers above.

Correctness

  • Full float-exhaustive suite passes: exhaustive32, exhaustive32_64, exhaustive32_midpoint, random64 — all "all ok".
  • A 2³² single-precision sweep is byte-identical to the current tip.
  • Core + supplemental tests pass under -Werror -Wall -Wextra -Wconversion.

Equivalence reasoning: when store_spans=false and >19 digits, the mantissa is left un-truncated but too_many_digits is set and the caller re-parses before reading it; the am.power2<0 re-parse re-runs Clinger, but Clinger is a pure function of (mantissa, exponent, negative, T) which store_spans does not affect for !too_many_digits, so a Clinger that failed on the hot path fails again, and digit_comp reproduces the original result via the re-materialized spans. answer.ptr/ec are set identically on every path.

Notes

  • __attribute__((noinline, cold)) / __declspec(noinline); the attribute is ignored during constant evaluation, and the constexpr from_chars tests pass on gcc 13.3 and clang 18.1 (and the MSVC/Alpine/MINGW CI here is green).

parsed_number_string_t carries two span<UC const> members (integer, fraction)
that are only read on the rare slow paths (digit_comp, and the >19-significant-
digit truncation recompute). Materializing them on every parse forces the ~56/64-
byte struct to be written out and marshaled through the by-value return, which
shows up as backend/store pressure on the hot path.

This adds a runtime `store_spans` flag (default true, so all existing callers are
unchanged) to parse_number_string; from_chars_float_advanced parses with it false,
attempts the Clinger and Eisel-Lemire fast paths inline, and only re-parses with
spans on the two rare slow branches. The re-parse is pushed into a single
`fastfloat_noinline` (noinline+cold) helper so the force-inlined hot scanner is
emitted once rather than duplicated into the caller (without this the extra inline
copies regress some targets, e.g. ARM gcc, by bloating the hot frame and lengthening
the loop-carried dependency chain).

A runtime flag is used deliberately rather than a template parameter: a template
would create a second instantiation of the whole scanner whose icache cost wipes
out the gain.

Measured (per-parser microbench, median of 5, pinned core), fast_float from_chars
<double>/<float>, vs the current tip:
  - Intel Ice Lake (Xeon 8360Y): +17-19% (gcc), Intel TMA shows backend-bound
    26.0% -> 2.2% and retiring 60.3% -> 77.3% on short floats (the eliminated span
    spill), with -36% pipeline slots.
  - Intel Cascade Lake (Xeon 6248): +18-22% (gcc), +13-23% (clang).
  - ARM Neoverse-V2 (Graviton4): +73-196% (gcc), +8-11% (clang) -- the struct spill
    dominated the gcc hot loop there.
Correctness: the full float exhaustive suite (exhaustive32, exhaustive32_64,
exhaustive32_midpoint, random64) passes, and a 2^32 sweep is byte-identical to the
current tip. Public from_chars / from_chars_advanced / parsed_number_string_t are
unchanged.
@lemire
Copy link
Copy Markdown
Member

lemire commented Jun 6, 2026

@fcostaoliveira Please see #387

Same deal, but I think it is conceptually better.

@fcostaoliveira
Copy link
Copy Markdown
Contributor Author

Thanks @lemire — I benchmarked both approaches head-to-head against main (6258cbc): fast_float-only microbench (from_charsdouble, best-of-9, single-core pinned), across Cascade Lake / Ice Lake / Emerald Rapids / Granite Rapids (gcc 11) and Graviton4 (clang 18), in both C++17 and C++20.

They're performance-equivalent. Both land ~+12% median over main on the short-string datasets, and #387 vs #386 is +0.7% median (within run-to-run noise — neither consistently ahead). Since the unlikely-branch version is cleaner (no function-level cold) and achieves the same speedup, I agree it's the better option — I'll close #386 in favor of #387.

C++17 — MB/s (Δ vs main):

env dataset main #386 cold #387 unlikely #386 Δ #387 Δ #387 vs #386
Cascade Lake (gcc11) random 598 680 668 +13.8% +11.7% -1.8%
Cascade Lake (gcc11) mesh 417 517 501 +24.0% +20.2% -3.1%
Cascade Lake (gcc11) canada 524 622 651 +18.7% +24.3% +4.7%
Ice Lake (gcc11) random 976 1086 1109 +11.2% +13.6% +2.2%
Ice Lake (gcc11) mesh 739 853 881 +15.4% +19.2% +3.3%
Ice Lake (gcc11) canada 912 995 1044 +9.0% +14.5% +5.0%
Emerald Rapids (gcc11) random 681 751 1550 +10.3% +127.5% +106.3% ⚠️noise
Emerald Rapids (gcc11) mesh 1212 1383 1358 +14.1% +12.0% -1.9%
Emerald Rapids (gcc11) canada 1500 1596 1637 +6.4% +9.2% +2.6%
Granite Rapids (gcc11) random 1424 1557 1579 +9.4% +10.8% +1.4%
Granite Rapids (gcc11) mesh 1260 1362 1338 +8.1% +6.2% -1.7%
Granite Rapids (gcc11) canada 1541 1699 1719 +10.2% +11.6% +1.2%
Graviton4 (clang18) random 1296 1467 1455 +13.2% +12.3% -0.8%
Graviton4 (clang18) mesh 988 1085 1090 +9.8% +10.4% +0.5%
Graviton4 (clang18) canada 1215 1355 1355 +11.6% +11.6% -0.0%

C++20 — MB/s (Δ vs main):

env dataset main #386 cold #387 unlikely #386 Δ #387 Δ #387 vs #386
Cascade Lake (gcc11) random 594 640 701 +7.7% +18.1% +9.6%
Cascade Lake (gcc11) mesh 447 491 508 +9.7% +13.5% +3.5%
Cascade Lake (gcc11) canada 545 636 645 +16.6% +18.4% +1.6%
Ice Lake (gcc11) random 997 1065 1078 +6.8% +8.1% +1.2%
Ice Lake (gcc11) mesh 803 879 895 +9.5% +11.5% +1.9%
Ice Lake (gcc11) canada 893 1030 1031 +15.4% +15.4% +0.0%
Emerald Rapids (gcc11) random 1398 1543 1553 +10.3% +11.0% +0.7%
Emerald Rapids (gcc11) mesh 1168 1401 1370 +19.9% +17.3% -2.2%
Emerald Rapids (gcc11) canada 1341 1619 1637 +20.7% +22.1% +1.2%
Granite Rapids (gcc11) random 1357 1567 1572 +15.5% +15.9% +0.3%
Granite Rapids (gcc11) mesh 1168 1387 1344 +18.8% +15.1% -3.1%
Granite Rapids (gcc11) canada 1411 1686 1684 +19.5% +19.3% -0.1%
Graviton4 (clang18) random 1307 1459 1481 +11.6% +13.3% +1.5%
Graviton4 (clang18) mesh 960 1091 1088 +13.7% +13.4% -0.3%
Graviton4 (clang18) canada 1210 1354 1359 +11.9% +12.3% +0.4%

#387 vs #386 across all cells: median +0.7%, mean +1.0% — |Δ|<3% in 22/29 cells.

(One Emerald Rapids C++17 cell shows an impossible +106% — that box is shared and was under contention; re-running it on an idle core puts it in line with the rest.)

@fcostaoliveira
Copy link
Copy Markdown
Contributor Author

Closing in favor of #387 (Daniel's cleaner unlikely-branch variant) — benchmarked equivalent above. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants