Skip materializing parsed_number_string_t spans on the hot path (addresses #384) by fcostaoliveira · Pull Request #386 · fastfloat/fast_float

fcostaoliveira · 2026-06-03T08:32:26Z

Summary

parsed_number_string_t carries two span<UC const> members (integer, fraction) that are read only on the rare slow paths — digit_comp, and the >19-significant-digit truncation recompute. But they are written on every parse, which forces the ~56/64-byte struct to be materialized and marshaled through the by-value return. On the hot path that surfaces as backend/store pressure. This addresses #384 ("the structure parsed_number_string_t is ... probably too fat for its own good").

What changed

parse_number_string gains a runtime bool store_spans = true parameter (default keeps every existing caller unchanged). When false, the integer/fraction span stores and the span-reading >19-digit recompute are skipped.
from_chars_float_advanced parses with store_spans = false, attempts Clinger + Eisel-Lemire inline, and routes the two rare slow branches (too_many_digits, am.power2 < 0) to a single fastfloat_noinline (noinline+cold) helper that re-parses with spans and calls the unchanged from_chars_advanced.
New fastfloat_noinline macro in float_common.h.

Two deliberate choices:

Runtime flag, not a template parameter — a template would create a second instantiation of the whole scanner whose icache cost wipes out the gain.
noinline cold slow path — the rare re-parse must stay out of line, or the force-inlined hot scanner gets duplicated into the caller; that bloats the hot frame and lengthens the loop-carried dependency chain, which regresses some targets (notably ARM gcc) even though it removes the spill.

Public from_chars / from_chars_advanced / parsed_number_string_t are unchanged.

Performance

Per-parser microbench (from_chars<double> in a tight loop over each dataset, median of 5, pinned core). base → this PR, MB/s (Δ%), vs current tip:

Target	random	canada	mesh
ARM Neoverse-V2 (Graviton4) gcc	1087 → 1907 (+75%)	948 → 1645 (+73%)	503 → 1489 (+196%)
ARM Neoverse-V2 (Graviton4) clang	1347 → 1449 (+8%)	1049 → 1135 (+8%)	879 → 976 (+11%)
Intel Ice Lake (Xeon 8360Y) gcc	1142 → 1358 (+19%)	973 → 1138 (+17%)	814 → 955 (+17%)
Intel Cascade Lake (Xeon 6248) gcc	681 → 800 (+18%)	599 → 705 (+18%)	448 → 548 (+22%)
Intel Cascade Lake (Xeon 6248) clang	528 → 595 (+13%)	431 → 528 (+23%)	311 → 365 (+17%)

float mirrors double (e.g. ARM gcc float: +72% / +71% / +172%). The win is largest where the base codegen spilled the struct most (ARM gcc); clang baselines that already partly avoided the spill gain less. (Drift-controlled: the unchanged ffc-vs-fast_float control row was flat across base/patch on these nodes.)

Intel TMA (top-down), Ice Lake (Xeon 8360Y), short floats (`mesh`)

Isolated fast_float microbench under perf stat -M TopdownL1/L2:

	base	this PR
Backend-Bound	26.0%	2.2%
Retiring	60.3%	77.3%
pipeline slots (TOPDOWN.SLOTS)	37.2 B	23.7 B (−36%)
wall time (same work)	2.41 s	1.53 s (−37%)

The base spends 26% of pipeline slots backend-bound on the span spill; this PR collapses that to 2.2% and lifts retiring to 77%, with 36% fewer issued slots. That is the microarchitectural mechanism behind the throughput numbers above.

Correctness

Full float-exhaustive suite passes: exhaustive32, exhaustive32_64, exhaustive32_midpoint, random64 — all "all ok".
A 2³² single-precision sweep is byte-identical to the current tip.
Core + supplemental tests pass under -Werror -Wall -Wextra -Wconversion.

Equivalence reasoning: when store_spans=false and >19 digits, the mantissa is left un-truncated but too_many_digits is set and the caller re-parses before reading it; the am.power2<0 re-parse re-runs Clinger, but Clinger is a pure function of (mantissa, exponent, negative, T) which store_spans does not affect for !too_many_digits, so a Clinger that failed on the hot path fails again, and digit_comp reproduces the original result via the re-materialized spans. answer.ptr/ec are set identically on every path.

Notes

__attribute__((noinline, cold)) / __declspec(noinline); the attribute is ignored during constant evaluation, and the constexpr from_chars tests pass on gcc 13.3 and clang 18.1 (and the MSVC/Alpine/MINGW CI here is green).

parsed_number_string_t carries two span<UC const> members (integer, fraction) that are only read on the rare slow paths (digit_comp, and the >19-significant- digit truncation recompute). Materializing them on every parse forces the ~56/64- byte struct to be written out and marshaled through the by-value return, which shows up as backend/store pressure on the hot path. This adds a runtime `store_spans` flag (default true, so all existing callers are unchanged) to parse_number_string; from_chars_float_advanced parses with it false, attempts the Clinger and Eisel-Lemire fast paths inline, and only re-parses with spans on the two rare slow branches. The re-parse is pushed into a single `fastfloat_noinline` (noinline+cold) helper so the force-inlined hot scanner is emitted once rather than duplicated into the caller (without this the extra inline copies regress some targets, e.g. ARM gcc, by bloating the hot frame and lengthening the loop-carried dependency chain). A runtime flag is used deliberately rather than a template parameter: a template would create a second instantiation of the whole scanner whose icache cost wipes out the gain. Measured (per-parser microbench, median of 5, pinned core), fast_float from_chars <double>/<float>, vs the current tip: - Intel Ice Lake (Xeon 8360Y): +17-19% (gcc), Intel TMA shows backend-bound 26.0% -> 2.2% and retiring 60.3% -> 77.3% on short floats (the eliminated span spill), with -36% pipeline slots. - Intel Cascade Lake (Xeon 6248): +18-22% (gcc), +13-23% (clang). - ARM Neoverse-V2 (Graviton4): +73-196% (gcc), +8-11% (clang) -- the struct spill dominated the gcc hot loop there. Correctness: the full float exhaustive suite (exhaustive32, exhaustive32_64, exhaustive32_midpoint, random64) passes, and a 2^32 sweep is byte-identical to the current tip. Public from_chars / from_chars_advanced / parsed_number_string_t are unchanged.

…antic change)

lemire · 2026-06-06T02:04:36Z

@fcostaoliveira Please see #387

Same deal, but I think it is conceptually better.

fcostaoliveira · 2026-06-06T09:51:18Z

Thanks @lemire — I benchmarked both approaches head-to-head against main (6258cbc): fast_float-only microbench (from_chars→double, best-of-9, single-core pinned), across Cascade Lake / Ice Lake / Emerald Rapids / Granite Rapids (gcc 11) and Graviton4 (clang 18), in both C++17 and C++20.

They're performance-equivalent. Both land ~+12% median over main on the short-string datasets, and #387 vs #386 is +0.7% median (within run-to-run noise — neither consistently ahead). Since the unlikely-branch version is cleaner (no function-level cold) and achieves the same speedup, I agree it's the better option — I'll close #386 in favor of #387.

C++17 — MB/s (Δ vs main):

env	dataset	`main`	#386 cold	#387 unlikely	#386 Δ	#387 Δ	#387 vs #386
Cascade Lake (gcc11)	random	598	680	668	+13.8%	+11.7%	-1.8%
Cascade Lake (gcc11)	mesh	417	517	501	+24.0%	+20.2%	-3.1%
Cascade Lake (gcc11)	canada	524	622	651	+18.7%	+24.3%	+4.7%
Ice Lake (gcc11)	random	976	1086	1109	+11.2%	+13.6%	+2.2%
Ice Lake (gcc11)	mesh	739	853	881	+15.4%	+19.2%	+3.3%
Ice Lake (gcc11)	canada	912	995	1044	+9.0%	+14.5%	+5.0%
Emerald Rapids (gcc11)	random	681	751	1550	+10.3%	+127.5%	+106.3% ⚠️noise
Emerald Rapids (gcc11)	mesh	1212	1383	1358	+14.1%	+12.0%	-1.9%
Emerald Rapids (gcc11)	canada	1500	1596	1637	+6.4%	+9.2%	+2.6%
Granite Rapids (gcc11)	random	1424	1557	1579	+9.4%	+10.8%	+1.4%
Granite Rapids (gcc11)	mesh	1260	1362	1338	+8.1%	+6.2%	-1.7%
Granite Rapids (gcc11)	canada	1541	1699	1719	+10.2%	+11.6%	+1.2%
Graviton4 (clang18)	random	1296	1467	1455	+13.2%	+12.3%	-0.8%
Graviton4 (clang18)	mesh	988	1085	1090	+9.8%	+10.4%	+0.5%
Graviton4 (clang18)	canada	1215	1355	1355	+11.6%	+11.6%	-0.0%

C++20 — MB/s (Δ vs main):

env	dataset	`main`	#386 cold	#387 unlikely	#386 Δ	#387 Δ	#387 vs #386
Cascade Lake (gcc11)	random	594	640	701	+7.7%	+18.1%	+9.6%
Cascade Lake (gcc11)	mesh	447	491	508	+9.7%	+13.5%	+3.5%
Cascade Lake (gcc11)	canada	545	636	645	+16.6%	+18.4%	+1.6%
Ice Lake (gcc11)	random	997	1065	1078	+6.8%	+8.1%	+1.2%
Ice Lake (gcc11)	mesh	803	879	895	+9.5%	+11.5%	+1.9%
Ice Lake (gcc11)	canada	893	1030	1031	+15.4%	+15.4%	+0.0%
Emerald Rapids (gcc11)	random	1398	1543	1553	+10.3%	+11.0%	+0.7%
Emerald Rapids (gcc11)	mesh	1168	1401	1370	+19.9%	+17.3%	-2.2%
Emerald Rapids (gcc11)	canada	1341	1619	1637	+20.7%	+22.1%	+1.2%
Granite Rapids (gcc11)	random	1357	1567	1572	+15.5%	+15.9%	+0.3%
Granite Rapids (gcc11)	mesh	1168	1387	1344	+18.8%	+15.1%	-3.1%
Granite Rapids (gcc11)	canada	1411	1686	1684	+19.5%	+19.3%	-0.1%
Graviton4 (clang18)	random	1307	1459	1481	+11.6%	+13.3%	+1.5%
Graviton4 (clang18)	mesh	960	1091	1088	+13.7%	+13.4%	-0.3%
Graviton4 (clang18)	canada	1210	1354	1359	+11.9%	+12.3%	+0.4%

#387 vs #386 across all cells: median +0.7%, mean +1.0% — |Δ|<3% in 22/29 cells.

(One Emerald Rapids C++17 cell shows an impossible +106% — that box is shared and was under contention; re-running it on an idle core puts it in line with the rest.)

fcostaoliveira · 2026-06-06T09:51:25Z

Closing in favor of #387 (Daniel's cleaner unlikely-branch variant) — benchmarked equivalent above. Thanks!

fcostaoliveira added 2 commits June 3, 2026 09:30

clang-format (clang-format-17 comment reflow + signature wrap; no sem…

3067491

…antic change)

lemire mentioned this pull request Jun 6, 2026

Using unlikely markers for PR386 #387

Open

fcostaoliveira closed this Jun 6, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Skip materializing parsed_number_string_t spans on the hot path (addresses #384)#386

Skip materializing parsed_number_string_t spans on the hot path (addresses #384)#386
fcostaoliveira wants to merge 2 commits into
fastfloat:mainfrom
redis-performance:pr/lazy-spans-coldpath

fcostaoliveira commented Jun 3, 2026 •

edited

Loading

Uh oh!

lemire commented Jun 6, 2026

Uh oh!

fcostaoliveira commented Jun 6, 2026

Uh oh!

fcostaoliveira commented Jun 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

fcostaoliveira commented Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What changed

Performance

Intel TMA (top-down), Ice Lake (Xeon 8360Y), short floats (mesh)

Correctness

Notes

Uh oh!

lemire commented Jun 6, 2026

Uh oh!

fcostaoliveira commented Jun 6, 2026

Uh oh!

fcostaoliveira commented Jun 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

fcostaoliveira commented Jun 3, 2026 •

edited

Loading

Intel TMA (top-down), Ice Lake (Xeon 8360Y), short floats (`mesh`)