Prevent FP contractions to FMA in index/.../content.hpp to avoid numerical instability in rtree by tinko92 · Pull Request #1453 · boostorg/geometry

tinko92 · 2026-03-27T08:31:43Z

This PR addresses #1452 . The issue shows a performance regression on GCC with -march=x86-64-v3 over -march=x86-64-v2 and presumably optimizations enabled, the compilation flags are not fully specified. After instrumenting the code further, I noticed many more calls to distance_cross_track and a different tree structure in v3 vs. v2 with less efficient querying (~twice as many geographic distance computations). The issue no longer replicates after doing any change out of a) disabling inlining, b) disabling FMA, c) removing the points with duplicate coordinates from the data.csv in the issue.

This points to a numerical instability in tree insertion due to some fusing of multiply and add/sub (b) across function boundaries (a), in some expression where cancellation to == 0.0 would occur without FMA but not with (from c). After disallowing this as in the commit with fp-contract=off for the content algorithm, the issue disappears. The content algorithm ends in a multiply, so when this small method is inlined, this multiply may be fused into the next addition/subtraction, so it fits the findings.

Multiple call sites of content seem to be affected by this (changing call sites individually yielded different tree structures), so I think, disallowing the fusing at the level of content.hpp is the cleanest solution. I consider this a conservative change because it keeps the behavior similar to how it would have been at x86-64-v2 and before (probably the most tested settings), where no FMA would be available, and also similar to Clang and MSVC which do not default to fp-contract=fast like GCC when optimizations are enabled.

It is guarded behind the BOOST_GCC macro to avoid warnings for compilers that do not support this pragma and because GCC is the only compiler I am aware of that defaults to fp-contract=fast at common optimization levels without enabling it specifically or setting something well-known to be unsafe like -ffast-math.

Edit: Added explanatory comment. No test-case added because the behaviour changes are compiler- and machine-dependent and I see no guarantees wrt to CPU architectures for the Github CI).

… to fix boostorg#1452 on GCC with x86-64-v3.

barendgehrels · 2026-03-29T05:16:43Z

include/boost/geometry/index/detail/algorithms/content.hpp

+// #1452. Safe to remove after numerically robust handling of degenerate cases
+// is ensured at all index::detail::content(box) call sites.
+#pragma GCC push_options
+#pragma GCC optimize ("fp-contract=off")


Thanks for the PR and the explanations!

In the issue, you suggested

Until the code is made more robust, you can pass -ffp-contract=on (arguably a generally sensible default, the default of clang) or -mno-fma as a compiler flag as a workaround to avoid the performance hit for your program.

so it has to be off now?

Is distance_cross_track.hpp the culprit? Then we might hit the problem also elsewhere.

Alternatively we could document that the compilation options should handle this (outside our library) or we could spawn a warning if the setting has another value.

But I'm fine either way and will approve.

Thanks for the very quick review!

so it has to be off now?

Sorry for the inconsistency. For this file, to the best of my understanding, "off" and "on" behave the same. The problems are with contractions across expressions (from inside content to operations outside contents at the call site) that only occur with "fast", with "on" there would only be contractions inside expressions (but content AFAICS has no operations inside individual expresions that could be contracted), and with "off" there can no contractions of floating point operations at all.

AFAIK "off" is the default for MSVC since 2022 [1], and the default for Clang until 13, with a change to "on" since 14 [2]. My understanding is that "off" is the safest (in terms of no surprises between x86-64-v2 to x86-64-v3) in case the implementation of content changes in the future, so that is why I propose it in this PR. "on" would also resolve the issue.

Is distance_cross_track.hpp the culprit?

I think in this case, no. I initially mistook it for the culprit, but I think the higher call count to distance_cross_track is just the symptom. For the program in the linked issue, the internal nodes in the rtree change from 6763 nodes in the rtree (v2) to 6685 (v3) and with the latter there are way more calls to distance (which also look more suspicious for the latter when printing the points and segments, the search trace doesn't seem to narrow down to the search point, but I am not familiar with the rtree details). I think the change in content-behaviour from v2 to v3 causes some degeneracy in the rtree but I cannot isolate it to a small test case and I currently lack the time to investigate it more deeply. That is why I propose this small change to enforce more well-known (pre x86-64-v3) behaviour to this helper function in the rtree construction.

Alternatively we could document that the compilation options should handle this (outside our library) or we could spawn a warning if the setting has another value.

Warning for fp-contract=fast (and proposing "off" to guarantee behaviour as it was without FMA in x86-64-v3) could rule out any surprises from FMA across the library but I don't know enough about all the formulas to say whether this would be good advice for all use cases because FMA also has advantages. If we have something like a*b-a*b, it can reveal numerical non-robustness, because suddenly that can be non-zero (counterintuitively the increased precision from one fewer rounding prevents two rounding errors in a*b from cancelling out). But if we have something like for(i=0; i<length; ++i) sum += x[i] * y[i]; with non-degenerate values, it might roughly halve the overall accumulated error and also make this faster.

I think, ideally we want to fix it at individual code sites, where it causes problems, and keep it in mind as a potential source of problems for future issue reports, where the problem only occurs when built with GCC for modern CPU targets.

[1] - https://devblogs.microsoft.com/cppblog/the-fpcontract-flag-and-changes-to-fp-modes-in-vs2022
[2] - https://releases.llvm.org/14.0.0/tools/clang/docs/ReleaseNotes.html#floating-point-support-in-clang

vissarion

Thank @tinko92, I am OK with the changes.

I have one question: can the issue be replicated (i.e. slow performance) with clang14+ ( which defaults to -ffp-contract=on)?

barendgehrels · 2026-03-30T08:48:24Z

I'm still fine with the change - but I might have an alternative.

@tinko92 can you personally reproduce it?

I think you can:

After instrumenting the code further, I noticed many more calls to distance_cross_track

so that might help me as well. More news tonight.

tinko92 · 2026-03-30T09:01:47Z

@vissarion Sorry, unintended edit earlier, I meant to answer. Here is the answer:

I have one question: can the issue be replicated (i.e. slow performance) with clang14+ ( which defaults to -ffp-contract=on)?

It does not replicate with clang for me. Here are different combinations of compilers and fp-contract settings (The line "2: " outputs the count of internal nodes, which is only interpretable to me in so far as it is different between cases):

for command in "g++ -ffp-contract=fast" "g++ -ffp-contract=on" "g++ -ffp-contract=off" "clang++ -ffp-contract=fast" "clang++ -ffp-contract=on" "clang++ -ffp-contract=off"; do echo $command && ($command -I/home/bartels/dev/boost/ -DNDEBUG -O3 -march=x86-64-v3 main.cpp -std=c++20 && ./a.
out | grep -E "query all took|^2:"); done

g++ -ffp-contract=fast #<-- default
2: 6685 #<-- rtree structure that seems to hurt performance
query all took :   6726.298 ms
g++ -ffp-contract=on
2: 6763
query all took :   4316.571 ms
g++ -ffp-contract=off
2: 6763
query all took :   4555.714 ms
clang++ -ffp-contract=fast
2: 6763
query all took :   4734.149 ms
clang++ -ffp-contract=on
2: 6763
query all took :   4323.103 ms
clang++ -ffp-contract=off
2: 6763
query all took :   4801.359 ms

With GCC 14 on this machine and Clang 19. Clang contracts a lot fewer multiplications and additions:

bartels@DESKTOP-BPNV3G3:~/dev/bg_1452/test$ g++ -ffp-contract=fast -DNDEBUG -O3 -march=x86-64-v3 main.cpp -std=c++20 && objdump -d a.out | grep -E "vfmadd|vfmsub" | wc -l ### 20260330 16:49 /home/bartels/dev/bg_1452/test
131
bartels@DESKTOP-BPNV3G3:~/dev/bg_1452/test$ clang++ -ffp-contract=fast -DNDEBUG -O3 -march=x86-64-v3 main.cpp -std=c++20 && objdump -d a.out | grep -E "vfmadd|vfmsub" | wc -l
64
bartels@DESKTOP-BPNV3G3:~/dev/bg_1452/test$ clang++ -ffp-contract=on -DNDEBUG -O3 -march=x86-64-v3 main.cpp -std=c++20 && objdump -d a.out | grep -E "vfmadd|vfmsub" | wc -l
50
bartels@DESKTOP-BPNV3G3:~/dev/bg_1452/test$ clang++ -ffp-contract=off -DNDEBUG -O3 -march=x86-64-v3 main.cpp -std=c++20 && objdump -d a.out | grep -E "vfmadd|vfmsub" | wc -l
0

Apparently it emits none that trigger the problem of issue #1452 for that example program. I don't know what in Clang's optimizer causes this and whether some unrelated changes (that lead to some instruction reordering, that suddenly make some previously unfused operations candidates for fusing) could affect this.

tinko92 · 2026-03-30T09:19:00Z

@barendgehrels

can you personally reproduce it?

Yes, on a server with an Intel Xeon CPU (every x86 CPU that satisfies x86-64-v3 should work, which are most since Intel Haswell, I think), I have access to. The reporter of the issue has an AMD CPU.

It directly replicates with

g++ -I/home/bartels/dev/boost/ -DNDEBUG -O3 -march=x86-64-v3 main.cpp -std=c++20 && ./a.out for me, but many other combinations also work.

so that might help me as well

The instrumented code? It is rather messy/crude because it was never intended for sharing, probably could be done better with a profiler (I have no IDE on the remote x86 system, my own working machine is ARM and cannot replicate it on that). I appended the edited main.cpp (as txt due to github file filters), and pushed the changed BG code for counting as is to my fork (can be pull from https://github.com/tinko92/geometry/tree/experiment/issue_1452_instrumentation ).
main.cpp.txt

barendgehrels · 2026-03-30T09:32:45Z

so that might help me as well

The instrumented code? It is rather messy/crude because it was never intended for sharing, probably could be done better with a profiler

Thanks, I actually meant that you could maybe check a PR if I create it... But good to know anyway. And good you can reproduce it.

barendgehrels · 2026-03-30T09:34:59Z

Now see:

It does not replicate with clang for me.

OK, clear. TBC.

barendgehrels · 2026-03-30T15:20:55Z

Conversation

tinko92 commented Mar 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

barendgehrels Mar 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tinko92 Mar 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vissarion left a comment • edited by tinko92 Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

barendgehrels commented Mar 30, 2026

Uh oh!

tinko92 commented Mar 30, 2026

Uh oh!

tinko92 commented Mar 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

barendgehrels commented Mar 30, 2026

Uh oh!

barendgehrels commented Mar 30, 2026

Uh oh!

barendgehrels commented Mar 30, 2026

Uh oh!

sandman7920 commented Mar 30, 2026

Uh oh!

barendgehrels commented Mar 30, 2026

Uh oh!

sandman7920 commented Mar 30, 2026

Uh oh!

tinko92 commented Mar 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

tinko92 commented Mar 27, 2026 •

edited

Loading

barendgehrels Mar 29, 2026 •

edited

Loading

tinko92 Mar 30, 2026 •

edited

Loading

vissarion left a comment •

edited by tinko92

Loading

tinko92 commented Mar 30, 2026 •

edited

Loading