Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Disable restrict keyword in GCC/Clang #586

Closed
wants to merge 1 commit into from

Conversation

sipa
Copy link
Contributor

@sipa sipa commented Jan 18, 2019

This fixes #585.

@TheBlueMatt
Copy link
Contributor

TheBlueMatt commented Jan 18, 2019

You can water down the comment a bit, I think. Its rather unlikely to affect secp, but its also unlikely to speed anything up, so just not worth it. utACK either way.

@real-or-random
Copy link
Contributor

utACK

@sipa
Copy link
Contributor Author

sipa commented Jan 18, 2019

I very rudimentary benchmark shows it indeed doesn't affect verification speed.

I'll do a more thorough benchmark later.

@jgarzik
Copy link

jgarzik commented Jan 18, 2019

utACK

@sipa
Copy link
Contributor Author

sipa commented Feb 4, 2019

I've done some more benchmarking, and can't observe a slowdown in verification (if there is one, it's less than 0.2%).

@sipa
Copy link
Contributor Author

sipa commented Feb 5, 2019

Oops, no. That was with assembly code enabled, which obviously takes the restrictness into account, but isn't affected by the restrict keyword.

When I disable asm, verification goes from 277000 cycles to 282000 cycles on a Threadripper 2950X system, a 1.8% slowdown.

@real-or-random
Copy link
Contributor

1.8% sounds reasonable enough to me for the added safety.

@gmaxwell
Copy link
Contributor

Meh: 1.8% is a rather large slowdown to take for a "maybe the compiler might screw up in ways that the tests don't find". Assuming the 1.8% number applies on ARM, that change is essentially 1-2 hours of additional sync time on a non-assumevalid initial sync on a small bitcoin node.

Here is my reasoning: There are a significant number of optimizations in the codebase with 1% - 3% level improvements which could be excluded on the basis of "maybe they get miscompiled" or "maybe they have a bug".

If the tests (including run time self tests) aren't adequate to catch all plausible miscompilation, we should improve them. This has an added benefit of also protecting the code against unforseen risks, something that avoiding long standing language features does not accomplish.

Or, if 1% vs the level of risk that using a less common compiler annotation implies is really the trade-off we want to take we should probably remove all the distinct normalization cases (and replace all with the a single constant time full normalization), the R comparison in projective space optimization, x86_64 assembly (which is now not much faster than what the compiler can produce), effective affine, etc. Each of which has 1-2% performance impacts and arguably has a worse risk/performance trade-off than the use of restrict. Similarly, under that logic we should probably remove endomorphism and gmp support from the codebase (zero performance gain in the normal config, non-zero risk of introducing bugs in the normal configuration).

I don't actually think the above is a good idea: I think our principle before has been to include essentially any optimization-- even relatively complex ones-- which would make a measurable difference in a real system subject to the constraint that we could test it well enough to be reasonably confident that its safe. But since there is essentially no competing library for this curve which I'm aware of which is anywhere near as fast, I think a case could be made for simplifying the codebase until it is merely significantly faster than the next less-well-tested alternative. To the extent that safety is our primary goal we only need to be fast enough to ensure that users who mostly care about speed select this code over a less safe alternative. [also, the audience of embedded devices that frequently adopt obviously less safe code might be swayed by a smaller binary sizes...]

But whatever we do, we should try to be intentional and consistent in our strategy so that we get the full benefit of it.

I also think that a case could be made for doing this if it were shown that the slowdown didn't happen on ARM (since that's where a performance loss is especially critical), or as a temporary measure while tests were improved.

@real-or-random
Copy link
Contributor

Each of which has 1-2% performance impacts and arguably has a worse risk/performance trade-off than the use of restrict.

I don't think that's so clear. This library offers exceptional quality due to

  • careful programming,
  • careful review, and
  • good tests.

Unfortunately, the first two bullets don't help in case of miscompilation. Tests do help of course but tests rarely show the absence of bugs. Another concern is that compilers behave differently on different architectures, with different versions, different flags, etc, and it's not clear if people run the tests every time after they compile the code.

And sure, other code can be miscompiled, too. But without a specific bug in mind, there is no reason why a specific implementation (say projective space stuff) should be more prone to miscompilation than other, except that it uses more lines of code.

In our case, we know that these specific compiler bugs just won't happen if we disable restrict. The scope of the compiler bug is quite narrow and we know what to do to prevent it, and we know the slowdown.

If gcc released a new flag -finline-experimental-1.8 with the description "speeds up your code by 1.8 % but we've seen cases of miscompilation", would you enable it in secp if the tests pass? I wouldn't.

@gmaxwell
Copy link
Contributor

gmaxwell commented Feb 20, 2019

Another concern is that compilers behave differently on different architectures, with different versions, different flags, etc, and it's not clear if people run the tests every time after they compile the code.

This is not a justification for this change. It is a justification for runtime self-tests. I agree that the concern that the tests aren't getting run where they need to be run is a good one, but the answer to it isn't not to make a particular tradeoff one way for one thing, the answer is to fix the underlying.

Again, I still hold that this particular decision doesn't carry unique risks compared to a dozen other choices, nor do I think a case has been made that it has a worse trade-off.

And sure, other code can be miscompiled, too. But without a specific bug in mind, there is no reason why a specific implementation (say projective space stuff) should be more prone to miscompilation than other,

If you really want, I can go and justify the other examples in light of prior miscomplation bugs specifically, but I think it would be a waste of time. Besides, "except that it uses more lines of code" is a perfectly fine justification on its own. (not more lines precisely but more total machine code with more total corner cases)

If gcc released a new flag -finline-experimental-1.8 with the description "speeds up your code by 1.8 % but we've seen cases of miscompilation

The "new" kind of confuses the issue-- anything new also carries a higher degree of unknown unknowns--, but we certainly compile with optimizations and there is a long and illustrious history of compiler optimizations and miscompilation. The benefit is (presumably!) more than 1.8% but the risk, historically, is that there have been a tremendous number of miscomplations which were optimization linked.

GCC's bug tracker is filled with scads of reports of miscompilation both current and in recent history (some even reported by some of us in the past...). But no attention is being given to that, only to this one thing that someone noticed. This seems to me to be a reactionary response ungrounded in a systematic approach to managing the tradeoffs of software quality.

@gmaxwell
Copy link
Contributor

gmaxwell commented Feb 20, 2019

As an aside, the linked GCC misbehaviour only occurs with -O3 (the same applies to the similar behaviour in LLVM) ... which fits nicely with my above comment about optimization. I'd find an argument to not use O3 in general to be more compelling...

/* As both clang and GCC have known miscompilation bugs related to restrict, disable it for now.
* TODO: re-evaluate when this bugs are fixed, and enable restrict in known good compilers.
* See:
* - https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87609
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the referenced bugs give the impression that restrict is forbidden for those compilers because of those bugs, while the real reason is "it works one day, it fails the other day". GCC has just, 8 hours before, fixed the bug in master. Upstream fix doesn't mean everyone is using the latest compiler, as even GCC 8.3, released 2 hours ago, doesn't have the patch.

@jonasnick
Copy link
Contributor

Running this PR on an aarch64 (ARMv8) based ODROID C2 I get a 2.43% slowdown (452us vs 463us) in bench_verify. This is without using ASM which is not enabled by default on ARM architectures.

@gmaxwell
Copy link
Contributor

gmaxwell commented Mar 7, 2019

@jonasnick Can you try benchmarking on the same device with restrict but with -O2 instead of O3?

@jonasnick
Copy link
Contributor

O2:  min 450us / avg 450us / max 451us
O3:  min 452us / avg 452us / max 461us

All the measurements were done after increasing the ecdsa_verify benchmark "count" from 10 to 100

@gmaxwell
Copy link
Contributor

gmaxwell commented Mar 9, 2019

^ If we want to be paranoid, I think Jonas' results would support using O2 instead of O3, not only does it address the restrict concern in both GCC and Clang but also a long history of more bugs at O3.

@sipa
Copy link
Contributor Author

sipa commented Mar 12, 2019

I don't think we ever (neither in this repo, or in the Bitcoin Core build) use -O3 by default?

Scratch that, the default is -O3 apparently.

@gmaxwell
Copy link
Contributor

At one point earlier I almost responded with "this restrict issue only exists at O3, lets not worry about it" ... then though "hm. wait a minute."... O3 is also why asm vs non-asm doesn't make that big a difference on x86_64 IIRC.

@real-or-random
Copy link
Contributor

Fwiw, this is fixed in GCC 7, 8 ,9 (>=7.4.1, >= 8.3.1, >= 9.0) according to https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87609.

It's not yet fixed in LLVM but patches have been proposed already.

@real-or-random
Copy link
Contributor

As an aside, the linked GCC misbehaviour only occurs with -O3 (the same applies to the similar behaviour in LLVM) ... which fits nicely with my above comment about optimization. I'd find an argument to not use O3 in general to be more compelling...

Now that we switched to O2, I hoped that this issue had been resolved on the way. But apparently, the miscompilation of the example program from the GCC bug only disappears on GCC, where it's fixed anyway in the most recent versions. Clang/LLVM does loop unrolling even with O2, so you additionally need -fno-unroll-loops (https://godbolt.org/z/urkkWD), which seems to have some performance impact at least if x86_64 asm is disabled (gmp does not make a big difference).

./configure --with-bignum=gmp --with-asm=x86_64 CC=clang CFLAGS=-fno-unroll-loops
ecdsa_verify: min 81.3us / avg 83.4us / max 85.6us
ecdsa_verify: min 81.9us / avg 82.9us / max 84.3us

./configure --with-bignum=gmp --with-asm=x86_64 CC=clang CFLAGS= 
ecdsa_verify: min 81.4us / avg 83.7us / max 86.5us
ecdsa_verify: min 81.4us / avg 83.1us / max 87.0us

./configure --with-bignum=no --with-asm=no CC=clang CFLAGS= 
ecdsa_verify: min 97.1us / avg 98.6us / max 100us

./configure --with-bignum=no --with-asm=no CC=clang CFLAGS=-fno-unroll-loops
ecdsa_verify: min 95.9us / avg 101us / max 104us

500000 runs each.

I'm not a friend of tweaking flags too much but given that there are anyway performance differences between gcc and clang and we unrolling is disabled in gcc, I could indeed imagine adding -fno-unroll-loops if other can confirm my benchmarks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

All Major Linux Compilers miscomplie restrict
7 participants