-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-17436: [C++] Use -O2 instead of -O3 for RELEASE builds #13661
Conversation
@ursabot benchmark please |
Supported benchmark command examples:
To run all benchmarks: To filter benchmarks by language: To filter Python and R benchmarks by name: To filter C++ benchmarks by archery --suite-filter and --benchmark-filter: For other |
@ursabot please benchmark |
Benchmark runs are scheduled for baseline = 1214083 and contender = 46e3195. Results will be available as each benchmark for each run completes. |
cc @pitrou |
Here's a dump of symbols that shrink the most in -O2: https://gist.github.com/wesm/4a2815077ed37b671d6160b8abec5e7c I'd be interested to see if e.g. unsafe numeric casts are significantly affected by this |
Looks great. I believe the big regression from some tests are not real. One catch is gcc -O2 disables vectorization, while clang -O2 keeps it. We may need additional -fxxxx if want to keep some useful features. |
Hmm, perhaps the bit-util micro-benchmarks are a bit pathologic, but other regressions seem real and quite significant... |
Not sure of gcc version used in conbench.
Perhaps try |
There's |
Also, apparently with gcc 12 the following flags are enabled with |
if(NOT MSVC) | ||
string(REPLACE "-O3 -DNDEBUG" "" CMAKE_CXX_FLAGS_RELEASE "${CMAKE_CXX_FLAGS_RELEASE}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why remove -DNDEBUG
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Especially, RelWithDebInfo
needs -DNDEBUG
otherwise runtime assertions are enabled.
@ursabot please benchmark lang=C++ |
Benchmark runs are scheduled for baseline = 1214083 and contender = e4e430b. Results will be available as each benchmark for each run completes. |
Ok, so I run locally with gcc 9.4.0 (Ubuntu 20.04) on AMD Zen 2. Build times (with ccache disabled)
Lib sizes
Compute benchmarks
All in all, in this case:
|
Hmm, I notice that Edit: re-ran benchmarks and updated the gists above. |
@ursabot please benchmark lang=C++ |
Benchmark runs are scheduled for baseline = 1214083 and contender = 6dd7bab. Results will be available as each benchmark for each run completes. |
cc @save-buffer @westonpace for further opinions |
I would guess there are places that benefit from loop unswitching also. |
Hmm, I don't think it's our duty to micro-optimize compiler options, though. There are too many moving parts (compiler brand, compiler version, architecture, etc.). |
I would say that we should just keep O3 and keep an eye on symbol sizes in case we need to intervene occasionally. On the whole I think the symbol sizes we have are not too bad. |
@ursabot please benchmark lang=C++ |
Benchmark runs are scheduled for baseline = 8a2acaa and contender = 7e5ca1a. Results will be available as each benchmark for each run completes. |
I added |
You undid the changes I had already pushed :-( |
I've been trying to get caught up on the context here - I took a look at #13654. My current understanding is:
So looking at the results, -O3 adds about 1MB (to ~22MB) to the total binary size, so I think that's not an issue itself. However, there is something to be said about bloating individual kernels. Reading the other PR, it seems like one of the kernels was 40 KB big? That's quite alarming as chips these days have about 32 KB of icache. In the worst case, that's quite a bit of thrashing. As for solutions: Looking at the benchmarks, it seems like the current code is pretty unstable with regards to what the compiler generates when it comes to flags. I'm not sure messing with compiler flags will be one-size-fits-all as each combination of flags causes large changes in the generated code. I did like the changes in #13654. I really liked this point, which very much aligns with my experience and intuition that abstract templates lead to unstable code generation:
So in my mind, two solutions we could have are:
|
@save-buffer thanks for your comments
I agree also with this -- I know that some feel that manually generating code when you can have templates "do it for you" is an antipattern, but it seems at least that the code in compute/kernels/codegen_internal.h has gone a little too far introducing abstractions where we are putting too much blind faith in the compiler (e.g. the "OutputAdapter"). Not a priority by any means among our myriad priorities but perhaps something for us to occasionally hack at in our idle moments (I did #13654 when I was bored on an airplane) |
In https://conbench.ursa.dev/compare/runs/e938638743e84794ad829524fae04fbd...20727b1b390e4b30be10f49db7f06f3f/ it seems that there are several hundred microbenchmarks with > 10% performance regressions but also over 100 microbenchmarks with > 10% performance improvement. I'd say it's a coin toss whether to move to -O2 (with -ftree-vectorize) versus -O3. |
I would very much like to run the TPC-H benchmarks on this change. They are failing in conbench at the moment. There is a fix for these benchmarks in PR right now (#13679) so maybe we can run it after. That will at least give us some sense of the impact at a macro-level. |
Makes sense. Let’s get more data and make a decision after 9.0.0 goes out. |
@westonpace Are you planning to get/report TPC-H benchmark numbers for this? |
@ursabot please benchmark lang=R |
Benchmark runs are scheduled for baseline = 8a2acaa and contender = 47fcf77. Results will be available as each benchmark for each run completes. |
Ah, I think a rebase is needed to get these passing. I'll do that real quick. |
@pitrou I got the results here: https://conbench.ursa.dev/compare/runs/b724609840e242afbf4e1e26682afbe3...b742cce58407420db4da8e461604a1db/ There were no significant changes (one of the queries was 15% faster and everything else was within +/- 5%) so I think I'm +1 for this change. |
@github-actions crossbow submit -g cpp -g python -g r |
Revision: de53440 Submitted crossbow builds: ursacomputing/crossbow @ actions-0c44f1532a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Benchmark runs are scheduled for baseline = 682c63a and contender = 9d1bbaf. 9d1bbaf is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
I don't know why but it seems that the "AMD64 Windows R 3.6 RTools 35" CI job is failed since this change is merged into master: https://github.com/apache/arrow/runs/7866259947?check_suite_focus=true#step:11:699
|
That’s really strange. Where is the log for the job that builds the artifacts that depends on? Is the tarball it is downloading stale by chance? |
I don't know. Perhaps @paleolimbot wants to take a look. But regardless, we're now having a discussion to drop RTools 3.5 on the ML, so I'm not sure that matters much. |
As you noted we're discussing dropping support for the failing platform. I don't currently have a development environment for RTools 35...while I could set one up, I'm not keen to spend a bunch of time doing that if we're about to drop support. I'll open a discussion with the R developers as to how we'll solve the issue. |
…13661) Motivated by investigation in apache#13654. To be discussed Lead-authored-by: Antoine Pitrou <antoine@python.org> Co-authored-by: Wes McKinney <wesm@apache.org> Signed-off-by: Antoine Pitrou <antoine@python.org>
Motivated by investigation in #13654. To be discussed