-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Expose lto and llvm folder compilation flags #8357
Conversation
0c578fa
to
756610a
Compare
080f887
to
9a8b456
Compare
There is still an unconnected failure in Linux job here: https://github.com/duckdb/duckdb/actions/runs/5659062468/job/15331786339?pr=8357#step:7:2949, and a few unconnected jobs that are still to be done. I run a few experiments as part of the tests, using geometric mean of current regression tests.
Unsure what should be read here, something like: |
9a8b456
to
ed8ade8
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very cool work! I think we should carefully consider this. Running this on master builds seems viable CI-time wise for sure, and probably worth it. However, the question then would be how often will we run into ci failures that only occur on lto builds. Having to debug issues that are only caught on master lto builds seems like a potentially painful process that cost a lot of dev time
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR! LGTM - one comment:
Thanks @samansmink and @Mytherin for the feedback. I guess the hard choice that is what to make of all this / what to turn on and on what condition, and that would probably require some added considerations. On this PR I took the easy road of just providing options (and avoiding to have to re-discover the set of changes that were needed again in the future) without making real choices. This PR is having another round of CI since I moved the benchmarks to a separate workflow (to be run only via workflow dispatch or on changes to the workflow itself), but on my side is ready to be merged. If/when we want to experiment, I would consider probably easier to do so on OSX-based workflows, given that there clang it's already the default + we have an easier time testing them, and potentially moving from clang 14 to brew installed clang 16 and enabling LTO can bring gains with lower risks. |
ad9873a
to
8a56b9a
Compare
To be enabled via `LTO=thin make` or `LTO=full make`. To opt-in LTO variable has to be defined to something that clang will recognize LTO or FullLTO implies running additional optimisations at link time, trading off time spent compiling with improved compiled binary (both smaller and somehow more performant). ThinLTO aims at reaching similar gains with a smaller footprint AND avoiding degenerate cases where recompilations times becomes similar to compiling each time from scratch. Here some background on ThinLTO: http://blog.llvm.org/2016/06/thinlto-scalable-and-incremental-lto.html
Example of use `CMAKE_LLVM_PATH=~/llvm-project/build make` or `CMAKE_LLVM_PATH=/opt/homebrew/Cellar/llvm/16.0.6/ make` This is currently done only of LLVM/Clang, since executable names are hardcoded (eg llvm-ranlib). Same logic can be adapted to other compilers if we found it useful
This has two roles: check these option will keep working AND give a rough estimate of what can be gained by turning these on
LinkTimeOptimizations is available also in gcc, so make it also turn it on in the -flto version. Here additional details: https://gcc.gnu.org/onlinedocs/gccint/LTO-Overview.html, both on -flto and -whopr (somehow similar to ThinLTO)
Geometrical mean is very blunt, not to be used too seriously, but at least summarize as a single number "better / about the same / worst"
8a56b9a
to
cff0a04
Compare
Thanks - we can merge this and leave the actual choice of whether/where we want to enable LTO for a later date. |
This PR expose 2 independent flags to make and cmake invocations that enable link time optimisations and allow relying on an specific LLVM binary folder.
Examples:
Option are composable, between each other and with other options, but LTO requires the underlying compiler to supports it.
clang
supports both thin and full options,gcc
only the full option, while providing a non-supported option will result in a compiler failure.This PR do not adds either LTO or updated clang to any workflow but for self tests executed in NightlyTests.yml.
Eventually enabling this for distributed binaries is moved to a follow-up PR.
LTO basics
LTO (Link Time Optimisation) is basically trading slower compilation times for somewhat more optimised binary produced.
clang exposes also a rebuild friendly ThinLTO (http://blog.llvm.org/2016/06/thinlto-scalable-and-incremental-lto.html) that aims to be possible to turn in also during development.
This are full rebuild times without LTO, LTO=thin and LTO=full on a Mac M2 using the default clang 14:
I tried to estimate performance gains using our current benchmark suite, and speed up for full LTO seems to be, (very unscientifically) about 3% (on geometrical mean of all tests).
More testing / benchmarking is probably required to put a serious number into that. But I'd say the fact that there seems to be no regressions AND some workload are optimised significantly (up to 70% faster) it would make sense to consider this for inclusion.
Idea of this PR is allow, now or in the future, to experiment with this easily.
CMAKE_LLVM_PATH
Currently on most workflows we build DuckDB using the default system compiler, that for clang is version 14 both on macos and on ubuntu 22.04.
Current development is at clang 17 (used by duckdb-wasm), while clang 16 is already stable and packaged for example via
brew install llvm
.On my machine (Mac M2):
brew install llvm locate llvm-ranlib --- /opt/homebrew/Cellar/llvm/16.0.6/bin/llvm-ranlib CMAKE_LLVM_PATH='/opt/homebrew/Cellar/llvm/16.0.6' make
allows to build DuckDB using clang 16.0.6 instead of the stock clang 14.0.3.
More recent compilers allows more optimisation opportunities to be leveraged, also here benchmarking has been done not very seriously but seems to point towards something like 5% improvement in execution speed.
I have done this only for LLVM/clang (since it required llvm-ranlib to be specified), but potentially it could be worth exploring performance gains for more recent gcc versions.
How to roll this in
My idea is that having there options more easily available allows more experimenting with this, potentially moving some nightly binaries to use LTO and more recent compilers version and collect feedback to be able to decide whether this is worthy to be turned on also for proper releases.
But input is very welcome, and if someone wants to take this over and give a critical look, you are very welcome!
Note on benchmarking
Benchmarking is hard, especially if you do that while trying to prove a point.
I added to the regression test runner a summary such as "new is roughly X% faster | old is roughly Y% faster | about the same". This is done comparing geometrical means. It's a very blunt simplification, do take this with quite some distance, if it does more harm than good it should be removed.
I also added 3 jobs to be executed on nightly that re use the benchmark runner to evaluate LTO gains (on clang and gcc) and performance differences between clang and gcc. Role of these tests is mostly checking that LTO and CMAKE_LLVM_PATH options keep working over time.
Current invocation of regression script involves lots of copy pasting, probably logic should be refactored, but at first I though this was clearer.