Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expose lto and llvm folder compilation flags #8357

Merged
merged 12 commits into from
Aug 7, 2023

Conversation

carlopi
Copy link
Contributor

@carlopi carlopi commented Jul 25, 2023

This PR expose 2 independent flags to make and cmake invocations that enable link time optimisations and allow relying on an specific LLVM binary folder.
Examples:

LTO=full make                                          // default options plus passing -flto=full to the compiler/linker
LTO=thin make                                          // default options plus passing -flto=thin to the compiler/linker
CMAKE_LLVM_PATH='~/llvm-project/build' make            // default options using clang++ / clang and llvm-ranlib found in the given folder
LTO=full CMAKE_LLVM_PATH='~/llvm-project/build' make   // default options using clang++ / clang and llvm-ranlib found in the given folder PLUS -flto=full

Option are composable, between each other and with other options, but LTO requires the underlying compiler to supports it.
clang supports both thin and full options, gcc only the full option, while providing a non-supported option will result in a compiler failure.

This PR do not adds either LTO or updated clang to any workflow but for self tests executed in NightlyTests.yml.
Eventually enabling this for distributed binaries is moved to a follow-up PR.

LTO basics

LTO (Link Time Optimisation) is basically trading slower compilation times for somewhat more optimised binary produced.
clang exposes also a rebuild friendly ThinLTO (http://blog.llvm.org/2016/06/thinlto-scalable-and-incremental-lto.html) that aims to be possible to turn in also during development.

This are full rebuild times without LTO, LTO=thin and LTO=full on a Mac M2 using the default clang 14:

GEN=ninja make			489.60s user 30.12s system 678% cpu 1:16.57 total
GEN=ninja LTO=thin make		900.59s user 40.24s system 663% cpu 2:21.72 total
GEN=ninja LTO=full make		711.20s user 46.07s system 341% cpu 3:41.52 total

I tried to estimate performance gains using our current benchmark suite, and speed up for full LTO seems to be, (very unscientifically) about 3% (on geometrical mean of all tests).
More testing / benchmarking is probably required to put a serious number into that. But I'd say the fact that there seems to be no regressions AND some workload are optimised significantly (up to 70% faster) it would make sense to consider this for inclusion.

Idea of this PR is allow, now or in the future, to experiment with this easily.

CMAKE_LLVM_PATH

Currently on most workflows we build DuckDB using the default system compiler, that for clang is version 14 both on macos and on ubuntu 22.04.
Current development is at clang 17 (used by duckdb-wasm), while clang 16 is already stable and packaged for example via brew install llvm.
On my machine (Mac M2):

brew install llvm
locate llvm-ranlib
--- /opt/homebrew/Cellar/llvm/16.0.6/bin/llvm-ranlib
CMAKE_LLVM_PATH='/opt/homebrew/Cellar/llvm/16.0.6' make

allows to build DuckDB using clang 16.0.6 instead of the stock clang 14.0.3.
More recent compilers allows more optimisation opportunities to be leveraged, also here benchmarking has been done not very seriously but seems to point towards something like 5% improvement in execution speed.

I have done this only for LLVM/clang (since it required llvm-ranlib to be specified), but potentially it could be worth exploring performance gains for more recent gcc versions.

How to roll this in

My idea is that having there options more easily available allows more experimenting with this, potentially moving some nightly binaries to use LTO and more recent compilers version and collect feedback to be able to decide whether this is worthy to be turned on also for proper releases.
But input is very welcome, and if someone wants to take this over and give a critical look, you are very welcome!

Note on benchmarking

Benchmarking is hard, especially if you do that while trying to prove a point.
I added to the regression test runner a summary such as "new is roughly X% faster | old is roughly Y% faster | about the same". This is done comparing geometrical means. It's a very blunt simplification, do take this with quite some distance, if it does more harm than good it should be removed.

I also added 3 jobs to be executed on nightly that re use the benchmark runner to evaluate LTO gains (on clang and gcc) and performance differences between clang and gcc. Role of these tests is mostly checking that LTO and CMAKE_LLVM_PATH options keep working over time.

Current invocation of regression script involves lots of copy pasting, probably logic should be refactored, but at first I though this was clearer.

@github-actions github-actions bot marked this pull request as draft July 25, 2023 07:53
@carlopi carlopi marked this pull request as ready for review July 25, 2023 09:06
@github-actions github-actions bot marked this pull request as draft July 25, 2023 09:13
@carlopi carlopi marked this pull request as ready for review July 25, 2023 09:13
@github-actions github-actions bot marked this pull request as draft July 25, 2023 16:00
@carlopi carlopi marked this pull request as ready for review July 25, 2023 16:00
@carlopi
Copy link
Contributor Author

carlopi commented Jul 25, 2023

There is still an unconnected failure in Linux job here: https://github.com/duckdb/duckdb/actions/runs/5659062468/job/15331786339?pr=8357#step:7:2949, and a few unconnected jobs that are still to be done.

I run a few experiments as part of the tests, using geometric mean of current regression tests.
Results are:

image base alternative micro tpch tpcds h2oai imdb
macos-latest clang 14 clang 14, LTO=full -17% -3% -2% -5% -5%
macos-latest clang 14 clang 16 -15% same -2% same same
ubuntu-latest gcc 11.3 clang 14 -7% +2% -4% -2% -3%
ubuntu-latest gcc 11.3 gcc 11.3, LTO=full -1% -2% -1% -3% -1%

Unsure what should be read here, something like:
clang marginally better than gcc, newer compiler version marginally better than older compiler versions, gcc LTO marginally better than regular gcc, clang LTO visibly better than regular clang.
Takeaway is that probably best combination is recent clang with LTO enabled, but hard to put actual numbers, and to be balanced with bringing in additional dependencies or existing restrictions.

@github-actions github-actions bot marked this pull request as draft July 26, 2023 05:57
@carlopi carlopi marked this pull request as ready for review July 26, 2023 05:58
Copy link
Contributor

@samansmink samansmink left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very cool work! I think we should carefully consider this. Running this on master builds seems viable CI-time wise for sure, and probably worth it. However, the question then would be how often will we run into ci failures that only occur on lto builds. Having to debug issues that are only caught on master lto builds seems like a potentially painful process that cost a lot of dev time

Copy link
Collaborator

@Mytherin Mytherin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! LGTM - one comment:

.github/workflows/NightlyTests.yml Outdated Show resolved Hide resolved
@github-actions github-actions bot marked this pull request as draft July 26, 2023 18:35
@carlopi carlopi marked this pull request as ready for review July 26, 2023 18:36
@carlopi
Copy link
Contributor Author

carlopi commented Jul 26, 2023

Thanks @samansmink and @Mytherin for the feedback.

I guess the hard choice that is what to make of all this / what to turn on and on what condition, and that would probably require some added considerations. On this PR I took the easy road of just providing options (and avoiding to have to re-discover the set of changes that were needed again in the future) without making real choices.

This PR is having another round of CI since I moved the benchmarks to a separate workflow (to be run only via workflow dispatch or on changes to the workflow itself), but on my side is ready to be merged.


If/when we want to experiment, I would consider probably easier to do so on OSX-based workflows, given that there clang it's already the default + we have an easier time testing them, and potentially moving from clang 14 to brew installed clang 16 and enabling LTO can bring gains with lower risks.

@github-actions github-actions bot marked this pull request as draft July 27, 2023 05:15
@carlopi carlopi marked this pull request as ready for review July 27, 2023 05:15
To be enabled via `LTO=thin make` or `LTO=full make`.
To opt-in LTO variable has to be defined to something that clang will recognize

LTO or FullLTO implies running additional optimisations at link time, trading off time
spent compiling with improved compiled binary (both smaller and somehow more performant).
ThinLTO aims at reaching similar gains with a smaller footprint AND avoiding degenerate cases
where recompilations times becomes similar to compiling each time from scratch.

Here some background on ThinLTO: http://blog.llvm.org/2016/06/thinlto-scalable-and-incremental-lto.html
Example of use `CMAKE_LLVM_PATH=~/llvm-project/build make` or
`CMAKE_LLVM_PATH=/opt/homebrew/Cellar/llvm/16.0.6/ make`

This is currently done only of LLVM/Clang, since executable names are hardcoded (eg llvm-ranlib).
Same logic can be adapted to other compilers if we found it useful
This has two roles: check these option will keep working AND give a rough estimate of
what can be gained by turning these on
LinkTimeOptimizations is available also in gcc, so make it also turn it on in the -flto version.

Here additional details: https://gcc.gnu.org/onlinedocs/gccint/LTO-Overview.html,
both on -flto and -whopr (somehow similar to ThinLTO)
Geometrical mean is very blunt, not to be used too seriously, but at least
summarize as a single number "better / about the same / worst"
@github-actions github-actions bot marked this pull request as draft July 28, 2023 08:20
@carlopi carlopi marked this pull request as ready for review July 28, 2023 08:20
@Mytherin
Copy link
Collaborator

Mytherin commented Aug 7, 2023

Thanks - we can merge this and leave the actual choice of whether/where we want to enable LTO for a later date.

@Mytherin Mytherin merged commit 538b9f9 into duckdb:master Aug 7, 2023
72 of 74 checks passed
@carlopi carlopi deleted the lto_and_llvm_flags branch August 28, 2023 12:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants