Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

debian/rules: Fix build_type for massive performance gain #54891

Closed
wants to merge 1 commit into from

Conversation

markhpc
Copy link
Member

@markhpc markhpc commented Dec 14, 2023

Recently I observed very slow 4K random write performance on an NVMe cluster running our 17.2.7 builds for Ubuntu Focal. After significant diagnosis, I was able to observe that this was due to the fact that our Debian builds are compiling RocksDB with CMAKE_BUILD_TYPE=None. For instance, looking at the console logs for the 17.2.7 Focal build:

https://jenkins.ceph.com/job/ceph-build/ARCH=x86_64,AVAILABLE_ARCH=x86_64,AVAILABLE_DIST=focal,DIST=focal,MACHINE_SIZE=gigantic/569/consoleFull

One can see this directly in the console log:
DCMAKE_CXX_COMPILER=/usr/bin/c++ -DWITH_SNAPPY=TRUE -DWITH_LZ4=TRUE -Dlz4_INCLUDE_DIRS=/usr/include -Dlz4_LIBRARIES=/usr/lib/x86_64-linux-gnu/liblz4.so -DWITH_ZLIB=TRUE -DPORTABLE=ON -DCMAKE_AR=/usr/bin/ar -DCMAKE_BUILD_TYPE=None -DFAIL_ON_WARNINGS=OFF -DUSE_RTTI=1 -DCMAKE_C_FLAGS=-Wno-stringop-truncation "-DCMAKE_CXX_FLAGS='-Wno-deprecated-copy -Wno-pessimizing-move'" "-GUnix Makefiles" /build/ceph-17.2.7/src/rocksdb

This is the same issue that both Gentoo and Canonical ran into several years ago in their builds. i.e:
https://bugs.launchpad.net/ubuntu/+source/ceph/+bug/1894453

After retesting with CMAKE_BUILD_TYPE properly set to RelWithDebInfo, 4K random write performance on the cluster doubled and RocksDB compaction times were reduced by roughly 2-3X.

Contribution Guidelines

  • To sign and title your commits, please refer to Submitting Patches to Ceph.

  • If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.

  • When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an x between the brackets: [x]. Spaces and capitalization matter when checking off items this way.

Checklist

  • Tracker (select at least one)
    • References tracker ticket
    • Very recent bug; references commit where it was introduced
    • New feature (ticket optional)
    • Doc update (no ticket needed)
    • Code cleanup (no ticket needed)
  • Component impact
    • Affects Dashboard, opened tracker ticket
    • Affects Orchestrator, opened tracker ticket
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
    • No doc update is appropriate
  • Tests (select at least one)
Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard cephadm
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox
  • jenkins test windows
  • jenkins test rook e2e

Signed-off-by: Mark Nelson <mark.nelson@clyso.com>
Copy link
Contributor

@tchaikov tchaikov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think we should respect the CXXFLAGS set by user instead of overriding it with RelWithDebInfo.

IMHO, #54918 is a better fix of this performance issue.

@markhpc markhpc requested a review from a team December 18, 2023 16:25
@markhpc
Copy link
Member Author

markhpc commented Dec 18, 2023

i think we should respect the CXXFLAGS set by user instead of overriding it with RelWithDebInfo.

IMHO, #54918 is a better fix of this performance issue.

have you verified your fix? I've already verified this fix, and it's both more explicit and much clearer what we are doing in the debian rules file. It also matches what Canonical is doing. I don't think you should be blocking this fix.

@athanatos
Copy link
Contributor

@markhpc Do you prefer doing it unconditionally, or is it that you feel the effort to validate the CXX_FLAGS approach @tchaikov is suggesting isn't worthwhile?

@cbodley @ktdreyer Do y'all have an opinion on which form a fix should take?

@dvanders
Copy link
Contributor

Does this warning need to be updated too?
https://github.com/ceph/ceph/blob/main/do_cmake.sh#L102

Are there two separate issues here? i.e. what does do_cmake.sh do by default, and which flags were enabled for the builds in download.ceph.com?

@cbodley
Copy link
Contributor

cbodley commented Dec 20, 2023

i think we should respect the CXXFLAGS set by user instead of overriding it with RelWithDebInfo.

IMHO, #54918 is a better fix of this performance issue.

@tchaikov i like your pr. extra cflags from debian should apply to our subproject builds too 👍

but that only applies to the rocksdb build. it sounds like ceph builds are now defaulting to Debug, so we're not adding optimization flags from CMAKE_CXX_FLAGS_RELWITHDEBINFO:STRING=-O2 -g -DNDEBUG. if we don't add those to our cflags, how does it help rocksdb's performance to forward them?

we do need some strategy to guarantee that our releases get built with optimizations. do we know how rpm builds handle that? i don't see anything about CMAKE_BUILD_TYPE in ceph.spec.in, so it seems odd that debian/rules would need to override it

cc @batrick who recently added support for debug builds in ceph/ceph-build#2167. it doesn't look like that script uses debian/rules, so @markhpc's override of CMAKE_BUILD_TYPE may not interfere with that

@tchaikov
Copy link
Contributor

tchaikov commented Dec 21, 2023

i think we should respect the CXXFLAGS set by user instead of overriding it with RelWithDebInfo.
IMHO, #54918 is a better fix of this performance issue.

have you verified your fix?

i did check the compiling flags passed to compiler when compling rocksdb. please let me know if there's anything else i am supposed to do?

I've already verified this fix, and it's both more explicit

explicit ? no. i am not sure what RelWithDebInfo implies unless i read all of the build settings of cflags and rocksdb's cmake settings. i explained why my fix was better. and i cannot find your reply.

and much clearer what we are doing in the debian rules file.

clearer? maybe. it depends on how you tell the impact and verify the change.

It also matches what Canonical is doing.

i did read the ubuntu bug and their fix. i am not convinced with the reasoning of "somebody is doing foo and bar, so we should follow their leads". i am more interested in the real reasonings.

I don't think you should be blocking this fix.

to block this fix is not my intention. i want to improve this project and how it interacts with the rest of the world in general in a better way if i can. that's why i created a fix couple days ago.

@tchaikov
Copy link
Contributor

tchaikov commented Dec 21, 2023

@markhpc Do you prefer doing it unconditionally, or is it that you feel the effort to validate the CXX_FLAGS approach @tchaikov is suggesting isn't worthwhile?

@athanatos hi Sam, it's not about "worthwhile". it's more about correctness.

i wanted to fix it in the explicit and correct way. in debian and fedora's packagings, they intend to pass various cflags / cxxflags generated from their own toolings. for instance, debian provides dpkg-buildflags, and fedora packaging uses its rpm macros for defining/exposing cflags. instead of using whatever the upstream defines for a certain CMAKE_BUILD_TYPE, we should respect the distro's build options.

also, for instance, various distro downstreams have their own optimization and hardening policies. but these policies are not necessarily encoded into the upstream softwares' building systems.

@cbodley @ktdreyer Do y'all have an opinion on which form a fix should take?

@tchaikov
Copy link
Contributor

i think we should respect the CXXFLAGS set by user instead of overriding it with RelWithDebInfo.
IMHO, #54918 is a better fix of this performance issue.

@tchaikov i like your pr. extra cflags from debian should apply to our subproject builds too 👍

but that only applies to the rocksdb build. it sounds like ceph builds are now defaulting to Debug, so we're not adding optimization flags from CMAKE_CXX_FLAGS_RELWITHDEBINFO:STRING=-O2 -g -DNDEBUG. if we don't add those to our cflags, how does it help rocksdb's performance to forward them?

@cbodley hi Casey, could you share more details on "ceph builds are now defaulting to Debug" ? what is the context of this? like when, how and where? what's the packaging you are talking about?

we do need some strategy to guarantee that our releases get built with optimizations. do we know how rpm builds handle that? i don't see anything about CMAKE_BUILD_TYPE in ceph.spec.in, so it seems odd that debian/rules would need to override it

this is exactly the point i try to explain above. the cflags/cxxflags is what the distros are using.

cc @batrick who recently added support for debug builds in ceph/ceph-build#2167. it doesn't look like that script uses debian/rules, so @markhpc's override of CMAKE_BUILD_TYPE may not interfere with that

@tchaikov
Copy link
Contributor

Does this warning need to be updated too? https://github.com/ceph/ceph/blob/main/do_cmake.sh#L102

Are there two separate issues here? i.e. what does do_cmake.sh do by default, and which flags were enabled for the builds in download.ceph.com?

Dan, they are two different things. one is the debian packaging recipe, the other is do_cmake.sh. debian/rules does not use do_cmake.sh. IIUC, do_cmake.sh is only used for our CI for testing, and it is not used for the rpm or deb packagings.

@cbodley
Copy link
Contributor

cbodley commented Jan 2, 2024

@cbodley hi Casey, could you share more details on "ceph builds are now defaulting to Debug" ? what is the context of this? like when, how and where? what's the packaging you are talking about?

sorry @tchaikov, i just meant to say that the ceph builds aren't setting a CMAKE_BUILD_TYPE so cmake itself wasn't adding optimization flags. that's why rocksdb wasn't adding them when cmake/modules/BuildRocksDB.cmake forwards the CMAKE_BUILD_TYPE

i see that both the debian and rpm builds for shaman are exporting their own C/CXXFLAGS that include -O2, so your fix in #54918 covers both cases 👍

@markhpc
Copy link
Member Author

markhpc commented Jan 4, 2024

I see the other PR merged. Are we closing this?

Edit: Alright, talked to Casey. He thinks the other PR is most likely to cover the cases we need to cover so I'm going to close this. I will note however that I'm pretty frustrated that I spent several hours building and meticulously testing this on a customer cluster to verify that the flags got properly set and that it fully fixed the performance issue. I figured that a one liner to the debian rules file wouldn't be a controversial change. I won't make the mistake of actually verifying fixes before they are approved again given how easy it is for them to be blocked. I hope the other PR actually fixes the issue since I don't think it was actually ever tested before being merged.

And for posterity, I will note that the reason we have bugs like this in the first place is that we don't take the time to actually sit down, verify, and sanity check that the things we submit into the project behave the way we intend. I think it's incredibly disappointing that we value subjective correctness over actual verification and testing for even what should be a simple fix like this, but here we are.

@markhpc markhpc closed this Jan 4, 2024
@markhpc markhpc deleted the wip-debian-rocksdb-relwithdebinfo branch January 5, 2024 07:23
ProxBot pushed a commit to proxmox/ceph that referenced this pull request Jan 9, 2024
source: ceph/ceph#54891

build packages with 'RelWithDebInfo' to avoid to build rocksdb in debug

This is already the default in ubuntu packages
https://bugs.launchpad.net/ubuntu/+source/ceph/+bug/1894453
ProxBot pushed a commit to proxmox/ceph that referenced this pull request Jan 9, 2024
source: ceph/ceph#54891

build packages with 'RelWithDebInfo' to avoid to build rocksdb in debug

This is already the default in ubuntu packages
https://bugs.launchpad.net/ubuntu/+source/ceph/+bug/1894453
ProxBot pushed a commit to proxmox/ceph that referenced this pull request Jan 15, 2024
As Fabian correctly noticed, from the two PR's, namely PR #54918[0]
and PR #54891[1], only the first one is necessary, that's why the
second one was closed upstream, so drop it here too to avoid a
unnecessary divergence from upstream.

[0]: ceph/ceph#54918
[1]: ceph/ceph#54891

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants