New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pacific: mgr: various fixes for mgr scalability #44869
Conversation
This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved |
f144fcf
to
18a6e44
Compare
Changelog
|
jenkins test make check |
Changelog
|
jenkins test make check |
jenkins retest this please |
This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved |
This does not change for the lifetime of an active mgr module. No need to keep calling back into Mgr to re-fetch it. Signed-off-by: Sage Weil <sage@newdream.net> (cherry picked from commit 994832e) Conflicts: src/pybind/mgr/mgr_module.py - self._db_lock = threading.Lock() DNE in pacific
We only use a handful of fields, and the pg dump includes a gazillion fields that we waste CPU copying to python-land. This tends to lead to long ClusterState::lock hold times, leading to long ms_dispatch delays and generally gumming up the works. Instead, create a new "pg_progress" item that dumps only the fields that mgr/progress needs. Fixes: https://tracker.ceph.com/issues/53475 Signed-off-by: Sage Weil <sage@newdream.net> (cherry picked from commit f5973cc)
We need to avoid making drastic changes to pg_num that outpace pgp_num or else we will may hit the per-osd pg limits. Fixes: https://tracker.ceph.com/issues/53442 Signed-off-by: Sage Weil <sage@newdream.net> (cherry picked from commit 3b2a112) Conflicts: src/common/options/mgr.yaml.in - old way of specifying config settings
Signed-off-by: Sage Weil <sage@newdream.net> (cherry picked from commit d3c8f17)
Note that we don't annotate the dashboard NotificationQueue because it is used internally by the dashboard with other events. Signed-off-by: Sage Weil <sage@newdream.net> (cherry picked from commit 1ac480d) Conflicts: src/pybind/mgr/cephadm/module.py - trivial resolution src/pybind/mgr/dashboard/module.py - trivial resolution src/pybind/mgr/localpool/module.py - trivial resolution src/pybind/mgr/mds_autoscaler/module.py - trivial resolution
Signed-off-by: Sage Weil <sage@newdream.net> (cherry picked from commit 95ca3a4) Conflicts: src/pybind/mgr/mds_autoscaler/module.py - trivial resolution
Signed-off-by: Sage Weil <sage@newdream.net> (cherry picked from commit ee4e3ec)
master/quincy use a precreated .mgr pool and does not need this commit Signed-off-by: Neha Ojha <nojha@redhat.com>
288b80b
to
564d370
Compare
Dropping #44207 due to lot of conflicts now Test runs+reruns - https://pulpito.ceph.com/?branch=wip-mgr-fixes-pacific-gil @adk3798 can you please review the cephadm failures in the last rerun https://pulpito.ceph.com/nojha-2022-02-18_01:15:24-rados-wip-mgr-fixes-pacific-gil-distro-basic-smithi/ to see if you find anything related, otherwise looks good |
Almost all errors fall under one of these which happens independent of this PR I've never seen that rm-zap-flag failure before where the osd never comes up after attempted replacement but I don't think this PR is causing it. More likely it's just an unlikely failure that happened to pop up here imo. |
jenkins test dashboard |
@ceph/dashboard can someone please help me understand if the dashboard test failure is related or not? |
Looked unrelated. So I triggered again and it failed due to a flaky test. So I would say its good to go. |
Backports:
These fixes were identified to help with mgr scalability during Pawsey scale testing.
Checklist
Show available Jenkins commands
jenkins retest this please
jenkins test classic perf
jenkins test crimson perf
jenkins test signed
jenkins test make check
jenkins test make check arm64
jenkins test submodules
jenkins test dashboard
jenkins test dashboard cephadm
jenkins test api
jenkins test docs
jenkins render docs
jenkins test ceph-volume all
jenkins test ceph-volume tox