Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pacific: mgr: various fixes for mgr scalability #44869

Merged
merged 8 commits into from Feb 24, 2022

Conversation

neha-ojha
Copy link
Member

@neha-ojha neha-ojha commented Feb 2, 2022

Backports:

These fixes were identified to help with mgr scalability during Pawsey scale testing.

Checklist

  • Tracker (select at least one)
    • References tracker ticket
    • Very recent bug; references commit where it was introduced
    • New feature (ticket optional)
    • Doc update (no ticket needed)
    • Code cleanup (no ticket needed)
  • Component impact
    • Affects Dashboard, opened tracker ticket
    • Affects Orchestrator, opened tracker ticket
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
    • No doc update is appropriate
  • Tests (select at least one)
Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard cephadm
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox

@neha-ojha neha-ojha requested review from a team as code owners February 2, 2022 20:22
@neha-ojha neha-ojha requested review from pereman2 and Sarthak0702 and removed request for a team February 2, 2022 20:22
@github-actions github-actions bot added this to the pacific milestone Feb 2, 2022
@github-actions
Copy link

github-actions bot commented Feb 2, 2022

This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved

@github-actions github-actions bot added this to In progress in Dashboard Feb 2, 2022
@neha-ojha
Copy link
Member Author

Changelog

  • Rebased to resolve minor conflict in src/pybind/mgr/progress/module.py

@vumrao
Copy link
Contributor

vumrao commented Feb 3, 2022

@vumrao

@neha-ojha
Copy link
Member Author

jenkins test make check

Dashboard automation moved this from In progress to Reviewer approved Feb 11, 2022
@neha-ojha
Copy link
Member Author

Changelog

2022-02-06T07:27:13.493+0000 7f5efede2700 10 ceph_option_get osd_pool_default_size found: 2
2022-02-06T07:27:13.493+0000 7f5efede2700  0 [devicehealth WARNING root] Not enough OSDs yet to create monitoring pool
...
2022-02-06T08:00:08.609+0000 7f2a6c782700  0 [dashboard ERROR exception] Dashboard Exception
Traceback (most recent call last):
  File "/usr/share/ceph/mgr/dashboard/services/exception.py", line 71, in handle_rados_error
    yield
  File "/usr/share/ceph/mgr/dashboard/services/exception.py", line 46, in dashboard_exception_handler
    return handler(*args, **kwargs)
  File "/usr/lib/python3.6/site-packages/cherrypy/_cpdispatch.py", line 54, in __call__
    return self.callable(*self.args, **self.kwargs)
  File "/usr/share/ceph/mgr/dashboard/controllers/_base_controller.py", line 258, in inner
    ret = func(*args, **kwargs)
  File "/usr/share/ceph/mgr/dashboard/controllers/_rest_controller.py", line 191, in wrapper
    return func(*vpath, **params)
  File "/usr/share/ceph/mgr/dashboard/controllers/pool.py", line 284, in configuration
    return RbdConfiguration(pool_name).list()
  File "/usr/share/ceph/mgr/dashboard/services/rbd.py", line 146, in list
    ioctx = mgr.rados.open_ioctx(self._pool_name)
  File "rados.pyx", line 982, in rados.Rados.open_ioctx
rados.ObjectNotFound: [errno 2] RADOS object not found (error opening pool 'device_health_metrics')

@neha-ojha
Copy link
Member Author

jenkins test make check

@neha-ojha
Copy link
Member Author

jenkins retest this please

@github-actions
Copy link

This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved

liewegas and others added 8 commits February 16, 2022 17:14
This does not change for the lifetime of an active mgr module.  No need to
keep calling back into Mgr to re-fetch it.

Signed-off-by: Sage Weil <sage@newdream.net>
(cherry picked from commit 994832e)

 Conflicts:
	src/pybind/mgr/mgr_module.py - self._db_lock = threading.Lock() DNE in pacific
We only use a handful of fields, and the pg dump includes a gazillion
fields that we waste CPU copying to python-land.  This tends to lead to
long ClusterState::lock hold times, leading to long ms_dispatch delays
and generally gumming up the works.

Instead, create a new "pg_progress" item that dumps only the fields that
mgr/progress needs.

Fixes: https://tracker.ceph.com/issues/53475
Signed-off-by: Sage Weil <sage@newdream.net>
(cherry picked from commit f5973cc)
We need to avoid making drastic changes to pg_num that outpace pgp_num or
else we will may hit the per-osd pg limits.

Fixes: https://tracker.ceph.com/issues/53442
Signed-off-by: Sage Weil <sage@newdream.net>
(cherry picked from commit 3b2a112)

 Conflicts:
	src/common/options/mgr.yaml.in - old way of specifying config settings
Signed-off-by: Sage Weil <sage@newdream.net>
(cherry picked from commit d3c8f17)
Note that we don't annotate the dashboard NotificationQueue because it is
used internally by the dashboard with other events.

Signed-off-by: Sage Weil <sage@newdream.net>
(cherry picked from commit 1ac480d)

 Conflicts:
	src/pybind/mgr/cephadm/module.py - trivial resolution
	src/pybind/mgr/dashboard/module.py - trivial resolution
	src/pybind/mgr/localpool/module.py - trivial resolution
	src/pybind/mgr/mds_autoscaler/module.py - trivial resolution
Signed-off-by: Sage Weil <sage@newdream.net>
(cherry picked from commit 95ca3a4)

 Conflicts:
	src/pybind/mgr/mds_autoscaler/module.py - trivial resolution
Signed-off-by: Sage Weil <sage@newdream.net>
(cherry picked from commit ee4e3ec)
master/quincy use a precreated .mgr pool and does not
need this commit

Signed-off-by: Neha Ojha <nojha@redhat.com>
@neha-ojha
Copy link
Member Author

Dropping #44207 due to lot of conflicts now

Test runs+reruns - https://pulpito.ceph.com/?branch=wip-mgr-fixes-pacific-gil

@adk3798 can you please review the cephadm failures in the last rerun https://pulpito.ceph.com/nojha-2022-02-18_01:15:24-rados-wip-mgr-fixes-pacific-gil-distro-basic-smithi/ to see if you find anything related, otherwise looks good

@adk3798
Copy link
Contributor

adk3798 commented Feb 23, 2022

Dropping #44207 due to lot of conflicts now

Test runs+reruns - https://pulpito.ceph.com/?branch=wip-mgr-fixes-pacific-gil

@adk3798 can you please review the cephadm failures in the last rerun https://pulpito.ceph.com/nojha-2022-02-18_01:15:24-rados-wip-mgr-fixes-pacific-gil-distro-basic-smithi/ to see if you find anything related, otherwise looks good

Almost all errors fall under one of these which happens independent of this PR
https://tracker.ceph.com/issues/53939
https://tracker.ceph.com/issues/54304
https://tracker.ceph.com/issues/54071
https://tracker.ceph.com/issues/54273

I've never seen that rm-zap-flag failure before where the osd never comes up after attempted replacement but I don't think this PR is causing it. More likely it's just an unlikely failure that happened to pop up here imo.

@neha-ojha
Copy link
Member Author

jenkins test dashboard

@neha-ojha
Copy link
Member Author

@ceph/dashboard can someone please help me understand if the dashboard test failure is related or not?

@nizamial09
Copy link
Member

@ceph/dashboard can someone please help me understand if the dashboard test failure is related or not?

Looked unrelated. So I triggered again and it failed due to a flaky test. So I would say its good to go.

@yuriw yuriw merged commit 5508fec into ceph:pacific Feb 24, 2022
Dashboard automation moved this from Reviewer approved to Done Feb 24, 2022
@neha-ojha neha-ojha deleted the wip-mgr-fixes-pacific branch February 24, 2022 20:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Archived in project
Dashboard
  
Done
7 participants