New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
mgr: Release GIL before calling OSDMap::calc_pg_upmaps() #31064
Conversation
@liewegas @xiexingguo @jdurgin I'm not sure if I put the PyEval_SaveThread() in the best location nor whether there are ramification to dropping the GIL there. I was able to do ceph balancer status and ceph balancer show xxx while a test sleep was happening in OSDMap::calc_pg_upmaps(). |
hmm looking closer I see one potential race - since the balancer module inserts the plan object in optimize()->plan_create(), a user could call execute() on a partially initiallized plan (with the incremental in particular being updated). To fix this we could only add the plan to self.plans after optimize() is done initializing it. |
@jdurgin I fixed the issue with self.plans that you pointed out. In addition, now the code disallows the command "ceph balancer optimize ...." while the balancer is active. This prevents races with optimize being run in 2 threads. There is still a race:
I can fix with with a boolean "optimizing" that is set while active balancer is optimizing. So even if active = False, optimizing = True will still fail the balancer optimize command. There is a lock in the ceph-mgr for incoming ceph commands, so if a manual "ceph balancer optimize ..." takes a long time, other ceph balancer commands like status will hang as before. At least this prevents activating the balancer while a manual optimize is already running. I believe that this is the relevant code: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Prevent optimize and execute commands from running with active balancer Fixes: https://tracker.ceph.com/issues/42432 Signed-off-by: David Zafman <dzafman@redhat.com>
Add balancer status fields so that slow optimizations can be detected Signed-off-by: David Zafman <dzafman@redhat.com>
Signed-off-by: David Zafman <dzafman@redhat.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not quite sure about this... @dzafman, are you able to confirm if it's only mgr balancer
commands that are affected, or do all mgr commands block (try ceph osd status
for example). If the latter is true, we're hitting a bigger problem (see https://tracker.ceph.com/issues/37514)
Hi, in my case, long balancer task blocked |
@tserong @mamahtehok This fix only deals with the active balancer background thread. In that case a slow If a user uses the |
Thanks for the explanation. As for the placement of |
* refs/pull/31064/head: test: Test balancer module commands mgr: Improve balancer module status mgr: Release GIL before calling OSDMap::calc_pg_upmaps() Reviewed-by: Josh Durgin <jdurgin@redhat.com> Reviewed-by: Kefu Chai <kchai@redhat.com>
Fixes: https://tracker.ceph.com/issues/42432
Signed-off-by: David Zafman dzafman@redhat.com
Checklist
Updates documentation if necessaryIncludes tests for new functionality or reproducer for bugShow available Jenkins commands
jenkins retest this please
jenkins test crimson perf
jenkins test signed
jenkins test make check
jenkins test make check arm64
jenkins test submodules
jenkins test dashboard
jenkins test dashboard backend
jenkins test docs
jenkins render docs