Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pacific: mgr/ActivePyModules.cc: fix cases where GIL is held while attempting to lock mutex #46302

Merged

Conversation

cfsnyder
Copy link
Contributor

The mgr process can deadlock if the GIL is held while attempting to lock a mutex.
Relevant regressions were introduced in commit a356bac. This fixes those regressions
and also cleans up some unnecessary yielding of the GIL.

Fixes: https://tracker.ceph.com/issues/55687
Signed-off-by: Cory Snyder csnyder@iland.com

Checklist

  • Tracker (select at least one)
    • References tracker ticket
    • Very recent bug; references commit where it was introduced
    • New feature (ticket optional)
    • Doc update (no ticket needed)
    • Code cleanup (no ticket needed)
  • Component impact
    • Affects Dashboard, opened tracker ticket
    • Affects Orchestrator, opened tracker ticket
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
    • No doc update is appropriate
  • Tests (select at least one)

…to lock mutex

The mgr process can deadlock if the GIL is held while attempting to lock a mutex.
Relevant regressions were introduced in commit a356bac. This fixes those regressions
and also cleans up some unnecessary yielding of the GIL.

Fixes: https://tracker.ceph.com/issues/55687
Signed-off-by: Cory Snyder <csnyder@iland.com>
@cfsnyder cfsnyder requested a review from tchaikov May 17, 2022 10:06
@github-actions github-actions bot added this to the pacific milestone May 17, 2022
@neha-ojha neha-ojha changed the title mgr/ActivePyModules.cc: fix cases where GIL is held while attempting to lock mutex pacific: mgr/ActivePyModules.cc: fix cases where GIL is held while attempting to lock mutex May 17, 2022
Copy link
Member

@epuertat epuertat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! In any case, let's declare that this is partially reverting changes from the identified commit (a356bac), but also applying changes from another master commit (1da9885).

Comment on lines -255 to -256
without_gil_t no_gil;
with_gil_t with_gil{no_gil};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing the first line reverts the (unintentional) changes made by https://github.com/ceph/ceph/pull/44750/files#diff-50ab66411d9293d402a15e00ed6843a4d37889c616873e69534e609c210f72ecR248, that's perfect!

However, for removing the 2nd line we should declare the commit that introduced those changes in master, which is 1da9885. Some changes from that commit were already (and, again, unintentionally) introduced by a356bac (#44750), so this PR means that the half-baked commit has now been completely backported.

Copy link
Member

@neha-ojha neha-ojha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://pulpito.ceph.com/?branch=wip-lrc-fix-pacific - none of the failures are related

@ljflores
Copy link
Contributor

https://pulpito.ceph.com/?branch=wip-lrc-fix-pacific

Failures, unrelated:
1. https://tracker.ceph.com/issues/51858
2. https://tracker.ceph.com/issues/53939
3. https://tracker.ceph.com/issues/53501
5. https://tracker.ceph.com/issues/52321
6. https://tracker.ceph.com/issues/54360
7. https://tracker.ceph.com/issues/54071
8. https://tracker.ceph.com/issues/45318

Details:
1. octopus: rados/test_crash.sh failure - Ceph - RADOS
2. ceph-nfs-upgrade, pacific: Upgrade Paused due to UPGRADE_REDEPLOY_DAEMON: Upgrading daemon osd.0 on host smithi103 failed - Ceph - Orchestrator
3. Exception when running 'rook' task. - Ceph - Orchestrator
5. qa/tasks/rook times out: 'check osd count' reached maximum tries (90) after waiting for 900 seconds - Ceph - Orchestrator
6. Dead job at "Finished running handlers" in rados/cephadm/osds/.../rm-zap-wait - Ceph - Orchestrator
7. rados/cephadm/osds: Invalid command: missing required parameter hostname() - Ceph - Orchestrator
8. octopus: Health check failed: 2/6 mons down, quorum b,a,c,e (MON_DOWN)" in cluster log running tasks/mon_clock_no_skews.yaml - Ceph - RADOS

@djgalloway djgalloway merged commit 0803e18 into ceph:pacific May 18, 2022
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
7 participants