Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rgw/multisite: Fix deadlock by not holding mutex over coroutine #54505

Merged
merged 1 commit into from Jan 4, 2024

Conversation

adamemerson
Copy link
Contributor

@adamemerson adamemerson commented Nov 14, 2023

Race to create the FIFO, stash it under mutex. Never hold the mutex over a yield, null or otherwise.

Fixes: https://tracker.ceph.com/issues/63373

Contribution Guidelines

  • To sign and title your commits, please refer to Submitting Patches to Ceph.

  • If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.

  • When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an x between the brackets: [x]. Spaces and capitalization matter when checking off items this way.

Checklist

  • Tracker (select at least one)
    • References tracker ticket
    • Very recent bug; references commit where it was introduced
    • New feature (ticket optional)
    • Doc update (no ticket needed)
    • Code cleanup (no ticket needed)
  • Component impact
    • Affects Dashboard, opened tracker ticket
    • Affects Orchestrator, opened tracker ticket
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
    • No doc update is appropriate
  • Tests (select at least one)
Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard cephadm
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox
  • jenkins test windows
  • jenkins test rook e2e

@petrutlucian94
Copy link
Contributor

We applied this patch on Reef but we're still experiencing the deadlock: https://paste.opendev.org/raw/bPUD6MaEBOfQ5qQa9i6y/.

For what is worth, I haven't seen any hanging lazy_init call before or after the patch.

@cbodley
Copy link
Contributor

cbodley commented Nov 15, 2023

@petrutlucian94 thanks a lot for testing. i don't think we've set up a reproducer for this one yet

If the FIFO doesn't exist, let clients race to create it, then stash
and use whoever wins.

Fixes: https://tracker.ceph.com/issues/63373
Signed-off-by: Adam Emerson <aemerson@redhat.com>
@cbodley
Copy link
Contributor

cbodley commented Jan 4, 2024

@cbodley cbodley merged commit 6787c4d into ceph:main Jan 4, 2024
10 of 11 checks passed
@frittentheke
Copy link
Contributor

(this fixes one issue, but doesn't entirely resolve https://tracker.ceph.com/issues/63373)

@cbodley could you elborate what is still broken?

Is the remaining part of this issue tracked anywhere but in https://tracker.ceph.com/issues/63373? I'd really love for this to be resolved with 18.2.2, but apparently https://tracker.ceph.com/issues/63373 is not on the roadmap for 18.2.2 at https://tracker.ceph.com/versions/678.

@cbodley
Copy link
Contributor

cbodley commented Feb 8, 2024

@frittentheke i don't think that roadmap page is accurate, it usually only tracks backports that already merged

in https://tracker.ceph.com/issues/63373#note-20, @mkogan1 bisected this down to a specific commit but nobody has followed up on that yet. if we can't make progress there, we should probably revert all of #49741. cc @adamemerson @mattbenjamin @smanjara

we're also nearing the code freeze for the squid release, and i'd love to have it fixed there too

@cbodley
Copy link
Contributor

cbodley commented Feb 9, 2024

if we can't make progress there, we should probably revert all of #49741. cc @adamemerson @mattbenjamin @smanjara

i opened #55522 which leaves most of #49741 intact, but replaces a single null_yield in RGWDataChangesLog::add_entry() to make it all synchronous

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants