Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

quincy: librbd: Fix local rbd mirror journals growing forever #50159

Merged
merged 3 commits into from
Mar 6, 2023

Conversation

idryomov
Copy link
Contributor

@github-actions
Copy link

github-actions bot commented Mar 3, 2023

This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved

isodude and others added 3 commits March 3, 2023 19:12
This commit fixes commit 7ca1bab by pushing properly aligned
discards back to m_image_extents, if corrected.

If discards are misaligned (off 0, len 4608, gran=4096), they are
corrected properly, but only in object_extents and not in
m_image_extents.

When journal_append_event is triggered it will only append from
m_image_extents and does not now about the alignment fixes. In
commit_io_events_extent it will log a message and return without
completing the io since the larger misaligned area was sent to the journal.
This will in turn break rbd journal mirroring since the local client will wait
indefinately on the commit to be completed, which it never does.

This does not effect rbd-mirror in any way, which may be confusing and
dangerous since it's only rbd-mirror that updates ceph health, and not
the local client.

Setting `rbd_skip_partial_discard = false` under client will restore the
pre 7ca1bab behaviour and thus not trigger the bug with journals growing.
This will set `rbd_discard_granularity_bytes = 0` internally. This
setting is only changed during startup of a client.

Fixes: 7ca1bab
Fixes: https://tracker.ceph.com/issues/57396
Signed-off-by: Josef Johansson <josef@oderland.se>
(cherry picked from commit 21a26a7)

Conflicts:
	src/librbd/io/ImageRequest.cc [ commit b2c8882 ("librbd:
	  return area from extents_to_file()") not in quincy ]
	src/test/librbd/io/test_mock_ImageRequest.cc [ commit
	  b9a2384 ("librbd: propagate area down to
	  file_to_extents()") not in quincy ]
Currently nothing triggers the length_modified case in
ImageDiscardRequest::prune_object_extents() in isolation. It's only
triggered in DiscardGranularityJournalAppendEnabled test together with
the prune_required case and a bad refactoring could easily break the
length_modified logic again.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
(cherry picked from commit 34e59c4)

Conflicts:
	src/test/librbd/io/test_mock_ImageRequest.cc [ commit
	  b9a2384 ("librbd: propagate area down to
	  file_to_extents()") not in quincy ]
"rbd feature disable" appears to reliably hang if the corresponding
remote request is proxied to rbd-nbd (because rbd-nbd happens to own
the exclusive lock after a series of blkdiscard calls) [1].  Work
around it here by enabling journaling before the image is mapped
and disabling it after the image is unmapped.

Also, don't assert on the output of "rbd journal inspect --verbose"
having a certain number of entries.  This is racy: if the script gets
delayed after the last blkdiscard call for some reason, there may be
fewer entries present in the journal or none at all.

[1] https://tracker.ceph.com/issues/58740

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
(cherry picked from commit fcfef0a)
@idryomov
Copy link
Contributor Author

idryomov commented Mar 3, 2023

Rebased to resolve a trivial context conflict in qa/workunits/rbd/rbd-nbd.sh that arose because #50292 got merged first.

@ljflores
Copy link
Contributor

ljflores commented Mar 3, 2023

Rados suite reivew: https://pulpito.ceph.com/?branch=wip-yuri4-testing-2023-02-22-0817-quincy

Failures, unrelated:
1. https://tracker.ceph.com/issues/58585
2. https://tracker.ceph.com/issues/58146
3. https://tracker.ceph.com/issues/58837 -- new tracker
4. https://tracker.ceph.com/issues/58915 -- new tracker
5. https://tracker.ceph.com/issues/54750

Details:
1. rook: failed to pull kubelet image - Ceph - Orchestrator
2. test_cephadm.sh: Error: Error initializing source docker://quay.ceph.io/ceph-ci/ceph:master - Ceph - Orchestrator
3. mgr/test_progress.py: test_osd_healthy_recovery fails after timeout - Ceph - RADOS
4. map eXX had wrong heartbeat front addr - Ceph - RADOS
5. crash: PeeringState::Crashed::Crashed(boost::statechart::state<PeeringState::Crashed, PeeringState::PeeringMachine>::my_context): abort - Ceph - RADOS

@yuriw yuriw merged commit 6c82e69 into ceph:quincy Mar 6, 2023
@idryomov idryomov deleted the wip-57396-quincy branch March 7, 2023 09:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants