New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
librbd: Fix local rbd mirror journals growing forever #49614
Conversation
98282b9
to
f67ebe1
Compare
jenkins test make check |
This should be backported to nautilus+. I think I have an idea for tests. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When journal_append_event is triggered it will only append from m_image_extents and does not now about the alignment fixes. In commit_io_events_extent it will throw an error that there are pending io since the larger misaligned area was sent to the journal.
Hi Josef,
What do you mean by "it will throw an error"? Doesn't commit_io_event_extent
just bail?
if (!event.pending_extents.empty()) {
ldout(cct, 20) << this << " " << __func__ << ": "
<< "pending extents: " << event.pending_extents << dendl;
return;
}
complete_event(it, event.ret_val);
This will in turn break rbd journal mirroring since the local client will wait
indefinately on the commit to be completed, which it never does.
Am I understanding correctly that the root cause has nothing to do with the consumer/replayer side (rbd-mirror)? The issue is that the journal entry doesn't get "fulfilled" and therefore marked as committed on the producer side?
Nautilus is EOL. I'll adjust the tracker ticket to make sure the fix gets backported to supported releases.
I'd look in the direction of replicating |
True, bad wording on my part. I regard the ldout as an error in this case, it should be worded as "logs a warning and returns without completing the commit".
Yes.
complete_event is not runned on the event, I read that as that the journal entry is not fulfilled and not marked as committed on the producer side. |
Ok. No problem in supplying a patch for others to implement then I guess? (for those running nautilus). This is a pretty major problem for those running journaling. Since in the end you might loose data on this bug. Case in point: My test-vm that I ran the compilation on had the bug, and I did not want to wait 24h+ for the journaling to complete so I kill -9 the KVM process. The resulting filesystem was not possible to rescue and I had to wipe and disk and restart.
Thanks, I'll take a look. |
7fe3246
to
e464a96
Compare
It's not even a warning -- just an informational debug message. It's expected when an image extent spans an object boundary. |
So far I don't see how one might lose data due to this bug. It only affects discards and the failure mode should be basically just a hang. AFAIU nothing bad is written to the image. Am I mistaken?
Did you encounter the bad image on the producer side or on the consumer side? You seemed to confirm above that rbd-mirror isn't in the picture but now mention |
I don't think it's possible to produce a new nautilus release upstream at this point -- all the infrastructure has moved on. Given that there appears to be a straightforward workaround (setting |
You are not mistaken. Nothing bad is written.
On the producer side, the consumer side is fine.
Initially when a client opens an image it will run JournalPlayer that will check for journal entries that are not commited to the rbd image. I will try my best do describe the scenario again.
I actually waited out a dev VM that had this bug, just to see how long it would take, 24h later it was booted :) I realize that Proxmox should detect this scenario and opt for fixing the journal prior to run the QEMU process, I'm working on a patch for that. Optimally this is a CEPH Health error. |
I'm trying out the below, but I always fail to access the protected m_image_extents, not sure how to do that properly? diff --git a/src/test/librbd/io/test_mock_ImageRequest.cc b/src/test/librbd/io/test_mock_ImageRequest.cc
index e2ac825630c..23cf0e49a80 100644
--- a/src/test/librbd/io/test_mock_ImageRequest.cc
+++ b/src/test/librbd/io/test_mock_ImageRequest.cc
@@ -57,7 +57,6 @@ inline ImageCtx *get_image_ctx(MockTestImageCtx *image_ctx) {
namespace librbd {
namespace io {
-
namespace util {
template <>
@@ -83,6 +82,14 @@ std::pair<Extents, ImageArea> object_to_area_extents(
} // namespace util
+template <>
+class ImageDiscardRequest<librbd::MockTestImageCtx> {
+ public:
+ Extents get_image_extents() {
+ return this->m_image_extents;
+ }
+};
+
using ::testing::_;
using ::testing::InSequence;
using ::testing::Invoke;
@@ -96,6 +103,7 @@ struct TestMockIoImageRequest : public TestMockFixture {
typedef ImageReadRequest<librbd::MockTestImageCtx> MockImageReadRequest;
typedef ImageWriteRequest<librbd::MockTestImageCtx> MockImageWriteRequest;
typedef ImageDiscardRequest<librbd::MockTestImageCtx> MockImageDiscardRequest;
+
typedef ImageFlushRequest<librbd::MockTestImageCtx> MockImageFlushRequest;
typedef ImageWriteSameRequest<librbd::MockTestImageCtx> MockImageWriteSameRequest;
typedef ImageCompareAndWriteRequest<librbd::MockTestImageCtx> MockImageCompareAndWriteRequest;
@@ -367,6 +375,44 @@ TEST_F(TestMockIoImageRequest, DiscardGranularity) {
ASSERT_EQ(0, aio_comp_ctx.wait());
}
+TEST_F(TestMockIoImageRequest, DiscardGranularityJournalAppendDisabled) {
+ REQUIRE_FEATURE(RBD_FEATURE_JOURNALING);
+
+ librbd::ImageCtx *ictx;
+ ASSERT_EQ(0, open_image(m_image_name, &ictx));
+ ASSERT_EQ(0, resize(ictx, ictx->layout.object_size));
+ ictx->discard_granularity_bytes = 32;
+
+ MockTestImageCtx mock_image_ctx(*ictx);
+ MockTestJournal mock_journal;
+ mock_image_ctx.journal = &mock_journal;
+
+ InSequence seq;
+ expect_get_modify_timestamp(mock_image_ctx, false);
+ expect_is_journal_appending(mock_journal, false);
+ Extents extents = {{32,32}, {96,64}, {ictx->layout.object_size - 32, 32}};
+ for (auto extent : extents) {
+ expect_object_discard_request(mock_image_ctx, 0, extent.first, extent.second, 0);
+ }
+
+ C_SaferCond aio_comp_ctx;
+ AioCompletion *aio_comp = AioCompletion::create_and_start(
+ &aio_comp_ctx, ictx, AIO_TYPE_DISCARD);
+ MockImageDiscardRequest mock_aio_image_discard(
+ mock_image_ctx, aio_comp,
+ {{16, 63}, {96, 31}, {84, 100}, {ictx->layout.object_size - 33, 33}},
+ ImageArea::DATA, ictx->discard_granularity_bytes,
+ mock_image_ctx.get_data_io_context(), {});
+ {
+ std::shared_lock owner_locker{mock_image_ctx.owner_lock};
+ mock_aio_image_discard.send();
+ }
+ ASSERT_EQ(0, aio_comp_ctx.wait());
+ ASSERT_EQ(extents, mock_aio_image_discard.get_image_extents());
+
+ ASSERT_EQ(0, mock_journal.is_journal_ready());
+}
+
TEST_F(TestMockIoImageRequest, AioWriteJournalAppendDisabled) {
REQUIRE_FEATURE(RBD_FEATURE_JOURNALING); |
Another workaround coud be to set |
I think you could add
On the other hand I am not sure we have to check the image extents in the test. What we want to test that the commiting the journal works as expected with a specific request, right? |
I know there is mocks for this in mock_test_journal, maybe have it there as well. I wanted a check directly in ImageRequest that does not use journal. Maybe even just extend the granulaity checks only. Since the file does not have any good journal support I think it would make sense to avoid it. Also, mock_test_journal does not seem to have a way to look at each extent in the journal, only length. I'm looking at support structure there as well. |
This is in escense what rbd_skip_partial_discard does. Deprecating that one would make sense though, maybe add an info saying to set rbd_discard_granularity instead. |
I think @idryomov's idea was just to take the |
The code seems to be working, however, when running unittest_librbd with RBD_FEATURES=125 (to enable journaling in features), tests with journal stops working. How should I run the unittests from commandline successfully? I tried to understand how the unittests are runned, but it's not entirely clear. diff --git a/src/test/librbd/io/test_mock_ImageRequest.cc b/src/test/librbd/io/test_mock_ImageRequest.cc
index e2ac825630c..2764c379994 100644
--- a/src/test/librbd/io/test_mock_ImageRequest.cc
+++ b/src/test/librbd/io/test_mock_ImageRequest.cc
@@ -18,6 +18,7 @@ struct MockTestImageCtx;
struct MockTestJournal : public MockJournal {
MOCK_METHOD4(append_write_event, uint64_t(uint64_t, size_t,
const bufferlist &, bool));
+
MOCK_METHOD5(append_compare_and_write_event, uint64_t(uint64_t, size_t,
const bufferlist &,
const bufferlist &,
@@ -32,6 +33,12 @@ struct MockTestJournal : public MockJournal {
filter_ret_val);
}
+ bool appending = false;
+
+ bool is_journal_appending() const {
+ return appending;
+ }
+
MOCK_METHOD2(commit_io_event, void(uint64_t, int));
};
@@ -57,7 +64,6 @@ inline ImageCtx *get_image_ctx(MockTestImageCtx *image_ctx) {
namespace librbd {
namespace io {
namespace util {
template <>
@@ -84,10 +90,12 @@ std::pair<Extents, ImageArea> object_to_area_extents(
} // namespace util
using ::testing::_;
+using ::testing::DoAll;
using ::testing::InSequence;
using ::testing::Invoke;
using ::testing::Return;
using ::testing::WithArg;
+using ::testing::WithArgs;
using ::testing::WithoutArgs;
using ::testing::Exactly;
@@ -119,6 +127,19 @@ struct TestMockIoImageRequest : public TestMockFixture {
}
}
+ void expect_append_io_event(MockTestJournal &mock_journal, uint64_t journal_tid,
+ uint64_t offset,
+ size_t length) {
+ EXPECT_CALL(mock_journal, append_io_event_mock(_, _, _, _, _)).WillOnce(DoAll(
+ WithArgs<1, 2>(Invoke([offset, length] (long unsigned int entry_offset, long unsigned int entry_length) {
+
+ ASSERT_EQ(offset, entry_offset);
+ ASSERT_EQ(length, entry_length);
+ })),
+ Return(journal_tid)
+ ));
+ }
+
void expect_object_discard_request(MockTestImageCtx &mock_image_ctx,
uint64_t object_no, uint64_t offset,
uint32_t length, int r) {
@@ -367,6 +388,46 @@ TEST_F(TestMockIoImageRequest, DiscardGranularity) {
ASSERT_EQ(0, aio_comp_ctx.wait());
}
+TEST_F(TestMockIoImageRequest, DiscardGranularityJournalAppendEnabled) {
+ REQUIRE_FEATURE(RBD_FEATURE_JOURNALING);
+
+ librbd::ImageCtx *ictx;
+ ASSERT_EQ(0, open_image(m_image_name, &ictx));
+ ASSERT_EQ(0, resize(ictx, ictx->layout.object_size));
+ ictx->discard_granularity_bytes = 32;
+
+ MockTestImageCtx mock_image_ctx(*ictx);
+ MockTestJournal mock_journal;
+ mock_image_ctx.journal = &mock_journal;
+ mock_journal.appending = true;
+
+ InSequence seq;
+ expect_get_modify_timestamp(mock_image_ctx, false);
+ expect_is_journal_appending(mock_journal, true);
+ expect_object_discard_request(mock_image_ctx, 0, 32, 32, 0);
+ expect_append_io_event(mock_journal, 0, 32, 32);
+ expect_object_discard_request(mock_image_ctx, 0, 96, 64, 0);
+ expect_append_io_event(mock_journal, 1, 96, 64);
+ expect_object_discard_request(mock_image_ctx, 0, ictx->layout.object_size - 32, 32, 0);
+ expect_append_io_event(mock_journal, 2, ictx->layout.object_size - 32, 32);
+
+ C_SaferCond aio_comp_ctx;
+ AioCompletion *aio_comp = AioCompletion::create_and_start(
+ &aio_comp_ctx, ictx, AIO_TYPE_DISCARD);
+ MockImageDiscardRequest mock_aio_image_discard(
+ mock_image_ctx, aio_comp,
+ {{16, 63}, {96, 31}, {84, 100}, {ictx->layout.object_size - 33, 33}},
+ ImageArea::DATA, ictx->discard_granularity_bytes,
+ mock_image_ctx.get_data_io_context(), {});
+ {
+ std::shared_lock owner_locker{mock_image_ctx.owner_lock};
+ mock_aio_image_discard.send();
+ }
+ ASSERT_EQ(0, aio_comp_ctx.wait());
+}
+
TEST_F(TestMockIoImageRequest, AioWriteJournalAppendDisabled) {
REQUIRE_FEATURE(RBD_FEATURE_JOURNALING); |
I recommend just to add your test to your commit. It will be easier to comment on it.
Why do you need this? Doesn't
With this the test is run only when
Not sure I understand the question. Do you mean how to run a particular unit test with particular
Note, I added |
e464a96
to
a4c756d
Compare
Thanks! I have successfully compiled and tested the new commit. Ready for review. One thing that I could not properly manage to get rid of is that the test stalls if it's not getting the expected calls, maybe this is expected? There's a way to end-to-end test this by simply running blkdiscard on a nbd-device. Not sure if there's support for that in the infrastructure though? To enable appending I mocked journal_policy inside ImageCtx, this way it would return true, but also other tests would not fail. |
Before I modify the Title, I'm suggesting to change it to something along the line of To make it easier for folks reading the changelogs to see that the issue is solved. |
8b6abde
to
3e36669
Compare
I think you could add your test to |
I agree, a test based on #49614 (comment) should go to |
3e36669
to
8a2ba3d
Compare
Done. With the rbd-nbd.sh support structure it was quite easy to adopt the test. I run my tests inside a podman running fedora 37 with debian 11 as a host os, rbd-nbd list-mapped does not work here since pid inside /sys/.../nbd0/pid shows the pid in the default pid namespace (the host). I worked around that with the below, but it only got my test working. Not sure if it's a bug or not supported case by the linux kernel (the linux kernel does a task_pid_nr(current) for pid and does not care about namespaces at all). @@ -105,8 +106,19 @@ function get_pid()
local pool=$1
local ns=$2
- PID=$(rbd device --device-type nbd --format xml list | $XMLSTARLET sel -t -v \
- "//devices/device[pool='${pool}'][namespace='${ns}'][image='${IMAGE}'][device='${DEV}']/id")
+
+
+ for pid in `pgrep -f rbd-nbd`; do
+ lsof -p ${pid} | grep -q ${DEV} || continue
+ arg=`< /proc/${pid}/cmdline tr '\0' '\n' | tail -n 1`
+ [ "${arg%%/*}" == "${pool}" ] || continue
+ [ "${arg##*/}" == "${IMAGE}" ] || continue
+ if [ -n "${ns}" ]; then
+ arg=${arg#*/}
+ [ "${arg%%/*}" == "${ns}" ] || continue
+ fi
+ PID=${pid}
+ done |
8a2ba3d
to
d09bb42
Compare
Is this a valid failure? Maybe racey since it succeded earlier and failed even though I did not touch that code in the last commit.
|
d09bb42
to
7c9b39a
Compare
I was wrong in the commit message.
|
I created new trackers for related problems |
83572d0
to
d3c83d7
Compare
This commit fixes commit 7ca1bab by pushing properly aligned discards back to m_image_extents, if corrected. If discards are misaligned (off 0, len 4608, gran=4096), they are corrected properly, but only in object_extents and not in m_image_extents. When journal_append_event is triggered it will only append from m_image_extents and does not now about the alignment fixes. In commit_io_events_extent it will log a message and return without completing the io since the larger misaligned area was sent to the journal. This will in turn break rbd journal mirroring since the local client will wait indefinately on the commit to be completed, which it never does. This does not effect rbd-mirror in any way, which may be confusing and dangerous since it's only rbd-mirror that updates ceph health, and not the local client. Setting `rbd_skip_partial_discard = false` under client will restore the pre 7ca1bab behaviour and thus not trigger the bug with journals growing. This will set `rbd_discard_granularity_bytes = 0` internally. This setting is only changed during startup of a client. Fixes: 7ca1bab Fixes: https://tracker.ceph.com/issues/57396 Signed-off-by: Josef Johansson <josef@oderland.se>
d3c83d7
to
21a26a7
Compare
Rebase against main also. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apologies for the delay, Josef! I hit the hang (rbd feature disable
issue) but attributed it to the on-going lab breakage initially...
A few things need to be addressed. If you no longer have the development setup, let me know -- I'd be happy to incorporate the changes and move this PR forward on your behalf. Thanks!
LGTM Do go ahead. Let's close this, I might find more bugs if I recommit :) Thanks for reviewing! I'll hand you my bugswatter! |
Currently nothing triggers the length_modified case in ImageDiscardRequest::prune_object_extents() in isolation. It's only triggered in DiscardGranularityJournalAppendEnabled test together with the prune_required case and a bad refactoring could easily break the length_modified logic again. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
"rbd feature disable" appears to reliably hang if the corresponding remote request is proxied to rbd-nbd (because rbd-nbd happens to own the exclusive lock after a series of blkdiscard calls) [1]. Work around it here by enabling journaling before the image is mapped and disabling it after the image is unmapped. Also, don't assert on the output of "rbd journal inspect --verbose" having a certain number of entries. This is racy: if the script gets delayed after the last blkdiscard call for some reason, there may be fewer entries present in the journal or none at all. [1] https://tracker.ceph.com/issues/58740 Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
jenkins test make check |
This commit fixes commit 7ca1bab by pushing properly aligned discards back to m_image_extents, if corrected.
If discards are misaligned (off 0, len 4608, gran=4096), they are corrected properly, but only in object_extents and not in m_image_extents.
When journal_append_event is triggered it will only append from m_image_extents and does not now about the alignment fixes. In commit_io_events_extent it will throw an error that there are pending io since the larger misaligned area was sent to the journal. This will in turn break rbd journal mirroring since the local client will wait indefinately on the commit to be completed, which it never does.
Fixes: 7ca1bab
Fixes: https://tracker.ceph.com/issues/57396
Signed-off-by: Josef Johansson josef@oderland.se
Checklist