mds/quiesce: let clients keep their buffered writes for a quiesced file #56755

leonid-s-usov · 2024-04-08T12:11:35Z

The analysis of https://bugzilla.redhat.com/show_bug.cgi?id=2273935 has shown that a client may build up a significant backlog of flushes in the buffers.

With the quiesce protocol taking a rdlock on the file, it also revokes the Fb capability, which the clients can't release until they are done flushing, and that may take up arbitrarily long, evidently, more than 10 minutes.

We went for the rdlock to avoid affecting readonly clients, but given the evidence above we should not optimize for those. Ideally, we’d like to have a QUIESCE file lock mode where both rd and buffer are allowed, but as of now, it seems like our best available option is to xlock the file which will let the writing clients keep their buffers for the duration of the quiesce.

We can only afford this change for a splitauth config, i.e. where we drop the lock immediately after all Fws are revoked

Show available Jenkins commands

jenkins retest this please
jenkins test classic perf
jenkins test crimson perf
jenkins test signed
jenkins test make check
jenkins test make check arm64
jenkins test submodules
jenkins test dashboard
jenkins test dashboard cephadm
jenkins test api
jenkins test docs
jenkins render docs
jenkins test ceph-volume all
jenkins test ceph-volume tox
jenkins test windows
jenkins test rook e2e

leonid-s-usov · 2024-04-09T19:29:04Z

I wasn't able to reproduce the scenario on the vossi server with ceph-fuse - the buffers have been flushed quite fast.
However, I did verify that with this change the files are now keeping Fb during the quiesce:

2024-04-09T19:26:16.145+0000 7f3ac6ffd700  5 client.4250 handle_cap_grant on in 0x100000201a2 mds.0 seq 11 caps now pAsxLsXsxFsxcrwb was pAsxLsXsxFsxcrwb
2024-04-09T19:26:16.145+0000 7f3ac6ffd700  5 client.4250 handle_cap_grant on in 0x100000201a0 mds.0 seq 11 caps now pAsLsXsFcb was pAsxLsXsxFsxcrwb
2024-04-09T19:26:16.145+0000 7f3ac6ffd700  5 client.4250 handle_cap_grant on in 0x1000002019a mds.0 seq 11 caps now pAsLsXsFcb was pAsxLsXsxFsxcrwb
2024-04-09T19:26:16.145+0000 7f3ac6ffd700  5 client.4250 handle_cap_grant on in 0x100000201a1 mds.0 seq 12 caps now pAsLsXsFcb was pAsxLsXsxFsxcrwb
2024-04-09T19:26:16.145+0000 7f3ac6ffd700  5 client.4250 handle_cap_grant on in 0x100000201a3 mds.0 seq 11 caps now pAsLsXsFcb was pAsxLsXsxFsxcrwb
2024-04-09T19:26:16.145+0000 7f3ac6ffd700  5 client.4250 handle_cap_grant on in 0x100000201a4 mds.0 seq 11 caps now pAsLsXsFcb was pAsxLsXsxFsxcrwb
2024-04-09T19:26:16.145+0000 7f3ac6ffd700  5 client.4250 handle_cap_grant on in 0x10000020195 mds.0 seq 11 caps now pAsLsXsFcb was pAsxLsXsxFsxcrwb
2024-04-09T19:26:16.145+0000 7f3ac6ffd700  5 client.4250 handle_cap_grant on in 0x1000002019e mds.0 seq 12 caps now pAsLsXsFcb was pAsxLsXsxFsxcrwb
2024-04-09T19:26:16.145+0000 7f3ac6ffd700  5 client.4250 handle_cap_grant on in 0x10000020199 mds.0 seq 12 caps now pAsLsXsFcb was pAsxLsXsxFsxcrwb
2024-04-09T19:26:16.146+0000 7f3ac6ffd700  5 client.4250 handle_cap_grant on in 0x1000002019b mds.0 seq 12 caps now pAsLsXsFcb was pAsxLsXsxFsxcrwb
2024-04-09T19:26:16.146+0000 7f3ac6ffd700  5 client.4250 handle_cap_grant on in 0x1000002019c mds.0 seq 12 caps now pAsLsXsFcb was pAsxLsXsxFsxcrwb
2024-04-09T19:26:16.148+0000 7f3ac6ffd700  5 client.4250 handle_cap_grant on in 0x100000201a2 mds.0 seq 12 caps now pAsLsXsFcb was pAsxLsXsxFsxcrwb
2024-04-09T19:26:16.148+0000 7f3ac6ffd700  5 client.4250 handle_cap_grant on in 0x10000020196 mds.0 seq 11 caps now pAsLsXsFcb was pAsxLsXsxFsxcrwb
2024-04-09T19:26:16.148+0000 7f3ac6ffd700  5 client.4250 handle_cap_grant on in 0x10000020198 mds.0 seq 12 caps now pAsLsXsFcb was pAsxLsXsxFsxcrwb
2024-04-09T19:26:16.148+0000 7f3ac6ffd700  5 client.4250 handle_cap_grant on in 0x10000020197 mds.0 seq 11 caps now pAsLsXsFcb was pAsxLsXsxFsxcrwb
2024-04-09T19:26:16.148+0000 7f3ac6ffd700  5 client.4250 handle_cap_grant on in 0x1000002019d mds.0 seq 13 caps now pAsLsXsFcb was pAsxLsXsxFsxcrwb
2024-04-09T19:26:16.148+0000 7f3ac6ffd700  5 client.4250 handle_cap_grant on in 0x1000002019f mds.0 seq 12 caps now pAsLsXsFcb was pAsxLsXsxFsxcrwb

leonid-s-usov · 2024-04-12T10:45:05Z

@batrick this shouldn't be merged, I haven't tested it with multiple ranks :(

/builddir/build/BUILD/ceph-18.2.1/src/mds/Locker.cc: 2077: FAILED ceph_assert(lock->get_sm()->can_remote_xlock)

 ceph version 18.2.1-136.el9cp (e7edde2b655d0dd9f860dda675f9d7954f07e6e3) reef (stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x12e) [0x7f6a0646700e]
 2: /usr/lib64/ceph/libceph-common.so.2(+0x1631cc) [0x7f6a064671cc]
 3: (Locker::xlock_start(SimpleLock*, boost::intrusive_ptr<MDRequestImpl> const&)+0xa12) [0x556a16650472]
 4: (Locker::acquire_locks(boost::intrusive_ptr<MDRequestImpl> const&, MutationImpl::LockOpVec&, CInode*, std::set<MDSCacheObject*, std::less<MDSCacheObject*>, std::allocator<MDSCacheObject*> >, bool, bool)+0x2a11) [0x556a166459e1]
 5: (MDCache::dispatch_quiesce_inode(boost::intrusive_ptr<MDRequestImpl> const&)+0x942) [0x556a16621c72]
 6: (Server::handle_peer_auth_pin_ack(boost::intrusive_ptr<MDRequestImpl> const&, boost::intrusive_ptr<MMDSPeerRequest const> const&)+0x822) [0x556a16519602]
 7: (Server::handle_peer_request_reply(boost::intrusive_ptr<MMDSPeerRequest const> const&)+0x6cc) [0x556a16519efc]
 8: (Server::dispatch(boost::intrusive_ptr<Message const> const&)+0x12c) [0x556a1650679c]
 9: (MDSRank::_dispatch(boost::intrusive_ptr<Message const> const&, bool)+0x4fb) [0x556a164bccfb]
 10: (MDSRankDispatcher::ms_dispatch(boost::intrusive_ptr<Message const> const&)+0x5c) [0x556a164bd1fc]
 11: (MDSDaemon::ms_dispatch2(boost::intrusive_ptr<Message> const&)+0x196) [0x556a164a09b6]
 12: (DispatchQueue::entry()+0x542) [0x7f6a06666c02]
 13: /usr/lib64/ceph/libceph-common.so.2(+0x3fab31) [0x7f6a066feb31]
 14: /lib64/libc.so.6(+0x9f802) [0x7f6a05e35802]
 15: /lib64/libc.so.6(+0x3f450) [0x7f6a05dd5450]

batrick · 2024-04-12T22:24:42Z

https://tracker.ceph.com/issues/65472

revoking this review due to the code change

batrick

otherwise lgtm

batrick · 2024-04-15T17:07:13Z

src/mds/MDCache.cc

-      // NB: this will also wrlock the versionlock
-      lov.add_xlock(&in->filelock);
+    if (in->is_auth()) {
+      lov.add_rdlock(&in->policylock); /* for the quiesce_block xattr test */


replica mds needs this lock too?

I don't think so, since it's a SimpleLock, so it's mirrored

hmm... you are right; but this makes me wonder if we have an issue here with a window where the auth and the replica may see different values of the quiesce_block xattr for the same quiesce.

Imagine there is a race of the quiesce xattr set and a quiesce. One of the ranks could catch the value before it's updated on the other rank, which could order its quiesce of the node after the change.

The only way I can think of that will prevent the inconsistency is that we don't release the policylock, at least not until the quiesce is complete on all ranks.

OK I don't think we have the time for this. We'll just have to mention in the docs (are there docs about it?) that the quiesce.block shouldn't be set during quiesces, for now.

Though before we add it to the docs, I'd still like to rename it to quiesce.skip cause block still throws me off, TBH.

Alternatively, we could remove this feature until we re-introduce it consistently.

Or, how about this: it appears that the policylock doesn't help us with the race of a quiesce and setting of this flag. So, to be safe, the user would need to change this flag and then re-issue a quiesce. If that's the case, we may as well not even bother taking the policylock :D

If the lock is not in sync state, you cannot safely read it.

Please pull the commit out. I have discovered several bugs surrounding the policylock which I'll post in a separate PR.

Ack!
We have to figure out a new way of setting the quiesce skip flag to be consistent across the cluster with a racing quiesce. Having a separate PR for that makes a lot of sense.

The current solution seems good enough for the first version of the quiesce, it will work well if we need to enable it in the field as it will most probably mean some more or less static config under support supervision

leonid-s-usov · 2024-04-15T20:19:37Z

jenkins test windows

With the quiesce protocol taking a `rdlock` on the file, it also revokes the `Fb` capability, which the clients can't release until they are done flushing, and that may take up arbitrarily long, evidently, more than 10 minutes. We went for the rdlock to avoid affecting readonly clients, but given the evidence above we should not optimize for those. Ideally, we’d like to have a QUIESCE file lock mode where both rd and buffer are allowed, but as of now it seems like our best available option is to `xlock` the file which will let the writing clients keep their buffers for the duration of the quiesce. We can only afford this change for a `splitauth` config, i.e. where we drop the lock immediately after all `Fw`s are revoked Signed-off-by: Leonid Usov <leonid.usov@ibm.com>

For every mirrored lock, the auth will message the replica to ensure the replicated lock state. When we take x/rdlock on the auth, it will ensure the LOCK_LOCK state on the replica, which has the file caps we want for quiesce: CACHE and BUFFER. It should be sufficient to only hold the quiesce local lock on the replica side. Signed-off-by: Leonid Usov <leonid.usov@ibm.com>

leonid-s-usov · 2024-04-16T10:52:49Z

@batrick here is the latest quiesce test run: https://pulpito.ceph.com/leonidus-2024-04-16_05:41:33-fs-wip-lusov-quiesce-xlock-distro-default-smithi/

The same test has failed multiple times, namely test_quiesce_authpin_wait (tasks.cephfs.test_quiesce.TestQuiesceMultiRank)

This test passes for me locally. It's important to note that the above suite was running on a version with the policylock, though I don't think it should have affected anything.

I will schedule a new run with the latest code (without the policylock commit), but for now it's important to conclude whether the test run was good enough for merging downstream at https://gitlab.cee.redhat.com/ceph/ceph/-/merge_requests/592.

leonid-s-usov · 2024-04-16T10:53:11Z

jenkins test windows

batrick · 2024-04-16T20:42:58Z

jenkins test windows

batrick

Tested here

https://pulpito.ceph.com/leonidus-2024-04-16_14:54:17-fs-main-distro-default-smithi/

batrick · 2024-04-17T02:15:15Z

This PR is under test in https://tracker.ceph.com/issues/65530.

leonid-s-usov · 2024-04-17T19:50:42Z

This PR is under test in https://tracker.ceph.com/issues/65530.

@batrick that run seems doomed for whatever reason. Would you consider this sufficient to merge?

batrick · 2024-04-17T20:02:46Z

This PR is under test in https://tracker.ceph.com/issues/65530.

@batrick that run seems doomed for whatever reason. Would you consider this sufficient to merge?

Yes, I was just including it in the branch I was building because I couldn't merge to main with required jenkins test running and I wanted to include it in my tests.

With the quiesce protocol taking a `rdlock` on the file, it also revokes the `Fb` capability, which the clients can't release until they are done flushing, and that may take up arbitrarily long, evidently, more than 10 minutes. We went for the rdlock to avoid affecting readonly clients, but given the evidence above we should not optimize for those. Ideally, we’d like to have a QUIESCE file lock mode where both rd and buffer are allowed, but as of now it seems like our best available option is to `xlock` the file which will let the writing clients keep their buffers for the duration of the quiesce. We can only afford this change for a `splitauth` config, i.e. where we drop the lock immediately after all `Fw`s are revoked Signed-off-by: Leonid Usov <leonid.usov@ibm.com> (cherry picked from commit 8ac9842) Fixes: https://tracker.ceph.com/issues/65556 Original-Issue: https://tracker.ceph.com/issues/65472 Original-PR: #56755

For every mirrored lock, the auth will message the replica to ensure the replicated lock state. When we take x/rdlock on the auth, it will ensure the LOCK_LOCK state on the replica, which has the file caps we want for quiesce: CACHE and BUFFER. It should be sufficient to only hold the quiesce local lock on the replica side. Signed-off-by: Leonid Usov <leonid.usov@ibm.com> (cherry picked from commit eac482b) Fixes: https://tracker.ceph.com/issues/65556 Original-Issue: https://tracker.ceph.com/issues/65472 Original-PR: #56755

leonid-s-usov requested a review from batrick April 8, 2024 12:11

github-actions bot added the cephfs Ceph File System label Apr 8, 2024

leonid-s-usov force-pushed the wip-lusov-quiesce-xlock branch from b2e28f5 to a6584ff Compare April 8, 2024 21:30

batrick previously approved these changes Apr 10, 2024

View reviewed changes

batrick added needs-qa wip-pdonnell-testing labels Apr 10, 2024

leonid-s-usov added the do-not-merge label Apr 12, 2024

batrick removed wip-pdonnell-testing needs-qa labels Apr 12, 2024

batrick self-requested a review April 12, 2024 13:33

leonid-s-usov force-pushed the wip-lusov-quiesce-xlock branch from fc422e3 to a9e20a2 Compare April 13, 2024 12:40

leonid-s-usov mentioned this pull request Apr 13, 2024

mds: add optimization for replica recall during quiesce #56867

Draft

4 tasks

leonid-s-usov force-pushed the wip-lusov-quiesce-xlock branch 2 times, most recently from 92a796c to 1be0a7e Compare April 15, 2024 16:36

leonid-s-usov requested a review from a team April 15, 2024 16:39

leonid-s-usov removed the do-not-merge label Apr 15, 2024

batrick requested changes Apr 15, 2024

View reviewed changes

leonid-s-usov changed the title ~~mds/quiesce: xlock the file to let clients keep their buffered writes~~ mds/quiesce: let clients keep their buffered writes for a quiesced file Apr 15, 2024

leonid-s-usov force-pushed the wip-lusov-quiesce-xlock branch from 1be0a7e to 9775b2e Compare April 15, 2024 21:24

leonid-s-usov requested review from batrick and a team April 15, 2024 21:55

leonid-s-usov added 2 commits April 16, 2024 08:43

leonid-s-usov force-pushed the wip-lusov-quiesce-xlock branch from 9775b2e to eac482b Compare April 16, 2024 05:44

batrick approved these changes Apr 16, 2024

View reviewed changes

batrick added the wip-pdonnell-testing label Apr 16, 2024

batrick merged commit 08d35a8 into main Apr 17, 2024
10 of 11 checks passed

batrick deleted the wip-lusov-quiesce-xlock branch April 17, 2024 20:02

leonid-s-usov mentioned this pull request Apr 21, 2024

squid: mds/quiesce: let clients keep their buffered writes for a quiesced file #57013

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mds/quiesce: let clients keep their buffered writes for a quiesced file #56755

mds/quiesce: let clients keep their buffered writes for a quiesced file #56755

leonid-s-usov commented Apr 8, 2024

leonid-s-usov commented Apr 9, 2024

leonid-s-usov commented Apr 12, 2024

batrick commented Apr 12, 2024

batrick left a comment

batrick Apr 15, 2024

leonid-s-usov Apr 15, 2024

leonid-s-usov Apr 15, 2024 •

edited

leonid-s-usov Apr 15, 2024 •

edited

leonid-s-usov Apr 15, 2024

leonid-s-usov Apr 15, 2024

leonid-s-usov Apr 15, 2024

batrick Apr 16, 2024

leonid-s-usov Apr 16, 2024 •

edited

leonid-s-usov commented Apr 15, 2024

leonid-s-usov commented Apr 16, 2024

leonid-s-usov commented Apr 16, 2024

batrick commented Apr 16, 2024

batrick left a comment

batrick commented Apr 17, 2024

leonid-s-usov commented Apr 17, 2024

batrick commented Apr 17, 2024

mds/quiesce: let clients keep their buffered writes for a quiesced file #56755

mds/quiesce: let clients keep their buffered writes for a quiesced file #56755

Conversation

leonid-s-usov commented Apr 8, 2024

leonid-s-usov commented Apr 9, 2024

leonid-s-usov commented Apr 12, 2024

batrick commented Apr 12, 2024

batrick left a comment

Choose a reason for hiding this comment

batrick Apr 15, 2024

Choose a reason for hiding this comment

leonid-s-usov Apr 15, 2024

Choose a reason for hiding this comment

leonid-s-usov Apr 15, 2024 • edited

Choose a reason for hiding this comment

leonid-s-usov Apr 15, 2024 • edited

Choose a reason for hiding this comment

leonid-s-usov Apr 15, 2024

Choose a reason for hiding this comment

leonid-s-usov Apr 15, 2024

Choose a reason for hiding this comment

leonid-s-usov Apr 15, 2024

Choose a reason for hiding this comment

batrick Apr 16, 2024

Choose a reason for hiding this comment

leonid-s-usov Apr 16, 2024 • edited

Choose a reason for hiding this comment

leonid-s-usov commented Apr 15, 2024

leonid-s-usov commented Apr 16, 2024

leonid-s-usov commented Apr 16, 2024

batrick commented Apr 16, 2024

batrick left a comment

Choose a reason for hiding this comment

batrick commented Apr 17, 2024

leonid-s-usov commented Apr 17, 2024

batrick commented Apr 17, 2024

leonid-s-usov Apr 15, 2024 •

edited

leonid-s-usov Apr 15, 2024 •

edited

leonid-s-usov Apr 16, 2024 •

edited