Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mds/quiesce: let clients keep their buffered writes for a quiesced file #56755

Merged
merged 2 commits into from Apr 17, 2024

Conversation

leonid-s-usov
Copy link
Contributor

The analysis of https://bugzilla.redhat.com/show_bug.cgi?id=2273935 has shown that a client may build up a significant backlog of flushes in the buffers.

With the quiesce protocol taking a rdlock on the file, it also revokes the Fb capability, which the clients can't release until they are done flushing, and that may take up arbitrarily long, evidently, more than 10 minutes.

We went for the rdlock to avoid affecting readonly clients, but given the evidence above we should not optimize for those. Ideally, we’d like to have a QUIESCE file lock mode where both rd and buffer are allowed, but as of now, it seems like our best available option is to xlock the file which will let the writing clients keep their buffers for the duration of the quiesce.

We can only afford this change for a splitauth config, i.e. where we drop the lock immediately after all Fws are revoked

Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard cephadm
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox
  • jenkins test windows
  • jenkins test rook e2e

@github-actions github-actions bot added the cephfs Ceph File System label Apr 8, 2024
@leonid-s-usov
Copy link
Contributor Author

I wasn't able to reproduce the scenario on the vossi server with ceph-fuse - the buffers have been flushed quite fast.
However, I did verify that with this change the files are now keeping Fb during the quiesce:

2024-04-09T19:26:16.145+0000 7f3ac6ffd700  5 client.4250 handle_cap_grant on in 0x100000201a2 mds.0 seq 11 caps now pAsxLsXsxFsxcrwb was pAsxLsXsxFsxcrwb
2024-04-09T19:26:16.145+0000 7f3ac6ffd700  5 client.4250 handle_cap_grant on in 0x100000201a0 mds.0 seq 11 caps now pAsLsXsFcb was pAsxLsXsxFsxcrwb
2024-04-09T19:26:16.145+0000 7f3ac6ffd700  5 client.4250 handle_cap_grant on in 0x1000002019a mds.0 seq 11 caps now pAsLsXsFcb was pAsxLsXsxFsxcrwb
2024-04-09T19:26:16.145+0000 7f3ac6ffd700  5 client.4250 handle_cap_grant on in 0x100000201a1 mds.0 seq 12 caps now pAsLsXsFcb was pAsxLsXsxFsxcrwb
2024-04-09T19:26:16.145+0000 7f3ac6ffd700  5 client.4250 handle_cap_grant on in 0x100000201a3 mds.0 seq 11 caps now pAsLsXsFcb was pAsxLsXsxFsxcrwb
2024-04-09T19:26:16.145+0000 7f3ac6ffd700  5 client.4250 handle_cap_grant on in 0x100000201a4 mds.0 seq 11 caps now pAsLsXsFcb was pAsxLsXsxFsxcrwb
2024-04-09T19:26:16.145+0000 7f3ac6ffd700  5 client.4250 handle_cap_grant on in 0x10000020195 mds.0 seq 11 caps now pAsLsXsFcb was pAsxLsXsxFsxcrwb
2024-04-09T19:26:16.145+0000 7f3ac6ffd700  5 client.4250 handle_cap_grant on in 0x1000002019e mds.0 seq 12 caps now pAsLsXsFcb was pAsxLsXsxFsxcrwb
2024-04-09T19:26:16.145+0000 7f3ac6ffd700  5 client.4250 handle_cap_grant on in 0x10000020199 mds.0 seq 12 caps now pAsLsXsFcb was pAsxLsXsxFsxcrwb
2024-04-09T19:26:16.146+0000 7f3ac6ffd700  5 client.4250 handle_cap_grant on in 0x1000002019b mds.0 seq 12 caps now pAsLsXsFcb was pAsxLsXsxFsxcrwb
2024-04-09T19:26:16.146+0000 7f3ac6ffd700  5 client.4250 handle_cap_grant on in 0x1000002019c mds.0 seq 12 caps now pAsLsXsFcb was pAsxLsXsxFsxcrwb
2024-04-09T19:26:16.148+0000 7f3ac6ffd700  5 client.4250 handle_cap_grant on in 0x100000201a2 mds.0 seq 12 caps now pAsLsXsFcb was pAsxLsXsxFsxcrwb
2024-04-09T19:26:16.148+0000 7f3ac6ffd700  5 client.4250 handle_cap_grant on in 0x10000020196 mds.0 seq 11 caps now pAsLsXsFcb was pAsxLsXsxFsxcrwb
2024-04-09T19:26:16.148+0000 7f3ac6ffd700  5 client.4250 handle_cap_grant on in 0x10000020198 mds.0 seq 12 caps now pAsLsXsFcb was pAsxLsXsxFsxcrwb
2024-04-09T19:26:16.148+0000 7f3ac6ffd700  5 client.4250 handle_cap_grant on in 0x10000020197 mds.0 seq 11 caps now pAsLsXsFcb was pAsxLsXsxFsxcrwb
2024-04-09T19:26:16.148+0000 7f3ac6ffd700  5 client.4250 handle_cap_grant on in 0x1000002019d mds.0 seq 13 caps now pAsLsXsFcb was pAsxLsXsxFsxcrwb
2024-04-09T19:26:16.148+0000 7f3ac6ffd700  5 client.4250 handle_cap_grant on in 0x1000002019f mds.0 seq 12 caps now pAsLsXsFcb was pAsxLsXsxFsxcrwb

batrick
batrick previously approved these changes Apr 10, 2024
@leonid-s-usov
Copy link
Contributor Author

@batrick this shouldn't be merged, I haven't tested it with multiple ranks :(

/builddir/build/BUILD/ceph-18.2.1/src/mds/Locker.cc: 2077: FAILED ceph_assert(lock->get_sm()->can_remote_xlock)

 ceph version 18.2.1-136.el9cp (e7edde2b655d0dd9f860dda675f9d7954f07e6e3) reef (stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x12e) [0x7f6a0646700e]
 2: /usr/lib64/ceph/libceph-common.so.2(+0x1631cc) [0x7f6a064671cc]
 3: (Locker::xlock_start(SimpleLock*, boost::intrusive_ptr<MDRequestImpl> const&)+0xa12) [0x556a16650472]
 4: (Locker::acquire_locks(boost::intrusive_ptr<MDRequestImpl> const&, MutationImpl::LockOpVec&, CInode*, std::set<MDSCacheObject*, std::less<MDSCacheObject*>, std::allocator<MDSCacheObject*> >, bool, bool)+0x2a11) [0x556a166459e1]
 5: (MDCache::dispatch_quiesce_inode(boost::intrusive_ptr<MDRequestImpl> const&)+0x942) [0x556a16621c72]
 6: (Server::handle_peer_auth_pin_ack(boost::intrusive_ptr<MDRequestImpl> const&, boost::intrusive_ptr<MMDSPeerRequest const> const&)+0x822) [0x556a16519602]
 7: (Server::handle_peer_request_reply(boost::intrusive_ptr<MMDSPeerRequest const> const&)+0x6cc) [0x556a16519efc]
 8: (Server::dispatch(boost::intrusive_ptr<Message const> const&)+0x12c) [0x556a1650679c]
 9: (MDSRank::_dispatch(boost::intrusive_ptr<Message const> const&, bool)+0x4fb) [0x556a164bccfb]
 10: (MDSRankDispatcher::ms_dispatch(boost::intrusive_ptr<Message const> const&)+0x5c) [0x556a164bd1fc]
 11: (MDSDaemon::ms_dispatch2(boost::intrusive_ptr<Message> const&)+0x196) [0x556a164a09b6]
 12: (DispatchQueue::entry()+0x542) [0x7f6a06666c02]
 13: /usr/lib64/ceph/libceph-common.so.2(+0x3fab31) [0x7f6a066feb31]
 14: /lib64/libc.so.6(+0x9f802) [0x7f6a05e35802]
 15: /lib64/libc.so.6(+0x3f450) [0x7f6a05dd5450]

@batrick
Copy link
Member

batrick commented Apr 12, 2024

@leonid-s-usov leonid-s-usov force-pushed the wip-lusov-quiesce-xlock branch 2 times, most recently from 92a796c to 1be0a7e Compare April 15, 2024 16:36
@leonid-s-usov leonid-s-usov requested a review from a team April 15, 2024 16:39
@leonid-s-usov leonid-s-usov dismissed batrick’s stale review April 15, 2024 16:40

revoking this review due to the code change

Copy link
Member

@batrick batrick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

otherwise lgtm

// NB: this will also wrlock the versionlock
lov.add_xlock(&in->filelock);
if (in->is_auth()) {
lov.add_rdlock(&in->policylock); /* for the quiesce_block xattr test */
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

replica mds needs this lock too?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so, since it's a SimpleLock, so it's mirrored

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm... you are right; but this makes me wonder if we have an issue here with a window where the auth and the replica may see different values of the quiesce_block xattr for the same quiesce.

Imagine there is a race of the quiesce xattr set and a quiesce. One of the ranks could catch the value before it's updated on the other rank, which could order its quiesce of the node after the change.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only way I can think of that will prevent the inconsistency is that we don't release the policylock, at least not until the quiesce is complete on all ranks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK I don't think we have the time for this. We'll just have to mention in the docs (are there docs about it?) that the quiesce.block shouldn't be set during quiesces, for now.

Though before we add it to the docs, I'd still like to rename it to quiesce.skip cause block still throws me off, TBH.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively, we could remove this feature until we re-introduce it consistently.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or, how about this: it appears that the policylock doesn't help us with the race of a quiesce and setting of this flag. So, to be safe, the user would need to change this flag and then re-issue a quiesce. If that's the case, we may as well not even bother taking the policylock :D

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the lock is not in sync state, you cannot safely read it.

Please pull the commit out. I have discovered several bugs surrounding the policylock which I'll post in a separate PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack!
We have to figure out a new way of setting the quiesce skip flag to be consistent across the cluster with a racing quiesce. Having a separate PR for that makes a lot of sense.

The current solution seems good enough for the first version of the quiesce, it will work well if we need to enable it in the field as it will most probably mean some more or less static config under support supervision

@leonid-s-usov
Copy link
Contributor Author

jenkins test windows

@leonid-s-usov leonid-s-usov changed the title mds/quiesce: xlock the file to let clients keep their buffered writes mds/quiesce: let clients keep their buffered writes for a quiesced file Apr 15, 2024
@leonid-s-usov leonid-s-usov requested review from batrick and a team April 15, 2024 21:55
With the quiesce protocol taking a `rdlock` on the file,
it also revokes the `Fb` capability, which the clients can't release
until they are done flushing, and that may take up arbitrarily long,
evidently, more than 10 minutes.

We went for the rdlock to avoid affecting readonly clients,
but given the evidence above we should not optimize for those.
Ideally, we’d like to have a QUIESCE file lock mode where both rd
and buffer are allowed, but as of now it seems like our best
available option is to `xlock` the file which will let the writing
clients keep their buffers for the duration of the quiesce.

We can only afford this change for a `splitauth` config,
i.e. where we drop the lock immediately after all `Fw`s are revoked

Signed-off-by: Leonid Usov <leonid.usov@ibm.com>
For every mirrored lock, the auth will message the replica to ensure
the replicated lock state. When we take x/rdlock on the auth, it will
ensure the LOCK_LOCK state on the replica, which has the file caps we
want for quiesce: CACHE and BUFFER.

It should be sufficient to only hold the quiesce local lock
on the replica side.

Signed-off-by: Leonid Usov <leonid.usov@ibm.com>
@leonid-s-usov
Copy link
Contributor Author

@batrick here is the latest quiesce test run: https://pulpito.ceph.com/leonidus-2024-04-16_05:41:33-fs-wip-lusov-quiesce-xlock-distro-default-smithi/

The same test has failed multiple times, namely test_quiesce_authpin_wait (tasks.cephfs.test_quiesce.TestQuiesceMultiRank)

This test passes for me locally. It's important to note that the above suite was running on a version with the policylock, though I don't think it should have affected anything.

I will schedule a new run with the latest code (without the policylock commit), but for now it's important to conclude whether the test run was good enough for merging downstream at https://gitlab.cee.redhat.com/ceph/ceph/-/merge_requests/592.

@leonid-s-usov
Copy link
Contributor Author

jenkins test windows

1 similar comment
@batrick
Copy link
Member

batrick commented Apr 16, 2024

jenkins test windows

Copy link
Member

@batrick batrick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@batrick
Copy link
Member

batrick commented Apr 17, 2024

This PR is under test in https://tracker.ceph.com/issues/65530.

@leonid-s-usov
Copy link
Contributor Author

This PR is under test in https://tracker.ceph.com/issues/65530.

@batrick that run seems doomed for whatever reason. Would you consider this sufficient to merge?

@batrick batrick merged commit 08d35a8 into main Apr 17, 2024
10 of 11 checks passed
@batrick batrick deleted the wip-lusov-quiesce-xlock branch April 17, 2024 20:02
@batrick
Copy link
Member

batrick commented Apr 17, 2024

This PR is under test in https://tracker.ceph.com/issues/65530.

@batrick that run seems doomed for whatever reason. Would you consider this sufficient to merge?

Yes, I was just including it in the branch I was building because I couldn't merge to main with required jenkins test running and I wanted to include it in my tests.

leonid-s-usov added a commit that referenced this pull request Apr 21, 2024
With the quiesce protocol taking a `rdlock` on the file,
it also revokes the `Fb` capability, which the clients can't release
until they are done flushing, and that may take up arbitrarily long,
evidently, more than 10 minutes.

We went for the rdlock to avoid affecting readonly clients,
but given the evidence above we should not optimize for those.
Ideally, we’d like to have a QUIESCE file lock mode where both rd
and buffer are allowed, but as of now it seems like our best
available option is to `xlock` the file which will let the writing
clients keep their buffers for the duration of the quiesce.

We can only afford this change for a `splitauth` config,
i.e. where we drop the lock immediately after all `Fw`s are revoked

Signed-off-by: Leonid Usov <leonid.usov@ibm.com>
(cherry picked from commit 8ac9842)
Fixes: https://tracker.ceph.com/issues/65556
Original-Issue: https://tracker.ceph.com/issues/65472
Original-PR: #56755
leonid-s-usov added a commit that referenced this pull request Apr 21, 2024
For every mirrored lock, the auth will message the replica to ensure
the replicated lock state. When we take x/rdlock on the auth, it will
ensure the LOCK_LOCK state on the replica, which has the file caps we
want for quiesce: CACHE and BUFFER.

It should be sufficient to only hold the quiesce local lock
on the replica side.

Signed-off-by: Leonid Usov <leonid.usov@ibm.com>
(cherry picked from commit eac482b)
Fixes: https://tracker.ceph.com/issues/65556
Original-Issue: https://tracker.ceph.com/issues/65472
Original-PR: #56755
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
2 participants