New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
mds/quiesce: let clients keep their buffered writes for a quiesced file #56755
Conversation
b2e28f5
to
a6584ff
Compare
I wasn't able to reproduce the scenario on the vossi server with
|
@batrick this shouldn't be merged, I haven't tested it with multiple ranks :(
|
fc422e3
to
a9e20a2
Compare
92a796c
to
1be0a7e
Compare
revoking this review due to the code change
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
otherwise lgtm
src/mds/MDCache.cc
Outdated
// NB: this will also wrlock the versionlock | ||
lov.add_xlock(&in->filelock); | ||
if (in->is_auth()) { | ||
lov.add_rdlock(&in->policylock); /* for the quiesce_block xattr test */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
replica mds needs this lock too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think so, since it's a SimpleLock, so it's mirrored
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm... you are right; but this makes me wonder if we have an issue here with a window where the auth and the replica may see different values of the quiesce_block
xattr for the same quiesce.
Imagine there is a race of the quiesce xattr set and a quiesce. One of the ranks could catch the value before it's updated on the other rank, which could order its quiesce of the node after the change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only way I can think of that will prevent the inconsistency is that we don't release the policylock, at least not until the quiesce is complete on all ranks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK I don't think we have the time for this. We'll just have to mention in the docs (are there docs about it?) that the quiesce.block shouldn't be set during quiesces, for now.
Though before we add it to the docs, I'd still like to rename it to quiesce.skip
cause block
still throws me off, TBH.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alternatively, we could remove this feature until we re-introduce it consistently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or, how about this: it appears that the policylock
doesn't help us with the race of a quiesce and setting of this flag. So, to be safe, the user would need to change this flag and then re-issue a quiesce. If that's the case, we may as well not even bother taking the policylock
:D
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the lock is not in sync
state, you cannot safely read it.
Please pull the commit out. I have discovered several bugs surrounding the policylock which I'll post in a separate PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack!
We have to figure out a new way of setting the quiesce skip flag to be consistent across the cluster with a racing quiesce. Having a separate PR for that makes a lot of sense.
The current solution seems good enough for the first version of the quiesce, it will work well if we need to enable it in the field as it will most probably mean some more or less static config under support supervision
jenkins test windows |
1be0a7e
to
9775b2e
Compare
With the quiesce protocol taking a `rdlock` on the file, it also revokes the `Fb` capability, which the clients can't release until they are done flushing, and that may take up arbitrarily long, evidently, more than 10 minutes. We went for the rdlock to avoid affecting readonly clients, but given the evidence above we should not optimize for those. Ideally, we’d like to have a QUIESCE file lock mode where both rd and buffer are allowed, but as of now it seems like our best available option is to `xlock` the file which will let the writing clients keep their buffers for the duration of the quiesce. We can only afford this change for a `splitauth` config, i.e. where we drop the lock immediately after all `Fw`s are revoked Signed-off-by: Leonid Usov <leonid.usov@ibm.com>
For every mirrored lock, the auth will message the replica to ensure the replicated lock state. When we take x/rdlock on the auth, it will ensure the LOCK_LOCK state on the replica, which has the file caps we want for quiesce: CACHE and BUFFER. It should be sufficient to only hold the quiesce local lock on the replica side. Signed-off-by: Leonid Usov <leonid.usov@ibm.com>
9775b2e
to
eac482b
Compare
@batrick here is the latest quiesce test run: https://pulpito.ceph.com/leonidus-2024-04-16_05:41:33-fs-wip-lusov-quiesce-xlock-distro-default-smithi/ The same test has failed multiple times, namely This test passes for me locally. It's important to note that the above suite was running on a version with the I will schedule a new run with the latest code (without the policylock commit), but for now it's important to conclude whether the test run was good enough for merging downstream at https://gitlab.cee.redhat.com/ceph/ceph/-/merge_requests/592. |
jenkins test windows |
1 similar comment
jenkins test windows |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR is under test in https://tracker.ceph.com/issues/65530. |
@batrick that run seems doomed for whatever reason. Would you consider this sufficient to merge? |
Yes, I was just including it in the branch I was building because I couldn't merge to main with required jenkins test running and I wanted to include it in my tests. |
With the quiesce protocol taking a `rdlock` on the file, it also revokes the `Fb` capability, which the clients can't release until they are done flushing, and that may take up arbitrarily long, evidently, more than 10 minutes. We went for the rdlock to avoid affecting readonly clients, but given the evidence above we should not optimize for those. Ideally, we’d like to have a QUIESCE file lock mode where both rd and buffer are allowed, but as of now it seems like our best available option is to `xlock` the file which will let the writing clients keep their buffers for the duration of the quiesce. We can only afford this change for a `splitauth` config, i.e. where we drop the lock immediately after all `Fw`s are revoked Signed-off-by: Leonid Usov <leonid.usov@ibm.com> (cherry picked from commit 8ac9842) Fixes: https://tracker.ceph.com/issues/65556 Original-Issue: https://tracker.ceph.com/issues/65472 Original-PR: #56755
For every mirrored lock, the auth will message the replica to ensure the replicated lock state. When we take x/rdlock on the auth, it will ensure the LOCK_LOCK state on the replica, which has the file caps we want for quiesce: CACHE and BUFFER. It should be sufficient to only hold the quiesce local lock on the replica side. Signed-off-by: Leonid Usov <leonid.usov@ibm.com> (cherry picked from commit eac482b) Fixes: https://tracker.ceph.com/issues/65556 Original-Issue: https://tracker.ceph.com/issues/65472 Original-PR: #56755
The analysis of https://bugzilla.redhat.com/show_bug.cgi?id=2273935 has shown that a client may build up a significant backlog of flushes in the buffers.
With the quiesce protocol taking a
rdlock
on the file, it also revokes theFb
capability, which the clients can't release until they are done flushing, and that may take up arbitrarily long, evidently, more than 10 minutes.We went for the rdlock to avoid affecting readonly clients, but given the evidence above we should not optimize for those. Ideally, we’d like to have a QUIESCE file lock mode where both rd and buffer are allowed, but as of now, it seems like our best available option is to
xlock
the file which will let the writing clients keep their buffers for the duration of the quiesce.We can only afford this change for a
splitauth
config, i.e. where we drop the lock immediately after allFw
s are revokedShow available Jenkins commands
jenkins retest this please
jenkins test classic perf
jenkins test crimson perf
jenkins test signed
jenkins test make check
jenkins test make check arm64
jenkins test submodules
jenkins test dashboard
jenkins test dashboard cephadm
jenkins test api
jenkins test docs
jenkins render docs
jenkins test ceph-volume all
jenkins test ceph-volume tox
jenkins test windows
jenkins test rook e2e