New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
client: improve the libcephfs when MDS is stopping #52336
base: main
Are you sure you want to change the base?
Conversation
b9434d1
to
604789e
Compare
This is a good change, thanks Xiubo! |
in->delay_cap_item.remove_myself(); | ||
check_caps(in, CHECK_CAPS_NODELAY); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand at all the purpose of flushing the caps here. We can never guarantee they will get flushed before the MDS stops, and something else should take over for it, so why bother doing something that adds load to a daemon right when it's trying to shed load?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This also could't guarantee that, just trying to flush the dirty caps to MDS before the MDS stops.
I just saw that when the MDS or client get the first mdsmap
marking the one MDS is up:stopping
the client continued sending client requests and cap update requests to it. Which last around 20 seconds. I think this could make the MDS stopping process last longer.
That means during these 20 seconds the dirty caps also possibly will be flushed by the tick thread. I am thinking why not trigger it as earlier as possible to speed up it ?
Makes sense ?
This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days. |
This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days. |
This pull request has been automatically closed because there has been no activity for 90 days. Please feel free to reopen this pull request (or open a new one) if the proposed change is still appropriate. Thank you for your contribution! |
src/client/Client.cc
Outdated
@@ -3091,7 +3091,19 @@ void Client::handle_mds_map(const MConstRef<MMDSMap>& m) | |||
continue; | |||
} | |||
if (newstate >= MDSMap::STATE_ACTIVE) { | |||
if (oldstate < MDSMap::STATE_ACTIVE) { | |||
if (oldstate <= MDSMap::STATE_ACTIVE) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
STATE_ACTIVE
is 13
while STATE_STOPPING
is 14
. So we're skipping STATE_STOPPING
it seems.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also if we're concerned with the stopping/stopped states then why not check them explicitly? and as greg mentioned in one of the comments - why flush the caps when there is no guarantee of them being actually flushed, instead how about creating a queue of the requests and sending them to the MDS when the MDS failover takes place? so that clients can continue with the data ops while they can survive with the cached metadata till the time mds comes alive, makes sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually this should be in non-recover case and the corresponding MDS
is just try to stop and set the state to STOPPING
.
Please note the STOPPING
state do not mean it won't accept the new client requests
. And the MDS
will report and request STOPPED
state to Monitor
just after it having shut down cleanly. More detail please see Client::make_request()
and void MDSRank::stopping_done()
.
Before the MDS
being shut down cleanly all the sessions will be closed, and then in client side when the sessions are closed the clients will drop the dirty caps. So while in STOPPING
state and just before the sessions are closed why not try to flush the dirty caps ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for flushing the dirty caps but I was trying to come up with something to make it a sure shot and not a "try". I understand that the clients can send requests to the MDS when it is stopping but what happens if the client just sent a request and the MDS stopped? The timing of the events can be such that the dirty caps can be dropped, right? So why not take care of the dropped caps by maintaining a queue and then syncing the metadata changes once the MDS is back up? Or do we already have such mechanism specially for handling this STOPPING case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the flushing_cap_tids
is doing this. Let me check it again later. Thanks @dparmar18
When stopping an MDS the mdsmap will mark it as stopping, try to avoid choose the stopping MDSs when making new requests. Fixes: https://tracker.ceph.com/issues/61914 Signed-off-by: Xiubo Li <xiubli@redhat.com>
Trigger to flush the dirty caps when the MDS is stopping, or the caps could be dropped if the MDS stopped before the client could receiving the mdsmap. Fixes: https://tracker.ceph.com/issues/61914 Signed-off-by: Xiubo Li <xiubli@redhat.com>
This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days. |
When stopping an MDS the mdsmap will mark it as stopping, try to avoid choose the stopping MDSs when making new requests and trigger to flush the dirty caps when the MDS is stopping, or the caps could be dropped if the MDS stopped before the client could receiving the mdsmap.
Fixes: https://tracker.ceph.com/issues/61914
Contribution Guidelines
To sign and title your commits, please refer to Submitting Patches to Ceph.
If you are submitting a fix for a stable branch (e.g. "pacific"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.
Checklist
Show available Jenkins commands
jenkins retest this please
jenkins test classic perf
jenkins test crimson perf
jenkins test signed
jenkins test make check
jenkins test make check arm64
jenkins test submodules
jenkins test dashboard
jenkins test dashboard cephadm
jenkins test api
jenkins test docs
jenkins render docs
jenkins test ceph-volume all
jenkins test ceph-volume tox
jenkins test windows