client: improve the libcephfs when MDS is stopping #52336

lxbsz · 2023-07-06T09:03:21Z

When stopping an MDS the mdsmap will mark it as stopping, try to avoid choose the stopping MDSs when making new requests and trigger to flush the dirty caps when the MDS is stopping, or the caps could be dropped if the MDS stopped before the client could receiving the mdsmap.

Fixes: https://tracker.ceph.com/issues/61914

Contribution Guidelines

To sign and title your commits, please refer to Submitting Patches to Ceph.
If you are submitting a fix for a stable branch (e.g. "pacific"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.

Checklist

Tracker (select at least one)
- References tracker ticket
- Very recent bug; references commit where it was introduced
- New feature (ticket optional)
- Doc update (no ticket needed)
- Code cleanup (no ticket needed)
Component impact
- Affects Dashboard, opened tracker ticket
- Affects Orchestrator, opened tracker ticket
- No impact that needs to be tracked
Documentation (select at least one)
- Updates relevant documentation
- No doc update is appropriate
Tests (select at least one)
- Includes unit test(s)
- Includes integration test(s)
- Includes bug reproducer
- No tests

Show available Jenkins commands

jenkins retest this please
jenkins test classic perf
jenkins test crimson perf
jenkins test signed
jenkins test make check
jenkins test make check arm64
jenkins test submodules
jenkins test dashboard
jenkins test dashboard cephadm
jenkins test api
jenkins test docs
jenkins render docs
jenkins test ceph-volume all
jenkins test ceph-volume tox
jenkins test windows

src/client/Client.cc

batrick · 2023-07-06T13:19:32Z

This is a good change, thanks Xiubo!

gregsfortytwo · 2023-07-06T13:24:38Z

src/client/Client.cc

+              in->delay_cap_item.remove_myself();
+              check_caps(in, CHECK_CAPS_NODELAY);
+	    }
+          }


I don't understand at all the purpose of flushing the caps here. We can never guarantee they will get flushed before the MDS stops, and something else should take over for it, so why bother doing something that adds load to a daemon right when it's trying to shed load?

This also could't guarantee that, just trying to flush the dirty caps to MDS before the MDS stops.

I just saw that when the MDS or client get the first mdsmap marking the one MDS is up:stopping the client continued sending client requests and cap update requests to it. Which last around 20 seconds. I think this could make the MDS stopping process last longer.

That means during these 20 seconds the dirty caps also possibly will be flushed by the tick thread. I am thinking why not trigger it as earlier as possible to speed up it ?

Makes sense ?

github-actions · 2023-11-09T13:02:47Z

This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days.
If you are a maintainer or core committer, please follow-up on this pull request to identify what steps should be taken by the author to move this proposed change forward.
If you are the author of this pull request, thank you for your proposed contribution. If you believe this change is still appropriate, please ensure that any feedback has been addressed and ask for a code review.

github-actions · 2024-01-29T05:01:26Z

This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days.
If you are a maintainer or core committer, please follow-up on this pull request to identify what steps should be taken by the author to move this proposed change forward.
If you are the author of this pull request, thank you for your proposed contribution. If you believe this change is still appropriate, please ensure that any feedback has been addressed and ask for a code review.

github-actions · 2024-02-28T06:01:52Z

This pull request has been automatically closed because there has been no activity for 90 days. Please feel free to reopen this pull request (or open a new one) if the proposed change is still appropriate. Thank you for your contribution!

dparmar18 · 2024-03-06T11:39:02Z

src/client/Client.cc

@@ -3091,7 +3091,19 @@ void Client::handle_mds_map(const MConstRef<MMDSMap>& m)
 	continue;
      }
      if (newstate >= MDSMap::STATE_ACTIVE) {
-	if (oldstate < MDSMap::STATE_ACTIVE) {
+	if (oldstate <= MDSMap::STATE_ACTIVE) {


STATE_ACTIVE is 13 while STATE_STOPPING is 14. So we're skipping STATE_STOPPING it seems.

also if we're concerned with the stopping/stopped states then why not check them explicitly? and as greg mentioned in one of the comments - why flush the caps when there is no guarantee of them being actually flushed, instead how about creating a queue of the requests and sending them to the MDS when the MDS failover takes place? so that clients can continue with the data ops while they can survive with the cached metadata till the time mds comes alive, makes sense?

Actually this should be in non-recover case and the corresponding MDS is just try to stop and set the state to STOPPING.

Please note the STOPPING state do not mean it won't accept the new client requests. And the MDS will report and request STOPPED state to Monitor just after it having shut down cleanly. More detail please see Client::make_request() and void MDSRank::stopping_done().

Before the MDS being shut down cleanly all the sessions will be closed, and then in client side when the sessions are closed the clients will drop the dirty caps. So while in STOPPING state and just before the sessions are closed why not try to flush the dirty caps ?

+1 for flushing the dirty caps but I was trying to come up with something to make it a sure shot and not a "try". I understand that the clients can send requests to the MDS when it is stopping but what happens if the client just sent a request and the MDS stopped? The timing of the events can be such that the dirty caps can be dropped, right? So why not take care of the dropped caps by maintaining a queue and then syncing the metadata changes once the MDS is back up? Or do we already have such mechanism specially for handling this STOPPING case?

I think the flushing_cap_tids is doing this. Let me check it again later. Thanks @dparmar18

When stopping an MDS the mdsmap will mark it as stopping, try to avoid choose the stopping MDSs when making new requests. Fixes: https://tracker.ceph.com/issues/61914 Signed-off-by: Xiubo Li <xiubli@redhat.com>

Trigger to flush the dirty caps when the MDS is stopping, or the caps could be dropped if the MDS stopped before the client could receiving the mdsmap. Fixes: https://tracker.ceph.com/issues/61914 Signed-off-by: Xiubo Li <xiubli@redhat.com>

github-actions · 2024-05-06T10:01:32Z

This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days.
If you are a maintainer or core committer, please follow-up on this pull request to identify what steps should be taken by the author to move this proposed change forward.
If you are the author of this pull request, thank you for your proposed contribution. If you believe this change is still appropriate, please ensure that any feedback has been addressed and ask for a code review.

lxbsz requested a review from a team July 6, 2023 09:03

github-actions bot added the cephfs Ceph File System label Jul 6, 2023

lxbsz force-pushed the wip-61914 branch 2 times, most recently from b9434d1 to 604789e Compare July 6, 2023 09:48

batrick requested changes Jul 6, 2023

View reviewed changes

src/client/Client.cc Outdated Show resolved Hide resolved

src/client/Client.cc Outdated Show resolved Hide resolved

gregsfortytwo reviewed Jul 6, 2023

View reviewed changes

lxbsz force-pushed the wip-61914 branch from 604789e to 6d5b09c Compare July 7, 2023 02:23

github-actions bot added the stale label Nov 9, 2023

lxbsz removed the stale label Nov 30, 2023

github-actions bot added the stale label Jan 29, 2024

github-actions bot closed this Feb 28, 2024

lxbsz reopened this Feb 29, 2024

github-actions bot removed the stale label Feb 29, 2024

dparmar18 reviewed Mar 6, 2024

View reviewed changes

lxbsz added 2 commits March 7, 2024 08:15

client: try to skip stopping ones when choosing random MDSs

77e6fe2

When stopping an MDS the mdsmap will mark it as stopping, try to avoid choose the stopping MDSs when making new requests. Fixes: https://tracker.ceph.com/issues/61914 Signed-off-by: Xiubo Li <xiubli@redhat.com>

lxbsz force-pushed the wip-61914 branch from 6d5b09c to 5061751 Compare March 7, 2024 00:54

github-actions bot added the stale label May 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

client: improve the libcephfs when MDS is stopping #52336

client: improve the libcephfs when MDS is stopping #52336

lxbsz commented Jul 6, 2023

batrick commented Jul 6, 2023

gregsfortytwo Jul 6, 2023

lxbsz Jul 6, 2023

github-actions bot commented Nov 9, 2023

github-actions bot commented Jan 29, 2024

github-actions bot commented Feb 28, 2024

dparmar18 Mar 6, 2024

dparmar18 Mar 6, 2024 •

edited

lxbsz Mar 7, 2024 •

edited

dparmar18 Mar 7, 2024

lxbsz Mar 7, 2024

github-actions bot commented May 6, 2024

client: improve the libcephfs when MDS is stopping #52336

Are you sure you want to change the base?

client: improve the libcephfs when MDS is stopping #52336

Conversation

lxbsz commented Jul 6, 2023

Contribution Guidelines

Checklist

batrick commented Jul 6, 2023

gregsfortytwo Jul 6, 2023

Choose a reason for hiding this comment

lxbsz Jul 6, 2023

Choose a reason for hiding this comment

github-actions bot commented Nov 9, 2023

github-actions bot commented Jan 29, 2024

github-actions bot commented Feb 28, 2024

dparmar18 Mar 6, 2024

Choose a reason for hiding this comment

dparmar18 Mar 6, 2024 • edited

Choose a reason for hiding this comment

lxbsz Mar 7, 2024 • edited

Choose a reason for hiding this comment

dparmar18 Mar 7, 2024

Choose a reason for hiding this comment

lxbsz Mar 7, 2024

Choose a reason for hiding this comment

github-actions bot commented May 6, 2024

dparmar18 Mar 6, 2024 •

edited

lxbsz Mar 7, 2024 •

edited