Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

client: improve the libcephfs when MDS is stopping #52336

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

lxbsz
Copy link
Member

@lxbsz lxbsz commented Jul 6, 2023

When stopping an MDS the mdsmap will mark it as stopping, try to avoid choose the stopping MDSs when making new requests and trigger to flush the dirty caps when the MDS is stopping, or the caps could be dropped if the MDS stopped before the client could receiving the mdsmap.

Fixes: https://tracker.ceph.com/issues/61914

Contribution Guidelines

Checklist

  • Tracker (select at least one)
    • References tracker ticket
    • Very recent bug; references commit where it was introduced
    • New feature (ticket optional)
    • Doc update (no ticket needed)
    • Code cleanup (no ticket needed)
  • Component impact
    • Affects Dashboard, opened tracker ticket
    • Affects Orchestrator, opened tracker ticket
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
    • No doc update is appropriate
  • Tests (select at least one)
Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard cephadm
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox
  • jenkins test windows

@lxbsz lxbsz requested a review from a team July 6, 2023 09:03
@github-actions github-actions bot added the cephfs Ceph File System label Jul 6, 2023
@lxbsz lxbsz force-pushed the wip-61914 branch 2 times, most recently from b9434d1 to 604789e Compare July 6, 2023 09:48
src/client/Client.cc Outdated Show resolved Hide resolved
src/client/Client.cc Outdated Show resolved Hide resolved
@batrick
Copy link
Member

batrick commented Jul 6, 2023

This is a good change, thanks Xiubo!

in->delay_cap_item.remove_myself();
check_caps(in, CHECK_CAPS_NODELAY);
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand at all the purpose of flushing the caps here. We can never guarantee they will get flushed before the MDS stops, and something else should take over for it, so why bother doing something that adds load to a daemon right when it's trying to shed load?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This also could't guarantee that, just trying to flush the dirty caps to MDS before the MDS stops.

I just saw that when the MDS or client get the first mdsmap marking the one MDS is up:stopping the client continued sending client requests and cap update requests to it. Which last around 20 seconds. I think this could make the MDS stopping process last longer.

That means during these 20 seconds the dirty caps also possibly will be flushed by the tick thread. I am thinking why not trigger it as earlier as possible to speed up it ?

Makes sense ?

Copy link

github-actions bot commented Nov 9, 2023

This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days.
If you are a maintainer or core committer, please follow-up on this pull request to identify what steps should be taken by the author to move this proposed change forward.
If you are the author of this pull request, thank you for your proposed contribution. If you believe this change is still appropriate, please ensure that any feedback has been addressed and ask for a code review.

@github-actions github-actions bot added the stale label Nov 9, 2023
@lxbsz lxbsz removed the stale label Nov 30, 2023
Copy link

This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days.
If you are a maintainer or core committer, please follow-up on this pull request to identify what steps should be taken by the author to move this proposed change forward.
If you are the author of this pull request, thank you for your proposed contribution. If you believe this change is still appropriate, please ensure that any feedback has been addressed and ask for a code review.

@github-actions github-actions bot added the stale label Jan 29, 2024
Copy link

This pull request has been automatically closed because there has been no activity for 90 days. Please feel free to reopen this pull request (or open a new one) if the proposed change is still appropriate. Thank you for your contribution!

@github-actions github-actions bot closed this Feb 28, 2024
@lxbsz lxbsz reopened this Feb 29, 2024
@github-actions github-actions bot removed the stale label Feb 29, 2024
@@ -3091,7 +3091,19 @@ void Client::handle_mds_map(const MConstRef<MMDSMap>& m)
continue;
}
if (newstate >= MDSMap::STATE_ACTIVE) {
if (oldstate < MDSMap::STATE_ACTIVE) {
if (oldstate <= MDSMap::STATE_ACTIVE) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

STATE_ACTIVE is 13 while STATE_STOPPING is 14. So we're skipping STATE_STOPPING it seems.

Copy link
Contributor

@dparmar18 dparmar18 Mar 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also if we're concerned with the stopping/stopped states then why not check them explicitly? and as greg mentioned in one of the comments - why flush the caps when there is no guarantee of them being actually flushed, instead how about creating a queue of the requests and sending them to the MDS when the MDS failover takes place? so that clients can continue with the data ops while they can survive with the cached metadata till the time mds comes alive, makes sense?

Copy link
Member Author

@lxbsz lxbsz Mar 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually this should be in non-recover case and the corresponding MDS is just try to stop and set the state to STOPPING.

Please note the STOPPING state do not mean it won't accept the new client requests. And the MDS will report and request STOPPED state to Monitor just after it having shut down cleanly. More detail please see Client::make_request() and void MDSRank::stopping_done().

Before the MDS being shut down cleanly all the sessions will be closed, and then in client side when the sessions are closed the clients will drop the dirty caps. So while in STOPPING state and just before the sessions are closed why not try to flush the dirty caps ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for flushing the dirty caps but I was trying to come up with something to make it a sure shot and not a "try". I understand that the clients can send requests to the MDS when it is stopping but what happens if the client just sent a request and the MDS stopped? The timing of the events can be such that the dirty caps can be dropped, right? So why not take care of the dropped caps by maintaining a queue and then syncing the metadata changes once the MDS is back up? Or do we already have such mechanism specially for handling this STOPPING case?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the flushing_cap_tids is doing this. Let me check it again later. Thanks @dparmar18

When stopping an MDS the mdsmap will mark it as stopping, try to
avoid choose the stopping MDSs when making new requests.

Fixes: https://tracker.ceph.com/issues/61914
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Trigger to flush the dirty caps when the MDS is stopping, or the
caps could be dropped if the MDS stopped before the client could
receiving the mdsmap.

Fixes: https://tracker.ceph.com/issues/61914
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Copy link

github-actions bot commented May 6, 2024

This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days.
If you are a maintainer or core committer, please follow-up on this pull request to identify what steps should be taken by the author to move this proposed change forward.
If you are the author of this pull request, thank you for your proposed contribution. If you believe this change is still appropriate, please ensure that any feedback has been addressed and ask for a code review.

@github-actions github-actions bot added the stale label May 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cephfs Ceph File System stale
Projects
None yet
4 participants