Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mds: switch submit_mutex to fair mutex for MDLog #44215

Closed
wants to merge 1 commit into from

Conversation

lxbsz
Copy link
Member

@lxbsz lxbsz commented Dec 6, 2021

The implementations of the Mutex (e.g. std::mutex in C++) do not
guarantee fairness, they do not guarantee that the lock will be
acquired by threads in the order that they called the lock().

In most case this works well, but in overload case the client
requests handling thread and _submit_thread could always successfully
acquire the submit_mutex in a long time, which could make the
MDLog::trim() get stuck. That means the MDS daemons will fill journal
logs into the metadata pool, but couldn't trim the expired segments
in time.

This will switch the submit_mutex to fair mutex and it could make
sure that the all the submit_mutex waiters are in FIFO order and
could get a change to be excuted in time.

Fixes: https://tracker.ceph.com/issues/40002
Signed-off-by: Xiubo Li xiubli@redhat.com

Checklist

  • Tracker (select at least one)
    • References tracker ticket
    • Very recent bug; references commit where it was introduced
    • New feature (ticket optional)
    • Doc update (no ticket needed)
    • Code cleanup (no ticket needed)
  • Component impact
    • Affects Dashboard, opened tracker ticket
    • Affects Orchestrator, opened tracker ticket
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
    • No doc update is appropriate
  • Tests (select at least one)
Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard cephadm
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox

@lxbsz lxbsz requested a review from a team December 6, 2021 01:51
@github-actions github-actions bot added the cephfs Ceph File System label Dec 6, 2021
The implementations of the Mutex (e.g. std::mutex in C++) do not
guarantee fairness, they do not guarantee that the lock will be
acquired by threads in the order that they called the lock().

In most case this works well, but in overload case the client
requests handling thread and _submit_thread could always successfully
acquire the submit_mutex in a long time, which could make the
MDLog::trim() get stuck. That means the MDS daemons will fill journal
logs into the metadata pool, but couldn't trim the expired segments
in time.

This will switch the submit_mutex to fair mutex and it could make
sure that the all the submit_mutex waiters are in FIFO order and
could get a change to be excuted in time.

Fixes: https://tracker.ceph.com/issues/40002
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Copy link
Contributor

@vshankar vshankar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - we probably need to put this to test under load.

@lxbsz
Copy link
Member Author

lxbsz commented Dec 8, 2021

LGTM - we probably need to put this to test under load.

Sounds reasonable. I think we also need to put the mds_lock to test under heavy load too. I can do that later together in one separate PR. Is that okay ?

@vshankar
Copy link
Contributor

vshankar commented Dec 8, 2021

Sounds reasonable. I think we also need to put the mds_lock to test under heavy load too. I can do that later together in one separate PR. Is that okay ?

Sure.

@lxbsz
Copy link
Member Author

lxbsz commented Dec 8, 2021

Sounds reasonable. I think we also need to put the mds_lock to test under heavy load too. I can do that later together in one separate PR. Is that okay ?

Sure.

I have raised one tracker about this: https://tracker.ceph.com/issues/53520

@batrick
Copy link
Member

batrick commented Dec 9, 2021

I'm not confident this is fixing the real problem. I think it may be more likely the MDS finisher was starved which deferred expiration of segments (e.g. BatchCommitBacktrace). Your mds_lock fix for fair mutex may also incidentally fix the #40002 tracker ticket.

Do you have a test case showing this fix resolves the original issue?

@lxbsz
Copy link
Member Author

lxbsz commented Dec 10, 2021

I'm not confident this is fixing the real problem. I think it may be more likely the MDS finisher was starved which deferred expiration of segments (e.g. BatchCommitBacktrace). Your mds_lock fix for fair mutex may also incidentally fix the #40002 tracker ticket.

Yeah, it seems will fix or at least alleviate some other issues like https://tracker.ceph.com/issues/53521.

Do you have a test case showing this fix resolves the original issue?

Not yet, I was just using a simple testing script by creating/deleting dirs and files in 2 loops in two terminals, which will send a lot of client requests and also in a third terminal kept doing heavy read/write at the same time. I could see sometimes the mdlog->trim() could stuck for a while by waiting the submit_mutex, just like what I saw when fixing the mds_lock issue before.

With this fixing I have see that yet.

@lxbsz
Copy link
Member Author

lxbsz commented Dec 10, 2021

I'm not confident this is fixing the real problem. I think it may be more likely the MDS finisher was starved which deferred expiration of segments (e.g. BatchCommitBacktrace). Your mds_lock fix for fair mutex may also incidentally fix the #40002 tracker ticket.

The BatchCommitBacktrace was added since Pacific, while both the heavy load issue were seen in ceph version 14.2.X.

@vshankar
Copy link
Contributor

vshankar commented Dec 10, 2021

I'm not confident this is fixing the real problem. I think it may be more likely the MDS finisher was starved which deferred expiration of segments (e.g. BatchCommitBacktrace). Your mds_lock fix for fair mutex may also incidentally fix the #40002 tracker ticket.

That's the reason this PR is not "approved", although, the switch to using fair mutex does look reasonable to me. We'll only know if log trimming does not get stuck when we load test this change.

@lxbsz
Copy link
Member Author

lxbsz commented Dec 10, 2021

I'm not confident this is fixing the real problem. I think it may be more likely the MDS finisher was starved which deferred expiration of segments (e.g. BatchCommitBacktrace). Your mds_lock fix for fair mutex may also incidentally fix the #40002 tracker ticket.

That's the reason this PR is not "approved", although, the switch to using fair mutex does look reasonable to me. We'll only know if log trimming does not get stuck when we load test this change.

Yeah, as we know the mdlog->trim(), which will be called by the MDS tick() thread, will try to acquire the submit_mutex before trimming the expired segments, and for each client request it will also try to acquire it when submitting a journal.

In theory, if there has a large number of client requests queued in the dispatch thread, for the non-fair mutex it's possibly that the submit_mutex could be successfully acquired by MDLog's _submit_entry() in most time, and the _submit_thead() could successfully acquire it occasionally, but in case the mdlog->trim() won't for a long time, such as 1 minute or longer.

Then I'm sure that we will see the same issue with the mds_lock and it will make the mdlog->trim() get stuck and the metadata pool's usage could increase a lot.

@lxbsz
Copy link
Member Author

lxbsz commented Dec 15, 2021

I'm not confident this is fixing the real problem. I think it may be more likely the MDS finisher was starved which deferred expiration of segments (e.g. BatchCommitBacktrace). Your mds_lock fix for fair mutex may also incidentally fix the #40002 tracker ticket.

Do you have a test case showing this fix resolves the original issue?

@batrick

With #44180, this really could resolve the problem. More detail please see https://tracker.ceph.com/issues/40002#note-14.

@lxbsz
Copy link
Member Author

lxbsz commented Dec 16, 2021

I have fixed this in #44180, because there has dependency.

@lxbsz lxbsz closed this Dec 16, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug-fix cephfs Ceph File System
Projects
None yet
3 participants