mds: switch submit_mutex to fair mutex for MDLog #44215

lxbsz · 2021-12-06T01:51:07Z

The implementations of the Mutex (e.g. std::mutex in C++) do not
guarantee fairness, they do not guarantee that the lock will be
acquired by threads in the order that they called the lock().

In most case this works well, but in overload case the client
requests handling thread and _submit_thread could always successfully
acquire the submit_mutex in a long time, which could make the
MDLog::trim() get stuck. That means the MDS daemons will fill journal
logs into the metadata pool, but couldn't trim the expired segments
in time.

This will switch the submit_mutex to fair mutex and it could make
sure that the all the submit_mutex waiters are in FIFO order and
could get a change to be excuted in time.

Fixes: https://tracker.ceph.com/issues/40002
Signed-off-by: Xiubo Li xiubli@redhat.com

Checklist

Tracker (select at least one)
- References tracker ticket
- Very recent bug; references commit where it was introduced
- New feature (ticket optional)
- Doc update (no ticket needed)
- Code cleanup (no ticket needed)
Component impact
- Affects Dashboard, opened tracker ticket
- Affects Orchestrator, opened tracker ticket
- No impact that needs to be tracked
Documentation (select at least one)
- Updates relevant documentation
- No doc update is appropriate
Tests (select at least one)
- Includes unit test(s)
- Includes integration test(s)
- Includes bug reproducer
- No tests

Show available Jenkins commands

jenkins retest this please
jenkins test classic perf
jenkins test crimson perf
jenkins test signed
jenkins test make check
jenkins test make check arm64
jenkins test submodules
jenkins test dashboard
jenkins test dashboard cephadm
jenkins test api
jenkins test docs
jenkins render docs
jenkins test ceph-volume all
jenkins test ceph-volume tox

The implementations of the Mutex (e.g. std::mutex in C++) do not guarantee fairness, they do not guarantee that the lock will be acquired by threads in the order that they called the lock(). In most case this works well, but in overload case the client requests handling thread and _submit_thread could always successfully acquire the submit_mutex in a long time, which could make the MDLog::trim() get stuck. That means the MDS daemons will fill journal logs into the metadata pool, but couldn't trim the expired segments in time. This will switch the submit_mutex to fair mutex and it could make sure that the all the submit_mutex waiters are in FIFO order and could get a change to be excuted in time. Fixes: https://tracker.ceph.com/issues/40002 Signed-off-by: Xiubo Li <xiubli@redhat.com>

vshankar

LGTM - we probably need to put this to test under load.

lxbsz · 2021-12-08T06:19:29Z

LGTM - we probably need to put this to test under load.

Sounds reasonable. I think we also need to put the mds_lock to test under heavy load too. I can do that later together in one separate PR. Is that okay ?

vshankar · 2021-12-08T06:20:36Z

Sounds reasonable. I think we also need to put the mds_lock to test under heavy load too. I can do that later together in one separate PR. Is that okay ?

Sure.

lxbsz · 2021-12-08T06:34:14Z

Sounds reasonable. I think we also need to put the mds_lock to test under heavy load too. I can do that later together in one separate PR. Is that okay ?

Sure.

I have raised one tracker about this: https://tracker.ceph.com/issues/53520

batrick · 2021-12-09T16:09:47Z

I'm not confident this is fixing the real problem. I think it may be more likely the MDS finisher was starved which deferred expiration of segments (e.g. BatchCommitBacktrace). Your mds_lock fix for fair mutex may also incidentally fix the #40002 tracker ticket.

Do you have a test case showing this fix resolves the original issue?

lxbsz · 2021-12-10T02:15:26Z

I'm not confident this is fixing the real problem. I think it may be more likely the MDS finisher was starved which deferred expiration of segments (e.g. BatchCommitBacktrace). Your mds_lock fix for fair mutex may also incidentally fix the #40002 tracker ticket.

Yeah, it seems will fix or at least alleviate some other issues like https://tracker.ceph.com/issues/53521.

Do you have a test case showing this fix resolves the original issue?

Not yet, I was just using a simple testing script by creating/deleting dirs and files in 2 loops in two terminals, which will send a lot of client requests and also in a third terminal kept doing heavy read/write at the same time. I could see sometimes the mdlog->trim() could stuck for a while by waiting the submit_mutex, just like what I saw when fixing the mds_lock issue before.

With this fixing I have see that yet.

lxbsz · 2021-12-10T03:41:07Z

I'm not confident this is fixing the real problem. I think it may be more likely the MDS finisher was starved which deferred expiration of segments (e.g. BatchCommitBacktrace). Your mds_lock fix for fair mutex may also incidentally fix the #40002 tracker ticket.

The BatchCommitBacktrace was added since Pacific, while both the heavy load issue were seen in ceph version 14.2.X.

vshankar · 2021-12-10T04:31:18Z

I'm not confident this is fixing the real problem. I think it may be more likely the MDS finisher was starved which deferred expiration of segments (e.g. BatchCommitBacktrace). Your mds_lock fix for fair mutex may also incidentally fix the #40002 tracker ticket.

That's the reason this PR is not "approved", although, the switch to using fair mutex does look reasonable to me. We'll only know if log trimming does not get stuck when we load test this change.

lxbsz · 2021-12-10T06:45:32Z

I'm not confident this is fixing the real problem. I think it may be more likely the MDS finisher was starved which deferred expiration of segments (e.g. BatchCommitBacktrace). Your mds_lock fix for fair mutex may also incidentally fix the #40002 tracker ticket.

That's the reason this PR is not "approved", although, the switch to using fair mutex does look reasonable to me. We'll only know if log trimming does not get stuck when we load test this change.

Yeah, as we know the mdlog->trim(), which will be called by the MDS tick() thread, will try to acquire the submit_mutex before trimming the expired segments, and for each client request it will also try to acquire it when submitting a journal.

In theory, if there has a large number of client requests queued in the dispatch thread, for the non-fair mutex it's possibly that the submit_mutex could be successfully acquired by MDLog's _submit_entry() in most time, and the _submit_thead() could successfully acquire it occasionally, but in case the mdlog->trim() won't for a long time, such as 1 minute or longer.

Then I'm sure that we will see the same issue with the mds_lock and it will make the mdlog->trim() get stuck and the metadata pool's usage could increase a lot.

lxbsz · 2021-12-15T10:48:09Z

I'm not confident this is fixing the real problem. I think it may be more likely the MDS finisher was starved which deferred expiration of segments (e.g. BatchCommitBacktrace). Your mds_lock fix for fair mutex may also incidentally fix the #40002 tracker ticket.

Do you have a test case showing this fix resolves the original issue?

@batrick

With #44180, this really could resolve the problem. More detail please see https://tracker.ceph.com/issues/40002#note-14.

lxbsz · 2021-12-16T13:49:13Z

I have fixed this in #44180, because there has dependency.

lxbsz requested a review from a team December 6, 2021 01:51

github-actions bot added the cephfs Ceph File System label Dec 6, 2021

lxbsz force-pushed the mdlog-40002 branch from 91d8a23 to e928d16 Compare December 6, 2021 01:53

lxbsz added the bug-fix label Dec 6, 2021

vshankar reviewed Dec 8, 2021

View reviewed changes

lxbsz closed this Dec 16, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mds: switch submit_mutex to fair mutex for MDLog #44215

mds: switch submit_mutex to fair mutex for MDLog #44215

lxbsz commented Dec 6, 2021 •

edited

vshankar left a comment

lxbsz commented Dec 8, 2021

vshankar commented Dec 8, 2021

lxbsz commented Dec 8, 2021 •

edited

batrick commented Dec 9, 2021

lxbsz commented Dec 10, 2021

lxbsz commented Dec 10, 2021

vshankar commented Dec 10, 2021 •

edited

lxbsz commented Dec 10, 2021

lxbsz commented Dec 15, 2021

lxbsz commented Dec 16, 2021

mds: switch submit_mutex to fair mutex for MDLog #44215

mds: switch submit_mutex to fair mutex for MDLog #44215

Conversation

lxbsz commented Dec 6, 2021 • edited

Checklist

vshankar left a comment

Choose a reason for hiding this comment

lxbsz commented Dec 8, 2021

vshankar commented Dec 8, 2021

lxbsz commented Dec 8, 2021 • edited

batrick commented Dec 9, 2021

lxbsz commented Dec 10, 2021

lxbsz commented Dec 10, 2021

vshankar commented Dec 10, 2021 • edited

lxbsz commented Dec 10, 2021

lxbsz commented Dec 15, 2021

lxbsz commented Dec 16, 2021

lxbsz commented Dec 6, 2021 •

edited

lxbsz commented Dec 8, 2021 •

edited

vshankar commented Dec 10, 2021 •

edited