New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
mds: reset heartbeat map at potential time-consuming places #22999
Conversation
src/mds/MDSContext.cc
Outdated
@@ -28,6 +28,9 @@ void MDSInternalContextBase::complete(int r) { | |||
assert(mds != NULL); | |||
assert(mds->mds_lock.is_locked_by_me()); | |||
MDSContext::complete(r); | |||
|
|||
// avoid unhealthy heartbeat map when finish_contexts() processes large context list | |||
mds->heartbeat_reset(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Large context list are delayed messages (caused by frozen subtree). how large it is depends on how many messages were sent by clients while subtree was frozen. these messages need to be processed before dispatching new messages (to guarantee message order). mds can't directly control how large the context list is (mds can control it indirectly by reducing max_export_size) . I can't figure out a better approach.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for explaining. What concerns me is that this hides the problem: the MDS is stuck processing a large context list and not doing anything else. This is precisely the type of scenario the heartbeat mechanism is designed to catch!
How about reducing the max_export_size
? 1G
and even 100M
sounds too large. What are the drawbacks to breaking the exports into smaller chunks like 10M
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
10M is about 5000 inodes, that's too small. besides, we need to change the code that chooses export candidate. it's too inefficiency to freeze a large subtree, but only export small portion of it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay.
src/mds/Locker.cc
Outdated
@@ -2737,7 +2737,7 @@ void Locker::handle_client_caps(MClientCaps *m) | |||
mdcache->wait_replay_cap_reconnect(m->get_ino(), new C_MDS_RetryMessage(mds, m)); | |||
return; | |||
} | |||
dout(1) << "handle_client_caps on unknown ino " << m->get_ino() << ", dropping" << dendl; | |||
dout(7) << "handle_client_caps on unknown ino " << m->get_ino() << ", dropping" << dendl; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also needs to be a separate PR and the commit needs amended to note it fixes the issue. Sorry to be a PITA Zheng but this makes backporting easier and helps me get these fixes expedited.
src/mds/MDSContext.cc
Outdated
@@ -28,6 +28,9 @@ void MDSInternalContextBase::complete(int r) { | |||
assert(mds != NULL); | |||
assert(mds->mds_lock.is_locked_by_me()); | |||
MDSContext::complete(r); | |||
|
|||
// avoid unhealthy heartbeat map when finish_contexts() processes large context list | |||
mds->heartbeat_reset(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for explaining. What concerns me is that this hides the problem: the MDS is stuck processing a large context list and not doing anything else. This is precisely the type of scenario the heartbeat mechanism is designed to catch!
How about reducing the max_export_size
? 1G
and even 100M
sounds too large. What are the drawbacks to breaking the exports into smaller chunks like 10M
?
src/mds/OpenFileTable.cc
Outdated
for (auto dir : fetch_queue) { | ||
if (dir->state_test(CDir::STATE_REJOINUNDEF)) | ||
assert(dir->get_inode()->dirfragtree.is_leaf(dir->get_frag())); | ||
dir->fetch(gather.new_sub()); | ||
|
||
if (!(++num_opening_dirfrags % 100)) | ||
mds->heartbeat_reset(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: these don't bother me because we don't expect a non-active MDS to be responsive to clients.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But if recovering mds does not send beacon to monitor, monitor may replace it with standby mds
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, I'm saying that doing the heartbeat reset here makes sense.
98708f5
to
42d9b24
Compare
most change are moved to #23088 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Otherwise LGTM.
src/mds/Migrator.cc
Outdated
std::list<MDSInternalContextBase*> contexts; | ||
C_MDC_QueueContext(Migrator *m) : MigratorContext(m) {} | ||
void finish(int r) override { | ||
get_mds()->queue_waiters_front(contexts); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do these need queued at the front?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will require that MDSRank::finished_queue
be a std::deque
which is not preferred to std::vector
(in #23195 ). So, is it actually important to queue at the front?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, it's important. to make sure contexts get processed in proper order
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please document an example.
src/mds/Migrator.cc
Outdated
} | ||
}; | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dead code? Separate commit and PR please. (leaving this in here will mess up backports.)
src/mds/Migrator.cc
Outdated
@@ -1665,7 +1654,7 @@ uint64_t Migrator::encode_export_dir(bufferlist& exportbl, | |||
|
|||
void Migrator::finish_export_dir(CDir *dir, mds_rank_t peer, | |||
map<inodeno_t,map<client_t,Capability::Import> >& peer_imported, | |||
list<MDSInternalContextBase*>& finished, int *num_dentries) | |||
list<MDSInternalContextBase*>& finished, unsigned& num_dentries) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't relevant to this PR. Needs to be a separate refactor PR that won't be backported.
d5c7a40
to
744d5b0
Compare
@ukernel please rebase |
744d5b0
to
666c302
Compare
Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
6c85c8d
to
14b71c2
Compare
dout(7) << "mds has " << finished_queue.size() << " queued contexts" << dendl; | ||
dout(10) << finished_queue << dendl; | ||
decltype(finished_queue) ls; | ||
ls.swap(finished_queue); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be preferable to keep ls.swap
here to so that any buggy context which adds to the finished_queue doesn't cause an infinite loop.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
'auto fin = finished_queue.front()' is required by queue_waiters_front()
if buggy context keeps adding another buggy context to finished_queue, it's infinite loop no matter keep "ls.swap" or not
src/mds/MDSRank.h
Outdated
void queue_waiters_front(MDSInternalContextBase::vec& ls) { | ||
MDSInternalContextBase::vec v; | ||
v.swap(ls); | ||
std::copy(v.rbegin(), v.rend(), std::front_inserter(finished_queue)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow. Okay this needs a comment explaining what situation this is addressing. You're basically reversing the context vector and putting that at the front of the deque? This is not what one glancing at queue_waiters_front
would expect and it's easy to miss!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
contexts in 'ls' do not get reversed.
If 'ls' and 'finished_queue' are lists,
'std::copy(v.rbegin(), v.rend(), std::front_inserter(finished_queue))'
is equivalent to
’finished_queue.splice(finished_queue.begin(), ls)'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OKay got it
src/mds/Migrator.cc
Outdated
std::list<MDSInternalContextBase*> contexts; | ||
C_MDC_QueueContext(Migrator *m) : MigratorContext(m) {} | ||
void finish(int r) override { | ||
get_mds()->queue_waiters_front(contexts); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please document an example.
14b71c2
to
187114a
Compare
And also needs a tracker ticket for backport. |
Signed-off-by: Yan, Zheng <zyan@redhat.com> Fixes: http://tracker.ceph.com/issues/26858
session renew messages from clients can be in the dispatch queue, waiting for getting dispatched. Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
187114a
to
2ca5708
Compare
* refs/pull/22999/head: mds: consider max age of dispatch queue when finding stale client mds: reset heartbeat map at potential time-consuming places mds: change MDSRank::finished_queue to deque Reviewed-by: Patrick Donnelly <pdonnell@redhat.com>
Just realized post-merging this that this PR may cause: http://tracker.ceph.com/issues/26869 Zheng, can you take a look? |
No description provided.