Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mds: reset heartbeat map at potential time-consuming places #22999

Merged
merged 3 commits into from Aug 6, 2018

Conversation

ukernel
Copy link
Contributor

@ukernel ukernel commented Jul 12, 2018

No description provided.

@ukernel ukernel added bug-fix cephfs Ceph File System labels Jul 12, 2018
@@ -28,6 +28,9 @@ void MDSInternalContextBase::complete(int r) {
assert(mds != NULL);
assert(mds->mds_lock.is_locked_by_me());
MDSContext::complete(r);

// avoid unhealthy heartbeat map when finish_contexts() processes large context list
mds->heartbeat_reset();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Large context list are delayed messages (caused by frozen subtree). how large it is depends on how many messages were sent by clients while subtree was frozen. these messages need to be processed before dispatching new messages (to guarantee message order). mds can't directly control how large the context list is (mds can control it indirectly by reducing max_export_size) . I can't figure out a better approach.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for explaining. What concerns me is that this hides the problem: the MDS is stuck processing a large context list and not doing anything else. This is precisely the type of scenario the heartbeat mechanism is designed to catch!

How about reducing the max_export_size? 1G and even 100M sounds too large. What are the drawbacks to breaking the exports into smaller chunks like 10M?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

10M is about 5000 inodes, that's too small. besides, we need to change the code that chooses export candidate. it's too inefficiency to freeze a large subtree, but only export small portion of it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay.

@@ -2737,7 +2737,7 @@ void Locker::handle_client_caps(MClientCaps *m)
mdcache->wait_replay_cap_reconnect(m->get_ino(), new C_MDS_RetryMessage(mds, m));
return;
}
dout(1) << "handle_client_caps on unknown ino " << m->get_ino() << ", dropping" << dendl;
dout(7) << "handle_client_caps on unknown ino " << m->get_ino() << ", dropping" << dendl;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also needs to be a separate PR and the commit needs amended to note it fixes the issue. Sorry to be a PITA Zheng but this makes backporting easier and helps me get these fixes expedited.

@@ -28,6 +28,9 @@ void MDSInternalContextBase::complete(int r) {
assert(mds != NULL);
assert(mds->mds_lock.is_locked_by_me());
MDSContext::complete(r);

// avoid unhealthy heartbeat map when finish_contexts() processes large context list
mds->heartbeat_reset();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for explaining. What concerns me is that this hides the problem: the MDS is stuck processing a large context list and not doing anything else. This is precisely the type of scenario the heartbeat mechanism is designed to catch!

How about reducing the max_export_size? 1G and even 100M sounds too large. What are the drawbacks to breaking the exports into smaller chunks like 10M?

for (auto dir : fetch_queue) {
if (dir->state_test(CDir::STATE_REJOINUNDEF))
assert(dir->get_inode()->dirfragtree.is_leaf(dir->get_frag()));
dir->fetch(gather.new_sub());

if (!(++num_opening_dirfrags % 100))
mds->heartbeat_reset();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: these don't bother me because we don't expect a non-active MDS to be responsive to clients.

Copy link
Contributor Author

@ukernel ukernel Jul 12, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But if recovering mds does not send beacon to monitor, monitor may replace it with standby mds

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, I'm saying that doing the heartbeat reset here makes sense.

@ukernel ukernel force-pushed the wip-mds-heartbeat branch 2 times, most recently from 98708f5 to 42d9b24 Compare July 17, 2018 07:17
@ukernel
Copy link
Contributor Author

ukernel commented Jul 17, 2018

most change are moved to #23088

Copy link
Member

@batrick batrick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise LGTM.

std::list<MDSInternalContextBase*> contexts;
C_MDC_QueueContext(Migrator *m) : MigratorContext(m) {}
void finish(int r) override {
get_mds()->queue_waiters_front(contexts);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do these need queued at the front?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will require that MDSRank::finished_queue be a std::deque which is not preferred to std::vector (in #23195 ). So, is it actually important to queue at the front?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, it's important. to make sure contexts get processed in proper order

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please document an example.

}
};


Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dead code? Separate commit and PR please. (leaving this in here will mess up backports.)

@@ -1665,7 +1654,7 @@ uint64_t Migrator::encode_export_dir(bufferlist& exportbl,

void Migrator::finish_export_dir(CDir *dir, mds_rank_t peer,
map<inodeno_t,map<client_t,Capability::Import> >& peer_imported,
list<MDSInternalContextBase*>& finished, int *num_dentries)
list<MDSInternalContextBase*>& finished, unsigned& num_dentries)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't relevant to this PR. Needs to be a separate refactor PR that won't be backported.

@ukernel ukernel force-pushed the wip-mds-heartbeat branch 3 times, most recently from d5c7a40 to 744d5b0 Compare July 31, 2018 07:37
@batrick
Copy link
Member

batrick commented Aug 1, 2018

@ukernel please rebase

Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
@ukernel ukernel force-pushed the wip-mds-heartbeat branch 2 times, most recently from 6c85c8d to 14b71c2 Compare August 2, 2018 00:24
dout(7) << "mds has " << finished_queue.size() << " queued contexts" << dendl;
dout(10) << finished_queue << dendl;
decltype(finished_queue) ls;
ls.swap(finished_queue);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be preferable to keep ls.swap here to so that any buggy context which adds to the finished_queue doesn't cause an infinite loop.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'auto fin = finished_queue.front()' is required by queue_waiters_front()

if buggy context keeps adding another buggy context to finished_queue, it's infinite loop no matter keep "ls.swap" or not

void queue_waiters_front(MDSInternalContextBase::vec& ls) {
MDSInternalContextBase::vec v;
v.swap(ls);
std::copy(v.rbegin(), v.rend(), std::front_inserter(finished_queue));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow. Okay this needs a comment explaining what situation this is addressing. You're basically reversing the context vector and putting that at the front of the deque? This is not what one glancing at queue_waiters_front would expect and it's easy to miss!

Copy link
Contributor Author

@ukernel ukernel Aug 3, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

contexts in 'ls' do not get reversed.

If 'ls' and 'finished_queue' are lists,
'std::copy(v.rbegin(), v.rend(), std::front_inserter(finished_queue))'
is equivalent to
’finished_queue.splice(finished_queue.begin(), ls)'

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OKay got it

std::list<MDSInternalContextBase*> contexts;
C_MDC_QueueContext(Migrator *m) : MigratorContext(m) {}
void finish(int r) override {
get_mds()->queue_waiters_front(contexts);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please document an example.

@batrick
Copy link
Member

batrick commented Aug 4, 2018

And also needs a tracker ticket for backport.

session renew messages from clients can be in the dispatch queue,
waiting for getting dispatched.

Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
@ukernel
Copy link
Contributor Author

ukernel commented Aug 6, 2018

@batrick batrick merged commit 2ca5708 into ceph:master Aug 6, 2018
batrick added a commit that referenced this pull request Aug 6, 2018
* refs/pull/22999/head:
	mds: consider max age of dispatch queue when finding stale client
	mds: reset heartbeat map at potential time-consuming places
	mds: change MDSRank::finished_queue to deque

Reviewed-by: Patrick Donnelly <pdonnell@redhat.com>
@batrick
Copy link
Member

batrick commented Aug 6, 2018

Just realized post-merging this that this PR may cause: http://tracker.ceph.com/issues/26869

Zheng, can you take a look?

@ukernel ukernel deleted the wip-mds-heartbeat branch August 9, 2018 01:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants