mds: reset heartbeat map at potential time-consuming places #22999

ukernel · 2018-07-12T03:30:45Z

No description provided.

ukernel · 2018-07-12T03:40:44Z

src/mds/MDSContext.cc

@@ -28,6 +28,9 @@ void MDSInternalContextBase::complete(int r) {
  assert(mds != NULL);
  assert(mds->mds_lock.is_locked_by_me());
  MDSContext::complete(r);
+
+  // avoid unhealthy heartbeat map when finish_contexts() processes large context list
+  mds->heartbeat_reset();


Large context list are delayed messages (caused by frozen subtree). how large it is depends on how many messages were sent by clients while subtree was frozen. these messages need to be processed before dispatching new messages (to guarantee message order). mds can't directly control how large the context list is (mds can control it indirectly by reducing max_export_size) . I can't figure out a better approach.

Thanks for explaining. What concerns me is that this hides the problem: the MDS is stuck processing a large context list and not doing anything else. This is precisely the type of scenario the heartbeat mechanism is designed to catch!

How about reducing the max_export_size? 1G and even 100M sounds too large. What are the drawbacks to breaking the exports into smaller chunks like 10M?

10M is about 5000 inodes, that's too small. besides, we need to change the code that chooses export candidate. it's too inefficiency to freeze a large subtree, but only export small portion of it.

batrick · 2018-07-12T04:02:19Z

src/mds/Locker.cc

@@ -2737,7 +2737,7 @@ void Locker::handle_client_caps(MClientCaps *m)
      mdcache->wait_replay_cap_reconnect(m->get_ino(), new C_MDS_RetryMessage(mds, m));
      return;
    }
-    dout(1) << "handle_client_caps on unknown ino " << m->get_ino() << ", dropping" << dendl;
+    dout(7) << "handle_client_caps on unknown ino " << m->get_ino() << ", dropping" << dendl;


Also needs to be a separate PR and the commit needs amended to note it fixes the issue. Sorry to be a PITA Zheng but this makes backporting easier and helps me get these fixes expedited.

batrick · 2018-07-12T04:05:24Z

src/mds/MDSContext.cc

@@ -28,6 +28,9 @@ void MDSInternalContextBase::complete(int r) {
  assert(mds != NULL);
  assert(mds->mds_lock.is_locked_by_me());
  MDSContext::complete(r);
+
+  // avoid unhealthy heartbeat map when finish_contexts() processes large context list
+  mds->heartbeat_reset();


Thanks for explaining. What concerns me is that this hides the problem: the MDS is stuck processing a large context list and not doing anything else. This is precisely the type of scenario the heartbeat mechanism is designed to catch!

How about reducing the max_export_size? 1G and even 100M sounds too large. What are the drawbacks to breaking the exports into smaller chunks like 10M?

batrick · 2018-07-12T04:16:15Z

src/mds/OpenFileTable.cc

  for (auto dir : fetch_queue) {
    if (dir->state_test(CDir::STATE_REJOINUNDEF))
      assert(dir->get_inode()->dirfragtree.is_leaf(dir->get_frag()));
    dir->fetch(gather.new_sub());
+
+    if (!(++num_opening_dirfrags % 100))
+      mds->heartbeat_reset();


Note: these don't bother me because we don't expect a non-active MDS to be responsive to clients.

But if recovering mds does not send beacon to monitor, monitor may replace it with standby mds

Right, I'm saying that doing the heartbeat reset here makes sense.

ukernel · 2018-07-17T13:42:34Z

most change are moved to #23088

batrick

Otherwise LGTM.

batrick · 2018-07-26T20:25:19Z

src/mds/Migrator.cc

+  std::list<MDSInternalContextBase*> contexts;
+  C_MDC_QueueContext(Migrator *m) : MigratorContext(m) {}
+  void finish(int r) override {
+    get_mds()->queue_waiters_front(contexts);


Why do these need queued at the front?

This will require that MDSRank::finished_queue be a std::deque which is not preferred to std::vector (in #23195 ). So, is it actually important to queue at the front?

yes, it's important. to make sure contexts get processed in proper order

Please document an example.

batrick · 2018-07-26T20:41:59Z

src/mds/Migrator.cc

-  }
-};
-
-


Dead code? Separate commit and PR please. (leaving this in here will mess up backports.)

batrick · 2018-07-26T20:46:35Z

src/mds/Migrator.cc

@@ -1665,7 +1654,7 @@ uint64_t Migrator::encode_export_dir(bufferlist& exportbl,

 void Migrator::finish_export_dir(CDir *dir, mds_rank_t peer,
 				 map<inodeno_t,map<client_t,Capability::Import> >& peer_imported,
-				 list<MDSInternalContextBase*>& finished, int *num_dentries)
+				 list<MDSInternalContextBase*>& finished, unsigned& num_dentries)


This isn't relevant to this PR. Needs to be a separate refactor PR that won't be backported.

batrick · 2018-08-01T22:41:31Z

@ukernel please rebase

Signed-off-by: "Yan, Zheng" <zyan@redhat.com>

batrick · 2018-08-02T22:01:24Z

src/mds/MDSRank.cc

    dout(7) << "mds has " << finished_queue.size() << " queued contexts" << dendl;
-    dout(10) << finished_queue << dendl;
-    decltype(finished_queue) ls;
-    ls.swap(finished_queue);


I think it would be preferable to keep ls.swap here to so that any buggy context which adds to the finished_queue doesn't cause an infinite loop.

'auto fin = finished_queue.front()' is required by queue_waiters_front()

if buggy context keeps adding another buggy context to finished_queue, it's infinite loop no matter keep "ls.swap" or not

batrick · 2018-08-02T22:07:58Z

src/mds/MDSRank.h

+    void queue_waiters_front(MDSInternalContextBase::vec& ls) {
+      MDSInternalContextBase::vec v;
+      v.swap(ls);
+      std::copy(v.rbegin(), v.rend(), std::front_inserter(finished_queue));


Wow. Okay this needs a comment explaining what situation this is addressing. You're basically reversing the context vector and putting that at the front of the deque? This is not what one glancing at queue_waiters_front would expect and it's easy to miss!

contexts in 'ls' do not get reversed.

If 'ls' and 'finished_queue' are lists,
'std::copy(v.rbegin(), v.rend(), std::front_inserter(finished_queue))'
is equivalent to
’finished_queue.splice(finished_queue.begin(), ls)'

OKay got it

batrick · 2018-08-02T22:08:28Z

src/mds/Migrator.cc

+  std::list<MDSInternalContextBase*> contexts;
+  C_MDC_QueueContext(Migrator *m) : MigratorContext(m) {}
+  void finish(int r) override {
+    get_mds()->queue_waiters_front(contexts);


Please document an example.

batrick · 2018-08-04T00:17:17Z

And also needs a tracker ticket for backport.

Signed-off-by: Yan, Zheng <zyan@redhat.com> Fixes: http://tracker.ceph.com/issues/26858

session renew messages from clients can be in the dispatch queue, waiting for getting dispatched. Signed-off-by: "Yan, Zheng" <zyan@redhat.com>

ukernel · 2018-08-06T03:32:34Z

Fixes: http://tracker.ceph.com/issues/26858

* refs/pull/22999/head: mds: consider max age of dispatch queue when finding stale client mds: reset heartbeat map at potential time-consuming places mds: change MDSRank::finished_queue to deque Reviewed-by: Patrick Donnelly <pdonnell@redhat.com>

batrick · 2018-08-06T18:47:33Z

Just realized post-merging this that this PR may cause: http://tracker.ceph.com/issues/26869

Zheng, can you take a look?

ukernel added bug-fix cephfs Ceph File System labels Jul 12, 2018

ukernel commented Jul 12, 2018

View reviewed changes

batrick requested changes Jul 12, 2018

View reviewed changes

ukernel force-pushed the wip-mds-heartbeat branch 2 times, most recently from 98708f5 to 42d9b24 Compare July 17, 2018 07:17

batrick requested changes Jul 26, 2018

View reviewed changes

ukernel force-pushed the wip-mds-heartbeat branch 3 times, most recently from d5c7a40 to 744d5b0 Compare July 31, 2018 07:37

ukernel force-pushed the wip-mds-heartbeat branch from 744d5b0 to 666c302 Compare August 2, 2018 00:20

mds: change MDSRank::finished_queue to deque

f91faa7

Signed-off-by: "Yan, Zheng" <zyan@redhat.com>

ukernel force-pushed the wip-mds-heartbeat branch 2 times, most recently from 6c85c8d to 14b71c2 Compare August 2, 2018 00:24

batrick requested changes Aug 2, 2018

View reviewed changes

ukernel force-pushed the wip-mds-heartbeat branch from 14b71c2 to 187114a Compare August 3, 2018 07:33

batrick added the wip-pdonnell-testing label Aug 4, 2018

ukernel added 2 commits August 6, 2018 11:31

mds: reset heartbeat map at potential time-consuming places

5a6a9a3

Signed-off-by: Yan, Zheng <zyan@redhat.com> Fixes: http://tracker.ceph.com/issues/26858

mds: consider max age of dispatch queue when finding stale client

2ca5708

session renew messages from clients can be in the dispatch queue, waiting for getting dispatched. Signed-off-by: "Yan, Zheng" <zyan@redhat.com>

ukernel force-pushed the wip-mds-heartbeat branch from 187114a to 2ca5708 Compare August 6, 2018 03:32

batrick approved these changes Aug 6, 2018

View reviewed changes

batrick merged commit 2ca5708 into ceph:master Aug 6, 2018

ukernel deleted the wip-mds-heartbeat branch August 9, 2018 01:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mds: reset heartbeat map at potential time-consuming places #22999

mds: reset heartbeat map at potential time-consuming places #22999

ukernel commented Jul 12, 2018

ukernel Jul 12, 2018

batrick Jul 12, 2018

ukernel Jul 13, 2018

batrick Jul 14, 2018

batrick Jul 12, 2018

batrick Jul 12, 2018

batrick Jul 12, 2018

ukernel Jul 12, 2018 •

edited

batrick Jul 14, 2018

ukernel commented Jul 17, 2018

batrick left a comment

batrick Jul 26, 2018

batrick Jul 26, 2018

ukernel Jul 30, 2018

batrick Aug 2, 2018

batrick Jul 26, 2018

batrick Jul 26, 2018

batrick commented Aug 1, 2018

batrick Aug 2, 2018

ukernel Aug 3, 2018

batrick Aug 2, 2018

ukernel Aug 3, 2018 •

edited

batrick Aug 6, 2018

batrick Aug 2, 2018

batrick commented Aug 4, 2018

ukernel commented Aug 6, 2018

batrick commented Aug 6, 2018

mds: reset heartbeat map at potential time-consuming places #22999

mds: reset heartbeat map at potential time-consuming places #22999

Conversation

ukernel commented Jul 12, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ukernel Jul 12, 2018 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ukernel commented Jul 17, 2018

batrick left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

batrick commented Aug 1, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ukernel Aug 3, 2018 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

batrick commented Aug 4, 2018

ukernel commented Aug 6, 2018

batrick commented Aug 6, 2018

ukernel Jul 12, 2018 •

edited

ukernel Aug 3, 2018 •

edited