Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mimic: core: Health warnings on long network ping times #30225

Merged
merged 25 commits into from Oct 29, 2019

Conversation

@dzafman
Copy link
Member

dzafman commented Sep 6, 2019

@dzafman dzafman added this to the mimic milestone Sep 6, 2019
@dzafman

This comment has been minimized.

Copy link
Member Author

dzafman commented Sep 6, 2019

Need to merge #30220 first in case sha1 changes. DONE

@dzafman dzafman changed the title Wip network mimic DNM: Wip network mimic Sep 6, 2019
@dzafman dzafman force-pushed the dzafman:wip-network-mimic branch 2 times, most recently from 1255268 to 4758d88 Sep 6, 2019
@smithfarm smithfarm added the core label Sep 7, 2019
@smithfarm smithfarm changed the title DNM: Wip network mimic DNM: core: Health warnings on long network ping times Sep 7, 2019
@smithfarm smithfarm changed the title DNM: core: Health warnings on long network ping times DNM: mimic: core: Health warnings on long network ping times Sep 7, 2019
@dzafman

This comment has been minimized.

Copy link
Member Author

dzafman commented Sep 9, 2019

jenkins test make check

@dzafman dzafman removed the DNM label Sep 11, 2019
@dzafman dzafman changed the title DNM: mimic: core: Health warnings on long network ping times mimic: core: Health warnings on long network ping times Sep 11, 2019
@dzafman dzafman added the DNM label Sep 27, 2019
@dzafman dzafman added needs-qa needs-review and removed DNM labels Oct 5, 2019
@smithfarm

This comment has been minimized.

Copy link
Contributor

smithfarm commented Oct 18, 2019

@dzafman Sorry, other mimic PRs got merged and this one now needs rebasing.

xiexingguo and others added 9 commits May 16, 2016
The original logic will reuse the timestamp which we send pings to
the specific heartbeat peer to update the last_rx_front[back] field
on receiving the corresponding replies, which later shall be honoured
as the exact time we succeed in getting the corresponding replies and
is used to calculate the heartbeat latency and determine whether the
relevant peer is dead.

However this is not accurate enough as there may be a delay between
we receive a reply and call heartbeat_check(). We can eliminate
the delay by introducing a map to track the ping-history here,
each entry of which consists of three elements:

1. "tx_time", worked as the map key, indicates the exact timestamp
   we send pings.
2. "deadline", indicates we shall receive all replies by then,
   otherwise we consider this peer as "dead".
3. "unacknowledged", indicates how many pings for the corresponding
   ping are still unacknowledged. The initial value is 2(as we send
   two pings from the front and back side for each peer).

We insert an item into the map on every time we sending out a ping, and
decrease the "unacknowledged" counter by 1 each time we get a reply from
the tracked ping. If "unacknowledged" drops to 0, we know all the replies
have been successfully collected and we can safely erase the relevant
item from the map as well as the earlier sent ones,  if there is any.

By comparing the current timestamp with the oldest deadline, we can now
make a much accurate decision about whether the corresponding peer is
healthy or not. And by setting last_rx_* to the timestamp we receiving
the reply, the lower bound when we can no longer hear a reply from the
corresponding connection is also much clear now.

Signed-off-by: xie xingguo <xie.xingguo@zte.com.cn>
(cherry picked from commit 477774c)

Conflicts:
	src/osd/OSD.cc (send_still_alive() has 1 less argument)
If we never hear any replies from a heartbeat peer, use first_tx
to calculdate failed_for, which is more accurate.

Signed-off-by: xie xingguo <xie.xingguo@zte.com.cn>
(cherry picked from commit aba6037)
Delay to declared to be healthy until we have received the first
replies from both front and back connections.

Signed-off-by: xie xingguo <xie.xingguo@zte.com.cn>
(cherry picked from commit d912315)
Fixes: http://tracker.ceph.com/issues/40640

Signed-off-by: David Zafman <dzafman@redhat.com>
(cherry picked from commit 66d44e7)

Conflicts:
	src/common/options.cc (trivial)
	src/mon/PGMap.cc (trivial)
	src/osd/OSD.cc (trivial)
	src/osd/OSD.h (trivial)
	src/osd/osd_types.cc (encode version difference)
	src/osd/osd_types.h (osd_stat_t location in file changed)

src/mon/PGMap.cc manually get rid of extra argument to checks->add
src/osd/OSD.cc rename ping_stamp to stamp for backport
Signed-off-by: David Zafman <dzafman@redhat.com>
(cherry picked from commit 5d3c185)

Conflicts:
	src/mgr/ClusterState.cc (trivial)
	src/mgr/ClusterState.h (trivial
Signed-off-by: David Zafman <dzafman@redhat.com>
(cherry picked from commit 025b10a)

 Conflicts:
	src/osd/OSD.cc (trivial)
…tbeat_grace

Compute network ping threshold based on ratio (5% of 20 seconds is 1 second)
Make the threshold value used part of dump_osd_network for osd and mgr
Keep mon_warn_on_slow_ping_time (default 0) to optionally override the ratio

Signed-off-by: David Zafman <dzafman@redhat.com>
(cherry picked from commit 0d1bbd3)
Signed-off-by: David Zafman <dzafman@redhat.com>
(cherry picked from commit f4a0be2)

Conflicts:
	PendingReleaseNotes (trivial)
Signed-off-by: David Zafman <dzafman@redhat.com>
(cherry picked from commit 297a0e7)

Conflicts:
	src/osd/osd_types.cc (trivial)
	src/osd/osd_types.h (osd_stat_t location in file changed)
Fixes: https://tracker.ceph.com/issues/41743

Signed-off-by: David Zafman <dzafman@redhat.com>
(cherry picked from commit ded58ef)

Conflicts: 3 yamls don't exist in Mimic
@dzafman dzafman force-pushed the dzafman:wip-network-mimic branch from 8de12bd to e49f071 Oct 18, 2019
@dzafman dzafman removed the needs-rebase label Oct 18, 2019
Fix use of asok_command() which doesn't do try/catch
Need unregister_command() since unregister_commands() doesn't exist here
Use Mutex::locker since lock_guard() isn't available
Use new g_conf which isn't g_conf() anymore
cct->_conf is a pointer now
Use ceph_abort() because cct isn't available for ceph_abort_msg()

Signed-off-by: David Zafman <dzafman@redhat.com>
@dzafman dzafman force-pushed the dzafman:wip-network-mimic branch from 8007515 to 6320b21 Oct 19, 2019
@smithfarm

This comment has been minimized.

Copy link
Contributor

smithfarm commented Oct 19, 2019

1 tests failed out of 157

Total Test time (real) = 3768.40 sec

The following tests FAILED:
	125 - unittest_chain_xattr (Timeout)
@smithfarm

This comment has been minimized.

Copy link
Contributor

smithfarm commented Oct 19, 2019

jenkins test make check

@yuriw

This comment has been minimized.

Copy link
Contributor

yuriw commented Oct 22, 2019

The primary benefit is that the OSD doesn't need to keep a flood of
blocked heartbeat messages around in memory.
This prevents OSDs from accumulating heartbeat messages due to a
broken switch and then exhausting the whole node's memory:

Jun 11 04:19:26 host-192-168-9-12 kernel: [409881.137077] Out of memory:
Kill process 1471476 (ceph-osd) score 47 or sacrifice child
Jun 11 04:19:26 host-192-168-9-12 kernel: [409881.146054] Killed process
1471476 (ceph-osd) total-vm:4822548kB, anon-rss:3097860kB,
file-rss:2556kB, shmem-rss:0kB

Fixes: http://tracker.ceph.com/issues/40586
Signed-off-by: xie xingguo <xie.xingguo@zte.com.cn>
(cherry picked from commit 6cc90f3)

Conflicts:
	src/osd/OSD.cc (no boot_finisher.stop() and no lock_guard)
	src/osd/OSD.h (trivial)

Fixed get_val() call in reset_heartbeat_peers()
@yuriw

This comment has been minimized.

Copy link
Contributor

yuriw commented Oct 29, 2019

@yuriw yuriw merged commit b46933b into ceph:mimic Oct 29, 2019
4 checks passed
4 checks passed
Docs: build check OK - docs built
Details
Signed-off-by all commits in this PR are signed
Details
Unmodified Submodules submodules for project are unmodified
Details
make check make check succeeded
Details
liewegas added a commit to liewegas/ceph that referenced this pull request Oct 30, 2019
This reverts commit b46933b, reversing
changes made to aafdd9a.

This breaks the encoding compatibility for osd_stat_t.  Revert this for
now until compatibility is fixed.

Fixes: https://tracker.ceph.com/issues/42570
Signed-off-by: Sage Weil <sage@redhat.com>
@batrick

This comment has been minimized.

Copy link
Member

batrick commented Oct 31, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
6 participants
You can’t perform that action at this time.