ceph · dzafman · Sep 5, 2019 · Jun 26, 2019 · Jul 10, 2019 · Jul 9, 2019
diff --git a/PendingReleaseNotes b/PendingReleaseNotes
@@ -108,3 +108,19 @@
 * The structured output of ``ceph status`` or ``ceph -s`` is now more
   concise, particularly the `mgrmap` and `monmap` sections, and the
   structure of the `osdmap` section has been cleaned up.
+
+* A health warning is now generated if the average osd heartbeat ping
+  time exceeds a configurable threshold for any of the intervals
+  computed.  The OSD computes 1 minute, 5 minute and 15 minute
+  intervals with average, minimum and maximum values.  New configuration
+  option ``mon_warn_on_slow_ping_ratio`` specifies a percentage of
+  ``osd_heartbeat_grace`` to determine the threshold.  A value of zero
+  disables the warning.  New configuration option
+ ``mon_warn_on_slow_ping_time`` specified in milliseconds over-rides the
+  computed value, causes a warning
+  when OSD heartbeat pings take longer than the specified amount.
+  New admin command ``ceph daemon mgr.# dump_osd_network [threshold]`` command will
+  list all connections with a ping time longer than the specified threshold or
+  value determined by the config options, for the average for any of the 3 intervals.
+  New admin command ``ceph daemon osd.# dump_osd_network [threshold]`` will
+  do the same but only including heartbeats initiated by the specified OSD.
diff --git a/doc/rados/configuration/mon-config-ref.rst b/doc/rados/configuration/mon-config-ref.rst
@@ -393,6 +393,25 @@ by setting it in the ``[mon]`` section of the configuration file.
 :Default: True
 
 
+``mon warn on slow ping ratio``
+
+:Description: Issue a ``HEALTH_WARN`` in cluster log if any heartbeat
+              between OSDs exceeds ``mon warn on slow ping ratio``
+              of ``osd heartbeat grace``.  The default is 5%.
+:Type: Float
+:Default: ``0.05``
+
+
+``mon warn on slow ping time``
+
+:Description: Override ``mon warn on slow ping ratio`` with a specific value.
+              Issue a ``HEALTH_WARN`` in cluster log if any heartbeat
+              between OSDs exceeds ``mon warn on slow ping time``
+              milliseconds.  The default is 0 (disabled).
+:Type: Integer
+:Default: ``0``
+
+
 ``mon cache target full warn ratio``
 
 :Description: Position between pool's ``cache_target_full`` and

diff --git a/doc/rados/configuration/mon-osd-interaction.rst b/doc/rados/configuration/mon-osd-interaction.rst
@@ -24,10 +24,8 @@ monitoring the Ceph Storage Cluster.
 OSDs Check Heartbeats
 =====================
 
-Each Ceph OSD Daemon checks the heartbeat of other Ceph OSD Daemons every 6
-seconds. You can change the heartbeat interval by adding an ``osd heartbeat
-interval`` setting under the ``[osd]`` section of your Ceph configuration file,
-or by setting the value at runtime. If a neighboring Ceph OSD Daemon doesn't
+Each Ceph OSD Daemon checks the heartbeat of other Ceph OSD Daemons at random
+intervals less than every 6 seconds.  If a neighboring Ceph OSD Daemon doesn't
 show a heartbeat within a 20 second grace period, the Ceph OSD Daemon may
 consider the neighboring Ceph OSD Daemon ``down`` and report it back to a Ceph
 Monitor, which will update the Ceph Cluster Map. You may change this grace
@@ -379,6 +377,15 @@ OSD Settings
 :Default: ``30``
 
 
+``osd mon heartbeat stat stale``
+
+:Description: Stop reporting on heartbeat ping times which haven't been updated for
+              this many seconds.  Set to zero to disable this action.
+
+:Type: 32-bit Integer
+:Default: ``3600``
+
+
 ``osd mon report interval``
 
 :Description: The number of seconds a Ceph OSD Daemon may wait

diff --git a/doc/rados/operations/monitoring.rst b/doc/rados/operations/monitoring.rst
@@ -159,6 +159,116 @@ to a health state:
     2017-07-25 10:11:13.535493 mon.a mon.0 172.21.9.34:6789/0 110 : cluster [INF] Health check cleared: PG_DEGRADED (was: Degraded data redundancy: 2 pgs unclean, 2 pgs degraded, 2 pgs undersized)
     2017-07-25 10:11:13.535577 mon.a mon.0 172.21.9.34:6789/0 111 : cluster [INF] Cluster is now healthy
 
+Network Performance Checks
+--------------------------
+
+Ceph OSDs send heartbeat ping messages amongst themselves to monitor daemon availability.  We
+also use the response times to monitor network performance.
+While it is possible that a busy OSD could delay a ping response, we can assume
+that if a network switch fails mutiple delays will be detected between distinct pairs of OSDs.
+
+By default we will warn about ping times which exceed 1 second (1000 milliseconds).
+
+::
+
+    HEALTH_WARN Long heartbeat ping times on back interface seen, longest is 1118.001 msec
+
+The health detail will add the combination of OSDs are seeing the delays and by how much.  There is a limit of 10
+detail line items.
+
+::
+
+    [WRN] OSD_SLOW_PING_TIME_BACK: Long heartbeat ping times on back interface seen, longest is 1118.001 msec
+        Slow heartbeat ping on back interface from osd.0 to osd.1 1118.001 msec
+        Slow heartbeat ping on back interface from osd.0 to osd.2 1030.123 msec
+        Slow heartbeat ping on back interface from osd.2 to osd.1 1015.321 msec
+        Slow heartbeat ping on back interface from osd.1 to osd.0 1010.456 msec
+
+To see even more detail and a complete dump of network performance information the ``dump_osd_network`` command can be used.  Typically, this would be
+sent to a mgr, but it can be limited to a particular OSD's interactions by issuing it to any OSD.  The current threshold which defaults to 1 second
+(1000 milliseconds) can be overridden as an argument in milliseconds.
+
+The following command will show all gathered network performance data by specifying a threshold of 0 and sending to the mgr.
+
+::
+
+    $ ceph daemon /var/run/ceph/ceph-mgr.x.asok dump_osd_network 0
+    {
+        "threshold": 0,
+        "entries": [
+            {
+                "last update": "Wed Sep  4 17:04:49 2019",
+                "stale": false,
+                "from osd": 2,
+                "to osd": 0,
+                "interface": "front",
+                "average": {
+                    "1min": 1.023,
+                    "5min": 0.860,
+                    "15min": 0.883
+                },
+                "min": {
+                    "1min": 0.818,
+                    "5min": 0.607,
+                    "15min": 0.607
+                },
+                "max": {
+                    "1min": 1.164,
+                    "5min": 1.173,
+                    "15min": 1.544
+                },
+                "last": 0.924
+            },
+            {
+                "last update": "Wed Sep  4 17:04:49 2019",
+                "stale": false,
+                "from osd": 2,
+                "to osd": 0,
+                "interface": "back",
+                "average": {
+                    "1min": 0.968,
+                    "5min": 0.897,
+                    "15min": 0.830
+                },
+                "min": {
+                    "1min": 0.860,
+                    "5min": 0.563,
+                    "15min": 0.502
+                },
+                "max": {
+                    "1min": 1.171,
+                    "5min": 1.216,
+                    "15min": 1.456
+                },
+                "last": 0.845
+            },
+            {
+                "last update": "Wed Sep  4 17:04:48 2019",
+                "stale": false,
+                "from osd": 0,
+                "to osd": 1,
+                "interface": "front",
+                "average": {
+                    "1min": 0.965,
+                    "5min": 0.811,
+                    "15min": 0.850
+                },
+                "min": {
+                    "1min": 0.650,
+                    "5min": 0.488,
+                    "15min": 0.466
+                },
+                "max": {
+                    "1min": 1.252,
+                    "5min": 1.252,
+                    "15min": 1.362
+                },
+            "last": 0.791
+        },
+        ...
+
+
+
 Muting health checks
 --------------------
 

diff --git a/qa/standalone/misc/network-ping.sh b/qa/standalone/misc/network-ping.sh
@@ -0,0 +1,145 @@
+#!/usr/bin/env bash
+
+source $CEPH_ROOT/qa/standalone/ceph-helpers.sh
+
+function run() {
+    local dir=$1
+    shift
+
+    export CEPH_MON="127.0.0.1:7146" # git grep '\<7146\>' : there must be only one
+    export CEPH_ARGS
+    CEPH_ARGS+="--fsid=$(uuidgen) --auth-supported=none "
+    CEPH_ARGS+="--mon-host=$CEPH_MON "
+    CEPH_ARGS+="--debug_disable_randomized_ping=true "
+    CEPH_ARGS+="--debug_heartbeat_testing_span=5 "
+    CEPH_ARGS+="--osd_heartbeat_interval=1 "
+    local funcs=${@:-$(set | sed -n -e 's/^\(TEST_[0-9a-z_]*\) .*/\1/p')}
+    for func in $funcs ; do
+        setup $dir || return 1
+        $func $dir || return 1
+        teardown $dir || return 1
+    done
+}
+
+function TEST_network_ping_test1() {
+    local dir=$1
+
+    run_mon $dir a || return 1
+    run_mgr $dir x || return 1
+    run_osd $dir 0 || return 1
+    run_osd $dir 1 || return 1
+    run_osd $dir 2 || return 1
+
+    sleep 5
+
+    create_pool foo 16
+
+    # write some objects
+    timeout 20 rados bench -p foo 10 write -b 4096 --no-cleanup || return 1
+
+    # Get 1 cycle worth of ping data "1 minute"
+    sleep 10
+    flush_pg_stats
+
+    CEPH_ARGS='' ceph daemon $(get_asok_path osd.0) dump_osd_network | tee $dir/json
+    test "$(cat $dir/json | jq '.entries | length')" = "0" || return 1
+    test "$(cat $dir/json | jq '.threshold')" = "1000" || return 1
+
+    CEPH_ARGS='' ceph daemon $(get_asok_path mgr.x) dump_osd_network | tee $dir/json
+    test "$(cat $dir/json | jq '.entries | length')" = "0" || return 1
+    test "$(cat $dir/json | jq '.threshold')" = "1000" || return 1
+
+    CEPH_ARGS='' ceph daemon $(get_asok_path osd.0) dump_osd_network 0 | tee $dir/json
+    test "$(cat $dir/json | jq '.entries | length')" = "4" || return 1
+    test "$(cat $dir/json | jq '.threshold')" = "0" || return 1
+
+    CEPH_ARGS='' ceph daemon $(get_asok_path mgr.x) dump_osd_network 0 | tee $dir/json
+    test "$(cat $dir/json | jq '.entries | length')" = "12" || return 1
+    test "$(cat $dir/json | jq '.threshold')" = "0" || return 1
+
+    # Wait another 4 cycles to get "5 minute interval"
+    sleep 20
+    flush_pg_stats
+    CEPH_ARGS='' ceph daemon $(get_asok_path osd.0) dump_osd_network | tee $dir/json
+    test "$(cat $dir/json | jq '.entries | length')" = "0" || return 1
+    test "$(cat $dir/json | jq '.threshold')" = "1000" || return 1
+
+    CEPH_ARGS='' ceph daemon $(get_asok_path mgr.x) dump_osd_network | tee $dir/json
+    test "$(cat $dir/json | jq '.entries | length')" = "0" || return 1
+    test "$(cat $dir/json | jq '.threshold')" = "1000" || return 1
+
+    CEPH_ARGS='' ceph daemon $(get_asok_path osd.0) dump_osd_network 0 | tee $dir/json
+    test "$(cat $dir/json | jq '.entries | length')" = "4" || return 1
+    test "$(cat $dir/json | jq '.threshold')" = "0" || return 1
+
+    CEPH_ARGS='' ceph daemon $(get_asok_path mgr.x) dump_osd_network 0 | tee $dir/json
+    test "$(cat $dir/json | jq '.entries | length')" = "12" || return 1
+    test "$(cat $dir/json | jq '.threshold')" = "0" || return 1
+
+
+    # Wait another 10 cycles to get "15 minute interval"
+    sleep 50
+    flush_pg_stats
+    CEPH_ARGS='' ceph daemon $(get_asok_path osd.0) dump_osd_network | tee $dir/json
+    test "$(cat $dir/json | jq '.entries | length')" = "0" || return 1
+    test "$(cat $dir/json | jq '.threshold')" = "1000" || return 1
+
+    CEPH_ARGS='' ceph daemon $(get_asok_path mgr.x) dump_osd_network | tee $dir/json
+    test "$(cat $dir/json | jq '.entries | length')" = "0" || return 1
+    test "$(cat $dir/json | jq '.threshold')" = "1000" || return 1
+
+    CEPH_ARGS='' ceph daemon $(get_asok_path osd.0) dump_osd_network 0 | tee $dir/json
+    test "$(cat $dir/json | jq '.entries | length')" = "4" || return 1
+    test "$(cat $dir/json | jq '.threshold')" = "0" || return 1
+
+    CEPH_ARGS='' ceph daemon $(get_asok_path mgr.x) dump_osd_network 0 | tee $dir/json
+    test "$(cat $dir/json | jq '.entries | length')" = "12" || return 1
+    test "$(cat $dir/json | jq '.threshold')" = "0" || return 1
+
+    # Just check the threshold output matches the input
+    CEPH_ARGS='' ceph daemon $(get_asok_path mgr.x) dump_osd_network 99 | tee $dir/json
+    test "$(cat $dir/json | jq '.threshold')" = "99" || return 1
+    CEPH_ARGS='' ceph daemon $(get_asok_path osd.0) dump_osd_network 98 | tee $dir/json
+    test "$(cat $dir/json | jq '.threshold')" = "98" || return 1
+
+    rm -f $dir/json
+}
+
+# Test setting of mon_warn_on_slow_ping_time very low to
+# get health warning
+function TEST_network_ping_test2() {
+    local dir=$1
+
+    export CEPH_ARGS
+    export EXTRA_OPTS=" --mon_warn_on_slow_ping_time=1"
+    run_mon $dir a || return 1
+    run_mgr $dir x || return 1
+    run_osd $dir 0 || return 1
+    run_osd $dir 1 || return 1
+    run_osd $dir 2 || return 1
+
+    sleep 5
+
+    create_pool foo 16
+
+    # write some objects
+    timeout 20 rados bench -p foo 10 write -b 4096 --no-cleanup || return 1
+
+    # Get at least 1 cycle of ping data (this test runs with 5 second cycles of 1 second pings)
+    sleep 10
+    flush_pg_stats
+
+    ceph health | tee $dir/health
+    grep -q "Long heartbeat" $dir/health || return 1
+
+    ceph health detail | tee $dir/health
+    grep -q "OSD_SLOW_PING_TIME_BACK" $dir/health || return 1
+    grep -q "OSD_SLOW_PING_TIME_FRONT" $dir/health || return 1
+    rm -f $dir/health
+}
+
+main network-ping "$@"
+
+# Local Variables:
+# compile-command: "cd ../.. ; make -j4 && ../qa/run-standalone.sh network-ping.sh"
+# End:
diff --git a/src/common/Formatter.cc b/src/common/Formatter.cc
@@ -25,6 +25,32 @@
 // -----------------------
 namespace ceph {
 
+std::string
+fixed_u_to_string(uint64_t num, int scale)
+{
+	std::ostringstream t;
+
+	t.fill('0');
+	t.width(scale + 1);
+	t << num;
+	int len = t.str().size();
+	return t.str().substr(0,len - scale) + "." + t.str().substr(len - scale);
+}
+
+std::string
+fixed_to_string(int64_t num, int scale)
+{
+	std::ostringstream t;
+	bool neg = num < 0;
+	if (neg) num = -num;
+
+	t.fill('0');
+	t.width(scale + 1);
+	t << num;
+	int len = t.str().size();
+	return (neg ? "-" : "") + t.str().substr(0,len - scale) + "." + t.str().substr(len - scale);
+}
+
 /*
  * FormatterAttrs(const char *attr, ...)
  *

diff --git a/src/common/Formatter.h b/src/common/Formatter.h
@@ -299,5 +299,7 @@ namespace ceph {
     std::vector< std::string > m_column_name;
   };
 
+  std::string fixed_to_string(int64_t num, int scale);
+  std::string fixed_u_to_string(uint64_t num, int scale);
 }
 #endif