ceph · liewegas · Apr 17, 2017 · Feb 28, 2017 · Apr 3, 2017 · Mar 16, 2017
diff --git a/doc/dev/osd_internals/recovery_reservation.rst b/doc/dev/osd_internals/recovery_reservation.rst
@@ -34,8 +34,8 @@ the typical process.
 
 Once the primary has its local reservation, it requests a remote
 reservation from the backfill target. This reservation CAN be rejected,
-for instance if the OSD is too full (osd_backfill_full_ratio config
-option). If the reservation is rejected, the primary drops its local
+for instance if the OSD is too full (backfillfull_ratio osd setting).
+If the reservation is rejected, the primary drops its local
 reservation, waits (osd_backfill_retry_interval), and then retries. It
 will retry indefinitely.
 
@@ -62,9 +62,10 @@ to the monitor. The state chart can set:
 
  - recovery_wait: waiting for local/remote reservations
  - recovering: recovering
+ - recovery_toofull: recovery stopped, OSD(s) above full ratio
  - backfill_wait: waiting for remote backfill reservations
  - backfilling: backfilling
- - backfill_toofull: backfill reservation rejected, OSD too full
+ - backfill_toofull: backfill stopped, OSD(s) above backfillfull ratio
 
 
 --------

diff --git a/doc/man/8/ceph.rst b/doc/man/8/ceph.rst
@@ -1166,6 +1166,12 @@ Usage::
 
 	ceph pg set_full_ratio <float[0.0-1.0]>
 
+Subcommand ``set_backfillfull_ratio`` sets ratio at which pgs are considered too full to backfill.
+
+Usage::
+
+	ceph pg set_backfillfull_ratio <float[0.0-1.0]>
+
 Subcommand ``set_nearfull_ratio`` sets ratio at which pgs are considered nearly
 full.
 

diff --git a/doc/rados/configuration/mon-config-ref.rst b/doc/rados/configuration/mon-config-ref.rst
@@ -400,6 +400,7 @@ a reasonable number for a near full ratio.
 	[global]
 
 		mon osd full ratio = .80
+		mon osd backfillfull ratio = .75
 		mon osd nearfull ratio = .70
 
 
@@ -412,6 +413,15 @@ a reasonable number for a near full ratio.
 :Default: ``.95``
 
 
+``mon osd backfillfull ratio``
+
+:Description: The percentage of disk space used before an OSD is
+              considered too ``full`` to backfill.
+
+:Type: Float
+:Default: ``.90``
+
+
 ``mon osd nearfull ratio`` 
 
 :Description: The percentage of disk space used before an OSD is 

diff --git a/doc/rados/configuration/osd-config-ref.rst b/doc/rados/configuration/osd-config-ref.rst
@@ -560,15 +560,6 @@ priority than requests to read or write data.
 :Default: ``512`` 
 
 
-``osd backfill full ratio``
-
-:Description: Refuse to accept backfill requests when the Ceph OSD Daemon's 
-              full ratio is above this value.
-
-:Type: Float
-:Default: ``0.85``
-
-
 ``osd backfill retry interval``
 
 :Description: The number of seconds to wait before retrying backfill requests.
@@ -673,13 +664,6 @@ perform well in a degraded state.
 :Default: ``8 << 20`` 
 
 
-``osd recovery threads`` 
-
-:Description: The number of threads for recovering data.
-:Type: 32-bit Integer
-:Default: ``1``
-
-
 ``osd recovery thread timeout`` 
 
 :Description: The maximum time in seconds before timing out a recovery thread.

diff --git a/doc/rados/operations/monitoring-osd-pg.rst b/doc/rados/operations/monitoring-osd-pg.rst
@@ -468,8 +468,7 @@ Ceph provides a number of settings to balance the resource contention between
 new service requests and the need to recover data objects and restore the
 placement groups to the current state. The ``osd recovery delay start`` setting
 allows an OSD to restart, re-peer and even process some replay requests before
-starting the recovery process. The ``osd recovery threads`` setting limits the
-number of threads for the recovery process (1 thread by default).  The ``osd
+starting the recovery process.  The ``osd
 recovery thread timeout`` sets a thread timeout, because multiple OSDs may fail,
 restart and re-peer at staggered rates. The ``osd recovery max active`` setting
 limits the  number of recovery requests an OSD will entertain simultaneously to
@@ -497,8 +496,9 @@ placement group can't be backfilled, it may be considered ``incomplete``.
 Ceph provides a number of settings to manage the load spike associated with
 reassigning placement groups to an OSD (especially a new OSD). By default,
 ``osd_max_backfills`` sets the maximum number of concurrent backfills to or from
-an OSD to 10. The ``osd backfill full ratio`` enables an OSD to refuse a
-backfill request if the OSD is approaching its full ratio (85%, by default).
+an OSD to 10. The ``backfill full ratio`` enables an OSD to refuse a
+backfill request if the OSD is approaching its full ratio (90%, by default) and
+change with ``ceph osd set-backfillfull-ratio`` comand.
 If an OSD refuses a backfill request, the ``osd backfill retry interval``
 enables an OSD to retry the request (after 10 seconds, by default). OSDs can
 also set ``osd backfill scan min`` and ``osd backfill scan max`` to manage scan

diff --git a/doc/rados/troubleshooting/troubleshooting-osd.rst b/doc/rados/troubleshooting/troubleshooting-osd.rst
@@ -206,28 +206,31 @@ Ceph prevents you from writing to a full OSD so that you don't lose data.
 In an operational cluster, you should receive a warning when your cluster
 is getting near its full ratio. The ``mon osd full ratio`` defaults to
 ``0.95``, or 95% of capacity before it stops clients from writing data.
-The ``mon osd nearfull ratio`` defaults to ``0.85``, or 85% of capacity
+The ``mon osd backfillfull ratio`` defaults to ``0.90``, or 90 % of
+capacity when it blocks backfills from starting. The
+``mon osd nearfull ratio`` defaults to ``0.85``, or 85% of capacity
 when it generates a health warning.
 
 Full cluster issues usually arise when testing how Ceph handles an OSD
 failure on a small cluster. When one node has a high percentage of the
 cluster's data, the cluster can easily eclipse its nearfull and full ratio
 immediately. If you are testing how Ceph reacts to OSD failures on a small
 cluster, you should leave ample free disk space and consider temporarily
-lowering the ``mon osd full ratio`` and ``mon osd nearfull ratio``.
+lowering the ``mon osd full ratio``, ``mon osd backfillfull ratio``  and
+``mon osd nearfull ratio``.
 
 Full ``ceph-osds`` will be reported by ``ceph health``::
 
 	ceph health
-	HEALTH_WARN 1 nearfull osds
-	osd.2 is near full at 85%
+	HEALTH_WARN 1 nearfull osd(s)
 
 Or::
 
-	ceph health
-	HEALTH_ERR 1 nearfull osds, 1 full osds
-	osd.2 is near full at 85%
+	ceph health detail
+	HEALTH_ERR 1 full osd(s); 1 backfillfull osd(s); 1 nearfull osd(s)
 	osd.3 is full at 97%
+	osd.4 is backfill full at 91%
+	osd.2 is near full at 87%
 
 The best way to deal with a full cluster is to add new ``ceph-osds``, allowing
 the cluster to redistribute data to the newly available storage.

diff --git a/qa/tasks/ceph_manager.py b/qa/tasks/ceph_manager.py
@@ -696,7 +696,7 @@ def test_backfill_full(self):
         """
         Test backfills stopping when the replica fills up.
 
-        First, use osd_backfill_full_ratio to simulate a now full
+        First, use injectfull admin command to simulate a now full
         osd by setting it to 0 on all of the OSDs.
 
         Second, on a random subset, set
@@ -705,13 +705,14 @@ def test_backfill_full(self):
 
         Then, verify that all backfills stop.
         """
-        self.log("injecting osd_backfill_full_ratio = 0")
+        self.log("injecting backfill full")
         for i in self.live_osds:
             self.ceph_manager.set_config(
                 i,
                 osd_debug_skip_full_check_in_backfill_reservation=
-                random.choice(['false', 'true']),
-                osd_backfill_full_ratio=0)
+                random.choice(['false', 'true']))
+            self.ceph_manager.osd_admin_socket(i, command=['injectfull', 'backfillfull'],
+                                     check_status=True, timeout=30, stdout=DEVNULL)
         for i in range(30):
             status = self.ceph_manager.compile_pg_status()
             if 'backfill' not in status.keys():
@@ -724,8 +725,9 @@ def test_backfill_full(self):
         for i in self.live_osds:
             self.ceph_manager.set_config(
                 i,
-                osd_debug_skip_full_check_in_backfill_reservation='false',
-                osd_backfill_full_ratio=0.85)
+                osd_debug_skip_full_check_in_backfill_reservation='false')
+            self.ceph_manager.osd_admin_socket(i, command=['injectfull', 'none'],
+                                     check_status=True, timeout=30, stdout=DEVNULL)
 
     def test_map_discontinuity(self):
         """

diff --git a/qa/workunits/ceph-helpers.sh b/qa/workunits/ceph-helpers.sh
@@ -400,6 +400,7 @@ EOF
     if test -z "$(get_config mon $id mon_initial_members)" ; then
         ceph osd pool delete rbd rbd --yes-i-really-really-mean-it || return 1
         ceph osd pool create rbd $PG_NUM || return 1
+        ceph osd set-backfillfull-ratio .99
     fi
 }
 
@@ -634,7 +635,6 @@ function activate_osd() {
     ceph_disk_args+=" --prepend-to-path="
 
     local ceph_args="$CEPH_ARGS"
-    ceph_args+=" --osd-backfill-full-ratio=.99"
     ceph_args+=" --osd-failsafe-full-ratio=.99"
     ceph_args+=" --osd-journal-size=100"
     ceph_args+=" --osd-scrub-load-threshold=2000"

diff --git a/qa/workunits/cephtool/test.sh b/qa/workunits/cephtool/test.sh
@@ -1419,9 +1419,44 @@ function test_mon_pg()
 
   ceph osd set-full-ratio .962
   ceph osd dump | grep '^full_ratio 0.962'
+  ceph osd set-backfillfull-ratio .912
+  ceph osd dump | grep '^backfillfull_ratio 0.912'
   ceph osd set-nearfull-ratio .892
   ceph osd dump | grep '^nearfull_ratio 0.892'
 
+  # Check health status
+  ceph osd set-nearfull-ratio .913
+  ceph health | grep 'HEALTH_ERR Full ratio(s) out of order'
+  ceph health detail | grep 'backfill_ratio (0.912) < nearfull_ratio (0.913), increased'
+  ceph osd set-nearfull-ratio .892
+  ceph osd set-backfillfull-ratio .963
+  ceph health detail | grep 'full_ratio (0.962) < backfillfull_ratio (0.963), increased'
+  ceph osd set-backfillfull-ratio .912
+
+  # Check injected full results
+  WAITFORFULL=10
+  ceph --admin-daemon $CEPH_OUT_DIR/osd.0.asok injectfull nearfull
+  sleep $WAITFORFULL
+  ceph health | grep "HEALTH_WARN.*1 nearfull osd(s)"
+  ceph --admin-daemon $CEPH_OUT_DIR/osd.1.asok injectfull backfillfull
+  sleep $WAITFORFULL
+  ceph health | grep "HEALTH_WARN.*1 backfillfull osd(s)"
+  ceph --admin-daemon $CEPH_OUT_DIR/osd.2.asok injectfull failsafe
+  sleep $WAITFORFULL
+  # failsafe and full are the same as far as the monitor is concerned
+  ceph health | grep "HEALTH_ERR.*1 full osd(s)"
+  ceph --admin-daemon $CEPH_OUT_DIR/osd.0.asok injectfull full
+  sleep  $WAITFORFULL
+  ceph health | grep "HEALTH_ERR.*2 full osd(s)"
+  ceph health detail | grep "osd.0 is full at.*%"
+  ceph health detail | grep "osd.2 is full at.*%"
+  ceph health detail | grep "osd.1 is backfill full at.*%"
+  ceph --admin-daemon $CEPH_OUT_DIR/osd.0.asok injectfull none
+  ceph --admin-daemon $CEPH_OUT_DIR/osd.1.asok injectfull none
+  ceph --admin-daemon $CEPH_OUT_DIR/osd.2.asok injectfull none
+  sleep $WAITFORFULL
+  ceph health | grep HEALTH_OK
+
   ceph pg stat | grep 'pgs:'
   ceph pg 0.0 query
   ceph tell 0.0 query

diff --git a/qa/workunits/rest/test.py b/qa/workunits/rest/test.py
@@ -359,10 +359,14 @@ def expect_nofail(url, method, respcode, contenttype, extra_hdrs=None,
     r = expect('osd/dump', 'GET', 200, 'json', JSONHDR)
     assert(float(r.myjson['output']['full_ratio']) == 0.90)
     expect('osd/set-full-ratio?ratio=0.95', 'PUT', 200, '')
+    expect('osd/set-backfillfull-ratio?ratio=0.88', 'PUT', 200, '')
+    r = expect('osd/dump', 'GET', 200, 'json', JSONHDR)
+    assert(float(r.myjson['output']['backfillfull_ratio']) == 0.88)
+    expect('osd/set-backfillfull-ratio?ratio=0.90', 'PUT', 200, '')
     expect('osd/set-nearfull-ratio?ratio=0.90', 'PUT', 200, '')
     r = expect('osd/dump', 'GET', 200, 'json', JSONHDR)
     assert(float(r.myjson['output']['nearfull_ratio']) == 0.90)
-    expect('osd/set-full-ratio?ratio=0.85', 'PUT', 200, '')
+    expect('osd/set-nearfull-ratio?ratio=0.85', 'PUT', 200, '')
 
     r = expect('pg/stat', 'GET', 200, 'json', JSONHDR)
     assert('num_pgs' in r.myjson['output'])

diff --git a/src/common/ceph_strings.cc b/src/common/ceph_strings.cc
@@ -42,6 +42,8 @@ const char *ceph_osd_state_name(int s)
 		return "full";
 	case CEPH_OSD_NEARFULL:
 		return "nearfull";
+	case CEPH_OSD_BACKFILLFULL:
+		return "backfillfull";
 	default:
 		return "???";
 	}	

diff --git a/src/common/config_opts.h b/src/common/config_opts.h
@@ -308,6 +308,7 @@ OPTION(mon_pg_warn_min_pool_objects, OPT_INT, 1000)  // do not warn on pools bel
 OPTION(mon_pg_check_down_all_threshold, OPT_FLOAT, .5) // threshold of down osds after which we check all pgs
 OPTION(mon_cache_target_full_warn_ratio, OPT_FLOAT, .66) // position between pool cache_target_full and max where we start warning
 OPTION(mon_osd_full_ratio, OPT_FLOAT, .95) // what % full makes an OSD "full"
+OPTION(mon_osd_backfillfull_ratio, OPT_FLOAT, .90) // what % full makes an OSD backfill full (backfill halted)
 OPTION(mon_osd_nearfull_ratio, OPT_FLOAT, .85) // what % full makes an OSD near full
 OPTION(mon_allow_pool_delete, OPT_BOOL, false) // allow pool deletion
 OPTION(mon_globalid_prealloc, OPT_U32, 10000)   // how many globalids to prealloc
@@ -626,11 +627,11 @@ OPTION(osd_max_backfills, OPT_U64, 1)
 // Minimum recovery priority (255 = max, smaller = lower)
 OPTION(osd_min_recovery_priority, OPT_INT, 0)
 
-// Refuse backfills when OSD full ratio is above this value
-OPTION(osd_backfill_full_ratio, OPT_FLOAT, 0.85)
-
 // Seconds to wait before retrying refused backfills
-OPTION(osd_backfill_retry_interval, OPT_DOUBLE, 10.0)
+OPTION(osd_backfill_retry_interval, OPT_DOUBLE, 30.0)
+
+// Seconds to wait before retrying refused recovery
+OPTION(osd_recovery_retry_interval, OPT_DOUBLE, 30.0)
 
 // max agent flush ops
 OPTION(osd_agent_max_ops, OPT_INT, 4)
@@ -742,7 +743,6 @@ OPTION(osd_op_pq_min_cost, OPT_U64, 65536)
 OPTION(osd_disk_threads, OPT_INT, 1)
 OPTION(osd_disk_thread_ioprio_class, OPT_STR, "") // rt realtime be best effort idle
 OPTION(osd_disk_thread_ioprio_priority, OPT_INT, -1) // 0-7
-OPTION(osd_recovery_threads, OPT_INT, 1)
 OPTION(osd_recover_clone_overlap, OPT_BOOL, true)   // preserve clone_overlap during recovery/migration
 OPTION(osd_op_num_threads_per_shard, OPT_INT, 2)
 OPTION(osd_op_num_shards, OPT_INT, 5)
@@ -871,6 +871,7 @@ OPTION(osd_debug_skip_full_check_in_backfill_reservation, OPT_BOOL, false)
 OPTION(osd_debug_reject_backfill_probability, OPT_DOUBLE, 0)
 OPTION(osd_debug_inject_copyfrom_error, OPT_BOOL, false)  // inject failure during copyfrom completion
 OPTION(osd_debug_misdirected_ops, OPT_BOOL, false)
+OPTION(osd_debug_skip_full_check_in_recovery, OPT_BOOL, false)
 OPTION(osd_enxio_on_misdirected_op, OPT_BOOL, false)
 OPTION(osd_debug_verify_cached_snaps, OPT_BOOL, false)
 OPTION(osd_enable_op_tracker, OPT_BOOL, true) // enable/disable OSD op tracking

diff --git a/src/include/rados.h b/src/include/rados.h
@@ -116,6 +116,7 @@ struct ceph_eversion {
 #define CEPH_OSD_NEW     (1<<3)  /* osd is new, never marked in */
 #define CEPH_OSD_FULL    (1<<4)  /* osd is at or above full threshold */
 #define CEPH_OSD_NEARFULL (1<<5) /* osd is at or above nearfull threshold */
+#define CEPH_OSD_BACKFILLFULL (1<<6) /* osd is at or above backfillfull threshold */
 
 extern const char *ceph_osd_state_name(int s);
 

diff --git a/src/mon/MonCommands.h b/src/mon/MonCommands.h
@@ -592,6 +592,10 @@ COMMAND("osd set-full-ratio " \
 	"name=ratio,type=CephFloat,range=0.0|1.0", \
 	"set usage ratio at which OSDs are marked full",
 	"osd", "rw", "cli,rest")
+COMMAND("osd set-backfillfull-ratio " \
+	"name=ratio,type=CephFloat,range=0.0|1.0", \
+	"set usage ratio at which OSDs are marked too full to backfill",
+	"osd", "rw", "cli,rest")
 COMMAND("osd set-nearfull-ratio " \
 	"name=ratio,type=CephFloat,range=0.0|1.0", \
 	"set usage ratio at which OSDs are marked near-full",