From 76d419512b9dc9de8060717f035676e704879435 Mon Sep 17 00:00:00 2001 From: Greg Farnum Date: Thu, 16 Jul 2020 18:15:51 +0000 Subject: [PATCH] doc: describe stretch mode for users and developers Signed-off-by: Greg Farnum --- doc/rados/operations/index.rst | 3 +- doc/rados/operations/stretch-mode.rst | 173 ++++++++++++++++++++++++++ 2 files changed, 175 insertions(+), 1 deletion(-) create mode 100644 doc/rados/operations/stretch-mode.rst diff --git a/doc/rados/operations/index.rst b/doc/rados/operations/index.rst index 351ef53e417f51..c8dff51a1fbc68 100644 --- a/doc/rados/operations/index.rst +++ b/doc/rados/operations/index.rst @@ -41,6 +41,8 @@ CRUSH algorithm. upmap crush-map crush-map-edits + stretch-mode + change-mon-elections @@ -60,7 +62,6 @@ with new hardware. add-or-rm-osds add-or-rm-mons - change-mon-elections devices bluestore-migration Command Reference diff --git a/doc/rados/operations/stretch-mode.rst b/doc/rados/operations/stretch-mode.rst new file mode 100644 index 00000000000000..e94bfb57e11cd5 --- /dev/null +++ b/doc/rados/operations/stretch-mode.rst @@ -0,0 +1,173 @@ +.. _stretch_mode: +================ +Stretch Clusters +================ + + +Stretch Clusters +================ +Ceph generally expects all parts of its network and overall cluster to be +equally reliable, with failures randomly distributed across the CRUSH map. +So you may lose a switch that knocks out a big segment of OSDs, but we expect +the remaining OSDs and monitors to route around that. + +This is usually a good choice, but may not work well in some +stretched cluster configurations where a significant part of your cluster +is stuck behind a single network component. For instance, a single +cluster which is located in multiple data centers, and you want to +sustain the loss of a full DC. + +There are two standard configurations we've seen deployed, with either +two or three data centers (or, in clouds, availability zones). With two +zones, we expect each site to hold a copy of the data, and for a third +site to have a tiebreaker monitor (this can be a VM or high-latency compared +to the main sites) to pick a winner if the network connection fails and both +DCs remain alive. For three sites, we expect a a copy of the data and an equal +number of monitors in each site. + +Note, the standard Ceph configuration will survive MANY failures of +the network or Data Centers, if you have configured it correctly, and it will +never compromise data consistency -- if you bring back enough of the Ceph servers +following a failure, it will recover. If you lose +a data center and can still form a quorum of monitors and have all the data +available (with enough copies to satisfy min_size, or CRUSH rules that will +re-replicate to meet it), Ceph will maintain availability. + +What can't it handle? + +Stretch Cluster Issues +====================== +No matter what happens, Ceph will not compromise on data integrity +and consistency. If there's a failure in your network or a loss of nodes and +you can restore service, Ceph will return to normal functionality on its own. + +But there are scenarios where you lose data availibility despite having +enough servers available to satisfy Ceph's consistency and sizing constraints, or +where you may be surprised to not satisfy Ceph's constraints. +The first important category of these failures resolve around inconsistent +networks -- if there's a netsplit, Ceph may be unable to mark OSDs down and kick +them out of the acting PG sets despite the primary being unable to replicate data. +If this happens, IO will not be permitted, because Ceph can't satisfy its durability +guarantees. + +The second important category of failures is when you think you have data replicated +across data centers, but the constraints aren't sufficient to guarantee this. +For instance, you might have data centers A and B, and your CRUSH rule targets 3 copies +and places a copy in each data center with a min_size of 2. The PG may go active with +2 copies in site A and no copies in site B, which means that if you then lose site A you +have lost data and Ceph can't operate on it. This situation is surprisingly difficult +to avoid with standard CRUSH rules. + +Stretch Mode +============ +The new stretch mode is designed to handle the 2-site case. (3 sites are +just as susceptible to netsplit issues, but much more resilient to surprising +data availability ones than 2-site clusters are.) + +To enter stretch mode, you must set the location of each monitor, matching +your CRUSH map. For instance, to place mon.a in your first data center :: + + $ ceph mon set_location a datacenter=site1 + +Next, generate a CRUSH rule which will place 2 copies in each data center. This +will require editing the crush map directly:: + + $ ceph osd getcrushmap > crush.map.bin + $ crushtool -d crush.map.bin -o crush.map.txt + +Then edit the crush.map.txt file to add a new rule. Here +there is only one other rule, so this is id 1, but you may need +to use a different rule id. We also have two data center buckets +named site1 and site2:: + + rule stretch_rule { + id 1 + type replicated + min_size 1 + max_size 10 + step take site1 + step chooseleaf firstn 2 type host + step emit + step take site2 + step chooseleaf firstn 2 type host + step emit + } + +Finally, inject the crushmap to make the rule available to the cluster:: + + $ crushtool -c crush.map.txt -o crush2.map.bin + $ ceph osd setcrushmap -i crush2.map.bin + +If you aren't already running your monitors in connectivity mode, do so with +the instructions in `Changing Monitor Elections`_. + +.. _Changing Monitor elections: ../change-mon-elections + + +And last, tell the cluster to enter stretch mode. Here, mon.e is the +tiebreaker and we are splitting across datacenters :: + + $ ceph mon enable_stretch_mode e stretch_rule datacenter + +When stretch mode is enabled, the OSDs wlll only take PGs active when +they peer across datacenters (or whatever other CRUSH bucket type +you specified), assuming both are alive. Pools will increase in size +from the default 3 to 4, expecting 2 copies in each site. OSDs will only +be allowed to connect to monitors in the same data center. + +If all the OSDs and monitors from a data center become inaccessible +at once, the surviving data center will enter a degraded stretch mode, +reducing pool size to 2 and min_size to 1, issuing a warning, and +going active by itself. + +When the missing data center comes back, the cluster will enter +recovery stretch mode. It increases the pool size back to 4 and min_size to 2, +but still only requires OSDs from the data center which was up the whole time. +It continues issuing a warning. This mode then waits until all PGs are in +a known state, and are neither degraded nor incomplete. At that point, +it transitions back to regular stretch mode and the warning ends. + + +Stretch Mode Limitations +======================== +As implied by the setup, stretch mode only handles 2 sites with OSDs. + +While it is not enforced, you should run 2 monitors in each site plus +a tiebreaker, for a total of 5. This is because OSDs can only connect +to monitors in their own site when in stretch mode. + +You cannot use erasure coded pools with stretch mode. If you try, it will +refuse, and it will not allow you to create EC pools once in stretch mode. + +You must create your own CRUSH rule which provides 2 copies in each site, and +you must use 4 total copies with 2 in each site. If you have existing pools +with non-default size/min_size, Ceph will object when you attempt to +enable_stretch_mode. + +Because it runs with min_size 1 when degraded, you should only use stretch mode +with all-flash OSDs. + +Hopefully, future development will extend this feature to support EC pools and +running with more than 2 full sites. + +Other commands +============== +When in stretch degraded mode, the cluster will go into "recovery" mode automatically +when the disconnected data center comes back. If that doesn't work, or you want to +enable recovery mode early, you can invoke :: + + $ ceph osd force_recovery_stretch_mode --yes-i-realy-mean-it + +But this command should not be necessary; it is included to deal with +unanticipated situations. + +When in recovery mode, the cluster should go back into normal stretch mode +when the PGs are healthy. If this doesn't happen, or you want to force the +cross-data-center peering early and are willing to risk data downtime (or have +verified separately that all the PGs can peer, even if they aren't fully +recovered), you can invoke :: + $ ceph osd force_healthy_stretch_mode --yes-i-really-mean-it + +This command should not be necessary; it is included to deal with +unanticipated situations. But you might wish to invoke it to remove +the HEALTH_WARN state which recovery mode generates.