mgr: implement 'osd safe-to-destroy' and 'osd ok-to-stop' commands #16976

liewegas · 2017-08-10T18:06:35Z

An osd is safe to destroy if

we have osd_stat for it
osd_stat indicates no pgs stored
all pgs are known
no pgs map to it
i.e., overall data durability will not be affected

An OSD is ok to stop if

we have the pg stats we need
no PGs will drop below min_size
i.e., availability won't be immediately compromised

liewegas · 2017-08-10T18:07:59Z

@dvanders what do you think? this isn't smart enough to tell you whether removal will make the cluster go degraded but not lose data--that is a hard thing to tell. but it does tell you whether you've successfully drained an osd and can remove it without increasing the risk of data loss (which is really what we want, right?).

dvanders · 2017-08-10T18:40:22Z

src/mgr/DaemonServer.cc

+      cmdctx->reply(r, ss);
+      return true;
+    }
+    ss << "osd." << osd << " is safe to remove";


Maybe s/remove/destroy/ to keep that verb consistent? (Also consistent with the ceph-disk command)

dvanders · 2017-08-10T18:53:37Z

This more conservative approach is probably wise for general safe-to-destroy cmd.

But maybe a min_size based command could be useful if it was labelled as "ceph osd safe-to-stop" or safe-to-go-down?
This would check if stopping an osd would cause any PG's up set to drop below min_size.

liewegas · 2017-08-10T19:10:38Z

@dvanders Yeah we could do something like that, but is harder. The bit that concerns me is that any such check is racy. If you have, say, 3 osds you want to stop, you can't check them all and then stop them all. You can't even do it in a loop unless you wait for the cluster to settle (fully peer) between each osd. We can have lots of warnings around it and make it take a list of OSDs so enable that sort of thing, but even so.

@jcsp suggested a more scary set of verbs for it to make it clear that it is not really safe because the redundancy in the cluster is being reduced. Something like unlikely-to-destroy-data-when-stopped but less of a mouthful. :/

liewegas · 2017-08-10T19:11:23Z

maybe "ceph osd unlikely-to-degrade-if-stopped osd.n" ?

dvanders · 2017-08-10T20:11:32Z

Maybe not "degrade"... The PGs would still degrade if this osd were to be stopped, but they won't go inactive/down whatever it is.

Thinking about the use-case for this "probably-ok-to-stop":

Operator wants to
update some osds to newer ceph
power off a host
Reboot a network switch
...
and it's not obvious to them if we have osd/host/rack-wise redundancy. Is it safe to stop this list of osds... or better yet a host, rack, etc.

And yeah a warning about the raciness (against yourself or other operators) would be useful.

"Yes it is probably safe to stop osd/host/ x, provided there are no other ongoing interventions."

dvanders · 2017-08-11T09:44:34Z

This looks great. Thanks!

jcsp · 2017-08-11T10:58:25Z

I think I'd even be okay with dropping the probably- prefix -- and maybe change the naming to have both commands start with ok- or safe- rather than differing. That way it is easy to explain both commands in one place, by saying that the difference is just that one command is about decommissioning and the other is about temporarily stopping.

It would be really nice to make parse_osd_id_list take arbitrary crush nodes too. See also things like https://bugzilla.redhat.com/show_bug.cgi?id=1318389, where the workflow would probably be a host-level "ok-to-stop" followed by a host-level "add-noout".

liewegas · 2017-08-11T13:57:36Z

So the thing with safe-to-destroy and safe-to-stop is that the difference sounds like it's about what action you're taking, but in both cases (destroy vs stop) the data is going offline.. it's less about what you do but more about what criteria we're applying to decide whether we can do without it (no reduction in durability vs no reduction in immediate availability). Current "safe" + "destroy" = durability, "ok" + "stop" = availability, but that's pretty subtle too. :/

I currently lean toward safe-to-destroy and ok-to-stop, with

check whether osd(s) can be safely destroyed without risking data loss
check whether osd(s) can be safely stopped without reducing immediate data availability

as the help messages...

liewegas · 2017-08-11T15:26:56Z

Ok I think this is ready to go. I'll follow up with improvements to the osd parsing to allow crush names to be used later, along with a refactor that uses the same helper for a bunch of other mon commands.

liewegas · 2017-08-11T15:28:08Z

Note that I don't think we shoudl backport the osd_stat change in mgr (and as a result we can't backport the test). Might do a conservative change that preserves just the seq for 'out' osds for luminous just so this works as we see failures from I think that occasionally in the upgrade tests.

gregsfortytwo

Mostly good but some suggested language changes and a couple logic issues.

gregsfortytwo · 2017-08-10T23:57:56Z

src/osd/osd_types.cc

@@ -340,6 +341,7 @@ void osd_stat_t::encode(bufferlist &bl) const
  ::encode(os_perf_stat, bl);
  ::encode(up_from, bl);
  ::encode(seq, bl);
+  ::encode(num_pgs, bl);


I thought we already had ways of getting the pgs-per-osd by working some mapping in an opposite direction. But maybe that's just the number of CRUSH-assigned ones?
(Or I had wanted this field and it didn't exist but I swapped the memory in my head?)

The point of this field is as a second out of band safety check. Not whether the current map maps to this osd, but whether there is still data stored on the osd that hasn't been cleaned up yet. It's a pretty strict check but I'm erring on the side of paranoid here!

gregsfortytwo · 2017-08-11T00:01:55Z

src/mgr/ClusterState.cc

-    pending_inc.update_stat(from, stats->epoch, osd_stat_t());
-  }
+
+  pending_inc.update_stat(from, stats->epoch, std::move(stats->osd_stat));


Looks like John just didn't think we'd need stats on out OSDs (initial commit). This is better.

Agree, although it mirrors PGMonitor's behavior, and the commit that added it was one of Sam's that adds a bunch of checking and infrastructure to ensure that it is always in sync with osd_epochs (which AFAICS is obsolete). In any case, I think it's better this way but I don't want to backport this part to luminous.

gregsfortytwo · 2017-08-11T00:04:42Z

src/osd/OSDMap.cc

+      *ss << "invalid osd id '" << *i << "'";
+      return -EINVAL;
+    }
+    out->insert(osd);


Next time around, logic like this makes a lot more sense as an if block combined with a while loop... ;)

gregsfortytwo · 2017-08-11T20:30:15Z

src/mgr/MgrCommands.h

@@ -107,6 +107,13 @@ COMMAND("osd test-reweight-by-pg " \
 	"dry run of reweight OSDs by PG distribution [overload-percentage-for-consideration, default 120]", \
 	"osd", "r", "cli,rest")

+COMMAND("osd safe-to-destroy name=ids,type=CephString,n=N",
+	"check whether osd(s) can be safely destroyed without risking data loss",
+	"osd", "r", "cli,rest")


Maybe we could do "without reducing data durability" instead of "risking data loss"? Or do you expect this to expand to cover other kinds of contingencies I can't imagine?

The former phrasing is more specific and doesn't inspire quite as much fear of other cluster management commands.

gregsfortytwo · 2017-08-11T20:33:47Z

src/mgr/DaemonServer.cc

+	 << "cannot draw any conclusions";
+      r = -EAGAIN;
+    } else if (!stored_pgs.empty()) {
+      ss << "OSD(s) " << stored_pgs << " last reported they still store PGs";


In this and the missing stats case we only care because not all PGs are active+clean; we should mention that.

gregsfortytwo · 2017-08-11T20:34:10Z

src/mgr/DaemonServer.cc

+      return true;
+    }
+    ss << "OSD(s) " << osds << " are safe to destroy without reducing data "
+       << "redundancy.";


s/redundancy/durability/ ?

gregsfortytwo · 2017-08-11T21:01:07Z

src/mgr/DaemonServer.cc

+	      }
+	      const pg_pool_t *pi = osdmap.get_pg_pool(p.first.pool());
+	      if (!pi) {
+		// pool deleted?


We should actually check this?

Either the pool just got deleted (in which case we don't care) or perhaps we got pg stat reports that raced ahead of the osdmap update on the mgr. Either way, I don't think we care here.

I guess the creating case might mean we make pg creation break down. I'll just throw these PGs in the dangerous_pgs bucket. This should be extremely rare anyway (only possible right around pg create/delete).

gregsfortytwo · 2017-08-11T21:02:56Z

src/mgr/DaemonServer.cc

+		  (q->second.state & PG_STATE_DEGRADED)) {
+		// we don't currently have a good way to tell *how* degraded
+		// a degraded PG is, so we have to assume we cannot remove
+		// any more replicas/shards.


We should be able to apply the same min_size logic here to the degraded PGs that we do farther down?

Not easily. The degraded flag means we have < acting.size() replicas for some objects, but we don't know how much less... maybe we can afford one more replica loss or maybe none. Erring on the side of caution here.

gregsfortytwo · 2017-08-11T21:03:11Z

src/mgr/DaemonServer.cc

+		// a degraded PG is, so we have to assume we cannot remove
+		// any more replicas/shards.
+		++dangerous_pgs;
+		return;


Did you mean to continue here? Not sure why we'd cut the pg checking short.

gregsfortytwo · 2017-08-11T21:05:08Z

doc/man/8/ceph.rst

@@ -874,6 +874,18 @@ Usage::

 	ceph osd out <ids> [<ids>...]

+Subcommand ``ok-to-stop`` checks whether the list of OSD(s) can be
+stopped without reducing immediately data availability.  That is, all


s/reducing immediately data availability/immediately making data unavailable/ is a much more accurate description of what it's doing.

liewegas · 2017-08-12T02:37:36Z

http://pulpito.ceph.com/sage-2017-08-11_21:54:25-rados-wip-sage-testing2-20170811a-distro-basic-smithi/

Signed-off-by: Sage Weil <sage@redhat.com>

I'm not quite sure why we were doing this. :/ Signed-off-by: Sage Weil <sage@redhat.com>

An osd is safe to destroy if - we have osd_stat for it - osd_stat indicates no pgs stored - all pgs are known - no pgs map to it An osd is ok ot stop if - we have pg stats - no pgs will drop below min_size Signed-off-by: Sage Weil <sage@redhat.com>

This is hard with workunits/cephtool/test.sh because we don't control the whole cluster. Signed-off-by: Sage Weil <sage@redhat.com>

gregsfortytwo · 2017-08-12T17:11:02Z

LGTM!

liewegas · 2017-08-13T19:02:36Z

http://pulpito.ceph.com/sage-2017-08-12_21:09:40-rados-wip-sage-testing-20170812a-distro-basic-smithi/

vumrao · 2021-11-22T20:30:37Z

There was an upstream tracker for a similar request - https://tracker.ceph.com/issues/21579. Marked it resolved.

liewegas requested review from gregsfortytwo and jcsp August 10, 2017 18:06

liewegas added feature mgr labels Aug 10, 2017

dvanders reviewed Aug 10, 2017

View reviewed changes

liewegas force-pushed the wip-osd-empty branch from e75f57e to 4585604 Compare August 10, 2017 19:04

liewegas force-pushed the wip-osd-empty branch from 4585604 to 15d9094 Compare August 10, 2017 22:42

liewegas force-pushed the wip-osd-empty branch from 15d9094 to 93fec6e Compare August 11, 2017 15:25

liewegas changed the title ~~mgr: implement 'osd safe-to-destroy' command~~ mgr: implement 'osd safe-to-destroy' and 'osd ok-to-stop' commands Aug 11, 2017

liewegas force-pushed the wip-osd-empty branch from 93fec6e to 8ba90b9 Compare August 11, 2017 15:32

liewegas added the wip-sage2-testing label Aug 11, 2017

gregsfortytwo requested changes Aug 11, 2017

View reviewed changes

liewegas added 3 commits August 11, 2017 22:47

osd/OSDMap: add parse_osd_id_list helper

6fc33a0

Signed-off-by: Sage Weil <sage@redhat.com>

osd/osd_types: include number of locally stored PGs in osd_stat_t

c294547

Signed-off-by: Sage Weil <sage@redhat.com>

mgr/ClusterState: record osd_stat for out osds too

a7612d3

I'm not quite sure why we were doing this. :/ Signed-off-by: Sage Weil <sage@redhat.com>

liewegas force-pushed the wip-osd-empty branch from 8ba90b9 to 08b3624 Compare August 12, 2017 02:47

liewegas added 2 commits August 11, 2017 22:54

test/osd/safe-to-destroy.sh: test 'osd safe-to-destroy'

ba66977

This is hard with workunits/cephtool/test.sh because we don't control the whole cluster. Signed-off-by: Sage Weil <sage@redhat.com>

liewegas force-pushed the wip-osd-empty branch from 08b3624 to ba66977 Compare August 12, 2017 02:54

liewegas added needs-qa and removed wip-sage2-testing labels Aug 12, 2017

gregsfortytwo approved these changes Aug 12, 2017

View reviewed changes

liewegas added the wip-sage-testing label Aug 12, 2017

liewegas merged commit 0a8ceaa into ceph:master Aug 13, 2017

liewegas deleted the wip-osd-empty branch August 13, 2017 19:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mgr: implement 'osd safe-to-destroy' and 'osd ok-to-stop' commands #16976

mgr: implement 'osd safe-to-destroy' and 'osd ok-to-stop' commands #16976

liewegas commented Aug 10, 2017 •

edited

liewegas commented Aug 10, 2017

dvanders Aug 10, 2017

dvanders commented Aug 10, 2017

liewegas commented Aug 10, 2017

liewegas commented Aug 10, 2017 •

edited

dvanders commented Aug 10, 2017

dvanders commented Aug 11, 2017

jcsp commented Aug 11, 2017

liewegas commented Aug 11, 2017 •

edited

liewegas commented Aug 11, 2017

liewegas commented Aug 11, 2017

gregsfortytwo left a comment

gregsfortytwo Aug 10, 2017

liewegas Aug 12, 2017

gregsfortytwo Aug 11, 2017

liewegas Aug 12, 2017

gregsfortytwo Aug 11, 2017

gregsfortytwo Aug 11, 2017

liewegas Aug 12, 2017

gregsfortytwo Aug 11, 2017

gregsfortytwo Aug 11, 2017

gregsfortytwo Aug 11, 2017

liewegas Aug 12, 2017

gregsfortytwo Aug 11, 2017

liewegas Aug 12, 2017

gregsfortytwo Aug 11, 2017

liewegas Aug 12, 2017

gregsfortytwo Aug 11, 2017

liewegas commented Aug 12, 2017

gregsfortytwo commented Aug 12, 2017

liewegas commented Aug 13, 2017

vumrao commented Nov 22, 2021

mgr: implement 'osd safe-to-destroy' and 'osd ok-to-stop' commands #16976

mgr: implement 'osd safe-to-destroy' and 'osd ok-to-stop' commands #16976

Conversation

liewegas commented Aug 10, 2017 • edited

liewegas commented Aug 10, 2017

Choose a reason for hiding this comment

dvanders commented Aug 10, 2017

liewegas commented Aug 10, 2017

liewegas commented Aug 10, 2017 • edited

dvanders commented Aug 10, 2017

dvanders commented Aug 11, 2017

jcsp commented Aug 11, 2017

liewegas commented Aug 11, 2017 • edited

liewegas commented Aug 11, 2017

liewegas commented Aug 11, 2017

gregsfortytwo left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

liewegas commented Aug 12, 2017

gregsfortytwo commented Aug 12, 2017

liewegas commented Aug 13, 2017

vumrao commented Nov 22, 2021

liewegas commented Aug 10, 2017 •

edited

liewegas commented Aug 10, 2017 •

edited

liewegas commented Aug 11, 2017 •

edited