Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mgr: implement 'osd safe-to-destroy' and 'osd ok-to-stop' commands #16976

Merged
merged 5 commits into from Aug 13, 2017

Conversation

liewegas
Copy link
Member

@liewegas liewegas commented Aug 10, 2017

An osd is safe to destroy if

  • we have osd_stat for it
  • osd_stat indicates no pgs stored
  • all pgs are known
  • no pgs map to it
  • i.e., overall data durability will not be affected

An OSD is ok to stop if

  • we have the pg stats we need
  • no PGs will drop below min_size
  • i.e., availability won't be immediately compromised

@liewegas
Copy link
Member Author

@dvanders what do you think? this isn't smart enough to tell you whether removal will make the cluster go degraded but not lose data--that is a hard thing to tell. but it does tell you whether you've successfully drained an osd and can remove it without increasing the risk of data loss (which is really what we want, right?).

cmdctx->reply(r, ss);
return true;
}
ss << "osd." << osd << " is safe to remove";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe s/remove/destroy/ to keep that verb consistent? (Also consistent with the ceph-disk command)

@dvanders
Copy link
Contributor

This more conservative approach is probably wise for general safe-to-destroy cmd.

But maybe a min_size based command could be useful if it was labelled as "ceph osd safe-to-stop" or safe-to-go-down?
This would check if stopping an osd would cause any PG's up set to drop below min_size.

@liewegas
Copy link
Member Author

@dvanders Yeah we could do something like that, but is harder. The bit that concerns me is that any such check is racy. If you have, say, 3 osds you want to stop, you can't check them all and then stop them all. You can't even do it in a loop unless you wait for the cluster to settle (fully peer) between each osd. We can have lots of warnings around it and make it take a list of OSDs so enable that sort of thing, but even so.

@jcsp suggested a more scary set of verbs for it to make it clear that it is not really safe because the redundancy in the cluster is being reduced. Something like unlikely-to-destroy-data-when-stopped but less of a mouthful. :/

@liewegas
Copy link
Member Author

liewegas commented Aug 10, 2017

maybe "ceph osd unlikely-to-degrade-if-stopped osd.n" ?

@dvanders
Copy link
Contributor

Maybe not "degrade"... The PGs would still degrade if this osd were to be stopped, but they won't go inactive/down whatever it is.

Thinking about the use-case for this "probably-ok-to-stop":

  1. Operator wants to
    update some osds to newer ceph
    power off a host
    Reboot a network switch
    ...
    and it's not obvious to them if we have osd/host/rack-wise redundancy. Is it safe to stop this list of osds... or better yet a host, rack, etc.

And yeah a warning about the raciness (against yourself or other operators) would be useful.

"Yes it is probably safe to stop osd/host/ x, provided there are no other ongoing interventions."

@dvanders
Copy link
Contributor

This looks great. Thanks!

@jcsp
Copy link
Contributor

jcsp commented Aug 11, 2017

I think I'd even be okay with dropping the probably- prefix -- and maybe change the naming to have both commands start with ok- or safe- rather than differing. That way it is easy to explain both commands in one place, by saying that the difference is just that one command is about decommissioning and the other is about temporarily stopping.

It would be really nice to make parse_osd_id_list take arbitrary crush nodes too. See also things like https://bugzilla.redhat.com/show_bug.cgi?id=1318389, where the workflow would probably be a host-level "ok-to-stop" followed by a host-level "add-noout".

@liewegas
Copy link
Member Author

liewegas commented Aug 11, 2017

So the thing with safe-to-destroy and safe-to-stop is that the difference sounds like it's about what action you're taking, but in both cases (destroy vs stop) the data is going offline.. it's less about what you do but more about what criteria we're applying to decide whether we can do without it (no reduction in durability vs no reduction in immediate availability). Current "safe" + "destroy" = durability, "ok" + "stop" = availability, but that's pretty subtle too. :/

I currently lean toward safe-to-destroy and ok-to-stop, with

  • check whether osd(s) can be safely destroyed without risking data loss
  • check whether osd(s) can be safely stopped without reducing immediate data availability

as the help messages...

@liewegas liewegas changed the title mgr: implement 'osd safe-to-destroy' command mgr: implement 'osd safe-to-destroy' and 'osd ok-to-stop' commands Aug 11, 2017
@liewegas
Copy link
Member Author

Ok I think this is ready to go. I'll follow up with improvements to the osd parsing to allow crush names to be used later, along with a refactor that uses the same helper for a bunch of other mon commands.

@liewegas
Copy link
Member Author

Note that I don't think we shoudl backport the osd_stat change in mgr (and as a result we can't backport the test). Might do a conservative change that preserves just the seq for 'out' osds for luminous just so this works as we see failures from I think that occasionally in the upgrade tests.

Copy link
Member

@gregsfortytwo gregsfortytwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly good but some suggested language changes and a couple logic issues.

@@ -340,6 +341,7 @@ void osd_stat_t::encode(bufferlist &bl) const
::encode(os_perf_stat, bl);
::encode(up_from, bl);
::encode(seq, bl);
::encode(num_pgs, bl);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought we already had ways of getting the pgs-per-osd by working some mapping in an opposite direction. But maybe that's just the number of CRUSH-assigned ones?
(Or I had wanted this field and it didn't exist but I swapped the memory in my head?)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The point of this field is as a second out of band safety check. Not whether the current map maps to this osd, but whether there is still data stored on the osd that hasn't been cleaned up yet. It's a pretty strict check but I'm erring on the side of paranoid here!

pending_inc.update_stat(from, stats->epoch, osd_stat_t());
}

pending_inc.update_stat(from, stats->epoch, std::move(stats->osd_stat));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like John just didn't think we'd need stats on out OSDs (initial commit). This is better.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, although it mirrors PGMonitor's behavior, and the commit that added it was one of Sam's that adds a bunch of checking and infrastructure to ensure that it is always in sync with osd_epochs (which AFAICS is obsolete). In any case, I think it's better this way but I don't want to backport this part to luminous.

*ss << "invalid osd id '" << *i << "'";
return -EINVAL;
}
out->insert(osd);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Next time around, logic like this makes a lot more sense as an if block combined with a while loop... ;)

@@ -107,6 +107,13 @@ COMMAND("osd test-reweight-by-pg " \
"dry run of reweight OSDs by PG distribution [overload-percentage-for-consideration, default 120]", \
"osd", "r", "cli,rest")

COMMAND("osd safe-to-destroy name=ids,type=CephString,n=N",
"check whether osd(s) can be safely destroyed without risking data loss",
"osd", "r", "cli,rest")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we could do "without reducing data durability" instead of "risking data loss"? Or do you expect this to expand to cover other kinds of contingencies I can't imagine?

The former phrasing is more specific and doesn't inspire quite as much fear of other cluster management commands.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

<< "cannot draw any conclusions";
r = -EAGAIN;
} else if (!stored_pgs.empty()) {
ss << "OSD(s) " << stored_pgs << " last reported they still store PGs";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this and the missing stats case we only care because not all PGs are active+clean; we should mention that.

return true;
}
ss << "OSD(s) " << osds << " are safe to destroy without reducing data "
<< "redundancy.";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/redundancy/durability/ ?

}
const pg_pool_t *pi = osdmap.get_pg_pool(p.first.pool());
if (!pi) {
// pool deleted?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should actually check this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Either the pool just got deleted (in which case we don't care) or perhaps we got pg stat reports that raced ahead of the osdmap update on the mgr. Either way, I don't think we care here.

I guess the creating case might mean we make pg creation break down. I'll just throw these PGs in the dangerous_pgs bucket. This should be extremely rare anyway (only possible right around pg create/delete).

(q->second.state & PG_STATE_DEGRADED)) {
// we don't currently have a good way to tell *how* degraded
// a degraded PG is, so we have to assume we cannot remove
// any more replicas/shards.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should be able to apply the same min_size logic here to the degraded PGs that we do farther down?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not easily. The degraded flag means we have < acting.size() replicas for some objects, but we don't know how much less... maybe we can afford one more replica loss or maybe none. Erring on the side of caution here.

// a degraded PG is, so we have to assume we cannot remove
// any more replicas/shards.
++dangerous_pgs;
return;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you mean to continue here? Not sure why we'd cut the pg checking short.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes :)

@@ -874,6 +874,18 @@ Usage::

ceph osd out <ids> [<ids>...]

Subcommand ``ok-to-stop`` checks whether the list of OSD(s) can be
stopped without reducing immediately data availability. That is, all
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/reducing immediately data availability/immediately making data unavailable/ is a much more accurate description of what it's doing.

Signed-off-by: Sage Weil <sage@redhat.com>
Signed-off-by: Sage Weil <sage@redhat.com>
I'm not quite sure why we were doing this. :/

Signed-off-by: Sage Weil <sage@redhat.com>
An osd is safe to destroy if

- we have osd_stat for it
- osd_stat indicates no pgs stored
- all pgs are known
- no pgs map to it

An osd is ok ot stop if

- we have pg stats
- no pgs will drop below min_size

Signed-off-by: Sage Weil <sage@redhat.com>
This is hard with workunits/cephtool/test.sh because we don't
control the whole cluster.

Signed-off-by: Sage Weil <sage@redhat.com>
@gregsfortytwo
Copy link
Member

LGTM!

@liewegas liewegas merged commit 0a8ceaa into ceph:master Aug 13, 2017
@liewegas liewegas deleted the wip-osd-empty branch August 13, 2017 19:02
@vumrao
Copy link
Contributor

vumrao commented Nov 22, 2021

There was an upstream tracker for a similar request - https://tracker.ceph.com/issues/21579. Marked it resolved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants