New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
mgr: implement 'osd safe-to-destroy' and 'osd ok-to-stop' commands #16976
Conversation
@dvanders what do you think? this isn't smart enough to tell you whether removal will make the cluster go degraded but not lose data--that is a hard thing to tell. but it does tell you whether you've successfully drained an osd and can remove it without increasing the risk of data loss (which is really what we want, right?). |
src/mgr/DaemonServer.cc
Outdated
cmdctx->reply(r, ss); | ||
return true; | ||
} | ||
ss << "osd." << osd << " is safe to remove"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe s/remove/destroy/ to keep that verb consistent? (Also consistent with the ceph-disk command)
This more conservative approach is probably wise for general safe-to-destroy cmd. But maybe a min_size based command could be useful if it was labelled as "ceph osd safe-to-stop" or safe-to-go-down? |
e75f57e
to
4585604
Compare
@dvanders Yeah we could do something like that, but is harder. The bit that concerns me is that any such check is racy. If you have, say, 3 osds you want to stop, you can't check them all and then stop them all. You can't even do it in a loop unless you wait for the cluster to settle (fully peer) between each osd. We can have lots of warnings around it and make it take a list of OSDs so enable that sort of thing, but even so. @jcsp suggested a more scary set of verbs for it to make it clear that it is not really safe because the redundancy in the cluster is being reduced. Something like unlikely-to-destroy-data-when-stopped but less of a mouthful. :/ |
maybe "ceph osd unlikely-to-degrade-if-stopped osd.n" ? |
Maybe not "degrade"... The PGs would still degrade if this osd were to be stopped, but they won't go inactive/down whatever it is. Thinking about the use-case for this "probably-ok-to-stop":
And yeah a warning about the raciness (against yourself or other operators) would be useful. "Yes it is probably safe to stop osd/host/ x, provided there are no other ongoing interventions." |
4585604
to
15d9094
Compare
This looks great. Thanks! |
I think I'd even be okay with dropping the It would be really nice to make parse_osd_id_list take arbitrary crush nodes too. See also things like https://bugzilla.redhat.com/show_bug.cgi?id=1318389, where the workflow would probably be a host-level "ok-to-stop" followed by a host-level "add-noout". |
So the thing with I currently lean toward
as the help messages... |
15d9094
to
93fec6e
Compare
Ok I think this is ready to go. I'll follow up with improvements to the osd parsing to allow crush names to be used later, along with a refactor that uses the same helper for a bunch of other mon commands. |
Note that I don't think we shoudl backport the osd_stat change in mgr (and as a result we can't backport the test). Might do a conservative change that preserves just the seq for 'out' osds for luminous just so this works as we see failures from I think that occasionally in the upgrade tests. |
93fec6e
to
8ba90b9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly good but some suggested language changes and a couple logic issues.
@@ -340,6 +341,7 @@ void osd_stat_t::encode(bufferlist &bl) const | |||
::encode(os_perf_stat, bl); | |||
::encode(up_from, bl); | |||
::encode(seq, bl); | |||
::encode(num_pgs, bl); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought we already had ways of getting the pgs-per-osd by working some mapping in an opposite direction. But maybe that's just the number of CRUSH-assigned ones?
(Or I had wanted this field and it didn't exist but I swapped the memory in my head?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The point of this field is as a second out of band safety check. Not whether the current map maps to this osd, but whether there is still data stored on the osd that hasn't been cleaned up yet. It's a pretty strict check but I'm erring on the side of paranoid here!
pending_inc.update_stat(from, stats->epoch, osd_stat_t()); | ||
} | ||
|
||
pending_inc.update_stat(from, stats->epoch, std::move(stats->osd_stat)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like John just didn't think we'd need stats on out OSDs (initial commit). This is better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree, although it mirrors PGMonitor's behavior, and the commit that added it was one of Sam's that adds a bunch of checking and infrastructure to ensure that it is always in sync with osd_epochs (which AFAICS is obsolete). In any case, I think it's better this way but I don't want to backport this part to luminous.
*ss << "invalid osd id '" << *i << "'"; | ||
return -EINVAL; | ||
} | ||
out->insert(osd); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Next time around, logic like this makes a lot more sense as an if block combined with a while loop... ;)
src/mgr/MgrCommands.h
Outdated
@@ -107,6 +107,13 @@ COMMAND("osd test-reweight-by-pg " \ | |||
"dry run of reweight OSDs by PG distribution [overload-percentage-for-consideration, default 120]", \ | |||
"osd", "r", "cli,rest") | |||
|
|||
COMMAND("osd safe-to-destroy name=ids,type=CephString,n=N", | |||
"check whether osd(s) can be safely destroyed without risking data loss", | |||
"osd", "r", "cli,rest") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we could do "without reducing data durability" instead of "risking data loss"? Or do you expect this to expand to cover other kinds of contingencies I can't imagine?
The former phrasing is more specific and doesn't inspire quite as much fear of other cluster management commands.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
src/mgr/DaemonServer.cc
Outdated
<< "cannot draw any conclusions"; | ||
r = -EAGAIN; | ||
} else if (!stored_pgs.empty()) { | ||
ss << "OSD(s) " << stored_pgs << " last reported they still store PGs"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this and the missing stats case we only care because not all PGs are active+clean; we should mention that.
src/mgr/DaemonServer.cc
Outdated
return true; | ||
} | ||
ss << "OSD(s) " << osds << " are safe to destroy without reducing data " | ||
<< "redundancy."; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/redundancy/durability/ ?
src/mgr/DaemonServer.cc
Outdated
} | ||
const pg_pool_t *pi = osdmap.get_pg_pool(p.first.pool()); | ||
if (!pi) { | ||
// pool deleted? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should actually check this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Either the pool just got deleted (in which case we don't care) or perhaps we got pg stat reports that raced ahead of the osdmap update on the mgr. Either way, I don't think we care here.
I guess the creating case might mean we make pg creation break down. I'll just throw these PGs in the dangerous_pgs bucket. This should be extremely rare anyway (only possible right around pg create/delete).
(q->second.state & PG_STATE_DEGRADED)) { | ||
// we don't currently have a good way to tell *how* degraded | ||
// a degraded PG is, so we have to assume we cannot remove | ||
// any more replicas/shards. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should be able to apply the same min_size logic here to the degraded PGs that we do farther down?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not easily. The degraded flag means we have < acting.size() replicas for some objects, but we don't know how much less... maybe we can afford one more replica loss or maybe none. Erring on the side of caution here.
src/mgr/DaemonServer.cc
Outdated
// a degraded PG is, so we have to assume we cannot remove | ||
// any more replicas/shards. | ||
++dangerous_pgs; | ||
return; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you mean to continue here? Not sure why we'd cut the pg checking short.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes :)
doc/man/8/ceph.rst
Outdated
@@ -874,6 +874,18 @@ Usage:: | |||
|
|||
ceph osd out <ids> [<ids>...] | |||
|
|||
Subcommand ``ok-to-stop`` checks whether the list of OSD(s) can be | |||
stopped without reducing immediately data availability. That is, all |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/reducing immediately data availability/immediately making data unavailable/ is a much more accurate description of what it's doing.
Signed-off-by: Sage Weil <sage@redhat.com>
Signed-off-by: Sage Weil <sage@redhat.com>
I'm not quite sure why we were doing this. :/ Signed-off-by: Sage Weil <sage@redhat.com>
8ba90b9
to
08b3624
Compare
An osd is safe to destroy if - we have osd_stat for it - osd_stat indicates no pgs stored - all pgs are known - no pgs map to it An osd is ok ot stop if - we have pg stats - no pgs will drop below min_size Signed-off-by: Sage Weil <sage@redhat.com>
This is hard with workunits/cephtool/test.sh because we don't control the whole cluster. Signed-off-by: Sage Weil <sage@redhat.com>
08b3624
to
ba66977
Compare
LGTM! |
There was an upstream tracker for a similar request - https://tracker.ceph.com/issues/21579. Marked it resolved. |
An osd is safe to destroy if
An OSD is ok to stop if