New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
mon: osdmap prune #19331
mon: osdmap prune #19331
Conversation
I wish every such PR could have a such exhaustive docs. ;) |
doc/dev/mon-osdmap-prune.rst
Outdated
upper_pinned(last_pruned) C pinned AND | ||
lower_pinned(last_pruned) C pinned AND | ||
upper_pinned(last_pruned) != lower_pinned(last_pruned) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
indentation was messed up for these last two blocks. I'll address it before merging.
src/common/options.cc
Outdated
@@ -897,6 +897,36 @@ std::vector<Option> get_global_options() { | |||
.set_default(true) | |||
.set_description(""), | |||
|
|||
/* -- mon: osdmap prune (begin) -- */ | |||
Option("mon_osdmap_full_prune_enabled", Option::TYPE_BOOL, Option::LEVEL_ADVANCED) | |||
.set_default(false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
true!
src/mon/OSDMonitor.cc
Outdated
* for more information, please refer to doc/dev/mon-osdmap-prune.rst | ||
*/ | ||
|
||
void OSDMonitor::update_osdmap_manifest() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
load_osdmap_manifest()? i wasn't sure when i got here whether this was making some change or refreshing state.
src/mon/OSDMonitor.cc
Outdated
dout(10) << __func__ | ||
<< " interval [" << first << ".." << last << "]" << dendl; | ||
for (version_t v = first; v <= last; ++v) { | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unnecessary whitespace
src/mon/OSDMonitor.cc
Outdated
version_t next = osdmap_manifest.get_next_prunable(); | ||
|
||
if (g_conf->get_val<bool>("mon_debug_extra_checks") && next > 0) { | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unnecessary whitespace
src/mon/OSDMonitor.cc
Outdated
|
||
dout(20) << __func__ | ||
<< " dropping osdmap manifest from memory." << dendl; | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unnecessary whitespace
src/mon/OSDMonitor.cc
Outdated
return ret; | ||
} | ||
} else if (ret != 0) { | ||
return ret; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
simpler?
if (ret == -ENOENT) { // build map? ret = get_full_from_pinned_map(ver, bl); } if (ret != 0) { return ret; }
src/mon/OSDMonitor.h
Outdated
return get_next_prunable(last_pruned); | ||
} | ||
|
||
version_t get_next_prunable(version_t v) const |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if it would be simpler to just drop this last_pruned, in_progress thing. Instead, when we prune, we (in a single transaction) (1) add the pinned epoch and (2) delete all maps between the new pinned epoch and the next one. I don't think the gap should be more then a few 10s of epochs anyway, and deletes aren't expensive to include in a transaction.
src/mon/OSDMonitor.cc
Outdated
osdm->encode(tbl, encode_features | CEPH_FEATURE_RESERVED); | ||
tosdm->decode(tbl); | ||
delete osdm; | ||
osdm = tosdm; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do the switcheroo to tosdm here? paranoia?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pretty much; I don't want to rely on squashing existing values when decoding to an already existing map. And afaict we need the whole encode/decode dance to get updated crcs out of the maps.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you should drop this. In the normal path(s) we apply multiple incrementals to one map. I think that's safe.
If you want you can add an if (mon_debug_extra_checks) branch that does a paranoid check.
src/mon/Monitor.cc
Outdated
@@ -306,6 +306,35 @@ void Monitor::do_admin_command(string command, cmdmap_t& cmdmap, string format, | |||
f->flush(ss); | |||
} | |||
|
|||
} else if (command == "test generate osdmap") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this whole thing makes me nervous... it seems inevitable that it will rot. can we instead do a loop like 'for f in seq 1 100
; do ceph osd unset nodown ; done'?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i still don't like this patch; i'd rather do it the slow way (for loop in bash doing some trivial change)
'--mon-debug-block-osdmap-trim '\ | ||
'--mon-debug-extra-checks' || return 1 | ||
|
||
ceph daemon $(get_asok_path mon.a) test generate osdmap 1000 || return 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, i guess it won't bit rot. :)
The other thing that would be nice here is if one of the thrashing test cases set things such that this is regularly triggered. Like, min epochs = 25, prune max = 5, prune tx = 1, something like that. |
Option("mon_osdmap_full_prune_min", Option::TYPE_UINT, Option::LEVEL_ADVANCED) | ||
.set_default(10000) | ||
.set_description("Minimum number of versions in the store to trigger full map pruning") | ||
.add_see_also("mon_osdmap_full_prune_enabled") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit, i think we can assume the yet-written tool will take Option("foo").add_see_also("bar")
as a link from "foo" to "bar", and hence "bar" will be linked to "foo" also.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not know what tool is this you are talking about, but -- even if it is in the works -- it seems that assumption to be weird. I understand the add_see_also()
as a unidirectional link, to further instruct the user to know more about a given topic; sometimes we'll have bidirectional links, but not necessarily.
In any case, I presume this tool will be smart enough to either take it as is, or complain about it and having us remove one of the links. When the time comes, I'll be all for fixing this to follow said tool's semantics.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
okay. i think it will be bidirectional. but, agreed. we can fix it when the tool is around. BTW, for the "tool", see https://trello.com/c/wBc0R9HV/6-config-options-docs and http://pad.ceph.com/p/config-annotation
script to generate .rst docs for docs.ceph.com
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will continue reviewing tomorrow.
src/mon/OSDMonitor.cc
Outdated
|
||
bool OSDMonitor::should_prune() const | ||
{ | ||
version_t first = get_first_committed(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i'd suggest move the definition of first
, last
, min_osdmap_epochs
, prune_to
and last_pruned
after the following if
block.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@liewegas @tchaikov sorry, where to? Somehow, it looks like these comments have been kept, even though the code itself has changed (heck, github tells me the commit this review is on has changed). Not only there's no following if block
atm (it's a bunch of if-elseif-else that actually depend on these variables to be declared and initialized), but everything pertaining to last_pruned
has been dropped from the current algorithm.
src/mon/OSDMonitor.h
Outdated
{ | ||
set<version_t>::const_reverse_iterator it = pinned.crbegin(); | ||
if (it == pinned.crend()) { | ||
ceph_assert(pinned.empty()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i don't think we need this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agree
src/mon/OSDMonitor.h
Outdated
|
||
version_t get_lower_closest_pinned(version_t v) const | ||
{ | ||
set<version_t>::const_iterator p = pinned.lower_bound(v); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit, could just use auto
. less type.
Option("mon_osdmap_full_prune_min", Option::TYPE_UINT, Option::LEVEL_ADVANCED) | ||
.set_default(10000) | ||
.set_description("Minimum number of versions in the store to trigger full map pruning") | ||
.add_see_also("mon_osdmap_full_prune_enabled") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
okay. i think it will be bidirectional. but, agreed. we can fix it when the tool is around. BTW, for the "tool", see https://trello.com/c/wBc0R9HV/6-config-options-docs and http://pad.ceph.com/p/config-annotation
script to generate .rst docs for docs.ceph.com
src/mon/OSDMonitor.cc
Outdated
<< dendl; | ||
|
||
if (!osdmap_manifest.is_pinned(first)) { | ||
osdmap_manifest.pinned.insert(first); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
might want to use the helper method instead. i.e.
osdmap_manifest.pin(first);
@@ -3425,6 +4020,12 @@ void OSDMonitor::dump_info(Formatter *f) | |||
f->open_object_section("crushmap"); | |||
osdmap.crush->dump(f); | |||
f->close_section(); | |||
|
|||
if (has_osdmap_manifest) { | |||
f->open_object_section("osdmap_manifest"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit, use f.dump_object("osdmap_manifest", osdmap_manifest)
instead.
@jecluis ping? |
@tchaikov busy week; not forgotten ;) |
src/mon/OSDMonitor.cc
Outdated
bool store_has_manifest = | ||
mon->store->exists(get_service_name(), "full_manifest"); | ||
|
||
if (!store_has_manifest) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit
if (!store_has_manifest && !has_osdmap_manifest)
return;
src/mon/OSDMonitor.cc
Outdated
|
||
if (prune_interval > prune_min) { | ||
derr << __func__ | ||
<< " impossible to ascertain proper prune interval because " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: unnecessary white space after because
src/mon/OSDMonitor.cc
Outdated
// this allows us the monitor to trim maps without caring too much about | ||
// pinned maps, and then allow us to use another ceph-mon without these | ||
// capabilities, without having to repair the store. | ||
version_t last_to_pin = last - g_conf->mon_min_osdmap_epochs; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
g_conf->get_val<int64_t>("mon_min_osdmap_epochs");
src/mon/OSDMonitor.cc
Outdated
version_t last_to_pin = last - g_conf->mon_min_osdmap_epochs; | ||
version_t last_pinned = osdmap_manifest.get_last_pinned(); | ||
version_t last_pruned = osdmap_manifest.get_last_pruned(); | ||
uint64_t prune_interval = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: might be better to consistent with version_t
?
de9b0f2
to
f5ad08f
Compare
@liewegas mind taking another looks? I'm now working on getting a teuthology test along with the PR, but I haven't done this in a while and I'm still trying to figure out how (sigh). |
@jecluis don't forget to update the cli test: /home/jenkins-build/build/workspace/ceph-pull-requests/src/test/cli/monmaptool/feature-set-unset-list.t failed |
@liewegas yeah, already got a patch for it; will push that later squashed with the appropriate commit, as to avoid voiding any comments on the current state. |
6d4c4d8
to
c4f55ca
Compare
Just one of the latest runs (more: http://pulpito.ceph.com/compare/?branch=wip-mon-osdmap-prune&suite=rados:monthrash ). One of the jobs was marked Fairly confident this is ready for merge. |
doc/dev/mon-osdmap-prune.rst
Outdated
[1..50000], we would allow the pruning algorithm to operate only over | ||
osdmap epochs [1..49500); but, if have on-disk versions [1..10200], we | ||
won't be pruning because the algorithm would only operate on versions | ||
[1..700), and this interval contains less versions than the minimum |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/700/9700/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed, will rebase and squash before merge.
@liewegas ping |
@jecluis looks good, can you squash those commits down? |
@liewegas on it! :) |
Signed-off-by: Joao Eduardo Luis <joao@suse.de>
If we have gone over a safety threshold for number of versions, start pruning full osdmaps, poking holes in the sequence. To make up for the missing maps, we will rebuild full maps based on the incremental maps and the closest pinned full map we have available. Signed-off-by: Joao Eduardo Luis <joao@suse.de>
Signed-off-by: Joao Eduardo Luis <joao@suse.de>
We need to allow blocking osdmap trimming to test osdmap pruning. Signed-off-by: Joao Eduardo Luis <joao@suse.de>
Signed-off-by: Joao Eduardo Luis <joao@suse.de>
At least verified in openSUSE tumbleweed and ubuntu 16.04, `sysctl -n kernel.core_pattern` returns a pipe at the start of the value. We reset core_pattern to its original form once the script is about to end, but if we do not discard the pipe the new value will contain an extra pipe (apparently, the pipe is added automatically at some point, possibly simply on read). We are simply stripping it on read, as to prevent this behavior. Additionally, we are also enclosing the reset of kernel.core_pattern in quotes, so as to make sure patterns that include spaces (e.g., ubuntu's apport, or tumbleweed's systemd-coredump) are properly reset. Signed-off-by: Joao Eduardo Luis <joao@suse.de>
Signed-off-by: Joao Eduardo Luis <joao@suse.de>
If, by default, a user's PATH does not contain /sbin and /usr/sbin, we may have a hard time finding sysctl to adjust the kernel core pattern. Signed-off-by: Joao Eduardo Luis <joao@suse.de>
Keep a standalone wrapper for the workunit, so we can test it locally, leveraging the ceph-helpers to do the setup. Keep a workunit to be exercised by teuthology. Signed-off-by: Joao Eduardo Luis <joao@suse.de>
Signed-off-by: Joao Eduardo Luis <joao@suse.de>
Signed-off-by: Joao Eduardo Luis <joao@suse.de>
26b71f9
to
3997eed
Compare
lgtm also once the related commits are squashed.
…On Fri, Apr 6, 2018 at 11:08 AM Joao Eduardo Luis ***@***.***> wrote:
@liewegas <https://github.com/liewegas> on it! :)
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#19331 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AADmv921MalyqFhEb_svdpfTAFNUa7dLks5tltw_gaJpZM4Q17WC>
.
|
Yay!
|
Are there any reasons, why this functionality isn't backported to luminous? |
@liewegas Is this feature eligible for backport to luminous at this stage of its life cycle? |
@smithfarm I'm primarily worried about fixes that came days/weeks after the original PR merge.. I seem to remember there being several. @jecluis ? |
In the event of an unclean cluster for quite a while, the monitors may end up keeping quite a lot of maps (thousands to tens of thousands). When this happens, the monitors may be unstable due to the size of their store, which may lead to further unrest in the cluster -- and the whole situation can then become a catch-22 kind of thing.
This patch set introduces a new mechanism that will allow an administrator to prune maps. Unlike trim, prune will not be removing maps at the tail; instead, prune will be poking holes in the sequence of full osdmaps, reducing the usage of store space. Prune trades off performance for reduced store size, as the monitors will have to recompute pruned osdmaps from the last known full osdmap and all the incremental maps since.
The patch set also contains a dev doc describing the algorithm.
Signed-off-by: Joao Eduardo Luis
<joao@suse.de>