Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mon: osdmap prune #19331

Merged
merged 11 commits into from Apr 6, 2018
Merged

mon: osdmap prune #19331

merged 11 commits into from Apr 6, 2018

Conversation

jecluis
Copy link
Member

@jecluis jecluis commented Dec 5, 2017

In the event of an unclean cluster for quite a while, the monitors may end up keeping quite a lot of maps (thousands to tens of thousands). When this happens, the monitors may be unstable due to the size of their store, which may lead to further unrest in the cluster -- and the whole situation can then become a catch-22 kind of thing.
This patch set introduces a new mechanism that will allow an administrator to prune maps. Unlike trim, prune will not be removing maps at the tail; instead, prune will be poking holes in the sequence of full osdmaps, reducing the usage of store space. Prune trades off performance for reduced store size, as the monitors will have to recompute pruned osdmaps from the last known full osdmap and all the incremental maps since.
The patch set also contains a dev doc describing the algorithm.

Signed-off-by: Joao Eduardo Luis <joao@suse.de>

@branch-predictor
Copy link
Contributor

I wish every such PR could have a such exhaustive docs. ;)

upper_pinned(last_pruned) C pinned AND
lower_pinned(last_pruned) C pinned AND
upper_pinned(last_pruned) != lower_pinned(last_pruned)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indentation was messed up for these last two blocks. I'll address it before merging.

@@ -897,6 +897,36 @@ std::vector<Option> get_global_options() {
.set_default(true)
.set_description(""),

/* -- mon: osdmap prune (begin) -- */
Option("mon_osdmap_full_prune_enabled", Option::TYPE_BOOL, Option::LEVEL_ADVANCED)
.set_default(false)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

true!

* for more information, please refer to doc/dev/mon-osdmap-prune.rst
*/

void OSDMonitor::update_osdmap_manifest()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

load_osdmap_manifest()? i wasn't sure when i got here whether this was making some change or refreshing state.

dout(10) << __func__
<< " interval [" << first << ".." << last << "]" << dendl;
for (version_t v = first; v <= last; ++v) {

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unnecessary whitespace

version_t next = osdmap_manifest.get_next_prunable();

if (g_conf->get_val<bool>("mon_debug_extra_checks") && next > 0) {

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unnecessary whitespace


dout(20) << __func__
<< " dropping osdmap manifest from memory." << dendl;

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unnecessary whitespace

return ret;
}
} else if (ret != 0) {
return ret;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

simpler?

if (ret == -ENOENT) {
   // build map?
   ret = get_full_from_pinned_map(ver, bl);
}
if (ret != 0) {
  return ret;
}

return get_next_prunable(last_pruned);
}

version_t get_next_prunable(version_t v) const
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if it would be simpler to just drop this last_pruned, in_progress thing. Instead, when we prune, we (in a single transaction) (1) add the pinned epoch and (2) delete all maps between the new pinned epoch and the next one. I don't think the gap should be more then a few 10s of epochs anyway, and deletes aren't expensive to include in a transaction.

osdm->encode(tbl, encode_features | CEPH_FEATURE_RESERVED);
tosdm->decode(tbl);
delete osdm;
osdm = tosdm;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do the switcheroo to tosdm here? paranoia?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pretty much; I don't want to rely on squashing existing values when decoding to an already existing map. And afaict we need the whole encode/decode dance to get updated crcs out of the maps.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you should drop this. In the normal path(s) we apply multiple incrementals to one map. I think that's safe.

If you want you can add an if (mon_debug_extra_checks) branch that does a paranoid check.

@@ -306,6 +306,35 @@ void Monitor::do_admin_command(string command, cmdmap_t& cmdmap, string format,
f->flush(ss);
}

} else if (command == "test generate osdmap") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this whole thing makes me nervous... it seems inevitable that it will rot. can we instead do a loop like 'for f in seq 1 100 ; do ceph osd unset nodown ; done'?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i still don't like this patch; i'd rather do it the slow way (for loop in bash doing some trivial change)

'--mon-debug-block-osdmap-trim '\
'--mon-debug-extra-checks' || return 1

ceph daemon $(get_asok_path mon.a) test generate osdmap 1000 || return 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, i guess it won't bit rot. :)

@liewegas
Copy link
Member

liewegas commented Dec 5, 2017

The other thing that would be nice here is if one of the thrashing test cases set things such that this is regularly triggered. Like, min epochs = 25, prune max = 5, prune tx = 1, something like that.

Option("mon_osdmap_full_prune_min", Option::TYPE_UINT, Option::LEVEL_ADVANCED)
.set_default(10000)
.set_description("Minimum number of versions in the store to trigger full map pruning")
.add_see_also("mon_osdmap_full_prune_enabled")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit, i think we can assume the yet-written tool will take Option("foo").add_see_also("bar") as a link from "foo" to "bar", and hence "bar" will be linked to "foo" also.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not know what tool is this you are talking about, but -- even if it is in the works -- it seems that assumption to be weird. I understand the add_see_also() as a unidirectional link, to further instruct the user to know more about a given topic; sometimes we'll have bidirectional links, but not necessarily.
In any case, I presume this tool will be smart enough to either take it as is, or complain about it and having us remove one of the links. When the time comes, I'll be all for fixing this to follow said tool's semantics.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay. i think it will be bidirectional. but, agreed. we can fix it when the tool is around. BTW, for the "tool", see https://trello.com/c/wBc0R9HV/6-config-options-docs and http://pad.ceph.com/p/config-annotation

script to generate .rst docs for docs.ceph.com

Copy link
Contributor

@tchaikov tchaikov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will continue reviewing tomorrow.


bool OSDMonitor::should_prune() const
{
version_t first = get_first_committed();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'd suggest move the definition of first, last, min_osdmap_epochs, prune_to and last_pruned after the following if block.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@liewegas @tchaikov sorry, where to? Somehow, it looks like these comments have been kept, even though the code itself has changed (heck, github tells me the commit this review is on has changed). Not only there's no following if block atm (it's a bunch of if-elseif-else that actually depend on these variables to be declared and initialized), but everything pertaining to last_pruned has been dropped from the current algorithm.

{
set<version_t>::const_reverse_iterator it = pinned.crbegin();
if (it == pinned.crend()) {
ceph_assert(pinned.empty());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i don't think we need this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree


version_t get_lower_closest_pinned(version_t v) const
{
set<version_t>::const_iterator p = pinned.lower_bound(v);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit, could just use auto. less type.

Option("mon_osdmap_full_prune_min", Option::TYPE_UINT, Option::LEVEL_ADVANCED)
.set_default(10000)
.set_description("Minimum number of versions in the store to trigger full map pruning")
.add_see_also("mon_osdmap_full_prune_enabled")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay. i think it will be bidirectional. but, agreed. we can fix it when the tool is around. BTW, for the "tool", see https://trello.com/c/wBc0R9HV/6-config-options-docs and http://pad.ceph.com/p/config-annotation

script to generate .rst docs for docs.ceph.com

<< dendl;

if (!osdmap_manifest.is_pinned(first)) {
osdmap_manifest.pinned.insert(first);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might want to use the helper method instead. i.e.

osdmap_manifest.pin(first);

@@ -3425,6 +4020,12 @@ void OSDMonitor::dump_info(Formatter *f)
f->open_object_section("crushmap");
osdmap.crush->dump(f);
f->close_section();

if (has_osdmap_manifest) {
f->open_object_section("osdmap_manifest");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit, use f.dump_object("osdmap_manifest", osdmap_manifest) instead.

@tchaikov
Copy link
Contributor

@jecluis ping?

@jecluis
Copy link
Member Author

jecluis commented Dec 15, 2017

@tchaikov busy week; not forgotten ;)

bool store_has_manifest =
mon->store->exists(get_service_name(), "full_manifest");

if (!store_has_manifest) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit

if (!store_has_manifest && !has_osdmap_manifest)
  return;


if (prune_interval > prune_min) {
derr << __func__
<< " impossible to ascertain proper prune interval because "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: unnecessary white space after because

// this allows us the monitor to trim maps without caring too much about
// pinned maps, and then allow us to use another ceph-mon without these
// capabilities, without having to repair the store.
version_t last_to_pin = last - g_conf->mon_min_osdmap_epochs;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

g_conf->get_val<int64_t>("mon_min_osdmap_epochs");

version_t last_to_pin = last - g_conf->mon_min_osdmap_epochs;
version_t last_pinned = osdmap_manifest.get_last_pinned();
version_t last_pruned = osdmap_manifest.get_last_pruned();
uint64_t prune_interval =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: might be better to consistent with version_t?

@jecluis
Copy link
Member Author

jecluis commented Feb 7, 2018

@liewegas mind taking another looks? I'm now working on getting a teuthology test along with the PR, but I haven't done this in a while and I'm still trying to figure out how (sigh).

@liewegas
Copy link
Member

liewegas commented Feb 7, 2018

@jecluis don't forget to update the cli test: /home/jenkins-build/build/workspace/ceph-pull-requests/src/test/cli/monmaptool/feature-set-unset-list.t failed

@jecluis
Copy link
Member Author

jecluis commented Feb 7, 2018

@liewegas yeah, already got a patch for it; will push that later squashed with the appropriate commit, as to avoid voiding any comments on the current state.

@jecluis jecluis force-pushed the wip-mon-osdmap-prune branch 2 times, most recently from 6d4c4d8 to c4f55ca Compare March 23, 2018 06:06
@jecluis
Copy link
Member Author

jecluis commented Mar 27, 2018

http://pulpito.ceph.com/rdias-2018-03-26_09:00:32-rados:monthrash-wip-mon-osdmap-prune-distro-basic-smithi/

Just one of the latest runs (more: http://pulpito.ceph.com/compare/?branch=wip-mon-osdmap-prune&suite=rados:monthrash ). One of the jobs was marked dead for no obvious reason, but the workunit succeeded (and the job had a passing mark in a previous run).

Fairly confident this is ready for merge.

Anyone has any further objections? @liewegas @tchaikov ?

[1..50000], we would allow the pruning algorithm to operate only over
osdmap epochs [1..49500); but, if have on-disk versions [1..10200], we
won't be pruning because the algorithm would only operate on versions
[1..700), and this interval contains less versions than the minimum
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/700/9700/

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed, will rebase and squash before merge.

@jecluis
Copy link
Member Author

jecluis commented Apr 6, 2018

@liewegas ping

@liewegas
Copy link
Member

liewegas commented Apr 6, 2018

@jecluis looks good, can you squash those commits down?

@jecluis
Copy link
Member Author

jecluis commented Apr 6, 2018

@liewegas on it! :)

Signed-off-by: Joao Eduardo Luis <joao@suse.de>
If we have gone over a safety threshold for number of versions, start
pruning full osdmaps, poking holes in the sequence. To make up for the
missing maps, we will rebuild full maps based on the incremental maps
and the closest pinned full map we have available.

Signed-off-by: Joao Eduardo Luis <joao@suse.de>
Signed-off-by: Joao Eduardo Luis <joao@suse.de>
We need to allow blocking osdmap trimming to test osdmap pruning.

Signed-off-by: Joao Eduardo Luis <joao@suse.de>
Signed-off-by: Joao Eduardo Luis <joao@suse.de>
At least verified in openSUSE tumbleweed and ubuntu 16.04,
`sysctl -n kernel.core_pattern` returns a pipe at the start of the
value. We reset core_pattern to its original form once the script is
about to end, but if we do not discard the pipe the new value will
contain an extra pipe (apparently, the pipe is added automatically at
some point, possibly simply on read). We are simply stripping it on
read, as to prevent this behavior.

Additionally, we are also enclosing the reset of kernel.core_pattern in
quotes, so as to make sure patterns that include spaces (e.g., ubuntu's
apport, or tumbleweed's systemd-coredump) are properly reset.

Signed-off-by: Joao Eduardo Luis <joao@suse.de>
Signed-off-by: Joao Eduardo Luis <joao@suse.de>
If, by default, a user's PATH does not contain /sbin and /usr/sbin, we
may have a hard time finding sysctl to adjust the kernel core pattern.

Signed-off-by: Joao Eduardo Luis <joao@suse.de>
Keep a standalone wrapper for the workunit, so we can test it locally,
leveraging the ceph-helpers to do the setup. Keep a workunit to be
exercised by teuthology.

Signed-off-by: Joao Eduardo Luis <joao@suse.de>
Signed-off-by: Joao Eduardo Luis <joao@suse.de>
Signed-off-by: Joao Eduardo Luis <joao@suse.de>
@tchaikov
Copy link
Contributor

tchaikov commented Apr 6, 2018 via email

@jecluis jecluis merged commit 940dd94 into ceph:master Apr 6, 2018
@jecluis jecluis deleted the wip-mon-osdmap-prune branch April 6, 2018 14:22
@liewegas
Copy link
Member

liewegas commented Apr 6, 2018 via email

@xelexin
Copy link
Contributor

xelexin commented Mar 4, 2019

Are there any reasons, why this functionality isn't backported to luminous?
Right now we have about 100k osdmaps in mon-kvstore and it take a lot of storage. Isn't safe to cherry-pick these changes?
IMHO this functionality is needed in older versions of ceph.

@smithfarm
Copy link
Contributor

@liewegas Is this feature eligible for backport to luminous at this stage of its life cycle?

@liewegas
Copy link
Member

@smithfarm I'm primarily worried about fixes that came days/weeks after the original PR merge.. I seem to remember there being several. @jecluis ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
7 participants