New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
osd, pg, mgr: make snap trim queue problems visible #19520
Conversation
One can just parse the snap_trimq string, but that's much more expensive than just reading an unsigned int. Signed-off-by: Piotr Dałek <piotr.dalek@corp.ovh.com>
That way it will be unnecessary to go through all pgs separately to find pgs with excessively long snap trim queues. And we don't need to share snap trim queues itself, which may be large by itself. As snap trim queues tend to be short and anything above 50 000 I consider absurdly large, the snaptrimq_len is capped at 2^32 to save space in pg_stat_t. Signed-off-by: Piotr Dałek <piotr.dalek@corp.ovh.com>
good job, good eyes man. 😄 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great!
src/mon/PGMap.cc
Outdated
if (snaptrimq_exceeded) { | ||
list<string> detail; | ||
stringstream ss; | ||
ss << "snap trim queue for " << snaptrimq_exceeded << " pg(s) hit the " << snapthreshold << " point. "; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-
ss << "snap trim queue for " << snaptrimq_exceeded << " pg(s) >= " << snapthreshold << " (mon_osd_snap_trim_queue_warn_on)";
src/mon/PGMap.cc
Outdated
@@ -2873,6 +2875,25 @@ void PGMap::get_health_checks( | |||
d.detail.swap(detail); | |||
} | |||
} | |||
|
|||
// SNAPTRIMQ_SLOW |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PG_BIG_SNAPTRIMQ ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PG_SLOW_SNAP_TRIMMING?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed to PG_SLOW_SNAP_TRIMMING.
src/common/options.cc
Outdated
@@ -1324,6 +1324,10 @@ std::vector<Option> get_global_options() { | |||
.set_safe() | |||
.set_description("in which level of parent bucket the reporters are counted"), | |||
|
|||
Option("mon_osd_snap_trim_queue_warn_on", Option::TYPE_INT, Option::LEVEL_ADVANCED) | |||
.set_default(32768) | |||
.set_description("Warn when snap trim queue length for at least one PG crosses this value, as this is indicator of snap trimmer not keeping up, wasting disk space"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it would probably be better to use set_long_description
, instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I used both.
2239c1d
to
da81fb8
Compare
@liewegas updated. |
src/mon/PGMap.cc
Outdated
} | ||
} | ||
if (snaptrimq_exceeded) { | ||
list<string> detail; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could move it up a level and enumerate specific pgs (up to the max) that have big queues
baf9c33
to
885e347
Compare
If new option "mon osd snap trim queue warn on" is set to value larger than 0 (32768 by default), cluster will go into HEALTH_WARN state once any pg has a snap trim queue larger than that value. This can be used as an indicator of snaptrimmer not keeping up and disk space not being reclaimed fast enough. Warning message will tell how many pgs are affected. Signed-off-by: Piotr Dałek <piotr.dalek@corp.ovh.com>
885e347
to
8412a65
Compare
@liewegas updated:
|
continue; | ||
} | ||
if (detail.size() < max) { | ||
detail.push_back("...more pgs affected"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we're not very consistent, but maybe "NNN more pgs affected"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
User already knows how much pgs in total are affected, no need to complicate their lives further.
{ | ||
ostringstream ss; | ||
ss << "longest queue on pg " << *longest_q_pg << " at " << longest_queue; | ||
detail.push_back(ss.str()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks great!
@@ -2873,6 +2875,50 @@ void PGMap::get_health_checks( | |||
d.detail.swap(detail); | |||
} | |||
} | |||
|
|||
// PG_SLOW_SNAP_TRIMMING | |||
if (!pg_stat.empty() && cct->_conf->mon_osd_snap_trim_queue_warn_on > 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: might want to use get_val
@@ -286,6 +286,7 @@ OPTION(mon_inject_sync_get_chunk_delay, OPT_DOUBLE) // inject N second delay on | |||
OPTION(mon_osd_force_trim_to, OPT_INT) // force mon to trim maps to this point, regardless of min_last_epoch_clean (dangerous) | |||
OPTION(mon_mds_force_trim_to, OPT_INT) // force mon to trim mdsmaps to this point (dangerous) | |||
OPTION(mon_mds_skip_sanity, OPT_BOOL) // skip safety assertions on FSMap (in case of bugs where we want to continue anyway) | |||
OPTION(mon_osd_snap_trim_queue_warn_on, OPT_INT) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@branch-predictor please avoid adding new options to legacy_config_opts.h
, unless you plan to backport this change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For this one - I do plan to backport it to jewel.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also FWIW until we have a concise alternative to legacy_config_opts.h (for int types) I don't have a problem with adding new ones. (It would be nice to sort out an alternative, though.)
@branch-predictor for some reason the needs-backport label is gone. you might want to prepare a backport for it or file a ticket so we don't forget this change. |
Fields state, purged_snaps and snaptrimq_len are new to Mimic. Reorder them in a way that newest field (snaptrimq_len) is before two others and uses other encoding version, so the pull request ceph#19520 can be backported to Luminous without breaking Luminous -> Mimic upgrade. This also changes encoding/decoding version back to 24 as both state and purged snaps were added post-Luminous and pre-Mimic, so we can push it into a single struct version and keep snaptrimq_len into other version. Signed-off-by: Piotr Dałek <piotr.dalek@corp.ovh.com>
Fields state, purged_snaps and snaptrimq_len are new to Mimic. Reorder them in a way that newest field (snaptrimq_len) is before two others and uses other encoding version, so the pull request ceph#19520 can be backported to Luminous without breaking Luminous -> Mimic upgrade. This also changes encoding/decoding version back to 24 as both state and purged snaps were added post-Luminous and pre-Mimic, so we can push it into a single struct version and keep snaptrimq_len into other version. Signed-off-by: Piotr Dałek <piotr.dalek@corp.ovh.com>
Fields state, purged_snaps and snaptrimq_len are new to Mimic. Reorder them in a way that newest field (snaptrimq_len) is before two others and uses other encoding version, so the pull request ceph#19520 can be backported to Luminous without breaking Luminous -> Mimic upgrade. This also changes encoding/decoding version back to 24 as both state and purged snaps were added post-Luminous and pre-Mimic, so we can push it into a single struct version and keep snaptrimq_len into other version. Signed-off-by: Piotr Dałek <piotr.dalek@corp.ovh.com>
@tchaikov It was agreed on ceph-devel ML to drop the needs-backport label in favor of using the tracker mechanism. This PR did not get a tracker originally, so I opened one: UPDATE: deleted in favor of @branch-predictor 's trackers that were already open |
@branch-predictor I went ahead and assigned the backport issues to you:
Deassign yourself if that's not right?~~ UPDATE: deleted in favor of @branch-predictor 's trackers that were already open |
@smithfarm it had a tracker # before: http://tracker.ceph.com/issues/22448 along with backports (http://tracker.ceph.com/issues/22449 and http://tracker.ceph.com/issues/22450) |
@branch-predictor Got it, thanks! |
Fields state, purged_snaps and snaptrimq_len are new to Mimic. Reorder them in a way that newest field (snaptrimq_len) is before two others and uses other encoding version, so the pull request ceph#19520 can be backported to Luminous without breaking Luminous -> Mimic upgrade. This also changes encoding/decoding version back to 24 as both state and purged snaps were added post-Luminous and pre-Mimic, so we can push it into a single struct version and keep snaptrimq_len into other version. Signed-off-by: Piotr Dałek <piotr.dalek@corp.ovh.com>
Fields state, purged_snaps and snaptrimq_len are new to Mimic. Reorder them in a way that newest field (snaptrimq_len) is before two others and uses other encoding version, so the pull request ceph#19520 can be backported to Luminous without breaking Luminous -> Mimic upgrade. This also changes encoding/decoding version back to 24 as both state and purged snaps were added post-Luminous and pre-Mimic, so we can push it into a single struct version and keep snaptrimq_len into other version. Signed-off-by: Piotr Dałek <piotr.dalek@corp.ovh.com>
http://tracker.ceph.com/issues/22448
We observed unexplained, constant disk space usage increase on a few of our prod clusters. At first we thought that it's because of customers abusing them, but that wasn't it. Then we though that images are constantly filled with data, but space usage reported by Ceph wasn't consistent with filesystem. After further digging, we realized that snap trim queues for some of PGs are in 250k elements territory... We increased the snap trimmer frequency and number of parallel snap trim ops and disk space usage finally started to drop.
This pull request adds:
With that, the above situation could be noticed way earlier.
Signed-off-by: Piotr Dałek piotr.dalek@corp.ovh.com