Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

osd, pg, mgr: make snap trim queue problems visible #19520

Merged
merged 3 commits into from Jan 10, 2018

Conversation

branch-predictor
Copy link
Contributor

@branch-predictor branch-predictor commented Dec 14, 2017

http://tracker.ceph.com/issues/22448

We observed unexplained, constant disk space usage increase on a few of our prod clusters. At first we thought that it's because of customers abusing them, but that wasn't it. Then we though that images are constantly filled with data, but space usage reported by Ceph wasn't consistent with filesystem. After further digging, we realized that snap trim queues for some of PGs are in 250k elements territory... We increased the snap trimmer frequency and number of parallel snap trim ops and disk space usage finally started to drop.

This pull request adds:

  • convenient way to find snap trim queue length for any single PG (via pg query)
  • instant way to find snap trim queue lengths for all PGs (via pg dump), can use this to feed the monitoring
  • health warning when snaptrimq_len hits configured threshold. Can be disabled by setting that to 0.

With that, the above situation could be noticed way earlier.

Signed-off-by: Piotr Dałek piotr.dalek@corp.ovh.com

One can just parse the snap_trimq string, but that's much more
expensive than just reading an unsigned int.

Signed-off-by: Piotr Dałek <piotr.dalek@corp.ovh.com>
That way it will be unnecessary to go through all pgs separately
to find pgs with excessively long snap trim queues. And we don't need
to share snap trim queues itself, which may be large by itself.
As snap trim queues tend to be short and anything above 50 000
I consider absurdly large, the snaptrimq_len is capped at 2^32 to
save space in pg_stat_t.

Signed-off-by: Piotr Dałek <piotr.dalek@corp.ovh.com>
@songbaisen
Copy link

good job, good eyes man. 😄

Copy link
Member

@liewegas liewegas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great!

src/mon/PGMap.cc Outdated
if (snaptrimq_exceeded) {
list<string> detail;
stringstream ss;
ss << "snap trim queue for " << snaptrimq_exceeded << " pg(s) hit the " << snapthreshold << " point. ";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  •  ss << "snap trim queue for " << snaptrimq_exceeded << " pg(s) >= " << snapthreshold << " (mon_osd_snap_trim_queue_warn_on)";
    

@liewegas liewegas added the core label Dec 15, 2017
src/mon/PGMap.cc Outdated
@@ -2873,6 +2875,25 @@ void PGMap::get_health_checks(
d.detail.swap(detail);
}
}

// SNAPTRIMQ_SLOW
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PG_BIG_SNAPTRIMQ ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PG_SLOW_SNAP_TRIMMING?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to PG_SLOW_SNAP_TRIMMING.

@@ -1324,6 +1324,10 @@ std::vector<Option> get_global_options() {
.set_safe()
.set_description("in which level of parent bucket the reporters are counted"),

Option("mon_osd_snap_trim_queue_warn_on", Option::TYPE_INT, Option::LEVEL_ADVANCED)
.set_default(32768)
.set_description("Warn when snap trim queue length for at least one PG crosses this value, as this is indicator of snap trimmer not keeping up, wasting disk space"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it would probably be better to use set_long_description, instead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used both.

@branch-predictor
Copy link
Contributor Author

@liewegas updated.

src/mon/PGMap.cc Outdated
}
}
if (snaptrimq_exceeded) {
list<string> detail;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not used?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could move it up a level and enumerate specific pgs (up to the max) that have big queues

@branch-predictor branch-predictor force-pushed the bp-snap-trimq-visibility branch 3 times, most recently from baf9c33 to 885e347 Compare December 19, 2017 08:44
If new option "mon osd snap trim queue warn on" is set to value larger
than 0 (32768 by default), cluster will go into HEALTH_WARN state
once any pg has a snap trim queue larger than that value. This can
be used as an indicator of snaptrimmer not keeping up and disk space
not being reclaimed fast enough. Warning message will tell how many
pgs are affected.

Signed-off-by: Piotr Dałek <piotr.dalek@corp.ovh.com>
@branch-predictor
Copy link
Contributor Author

@liewegas updated:

$ bin/ceph -c ceph.conf health detail                                     
*** DEVELOPER MODE: setting PATH, PYTHONPATH and LD_LIBRARY_PATH ***                                                   
2017-12-19 08:51:05.927 7f4e5e91b700 -1 WARNING: all dangerous and experimental features are enabled.                  
2017-12-19 08:51:05.955 7f4e5e91b700 -1 WARNING: all dangerous and experimental features are enabled.                  
HEALTH_WARN nosnaptrim flag(s) set; snap trim queue for 64 pg(s) >= 2 (mon_osd_snap_trim_queue_warn_on)                                                                     
OSDMAP_FLAGS nosnaptrim flag(s) set                                                                                    
PG_SLOW_SNAP_TRIMMING snap trim queue for 64 pg(s) >= 2 (mon_osd_snap_trim_queue_warn_on)                                                                      
    snap trim queue for pg 1.3f at 10                                                                                  
    snap trim queue for pg 1.3e at 10                                                                                  
    snap trim queue for pg 1.3d at 10                                                                                  
    snap trim queue for pg 1.3c at 10                                                                                  
    snap trim queue for pg 1.3b at 10                                                                                  
    snap trim queue for pg 1.3a at 10                                                                                  
    snap trim queue for pg 1.39 at 10                                                                                  
    snap trim queue for pg 1.38 at 10                                                                                  
    snap trim queue for pg 1.37 at 10                                                                                  
    snap trim queue for pg 1.36 at 10                                                                                  
    snap trim queue for pg 1.35 at 10                                                                                  
    snap trim queue for pg 1.34 at 10                                                                                  
    snap trim queue for pg 1.33 at 10                                                                                  
    snap trim queue for pg 1.32 at 10                                                                                  
    snap trim queue for pg 1.31 at 10                                                                                  
    snap trim queue for pg 1.30 at 10                                                                                  
    snap trim queue for pg 1.2f at 10                                                                                  
    snap trim queue for pg 1.2e at 10                                                                                  
    snap trim queue for pg 1.2d at 10                                                                                  
    snap trim queue for pg 1.2c at 10                                                                                  
    snap trim queue for pg 1.2b at 10                                                                                  
    snap trim queue for pg 1.2a at 10                                                                                  
    snap trim queue for pg 1.29 at 10                                                                                  
    snap trim queue for pg 1.28 at 10                                                                                  
    snap trim queue for pg 1.27 at 10                                                                                  
    snap trim queue for pg 1.26 at 10                                                                                  
    snap trim queue for pg 1.25 at 10                                                                                  
    snap trim queue for pg 1.24 at 10                                                                                  
    snap trim queue for pg 1.f at 10                                                                                   
    snap trim queue for pg 1.e at 10                                                                                   
    snap trim queue for pg 1.d at 10                                                                                   
    snap trim queue for pg 1.c at 10                                                                                   
    snap trim queue for pg 1.b at 10                                                                                   
    snap trim queue for pg 1.a at 10                                                                                   
    snap trim queue for pg 1.9 at 10                                                                                   
    snap trim queue for pg 1.8 at 10                                                                                   
    snap trim queue for pg 1.7 at 10                                                                                   
    snap trim queue for pg 1.6 at 10                                                                                   
    snap trim queue for pg 1.1 at 10                                                                                   
    snap trim queue for pg 1.0 at 10                                                                                   
    snap trim queue for pg 1.2 at 10                                                                                   
    snap trim queue for pg 1.3 at 10                                                                                   
    snap trim queue for pg 1.4 at 10                                                                                   
    snap trim queue for pg 1.5 at 10                                                                                   
    snap trim queue for pg 1.10 at 10                                                                                  
    snap trim queue for pg 1.11 at 10                                                                                  
    snap trim queue for pg 1.12 at 10                                                                                  
    snap trim queue for pg 1.13 at 10                                                                                  
    snap trim queue for pg 1.14 at 10                                                                                  
    ...more pgs affected                                                                                               
    longest queue on pg 1.23 at 10                                                                                     
    try decreasing "osd snap trim sleep" and/or increasing "osd pg max concurrent snap trims".                         

continue;
}
if (detail.size() < max) {
detail.push_back("...more pgs affected");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we're not very consistent, but maybe "NNN more pgs affected"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

User already knows how much pgs in total are affected, no need to complicate their lives further.

{
ostringstream ss;
ss << "longest queue on pg " << *longest_q_pg << " at " << longest_queue;
detail.push_back(ss.str());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Member

@liewegas liewegas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks great!

@@ -2873,6 +2875,50 @@ void PGMap::get_health_checks(
d.detail.swap(detail);
}
}

// PG_SLOW_SNAP_TRIMMING
if (!pg_stat.empty() && cct->_conf->mon_osd_snap_trim_queue_warn_on > 0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: might want to use get_val

@@ -286,6 +286,7 @@ OPTION(mon_inject_sync_get_chunk_delay, OPT_DOUBLE) // inject N second delay on
OPTION(mon_osd_force_trim_to, OPT_INT) // force mon to trim maps to this point, regardless of min_last_epoch_clean (dangerous)
OPTION(mon_mds_force_trim_to, OPT_INT) // force mon to trim mdsmaps to this point (dangerous)
OPTION(mon_mds_skip_sanity, OPT_BOOL) // skip safety assertions on FSMap (in case of bugs where we want to continue anyway)
OPTION(mon_osd_snap_trim_queue_warn_on, OPT_INT)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@branch-predictor please avoid adding new options to legacy_config_opts.h, unless you plan to backport this change.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this one - I do plan to backport it to jewel.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also FWIW until we have a concise alternative to legacy_config_opts.h (for int types) I don't have a problem with adding new ones. (It would be nice to sort out an alternative, though.)

@tchaikov tchaikov merged commit 1dc26fb into ceph:master Jan 10, 2018
@tchaikov
Copy link
Contributor

@branch-predictor for some reason the needs-backport label is gone. you might want to prepare a backport for it or file a ticket so we don't forget this change.

branch-predictor added a commit to ovh/ceph that referenced this pull request Jan 16, 2018
Fields state, purged_snaps and snaptrimq_len are new to Mimic.
Reorder them in a way that newest field (snaptrimq_len) is before
two others and uses other encoding version, so the pull request
ceph#19520 can be backported to Luminous
without breaking Luminous -> Mimic upgrade.
This also changes encoding/decoding version back to 24 as both state
and purged snaps were added post-Luminous and pre-Mimic, so we can
push it into a single struct version and keep snaptrimq_len into other
version.

Signed-off-by: Piotr Dałek <piotr.dalek@corp.ovh.com>
p-se pushed a commit to p-se/ceph that referenced this pull request Jan 25, 2018
Fields state, purged_snaps and snaptrimq_len are new to Mimic.
Reorder them in a way that newest field (snaptrimq_len) is before
two others and uses other encoding version, so the pull request
ceph#19520 can be backported to Luminous
without breaking Luminous -> Mimic upgrade.
This also changes encoding/decoding version back to 24 as both state
and purged snaps were added post-Luminous and pre-Mimic, so we can
push it into a single struct version and keep snaptrimq_len into other
version.

Signed-off-by: Piotr Dałek <piotr.dalek@corp.ovh.com>
p-se pushed a commit to p-se/ceph that referenced this pull request Jan 31, 2018
Fields state, purged_snaps and snaptrimq_len are new to Mimic.
Reorder them in a way that newest field (snaptrimq_len) is before
two others and uses other encoding version, so the pull request
ceph#19520 can be backported to Luminous
without breaking Luminous -> Mimic upgrade.
This also changes encoding/decoding version back to 24 as both state
and purged snaps were added post-Luminous and pre-Mimic, so we can
push it into a single struct version and keep snaptrimq_len into other
version.

Signed-off-by: Piotr Dałek <piotr.dalek@corp.ovh.com>
@smithfarm
Copy link
Contributor

smithfarm commented Feb 1, 2018

@tchaikov It was agreed on ceph-devel ML to drop the needs-backport label in favor of using the tracker mechanism. This PR did not get a tracker originally, so I opened one: http://tracker.ceph.com/issues/22853

UPDATE: deleted in favor of @branch-predictor 's trackers that were already open

@smithfarm
Copy link
Contributor

smithfarm commented Feb 1, 2018

@branch-predictor I went ahead and assigned the backport issues to you:

Deassign yourself if that's not right?~~

UPDATE: deleted in favor of @branch-predictor 's trackers that were already open

@branch-predictor
Copy link
Contributor Author

@smithfarm
Copy link
Contributor

@branch-predictor Got it, thanks!

p-se pushed a commit to p-se/ceph that referenced this pull request Feb 5, 2018
Fields state, purged_snaps and snaptrimq_len are new to Mimic.
Reorder them in a way that newest field (snaptrimq_len) is before
two others and uses other encoding version, so the pull request
ceph#19520 can be backported to Luminous
without breaking Luminous -> Mimic upgrade.
This also changes encoding/decoding version back to 24 as both state
and purged snaps were added post-Luminous and pre-Mimic, so we can
push it into a single struct version and keep snaptrimq_len into other
version.

Signed-off-by: Piotr Dałek <piotr.dalek@corp.ovh.com>
cache-nez pushed a commit to cache-nez/ceph that referenced this pull request Feb 6, 2018
Fields state, purged_snaps and snaptrimq_len are new to Mimic.
Reorder them in a way that newest field (snaptrimq_len) is before
two others and uses other encoding version, so the pull request
ceph#19520 can be backported to Luminous
without breaking Luminous -> Mimic upgrade.
This also changes encoding/decoding version back to 24 as both state
and purged snaps were added post-Luminous and pre-Mimic, so we can
push it into a single struct version and keep snaptrimq_len into other
version.

Signed-off-by: Piotr Dałek <piotr.dalek@corp.ovh.com>
@branch-predictor branch-predictor deleted the bp-snap-trimq-visibility branch May 27, 2019 09:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
6 participants