osd, pg, mgr: make snap trim queue problems visible #19520

branch-predictor · 2017-12-14T12:55:04Z

http://tracker.ceph.com/issues/22448

We observed unexplained, constant disk space usage increase on a few of our prod clusters. At first we thought that it's because of customers abusing them, but that wasn't it. Then we though that images are constantly filled with data, but space usage reported by Ceph wasn't consistent with filesystem. After further digging, we realized that snap trim queues for some of PGs are in 250k elements territory... We increased the snap trimmer frequency and number of parallel snap trim ops and disk space usage finally started to drop.

This pull request adds:

convenient way to find snap trim queue length for any single PG (via pg query)
instant way to find snap trim queue lengths for all PGs (via pg dump), can use this to feed the monitoring
health warning when snaptrimq_len hits configured threshold. Can be disabled by setting that to 0.

With that, the above situation could be noticed way earlier.

Signed-off-by: Piotr Dałek piotr.dalek@corp.ovh.com

One can just parse the snap_trimq string, but that's much more expensive than just reading an unsigned int. Signed-off-by: Piotr Dałek <piotr.dalek@corp.ovh.com>

That way it will be unnecessary to go through all pgs separately to find pgs with excessively long snap trim queues. And we don't need to share snap trim queues itself, which may be large by itself. As snap trim queues tend to be short and anything above 50 000 I consider absurdly large, the snaptrimq_len is capped at 2^32 to save space in pg_stat_t. Signed-off-by: Piotr Dałek <piotr.dalek@corp.ovh.com>

songbaisen · 2017-12-15T09:12:34Z

good job, good eyes man. 😄

liewegas

This is great!

liewegas · 2017-12-15T14:53:32Z

src/mon/PGMap.cc

+    if (snaptrimq_exceeded) {
+      list<string> detail;
+      stringstream ss;
+      ss << "snap trim queue for " << snaptrimq_exceeded << " pg(s) hit the " << snapthreshold << " point. ";


ss << "snap trim queue for " << snaptrimq_exceeded << " pg(s) >= " << snapthreshold << " (mon_osd_snap_trim_queue_warn_on)";

liewegas · 2017-12-15T14:55:12Z

src/mon/PGMap.cc

@@ -2873,6 +2875,25 @@ void PGMap::get_health_checks(
      d.detail.swap(detail);
    }
  }
+
+  // SNAPTRIMQ_SLOW


PG_BIG_SNAPTRIMQ ?

PG_SLOW_SNAP_TRIMMING?

Changed to PG_SLOW_SNAP_TRIMMING.

shinobu-x · 2017-12-15T23:19:37Z

src/common/options.cc

@@ -1324,6 +1324,10 @@ std::vector<Option> get_global_options() {
    .set_safe()
    .set_description("in which level of parent bucket the reporters are counted"),

+    Option("mon_osd_snap_trim_queue_warn_on", Option::TYPE_INT, Option::LEVEL_ADVANCED)
+    .set_default(32768)
+    .set_description("Warn when snap trim queue length for at least one PG crosses this value, as this is indicator of snap trimmer not keeping up, wasting disk space"),


it would probably be better to use set_long_description, instead.

I used both.

branch-predictor · 2017-12-18T08:48:42Z

@liewegas updated.

liewegas · 2017-12-18T14:30:14Z

src/mon/PGMap.cc

+      }
+    }
+    if (snaptrimq_exceeded) {
+      list<string> detail;


could move it up a level and enumerate specific pgs (up to the max) that have big queues

If new option "mon osd snap trim queue warn on" is set to value larger than 0 (32768 by default), cluster will go into HEALTH_WARN state once any pg has a snap trim queue larger than that value. This can be used as an indicator of snaptrimmer not keeping up and disk space not being reclaimed fast enough. Warning message will tell how many pgs are affected. Signed-off-by: Piotr Dałek <piotr.dalek@corp.ovh.com>

branch-predictor · 2017-12-19T09:05:19Z

@liewegas updated:

$ bin/ceph -c ceph.conf health detail                                     
*** DEVELOPER MODE: setting PATH, PYTHONPATH and LD_LIBRARY_PATH ***                                                   
2017-12-19 08:51:05.927 7f4e5e91b700 -1 WARNING: all dangerous and experimental features are enabled.                  
2017-12-19 08:51:05.955 7f4e5e91b700 -1 WARNING: all dangerous and experimental features are enabled.                  
HEALTH_WARN nosnaptrim flag(s) set; snap trim queue for 64 pg(s) >= 2 (mon_osd_snap_trim_queue_warn_on)                                                                     
OSDMAP_FLAGS nosnaptrim flag(s) set                                                                                    
PG_SLOW_SNAP_TRIMMING snap trim queue for 64 pg(s) >= 2 (mon_osd_snap_trim_queue_warn_on)                                                                      
    snap trim queue for pg 1.3f at 10                                                                                  
    snap trim queue for pg 1.3e at 10                                                                                  
    snap trim queue for pg 1.3d at 10                                                                                  
    snap trim queue for pg 1.3c at 10                                                                                  
    snap trim queue for pg 1.3b at 10                                                                                  
    snap trim queue for pg 1.3a at 10                                                                                  
    snap trim queue for pg 1.39 at 10                                                                                  
    snap trim queue for pg 1.38 at 10                                                                                  
    snap trim queue for pg 1.37 at 10                                                                                  
    snap trim queue for pg 1.36 at 10                                                                                  
    snap trim queue for pg 1.35 at 10                                                                                  
    snap trim queue for pg 1.34 at 10                                                                                  
    snap trim queue for pg 1.33 at 10                                                                                  
    snap trim queue for pg 1.32 at 10                                                                                  
    snap trim queue for pg 1.31 at 10                                                                                  
    snap trim queue for pg 1.30 at 10                                                                                  
    snap trim queue for pg 1.2f at 10                                                                                  
    snap trim queue for pg 1.2e at 10                                                                                  
    snap trim queue for pg 1.2d at 10                                                                                  
    snap trim queue for pg 1.2c at 10                                                                                  
    snap trim queue for pg 1.2b at 10                                                                                  
    snap trim queue for pg 1.2a at 10                                                                                  
    snap trim queue for pg 1.29 at 10                                                                                  
    snap trim queue for pg 1.28 at 10                                                                                  
    snap trim queue for pg 1.27 at 10                                                                                  
    snap trim queue for pg 1.26 at 10                                                                                  
    snap trim queue for pg 1.25 at 10                                                                                  
    snap trim queue for pg 1.24 at 10                                                                                  
    snap trim queue for pg 1.f at 10                                                                                   
    snap trim queue for pg 1.e at 10                                                                                   
    snap trim queue for pg 1.d at 10                                                                                   
    snap trim queue for pg 1.c at 10                                                                                   
    snap trim queue for pg 1.b at 10                                                                                   
    snap trim queue for pg 1.a at 10                                                                                   
    snap trim queue for pg 1.9 at 10                                                                                   
    snap trim queue for pg 1.8 at 10                                                                                   
    snap trim queue for pg 1.7 at 10                                                                                   
    snap trim queue for pg 1.6 at 10                                                                                   
    snap trim queue for pg 1.1 at 10                                                                                   
    snap trim queue for pg 1.0 at 10                                                                                   
    snap trim queue for pg 1.2 at 10                                                                                   
    snap trim queue for pg 1.3 at 10                                                                                   
    snap trim queue for pg 1.4 at 10                                                                                   
    snap trim queue for pg 1.5 at 10                                                                                   
    snap trim queue for pg 1.10 at 10                                                                                  
    snap trim queue for pg 1.11 at 10                                                                                  
    snap trim queue for pg 1.12 at 10                                                                                  
    snap trim queue for pg 1.13 at 10                                                                                  
    snap trim queue for pg 1.14 at 10                                                                                  
    ...more pgs affected                                                                                               
    longest queue on pg 1.23 at 10                                                                                     
    try decreasing "osd snap trim sleep" and/or increasing "osd pg max concurrent snap trims".

liewegas · 2017-12-19T14:31:57Z

src/mon/PGMap.cc

+          continue;
+        }
+        if (detail.size() < max) {
+          detail.push_back("...more pgs affected");


we're not very consistent, but maybe "NNN more pgs affected"?

User already knows how much pgs in total are affected, no need to complicate their lives further.

liewegas · 2017-12-19T14:32:09Z

src/mon/PGMap.cc

+      {
+         ostringstream ss;
+         ss << "longest queue on pg " << *longest_q_pg << " at " << longest_queue;
+         detail.push_back(ss.str());


liewegas

looks great!

shinobu-x · 2017-12-20T09:56:07Z

src/mon/PGMap.cc

@@ -2873,6 +2875,50 @@ void PGMap::get_health_checks(
      d.detail.swap(detail);
    }
  }
+
+  // PG_SLOW_SNAP_TRIMMING
+  if (!pg_stat.empty() && cct->_conf->mon_osd_snap_trim_queue_warn_on > 0) {


nit: might want to use get_val

tchaikov · 2018-01-08T03:37:45Z

src/common/legacy_config_opts.h

@@ -286,6 +286,7 @@ OPTION(mon_inject_sync_get_chunk_delay, OPT_DOUBLE)  // inject N second delay on
 OPTION(mon_osd_force_trim_to, OPT_INT)   // force mon to trim maps to this point, regardless of min_last_epoch_clean (dangerous)
 OPTION(mon_mds_force_trim_to, OPT_INT)   // force mon to trim mdsmaps to this point (dangerous)
 OPTION(mon_mds_skip_sanity, OPT_BOOL)  // skip safety assertions on FSMap (in case of bugs where we want to continue anyway)
+OPTION(mon_osd_snap_trim_queue_warn_on, OPT_INT)


@branch-predictor please avoid adding new options to legacy_config_opts.h, unless you plan to backport this change.

For this one - I do plan to backport it to jewel.

Also FWIW until we have a concise alternative to legacy_config_opts.h (for int types) I don't have a problem with adding new ones. (It would be nice to sort out an alternative, though.)

tchaikov · 2018-01-10T15:25:35Z

tchaikov · 2018-01-10T15:27:10Z

@branch-predictor for some reason the needs-backport label is gone. you might want to prepare a backport for it or file a ticket so we don't forget this change.

Fields state, purged_snaps and snaptrimq_len are new to Mimic. Reorder them in a way that newest field (snaptrimq_len) is before two others and uses other encoding version, so the pull request ceph#19520 can be backported to Luminous without breaking Luminous -> Mimic upgrade. This also changes encoding/decoding version back to 24 as both state and purged snaps were added post-Luminous and pre-Mimic, so we can push it into a single struct version and keep snaptrimq_len into other version. Signed-off-by: Piotr Dałek <piotr.dalek@corp.ovh.com>

smithfarm · 2018-02-01T10:48:59Z

@tchaikov It was agreed on ceph-devel ML to drop the needs-backport label in favor of using the tracker mechanism. This PR did not get a tracker originally, so I opened one: ~~http://tracker.ceph.com/issues/22853~~

UPDATE: deleted in favor of @branch-predictor 's trackers that were already open

smithfarm · 2018-02-01T10:50:47Z

@branch-predictor I went ahead and assigned the backport issues to you:

~~added backport to luminous http://tracker.ceph.com/issues/22854~~
~~added backport to jewel http://tracker.ceph.com/issues/22855~~

Deassign yourself if that's not right?~~

UPDATE: deleted in favor of @branch-predictor 's trackers that were already open

branch-predictor · 2018-02-01T11:32:45Z

@smithfarm it had a tracker # before: http://tracker.ceph.com/issues/22448 along with backports (http://tracker.ceph.com/issues/22449 and http://tracker.ceph.com/issues/22450)

smithfarm · 2018-02-01T11:43:08Z

@branch-predictor Got it, thanks!

Fields state, purged_snaps and snaptrimq_len are new to Mimic. Reorder them in a way that newest field (snaptrimq_len) is before two others and uses other encoding version, so the pull request ceph#19520 can be backported to Luminous without breaking Luminous -> Mimic upgrade. This also changes encoding/decoding version back to 24 as both state and purged snaps were added post-Luminous and pre-Mimic, so we can push it into a single struct version and keep snaptrimq_len into other version. Signed-off-by: Piotr Dałek <piotr.dalek@corp.ovh.com>

branch-predictor added 2 commits December 13, 2017 10:14

osd/PrimaryLogPG: dump snap_trimq size

dc7781c

One can just parse the snap_trimq string, but that's much more expensive than just reading an unsigned int. Signed-off-by: Piotr Dałek <piotr.dalek@corp.ovh.com>

liewegas approved these changes Dec 15, 2017

View reviewed changes

liewegas added the core label Dec 15, 2017

liewegas reviewed Dec 15, 2017

View reviewed changes

shinobu-x suggested changes Dec 15, 2017

View reviewed changes

branch-predictor force-pushed the bp-snap-trimq-visibility branch from 2239c1d to da81fb8 Compare December 18, 2017 08:43

liewegas reviewed Dec 18, 2017

View reviewed changes

branch-predictor force-pushed the bp-snap-trimq-visibility branch 3 times, most recently from baf9c33 to 885e347 Compare December 19, 2017 08:44

branch-predictor force-pushed the bp-snap-trimq-visibility branch from 885e347 to 8412a65 Compare December 19, 2017 08:57

liewegas reviewed Dec 19, 2017

View reviewed changes

liewegas approved these changes Dec 19, 2017

View reviewed changes

shinobu-x reviewed Dec 20, 2017

View reviewed changes

liewegas added the needs-qa label Dec 29, 2017

tchaikov added the wip-kefu-testing label Jan 7, 2018

tchaikov reviewed Jan 8, 2018

View reviewed changes

tchaikov merged commit 1dc26fb into ceph:master Jan 10, 2018

branch-predictor mentioned this pull request Jan 16, 2018

osd_types.cc: reorder fields in serialized pg_stat_t #19965

Merged

branch-predictor deleted the bp-snap-trimq-visibility branch May 27, 2019 09:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

osd, pg, mgr: make snap trim queue problems visible #19520

osd, pg, mgr: make snap trim queue problems visible #19520

branch-predictor commented Dec 14, 2017 •

edited by smithfarm

songbaisen commented Dec 15, 2017

liewegas left a comment

liewegas Dec 15, 2017

liewegas Dec 15, 2017

liewegas Dec 15, 2017

branch-predictor Dec 18, 2017

shinobu-x Dec 15, 2017

branch-predictor Dec 18, 2017

branch-predictor commented Dec 18, 2017

liewegas Dec 18, 2017

liewegas Dec 18, 2017

branch-predictor commented Dec 19, 2017

liewegas Dec 19, 2017

branch-predictor Dec 19, 2017

liewegas Dec 19, 2017

liewegas left a comment

shinobu-x Dec 20, 2017

tchaikov Jan 8, 2018

branch-predictor Jan 8, 2018

liewegas Jan 8, 2018

tchaikov commented Jan 10, 2018

tchaikov commented Jan 10, 2018

smithfarm commented Feb 1, 2018 •

edited

smithfarm commented Feb 1, 2018 •

edited

branch-predictor commented Feb 1, 2018

smithfarm commented Feb 1, 2018

osd, pg, mgr: make snap trim queue problems visible #19520

osd, pg, mgr: make snap trim queue problems visible #19520

Conversation

branch-predictor commented Dec 14, 2017 • edited by smithfarm

songbaisen commented Dec 15, 2017

liewegas left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

branch-predictor commented Dec 18, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

branch-predictor commented Dec 19, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

liewegas left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tchaikov commented Jan 10, 2018

tchaikov commented Jan 10, 2018

smithfarm commented Feb 1, 2018 • edited

smithfarm commented Feb 1, 2018 • edited

branch-predictor commented Feb 1, 2018

smithfarm commented Feb 1, 2018

branch-predictor commented Dec 14, 2017 •

edited by smithfarm

smithfarm commented Feb 1, 2018 •

edited

smithfarm commented Feb 1, 2018 •

edited