mon/OSDMonitor: implement cluster pg limit #17427

liewegas · 2017-09-01T18:45:24Z

Prevent total pg count from exceeding a max per osd value. Guard pg create, pg_num change, and size change.

gnit:build (wip-pg-num-limits) 03:50 PM $ bin/ceph osd pool create foo 66
pool 'foo' created
gnit:build (wip-pg-num-limits) 03:50 PM $ bin/ceph osd pool set foo size 4
Error ERANGE:  pg_num 66 size 4 would mean 462 total pgs, which exceeds max 400 (mon_pg_warn_max_per_osd 200 * num_in_osds 2)
gnit:build (wip-pg-num-limits) 03:50 PM $ bin/ceph osd pool set foo size 4
set pool 1 size to 4
gnit:build (wip-pg-num-limits) 03:50 PM $ bin/ceph osd pool set foo pg_num 99
set pool 1 pg_num to 99
gnit:build (wip-pg-num-limits) 03:51 PM $ bin/ceph osd pool set foo pg_num 133
set pool 1 pg_num to 133
gnit:build (wip-pg-num-limits) 03:51 PM $ bin/ceph osd pool set foo pg_num 200
Error ERANGE: pool id 1 pg_num 200 size 4 would mean 800 total pgs, which exceeds max 600 (mon_pg_warn_max_per_osd 200 * num_in_osds 3)

jdurgin · 2017-09-02T01:32:07Z

src/common/options.cc

@@ -1477,6 +1477,11 @@ std::vector<Option> get_global_options() {
    .set_default(65536)
    .set_description(""),

+    Option("mon_max_pool_pg_num_per_osd", Option::TYPE_INT, Option::LEVEL_ADVANCED)
+    .set_default(100)


this seems low - could make it 300 to match the warning on existing pools mon_pg_warn_max_per_osd... maybe just use that option for this check instead of adding a new one

jdurgin · 2017-09-02T01:33:56Z

I guess the other place PGs per OSD could change would be when changing pool size - could add the same check there

liewegas · 2017-09-02T02:19:33Z

Yeah... it kind of seems like it's better to let them back around this limitation by increasing pool size than it is to make it hard to increase the replica count

jdurgin · 2017-09-02T02:28:15Z

src/mon/OSDMonitor.cc

+int OSDMonitor::check_pg_num(int pg_num, int size, ostream *ss)
+{
+  int64_t max_pgs_per_osd = g_conf->mon_pg_warn_max_per_osd;
+  int64_t max_pgs = max_pgs_per_osd * osdmap.get_num_osds();


counting only up+in osds might be useful if a bunch of osds are not running but not removed from crush

jdurgin · 2017-09-02T02:29:06Z

src/mon/OSDMonitor.cc

+  int64_t max_pgs_per_osd = g_conf->mon_pg_warn_max_per_osd;
+  int64_t max_pgs = max_pgs_per_osd * osdmap.get_num_osds();
+  int64_t max_pg_num = MAX(1, max_pgs / size);
+  if (pg_num > max_pg_num) {


shouldn't we check total PGs, rather than just those in this pool?

might get confusing with multiple crush roots, I guess we could account for those as well...

jdurgin · 2017-09-02T02:31:43Z

pool size changes are a lot less likely too, so I'm ok without the check there

liewegas · 2017-09-06T19:53:13Z

@jdurgin updated to look at total pgs and to also guard pool size changes!

jdurgin · 2017-09-08T00:14:02Z

src/mon/OSDMonitor.cc

@@ -6052,6 +6086,10 @@ int OSDMonitor::prepare_command_pool_set(map<string,cmd_vartype> &cmdmap,
      ss << "pool size must be between 1 and 10";
      return -EINVAL;
    }
+    int r = check_pg_num(-1, p.get_pg_num(), n, &ss);


pool instead of -1

oh it doesn't affect the result though, nvm

oh yes it does, since we pass the new size in

jdurgin · 2017-09-08T00:28:21Z

src/mon/OSDMonitor.cc

+int OSDMonitor::check_pg_num(int64_t pool, int pg_num, int size, ostream *ss)
+{
+  int64_t max_pgs_per_osd = g_conf->mon_pg_warn_max_per_osd;
+  int64_t max_pgs = max_pgs_per_osd * osdmap.get_num_in_osds();


the 'make check' result suggests we should use MAX(num_in_osds, 1).

for the mon_pg_warn_max_per_osd == 0 case, which someone might have used to disable the warning, maybe we should default to 200 instead, since there's a nice message explaining it if that limit is hit here.

I used max with 3, since that seems like the minimum useful cluster size. (And it fixes the standalone tests, which create pools before any osds exist). Sound ok?

Yeah we can do that. In that case I think we should rename the config option to be more like mon_max_pgs_per_osd (remove the 'warn' part)?

Yeah, 3 seems fine, and renaming the config option is a good idea too.

tchaikov · 2017-09-08T17:48:50Z

src/mon/OSDMonitor.h

@@ -346,6 +346,7 @@ class OSDMonitor : public PaxosService {
 				const string &erasure_code_profile,
 				unsigned *stripe_width,
 				ostream *ss);
+  int check_pg_num(int64_t pool, int pg_num, int size, ostream* ss);


mark this method const if you please.

liewegas · 2017-09-14T16:09:30Z

tests out okay. i think we need to decide what the final config option is going to be since this'll get backported? or should we keep the current config option for luminous?

This is 2x the recommended target (100 per OSD). Signed-off-by: Sage Weil <sage@redhat.com>

Check total pg count for the cluster vs osd count and max pgs per osd before allowing pool creation, pg_num change, or pool size change. "in" OSDs are the ones we distribute data too, so this should be the right count to use. (Whether they happen to be up or down at the moment is incidental.) If the user really wants to create the pool, they can change the configurable limit. Signed-off-by: Sage Weil <sage@redhat.com>

Signed-off-by: Sage Weil <sage@redhat.com>

This runs afoul of the new max pg per osd limit. Signed-off-by: Sage Weil <sage@redhat.com>

Signed-off-by: Sage Weil <sage@redhat.com>

Fiddling with pgp_num doesn't help with TOO_MANY_PGS. Signed-off-by: Sage Weil <sage@redhat.com>

liewegas · 2017-09-19T17:57:00Z

http://pulpito.ceph.com/sage-2017-09-18_23:57:25-rados-wip-sage-testing-2017-09-18-1640-distro-basic-smithi/
http://pulpito.ceph.com/sage-2017-09-19_12:22:41-rados-wip-sage-testing-2017-09-18-1640-distro-basic-smithi/

liewegas · 2017-09-19T17:59:46Z

backport #17814

liewegas requested a review from jdurgin September 1, 2017 18:45

liewegas added core feature mon labels Sep 1, 2017

jdurgin reviewed Sep 2, 2017

View reviewed changes

liewegas force-pushed the wip-pg-num-limits branch from e2c934c to a38b7fa Compare September 2, 2017 02:22

jdurgin reviewed Sep 2, 2017

View reviewed changes

liewegas force-pushed the wip-pg-num-limits branch from a38b7fa to 45d02b2 Compare September 6, 2017 19:50

liewegas changed the title ~~mon/OSDMonitor: implement mon_max_pool_pg_num_per_osd limit~~ mon/OSDMonitor: implement cluster pg limit Sep 6, 2017

liewegas added the wip-sage2-testing label Sep 7, 2017

jdurgin reviewed Sep 8, 2017

View reviewed changes

liewegas added needs-qa wip-sage2-testing and removed wip-sage2-testing labels Sep 8, 2017

liewegas force-pushed the wip-pg-num-limits branch from b4fdedc to 9fe34de Compare September 8, 2017 13:53

tchaikov self-requested a review September 8, 2017 17:46

tchaikov reviewed Sep 8, 2017

View reviewed changes

liewegas mentioned this pull request Sep 14, 2017

osd,mgr: add "max pgs per osd" setting #17720

Closed

liewegas removed the wip-sage2-testing label Sep 14, 2017

liewegas added 4 commits September 14, 2017 12:09

common/options: reduce mon_pg_warn_max_per_osd to 200

b7fa440

This is 2x the recommended target (100 per OSD). Signed-off-by: Sage Weil <sage@redhat.com>

mon/OSDMonitor: assume a minimum cluster size of 3

1010761

Signed-off-by: Sage Weil <sage@redhat.com>

qa/standalong/mon/osd-pool-create: fewer pgs in test

c9ffeee

This runs afoul of the new max pg per osd limit. Signed-off-by: Sage Weil <sage@redhat.com>

liewegas force-pushed the wip-pg-num-limits branch from 4028af5 to c9ffeee Compare September 14, 2017 16:10

liewegas added the wip-sage-testing label Sep 14, 2017

liewegas added 2 commits September 14, 2017 16:00

mon: rename mon_pg_warn_max_per_osd -> mon_max_pg_per_osd

986b86f

Signed-off-by: Sage Weil <sage@redhat.com>

doc/rados/operations/health-checks: fix TOO_MANY_PGS discussion

027672b

Fiddling with pgp_num doesn't help with TOO_MANY_PGS. Signed-off-by: Sage Weil <sage@redhat.com>

jdurgin approved these changes Sep 14, 2017

View reviewed changes

liewegas mentioned this pull request Sep 17, 2017

ceph.conf: set new pg max per osd limit ceph/teuthology#1111

Merged

liewegas merged commit 6767f84 into ceph:master Sep 19, 2017

liewegas deleted the wip-pg-num-limits branch September 19, 2017 17:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mon/OSDMonitor: implement cluster pg limit #17427

mon/OSDMonitor: implement cluster pg limit #17427

liewegas commented Sep 1, 2017 •

edited

jdurgin Sep 2, 2017

jdurgin commented Sep 2, 2017

liewegas commented Sep 2, 2017

jdurgin Sep 2, 2017

jdurgin Sep 2, 2017

jdurgin Sep 2, 2017

jdurgin commented Sep 2, 2017

liewegas commented Sep 6, 2017

jdurgin Sep 8, 2017

jdurgin Sep 8, 2017

jdurgin Sep 8, 2017

liewegas Sep 8, 2017

jdurgin Sep 8, 2017

liewegas Sep 8, 2017

liewegas Sep 8, 2017

jdurgin Sep 8, 2017

tchaikov Sep 8, 2017

liewegas commented Sep 14, 2017

liewegas commented Sep 19, 2017

liewegas commented Sep 19, 2017

mon/OSDMonitor: implement cluster pg limit #17427

mon/OSDMonitor: implement cluster pg limit #17427

Conversation

liewegas commented Sep 1, 2017 • edited

Choose a reason for hiding this comment

jdurgin commented Sep 2, 2017

liewegas commented Sep 2, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jdurgin commented Sep 2, 2017

liewegas commented Sep 6, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

liewegas commented Sep 14, 2017

liewegas commented Sep 19, 2017

liewegas commented Sep 19, 2017

liewegas commented Sep 1, 2017 •

edited