Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mon/OSDMonitor: implement cluster pg limit #17427

Merged
merged 6 commits into from Sep 19, 2017

Conversation

liewegas
Copy link
Member

@liewegas liewegas commented Sep 1, 2017

Prevent total pg count from exceeding a max per osd value. Guard pg create, pg_num change, and size change.

gnit:build (wip-pg-num-limits) 03:50 PM $ bin/ceph osd pool create foo 66
pool 'foo' created
gnit:build (wip-pg-num-limits) 03:50 PM $ bin/ceph osd pool set foo size 4
Error ERANGE:  pg_num 66 size 4 would mean 462 total pgs, which exceeds max 400 (mon_pg_warn_max_per_osd 200 * num_in_osds 2)
gnit:build (wip-pg-num-limits) 03:50 PM $ bin/ceph osd pool set foo size 4
set pool 1 size to 4
gnit:build (wip-pg-num-limits) 03:50 PM $ bin/ceph osd pool set foo pg_num 99
set pool 1 pg_num to 99
gnit:build (wip-pg-num-limits) 03:51 PM $ bin/ceph osd pool set foo pg_num 133
set pool 1 pg_num to 133
gnit:build (wip-pg-num-limits) 03:51 PM $ bin/ceph osd pool set foo pg_num 200
Error ERANGE: pool id 1 pg_num 200 size 4 would mean 800 total pgs, which exceeds max 600 (mon_pg_warn_max_per_osd 200 * num_in_osds 3)

@@ -1477,6 +1477,11 @@ std::vector<Option> get_global_options() {
.set_default(65536)
.set_description(""),

Option("mon_max_pool_pg_num_per_osd", Option::TYPE_INT, Option::LEVEL_ADVANCED)
.set_default(100)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems low - could make it 300 to match the warning on existing pools mon_pg_warn_max_per_osd... maybe just use that option for this check instead of adding a new one

@jdurgin
Copy link
Member

jdurgin commented Sep 2, 2017

I guess the other place PGs per OSD could change would be when changing pool size - could add the same check there

@liewegas
Copy link
Member Author

liewegas commented Sep 2, 2017

Yeah... it kind of seems like it's better to let them back around this limitation by increasing pool size than it is to make it hard to increase the replica count

int OSDMonitor::check_pg_num(int pg_num, int size, ostream *ss)
{
int64_t max_pgs_per_osd = g_conf->mon_pg_warn_max_per_osd;
int64_t max_pgs = max_pgs_per_osd * osdmap.get_num_osds();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

counting only up+in osds might be useful if a bunch of osds are not running but not removed from crush

int64_t max_pgs_per_osd = g_conf->mon_pg_warn_max_per_osd;
int64_t max_pgs = max_pgs_per_osd * osdmap.get_num_osds();
int64_t max_pg_num = MAX(1, max_pgs / size);
if (pg_num > max_pg_num) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't we check total PGs, rather than just those in this pool?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might get confusing with multiple crush roots, I guess we could account for those as well...

@jdurgin
Copy link
Member

jdurgin commented Sep 2, 2017

pool size changes are a lot less likely too, so I'm ok without the check there

@liewegas liewegas changed the title mon/OSDMonitor: implement mon_max_pool_pg_num_per_osd limit mon/OSDMonitor: implement cluster pg limit Sep 6, 2017
@liewegas
Copy link
Member Author

liewegas commented Sep 6, 2017

@jdurgin updated to look at total pgs and to also guard pool size changes!

@@ -6052,6 +6086,10 @@ int OSDMonitor::prepare_command_pool_set(map<string,cmd_vartype> &cmdmap,
ss << "pool size must be between 1 and 10";
return -EINVAL;
}
int r = check_pg_num(-1, p.get_pg_num(), n, &ss);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pool instead of -1

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh it doesn't affect the result though, nvm

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh yes it does, since we pass the new size in

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

int OSDMonitor::check_pg_num(int64_t pool, int pg_num, int size, ostream *ss)
{
int64_t max_pgs_per_osd = g_conf->mon_pg_warn_max_per_osd;
int64_t max_pgs = max_pgs_per_osd * osdmap.get_num_in_osds();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the 'make check' result suggests we should use MAX(num_in_osds, 1).

for the mon_pg_warn_max_per_osd == 0 case, which someone might have used to disable the warning, maybe we should default to 200 instead, since there's a nice message explaining it if that limit is hit here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used max with 3, since that seems like the minimum useful cluster size. (And it fixes the standalone tests, which create pools before any osds exist). Sound ok?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah we can do that. In that case I think we should rename the config option to be more like mon_max_pgs_per_osd (remove the 'warn' part)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, 3 seems fine, and renaming the config option is a good idea too.

@@ -346,6 +346,7 @@ class OSDMonitor : public PaxosService {
const string &erasure_code_profile,
unsigned *stripe_width,
ostream *ss);
int check_pg_num(int64_t pool, int pg_num, int size, ostream* ss);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mark this method const if you please.

@liewegas
Copy link
Member Author

tests out okay. i think we need to decide what the final config option is going to be since this'll get backported? or should we keep the current config option for luminous?

This is 2x the recommended target (100 per OSD).

Signed-off-by: Sage Weil <sage@redhat.com>
Check total pg count for the cluster vs osd count and max pgs per osd
before allowing pool creation, pg_num change, or pool size change.

"in" OSDs are the ones we distribute data too, so this should be the right
count to use.  (Whether they happen to be up or down at the moment is
incidental.)

If the user really wants to create the pool, they can change the
configurable limit.

Signed-off-by: Sage Weil <sage@redhat.com>
Signed-off-by: Sage Weil <sage@redhat.com>
This runs afoul of the new max pg per osd limit.

Signed-off-by: Sage Weil <sage@redhat.com>
Signed-off-by: Sage Weil <sage@redhat.com>
Fiddling with pgp_num doesn't help with TOO_MANY_PGS.

Signed-off-by: Sage Weil <sage@redhat.com>
@liewegas liewegas merged commit 6767f84 into ceph:master Sep 19, 2017
@liewegas liewegas deleted the wip-pg-num-limits branch September 19, 2017 17:57
@liewegas
Copy link
Member Author

backport #17814

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants