Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mon,osd: explicitly remap some pgs #13984

Merged
merged 12 commits into from Mar 28, 2017

Conversation

Projects
None yet
4 participants
@liewegas
Copy link
Member

liewegas commented Mar 15, 2017

TODO:

  • add osdmaptool args for --remap
  • cli tests
  • configurable and lower cap on max iterations
  • thrashing tests

The pg_remap has no users currently (only pg_remap_items).

before:

ID WEIGHT  REWEIGHT SIZE  USE   AVAIL  %USE  VAR  PGS TYPE NAME
-1 8.00000        - 2979G 2303G   675G 77.32 1.00   - root default
-2 8.00000        - 2979G 2303G   675G 77.32 1.00   -     host gnit
 0 1.00000  1.00000  372G  287G 86495M 77.32 1.00  69         osd.0
 1 1.00000  1.00000  372G  287G 86495M 77.32 1.00  54         osd.1
 2 1.00000  1.00000  372G  287G 86495M 77.32 1.00  47         osd.2
 3 1.00000  1.00000  372G  287G 86495M 77.32 1.00  53         osd.3
 4 1.00000  1.00000  372G  287G 86495M 77.32 1.00  61         osd.4
 5 1.00000  1.00000  372G  287G 86495M 77.32 1.00  55         osd.5
 6 1.00000  1.00000  372G  287G 86495M 77.32 1.00  63         osd.6
 7 1.00000  1.00000  372G  287G 86495M 77.32 1.00  54         osd.7

ran:

ceph osd getmap -o om
osdmaptool om --remap c
. c

result:

ID WEIGHT  REWEIGHT SIZE  USE   AVAIL  %USE  VAR  PGS TYPE NAME
-1 8.00000        - 2979G 2303G   675G 77.32 1.00   - root default
-2 8.00000        - 2979G 2303G   675G 77.32 1.00   -     host gnit
 0 1.00000  1.00000  372G  287G 86494M 77.32 1.00  57         osd.0
 1 1.00000  1.00000  372G  287G 86494M 77.32 1.00  57         osd.1
 2 1.00000  1.00000  372G  287G 86494M 77.32 1.00  57         osd.2
 3 1.00000  1.00000  372G  287G 86494M 77.32 1.00  57         osd.3
 4 1.00000  1.00000  372G  287G 86494M 77.32 1.00  57         osd.4
 5 1.00000  1.00000  372G  287G 86494M 77.32 1.00  57         osd.5
 6 1.00000  1.00000  372G  287G 86495M 77.32 1.00  57         osd.6
 7 1.00000  1.00000  372G  287G 86494M 77.32 1.00  57         osd.7

@liewegas liewegas force-pushed the liewegas:wip-osdmap-remap branch from 3a6bb2c to 130fb55 Mar 15, 2017

@liewegas liewegas changed the title mon,osd: explicitly remap some pgs WIP mon,osd: explicitly remap some pgs Mar 15, 2017

@liewegas liewegas force-pushed the liewegas:wip-osdmap-remap branch 7 times, most recently from f5fbfee to 53b366c Mar 15, 2017

@markhpc

This comment has been minimized.

Copy link
Member

markhpc commented Mar 22, 2017

This is absolutely beautiful Sage! Great job!

@jupiturliu

This comment has been minimized.

Copy link
Contributor

jupiturliu commented Mar 22, 2017

Well done , Sage

@liewegas liewegas force-pushed the liewegas:wip-osdmap-remap branch from dbebb69 to 57abaec Mar 22, 2017

@liewegas

This comment has been minimized.

Copy link
Member Author

liewegas commented Mar 22, 2017

This is close to being mergeable. The primary limitation with the current remapper is that it does not remap to balance across hosts (or any buckets really); only the leaves within the tree. The algorithm needs to be a little bit smarter to handle that. The OSDMap format is not affected, though.. jus the offline tool that currently makes use of it.

@liewegas liewegas force-pushed the liewegas:wip-osdmap-remap branch 5 times, most recently from 19fd43b to 4e83665 Mar 22, 2017

@liewegas liewegas changed the title WIP mon,osd: explicitly remap some pgs mon,osd: explicitly remap some pgs Mar 23, 2017

@liewegas liewegas force-pushed the liewegas:wip-osdmap-remap branch from 4e83665 to edeef84 Mar 23, 2017

liewegas added some commits Mar 13, 2017

osd/OSDMap: clean up primary selection
_pg_to_raw_osds was selecting a primary, but it was generally
getting thrown out since _raw_to_up_osds was picking a *new*
primary after filtering out down and nonexistent osds.  Add a
_pick_primary helper to do it and call that explicitly in the
callers where it can be used by all.

Signed-off-by: Sage Weil <sage@redhat.com>
osd/OSDMap: add pg_remap and pg_remap_items
The first simply forces a pg to map to a particular value.  The second
replaces specific OSDs with specific other OSDs in the CRUSH mapping.

Signed-off-by: Sage Weil <sage@redhat.com>
mon/OSDMonitor: add 'osd [rm-]pg-remap[-items] ...' commands
Add commands to add and remove pg_remap mappings.

Require that a mon config option is set before this is allowed so that
users don't inadvertantly prevent older clients from interacting with the
cluster.

Signed-off-by: Sage Weil <sage@redhat.com>
crush: make get_immediate_parent const
Signed-off-by: Sage Weil <sage@redhat.com>
osd/OSDMap: ignore pg_remap[_items] mappings when target osd isout
If the remap target is fully out, ignore the mapping.

Do not do a probabilistic 'osd_weight' rejection; pg-remap is intended to
avoid probabilistic mappings.

Signed-off-by: Sage Weil <sage@redhat.com>

@liewegas liewegas force-pushed the liewegas:wip-osdmap-remap branch from edeef84 to 4a7b44c Mar 25, 2017

@ghost

This comment has been minimized.

Copy link

ghost commented Mar 25, 2017

The remapping needs to be redone when a device is added/removed ?

@liewegas

This comment has been minimized.

@liewegas

This comment has been minimized.

Copy link
Member Author

liewegas commented Mar 26, 2017

The remapping can be done at any time, just like reweight-by-utilization.

The OSDMap code only uses pg_remap_items, which is resistant to other cluster changes... if the CRUSH topology/layout changes significantly then the remap entries degrade into no-ops and you just get what CRUSH did.

The idea is that the osdmaptool offline process there now would eventually be moved into a periodic process run in ceph-mgr, maybe daily or something. It also works by running N iterations.. so it could also just run 1 iteration every 5 minutes and only run a second if the first one found an improvement that could be made... something like that.

@liewegas liewegas requested review from gregsfortytwo and jdurgin Mar 26, 2017

@liewegas

This comment has been minimized.

Copy link
Member Author

liewegas commented Mar 26, 2017

I think this is ready to merge. Future todo:

  • improve crush try_remap method to rebalance across hosts (or any other intervening crush bucket types)
  • add something to mgr that does this incrementally online instead of offline

auto q = pg_remap_items.find(pg);
if (q != pg_remap_items.end()) {
// NOTE: this approach does not allow a bidirectional swap,

This comment has been minimized.

Copy link
@jdurgin

jdurgin Mar 28, 2017

Member

it also wouldn't handle a degenerate ec case, where the same osd appears more than once - e.g. [1, 0, 2, -1, 1] - or can that only happen in the acting remappings?

This comment has been minimized.

Copy link
@liewegas

liewegas Mar 28, 2017

Author Member

Yeah, CRUSH will never do dups like that. Only pg_temp (maybe?).

@@ -1481,6 +1540,19 @@ int OSDMap::apply_incremental(const Incremental &inc)
(*primary_temp)[p->first] = p->second;
}

for (auto& p : inc.new_pg_remap) {

This comment has been minimized.

Copy link
@jdurgin

jdurgin Mar 28, 2017

Member

just looking at the data structure removing the old entries before adding the new ones looks fragile, since a new remap of the same pg would be erased if we were erasing an old remap of the same pg. I see later commits only use addition or removal of a remap for a given pg though, so this isn't a bug as is

This comment has been minimized.

Copy link
@liewegas

liewegas Mar 28, 2017

Author Member

FWIW this mirrors how we do {new,old}_blacklist and _pools. But not erasure_code_profiles.

We should probably be asserting in encode if the sets are not mutually exclusive and treat that as a bug in the monitor.


c.set_max_devices(20);

//JSONFormatter jf(true);

This comment has been minimized.

Copy link
@jdurgin

jdurgin Mar 28, 2017

Member

minor: extra comment

This comment has been minimized.

Copy link
@liewegas

liewegas Mar 28, 2017

Author Member

added the matching dump (also commented out). copy-pasta from another test, but it's helpful when writing the test.

@jdurgin

This comment has been minimized.

Copy link
Member

jdurgin commented Mar 28, 2017

looks great overall - just a couple comments

liewegas added some commits Mar 15, 2017

crush: implement try_remap_rule
Simulate a CRUSH mapping but try to identify alternative OSD
choices (based on an underfull list and overfull set) that still
respect the CRUSH rule constraints.

Signed-off-by: Sage Weil <sage@redhat.com>
osd/OSDMap: implement remap_pgs
Run a specified number of iterations trying to install new
pg_remap_items entries that improve the PG distribution.

Signed-off-by: Sage Weil <sage@redhat.com>
osd/OSDMap: clean_pg_remaps
Helper to clean out no-op pg_remap cruft:

- pg_remap entries that match the raw result
- pg_remap_items entries where the 'from' part of the (from,to)
  set isn't present.

Signed-off-by: Sage Weil <sage@redhat.com>
test/cli/osdmaptool: fix tests
Signed-off-by: Sage Weil <sage@redhat.com>
test/cli/osdmaptool/remap.t: add --remap test
Signed-off-by: Sage Weil <sage@redhat.com>
osdmaptool: add --remap to run the remap_pgs() method
Output ceph cli commands to realize the new remaps entries.

Signed-off-by: Sage Weil <sage@redhat.com>
qa/tasks/thrashosds,ceph_manager: thrash pg_remap[_items]
Signed-off-by: Sage Weil <sage@redhat.com>

@liewegas liewegas force-pushed the liewegas:wip-osdmap-remap branch from 6165ac7 to 2a08cbb Mar 28, 2017

@liewegas liewegas merged commit 35b60ae into ceph:master Mar 28, 2017

3 checks passed

Signed-off-by all commits in this PR are signed
Details
Unmodifed Submodules submodules for project are unmodified
Details
default Build finished.
Details

@liewegas liewegas deleted the liewegas:wip-osdmap-remap branch Mar 28, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.