mgr/balancer: mgr module to automatically balance PGs across OSDs #16272

liewegas · 2017-07-11T20:28:57Z

back off when degraded, unknown, inactive
throttle against misplaced ratio
upmap (luminous+)
crush-legacy (compat with pre-luminous)
crush (luminous+)
osd_weights (legacy osd weight-based approach) (probably not worth doing this!)
phase out balance optimizations from other modes (e.g., phase out osd_weight if we are optimizing crush weights)

liewegas · 2017-07-11T20:30:17Z

There's enough here that this actually works! The real win is the 'crush-legacy' mode.

tchaikov · 2017-07-12T11:19:45Z

src/pybind/mgr/balancer/module.py

+                'prefix': 'osd rm-pg-upmap-items',
+                'format': 'json',
+                'pgid': pgid,
+            }), 'foo')


the tag ("foo") is used to identify the command upon completion. see MonCommandCompletion::finish() and PyModules::notify_all(). if we don't care about it, probably we can just pass an empty string here. but we can also pass something more meaningful here, like "upmap", in case we want to check it in future.

tchaikov · 2017-07-12T11:26:50Z

src/pybind/mgr/mgr_module.py

+    def __init__(self, handle):
+        self._handle = handle
+
+    def get_epoch(self):


nit, could be

@property def epoch(self): return ceph_osdmap.get_epoch(self._handle)

ghost · 2017-08-17T09:51:15Z

src/pybind/mgr/balancer/module.py

+                # adjust/normalize by weight
+                adjusted = float(v) / target[k] / float(num)
+                dev += (avg - adjusted) * (avg - adjusted)
+            stddev = math.sqrt(dev / float(max(num - 1, 1)))


could we keep the actual deviation here ? We have small numbers and having them reflect the actual difference will make things easier to read and easier to use to reach the optimum.

ghost · 2017-08-17T09:56:44Z

src/pybind/mgr/balancer/module.py

+        self.log.debug('pools %s' % pools)
+        self.log.debug('pool_rule %s' % pool_rule)
+
+        # get expected distributions by root


this should support multiple roots per rule by looping on the rules instead of the roots maybe ?

ghost · 2017-08-17T10:08:04Z

src/pybind/mgr/balancer/module.py

            total_did += did
            left -= did
            if left <= 0:
                break
        self.log.info('prepared %d/%d changes' % (total_did, max_iterations))

-        incdump = inc.dump()
-        self.log.debug('resulting inc is %s' % incdump)
+    def do_crush_compat(self):


I don't get why compat needs a specifc strategy ? If the conditions are right (i.e. one rule per pool mostly) the crushmap can be converted to a pre-luminous compatible format. If not, there is no way to rebalance so that it can be exported and manual intervention is required.

jcsp · 2017-08-28T14:11:25Z

The fix for the MonCommandCompletion crash is #17308 -- anyone testing this PR should pull that patch in too

This is summary info, same as what's in 'ceph status'. Signed-off-by: Sage Weil <sage@redhat.com>

Signed-off-by: Sage Weil <sage@redhat.com>

- wake up every minute - back off when unknown, inactive, degraded - throttle against misplaced ratio - apply some optimization step - initially implement 'upmap' only Signed-off-by: Sage Weil <sage@redhat.com>

Signed-off-by: Sage Weil <sage@redhat.com>

…ts too Allow us to specify a root node in the hierarchy instead of a rule. This way we can use it in conjunction with find_takes(). Signed-off-by: Sage Weil <sage@redhat.com>

Signed-off-by: Sage Weil <sage@redhat.com>

These let us identify distinct CRUSH hierarchies that rules distribute data over, and create relative weight maps for the OSDs they map to. Signed-off-by: Sage Weil <sage@redhat.com>

Signed-off-by: Sage Weil <sage@redhat.com>

* score lies in [0, 1), 0 being perfect distribution * use shifted and scaled cdf of normal distribution to prioritize highly over-weighted device. * consider only over-weighted devices to calculate score Signed-off-by: Spandan Kumar Sahu <spandankumarsahu@gmail.com>

(with upmap at least) Signed-off-by: Sage Weil <sage@redhat.com>

liewegas · 2017-09-06T21:54:22Z

@jcsp the 'upmap' mode is fully functional. I'm thinking it makes sense to merge this now in master, and later (when the more useful crush-compat mode is working) backport it to luminous.

I'm thinking it'd be good to get the mgr glue bits in sooner rather than later, though. But some of that probably needs some rework to align with the native object stuff you have in flight?

jcsp · 2017-09-07T13:18:38Z

@liewegas I'm completely happy for this to merge to master (luminous should wait, as you say), it makes my life easier with the other branches

liewegas added feature mgr labels Jul 11, 2017

liewegas requested review from jcsp and tchaikov July 11, 2017 20:29

tchaikov reviewed Jul 12, 2017

View reviewed changes

jcsp mentioned this pull request Jul 17, 2017

mgr: reweight analysis tool #16361

Closed

liewegas force-pushed the wip-balancer branch 2 times, most recently from 21c47e8 to a26ba1b Compare July 27, 2017 14:09

liewegas force-pushed the wip-balancer branch 2 times, most recently from 51bfe83 to 930ca8b Compare August 4, 2017 22:06

ghost reviewed Aug 17, 2017

View reviewed changes

ghost mentioned this pull request Aug 21, 2017

[WIP] replace the balancing python module for python-crush #17110

Closed

liewegas and others added 14 commits September 6, 2017 16:45

mgr/PyModules: add 'pg_status' dump

85b5b80

This is summary info, same as what's in 'ceph status'. Signed-off-by: Sage Weil <sage@redhat.com>

mgr/PyModules: add 'pg_dump' get

bfb9286

Signed-off-by: Sage Weil <sage@redhat.com>

mgr: add trivial OSDMap wrapper class

2ef0051

Signed-off-by: Sage Weil <sage@redhat.com>

pybind/mgr/mgr_module: add default arg to get_config

39c42dd

Signed-off-by: Sage Weil <sage@redhat.com>

pybind/mgr/balancer: add balancer module

0d9685c

- wake up every minute - back off when unknown, inactive, degraded - throttle against misplaced ratio - apply some optimization step - initially implement 'upmap' only Signed-off-by: Sage Weil <sage@redhat.com>

pybind/mgr/balancer: do upmap by pool, in random order

028a66d

Signed-off-by: Sage Weil <sage@redhat.com>

crush/CrushWrapper: refactor get_rule_weight_osd_map to work with roo…

69454e0

…ts too Allow us to specify a root node in the hierarchy instead of a rule. This way we can use it in conjunction with find_takes(). Signed-off-by: Sage Weil <sage@redhat.com>

crush/CrushWrapper: fix output arg for find_{takes,roots}()

60b9cfa

Signed-off-by: Sage Weil <sage@redhat.com>

crush/CrushWrapper: rule_has_take

ef140de

Signed-off-by: Sage Weil <sage@redhat.com>

mgr/PyOSDMap: get_crush, find_takes, get_take_weight_osd_map

3b8a276

These let us identify distinct CRUSH hierarchies that rules distribute data over, and create relative weight maps for the OSDs they map to. Signed-off-by: Sage Weil <sage@redhat.com>

mgr/PyOSDMap: OSDMap.map_pool_pgs_up, CRUSHMap.get_item_name

a928bf6

Signed-off-by: Sage Weil <sage@redhat.com>

pybind/mgr/balancer: rough framework

d5e5c68

Signed-off-by: Sage Weil <sage@redhat.com>

pybind/mgr/balancer: make 'crush-compat' sort of work

7a00e02

Signed-off-by: Sage Weil <sage@redhat.com>

pybind/mgr/balancer: make auto mode work

ef1a3be

(with upmap at least) Signed-off-by: Sage Weil <sage@redhat.com>

liewegas force-pushed the wip-balancer branch from 1ac2a87 to ef1a3be Compare September 6, 2017 20:47

liewegas changed the title ~~WIP: pybind/mgr/balancer: mgr module to automatically balance PGs across OSDs~~ mgr/balancer: mgr module to automatically balance PGs across OSDs Sep 6, 2017

liewegas added needs-qa wip-sage-testing labels Sep 7, 2017

liewegas merged commit 5b89608 into ceph:master Sep 8, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mgr/balancer: mgr module to automatically balance PGs across OSDs #16272

mgr/balancer: mgr module to automatically balance PGs across OSDs #16272

liewegas commented Jul 11, 2017 •

edited by liupan1111

liewegas commented Jul 11, 2017

tchaikov Jul 12, 2017

tchaikov Jul 12, 2017

ghost Aug 17, 2017

ghost Aug 17, 2017

ghost Aug 17, 2017

jcsp commented Aug 28, 2017

liewegas commented Sep 6, 2017

jcsp commented Sep 7, 2017

mgr/balancer: mgr module to automatically balance PGs across OSDs #16272

mgr/balancer: mgr module to automatically balance PGs across OSDs #16272

Conversation

liewegas commented Jul 11, 2017 • edited by liupan1111

liewegas commented Jul 11, 2017

tchaikov Jul 12, 2017

Choose a reason for hiding this comment

tchaikov Jul 12, 2017

Choose a reason for hiding this comment

ghost Aug 17, 2017

Choose a reason for hiding this comment

ghost Aug 17, 2017

Choose a reason for hiding this comment

ghost Aug 17, 2017

Choose a reason for hiding this comment

jcsp commented Aug 28, 2017

liewegas commented Sep 6, 2017

jcsp commented Sep 7, 2017

liewegas commented Jul 11, 2017 •

edited by liupan1111