New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
mgr/balancer: mgr module to automatically balance PGs across OSDs #16272
Conversation
liewegas
commented
Jul 11, 2017
•
edited by liupan1111
edited by liupan1111
- back off when degraded, unknown, inactive
- throttle against misplaced ratio
- upmap (luminous+)
- crush-legacy (compat with pre-luminous)
- crush (luminous+)
- osd_weights (legacy osd weight-based approach) (probably not worth doing this!)
- phase out balance optimizations from other modes (e.g., phase out osd_weight if we are optimizing crush weights)
There's enough here that this actually works! The real win is the 'crush-legacy' mode. |
'prefix': 'osd rm-pg-upmap-items', | ||
'format': 'json', | ||
'pgid': pgid, | ||
}), 'foo') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the tag ("foo") is used to identify the command upon completion. see MonCommandCompletion::finish()
and PyModules::notify_all()
. if we don't care about it, probably we can just pass an empty string here. but we can also pass something more meaningful here, like "upmap", in case we want to check it in future.
def __init__(self, handle): | ||
self._handle = handle | ||
|
||
def get_epoch(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit, could be
@property
def epoch(self):
return ceph_osdmap.get_epoch(self._handle)
21c47e8
to
a26ba1b
Compare
51bfe83
to
930ca8b
Compare
src/pybind/mgr/balancer/module.py
Outdated
# adjust/normalize by weight | ||
adjusted = float(v) / target[k] / float(num) | ||
dev += (avg - adjusted) * (avg - adjusted) | ||
stddev = math.sqrt(dev / float(max(num - 1, 1))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could we keep the actual deviation here ? We have small numbers and having them reflect the actual difference will make things easier to read and easier to use to reach the optimum.
src/pybind/mgr/balancer/module.py
Outdated
self.log.debug('pools %s' % pools) | ||
self.log.debug('pool_rule %s' % pool_rule) | ||
|
||
# get expected distributions by root |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should support multiple roots per rule by looping on the rules instead of the roots maybe ?
src/pybind/mgr/balancer/module.py
Outdated
total_did += did | ||
left -= did | ||
if left <= 0: | ||
break | ||
self.log.info('prepared %d/%d changes' % (total_did, max_iterations)) | ||
|
||
incdump = inc.dump() | ||
self.log.debug('resulting inc is %s' % incdump) | ||
def do_crush_compat(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't get why compat needs a specifc strategy ? If the conditions are right (i.e. one rule per pool mostly) the crushmap can be converted to a pre-luminous compatible format. If not, there is no way to rebalance so that it can be exported and manual intervention is required.
The fix for the MonCommandCompletion crash is #17308 -- anyone testing this PR should pull that patch in too |
This is summary info, same as what's in 'ceph status'. Signed-off-by: Sage Weil <sage@redhat.com>
Signed-off-by: Sage Weil <sage@redhat.com>
Signed-off-by: Sage Weil <sage@redhat.com>
Signed-off-by: Sage Weil <sage@redhat.com>
- wake up every minute - back off when unknown, inactive, degraded - throttle against misplaced ratio - apply some optimization step - initially implement 'upmap' only Signed-off-by: Sage Weil <sage@redhat.com>
Signed-off-by: Sage Weil <sage@redhat.com>
…ts too Allow us to specify a root node in the hierarchy instead of a rule. This way we can use it in conjunction with find_takes(). Signed-off-by: Sage Weil <sage@redhat.com>
Signed-off-by: Sage Weil <sage@redhat.com>
Signed-off-by: Sage Weil <sage@redhat.com>
These let us identify distinct CRUSH hierarchies that rules distribute data over, and create relative weight maps for the OSDs they map to. Signed-off-by: Sage Weil <sage@redhat.com>
Signed-off-by: Sage Weil <sage@redhat.com>
Signed-off-by: Sage Weil <sage@redhat.com>
Signed-off-by: Sage Weil <sage@redhat.com>
* score lies in [0, 1), 0 being perfect distribution * use shifted and scaled cdf of normal distribution to prioritize highly over-weighted device. * consider only over-weighted devices to calculate score Signed-off-by: Spandan Kumar Sahu <spandankumarsahu@gmail.com>
(with upmap at least) Signed-off-by: Sage Weil <sage@redhat.com>
1ac2a87
to
ef1a3be
Compare
@jcsp the 'upmap' mode is fully functional. I'm thinking it makes sense to merge this now in master, and later (when the more useful crush-compat mode is working) backport it to luminous. I'm thinking it'd be good to get the mgr glue bits in sooner rather than later, though. But some of that probably needs some rework to align with the native object stuff you have in flight? |
@liewegas I'm completely happy for this to merge to master (luminous should wait, as you say), it makes my life easier with the other branches |