New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pybind/mgr/balancer: fix pool-deletion vs auto-optimization race #20706
Conversation
eefd365
to
861cbee
Compare
src/pybind/mgr/balancer/module.py
Outdated
self.poolids = [p['pool'] for p in self.osdmap_dump.get('pools', [])] | ||
osd_poolids = [p['pool'] for p in self.osdmap_dump.get('pools', [])] | ||
pg_poolids = [p['poolid'] for p in pg_dump.get('pool_stats', [])] | ||
self.poolids = [p for p in osd_poolids if p in pg_poolids] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be a bit neater (and perhaps more efficient) to do a set(osd_poolids) & set(pg_poolids)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is still a work-in-progress and I have some other bugs to hunt 😂
Will fix in next version \o/
This patch fixes the error below: ``` File "/usr/lib/ceph/mgr/balancer/module.py", line 722, in optimize return self.do_crush_compat(plan) File "/usr/lib/ceph/mgr/balancer/module.py", line 781, in do_crush_compat pe = self.calc_eval(ms, plan.pools) File "/usr/lib/ceph/mgr/balancer/module.py", line 570, in calc_eval objects_by_osd[osd] += ms.pg_stat[pgid]['num_objects'] KeyError: ('5.1b',) ``` The root cause is that balancer is basically collecting cluster information from two separate maps (OSDMap and PGMap), and hence there is a small window/chance that the pool statistics might become divergent. E.g.: 1) auto-optimization begin 2) get osdmap 3) a pool is gone (deleted by admin); pg_dump refreshed 4) get pg_dump (balancer is now with both the newest pg_dump and an obsolute osdmap in hand) 5) execute optimization; balancer complains some PGs are missing in the pg_dump map.. Fix the above problem by tracing pools existing in both maps only. Signed-off-by: xie xingguo <xie.xingguo@zte.com.cn>
861cbee
to
a57b803
Compare
@liewegas I hit another balancer issue. Mind taking a look at this one too? Thanks! |
|
Thanks @tchaikov ! |
This patch fixes the error below:
The root cause is that balancer is basically collecting cluster
information from two separate maps (OSDMap and PGMap), and hence
there is a small window/chance that the pool statistics might
become divergent. E.g.:
and an obsolute osdmap in hand)
in the pg_dump map..
Fix the above problem by tracing pools existing in both maps only.
Signed-off-by: xie xingguo xie.xingguo@zte.com.cn