Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pybind/mgr/balancer: fix pool-deletion vs auto-optimization race #20706

Merged
merged 1 commit into from Mar 8, 2018

Conversation

xiexingguo
Copy link
Member

@xiexingguo xiexingguo commented Mar 5, 2018

This patch fixes the error below:

File "/usr/lib/ceph/mgr/balancer/module.py", line 722, in optimize
  return self.do_crush_compat(plan)
File "/usr/lib/ceph/mgr/balancer/module.py", line 781, in do_crush_compat
  pe = self.calc_eval(ms, plan.pools)
File "/usr/lib/ceph/mgr/balancer/module.py", line 570, in calc_eval
  objects_by_osd[osd] += ms.pg_stat[pgid]['num_objects']
KeyError: ('5.1b',)

The root cause is that balancer is basically collecting cluster
information from two separate maps (OSDMap and PGMap), and hence
there is a small window/chance that the pool statistics might
become divergent. E.g.:

  1. auto-optimization begin
  2. get osdmap
  3. a pool is gone (deleted by admin); pg_dump refreshed
  4. get pg_dump (balancer is now with both the newest pg_dump
    and an obsolute osdmap in hand)
  5. execute optimization; balancer complains some PGs are missing
    in the pg_dump map..

Fix the above problem by tracing pools existing in both maps only.

Signed-off-by: xie xingguo xie.xingguo@zte.com.cn

@xiexingguo xiexingguo changed the title pybind/mgr/balancer: fix pool-deletion vs auto-optimization race wip: pybind/mgr/balancer: fix pool-deletion vs auto-optimization race Mar 5, 2018
@xiexingguo xiexingguo added the DNM label Mar 5, 2018
self.poolids = [p['pool'] for p in self.osdmap_dump.get('pools', [])]
osd_poolids = [p['pool'] for p in self.osdmap_dump.get('pools', [])]
pg_poolids = [p['poolid'] for p in pg_dump.get('pool_stats', [])]
self.poolids = [p for p in osd_poolids if p in pg_poolids]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be a bit neater (and perhaps more efficient) to do a set(osd_poolids) & set(pg_poolids)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is still a work-in-progress and I have some other bugs to hunt 😂
Will fix in next version \o/

This patch fixes the error below:
```
File "/usr/lib/ceph/mgr/balancer/module.py", line 722, in optimize
  return self.do_crush_compat(plan)
File "/usr/lib/ceph/mgr/balancer/module.py", line 781, in do_crush_compat
  pe = self.calc_eval(ms, plan.pools)
File "/usr/lib/ceph/mgr/balancer/module.py", line 570, in calc_eval
  objects_by_osd[osd] += ms.pg_stat[pgid]['num_objects']
KeyError: ('5.1b',)
```

The root cause is that balancer is basically collecting cluster
information from two separate maps (OSDMap and PGMap), and hence
there is a small window/chance that the pool statistics might
become divergent. E.g.:
1) auto-optimization begin
2) get osdmap
3) a pool is gone (deleted by admin); pg_dump refreshed
4) get pg_dump (balancer is now with both the newest pg_dump
   and an obsolute osdmap in hand)
5) execute optimization; balancer complains some PGs are missing
   in the pg_dump map..

Fix the above problem by tracing pools existing in both maps only.

Signed-off-by: xie xingguo <xie.xingguo@zte.com.cn>
@xiexingguo xiexingguo removed the DNM label Mar 6, 2018
@xiexingguo xiexingguo changed the title wip: pybind/mgr/balancer: fix pool-deletion vs auto-optimization race pybind/mgr/balancer: fix pool-deletion vs auto-optimization race Mar 6, 2018
@xiexingguo xiexingguo requested a review from liewegas March 6, 2018 05:30
@xiexingguo
Copy link
Member Author

@liewegas I hit another balancer issue. Mind taking a look at this one too? Thanks!

@tchaikov
Copy link
Contributor

tchaikov commented Mar 8, 2018

http://pulpito.ceph.com/kchai-2018-03-08_12:59:37-rados-wip-kefu-testing-2018-03-08-1932-distro-basic-smithi/

@tchaikov tchaikov merged commit ecc64b0 into ceph:master Mar 8, 2018
@xiexingguo
Copy link
Member Author

Thanks @tchaikov !

@xiexingguo xiexingguo deleted the wip-balancer-03 branch March 9, 2018 00:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants