Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pybind/mgr/balancer: fix pool-deletion vs auto-optimization race #20706

Merged
merged 1 commit into from Mar 8, 2018

Conversation

Projects
None yet
4 participants
@xiexingguo
Copy link
Member

commented Mar 5, 2018

This patch fixes the error below:

File "/usr/lib/ceph/mgr/balancer/module.py", line 722, in optimize
  return self.do_crush_compat(plan)
File "/usr/lib/ceph/mgr/balancer/module.py", line 781, in do_crush_compat
  pe = self.calc_eval(ms, plan.pools)
File "/usr/lib/ceph/mgr/balancer/module.py", line 570, in calc_eval
  objects_by_osd[osd] += ms.pg_stat[pgid]['num_objects']
KeyError: ('5.1b',)

The root cause is that balancer is basically collecting cluster
information from two separate maps (OSDMap and PGMap), and hence
there is a small window/chance that the pool statistics might
become divergent. E.g.:

  1. auto-optimization begin
  2. get osdmap
  3. a pool is gone (deleted by admin); pg_dump refreshed
  4. get pg_dump (balancer is now with both the newest pg_dump
    and an obsolute osdmap in hand)
  5. execute optimization; balancer complains some PGs are missing
    in the pg_dump map..

Fix the above problem by tracing pools existing in both maps only.

Signed-off-by: xie xingguo xie.xingguo@zte.com.cn

@xiexingguo xiexingguo force-pushed the xiexingguo:wip-balancer-03 branch from eefd365 to 861cbee Mar 5, 2018

@xiexingguo xiexingguo changed the title pybind/mgr/balancer: fix pool-deletion vs auto-optimization race wip: pybind/mgr/balancer: fix pool-deletion vs auto-optimization race Mar 5, 2018

@xiexingguo xiexingguo added the DNM label Mar 5, 2018

self.poolids = [p['pool'] for p in self.osdmap_dump.get('pools', [])]
osd_poolids = [p['pool'] for p in self.osdmap_dump.get('pools', [])]
pg_poolids = [p['poolid'] for p in pg_dump.get('pool_stats', [])]
self.poolids = [p for p in osd_poolids if p in pg_poolids]

This comment has been minimized.

Copy link
@jcsp

jcsp Mar 5, 2018

Contributor

Would be a bit neater (and perhaps more efficient) to do a set(osd_poolids) & set(pg_poolids)

This comment has been minimized.

Copy link
@xiexingguo

xiexingguo Mar 6, 2018

Author Member

This is still a work-in-progress and I have some other bugs to hunt 😂
Will fix in next version \o/

pybind/mgr/balancer: fix pool-deletion vs auto-optimization race
This patch fixes the error below:
```
File "/usr/lib/ceph/mgr/balancer/module.py", line 722, in optimize
  return self.do_crush_compat(plan)
File "/usr/lib/ceph/mgr/balancer/module.py", line 781, in do_crush_compat
  pe = self.calc_eval(ms, plan.pools)
File "/usr/lib/ceph/mgr/balancer/module.py", line 570, in calc_eval
  objects_by_osd[osd] += ms.pg_stat[pgid]['num_objects']
KeyError: ('5.1b',)
```

The root cause is that balancer is basically collecting cluster
information from two separate maps (OSDMap and PGMap), and hence
there is a small window/chance that the pool statistics might
become divergent. E.g.:
1) auto-optimization begin
2) get osdmap
3) a pool is gone (deleted by admin); pg_dump refreshed
4) get pg_dump (balancer is now with both the newest pg_dump
   and an obsolute osdmap in hand)
5) execute optimization; balancer complains some PGs are missing
   in the pg_dump map..

Fix the above problem by tracing pools existing in both maps only.

Signed-off-by: xie xingguo <xie.xingguo@zte.com.cn>

@xiexingguo xiexingguo force-pushed the xiexingguo:wip-balancer-03 branch from 861cbee to a57b803 Mar 6, 2018

@xiexingguo xiexingguo removed the DNM label Mar 6, 2018

@xiexingguo xiexingguo changed the title wip: pybind/mgr/balancer: fix pool-deletion vs auto-optimization race pybind/mgr/balancer: fix pool-deletion vs auto-optimization race Mar 6, 2018

@xiexingguo xiexingguo requested a review from liewegas Mar 6, 2018

@xiexingguo

This comment has been minimized.

Copy link
Member Author

commented Mar 7, 2018

@liewegas I hit another balancer issue. Mind taking a look at this one too? Thanks!

@tchaikov

This comment has been minimized.

Copy link
Contributor

commented Mar 8, 2018

http://pulpito.ceph.com/kchai-2018-03-08_12:59:37-rados-wip-kefu-testing-2018-03-08-1932-distro-basic-smithi/

  • the swift failure was caused by #20419
  • and i reran osd-pool-create.sh multiple times locally, and was not able to reproduce that failure.

@tchaikov tchaikov merged commit ecc64b0 into ceph:master Mar 8, 2018

5 checks passed

Docs: build check OK - docs built
Details
Signed-off-by all commits in this PR are signed
Details
Unmodified Submodules submodules for project are unmodified
Details
make check make check succeeded
Details
make check (arm64) make check succeeded
Details
@xiexingguo

This comment has been minimized.

Copy link
Member Author

commented Mar 9, 2018

Thanks @tchaikov !

@xiexingguo xiexingguo deleted the xiexingguo:wip-balancer-03 branch Mar 9, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.