Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

luminous: mon, osd: fix potential collided *Up Set* after PG remapping #20829

Merged
merged 3 commits into from Apr 11, 2018

Conversation

xiexingguo
Copy link
Member

@xiexingguo xiexingguo commented Mar 10, 2018

The mgr balancer module are basically doing optimizations based on
the snapshots of OSDMap at certain moments, which turns out to be
the culprit of data loss since it can produce bad PG mapping results
sometimes while in upmap mode.
I.e.:
1) original cluster topology:

-5       2.00000     host host-a
 0   ssd 1.00000         osd.0       up  1.00000 1.00000
 1   ssd 1.00000         osd.1       up  1.00000 1.00000
-7       2.00000     host host-b
 2   ssd 1.00000         osd.2       up  1.00000 1.00000
 3   ssd 1.00000         osd.3       up  1.00000 1.00000
-9       2.00000     host host-c
 4   ssd 1.00000         osd.4       up  1.00000 1.00000
 5   ssd 1.00000         osd.5       up  1.00000 1.00000

2) mgr balancer applies optimization for PG 3.f:

            pg-upmap-items[3.f : 1->4]
3.f [1 3] + -------------------------> [4 3]

3) osd.3 is out/reweighted etc., original crush mapping of 3.f changed
   (while pg-upmap-items did not):

            pg-upmap-items[3.f : 1->4]
3.f [1 5] + -------------------------> [4 5]

4) we are now mapping PG 3.f to two OSDs(osd.4 & osd.5) on the same host
   (host-c).

Fix the above problem by putting a guard procedure before we can
finally encode these *unsafe* upmap remappings into OSDMap.
If any of them turns out to be inappropriate, we can simply cancel it
since balancer can still re-calculate and re-generate later if necessary.

Fixes: http://tracker.ceph.com/issues/23118
Signed-off-by: xie xingguo <xie.xingguo@zte.com.cn>
(cherry picked from commit 5fa467e)

Conflicts:
	src/crush/CrushWrapper.cc

drop **const** constraint of CrushWrapper::_get_leaves from master
For testing purpose, we may create pools of failure-domain OSD.
In this case get_parent_of_type() will always return 0 and all
parents of these PGs' *up set* are considered as collided, and
hence the relevant pg_upmap/items are incorrectly removed during
the maybe_remove_pg_upmaps() procedure.

Fix the above problem by skipping these PGs and also try to be
more specific if we are unable to load the parent of a specified
PG.

Signed-off-by: xie xingguo <xie.xingguo@zte.com.cn>
(cherry picked from commit f9a35aa)
@xiexingguo xiexingguo added this to the luminous milestone Mar 10, 2018
There are quite a lot mgr/balancer related tests that I can observe
the following logged errors:
```
2018-03-03 08:14:00.359946 7fdda18cc700 -1 maybe_remove_pg_upmaps unable to load crush-rule of pg 5.b
```
which turns out to be a *pool-deletion* vs *balancer-auto-injected-upmap-changes*
race issue.
The root cause is that we don't clean up those *pending*
pg_upmap/pg_upmap_items injected by the mgr/balancer properly simultaneously
when the corresponding pool is gone, and hence the above problem can be fixed by:
1. clean up any pending upmap changes too if the corresponding pool is gone
2. re-check pending pool removal queue before we can safely apply any new upmap changes

Signed-off-by: xie xingguo <xie.xingguo@zte.com.cn>
(cherry picked from commit bd828fc)
@theanalyst
Copy link
Member

@xiexingguo can you add the ticket number and original PR this backport is from

@xiexingguo
Copy link
Member Author

@theanalyst The first commit includes the original ticket number..

@yuriw
Copy link
Contributor

yuriw commented Apr 6, 2018

@liewegas
Copy link
Member

liewegas commented Apr 9, 2018

does not look good:

2018-04-06T23:06:35.044 INFO:tasks.ceph.mgr.x.smithi118.stderr:*** Caught signal (Segmentation fault) **
2018-04-06T23:06:35.044 INFO:tasks.ceph.mgr.x.smithi118.stderr: in thread 7f84692bd700 thread_name:balancer
2018-04-06T23:06:35.045 INFO:tasks.ceph.mgr.x.smithi118.stderr: ceph version 12.2.4-372-g3ec4869 (3ec4869d8a1d6ef06aa5698b52347ce41aa0d068) luminous (stable)
2018-04-06T23:06:35.045 INFO:tasks.ceph.mgr.x.smithi118.stderr: 1: (()+0x40c194) [0x557093897194]
2018-04-06T23:06:35.045 INFO:tasks.ceph.mgr.x.smithi118.stderr: 2: (()+0x11390) [0x7f8478180390]
2018-04-06T23:06:35.045 INFO:tasks.ceph.mgr.x.smithi118.stderr: 3: (OSDMap::_apply_upmap(pg_pool_t const&, pg_t, std::vector >*) const+0x1af) [0x5570939c551f]
2018-04-06T23:06:35.045 INFO:tasks.ceph.mgr.x.smithi118.stderr: 4: (OSDMap::_pg_to_up_acting_osds(pg_t const&, std::vector >*, int*, std::vector >*, int*, bool) const+0x257) [0x5570939c9f97]
2018-04-06T23:06:35.045 INFO:tasks.ceph.mgr.x.smithi118.stderr: 5: (OSDMap::calc_pg_upmaps(CephContext*, float, int, std::set, std::allocator > const&, OSDMap::Incremental*)+0x367) [0x5570939db527]
2018-04-06T23:06:35.045 INFO:tasks.ceph.mgr.x.smithi118.stderr: 6: (()+0x2efef8) [0x55709377aef8]
2018-04-06T23:06:35.045 INFO:tasks.ceph.mgr.x.smithi118.stderr: 7: (PyEval_EvalFrameEx()+0x8a51) [0x7f847925c971]
2018-04-06T23:06:35.046 INFO:tasks.ceph.mgr.x.smithi118.stderr: 8: (PyEval_EvalCodeEx()+0x85c) [0x7f847939205c]
2018-04-06T23:06:35.046 INFO:tasks.ceph.mgr.x.smithi118.stderr: 9: (PyEval_EvalFrameEx()+0x6ffd) [0x7f847925af1d]
2018-04-06T23:06:35.046 INFO:tasks.ceph.mgr.x.smithi118.stderr: 10: (PyEval_EvalFrameEx()+0x7124) [0x7f847925b044]
2018-04-06T23:06:35.046 INFO:tasks.ceph.mgr.x.smithi118.stderr: 11: (PyEval_EvalFrameEx()+0x7124) [0x7f847925b044]
2018-04-06T23:06:35.046 INFO:tasks.ceph.mgr.x.smithi118.stderr: 12: (PyEval_EvalCodeEx()+0x85c) [0x7f847939205c]
2018-04-06T23:06:35.046 INFO:tasks.ceph.mgr.x.smithi118.stderr: 13: (()+0x13e370) [0x7f84792e8370]
2018-04-06T23:06:35.046 INFO:tasks.ceph.mgr.x.smithi118.stderr: 14: (PyObject_Call()+0x43) [0x7f84792bb273]
2018-04-06T23:06:35.046 INFO:tasks.ceph.mgr.x.smithi118.stderr: 15: (()+0x1853ac) [0x7f847932f3ac]
2018-04-06T23:06:35.046 INFO:tasks.ceph.mgr.x.smithi118.stderr: 16: (PyObject_Call()+0x43) [0x7f84792bb273]
2018-04-06T23:06:35.046 INFO:tasks.ceph.mgr.x.smithi118.stderr: 17: (PyObject_CallMethod()+0xf4) [0x7f84792bc444]
2018-04-06T23:06:35.047 INFO:tasks.ceph.mgr.x.smithi118.stderr: 18: (PyModuleRunner::serve()+0x5c) [0x5570937780dc]
2018-04-06T23:06:35.047 INFO:tasks.ceph.mgr.x.smithi118.stderr: 19: (PyModuleRunner::PyModuleRunnerThread::entry()+0x1b8) [0x5570937788e8]
2018-04-06T23:06:35.047 INFO:tasks.ceph.mgr.x.smithi118.stderr: 20: (()+0x76ba) [0x7f84781766ba]
2018-04-06T23:06:35.047 INFO:tasks.ceph.mgr.x.smithi118.stderr: 21: (clone()+0x6d) [0x7f84771e241d]
2018-04-06T23:06:35.047 INFO:tasks.ceph.mgr.x.smithi118.stderr:2018-04-06 23:06:35.044306 7f84692bd700 -1 *** Caught signal (Segmentation fault) **

@xiexingguo do you mind looking?
/a/yuriw-2018-04-06_21:35:03-rados-wip-yuri2-testing-2018-04-06-1946-luminous-distro-basic-smithi/2365151

@xiexingguo
Copy link
Member Author

@xiexingguo do you mind looking?

Sure, I'm on it.

@xiexingguo
Copy link
Member Author

xiexingguo commented Apr 10, 2018

@liewegas I believe the above test failure is not related to this change or #20840.
See #21325

@yuriw
Copy link
Contributor

yuriw commented Apr 10, 2018

Copy link
Contributor

@yuriw yuriw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Reviewed-by: Sage Weil sage@redhat.com

@yuriw yuriw merged commit 719e13c into ceph:luminous Apr 11, 2018
@xiexingguo xiexingguo deleted the backport-pr-20653 branch April 12, 2018 00:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants