osd/PG: restart recovery if NotRecovering and unfound found #18974

liewegas · 2017-11-16T21:05:47Z

If we are in recovery_unfound state waiting for unfound objects, and we
find them, we need to restart the recovery reservation process so that we
can recover. Do this by queueing DoRecover() event instead of calling
queue_recovery() (which won't do anything since we're not in
recoverying|backfilling pg states).

Make the parent Active state ignore DoRecovery so that if we are already
in some phase of recovery/backfill the event gets ignored. It is already
handled by the other important substates that care, like Clean (for
repair's benefit).

I'm not sure why states like Activating are paying attention tot his vevent...

Fixes: http://tracker.ceph.com/issues/22145
Signed-off-by: Sage Weil sage@redhat.com

badone

Besides the rogue tab LGTM

badone · 2017-11-16T22:14:18Z

qa/suites/rados/singleton-nomsgr/all/recovery-unfound-found.yaml

+    conf:
+      osd:
+        osd recovery sleep: .1
+	osd objectstore: filestore


Did a tab slip in here?

yep, fixed!

xiexingguo

lgtm also

xiexingguo · 2017-11-17T00:39:54Z

I'm not sure why states like Activating are paying attention tot his vevent...

+1

tchaikov · 2017-11-17T09:53:50Z

qa/suites/rados/upgrade/jewel-x-singleton/o

@@ -0,0 +1,6 @@
+[HANDLER_OUTPUT] 


@liewegas teuthology-suite panics at seeing this file, as its filename is not ended with .yaml and this directory will be empty without this file. which renders the test matrix representing this directory an empty one:

Traceback (most recent call last): File "/home/kchai/teuthology/virtualenv/bin/teuthology-suite", line 11, in <module> load_entry_point('teuthology', 'console_scripts', 'teuthology-suite')() File "/home/kchai/teuthology/scripts/suite.py", line 137, in main return teuthology.suite.main(args) File "/home/kchai/teuthology/teuthology/suite/__init__.py", line 88, in main run.prepare_and_schedule() File "/home/kchai/teuthology/teuthology/suite/run.py", line 309, in prepare_and_schedule num_jobs = self.schedule_suite() File "/home/kchai/teuthology/teuthology/suite/run.py", line 476, in schedule_suite build_matrix(suite_path, subset=self.args.subset) File "/home/kchai/teuthology/teuthology/suite/build_matrix.py", line 50, in build_matrix mat, first, matlimit = _get_matrix(path, subset) File "/home/kchai/teuthology/teuthology/suite/build_matrix.py", line 60, in _get_matrix mat = _build_matrix(path, mincyclicity=outof) File "/home/kchai/teuthology/teuthology/suite/build_matrix.py", line 122, in _build_matrix fn) File "/home/kchai/teuthology/teuthology/suite/build_matrix.py", line 122, in _build_matrix fn) File "/home/kchai/teuthology/teuthology/suite/build_matrix.py", line 131, in _build_matrix return matrix.Sum(item, submats) File "/home/kchai/teuthology/teuthology/suite/matrix.py", line 225, in __init__ "Sum requires non-empty _submats" AssertionError: Sum requires non-empty _submats

tchaikov

please remove qa/suites/rados/upgrade/jewel-x-singleton/o

comments addressed.

tchaikov · 2017-11-18T02:58:02Z

http://pulpito.ceph.com/kchai-2017-11-17_09:57:45-rados-wip-kefu-testing-2017-11-17-1613-distro-basic-smithi/

quite a few tests failed, with backtrace like

ceph version 13.0.0-3301-gc1e33f9 (c1e33f94784096fb4fd6761b8ff933c732ffaf3a) mimic (dev)
 1: (()+0xafc1a4) [0x55fd474e51a4]
 2: (()+0x11390) [0x7fb3f1635390]
 3: (gsignal()+0x38) [0x7fb3f05d0428]
 4: (abort()+0x16a) [0x7fb3f05d202a]
 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x28e) [0x55fd475283fe]
 6: (PG::RecoveryState::Crashed::Crashed(boost::statechart::state<PG::RecoveryState::Crashed, PG::RecoveryState::RecoveryMachine, boost::mpl:
:list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_:
:na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::my_context)+0x135) [0x55fd46fba1e5]
 7: (()+0x612e66) [0x55fd46ffbe66]
 8: (boost::statechart::simple_state<PG::RecoveryState::Primary, PG::RecoveryState::Started, PG::RecoveryState::Peering, (boost::statechart::
history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0x16e) [0x55fd470383be]
 9: (boost::statechart::simple_state<PG::RecoveryState::Active, PG::RecoveryState::Primary, PG::RecoveryState::Activating, (boost::statechart
::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0x1c1) [0x55fd470363d1]
 10: (boost::statechart::simple_state<PG::RecoveryState::WaitRemoteRecoveryReserved, PG::RecoveryState::Active, boost::mpl::list<mpl_::na, mp
l_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl
_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void co
nst*)+0x73) [0x55fd47035323]
 11: (boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<void>, boost::statechart::null_exception_translator>::process_event(boost::statechart::event_base const&)+0x69) [0x55fd4700f579]
 12: (PG::process_peering_event(PG::RecoveryCtx*)+0x455) [0x55fd46ff6255]
 13: (OSD::process_peering_events(std::__cxx11::list<PG*, std::allocator<PG*> > const&, ThreadPool::TPHandle&)+0x48d) [0x55fd46f264bd]
 14: (ThreadPool::BatchWorkQueue<PG>::_void_process(void*, ThreadPool::TPHandle&)+0x27) [0x55fd46f8aa47]
 15: (ThreadPool::worker(ThreadPool::WorkThread*)+0xdb9) [0x55fd4752f179]
 16: (ThreadPool::WorkThread::entry()+0x10) [0x55fd47530280]

for example, see /a/kchai-2017-11-17_09:57:45-rados-wip-kefu-testing-2017-11-17-1613-distro-basic-smithi/1857122

i am removing this change from my test batch.

xiexingguo · 2017-11-18T03:31:28Z

src/osd/PG.cc

-      got_missing)
-    pg->queue_recovery();
+  if (got_missing) {
+    post_event(DoRecovery());


Seems we still needs some sanity checking here, there are multiple substates of Active can not handle DoRecovery() event properly...

Yes, see commit 64047e1

This issue came up in testing pull request #19850 which should merge soon.

tchaikov

i am pretty sure it's buggy.

Signed-off-by: Sage Weil <sage@redhat.com>

See http://tracker.ceph.com/issues/22145 Signed-off-by: Sage Weil <sage@redhat.com>

liewegas · 2017-11-20T03:33:28Z

I forgot to add DoRecovery to the list of state events; fixed!

fixed

gregsfortytwo · 2017-11-21T01:53:47Z

"I'm not sure why states like Activating are paying attention tot his vevent..."

is that a question? Seems like it should be answered when fiddling with where the event is emitted.

tchaikov · 2017-11-22T06:44:56Z

tchaikov · 2017-11-22T06:56:07Z

src/osd/PG.cc

-    pg->queue_recovery();
+  if (got_missing) {
+    post_event(DoRecovery());
+    return discard_event();


@liewegas we can drop this line. as discard_event() will always be called.

tchaikov

modulo the nit, lgtm.

If we are in recovery_unfound state waiting for unfound objects, and we find them, we need to restart the recovery reservation process so that we can recover. Do this by queueing DoRecover() event instead of calling queue_recovery() (which won't do anything since we're not in recoverying|backfilling pg states). Make the parent Active state ignore DoRecovery so that if we are already in some phase of recovery/backfill the event gets ignored. It is already handled by the other important substates that care, like Clean (for repair's benefit). I'm not sure why states like Activating are paying attention tot his vevent... Fixes: http://tracker.ceph.com/issues/22145 Signed-off-by: Sage Weil <sage@redhat.com>

liewegas · 2017-11-29T18:45:12Z

fixed nit

disaster123 · 2018-01-15T15:20:44Z

Please backport to luminous - Thanks!

liewegas added bug-fix core labels Nov 16, 2017

badone approved these changes Nov 16, 2017

View reviewed changes

liewegas force-pushed the wip-22145 branch from 31e46aa to 6c7e0ea Compare November 16, 2017 22:17

liewegas added the needs-qa label Nov 16, 2017

xiexingguo approved these changes Nov 17, 2017

View reviewed changes

tchaikov added the wip-kefu-testing label Nov 17, 2017

tchaikov reviewed Nov 17, 2017

View reviewed changes

tchaikov previously requested changes Nov 17, 2017

View reviewed changes

tchaikov removed needs-qa wip-kefu-testing labels Nov 17, 2017

liewegas force-pushed the wip-22145 branch from 6c7e0ea to f9e8157 Compare November 17, 2017 13:19

tchaikov added needs-qa wip-kefu-testing labels Nov 17, 2017

tchaikov mentioned this pull request Nov 18, 2017

msg/async/dpdk: rebase to spdk/dpdk #18927

Closed

tchaikov removed the wip-kefu-testing label Nov 18, 2017

tchaikov self-requested a review November 18, 2017 02:59

xiexingguo reviewed Nov 18, 2017

View reviewed changes

tchaikov previously requested changes Nov 18, 2017

View reviewed changes

tchaikov removed the needs-qa label Nov 18, 2017

liewegas force-pushed the wip-22145 branch from f9e8157 to fa3bac8 Compare November 20, 2017 03:32

liewegas added 2 commits November 19, 2017 21:32

osd/PG: document state hierarchy

e2a75c9

Signed-off-by: Sage Weil <sage@redhat.com>

qa/suites/rados: test for recovery_unfound bug

25b7965

See http://tracker.ceph.com/issues/22145 Signed-off-by: Sage Weil <sage@redhat.com>

liewegas force-pushed the wip-22145 branch from fa3bac8 to da7acb3 Compare November 20, 2017 03:33

liewegas added the needs-qa label Nov 20, 2017

tchaikov added the wip-kefu-testing label Nov 21, 2017

tchaikov reviewed Nov 22, 2017

View reviewed changes

tchaikov approved these changes Nov 22, 2017

View reviewed changes

tchaikov removed needs-qa wip-kefu-testing labels Nov 22, 2017

liewegas added the needs-qa label Nov 29, 2017

liewegas force-pushed the wip-22145 branch from da7acb3 to 4cfe31c Compare November 29, 2017 18:45

liewegas merged commit 27e06ff into ceph:master Nov 29, 2017

liewegas deleted the wip-22145 branch November 29, 2017 18:48

dzafman mentioned this pull request Jan 22, 2018

luminous: miscounting degraded objects and PG stuck in recovery_unfound #20055

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

osd/PG: restart recovery if NotRecovering and unfound found #18974

osd/PG: restart recovery if NotRecovering and unfound found #18974

liewegas commented Nov 16, 2017

badone left a comment

badone Nov 16, 2017

liewegas Nov 16, 2017

badone Nov 16, 2017

xiexingguo left a comment

xiexingguo commented Nov 17, 2017

tchaikov Nov 17, 2017 •

edited

liewegas Nov 17, 2017

tchaikov left a comment •

edited

tchaikov commented Nov 18, 2017

xiexingguo Nov 18, 2017 •

edited

dzafman Jan 15, 2018

tchaikov left a comment

liewegas commented Nov 20, 2017

gregsfortytwo commented Nov 21, 2017

tchaikov commented Nov 22, 2017

tchaikov Nov 22, 2017

tchaikov left a comment

liewegas commented Nov 29, 2017

disaster123 commented Jan 15, 2018

osd/PG: restart recovery if NotRecovering and unfound found #18974

osd/PG: restart recovery if NotRecovering and unfound found #18974

Conversation

liewegas commented Nov 16, 2017

badone left a comment

Choose a reason for hiding this comment

badone Nov 16, 2017

Choose a reason for hiding this comment

liewegas Nov 16, 2017

Choose a reason for hiding this comment

badone Nov 16, 2017

Choose a reason for hiding this comment

xiexingguo left a comment

Choose a reason for hiding this comment

xiexingguo commented Nov 17, 2017

tchaikov Nov 17, 2017 • edited

Choose a reason for hiding this comment

liewegas Nov 17, 2017

Choose a reason for hiding this comment

tchaikov left a comment • edited

Choose a reason for hiding this comment

tchaikov commented Nov 18, 2017

xiexingguo Nov 18, 2017 • edited

Choose a reason for hiding this comment

dzafman Jan 15, 2018

Choose a reason for hiding this comment

tchaikov left a comment

Choose a reason for hiding this comment

liewegas commented Nov 20, 2017

gregsfortytwo commented Nov 21, 2017

tchaikov commented Nov 22, 2017

tchaikov Nov 22, 2017

Choose a reason for hiding this comment

tchaikov left a comment

Choose a reason for hiding this comment

liewegas commented Nov 29, 2017

disaster123 commented Jan 15, 2018

tchaikov Nov 17, 2017 •

edited

tchaikov left a comment •

edited

xiexingguo Nov 18, 2017 •

edited