Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

osd/PG: restart recovery if NotRecovering and unfound found #18974

Merged
merged 3 commits into from Nov 29, 2017

Conversation

Projects
None yet
7 participants
@liewegas
Copy link
Member

commented Nov 16, 2017

If we are in recovery_unfound state waiting for unfound objects, and we
find them, we need to restart the recovery reservation process so that we
can recover. Do this by queueing DoRecover() event instead of calling
queue_recovery() (which won't do anything since we're not in
recoverying|backfilling pg states).

Make the parent Active state ignore DoRecovery so that if we are already
in some phase of recovery/backfill the event gets ignored. It is already
handled by the other important substates that care, like Clean (for
repair's benefit).

I'm not sure why states like Activating are paying attention tot his vevent...

Fixes: http://tracker.ceph.com/issues/22145
Signed-off-by: Sage Weil sage@redhat.com

@badone

badone approved these changes Nov 16, 2017

Copy link
Contributor

left a comment

Besides the rogue tab LGTM

conf:
osd:
osd recovery sleep: .1
osd objectstore: filestore

This comment has been minimized.

Copy link
@badone

badone Nov 16, 2017

Contributor

Did a tab slip in here?

This comment has been minimized.

Copy link
@liewegas

liewegas Nov 16, 2017

Author Member

yep, fixed!

This comment has been minimized.

Copy link
@badone

badone Nov 16, 2017

Contributor

Good to go

@liewegas liewegas force-pushed the liewegas:wip-22145 branch from 31e46aa to 6c7e0ea Nov 16, 2017

@liewegas liewegas added the needs-qa label Nov 16, 2017

@xiexingguo
Copy link
Member

left a comment

lgtm also

@xiexingguo

This comment has been minimized.

Copy link
Member

commented Nov 17, 2017

I'm not sure why states like Activating are paying attention tot his vevent...

+1

@@ -0,0 +1,6 @@
[HANDLER_OUTPUT]

This comment has been minimized.

Copy link
@tchaikov

tchaikov Nov 17, 2017

Contributor

@liewegas teuthology-suite panics at seeing this file, as its filename is not ended with .yaml and this directory will be empty without this file. which renders the test matrix representing this directory an empty one:

Traceback (most recent call last):
  File "/home/kchai/teuthology/virtualenv/bin/teuthology-suite", line 11, in <module>
    load_entry_point('teuthology', 'console_scripts', 'teuthology-suite')()
  File "/home/kchai/teuthology/scripts/suite.py", line 137, in main
    return teuthology.suite.main(args)
  File "/home/kchai/teuthology/teuthology/suite/__init__.py", line 88, in main
    run.prepare_and_schedule()
  File "/home/kchai/teuthology/teuthology/suite/run.py", line 309, in prepare_and_schedule
    num_jobs = self.schedule_suite()
  File "/home/kchai/teuthology/teuthology/suite/run.py", line 476, in schedule_suite
    build_matrix(suite_path, subset=self.args.subset)
  File "/home/kchai/teuthology/teuthology/suite/build_matrix.py", line 50, in build_matrix
    mat, first, matlimit = _get_matrix(path, subset)
  File "/home/kchai/teuthology/teuthology/suite/build_matrix.py", line 60, in _get_matrix
    mat = _build_matrix(path, mincyclicity=outof)
  File "/home/kchai/teuthology/teuthology/suite/build_matrix.py", line 122, in _build_matrix
    fn)
  File "/home/kchai/teuthology/teuthology/suite/build_matrix.py", line 122, in _build_matrix
    fn)
  File "/home/kchai/teuthology/teuthology/suite/build_matrix.py", line 131, in _build_matrix
    return matrix.Sum(item, submats)
  File "/home/kchai/teuthology/teuthology/suite/matrix.py", line 225, in __init__
    "Sum requires non-empty _submats"
AssertionError: Sum requires non-empty _submats

This comment has been minimized.

Copy link
@liewegas

liewegas Nov 17, 2017

Author Member

fixed!

@tchaikov
Copy link
Contributor

left a comment

please remove qa/suites/rados/upgrade/jewel-x-singleton/o

@tchaikov

This comment has been minimized.

Copy link
Contributor

commented Nov 18, 2017

http://pulpito.ceph.com/kchai-2017-11-17_09:57:45-rados-wip-kefu-testing-2017-11-17-1613-distro-basic-smithi/

quite a few tests failed, with backtrace like

ceph version 13.0.0-3301-gc1e33f9 (c1e33f94784096fb4fd6761b8ff933c732ffaf3a) mimic (dev)
 1: (()+0xafc1a4) [0x55fd474e51a4]
 2: (()+0x11390) [0x7fb3f1635390]
 3: (gsignal()+0x38) [0x7fb3f05d0428]
 4: (abort()+0x16a) [0x7fb3f05d202a]
 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x28e) [0x55fd475283fe]
 6: (PG::RecoveryState::Crashed::Crashed(boost::statechart::state<PG::RecoveryState::Crashed, PG::RecoveryState::RecoveryMachine, boost::mpl:
:list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_:
:na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::my_context)+0x135) [0x55fd46fba1e5]
 7: (()+0x612e66) [0x55fd46ffbe66]
 8: (boost::statechart::simple_state<PG::RecoveryState::Primary, PG::RecoveryState::Started, PG::RecoveryState::Peering, (boost::statechart::
history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0x16e) [0x55fd470383be]
 9: (boost::statechart::simple_state<PG::RecoveryState::Active, PG::RecoveryState::Primary, PG::RecoveryState::Activating, (boost::statechart
::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0x1c1) [0x55fd470363d1]
 10: (boost::statechart::simple_state<PG::RecoveryState::WaitRemoteRecoveryReserved, PG::RecoveryState::Active, boost::mpl::list<mpl_::na, mp
l_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl
_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void co
nst*)+0x73) [0x55fd47035323]
 11: (boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<void>, boost::statechart::null_exception_translator>::process_event(boost::statechart::event_base const&)+0x69) [0x55fd4700f579]
 12: (PG::process_peering_event(PG::RecoveryCtx*)+0x455) [0x55fd46ff6255]
 13: (OSD::process_peering_events(std::__cxx11::list<PG*, std::allocator<PG*> > const&, ThreadPool::TPHandle&)+0x48d) [0x55fd46f264bd]
 14: (ThreadPool::BatchWorkQueue<PG>::_void_process(void*, ThreadPool::TPHandle&)+0x27) [0x55fd46f8aa47]
 15: (ThreadPool::worker(ThreadPool::WorkThread*)+0xdb9) [0x55fd4752f179]
 16: (ThreadPool::WorkThread::entry()+0x10) [0x55fd47530280]

for example, see /a/kchai-2017-11-17_09:57:45-rados-wip-kefu-testing-2017-11-17-1613-distro-basic-smithi/1857122

i am removing this change from my test batch.

@tchaikov tchaikov self-requested a review Nov 18, 2017

got_missing)
pg->queue_recovery();
if (got_missing) {
post_event(DoRecovery());

This comment has been minimized.

Copy link
@xiexingguo

xiexingguo Nov 18, 2017

Member

Seems we still needs some sanity checking here, there are multiple substates of Active can not handle DoRecovery() event properly...

This comment has been minimized.

Copy link
@dzafman

dzafman Jan 15, 2018

Member

Yes, see commit 64047e1

This issue came up in testing pull request #19850 which should merge soon.

@tchaikov
Copy link
Contributor

left a comment

i am pretty sure it's buggy.

@tchaikov tchaikov removed the needs-qa label Nov 18, 2017

@liewegas liewegas force-pushed the liewegas:wip-22145 branch from f9e8157 to fa3bac8 Nov 20, 2017

liewegas added some commits Nov 16, 2017

osd/PG: document state hierarchy
Signed-off-by: Sage Weil <sage@redhat.com>

@liewegas liewegas force-pushed the liewegas:wip-22145 branch from fa3bac8 to da7acb3 Nov 20, 2017

@liewegas

This comment has been minimized.

Copy link
Member Author

commented Nov 20, 2017

I forgot to add DoRecovery to the list of state events; fixed!

@liewegas liewegas added the needs-qa label Nov 20, 2017

fixed

@gregsfortytwo

This comment has been minimized.

Copy link
Member

commented Nov 21, 2017

"I'm not sure why states like Activating are paying attention tot his vevent..."

is that a question? Seems like it should be answered when fiddling with where the event is emitted.

pg->queue_recovery();
if (got_missing) {
post_event(DoRecovery());
return discard_event();

This comment has been minimized.

Copy link
@tchaikov

tchaikov Nov 22, 2017

Contributor

@liewegas we can drop this line. as discard_event() will always be called.

@tchaikov
Copy link
Contributor

left a comment

modulo the nit, lgtm.

osd/PG: restart recovery if NotRecovering and unfound found
If we are in recovery_unfound state waiting for unfound objects, and we
find them, we need to restart the recovery reservation process so that we
can recover.  Do this by queueing DoRecover() event instead of calling
queue_recovery() (which won't do anything since we're not in
recoverying|backfilling pg states).

Make the parent Active state ignore DoRecovery so that if we are already
in some phase of recovery/backfill the event gets ignored.  It is already
handled by the other important substates that care, like Clean (for
repair's benefit).

I'm not sure why states like Activating are paying attention tot his vevent...

Fixes: http://tracker.ceph.com/issues/22145
Signed-off-by: Sage Weil <sage@redhat.com>

@liewegas liewegas added the needs-qa label Nov 29, 2017

@liewegas liewegas force-pushed the liewegas:wip-22145 branch from da7acb3 to 4cfe31c Nov 29, 2017

@liewegas

This comment has been minimized.

Copy link
Member Author

commented Nov 29, 2017

fixed nit

@liewegas liewegas merged commit 27e06ff into ceph:master Nov 29, 2017

2 of 5 checks passed

Docs: build check Docs: building
Details
make check running make check
Details
make check (arm64) running make check
Details
Signed-off-by all commits in this PR are signed
Details
Unmodified Submodules submodules for project are unmodified
Details

@liewegas liewegas deleted the liewegas:wip-22145 branch Nov 29, 2017

@disaster123

This comment has been minimized.

Copy link

commented Jan 15, 2018

Please backport to luminous - Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.