Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

osd/PG: restart recovery if NotRecovering and unfound found #18974

Merged
merged 3 commits into from Nov 29, 2017

Conversation

liewegas
Copy link
Member

If we are in recovery_unfound state waiting for unfound objects, and we
find them, we need to restart the recovery reservation process so that we
can recover. Do this by queueing DoRecover() event instead of calling
queue_recovery() (which won't do anything since we're not in
recoverying|backfilling pg states).

Make the parent Active state ignore DoRecovery so that if we are already
in some phase of recovery/backfill the event gets ignored. It is already
handled by the other important substates that care, like Clean (for
repair's benefit).

I'm not sure why states like Activating are paying attention tot his vevent...

Fixes: http://tracker.ceph.com/issues/22145
Signed-off-by: Sage Weil sage@redhat.com

Copy link
Contributor

@badone badone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Besides the rogue tab LGTM

conf:
osd:
osd recovery sleep: .1
osd objectstore: filestore
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did a tab slip in here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, fixed!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good to go

Copy link
Member

@xiexingguo xiexingguo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm also

@xiexingguo
Copy link
Member

I'm not sure why states like Activating are paying attention tot his vevent...

+1

@@ -0,0 +1,6 @@
[HANDLER_OUTPUT]
Copy link
Contributor

@tchaikov tchaikov Nov 17, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@liewegas teuthology-suite panics at seeing this file, as its filename is not ended with .yaml and this directory will be empty without this file. which renders the test matrix representing this directory an empty one:

Traceback (most recent call last):
  File "/home/kchai/teuthology/virtualenv/bin/teuthology-suite", line 11, in <module>
    load_entry_point('teuthology', 'console_scripts', 'teuthology-suite')()
  File "/home/kchai/teuthology/scripts/suite.py", line 137, in main
    return teuthology.suite.main(args)
  File "/home/kchai/teuthology/teuthology/suite/__init__.py", line 88, in main
    run.prepare_and_schedule()
  File "/home/kchai/teuthology/teuthology/suite/run.py", line 309, in prepare_and_schedule
    num_jobs = self.schedule_suite()
  File "/home/kchai/teuthology/teuthology/suite/run.py", line 476, in schedule_suite
    build_matrix(suite_path, subset=self.args.subset)
  File "/home/kchai/teuthology/teuthology/suite/build_matrix.py", line 50, in build_matrix
    mat, first, matlimit = _get_matrix(path, subset)
  File "/home/kchai/teuthology/teuthology/suite/build_matrix.py", line 60, in _get_matrix
    mat = _build_matrix(path, mincyclicity=outof)
  File "/home/kchai/teuthology/teuthology/suite/build_matrix.py", line 122, in _build_matrix
    fn)
  File "/home/kchai/teuthology/teuthology/suite/build_matrix.py", line 122, in _build_matrix
    fn)
  File "/home/kchai/teuthology/teuthology/suite/build_matrix.py", line 131, in _build_matrix
    return matrix.Sum(item, submats)
  File "/home/kchai/teuthology/teuthology/suite/matrix.py", line 225, in __init__
    "Sum requires non-empty _submats"
AssertionError: Sum requires non-empty _submats

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed!

Copy link
Contributor

@tchaikov tchaikov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please remove qa/suites/rados/upgrade/jewel-x-singleton/o

@tchaikov
Copy link
Contributor

http://pulpito.ceph.com/kchai-2017-11-17_09:57:45-rados-wip-kefu-testing-2017-11-17-1613-distro-basic-smithi/

quite a few tests failed, with backtrace like

ceph version 13.0.0-3301-gc1e33f9 (c1e33f94784096fb4fd6761b8ff933c732ffaf3a) mimic (dev)
 1: (()+0xafc1a4) [0x55fd474e51a4]
 2: (()+0x11390) [0x7fb3f1635390]
 3: (gsignal()+0x38) [0x7fb3f05d0428]
 4: (abort()+0x16a) [0x7fb3f05d202a]
 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x28e) [0x55fd475283fe]
 6: (PG::RecoveryState::Crashed::Crashed(boost::statechart::state<PG::RecoveryState::Crashed, PG::RecoveryState::RecoveryMachine, boost::mpl:
:list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_:
:na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::my_context)+0x135) [0x55fd46fba1e5]
 7: (()+0x612e66) [0x55fd46ffbe66]
 8: (boost::statechart::simple_state<PG::RecoveryState::Primary, PG::RecoveryState::Started, PG::RecoveryState::Peering, (boost::statechart::
history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0x16e) [0x55fd470383be]
 9: (boost::statechart::simple_state<PG::RecoveryState::Active, PG::RecoveryState::Primary, PG::RecoveryState::Activating, (boost::statechart
::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0x1c1) [0x55fd470363d1]
 10: (boost::statechart::simple_state<PG::RecoveryState::WaitRemoteRecoveryReserved, PG::RecoveryState::Active, boost::mpl::list<mpl_::na, mp
l_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl
_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void co
nst*)+0x73) [0x55fd47035323]
 11: (boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<void>, boost::statechart::null_exception_translator>::process_event(boost::statechart::event_base const&)+0x69) [0x55fd4700f579]
 12: (PG::process_peering_event(PG::RecoveryCtx*)+0x455) [0x55fd46ff6255]
 13: (OSD::process_peering_events(std::__cxx11::list<PG*, std::allocator<PG*> > const&, ThreadPool::TPHandle&)+0x48d) [0x55fd46f264bd]
 14: (ThreadPool::BatchWorkQueue<PG>::_void_process(void*, ThreadPool::TPHandle&)+0x27) [0x55fd46f8aa47]
 15: (ThreadPool::worker(ThreadPool::WorkThread*)+0xdb9) [0x55fd4752f179]
 16: (ThreadPool::WorkThread::entry()+0x10) [0x55fd47530280]

for example, see /a/kchai-2017-11-17_09:57:45-rados-wip-kefu-testing-2017-11-17-1613-distro-basic-smithi/1857122

i am removing this change from my test batch.

got_missing)
pg->queue_recovery();
if (got_missing) {
post_event(DoRecovery());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems we still needs some sanity checking here, there are multiple substates of Active can not handle DoRecovery() event properly...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, see commit 64047e1

This issue came up in testing pull request #19850 which should merge soon.

Copy link
Contributor

@tchaikov tchaikov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i am pretty sure it's buggy.

Signed-off-by: Sage Weil <sage@redhat.com>
@liewegas
Copy link
Member Author

I forgot to add DoRecovery to the list of state events; fixed!

@gregsfortytwo
Copy link
Member

"I'm not sure why states like Activating are paying attention tot his vevent..."

is that a question? Seems like it should be answered when fiddling with where the event is emitted.

src/osd/PG.cc Outdated
pg->queue_recovery();
if (got_missing) {
post_event(DoRecovery());
return discard_event();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@liewegas we can drop this line. as discard_event() will always be called.

Copy link
Contributor

@tchaikov tchaikov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

modulo the nit, lgtm.

If we are in recovery_unfound state waiting for unfound objects, and we
find them, we need to restart the recovery reservation process so that we
can recover.  Do this by queueing DoRecover() event instead of calling
queue_recovery() (which won't do anything since we're not in
recoverying|backfilling pg states).

Make the parent Active state ignore DoRecovery so that if we are already
in some phase of recovery/backfill the event gets ignored.  It is already
handled by the other important substates that care, like Clean (for
repair's benefit).

I'm not sure why states like Activating are paying attention tot his vevent...

Fixes: http://tracker.ceph.com/issues/22145
Signed-off-by: Sage Weil <sage@redhat.com>
@liewegas
Copy link
Member Author

fixed nit

@liewegas liewegas merged commit 27e06ff into ceph:master Nov 29, 2017
@liewegas liewegas deleted the wip-22145 branch November 29, 2017 18:48
@disaster123
Copy link

Please backport to luminous - Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
7 participants