New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
osd/PG: restart recovery if NotRecovering and unfound found #18974
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Besides the rogue tab LGTM
conf: | ||
osd: | ||
osd recovery sleep: .1 | ||
osd objectstore: filestore |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did a tab slip in here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep, fixed!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good to go
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm also
+1 |
@@ -0,0 +1,6 @@ | |||
[HANDLER_OUTPUT] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@liewegas teuthology-suite panics at seeing this file, as its filename is not ended with .yaml
and this directory will be empty without this file. which renders the test matrix representing this directory an empty one:
Traceback (most recent call last):
File "/home/kchai/teuthology/virtualenv/bin/teuthology-suite", line 11, in <module>
load_entry_point('teuthology', 'console_scripts', 'teuthology-suite')()
File "/home/kchai/teuthology/scripts/suite.py", line 137, in main
return teuthology.suite.main(args)
File "/home/kchai/teuthology/teuthology/suite/__init__.py", line 88, in main
run.prepare_and_schedule()
File "/home/kchai/teuthology/teuthology/suite/run.py", line 309, in prepare_and_schedule
num_jobs = self.schedule_suite()
File "/home/kchai/teuthology/teuthology/suite/run.py", line 476, in schedule_suite
build_matrix(suite_path, subset=self.args.subset)
File "/home/kchai/teuthology/teuthology/suite/build_matrix.py", line 50, in build_matrix
mat, first, matlimit = _get_matrix(path, subset)
File "/home/kchai/teuthology/teuthology/suite/build_matrix.py", line 60, in _get_matrix
mat = _build_matrix(path, mincyclicity=outof)
File "/home/kchai/teuthology/teuthology/suite/build_matrix.py", line 122, in _build_matrix
fn)
File "/home/kchai/teuthology/teuthology/suite/build_matrix.py", line 122, in _build_matrix
fn)
File "/home/kchai/teuthology/teuthology/suite/build_matrix.py", line 131, in _build_matrix
return matrix.Sum(item, submats)
File "/home/kchai/teuthology/teuthology/suite/matrix.py", line 225, in __init__
"Sum requires non-empty _submats"
AssertionError: Sum requires non-empty _submats
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please remove qa/suites/rados/upgrade/jewel-x-singleton/o
quite a few tests failed, with backtrace like
for example, see /a/kchai-2017-11-17_09:57:45-rados-wip-kefu-testing-2017-11-17-1613-distro-basic-smithi/1857122 i am removing this change from my test batch. |
got_missing) | ||
pg->queue_recovery(); | ||
if (got_missing) { | ||
post_event(DoRecovery()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems we still needs some sanity checking here, there are multiple substates of Active can not handle DoRecovery() event properly...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, see commit 64047e1
This issue came up in testing pull request #19850 which should merge soon.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i am pretty sure it's buggy.
Signed-off-by: Sage Weil <sage@redhat.com>
See http://tracker.ceph.com/issues/22145 Signed-off-by: Sage Weil <sage@redhat.com>
I forgot to add DoRecovery to the list of state events; fixed! |
"I'm not sure why states like Activating are paying attention tot his vevent..." is that a question? Seems like it should be answered when fiddling with where the event is emitted. |
src/osd/PG.cc
Outdated
pg->queue_recovery(); | ||
if (got_missing) { | ||
post_event(DoRecovery()); | ||
return discard_event(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@liewegas we can drop this line. as discard_event()
will always be called.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
modulo the nit, lgtm.
If we are in recovery_unfound state waiting for unfound objects, and we find them, we need to restart the recovery reservation process so that we can recover. Do this by queueing DoRecover() event instead of calling queue_recovery() (which won't do anything since we're not in recoverying|backfilling pg states). Make the parent Active state ignore DoRecovery so that if we are already in some phase of recovery/backfill the event gets ignored. It is already handled by the other important substates that care, like Clean (for repair's benefit). I'm not sure why states like Activating are paying attention tot his vevent... Fixes: http://tracker.ceph.com/issues/22145 Signed-off-by: Sage Weil <sage@redhat.com>
fixed nit |
Please backport to luminous - Thanks! |
If we are in recovery_unfound state waiting for unfound objects, and we
find them, we need to restart the recovery reservation process so that we
can recover. Do this by queueing DoRecover() event instead of calling
queue_recovery() (which won't do anything since we're not in
recoverying|backfilling pg states).
Make the parent Active state ignore DoRecovery so that if we are already
in some phase of recovery/backfill the event gets ignored. It is already
handled by the other important substates that care, like Clean (for
repair's benefit).
I'm not sure why states like Activating are paying attention tot his vevent...
Fixes: http://tracker.ceph.com/issues/22145
Signed-off-by: Sage Weil sage@redhat.com