-
Notifications
You must be signed in to change notification settings - Fork 5.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
osd: handle inconsistent hash info during backfill and deep scrub gracefully #43239
Conversation
Instead fail the pull so it will try to recover from other shards (or will mark the object missing). Fixes: https://tracker.ceph.com/issues/48959 Signed-off-by: Mykola Golub <mgolub@suse.com>
Signed-off-by: Mykola Golub <mgolub@suse.com>
@neha-ojha If you like the idea I can add the teuthology test (similar to backfill_toofull test from #42964) that covers the backfill scenario. |
I think I have found a problematic case in If |
any test that helps reproduce the bug is always welcome! |
762b42b
to
29f5edc
Compare
Jenkins failure [1] was related to my last commit. It turned out it is a normal situation when I am also going to add the test. |
jenkins test signed |
1 similar comment
jenkins test signed |
@neha-ojha I have added the test. Here are some limited teuthology runs with [1] https://pulpito.ceph.com/trociny-2021-09-23_14:58:16-rados-wip-mgolub-testing-distro-basic-smithi/ |
Previously in the case of the error we stored in the cache and returned HashInfo(ec_impl->get_chunk_count()), which e.g. could propagate to non-primary shards, introducing inconsistency. The function's `checks` flag is replaced with `create` flag, which seems to have more clear meaning here. In be_deep_scrub the get_hash_info is still called with the second argument false (i.e. with `create=false`, while previously it was `checks=false`), which is done intentionally. Fixes: https://tracker.ceph.com/issues/48959 Signed-off-by: Mykola Golub <mgolub@suse.com>
Signed-off-by: Mykola Golub <mgolub@suse.com>
Our users reported a case when it was still possible to introduce hinfo inconsistency on the non-primary osds when
So I updated my patch and made |
I rerun it through the limited rados suite subset after the last modification: https://pulpito.ceph.com/trociny-2021-09-29_05:47:51-rados-wip-mgolub-testing-distro-basic-smithi/ There are a couple of failures but they do not look related. |
jenkins test make check |
@trociny have you also verified the test fails with the same symptom as https://tracker.ceph.com/issues/48959 without your patch? |
https://pulpito.ceph.com/trociny-2021-10-01_06:28:15-rados-master-distro-basic-smithi/ |
The tests failed due to the osd crash with the backtrace:
Note, it is a bit different from what reported in [1]: the failed assertion and backtrace are the same, but the error that caused |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@trociny thanks for the detailed explanation of the bug and how you went about fixing it!
Your fix makes sense to me (at least in the first pass). I'd also like to get @athanatos's opinion on this and we need thorough teuthology testing on this. I'll run it through a broader rados suite and see how that goes.
Hi, @neha-ojha @ronen-fr @athanatos |
@knkonishi Ah, https://tracker.ceph.com/issues/48959, but we don't have a root cause. Was it just one occurrence on your cluster? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR seems perfect! Since it mainly touches the ec code, a backport should be entirely possible.
@athanatos Thanks for your quick response and approval!
Yes, this just happened once. |
(late to the party, but ...) @trociny, @neha-ojha : I suspect that the change in ECBackend::get_hash_info() is the cause of a I will create a PR to fix the test. |
PR ceph#43239 has modified ECBackend::get_hash_info() behavior. Modified the standalone scrub test to match. Signed-off-by: Ronen Friedman <rfriedma@redhat.com>
tests: modify osd-scrub-repair to match PR #43239 changes Reviewed-by: Mykola Golub <mgolub@suse.com>
PR ceph#43239 has modified ECBackend::get_hash_info() behavior. Modified the standalone scrub test to match. Signed-off-by: Ronen Friedman <rfriedma@redhat.com> (cherry picked from commit 52e9fa1)
PR ceph#43239 has modified ECBackend::get_hash_info() behavior. Modified the standalone scrub test to match. Signed-off-by: Ronen Friedman <rfriedma@redhat.com> (cherry picked from commit 52e9fa1)
PR ceph#43239 has modified ECBackend::get_hash_info() behavior. Modified the standalone scrub test to match. Signed-off-by: Ronen Friedman <rfriedma@redhat.com> (cherry picked from commit 52e9fa1)
PR ceph#43239 has modified ECBackend::get_hash_info() behavior. Modified the standalone scrub test to match. Signed-off-by: Ronen Friedman <rfriedma@redhat.com> (cherry picked from commit 52e9fa1)
PR ceph#43239 has modified ECBackend::get_hash_info() behavior. Modified the standalone scrub test to match. Signed-off-by: Ronen Friedman <rfriedma@redhat.com>
Fixes: https://tracker.ceph.com/issues/48959
Checklist
Show available Jenkins commands
jenkins retest this please
jenkins test classic perf
jenkins test crimson perf
jenkins test signed
jenkins test make check
jenkins test make check arm64
jenkins test submodules
jenkins test dashboard
jenkins test dashboard cephadm
jenkins test api
jenkins test docs
jenkins render docs
jenkins test ceph-volume all
jenkins test ceph-volume tox