Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

osd: handle inconsistent hash info during backfill and deep scrub gracefully #43239

Merged
merged 4 commits into from
Oct 14, 2021

Conversation

trociny
Copy link
Contributor

@trociny trociny commented Sep 21, 2021

Fixes: https://tracker.ceph.com/issues/48959

Checklist

  • References tracker ticket
  • Updates documentation if necessary
  • Includes tests for new functionality or reproducer for bug

Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard cephadm
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox

Instead fail the pull so it will try to recover from other
shards (or will mark the object missing).

Fixes: https://tracker.ceph.com/issues/48959
Signed-off-by: Mykola Golub <mgolub@suse.com>
Signed-off-by: Mykola Golub <mgolub@suse.com>
@trociny
Copy link
Contributor Author

trociny commented Sep 21, 2021

@neha-ojha If you like the idea I can add the teuthology test (similar to backfill_toofull test from #42964) that covers the backfill scenario.

@trociny
Copy link
Contributor Author

trociny commented Sep 21, 2021

I think I have found a problematic case in ECBackend::get_hash_info that may be responsible for the inconsistency being introduced initially in our case (see my last commit).

If store->stat returns with an error for some reason, we create an "empty" hash info, which may propagate from the primary osd during backfill to the backfilling non-primary osd, introducing the inconsistency which may cause issues later.

@neha-ojha
Copy link
Member

@neha-ojha If you like the idea I can add the teuthology test (similar to backfill_toofull test from #42964) that covers the backfill scenario.

any test that helps reproduce the bug is always welcome!

@neha-ojha neha-ojha self-requested a review September 21, 2021 19:32
@trociny trociny force-pushed the wip-48959 branch 2 times, most recently from 762b42b to 29f5edc Compare September 22, 2021 05:38
@trociny
Copy link
Contributor Author

trociny commented Sep 22, 2021

Jenkins failure [1] was related to my last commit. It turned out it is a normal situation when stat returns -ENOENT (hash info does not exist on disk yet) and we insert in the cache and return a reference to the "empty" hash info, which is properly initialized later. And in this case checks boolean param is false. So I updated my last patch not to fail if stat failed but checks is false.

I am also going to add the test.

[1] https://jenkins.ceph.com/job/ceph-api/24839/consoleFull

@trociny
Copy link
Contributor Author

trociny commented Sep 22, 2021

jenkins test signed

1 similar comment
@tchaikov
Copy link
Contributor

jenkins test signed

@trociny
Copy link
Contributor Author

trociny commented Sep 23, 2021

Previously in the case of the error we stored in the cache and
returned HashInfo(ec_impl->get_chunk_count()), which e.g. could
propagate to non-primary shards, introducing inconsistency.

The function's `checks` flag is replaced with `create` flag,
which seems to have more clear meaning here.

In be_deep_scrub the get_hash_info is still called with the
second argument false (i.e. with `create=false`, while previously
it was `checks=false`), which is done intentionally.

Fixes: https://tracker.ceph.com/issues/48959
Signed-off-by: Mykola Golub <mgolub@suse.com>
Signed-off-by: Mykola Golub <mgolub@suse.com>
@trociny
Copy link
Contributor Author

trociny commented Sep 29, 2021

I think I have found a problematic case in ECBackend::get_hash_info that may be responsible for the inconsistency being introduced initially in our case (see my last commit).

If store->stat returns with an error for some reason, we create an "empty" hash info, which may propagate from the primary osd during backfill to the backfilling non-primary osd, introducing the inconsistency which may cause issues later.

Our users reported a case when it was still possible to introduce hinfo inconsistency on the non-primary osds when store->stat returned with an error, even with my patch applied, using the the following steps:

1) make `store->stat` in `ECBackend::get_hash_info` to return an error for a particular object (e.g. by modifying the code or by injecting the error with stap tool);
2) issue a command to update metadata for the object:
    rados setxattr -p {pool} {object} testkey VALUE
3) check hinfo_key attribute with ceph-objectstore-tool for this object and observe it is invalid (partially initialized) on all non-primary osds.

So I updated my patch and made ECBackend::get_hash_info fail for any error. The only valid error is when stat returns -ENOENT (object does not exists) and this is expected (create flag is set). It means for the scenario like above the primary osd will crash in ECBackend::start_rmw ("get_hash_info returned a null pointer and there is no way to recover") instead of propagating the invalid hinfo.

@trociny
Copy link
Contributor Author

trociny commented Sep 29, 2021

I rerun it through the limited rados suite subset after the last modification:

https://pulpito.ceph.com/trociny-2021-09-29_05:47:51-rados-wip-mgolub-testing-distro-basic-smithi/
https://pulpito.ceph.com/trociny-2021-09-28_18:51:21-rados-wip-mgolub-testing-distro-basic-smithi/

There are a couple of failures but they do not look related.

@neha-ojha
Copy link
Member

jenkins test make check

@neha-ojha
Copy link
Member

@neha-ojha I have added the test.

Here are some limited teuthology runs with --filter ec-inconsistent-hinfo [1] and without filter to check for a regression [2, 3].

[1] https://pulpito.ceph.com/trociny-2021-09-23_14:58:16-rados-wip-mgolub-testing-distro-basic-smithi/ [2] https://pulpito.ceph.com/trociny-2021-09-22_14:10:47-rados-wip-mgolub-testing-distro-basic-smithi/ [3] https://pulpito.ceph.com/trociny-2021-09-23_10:52:44-rados-wip-mgolub-testing-distro-basic-smithi/

@trociny have you also verified the test fails with the same symptom as https://tracker.ceph.com/issues/48959 without your patch?

@trociny
Copy link
Contributor Author

trociny commented Oct 1, 2021

@trociny have you also verified the test fails with the same symptom as https://tracker.ceph.com/issues/48959 without your patch?

https://pulpito.ceph.com/trociny-2021-10-01_06:28:15-rados-master-distro-basic-smithi/

@trociny
Copy link
Contributor Author

trociny commented Oct 1, 2021

https://pulpito.ceph.com/trociny-2021-10-01_06:28:15-rados-master-distro-basic-smithi/

The tests failed due to the osd crash with the backtrace:

/home/jenkins-build/build/workspace/ceph-dev-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.0.0-8011-g6939ea03/rpm/el8/BUILD/ceph-17.0.0-8011-g6939ea03/src/osd/ECBackend.cc: 608: FAILED ceph_assert(op.hinfo)

 ceph version 17.0.0-8011-g6939ea03 (6939ea034a21745448742eb1ff6186336a4fd48f) quincy (dev)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x152) [0x56279cd3bdf4]
 2: ceph-osd(+0x5d2015) [0x56279cd3c015]
 3: (ECBackend::continue_recovery_op(ECBackend::RecoveryOp&, RecoveryMessages*)+0x1b3c) [0x56279d23579c]
 4: (ECBackend::run_recovery_op(PGBackend::RecoveryHandle*, int)+0x441) [0x56279d236d01]
 5: (PrimaryLogPG::recover_backfill(unsigned long, ThreadPool::TPHandle&, bool*)+0x26f9) [0x56279cfc3b69]
 6: (PrimaryLogPG::start_recovery_ops(unsigned long, ThreadPool::TPHandle&, unsigned long*)+0x10e3) [0x56279cfc7763]
 7: (OSD::do_recovery(PG*, unsigned int, unsigned long, ThreadPool::TPHandle&)+0x2f5) [0x56279ce42845]
 8: (ceph::osd::scheduler::PGRecovery::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x1d) [0x56279d104b3d]
 9: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0xac8) [0x56279ce60fe8]
 10: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5c4) [0x56279d54ff94]
 11: (ShardedThreadPool::WorkThreadSharded::entry()+0x14) [0x56279d550a84]
 12: /lib64/libpthread.so.0(+0x814a) [0x7f419b04814a]
 13: clone()

Note, it is a bit different from what reported in [1]: the failed assertion and backtrace are the same, but the error that caused get_hash_info to return null is different: in the reported case it was because is was invalid on the disk, while in my case it was just removed. I could make my test to behave exactly the same (i.e. insert invalid hinfo instead of just removing it) but I don't think it is worth complicating the test.

[1] https://tracker.ceph.com/issues/48959

Copy link
Member

@neha-ojha neha-ojha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@trociny thanks for the detailed explanation of the bug and how you went about fixing it!

Your fix makes sense to me (at least in the first pass). I'd also like to get @athanatos's opinion on this and we need thorough teuthology testing on this. I'll run it through a broader rados suite and see how that goes.

@knkonishi
Copy link

Hi, @neha-ojha @ronen-fr @athanatos
Our customer actually had to suspend service due to this problem.
Without this fix, the customer continues to be at the risk that this problem will recur and cause a service outage.
Recurrence of this problem requires a lot of effort to recover and affects many end users of the storage service.
We would appreciate it if you could review the patch and fix this problem as soon as possible.

@athanatos
Copy link
Contributor

athanatos commented Oct 14, 2021

@knkonishi That's a bit odd. I'm not aware of any outstanding bugs that result in this inconsistency, can you point me at the one you are experiencing?

Ah, https://tracker.ceph.com/issues/48959, but we don't have a root cause. Was it just one occurrence on your cluster?

Copy link
Contributor

@athanatos athanatos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR seems perfect! Since it mainly touches the ec code, a backport should be entirely possible.

@knkonishi
Copy link

@athanatos Thanks for your quick response and approval!

Ah, https://tracker.ceph.com/issues/48959, but we don't have a root cause. Was it just one occurrence on your cluster?

Yes, this just happened once.
In our customer's case, there were medium errors in the primary OSD's disk.
We assume that store->stat() returned with an error due to the medium error in the disk.

@ronen-fr
Copy link
Contributor

ronen-fr commented Oct 19, 2021

(late to the party, but ...)

@trociny, @neha-ojha : I suspect that the change in ECBackend::get_hash_info() is the cause of a
standalone/scrub/osd-scrub-repair.sh failure. The test expects an 'ec_size_error', but now it seems that
the reported error is 'read_error'.

I will create a PR to fix the test.

ronen-fr added a commit to ronen-fr/ceph that referenced this pull request Oct 20, 2021
PR ceph#43239 has modified ECBackend::get_hash_info() behavior.
Modified the standalone scrub test to match.

Signed-off-by: Ronen Friedman <rfriedma@redhat.com>
ronen-fr added a commit that referenced this pull request Oct 20, 2021
tests: modify osd-scrub-repair to match PR #43239 changes

Reviewed-by: Mykola Golub <mgolub@suse.com>
trociny pushed a commit to trociny/ceph that referenced this pull request Oct 20, 2021
PR ceph#43239 has modified ECBackend::get_hash_info() behavior.
Modified the standalone scrub test to match.

Signed-off-by: Ronen Friedman <rfriedma@redhat.com>
(cherry picked from commit 52e9fa1)
trociny pushed a commit to trociny/ceph that referenced this pull request Oct 20, 2021
PR ceph#43239 has modified ECBackend::get_hash_info() behavior.
Modified the standalone scrub test to match.

Signed-off-by: Ronen Friedman <rfriedma@redhat.com>
(cherry picked from commit 52e9fa1)
trociny pushed a commit to trociny/ceph that referenced this pull request Oct 20, 2021
PR ceph#43239 has modified ECBackend::get_hash_info() behavior.
Modified the standalone scrub test to match.

Signed-off-by: Ronen Friedman <rfriedma@redhat.com>
(cherry picked from commit 52e9fa1)
trociny pushed a commit to trociny/ceph that referenced this pull request Oct 26, 2021
PR ceph#43239 has modified ECBackend::get_hash_info() behavior.
Modified the standalone scrub test to match.

Signed-off-by: Ronen Friedman <rfriedma@redhat.com>
(cherry picked from commit 52e9fa1)
Waadkh7 pushed a commit to rhcs-dashboard/ceph that referenced this pull request Nov 23, 2021
PR ceph#43239 has modified ECBackend::get_hash_info() behavior.
Modified the standalone scrub test to match.

Signed-off-by: Ronen Friedman <rfriedma@redhat.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
7 participants