osd: handle inconsistent hash info during backfill and deep scrub gracefully #43239

trociny · 2021-09-21T10:37:05Z

Fixes: https://tracker.ceph.com/issues/48959

Checklist

References tracker ticket
Updates documentation if necessary
Includes tests for new functionality or reproducer for bug

Show available Jenkins commands

jenkins retest this please
jenkins test classic perf
jenkins test crimson perf
jenkins test signed
jenkins test make check
jenkins test make check arm64
jenkins test submodules
jenkins test dashboard
jenkins test dashboard cephadm
jenkins test api
jenkins test docs
jenkins render docs
jenkins test ceph-volume all
jenkins test ceph-volume tox

Instead fail the pull so it will try to recover from other shards (or will mark the object missing). Fixes: https://tracker.ceph.com/issues/48959 Signed-off-by: Mykola Golub <mgolub@suse.com>

Signed-off-by: Mykola Golub <mgolub@suse.com>

trociny · 2021-09-21T10:47:36Z

@neha-ojha If you like the idea I can add the teuthology test (similar to backfill_toofull test from #42964) that covers the backfill scenario.

trociny · 2021-09-21T15:27:21Z

I think I have found a problematic case in ECBackend::get_hash_info that may be responsible for the inconsistency being introduced initially in our case (see my last commit).

If store->stat returns with an error for some reason, we create an "empty" hash info, which may propagate from the primary osd during backfill to the backfilling non-primary osd, introducing the inconsistency which may cause issues later.

neha-ojha · 2021-09-21T19:32:53Z

@neha-ojha If you like the idea I can add the teuthology test (similar to backfill_toofull test from #42964) that covers the backfill scenario.

any test that helps reproduce the bug is always welcome!

trociny · 2021-09-22T05:40:13Z

Jenkins failure [1] was related to my last commit. It turned out it is a normal situation when stat returns -ENOENT (hash info does not exist on disk yet) and we insert in the cache and return a reference to the "empty" hash info, which is properly initialized later. And in this case checks boolean param is false. So I updated my last patch not to fail if stat failed but checks is false.

I am also going to add the test.

[1] https://jenkins.ceph.com/job/ceph-api/24839/consoleFull

trociny · 2021-09-22T05:41:48Z

jenkins test signed

tchaikov · 2021-09-22T14:09:11Z

jenkins test signed

trociny · 2021-09-23T19:12:43Z

@neha-ojha I have added the test.

Here are some limited teuthology runs with --filter ec-inconsistent-hinfo [1] and without filter to check for a regression [2, 3].

[1] https://pulpito.ceph.com/trociny-2021-09-23_14:58:16-rados-wip-mgolub-testing-distro-basic-smithi/
[2] https://pulpito.ceph.com/trociny-2021-09-22_14:10:47-rados-wip-mgolub-testing-distro-basic-smithi/
[3] https://pulpito.ceph.com/trociny-2021-09-23_10:52:44-rados-wip-mgolub-testing-distro-basic-smithi/

Previously in the case of the error we stored in the cache and returned HashInfo(ec_impl->get_chunk_count()), which e.g. could propagate to non-primary shards, introducing inconsistency. The function's `checks` flag is replaced with `create` flag, which seems to have more clear meaning here. In be_deep_scrub the get_hash_info is still called with the second argument false (i.e. with `create=false`, while previously it was `checks=false`), which is done intentionally. Fixes: https://tracker.ceph.com/issues/48959 Signed-off-by: Mykola Golub <mgolub@suse.com>

Signed-off-by: Mykola Golub <mgolub@suse.com>

trociny · 2021-09-29T08:09:46Z

I think I have found a problematic case in ECBackend::get_hash_info that may be responsible for the inconsistency being introduced initially in our case (see my last commit).

If store->stat returns with an error for some reason, we create an "empty" hash info, which may propagate from the primary osd during backfill to the backfilling non-primary osd, introducing the inconsistency which may cause issues later.

Our users reported a case when it was still possible to introduce hinfo inconsistency on the non-primary osds when store->stat returned with an error, even with my patch applied, using the the following steps:

1) make `store->stat` in `ECBackend::get_hash_info` to return an error for a particular object (e.g. by modifying the code or by injecting the error with stap tool);
2) issue a command to update metadata for the object:
    rados setxattr -p {pool} {object} testkey VALUE
3) check hinfo_key attribute with ceph-objectstore-tool for this object and observe it is invalid (partially initialized) on all non-primary osds.

So I updated my patch and made ECBackend::get_hash_info fail for any error. The only valid error is when stat returns -ENOENT (object does not exists) and this is expected (create flag is set). It means for the scenario like above the primary osd will crash in ECBackend::start_rmw ("get_hash_info returned a null pointer and there is no way to recover") instead of propagating the invalid hinfo.

trociny · 2021-09-29T08:20:54Z

I rerun it through the limited rados suite subset after the last modification:

https://pulpito.ceph.com/trociny-2021-09-29_05:47:51-rados-wip-mgolub-testing-distro-basic-smithi/
https://pulpito.ceph.com/trociny-2021-09-28_18:51:21-rados-wip-mgolub-testing-distro-basic-smithi/

There are a couple of failures but they do not look related.

neha-ojha · 2021-09-30T22:39:23Z

jenkins test make check

neha-ojha · 2021-09-30T22:43:02Z

@neha-ojha I have added the test.

Here are some limited teuthology runs with --filter ec-inconsistent-hinfo [1] and without filter to check for a regression [2, 3].

[1] https://pulpito.ceph.com/trociny-2021-09-23_14:58:16-rados-wip-mgolub-testing-distro-basic-smithi/ [2] https://pulpito.ceph.com/trociny-2021-09-22_14:10:47-rados-wip-mgolub-testing-distro-basic-smithi/ [3] https://pulpito.ceph.com/trociny-2021-09-23_10:52:44-rados-wip-mgolub-testing-distro-basic-smithi/

@trociny have you also verified the test fails with the same symptom as https://tracker.ceph.com/issues/48959 without your patch?

trociny · 2021-10-01T09:08:20Z

@trociny have you also verified the test fails with the same symptom as https://tracker.ceph.com/issues/48959 without your patch?

https://pulpito.ceph.com/trociny-2021-10-01_06:28:15-rados-master-distro-basic-smithi/

trociny · 2021-10-01T09:16:12Z

https://pulpito.ceph.com/trociny-2021-10-01_06:28:15-rados-master-distro-basic-smithi/

The tests failed due to the osd crash with the backtrace:

/home/jenkins-build/build/workspace/ceph-dev-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.0.0-8011-g6939ea03/rpm/el8/BUILD/ceph-17.0.0-8011-g6939ea03/src/osd/ECBackend.cc: 608: FAILED ceph_assert(op.hinfo)

 ceph version 17.0.0-8011-g6939ea03 (6939ea034a21745448742eb1ff6186336a4fd48f) quincy (dev)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x152) [0x56279cd3bdf4]
 2: ceph-osd(+0x5d2015) [0x56279cd3c015]
 3: (ECBackend::continue_recovery_op(ECBackend::RecoveryOp&, RecoveryMessages*)+0x1b3c) [0x56279d23579c]
 4: (ECBackend::run_recovery_op(PGBackend::RecoveryHandle*, int)+0x441) [0x56279d236d01]
 5: (PrimaryLogPG::recover_backfill(unsigned long, ThreadPool::TPHandle&, bool*)+0x26f9) [0x56279cfc3b69]
 6: (PrimaryLogPG::start_recovery_ops(unsigned long, ThreadPool::TPHandle&, unsigned long*)+0x10e3) [0x56279cfc7763]
 7: (OSD::do_recovery(PG*, unsigned int, unsigned long, ThreadPool::TPHandle&)+0x2f5) [0x56279ce42845]
 8: (ceph::osd::scheduler::PGRecovery::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x1d) [0x56279d104b3d]
 9: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0xac8) [0x56279ce60fe8]
 10: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5c4) [0x56279d54ff94]
 11: (ShardedThreadPool::WorkThreadSharded::entry()+0x14) [0x56279d550a84]
 12: /lib64/libpthread.so.0(+0x814a) [0x7f419b04814a]
 13: clone()

Note, it is a bit different from what reported in [1]: the failed assertion and backtrace are the same, but the error that caused get_hash_info to return null is different: in the reported case it was because is was invalid on the disk, while in my case it was just removed. I could make my test to behave exactly the same (i.e. insert invalid hinfo instead of just removing it) but I don't think it is worth complicating the test.

[1] https://tracker.ceph.com/issues/48959

neha-ojha

@trociny thanks for the detailed explanation of the bug and how you went about fixing it!

Your fix makes sense to me (at least in the first pass). I'd also like to get @athanatos's opinion on this and we need thorough teuthology testing on this. I'll run it through a broader rados suite and see how that goes.

knkonishi · 2021-10-14T04:36:43Z

Hi, @neha-ojha @ronen-fr @athanatos
Our customer actually had to suspend service due to this problem.
Without this fix, the customer continues to be at the risk that this problem will recur and cause a service outage.
Recurrence of this problem requires a lot of effort to recover and affects many end users of the storage service.
We would appreciate it if you could review the patch and fix this problem as soon as possible.

athanatos · 2021-10-14T06:00:28Z

@knkonishi ~~That's a bit odd. I'm not aware of any outstanding bugs that result in this inconsistency, can you point me at the one you are experiencing?~~

Ah, https://tracker.ceph.com/issues/48959, but we don't have a root cause. Was it just one occurrence on your cluster?

athanatos

This PR seems perfect! Since it mainly touches the ec code, a backport should be entirely possible.

knkonishi · 2021-10-14T06:48:32Z

@athanatos Thanks for your quick response and approval!

Ah, https://tracker.ceph.com/issues/48959, but we don't have a root cause. Was it just one occurrence on your cluster?

Yes, this just happened once.
In our customer's case, there were medium errors in the primary OSD's disk.
We assume that store->stat() returned with an error due to the medium error in the disk.

tchaikov · 2021-10-14T14:42:54Z

ronen-fr · 2021-10-19T11:09:01Z

(late to the party, but ...)

@trociny, @neha-ojha : I suspect that the change in ECBackend::get_hash_info() is the cause of a
standalone/scrub/osd-scrub-repair.sh failure. The test expects an 'ec_size_error', but now it seems that
the reported error is 'read_error'.

I will create a PR to fix the test.

PR ceph#43239 has modified ECBackend::get_hash_info() behavior. Modified the standalone scrub test to match. Signed-off-by: Ronen Friedman <rfriedma@redhat.com>

tests: modify osd-scrub-repair to match PR #43239 changes Reviewed-by: Mykola Golub <mgolub@suse.com>

PR ceph#43239 has modified ECBackend::get_hash_info() behavior. Modified the standalone scrub test to match. Signed-off-by: Ronen Friedman <rfriedma@redhat.com> (cherry picked from commit 52e9fa1)

PR ceph#43239 has modified ECBackend::get_hash_info() behavior. Modified the standalone scrub test to match. Signed-off-by: Ronen Friedman <rfriedma@redhat.com>

trociny added 2 commits September 21, 2021 11:26

osd: don't crash in backfill due to inconsistent hash info

ffd4d1c

Instead fail the pull so it will try to recover from other shards (or will mark the object missing). Fixes: https://tracker.ceph.com/issues/48959 Signed-off-by: Mykola Golub <mgolub@suse.com>

osd: no need to crash in deep scrub if hash info is invalid

104d8df

Signed-off-by: Mykola Golub <mgolub@suse.com>

trociny added bug-fix core labels Sep 21, 2021

neha-ojha self-requested a review September 21, 2021 19:32

trociny force-pushed the wip-48959 branch 2 times, most recently from 762b42b to 29f5edc Compare September 22, 2021 05:38

github-actions bot added the tests label Sep 23, 2021

trociny force-pushed the wip-48959 branch from 7363d15 to e672fbc Compare September 23, 2021 17:43

neha-ojha requested a review from ronen-fr September 24, 2021 21:40

trociny added 2 commits September 28, 2021 16:14

qa/suites/rados: add inconsistent hinfo test

d35920d

Signed-off-by: Mykola Golub <mgolub@suse.com>

trociny force-pushed the wip-48959 branch from e672fbc to d35920d Compare September 28, 2021 15:43

neha-ojha reviewed Oct 6, 2021

View reviewed changes

neha-ojha requested a review from athanatos October 6, 2021 23:53

neha-ojha added the needs-qa label Oct 6, 2021

yuriw added the wip-yuri7-testing label Oct 8, 2021

athanatos approved these changes Oct 14, 2021

View reviewed changes

tchaikov added the wip-kefu-testing label Oct 14, 2021

tchaikov merged commit 70b049f into ceph:master Oct 14, 2021

ronen-fr mentioned this pull request Oct 20, 2021

tests: modify osd-scrub-repair to match PR #43239 changes #43596

Merged

ronen-fr added a commit that referenced this pull request Oct 20, 2021

Merge pull request #43596 from ronen-fr/wip-rf-test_info

352399d

tests: modify osd-scrub-repair to match PR #43239 changes Reviewed-by: Mykola Golub <mgolub@suse.com>

neha-ojha mentioned this pull request Oct 22, 2021

os/bluestore: BlueFS fine grain locking #42099

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

osd: handle inconsistent hash info during backfill and deep scrub gracefully #43239

osd: handle inconsistent hash info during backfill and deep scrub gracefully #43239

trociny commented Sep 21, 2021

trociny commented Sep 21, 2021

trociny commented Sep 21, 2021

neha-ojha commented Sep 21, 2021

trociny commented Sep 22, 2021

trociny commented Sep 22, 2021

tchaikov commented Sep 22, 2021

trociny commented Sep 23, 2021

trociny commented Sep 29, 2021

trociny commented Sep 29, 2021

neha-ojha commented Sep 30, 2021

neha-ojha commented Sep 30, 2021

trociny commented Oct 1, 2021

trociny commented Oct 1, 2021 •

edited

neha-ojha left a comment

knkonishi commented Oct 14, 2021

athanatos commented Oct 14, 2021 •

edited

athanatos left a comment

knkonishi commented Oct 14, 2021

tchaikov commented Oct 14, 2021

ronen-fr commented Oct 19, 2021 •

edited

osd: handle inconsistent hash info during backfill and deep scrub gracefully #43239

osd: handle inconsistent hash info during backfill and deep scrub gracefully #43239

Conversation

trociny commented Sep 21, 2021

Checklist

trociny commented Sep 21, 2021

trociny commented Sep 21, 2021

neha-ojha commented Sep 21, 2021

trociny commented Sep 22, 2021

trociny commented Sep 22, 2021

tchaikov commented Sep 22, 2021

trociny commented Sep 23, 2021

trociny commented Sep 29, 2021

trociny commented Sep 29, 2021

neha-ojha commented Sep 30, 2021

neha-ojha commented Sep 30, 2021

trociny commented Oct 1, 2021

trociny commented Oct 1, 2021 • edited

neha-ojha left a comment

Choose a reason for hiding this comment

knkonishi commented Oct 14, 2021

athanatos commented Oct 14, 2021 • edited

athanatos left a comment

Choose a reason for hiding this comment

knkonishi commented Oct 14, 2021

tchaikov commented Oct 14, 2021

ronen-fr commented Oct 19, 2021 • edited

trociny commented Oct 1, 2021 •

edited

athanatos commented Oct 14, 2021 •

edited

ronen-fr commented Oct 19, 2021 •

edited