Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DNM: TOO RISKY] nautilus: osd/ECBackend: don't remove in-flight backfill when missing is not primary #41293

Closed
wants to merge 4 commits into from

Conversation

trociny
Copy link
Contributor

@trociny trociny commented May 12, 2021

First commit:

This is a direct commit to nautilus branch to recreate the
behaviour we have post-nautilus, after refactoring made in
8a8947d, which seems to be
correct.

The in-flight backfill prevents updating of "backfill complete"
position, which remains on the object before the missing oid. So
when pg is re-peered it retries backfill from this position
instead of entering the clean state.

Fixes: https://tracker.ceph.com/issues/50747
Signed-off-by: Mykola Golub mgolub@suse.com


Second commit:

backport tracker: https://tracker.ceph.com/issues/50792

backport of: #41270

parent tracker: https://tracker.ceph.com/issues/50351

…imary

This is a direct commit to nautilus branch to recreate the
behaviour we have post-nautilus, after refactoring made in
8a8947d, which seems to be
correct.

The in-flight backfill prevents updating of "backfill complete"
position, which remains on the object before the missing oid. So
when pg is re-peered it retries backfill from this position
instead of entering the clean state.

Fixes: https://tracker.ceph.com/issues/50747
Signed-off-by: Mykola Golub <mgolub@suse.com>
@trociny trociny added this to the nautilus milestone May 12, 2021
@trociny
Copy link
Contributor Author

trociny commented May 12, 2021

It will also require the backport of #41270

In PrimaryLogPG::on_failed_pull, we unconditionally remove soid
from recovering list, but remove it from backfills_in_flight only
when the backfill source is the primary osd.

Fixes: https://tracker.ceph.com/issues/50351
Signed-off-by: Mykola Golub <mgolub@suse.com>
(cherry picked from commit 9b78e00)
@smithfarm
Copy link
Contributor

jenkins test make check

@smithfarm smithfarm added nautilus-batch-1 nautilus point releases needs-qa labels May 13, 2021
@jdurgin jdurgin changed the base branch from nautilus to nautilus-saved May 14, 2021 21:58
@jdurgin jdurgin changed the base branch from nautilus-saved to nautilus May 14, 2021 21:58
Copy link
Member

@neha-ojha neha-ojha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This approach looks reasonable to me! @athanatos I'd like to get your thoughts as well.
On a different note, it'd be great to add a test (in master and then backport) to replicate what was done manually in https://tracker.ceph.com/issues/50351, to catch future bugs in this area.

@neha-ojha neha-ojha requested a review from athanatos May 19, 2021 21:27
@athanatos
Copy link
Contributor

I'm not sure about this. Why don't we call on_primary_error or similar? I'm also kind of leery of backporting something like this to nautilus. What's the actual consequence of this bug?

@trociny
Copy link
Contributor Author

trociny commented May 20, 2021

@athanatos The actual consequence of this bug, that without this change the last_backfill position does not stop before "unfound" object, but advances. And after the backfill is finished it is at MAX. The pg enters backfill_unfound state. But when re-peering happens, the state is reset to clean. It looks confusing to users, who believe the unfound issue is resolved, while it is actually not.

With the fix the in-flight backfill prevents updating the last_backfill position, and it remains on the object before the missing oid. So when the pg is re-peered it retries backfill from this position instead of entering the clean state.

You can find more details in my emails to ceph-dev [1].

Why don't we call on_primary_error or similar?

I am not sure what you mean. When the error is on primary the current code behaves correctly (well, I mean the same as on the master). It behaves differently when the missing shard is not on a primary -- we want to skip removing the oid from backfills_in_flight in this case. I am not sure how on_primary_error could be used here.

My approach looks a bit hackish but as it is a direct commit in a maintenance branch I wanted to make it as simple as possible.

[1] https://lists.ceph.io/hyperkitty/list/dev@ceph.io/thread/OJEZB4YEGWV3EUJPFGLG2O3VHERR5ATI/

@trociny
Copy link
Contributor Author

trociny commented May 20, 2021

@neha-ojha thanks. I will work on the test and will push a separate PR for it.

@yuriw
Copy link
Contributor

yuriw commented May 20, 2021

@athanatos
Copy link
Contributor

athanatos commented May 20, 2021

It doesn't sound like that would prevent an upgrade. This pathway is pretty complicated, and this seems like quite a risky patch to a stable branch without a test particularly considering that that pathway is poorly tested anyway.

@athanatos
Copy link
Contributor

athanatos commented May 20, 2021

PrimaryLogPG::primary_error on nautilus appears to update the primary's missing set as on current master, so that should be ok.

I think this patch replicates the behavior on master w.r.t. backfills_in_flight. @trociny If you've carefully tested this and related scenarios, it might be ok to merge, but I'd characterize this as a very risky backport. It might actually be safer to try to backport the refactor (just the patch in question, not the whole PeeringState refactor).

@athanatos
Copy link
Contributor

athanatos commented May 20, 2021

With this patch, does restarting the primary still clear the unfound state?

@neha-ojha
Copy link
Member

neha-ojha commented May 20, 2021

@trociny @athanatos I agree this patch is risky (especially at this stage in nautilus) and needs more testing that one rados suite run. It would really help to add a test to reproduce the issue and confirm that this patch fixes the problem with no side effects. Also, is there a workaround that can be used for https://tracker.ceph.com/issues/50351 instead?

@trociny
Copy link
Contributor Author

trociny commented May 21, 2021

It might actually be safer to try to backport the refactor (just the patch in question, not the whole PeeringState refactor).

@athanatos Although 8a8947d is cherry-pickable (just with some rather trivial conflicts) it relies on the changes introduced earlier in d33a8b8. And backporting this (or just needed changes) looks like a hell and much more risky.

With my current approach I am sure that I introduce the change (and potential issues) only for ECBackend, and only for the case
when an error occurs for a non-primary shard during backfill. Backporting 8a8947d (and all other bits it would need) would affect both EC and Replicated backends and much larger variety of cases.

@trociny
Copy link
Contributor Author

trociny commented May 21, 2021

With this patch, does restarting the primary still clear the unfound state?

Yes. The same as it is on the master currently. When the primary is temporarily down, another osd becomes primary. The last_backfill position is still (properly) set to an object before the "unfound", but as the unfound object shard is missing on this new primary osd, the object is not returned by collection_list here and skipped from backfilling. So the issue is not detected, the backfill finishes successfully and the pg enters clean state. It becomes similar to what was recently discussed in this thread [1].

[1] https://lists.ceph.io/hyperkitty/list/dev@ceph.io/thread/GYZAXV2JO5JK4J3L2G4NCYVFFWRIX5VD/

@trociny
Copy link
Contributor Author

trociny commented May 21, 2021

@trociny @athanatos I agree this patch is risky (especially at this stage in nautilus) and needs more testing that one rados suite run. It would really help to add a test to reproduce the issue and confirm that this patch fixes the problem with no side effects.

I have started working on the test. My first thought was to use as an example the tests test_backfill.sh, test_lost.sh, test_missing_unfound.sh, which I found in src/test. But they do not run on teuthology and seem not to be executed in any CI tests. So I think I have to add a new yaml to qa/suites/rados/singleton-nomsgr/all, using recovery-unfound-found.yaml as an example. Any suggestions are appreciated.

Also, is there a workaround that can be used for https://tracker.ceph.com/issues/50351 instead?

I am not sure I understand. The backport for 50351 in the nautilus branch is needed only if we introduce the post-nautilus behavior, i.e. as an addition to my first "risky" commit. Without the first commit it will not crash there, it will just enter a clean state.

@trociny
Copy link
Contributor Author

trociny commented May 21, 2021

Also, is there a workaround that can be used for https://tracker.ceph.com/issues/50351 instead?

I am not sure I understand. The backport for 50351 in the nautilus branch is needed only if we introduce the post-nautilus behavior, i.e. as an addition to my first "risky" commit. Without the first commit it will not crash there, it will just enter a clean state.

Probably you meant a workaround for the backfill_unfound state reset issue (50747), not for the crash? Then a workaround is to run scrub, which will detect unfound objects and the pg will enter inconsistent state. But the user will not aware of the issue (will think it has magically resolved) until the scrub is run.

@athanatos
Copy link
Contributor

With this patch, does restarting the primary still clear the unfound state?

Yes. The same as it is on the master currently. When the primary is temporarily down, another osd becomes primary. The last_backfill position is still (properly) set to an object before the "unfound", but as the unfound object shard is missing on this new primary osd, the object is not returned by collection_list here and skipped from backfilling. So the issue is not detected, the backfill finishes successfully and the pg enters clean state. It becomes similar to what was recently discussed in this thread [1].

[1] https://lists.ceph.io/hyperkitty/list/dev@ceph.io/thread/GYZAXV2JO5JK4J3L2G4NCYVFFWRIX5VD/

Ok. Once we've got this backport sorted, we probably want to revisit the fix on master. Rather than blocking backfill at the unfound object, we probably want to add the offending object to the durably recorded missing sets where its missing and proceed with backfill. This probably means adding a force_missing message of some form.

@athanatos
Copy link
Contributor

I have started working on the test. My first thought was to use as an example the tests test_backfill.sh, test_lost.sh, test_missing_unfound.sh, which I found in src/test. But they do not run on teuthology and seem not to be executed in any CI tests. So I think I have to add a new yaml to qa/suites/rados/singleton-nomsgr/all, using recovery-unfound-found.yaml as an example. Any suggestions are appreciated.

Ah, I have found useful examples in qa/tasks, like ec_lost_unfound.py.

Yeah, that's probably a good place to start.

The osd may be 0.

Signed-off-by: Mykola Golub <mgolub@suse.com>
(cherry picked from commit 93139db)
Fixes: https://tracker.ceph.com/issues/50925
Signed-off-by: Mykola Golub <mgolub@suse.com>
(cherry picked from commit 0a9f7e1)

Conflicts:
    qa/suites/rados/singleton/all/ec-backfill-unfound.yaml:
        log-whitelist instead of log-ignorelist
@trociny
Copy link
Contributor Author

trociny commented May 26, 2021

I added the teuthology test, backported from #41532. Note the master PR is not merged yet, so the test might need re-backporting later.

Below are results when scheduling with--filter ec-backfill-unfound and --subset 0/5 [1] and --subset 1/5 [2].

[1] https://pulpito.ceph.com/trociny-2021-05-26_06:18:46-rados-wip-mgolub-testing-nautilus-distro-basic-smithi/
[2] https://pulpito.ceph.com/trociny-2021-05-26_06:17:04-rados-wip-mgolub-testing-nautilus-distro-basic-smithi/

@smithfarm smithfarm added nautilus-batch-1 nautilus point releases needs-qa labels Jun 4, 2021
@smithfarm
Copy link
Contributor

@neha-ojha @jdurgin @athanatos Is there still hope to get this into 14.2.22?

@athanatos
Copy link
Contributor

@neha-ojha Where did we end up on this one?

@neha-ojha
Copy link
Member

@neha-ojha @jdurgin @athanatos Is there still hope to get this into 14.2.22?

@athanatos @smithfarm This is waiting on @trociny to compare test runs from nautilus and master to figure out why we are not seeing #41532 (comment) in nautilus. Once we have a better understanding of the root cause and whether it imposes a threat to nautilus, we can take a call in this PR.

@trociny any updates? let's also run this through the upgrade suite

@trociny
Copy link
Contributor Author

trociny commented Jun 5, 2021

@trociny any updates?

I am still in process of investigating it. The current findings:

The crash is not filestore specific. The important condition seems to be that after a non-primary osd is stopped, we are in "backfilling" state with that osd missing in active set, and then the osd boots and the peering event comes. Filestore osds seem to start faster that is why we were able to see the crash here, while with the bluestore the backfill completed before the osd booted. And I was able to reproduce the issue with the bluestore too by increasing the backfill time (increasing the number of objects in the pg) [1].

Looking at the crash log [2] one can see that just before the crash PGPeeringEvent comes while the pg currently in active+undersized+degraded+remapped+backfilling state with [1,NONE,2] osd active set:

   -54> 2021-06-02T16:46:55.167+0000 7fd172a4d700 20 osd.1 op_wq(0) _process OpSchedulerItem(1.0s0 PGPeeringEvent(epoch_sent: 43 epoch_requested: 43 MLogRec from 3(1) log log((16'102016,16'102021], crt=16'102021) pi ([0,0] all_participants= intervals=) +create_info) prio 255 cost 10 e43) pg 0x56079df93000
   -53> 2021-06-02T16:46:55.167+0000 7fd172a4d700 10 osd.1 pg_epoch: 43 pg[1.0s0( v 16'102021 (16'101986,16'102021] local-lis/les=40/41 n=102021 ec=13/13 lis/c=40/35 les/c/f=41/36/0 sis=40) [1,0,2]/[1,NONE,2]p1(0) backfill=[0(1)] r=0 lpr=40 pi=[35,40)/1 crt=16'102021 lcod 16'102020 mlcod 0'0 active+undersized+degraded+remapped+backfilling rops=1 mbc={0={(1+0)=1},1={(0+0)=1},2={(0+0)=1}}] do_peering_event: epoch_sent: 43 epoch_requested: 43 MLogRec from 3(1) log log((16'102016,16'102021], crt=16'102021) pi ([0,0] all_participants= intervals=) +create_info

And when PrimaryLogPG::recover_replicas is called after this it bypasses if (recovery_state.get_missing_loc().is_unfound(soid)) condition [3] for the test object and aborts on "object added to missing set for backfill, but is not in recovering" error. And in the logs you can find that when PrimaryLogPG::recover_replicas was called before PGPeeringEvent came, the is_unfound condition returned true ("still unfound" messages were logged).

So my current assumption is that when PGPeeringEvent comes it updates the mission_loc in some way that it does not return is_unfound true any more, but I still have failed to figure out where exactly this happens. It is not clear from the log to me and it is problematic to reproduce it locally. May be you have some ideas?

In nautilus we have similar condition in PrimaryLogPG::recover_replicas [4] but looking in the osd log here for the filestore case, it looks like the test reproduces the conditions and the PGPeeringEvent comes when we are backfilling, but when PrimaryLogPG::recover_replicas is called later it is "still unfound":

2021-05-26 06:37:53.101 7f9a90c96700 20 osd.1 op_wq(0) _process 1.0s0 to_process <OpQueueItem(1.0s0 PGPeeringEvent(epoch_sent: 36 epoch_requested: 36 MLogRec from 3(1) +create_info) prio 255 cost 10 e36),OpQueueItem(1.0s0 PGOpItem(op=pg_scan(digest 1.0s0 1:f8d56238:::benchmark_data_smithi116_19237_object11416:head-MAX e 36/34) v2) prio 127 cost 33150 e36)> waiting <> waiting_peering {}
2021-05-26 06:37:53.101 7f9a90c96700 20 osd.1 op_wq(0) _process OpQueueItem(1.0s0 PGPeeringEvent(epoch_sent: 36 epoch_requested: 36 MLogRec from 3(1) +create_info) prio 255 cost 10 e36) pg 0xbfcf400
2021-05-26 06:37:53.101 7f9a90c96700 10 osd.1 pg_epoch: 36 pg[1.0s0( v 15'12248 (15'12177,15'12248] local-lis/les=34/35 n=12248 ec=12/12 lis/c 34/29 les/c/f 35/30/0 31/34/12) [1,0,2]/[1,2147483647,2]p1(0) backfill=[0(1)] r=0 lpr=34 pi=[29,34)/1 rops=1 crt=15'12248 lcod 15'12247 mlcod 0'0 active+undersized+remapped+backfilling mbc={}] do_peering_event: epoch_sent: 36 epoch_requested: 36 MLogRec from 3(1) +create_info
...
2021-05-26 06:37:53.105 7f9a94c9e700 10 osd.1 pg_epoch: 36 pg[1.0s0( v 15'12248 (15'12177,15'12248] local-lis/les=34/35 n=12248 ec=12/12 lis/c 34/29 les/c/f 35/30/0 31/34/12) [1,0,2]/[1,2147483647,2]p1(0) backfill=[0(1)] r=0 lpr=34 pi=[29,34)/1 crt=15'12248 lcod 15'12247 mlcod 0'0 active+undersized+remapped+backfilling mbc={0={(0+0)=1},1={(0+1)=1},2={(0+0)=1}}] recover_replicas: 1:ad551702:::test:head still unfound

My current assumption it that missing_loc is handled differently in nautilus on PGPeeringEvent.

[1] https://pulpito.ceph.com/trociny-2021-06-02_13:57:51-rados-master-distro-basic-smithi/
[2] http://qa-proxy.ceph.com/teuthology/trociny-2021-06-02_13:57:51-rados-master-distro-basic-smithi/6145804/remote/smithi193/crash/posted/2021-06-02T16%3A46%3A55.174344Z_240586b1-fca5-4c32-98a6-8c5b01a87c55/
[3]

if (recovery_state.get_missing_loc().is_unfound(soid)) {

[4]
if (missing_loc.is_unfound(soid)) {

@trociny
Copy link
Contributor Author

trociny commented Jun 5, 2021

So my current assumption is that when PGPeeringEvent comes it updates the mission_loc in some way that it does not return is_unfound true any more, but I still have failed to figure out where exactly this happens.

After adding some prints, I see now why is_unfound becomes to return false.

PeeringState::Active::react calls ps->search_for_missing, which calls MissingLoc::add_source_info, and according to this message in the log:

   -50> 2021-06-05T11:30:25.832+0000 7f448dfab700 10 osd.1 pg_epoch: 42 pg[1.0s0( v 16'64935 (16'64850,16'64935] local-lis/les=40/41 n=64935 ec=13/13 lis/c=40/35 les/c/f=41/36/0 sis=40) [1,0,2]/[1,NONE,2]p1(0) backfill=[0(1)] r=0 lpr=40 pi=[35,40)/1 crt=16'64935 lcod 16'64934 mlcod 0'0 active+undersized+degraded+remapped+backfilling rops=1 mbc={0={(1+0)=1},1={(0+0)=1},2={(0+0)=1}}] search_for_missing 1:ad551702:::test:head 16'2 is on osd.3(1)

It adds osd.3(1) to missing_loc (although actually the osd.3 is missing the object too). So when is_unfound [1] is called later, missing_loc contains two peers (``mit->second=1(0),3(1)) and (*is_recoverable)(mit->second)` returns true (and `is_unfound` returns false).

I still can't understand why in nautilus case is_unfound returns true here. In the log I see that add_source_info is called in this case too adding osd.3(1) to missing_loc, but is_unfound still returns true:

2021-05-26 06:37:53.101 7f9a90c96700 10 osd.1 pg_epoch: 36 pg[1.0s0( v 15'12248 (15'12177,15'12248] local-lis/les=34/35 n=12248 ec=12/12 lis/c 34/29 les/c/f 35/30/0 31/34/12) [1,0,2]/[1,2147483647,2]p1(0) backfill=[0(1)] r=0 lpr=34 pi=[29,34)/1 rops=1 crt=15'12248 lcod 15'12247 mlcod 0'0 active+undersized+remapped+backfilling mbc={}] search_for_missing 1:ad551702:::test:head 15'2 is on osd.3(1)
...
2021-05-26 06:37:53.105 7f9a94c9e700 10 osd.1 pg_epoch: 36 pg[1.0s0( v 15'12248 (15'12177,15'12248] local-lis/les=34/35 n=12248 ec=12/12 lis/c 34/29 les/c/f 35/30/0 31/34/12) [1,0,2]/[1,2147483647,2]p1(0) backfill=[0(1)] r=0 lpr=34 pi=[29,34)/1 crt=15'12248 lcod 15'12247 mlcod 0'0 active+undersized+remapped+backfilling mbc={0={(0+0)=1},1={(0+1)=1},2={(0+0)=1}}] recover_replicas: 1:ad551702:::test:head still unfound

[1]

bool is_unfound(const hobject_t &hoid) const {

@trociny
Copy link
Contributor Author

trociny commented Jun 6, 2021

This is waiting on @trociny to compare test runs from nautilus and master to figure out why we are not seeing #41532 (comment) in nautilus.

So the difference between master and nautilus is that when the object error (missing shards) is detected during backfill, the master (properly) adds the primary osd to missing_loc (PrimaryLogPG::on_failed_pull calls recovery_state.force_object_missing, which calls missing_loc.rebuild), so after this mission_loc contains one peer (mission_loc.is_unfound properly returns true at this time). Then when the restarted osd boots and PGPeeringEvent comes, search_for_missing -> add_source_info is called which (improperly?) adds the booted osd to mission_loc, so it now contains 2 peers (object is "recoverable") and mission_loc.is_unfound returns false triggering the abort.

While in nautilus when the object error (missing shards) is detected during backfill, the primary osd is not added to missing_loc (mission_loc.is_unfound returns true due to mit == missing_loc.end() is true). And when the restarted osd boots and PGPeeringEvent comes and adds the booted osd to mission_loc, it contains only one (this) peer and mission_loc.is_unfound still returns true.

@yuriw
Copy link
Contributor

yuriw commented Jun 7, 2021

tests passed https://trello.com/c/EPPqwRyb

@yuriw yuriw added the TESTED label Jun 7, 2021
@neha-ojha
Copy link
Member

Thanks for your analysis @trociny! The problem is that though osd.3(1) is missing the object, it advertises 0 missing to the primary. This ties back to https://trello.com/c/swaxVPq8/722-osd-bluestore-persist-flags-on-objects-to-generate-an-eio-add-tests-to-exercise-it

2021-05-26T06:09:06.882+0000 7fd7f73a7700 10 osd.1 pg_epoch: 41 pg[1.0s0( v 16'34826 (16'34797,16'34826] local-lis/les=39/40 n=34826 ec=13/13 lis/c=39/34 les/c/f=40/35/0 sis=39) [1,0,2]/[1,NONE,2]p1(0) backfill=[0(1)] r=0 lpr=39 pi=[34,39)/1 crt=16'34826 lcod 16'34825 mlcod 0'0 active+undersized+degraded+remapped+backfilling rops=1 mbc={0={(1+0)=1},1={(0+0)=1},2={(0+0)=1}}] proc_replica_log for osd.3(1): 1.0s1( v 16'34826 (16'34821,16'34826] local-lis/les=37/38 n=34826 ec=13/13 lis/c=39/34 les/c/f=40/35/0 sis=39) log((16'34821,16'34826], crt=16'34826) missing(0 may_include_deletes = 1)
...
2021-05-26T06:09:06.882+0000 7fd7f73a7700 10 osd.1 pg_epoch: 41 pg[1.0s0( v 16'34826 (16'34797,16'34826] local-lis/les=39/40 n=34826 ec=13/13 lis/c=39/34 les/c/f=40/35/0 sis=39) [1,0,2]/[1,NONE,2]p1(0) backfill=[0(1)] r=0 lpr=39 pi=[34,39)/1 crt=16'34826 lcod 16'34825 mlcod 0'0 active+undersized+degraded+remapped+backfilling rops=1 mbc={0={(1+0)=1},1={(0+0)=1},2={(0+0)=1}}] search_for_missing 1:ad551702:::test:head 16'2 is on osd.3(1)

above is from /a/trociny-2021-05-26_04:56:24-rados-master-distro-basic-smithi/6135980

I appreciate your efforts in finding the root cause of the problem. This PR may workaround the backfill_unfound edge case if an osd restarts, but I do not feel confident that this patch will not introduce any other problems. Since this is the last nautilus point release, we won't be able to patch any future bugs in this area. I would really like for us to invest in fixing this issue in master with a more robust solution and provide #41293 (comment) as a workaround for nautilus.

@trociny
Copy link
Contributor Author

trociny commented Jun 9, 2021

@neha-ojha Thanks! I totally agree with you that trying to fix this in nautilus at this stage is of high risk.

@neha-ojha neha-ojha added DNM and removed nautilus-batch-1 nautilus point releases labels Jun 9, 2021
@yuriw yuriw removed the needs-qa label Jun 9, 2021
@smithfarm smithfarm changed the title nautilus: osd/ECBackend: don't remove in-flight backfill when missing is not primary [DNM: TOO RISKY] nautilus: osd/ECBackend: don't remove in-flight backfill when missing is not primary Jul 20, 2021
@trociny trociny closed this Jul 27, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants