New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DNM: TOO RISKY] nautilus: osd/ECBackend: don't remove in-flight backfill when missing is not primary #41293
Conversation
…imary This is a direct commit to nautilus branch to recreate the behaviour we have post-nautilus, after refactoring made in 8a8947d, which seems to be correct. The in-flight backfill prevents updating of "backfill complete" position, which remains on the object before the missing oid. So when pg is re-peered it retries backfill from this position instead of entering the clean state. Fixes: https://tracker.ceph.com/issues/50747 Signed-off-by: Mykola Golub <mgolub@suse.com>
It will also require the backport of #41270 |
In PrimaryLogPG::on_failed_pull, we unconditionally remove soid from recovering list, but remove it from backfills_in_flight only when the backfill source is the primary osd. Fixes: https://tracker.ceph.com/issues/50351 Signed-off-by: Mykola Golub <mgolub@suse.com> (cherry picked from commit 9b78e00)
jenkins test make check |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This approach looks reasonable to me! @athanatos I'd like to get your thoughts as well.
On a different note, it'd be great to add a test (in master and then backport) to replicate what was done manually in https://tracker.ceph.com/issues/50351, to catch future bugs in this area.
I'm not sure about this. Why don't we call on_primary_error or similar? I'm also kind of leery of backporting something like this to nautilus. What's the actual consequence of this bug? |
@athanatos The actual consequence of this bug, that without this change the With the fix the in-flight backfill prevents updating the last_backfill position, and it remains on the object before the missing oid. So when the pg is re-peered it retries backfill from this position instead of entering the clean state. You can find more details in my emails to ceph-dev [1].
I am not sure what you mean. When the error is on primary the current code behaves correctly (well, I mean the same as on the master). It behaves differently when the missing shard is not on a primary -- we want to skip removing the My approach looks a bit hackish but as it is a direct commit in a maintenance branch I wanted to make it as simple as possible. [1] https://lists.ceph.io/hyperkitty/list/dev@ceph.io/thread/OJEZB4YEGWV3EUJPFGLG2O3VHERR5ATI/ |
@neha-ojha thanks. I will work on the test and will push a separate PR for it. |
It doesn't sound like that would prevent an upgrade. This pathway is pretty complicated, and this seems like quite a risky patch to a stable branch without a test particularly considering that that pathway is poorly tested anyway. |
PrimaryLogPG::primary_error on nautilus appears to update the primary's missing set as on current master, so that should be ok. I think this patch replicates the behavior on master w.r.t. backfills_in_flight. @trociny If you've carefully tested this and related scenarios, it might be ok to merge, but I'd characterize this as a very risky backport. It might actually be safer to try to backport the refactor (just the patch in question, not the whole PeeringState refactor). |
With this patch, does restarting the primary still clear the unfound state? |
@trociny @athanatos I agree this patch is risky (especially at this stage in nautilus) and needs more testing that one rados suite run. It would really help to add a test to reproduce the issue and confirm that this patch fixes the problem with no side effects. Also, is there a workaround that can be used for https://tracker.ceph.com/issues/50351 instead? |
@athanatos Although 8a8947d is cherry-pickable (just with some rather trivial conflicts) it relies on the changes introduced earlier in d33a8b8. And backporting this (or just needed changes) looks like a hell and much more risky. With my current approach I am sure that I introduce the change (and potential issues) only for ECBackend, and only for the case |
Yes. The same as it is on the master currently. When the primary is temporarily down, another osd becomes primary. The [1] https://lists.ceph.io/hyperkitty/list/dev@ceph.io/thread/GYZAXV2JO5JK4J3L2G4NCYVFFWRIX5VD/ |
I have started working on the test. My first thought was to use as an example the tests
I am not sure I understand. The backport for 50351 in the nautilus branch is needed only if we introduce the post-nautilus behavior, i.e. as an addition to my first "risky" commit. Without the first commit it will not crash there, it will just enter a clean state. |
Probably you meant a workaround for the backfill_unfound state reset issue (50747), not for the crash? Then a workaround is to run scrub, which will detect unfound objects and the pg will enter inconsistent state. But the user will not aware of the issue (will think it has magically resolved) until the scrub is run. |
Ok. Once we've got this backport sorted, we probably want to revisit the fix on master. Rather than blocking backfill at the unfound object, we probably want to add the offending object to the durably recorded missing sets where its missing and proceed with backfill. This probably means adding a force_missing message of some form. |
Yeah, that's probably a good place to start. |
The osd may be 0. Signed-off-by: Mykola Golub <mgolub@suse.com> (cherry picked from commit 93139db)
Fixes: https://tracker.ceph.com/issues/50925 Signed-off-by: Mykola Golub <mgolub@suse.com> (cherry picked from commit 0a9f7e1) Conflicts: qa/suites/rados/singleton/all/ec-backfill-unfound.yaml: log-whitelist instead of log-ignorelist
I added the teuthology test, backported from #41532. Note the master PR is not merged yet, so the test might need re-backporting later. Below are results when scheduling with [1] https://pulpito.ceph.com/trociny-2021-05-26_06:18:46-rados-wip-mgolub-testing-nautilus-distro-basic-smithi/ |
@neha-ojha @jdurgin @athanatos Is there still hope to get this into 14.2.22? |
@neha-ojha Where did we end up on this one? |
@athanatos @smithfarm This is waiting on @trociny to compare test runs from nautilus and master to figure out why we are not seeing #41532 (comment) in nautilus. Once we have a better understanding of the root cause and whether it imposes a threat to nautilus, we can take a call in this PR. @trociny any updates? let's also run this through the upgrade suite |
I am still in process of investigating it. The current findings: The crash is not filestore specific. The important condition seems to be that after a non-primary osd is stopped, we are in "backfilling" state with that osd missing in active set, and then the osd boots and the peering event comes. Filestore osds seem to start faster that is why we were able to see the crash here, while with the bluestore the backfill completed before the osd booted. And I was able to reproduce the issue with the bluestore too by increasing the backfill time (increasing the number of objects in the pg) [1]. Looking at the crash log [2] one can see that just before the crash
And when So my current assumption is that when In nautilus we have similar condition in
My current assumption it that [1] https://pulpito.ceph.com/trociny-2021-06-02_13:57:51-rados-master-distro-basic-smithi/ Line 13631 in 9370ac9
[4] Line 12937 in e48b581
|
After adding some prints, I see now why
It adds osd.3(1) to I still can't understand why in nautilus case
[1] Line 147 in 9370ac9
|
So the difference between master and nautilus is that when the object error (missing shards) is detected during backfill, the master (properly) adds the primary osd to While in nautilus when the object error (missing shards) is detected during backfill, the primary osd is not added to |
tests passed https://trello.com/c/EPPqwRyb |
Thanks for your analysis @trociny! The problem is that though osd.3(1) is missing the object, it advertises 0 missing to the primary. This ties back to https://trello.com/c/swaxVPq8/722-osd-bluestore-persist-flags-on-objects-to-generate-an-eio-add-tests-to-exercise-it
above is from /a/trociny-2021-05-26_04:56:24-rados-master-distro-basic-smithi/6135980 I appreciate your efforts in finding the root cause of the problem. This PR may workaround the backfill_unfound edge case if an osd restarts, but I do not feel confident that this patch will not introduce any other problems. Since this is the last nautilus point release, we won't be able to patch any future bugs in this area. I would really like for us to invest in fixing this issue in master with a more robust solution and provide #41293 (comment) as a workaround for nautilus. |
@neha-ojha Thanks! I totally agree with you that trying to fix this in nautilus at this stage is of high risk. |
First commit:
This is a direct commit to nautilus branch to recreate the
behaviour we have post-nautilus, after refactoring made in
8a8947d, which seems to be
correct.
The in-flight backfill prevents updating of "backfill complete"
position, which remains on the object before the missing oid. So
when pg is re-peered it retries backfill from this position
instead of entering the clean state.
Fixes: https://tracker.ceph.com/issues/50747
Signed-off-by: Mykola Golub mgolub@suse.com
Second commit:
backport tracker: https://tracker.ceph.com/issues/50792
backport of: #41270
parent tracker: https://tracker.ceph.com/issues/50351