New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
osd/ECTransaction: Remove incorrect asserts in generate_transactions #56924
Conversation
Back in PR ceph#11701 when EC Overwrites were added, there was significant churn in the ECTransaction code. Several asserts were added in generate_transactions, but the code changed several times and it appears some of those asserts don't make sense in the final version of the PR. This PR removes two of those asserts. The first doesn't appear to be necessary given how the surrounding conditional block changed. The second appears to be an incorrect assert that was left over from an earlier commit that no longer should be in place given the logic that was added to handle truncate (which roughly mirrors what we do in ReplicatedBackend). It's possible the assert for obc and entry are also unnecessary, however I left those in place for now. Fixes: https://tracker.ceph.com/issues/65509 Signed-off-by: Mark Nelson <mark.nelson@clyso.com>
This is currently impacting our operations, we have 2 pools inoperable because of it. It appears to be the second time this bug has taken down our cluster following a full cluster wide recovery. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, that looks right to me.
@athanatos - do we really need QA for changes like that? |
@ifed01 For this change specifically, no. It's the removal of an assert and so can't actually create a failure condition. However, I prefer that core contributions go through testing regardless unless there's a specific, pressing need to merge without testing. Is there such a pressing need here? This bug is very old. @yuriw Can we get this into the next testing batch? |
@athanatos I won't comment on whether or not we've reached that threshold, but here's the failure condition that Mike mentioned in slack that might inform that decision:
|
The most confounding part of this is the affected cluster has been in operation for 3 years without tripping this bug. I don't know what has changed in client behavior such that steady state operations triggered the bug but it caused a 52 hour outage until we could recover the cluster and stay healthy by forcibly disabling trim/discards from clients. |
I think we should convey also negligible-risk changes through QA, even for the sake of honoring the workflow.
OK, I added information about the workaround to the tracker. The ticket is High (but not Urgent), so let's backport this without delays but let's also have the due QA. |
@@ -193,9 +193,6 @@ void ECTransaction::generate_transactions( | |||
xattr_rollback[ECUtil::get_hinfo_key()] = old_hinfo; | |||
|
|||
if (op.is_none() && op.truncate && op.truncate->first == 0) { | |||
ceph_assert(op.truncate->first == 0); | |||
ceph_assert(op.truncate->first == | |||
op.truncate->second); | |||
ceph_assert(entry); | |||
ceph_assert(obc); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A dozen lines below there is the if (obc)
which seems to be always taken. Maybe it's worth a follow-up?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rzarzynski yeah, I considered the same thing, but ultimately decided to leave it for a future PR.
Hi, I was also involved in finding this. I'm trying to have it reproduced, but did not quite succeed. In the logs I find this a notable log:
The Perhaps someone has an idea what could be missing here to fully reproduce it. |
Back in PR #11701 when EC Overwrites were added, there was significant churn in the ECTransaction code. Several asserts were added in generate_transactions, but the code changed several times and it appears some of those asserts don't make sense in the final version of the PR.
This PR removes two of those asserts. The first doesn't appear to be necessary given how the surrounding conditional block changed. The second appears to be an incorrect assert that was left over from an earlier commit that no longer should be in place given the logic that was added to handle truncate (which roughly mirrors what we do in ReplicatedBackend).
It's possible the assert for entry and/or obc are also unnecessary, however I left those in place for now.
Fixes: https://tracker.ceph.com/issues/65509