-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[HUDI-3805] Delete existing corrupted requested rollback plan during rollback #5245
[HUDI-3805] Delete existing corrupted requested rollback plan during rollback #5245
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
6e1603b
to
40fc116
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What would exactly corrupt the plan? Is this for HDFS where writes are not-atomic
HoodieRollbackPlan rollbackPlan; | ||
try { | ||
rollbackPlan = RollbackUtils.getRollbackPlan(metaClient, rollbackInstant); | ||
} catch (IOException e) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would n't this go ahead and delete even if say the error was a legit IO exception? lets say cloud storage was inaccessible/connection timeout.. We should explicitly handle corruptions cleanly IMO
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes. probably we should catch only for the known exception and stacktrace as well.
Caused by: java.io.IOException: Not an Avro data fileat org.apache.avro.file.DataFileReader.openReader(DataFileReader.java:50)
this is the stacktrace when avro file is corrupt. we will follow up w/ the right fix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The deletion only happens for the requested rollback plan. Any inflight rollback is not affected. The assumption here is that even if the requested rollback plan is inaccessible and deleted, it can be requested again by a new writer, which is still safe. We don't want the hanging rollback plan to block metadata table compaction.
The corruption is mainly due to writes of a rollback plan not atomic, and the job fails during that time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah. Ethan reminded me of the same discussion we had when the patch was put up. we found the existing fix as the safest option compared to other alternatives. Just to add to what Ethan has mentioned above, we do this only incase of rollback.requested and not for rollback.inflight. For inflight, its safe to re-use the plan from the rollback.requested file.
What is the purpose of the pull request
When the
instant_time.rollback.requested
file in the timeline is empty or corrupted, it cannot be parsed. When runninggetPendingRollbackInfos()
, it's going to skip that empty/corrupted requested rollback instant and the rollback instant is going to stay on the timeline forever, preventing metadata table archival.This PR fixes the problem by deleting the requested rollback plan if it is empty or corrupted.
Brief change log
BaseHoodieWriteClient::getPendingRollbackInfos()
.Verify this pull request
This change added tests for requested rollbacks, either valid or corrupted, in
TestClientRollback
. The fix is also verified by running the deltastreamer on a Hudi Table with corrupted requested rollback in the timeline. The corrupted rollback plan is deleted afterwards.Committer checklist
Has a corresponding JIRA in PR title & commit
Commit message is descriptive of the change
CI is green
Necessary doc changes done or have another open PR
For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.