Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[HUDI-3805] Delete existing corrupted requested rollback plan during rollback #5245

Merged
merged 1 commit into from
Apr 7, 2022

Conversation

yihua
Copy link
Contributor

@yihua yihua commented Apr 7, 2022

What is the purpose of the pull request

When the instant_time.rollback.requested file in the timeline is empty or corrupted, it cannot be parsed. When running getPendingRollbackInfos(), it's going to skip that empty/corrupted requested rollback instant and the rollback instant is going to stay on the timeline forever, preventing metadata table archival.

This PR fixes the problem by deleting the requested rollback plan if it is empty or corrupted.

Brief change log

  • Adds logic to delete requested rollback plan if it cannot be parsed in BaseHoodieWriteClient::getPendingRollbackInfos().

Verify this pull request

This change added tests for requested rollbacks, either valid or corrupted, in TestClientRollback. The fix is also verified by running the deltastreamer on a Hudi Table with corrupted requested rollback in the timeline. The corrupted rollback plan is deleted afterwards.

Committer checklist

  • Has a corresponding JIRA in PR title & commit

  • Commit message is descriptive of the change

  • CI is green

  • Necessary doc changes done or have another open PR

  • For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

Copy link
Contributor

@nsivabalan nsivabalan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@yihua yihua force-pushed the HUDI-3805-fix-empty-requested-rollback branch from 6e1603b to 40fc116 Compare April 7, 2022 03:00
@hudi-bot
Copy link

hudi-bot commented Apr 7, 2022

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

Copy link
Member

@vinothchandar vinothchandar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would exactly corrupt the plan? Is this for HDFS where writes are not-atomic

HoodieRollbackPlan rollbackPlan;
try {
rollbackPlan = RollbackUtils.getRollbackPlan(metaClient, rollbackInstant);
} catch (IOException e) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would n't this go ahead and delete even if say the error was a legit IO exception? lets say cloud storage was inaccessible/connection timeout.. We should explicitly handle corruptions cleanly IMO

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes. probably we should catch only for the known exception and stacktrace as well.

Caused by: java.io.IOException: Not an Avro data fileat org.apache.avro.file.DataFileReader.openReader(DataFileReader.java:50)

this is the stacktrace when avro file is corrupt. we will follow up w/ the right fix.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

@yihua yihua Jul 28, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The deletion only happens for the requested rollback plan. Any inflight rollback is not affected. The assumption here is that even if the requested rollback plan is inaccessible and deleted, it can be requested again by a new writer, which is still safe. We don't want the hanging rollback plan to block metadata table compaction.

The corruption is mainly due to writes of a rollback plan not atomic, and the job fails during that time.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah. Ethan reminded me of the same discussion we had when the patch was put up. we found the existing fix as the safest option compared to other alternatives. Just to add to what Ethan has mentioned above, we do this only incase of rollback.requested and not for rollback.inflight. For inflight, its safe to re-use the plan from the rollback.requested file.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
No open projects
Status: No status
Development

Successfully merging this pull request may close these issues.

None yet

5 participants