Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[HUDI-5434] Fix archival in metadata table to not rely on completed rollback or clean in data table #7580

Merged
merged 3 commits into from Jan 11, 2023

Conversation

yihua
Copy link
Contributor

@yihua yihua commented Dec 29, 2022

Change Logs

Before this PR, the archival for the metadata table uses the earliest instant of all actions from the active timeline of the data table. In the archival process, CLEAN and ROLLBACK instants are archived separately apart from commits (check HoodieTimelineArchiver#getCleanInstantsToArchive). Because of this, a very old completed CLEAN or ROLLBACK instant in the data table can block the archive of the metadata table timeline and causes the active timeline of the metadata table to be extremely long, leading to performance issues for loading the timeline.

This PR changes the archival in metadata table to not rely on completed rollback or clean in data table, by archiving the metadata table's instants after the earliest commit (COMMIT, DELTA_COMMIT, and REPLACE_COMMIT only, considering non-savepoint commit only if enabling archive beyond savepoint) and the earliest inflight instant (all actions) in the data table's active timeline.

New tests are added and an existing test around the archival in the metadata table is adjusted to verify that the archival in the metadata table does not depend on the completed rollback in the data table.

This PR is tested locally to ensure the metadata table's archival works as expected.

Impact

Makes the active timeline of the metadata table shorter and improves the performance of loading the active timeline of the metadata table.

Risk level

low

Documentation Update

N/A

Contributor's checklist

  • Read through contributor's guide
  • Change Logs and Impact were stated clearly
  • Adequate tests were added if applicable
  • CI passed

Copy link
Contributor

@nsivabalan nsivabalan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you take a another look at special handling for archival beyond savepoint

// for loading the timeline.
if (qualifiedEarliestInstant.isPresent()) {
instants = instants.filter(instant ->
compareTimestamps(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have created a jira to test savepoint beyond archival. I guess, we should not be deleting the commits that are savepointed.
https://issues.apache.org/jira/browse/HUDI-5525

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sg. This particular logic is for the metadata table's archival only. Nevertheless, we should still thoroughly test the archival beyond savepoints, orthogonal to this PR.

@apache apache deleted a comment from hudi-bot Jan 11, 2023
@yihua
Copy link
Contributor Author

yihua commented Jan 11, 2023

CI passes
Screen Shot 2023-01-11 at 10 44 22

@yihua yihua merged commit 7a9aabd into apache:master Jan 11, 2023
fengjian428 pushed a commit to fengjian428/hudi that referenced this pull request Jan 31, 2023
…ollback or clean in data table (apache#7580)

Before this change, the archival for the metadata table uses the earliest instant of all actions from the active timeline of the data table. In the archival process, CLEAN and ROLLBACK instants are archived separately apart from commits (check HoodieTimelineArchiver#getCleanInstantsToArchive). Because of this, a very old completed CLEAN or ROLLBACK instant in the data table can block the archive of the metadata table timeline and causes the active timeline of the metadata table to be extremely long, leading to performance issues for loading the timeline.

This commit changes the archival in metadata table to not rely on completed rollback or clean in data table, by archiving the metadata table's instants after the earliest commit (COMMIT, DELTA_COMMIT, and REPLACE_COMMIT only, considering non-savepoint commit only if enabling archive beyond savepoint) and the earliest inflight instant (all actions) in the data table's active timeline.
nsivabalan pushed a commit to nsivabalan/hudi that referenced this pull request Mar 22, 2023
…ollback or clean in data table (apache#7580)

Before this change, the archival for the metadata table uses the earliest instant of all actions from the active timeline of the data table. In the archival process, CLEAN and ROLLBACK instants are archived separately apart from commits (check HoodieTimelineArchiver#getCleanInstantsToArchive). Because of this, a very old completed CLEAN or ROLLBACK instant in the data table can block the archive of the metadata table timeline and causes the active timeline of the metadata table to be extremely long, leading to performance issues for loading the timeline.

This commit changes the archival in metadata table to not rely on completed rollback or clean in data table, by archiving the metadata table's instants after the earliest commit (COMMIT, DELTA_COMMIT, and REPLACE_COMMIT only, considering non-savepoint commit only if enabling archive beyond savepoint) and the earliest inflight instant (all actions) in the data table's active timeline.
southernriver pushed a commit to southernriver/hudi that referenced this pull request Mar 31, 2023
…ollback or clean in data table (apache#7580)

Before this change, the archival for the metadata table uses the earliest instant of all actions from the active timeline of the data table. In the archival process, CLEAN and ROLLBACK instants are archived separately apart from commits (check HoodieTimelineArchiver#getCleanInstantsToArchive). Because of this, a very old completed CLEAN or ROLLBACK instant in the data table can block the archive of the metadata table timeline and causes the active timeline of the metadata table to be extremely long, leading to performance issues for loading the timeline.

This commit changes the archival in metadata table to not rely on completed rollback or clean in data table, by archiving the metadata table's instants after the earliest commit (COMMIT, DELTA_COMMIT, and REPLACE_COMMIT only, considering non-savepoint commit only if enabling archive beyond savepoint) and the earliest inflight instant (all actions) in the data table's active timeline.

(cherry picked from commit 7a9aabd)
fengjian428 pushed a commit to fengjian428/hudi that referenced this pull request Apr 5, 2023
…ollback or clean in data table (apache#7580)

Before this change, the archival for the metadata table uses the earliest instant of all actions from the active timeline of the data table. In the archival process, CLEAN and ROLLBACK instants are archived separately apart from commits (check HoodieTimelineArchiver#getCleanInstantsToArchive). Because of this, a very old completed CLEAN or ROLLBACK instant in the data table can block the archive of the metadata table timeline and causes the active timeline of the metadata table to be extremely long, leading to performance issues for loading the timeline.

This commit changes the archival in metadata table to not rely on completed rollback or clean in data table, by archiving the metadata table's instants after the earliest commit (COMMIT, DELTA_COMMIT, and REPLACE_COMMIT only, considering non-savepoint commit only if enabling archive beyond savepoint) and the earliest inflight instant (all actions) in the data table's active timeline.
flashJd pushed a commit to flashJd/hudi that referenced this pull request May 5, 2023
…ollback or clean in data table (apache#7580)

Before this change, the archival for the metadata table uses the earliest instant of all actions from the active timeline of the data table. In the archival process, CLEAN and ROLLBACK instants are archived separately apart from commits (check HoodieTimelineArchiver#getCleanInstantsToArchive). Because of this, a very old completed CLEAN or ROLLBACK instant in the data table can block the archive of the metadata table timeline and causes the active timeline of the metadata table to be extremely long, leading to performance issues for loading the timeline.

This commit changes the archival in metadata table to not rely on completed rollback or clean in data table, by archiving the metadata table's instants after the earliest commit (COMMIT, DELTA_COMMIT, and REPLACE_COMMIT only, considering non-savepoint commit only if enabling archive beyond savepoint) and the earliest inflight instant (all actions) in the data table's active timeline.

(cherry picked from commit 7a9aabd)

# Conflicts:
#	hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/HoodieTimelineArchiver.java
#	hudi-common/src/main/java/org/apache/hudi/common/table/timeline/TimelineUtils.java
#	hudi-common/src/test/java/org/apache/hudi/common/table/TestTimelineUtils.java
@Zouxxyy
Copy link
Contributor

Zouxxyy commented Aug 2, 2023

Before this PR, the archival for the metadata table uses the earliest instant of all actions from the active timeline of the data table. In the archival process, CLEAN and ROLLBACK instants are archived separately apart from commits (check HoodieTimelineArchiver#getCleanInstantsToArchive). Because of this, a very old completed CLEAN or ROLLBACK instant in the data table can block the archive of the metadata table timeline and causes the active timeline of the metadata table to be extremely long, leading to performance issues for loading the timeline.

Why not just archival all the clean/rollback instants before the oldest commit instant in getCommitInstantsToArchive(), the current logic of HoodieTimelineArchiver#getCleanInstantsToArchive doesn't really make sense

For example, if user only appears a few times of rollback(small than maxInstantsToKeep), then these rollback instants will stay in the active timeline forever.

@danny0405
Copy link
Contributor

then these rollback instants will stay in the active timeline forever.

That's true, we should optimize the archiving of cleaning and rollback.

@Zouxxyy
Copy link
Contributor

Zouxxyy commented Aug 3, 2023

@danny0405

That's true, we should optimize the archiving of cleaning and rollback.

I see you are working on LSM tree based archive timeline, have you done some work on this?
If not, I'd like to work for it, the current process of getInstantsToArchive is a bit complicated, I will sort it out as a whole.

@danny0405
Copy link
Contributor

have you done some work on this

No.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
metadata metadata table priority:critical production down; pipelines stalled; Need help asap.
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

[SUPPORT] Too many metadata timeline file caused by old rollback active timeline
6 participants