New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[HUDI-5434] Fix archival in metadata table to not rely on completed rollback or clean in data table #7580
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you take a another look at special handling for archival beyond savepoint
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/HoodieTimelineArchiver.java
Show resolved
Hide resolved
…ollback or clean in data table
363e7ec
to
294766f
Compare
// for loading the timeline. | ||
if (qualifiedEarliestInstant.isPresent()) { | ||
instants = instants.filter(instant -> | ||
compareTimestamps( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have created a jira to test savepoint beyond archival. I guess, we should not be deleting the commits that are savepointed.
https://issues.apache.org/jira/browse/HUDI-5525
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sg. This particular logic is for the metadata table's archival only. Nevertheless, we should still thoroughly test the archival beyond savepoints, orthogonal to this PR.
…ollback or clean in data table (apache#7580) Before this change, the archival for the metadata table uses the earliest instant of all actions from the active timeline of the data table. In the archival process, CLEAN and ROLLBACK instants are archived separately apart from commits (check HoodieTimelineArchiver#getCleanInstantsToArchive). Because of this, a very old completed CLEAN or ROLLBACK instant in the data table can block the archive of the metadata table timeline and causes the active timeline of the metadata table to be extremely long, leading to performance issues for loading the timeline. This commit changes the archival in metadata table to not rely on completed rollback or clean in data table, by archiving the metadata table's instants after the earliest commit (COMMIT, DELTA_COMMIT, and REPLACE_COMMIT only, considering non-savepoint commit only if enabling archive beyond savepoint) and the earliest inflight instant (all actions) in the data table's active timeline.
…ollback or clean in data table (apache#7580) Before this change, the archival for the metadata table uses the earliest instant of all actions from the active timeline of the data table. In the archival process, CLEAN and ROLLBACK instants are archived separately apart from commits (check HoodieTimelineArchiver#getCleanInstantsToArchive). Because of this, a very old completed CLEAN or ROLLBACK instant in the data table can block the archive of the metadata table timeline and causes the active timeline of the metadata table to be extremely long, leading to performance issues for loading the timeline. This commit changes the archival in metadata table to not rely on completed rollback or clean in data table, by archiving the metadata table's instants after the earliest commit (COMMIT, DELTA_COMMIT, and REPLACE_COMMIT only, considering non-savepoint commit only if enabling archive beyond savepoint) and the earliest inflight instant (all actions) in the data table's active timeline.
…ollback or clean in data table (apache#7580) Before this change, the archival for the metadata table uses the earliest instant of all actions from the active timeline of the data table. In the archival process, CLEAN and ROLLBACK instants are archived separately apart from commits (check HoodieTimelineArchiver#getCleanInstantsToArchive). Because of this, a very old completed CLEAN or ROLLBACK instant in the data table can block the archive of the metadata table timeline and causes the active timeline of the metadata table to be extremely long, leading to performance issues for loading the timeline. This commit changes the archival in metadata table to not rely on completed rollback or clean in data table, by archiving the metadata table's instants after the earliest commit (COMMIT, DELTA_COMMIT, and REPLACE_COMMIT only, considering non-savepoint commit only if enabling archive beyond savepoint) and the earliest inflight instant (all actions) in the data table's active timeline. (cherry picked from commit 7a9aabd)
…ollback or clean in data table (apache#7580) Before this change, the archival for the metadata table uses the earliest instant of all actions from the active timeline of the data table. In the archival process, CLEAN and ROLLBACK instants are archived separately apart from commits (check HoodieTimelineArchiver#getCleanInstantsToArchive). Because of this, a very old completed CLEAN or ROLLBACK instant in the data table can block the archive of the metadata table timeline and causes the active timeline of the metadata table to be extremely long, leading to performance issues for loading the timeline. This commit changes the archival in metadata table to not rely on completed rollback or clean in data table, by archiving the metadata table's instants after the earliest commit (COMMIT, DELTA_COMMIT, and REPLACE_COMMIT only, considering non-savepoint commit only if enabling archive beyond savepoint) and the earliest inflight instant (all actions) in the data table's active timeline.
…ollback or clean in data table (apache#7580) Before this change, the archival for the metadata table uses the earliest instant of all actions from the active timeline of the data table. In the archival process, CLEAN and ROLLBACK instants are archived separately apart from commits (check HoodieTimelineArchiver#getCleanInstantsToArchive). Because of this, a very old completed CLEAN or ROLLBACK instant in the data table can block the archive of the metadata table timeline and causes the active timeline of the metadata table to be extremely long, leading to performance issues for loading the timeline. This commit changes the archival in metadata table to not rely on completed rollback or clean in data table, by archiving the metadata table's instants after the earliest commit (COMMIT, DELTA_COMMIT, and REPLACE_COMMIT only, considering non-savepoint commit only if enabling archive beyond savepoint) and the earliest inflight instant (all actions) in the data table's active timeline. (cherry picked from commit 7a9aabd) # Conflicts: # hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/HoodieTimelineArchiver.java # hudi-common/src/main/java/org/apache/hudi/common/table/timeline/TimelineUtils.java # hudi-common/src/test/java/org/apache/hudi/common/table/TestTimelineUtils.java
Why not just archival all the clean/rollback instants before the oldest commit instant in For example, if user only appears a few times of rollback(small than |
That's true, we should optimize the archiving of cleaning and rollback. |
I see you are working on LSM tree based archive timeline, have you done some work on this? |
No. |
Change Logs
Before this PR, the archival for the metadata table uses the earliest instant of all actions from the active timeline of the data table. In the archival process, CLEAN and ROLLBACK instants are archived separately apart from commits (check HoodieTimelineArchiver#getCleanInstantsToArchive). Because of this, a very old completed CLEAN or ROLLBACK instant in the data table can block the archive of the metadata table timeline and causes the active timeline of the metadata table to be extremely long, leading to performance issues for loading the timeline.
This PR changes the archival in metadata table to not rely on completed rollback or clean in data table, by archiving the metadata table's instants after the earliest commit (COMMIT, DELTA_COMMIT, and REPLACE_COMMIT only, considering non-savepoint commit only if enabling archive beyond savepoint) and the earliest inflight instant (all actions) in the data table's active timeline.
New tests are added and an existing test around the archival in the metadata table is adjusted to verify that the archival in the metadata table does not depend on the completed rollback in the data table.
This PR is tested locally to ensure the metadata table's archival works as expected.
Impact
Makes the active timeline of the metadata table shorter and improves the performance of loading the active timeline of the metadata table.
Risk level
low
Documentation Update
N/A
Contributor's checklist