-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[HUDI-5477] Optimize timeline loading in Hudi sync client #7561
[HUDI-5477] Optimize timeline loading in Hudi sync client #7561
Conversation
hudi-common/src/main/java/org/apache/hudi/common/table/timeline/TimelineUtils.java
Show resolved
Hide resolved
hudi-common/src/main/java/org/apache/hudi/common/table/HoodieTableMetaClient.java
Outdated
Show resolved
Hide resolved
40361ca
to
f61d948
Compare
Before this change, the Hudi archived timeline is always loaded during the metastore sync process if the last sync time is given. Besides, the archived timeline is not cached inside the meta client if the start instant time is given. These cause performance issues and read timeout on cloud storage due to rate limiting on requests because of loading archived timeline from the storage, when the archived timeline is huge, e.g., hundreds of log files in .hoodie/archived folder. This change improves the timeline loading by (1) only reading active timeline if the last sync time is the same as or after the start of the active timeline; (2) caching the archived timeline based on the start instant time in the meta client, to avoid unnecessary repeated loading of the same archived timeline. (cherry picked from commit ab61f61)
Before this change, the Hudi archived timeline is always loaded during the metastore sync process if the last sync time is given. Besides, the archived timeline is not cached inside the meta client if the start instant time is given. These cause performance issues and read timeout on cloud storage due to rate limiting on requests because of loading archived timeline from the storage, when the archived timeline is huge, e.g., hundreds of log files in .hoodie/archived folder. This change improves the timeline loading by (1) only reading active timeline if the last sync time is the same as or after the start of the active timeline; (2) caching the archived timeline based on the start instant time in the meta client, to avoid unnecessary repeated loading of the same archived timeline.
Before this change, the Hudi archived timeline is always loaded during the metastore sync process if the last sync time is given. Besides, the archived timeline is not cached inside the meta client if the start instant time is given. These cause performance issues and read timeout on cloud storage due to rate limiting on requests because of loading archived timeline from the storage, when the archived timeline is huge, e.g., hundreds of log files in .hoodie/archived folder. This change improves the timeline loading by (1) only reading active timeline if the last sync time is the same as or after the start of the active timeline; (2) caching the archived timeline based on the start instant time in the meta client, to avoid unnecessary repeated loading of the same archived timeline.
HoodieTableMetaClient metaClient, String exclusiveStartInstantTime) { | ||
HoodieActiveTimeline activeTimeline = metaClient.getActiveTimeline(); | ||
HoodieDefaultTimeline timeline = | ||
activeTimeline.isBeforeTimelineStarts(exclusiveStartInstantTime) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yihua I have a doubt, since rollback and commit are archived separately, is it possible that there is a very early rollback instant, causing activeTimeline.isBeforeTimelineStarts(exclusiveStartInstantTime) to return false?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@BruceKellan You bring up a good point. So you are talking about the following case where the timeline is like:
ts3.rollback, ts50.commit, ts51.commit, ts52.commit, ...
and ts49.commit
and ts48.commit
are archived.
If we pass in ts47
as the exclusiveStartInstantTime
at this point, activeTimeline.isBeforeTimelineStarts(exclusiveStartInstantTime)
returns false
. However, the active timeline misses ts48.commit
and ts49.commit
which are required for meta sync. This is a problem if ts48.commit
or ts49.commit
has partition changes.
I'll put up a fix on that. Thanks for catching it!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, that's it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's the fix: #8991
Before this change, the Hudi archived timeline is always loaded during the metastore sync process if the last sync time is given. Besides, the archived timeline is not cached inside the meta client if the start instant time is given. These cause performance issues and read timeout on cloud storage due to rate limiting on requests because of loading archived timeline from the storage, when the archived timeline is huge, e.g., hundreds of log files in .hoodie/archived folder. This change improves the timeline loading by (1) only reading active timeline if the last sync time is the same as or after the start of the active timeline; (2) caching the archived timeline based on the start instant time in the meta client, to avoid unnecessary repeated loading of the same archived timeline.
…a sync (apache#8991) This commit fixes the problematic API implementation (TimelineUtils.getCommitsTimelineAfter) introduced by apache#7561.
Change Logs
Before this change, the Hudi archived timeline is always loaded during the metastore sync process if the last sync time is given. Besides, the archived timeline is not cached inside the meta client if the start instant time is given. These cause performance issues and read timeout on cloud storage due to rate limiting on requests because of loading archived timeline from the storage, when the archived timeline is huge, e.g., hundreds of log files in
.hoodie/archived
folder. Such behavior is introduced by #6662.This PR improves the timeline loading by
(1) only reading active timeline if the last sync time is the same as or after the start of the active timeline;
(2) caching the archived timeline based on the start instant time in the meta client, to avoid unnecessary repeated loading of the same archived timeline.
Impact
This PR improves the performance of metastore sync.
Risk level
low
Documentation Update
N/A
Contributor's checklist