Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[HUDI-5477] Optimize timeline loading in Hudi sync client #7561

Merged

Conversation

yihua
Copy link
Contributor

@yihua yihua commented Dec 26, 2022

Change Logs

Before this change, the Hudi archived timeline is always loaded during the metastore sync process if the last sync time is given. Besides, the archived timeline is not cached inside the meta client if the start instant time is given. These cause performance issues and read timeout on cloud storage due to rate limiting on requests because of loading archived timeline from the storage, when the archived timeline is huge, e.g., hundreds of log files in .hoodie/archived folder. Such behavior is introduced by #6662.

This PR improves the timeline loading by
(1) only reading active timeline if the last sync time is the same as or after the start of the active timeline;
(2) caching the archived timeline based on the start instant time in the meta client, to avoid unnecessary repeated loading of the same archived timeline.

Impact

This PR improves the performance of metastore sync.

Risk level

low

Documentation Update

N/A

Contributor's checklist

  • Read through contributor's guide
  • Change Logs and Impact were stated clearly
  • Adequate tests were added if applicable
  • CI passed

@yihua yihua added meta-sync priority:critical production down; pipelines stalled; Need help asap. labels Dec 26, 2022
@yihua yihua requested a review from xushiyan December 26, 2022 22:01
@yihua yihua changed the title [HUDI-5477][DO NOTE MERGE] Optimize timeline loading in Hudi sync client [HUDI-5477][DO NOT MERGE] Optimize timeline loading in Hudi sync client Dec 27, 2022
@yihua yihua changed the title [HUDI-5477][DO NOT MERGE] Optimize timeline loading in Hudi sync client [HUDI-5477] Optimize timeline loading in Hudi sync client Dec 29, 2022
@apache apache deleted a comment from hudi-bot Dec 30, 2022
@apache apache deleted a comment from hudi-bot Dec 31, 2022
@yihua yihua force-pushed the HUDI-5477-optimize-timeline-loading-in-sync branch from 40361ca to f61d948 Compare January 4, 2023 07:37
@apache apache deleted a comment from hudi-bot Jan 4, 2023
@hudi-bot
Copy link

hudi-bot commented Jan 5, 2023

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@yihua
Copy link
Contributor Author

yihua commented Jan 5, 2023

CI passes. Merging this PR
Screen Shot 2023-01-04 at 18 30 05

@yihua yihua merged commit ab61f61 into apache:master Jan 5, 2023
XuQianJin-Stars pushed a commit that referenced this pull request Jan 31, 2023
Before this change, the Hudi archived timeline is always loaded during the metastore sync process if the last sync time is given. Besides, the archived timeline is not cached inside the meta client if the start instant time is given. These cause performance issues and read timeout on cloud storage due to rate limiting on requests because of loading archived timeline from the storage, when the archived timeline is huge, e.g., hundreds of log files in .hoodie/archived folder.

This change improves the timeline loading by
(1) only reading active timeline if the last sync time is the same as or after the start of the active timeline;
(2) caching the archived timeline based on the start instant time in the meta client, to avoid unnecessary repeated loading of the same archived timeline.

(cherry picked from commit ab61f61)
nsivabalan pushed a commit to nsivabalan/hudi that referenced this pull request Mar 22, 2023
Before this change, the Hudi archived timeline is always loaded during the metastore sync process if the last sync time is given. Besides, the archived timeline is not cached inside the meta client if the start instant time is given. These cause performance issues and read timeout on cloud storage due to rate limiting on requests because of loading archived timeline from the storage, when the archived timeline is huge, e.g., hundreds of log files in .hoodie/archived folder.

This change improves the timeline loading by
(1) only reading active timeline if the last sync time is the same as or after the start of the active timeline;
(2) caching the archived timeline based on the start instant time in the meta client, to avoid unnecessary repeated loading of the same archived timeline.
fengjian428 pushed a commit to fengjian428/hudi that referenced this pull request Apr 5, 2023
Before this change, the Hudi archived timeline is always loaded during the metastore sync process if the last sync time is given. Besides, the archived timeline is not cached inside the meta client if the start instant time is given. These cause performance issues and read timeout on cloud storage due to rate limiting on requests because of loading archived timeline from the storage, when the archived timeline is huge, e.g., hundreds of log files in .hoodie/archived folder.

This change improves the timeline loading by
(1) only reading active timeline if the last sync time is the same as or after the start of the active timeline;
(2) caching the archived timeline based on the start instant time in the meta client, to avoid unnecessary repeated loading of the same archived timeline.
HoodieTableMetaClient metaClient, String exclusiveStartInstantTime) {
HoodieActiveTimeline activeTimeline = metaClient.getActiveTimeline();
HoodieDefaultTimeline timeline =
activeTimeline.isBeforeTimelineStarts(exclusiveStartInstantTime)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yihua I have a doubt, since rollback and commit are archived separately, is it possible that there is a very early rollback instant, causing activeTimeline.isBeforeTimelineStarts(exclusiveStartInstantTime) to return false?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@BruceKellan You bring up a good point. So you are talking about the following case where the timeline is like:

ts3.rollback, ts50.commit, ts51.commit, ts52.commit, ...

and ts49.commit and ts48.commit are archived.

If we pass in ts47 as the exclusiveStartInstantTime at this point, activeTimeline.isBeforeTimelineStarts(exclusiveStartInstantTime) returns false. However, the active timeline misses ts48.commit and ts49.commit which are required for meta sync. This is a problem if ts48.commit or ts49.commit has partition changes.

I'll put up a fix on that. Thanks for catching it!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, that's it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's the fix: #8991

yihua added a commit that referenced this pull request Jun 16, 2023
…a sync (#8991)

This commit fixes the problematic API implementation (TimelineUtils.getCommitsTimelineAfter) introduced by #7561.
yihua added a commit to yihua/hudi that referenced this pull request Aug 24, 2023
Before this change, the Hudi archived timeline is always loaded during the metastore sync process if the last sync time is given. Besides, the archived timeline is not cached inside the meta client if the start instant time is given. These cause performance issues and read timeout on cloud storage due to rate limiting on requests because of loading archived timeline from the storage, when the archived timeline is huge, e.g., hundreds of log files in .hoodie/archived folder.

This change improves the timeline loading by
(1) only reading active timeline if the last sync time is the same as or after the start of the active timeline;
(2) caching the archived timeline based on the start instant time in the meta client, to avoid unnecessary repeated loading of the same archived timeline.
yihua added a commit to yihua/hudi that referenced this pull request Aug 24, 2023
…a sync (apache#8991)

This commit fixes the problematic API implementation (TimelineUtils.getCommitsTimelineAfter) introduced by apache#7561.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
meta-sync priority:critical production down; pipelines stalled; Need help asap.
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

5 participants