Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[HUDI-6725] Support efficient completion time queries on the timeline #9565

Merged
merged 1 commit into from
Sep 5, 2023

Conversation

danny0405
Copy link
Contributor

Change Logs

Add a tool to query completion time efficiently on both active & archived timeline.

Impact

none

Risk level (write none, low medium or high below)

none

Documentation Update

Describe any necessary documentation update if there is any new feature, config, or user-facing change

  • The config description must be updated if new configs are added or the default value of the configs are changed
  • Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the
    ticket number here and follow the instruction to make
    changes to the website.

Contributor's checklist

  • Read through contributor's guide
  • Change Logs and Impact were stated clearly
  • Adequate tests were added if applicable
  • CI passed

// the 'startTime' should be out of the eager loading range, switch to a lazy loading.
// This operation is resource costly.
HoodieArchivedTimeline.loadInstants(metaClient,
new EQTsFilter(startTime),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might switch to point-query API on parquet if we have that.

private final Map<String, String> startToCompletionInstantTimeMap;

/**
* The start instant time to eagerly load from, by default load last 3 days of completed instants.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there some internal config to control last N days of completed instants?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably in the following-up PR, for this patch, we can firstly expose it as a consructor param.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please file a followup JIRA.

if (completionTime != null) {
return Option.of(completionTime);
}
if (HoodieTimeline.compareTimestamps(startTime, GREATER_THAN, this.firstInstantOnActiveTimeline)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not following this logic. If startTime > firstInstantOnActiveTimeline, it doesn't mean that instant is still pending right. Probably i'm missing something. Can you please explain with this example? Let's say firstInstantOnActiveTimeline has start time t0 and completion time t1. Another instant that has startTime t2 and completionTime t3. Here t2 > t0 but it is still completed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not following this logic. If startTime > firstInstantOnActiveTimeline, it doesn't mean that instant is still pending right.

It means it is pending, because in the #load method, we already put all the completed instants of the active timeline into the map, if the map does not contain the startTime as a key, then it means the instant is pending.

* The constructor.
*
* @param metaClient The table meta client.
* @param startInstant The earliest instant time to eagerly load from, by default load last 3 days of completed instants.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to match or infer from archival frequency.

@hudi-bot
Copy link

hudi-bot commented Sep 3, 2023

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@danny0405 danny0405 merged commit a3eea2f into apache:master Sep 5, 2023
27 checks passed
This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: ✅ Done
Development

Successfully merging this pull request may close these issues.

3 participants