Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[HUDI-6397][HUDI-6759] Fixing misc bugs w/ metadata table #9546

Merged
merged 6 commits into from
Sep 6, 2023

Conversation

nsivabalan
Copy link
Contributor

@nsivabalan nsivabalan commented Aug 26, 2023

Change Logs

  1. Disabling entire MDT after enabling needs some fixes.
  2. We do have some special instants times like initialization instant times, compaction, clean etc in MDT timeline. During InstantRange check, these should be considered as valid.

Impact

  1. allows users to disable metadata using write configs cleanly.
  2. Valid instants consideration while reading from MDT is solid now. We are going to treat any special instant time (that has additional suffix compared to DT's commit time) as valid.

Especially with MDT partition initialization, the suffix is dynamic, and so we can't really find exact match. So, might have to go with total instant time length and treat all special instant times as valid ones.

In the LogRecordReader, we will first ignore any uncommitted instants. And then if its completed in MDT timeline, we check w/ the instantRange. and so its should be fine to return true for any special instant times.

Risk level (write none, low medium or high below)

medium

Documentation Update

N/A

Contributor's checklist

  • Read through contributor's guide
  • Change Logs and Impact were stated clearly
  • Adequate tests were added if applicable
  • CI passed

@nsivabalan nsivabalan changed the title [HUDI-6759] Fixing valid instants in MDT [HUDI-6397] [HUDI-6759] Fixing misc bugs w/ metadata table Aug 26, 2023
return !HoodieTableMetadata.isMetadataTable(metaClient.getBasePath())
&& !config.isMetadataTableEnabled()
&& !metaClient.getTableConfig().getMetadataPartitions().isEmpty();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we check condition (3) before?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah. When we upstreamed patches from uber, we had an idea that we will not allow disabling metadata table via write configs only. Once enabled, user has to go via hudi-cli to disable it.
but due to operational overhead, we are reverting this change.

@@ -153,6 +153,8 @@ static boolean isValidSuffix(String suffix) {
// This suffix and all after that are used for initialization of the various partitions. The unused suffixes lower than this value
// are reserved for future operations on the MDT.
private static final int PARTITION_INITIALIZATION_TIME_SUFFIX = 10; // corresponds to "010";
// we have max of 4 partitions (FILES, COL_STATS, BLOOM, RLI)
private static final List<String> VALID_PARTITION_INITIALIZATION_TIME_SUFFIXES = Arrays.asList("010","011","012","013");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be dynamically generated based on the number of indexes / partition in the metadata table? For example, a new index can be added and update to this list may be missed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for 0.14.0, since our MDT partitions are static, I did not want to fix that yet.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. We should still track this as we may add new MDT partition in the future.

validInstantTimestamps.add(createIndexInitTimestamp(SOLO_COMMIT_TIMESTAMP, PARTITION_INITIALIZATION_TIME_SUFFIX));
metadataMetaClient.getActiveTimeline().getDeltaCommitTimeline().filterCompletedInstants()
.filter(instant -> instant.getTimestamp().startsWith(SOLO_COMMIT_TIMESTAMP))
.getInstants().forEach(instant -> validInstantTimestamps.add(instant.getTimestamp()));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could the suffix (from VALID_PARTITION_INITIALIZATION_TIME_SUFFIXES) be added to the completed commit, not SOLO_COMMIT_TIMESTAMP, as valid MDT initialization instant times?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

E.g., when MDT is initialized for an existing data table with completed commits.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, I also bumped into the same. its already taken care here

public static boolean isValidInstant(HoodieInstant instant) {

@hudi-bot
Copy link

hudi-bot commented Sep 6, 2023

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@yihua yihua changed the title [HUDI-6397] [HUDI-6759] Fixing misc bugs w/ metadata table [HUDI-6397][HUDI-6759] Fixing misc bugs w/ metadata table Sep 6, 2023
@@ -153,6 +153,8 @@ static boolean isValidSuffix(String suffix) {
// This suffix and all after that are used for initialization of the various partitions. The unused suffixes lower than this value
// are reserved for future operations on the MDT.
private static final int PARTITION_INITIALIZATION_TIME_SUFFIX = 10; // corresponds to "010";
// we have max of 4 partitions (FILES, COL_STATS, BLOOM, RLI)
private static final List<String> VALID_PARTITION_INITIALIZATION_TIME_SUFFIXES = Arrays.asList("010","011","012","013");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. We should still track this as we may add new MDT partition in the future.

@yihua yihua merged commit 48f89ae into apache:master Sep 6, 2023
26 of 28 checks passed
prashantwason pushed a commit that referenced this pull request Sep 13, 2023
1. This commit allows users to disable metadata using write configs cleanly.
2. Valid instants consideration while reading from MDT is solid now. We are going to treat any special instant time (that has additional suffix compared to DT's commit time) as valid.

Especially with MDT partition initialization, the suffix is dynamic, and so we can't really find exact match. So, might have to go with total instant time length and treat all special instant times as valid ones.

In the LogRecordReader, we will first ignore any uncommitted instants. And then if it's completed in MDT timeline, we check w/ the instantRange. So it should be fine to return true for any special instant times.
leosanqing pushed a commit to leosanqing/hudi that referenced this pull request Sep 13, 2023
1. This commit allows users to disable metadata using write configs cleanly.
2. Valid instants consideration while reading from MDT is solid now. We are going to treat any special instant time (that has additional suffix compared to DT's commit time) as valid.

Especially with MDT partition initialization, the suffix is dynamic, and so we can't really find exact match. So, might have to go with total instant time length and treat all special instant times as valid ones.

In the LogRecordReader, we will first ignore any uncommitted instants. And then if it's completed in MDT timeline, we check w/ the instantRange. So it should be fine to return true for any special instant times.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: 🆕 New
Development

Successfully merging this pull request may close these issues.

None yet

4 participants