-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[HUDI-7758] Only consider files in Hudi partitions when initializing MDT #11219
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -2000,16 +2000,16 @@ public DirectoryInfo(String relativePath, List<StoragePathInfo> pathInfos, Strin | |
// Pre-allocate with the maximum length possible | ||
filenameToSizeMap = new HashMap<>(pathInfos.size()); | ||
|
||
// Presence of partition meta file implies this is a HUDI partition | ||
isHoodiePartition = pathInfos.stream().anyMatch(status -> status.getPath().getName().startsWith(HoodiePartitionMetadata.HOODIE_PARTITION_METAFILE_PREFIX)); | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. for partioned table, if the parent path is already a Hudi partition, is it still necessary to validate the partition metadata files of the subdirectories, can we use short-circuit condition? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, makes sense. I will add this |
||
for (StoragePathInfo pathInfo : pathInfos) { | ||
if (pathInfo.isDirectory()) { | ||
// Do not attempt to search for more subdirectories inside directories that are partitions | ||
if (!isHoodiePartition && pathInfo.isDirectory()) { | ||
// Ignore .hoodie directory as there cannot be any partitions inside it | ||
if (!pathInfo.getPath().getName().equals(HoodieTableMetaClient.METAFOLDER_NAME)) { | ||
this.subDirectories.add(pathInfo.getPath()); | ||
} | ||
} else if (pathInfo.getPath().getName().startsWith(HoodiePartitionMetadata.HOODIE_PARTITION_METAFILE_PREFIX)) { | ||
// Presence of partition meta file implies this is a HUDI partition | ||
this.isHoodiePartition = true; | ||
} else if (FSUtils.isDataFile(pathInfo.getPath())) { | ||
} else if (isHoodiePartition && FSUtils.isDataFile(pathInfo.getPath())) { | ||
// Regular HUDI data file (base file or log file) | ||
String dataFileCommitTime = FSUtils.getCommitTime(pathInfo.getPath().getName()); | ||
// Limit the file listings to files which were created by successful commits before the maxInstant time. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we fix the
FSUtils.isDataFile
instead? The check for log file uses the regex pattern match, we should fix the base file check to be in line with the log file.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And can you clarify what kind of unexpected parquets would cause issue here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you expose your Hudi table as a Delta Lake table with XTable, you will have parquet files in the
_delta_log
and this will lead to a parsing issue.This is the proper way to fix the issue in my opinion. The intention of this code is to only add files that are in directories with a partition marker file. I'm worried that changing the
isDataFile
may lead to some unintended side effectsThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be okay if all the CI tests pass. Actually the
isDataFile
for base file does not make sense because the invoker always needs to consider the directory is a Hudi partition dir, let's fix it.Another concern is iterate through all the files under one partition is inefficient.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A complete dataset includes files that both conform to and do not conform to the Hudi filename format. If the metadata table (MDT) only includes files that conform to Hudi's format, then some file data will be missing. It is not clear whether XTable has its own solution for maintaining the MDT. I think such handling should be maintained on the XTable side, not on the Hudi side. On the Hudi side, I think the MDT construction should fail and throw an exception, prompting the user to handle such anomalous files. what about you?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that the Hudi bootstrap should only consider files that are managed by Hudi. Letting people do things with their Hudi tables is important in my opinion. This can include adding directories under a base path that are not managed by Hudi to store some metadata.
The issue here is that if you had a Hudi table without MDT and then turn it on and you happen to have any parquet files that are not managed by Hudi then you will get an error even if those files are not in a data partition directory.