Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[HUDI-7758] Only consider files in Hudi partitions when initializing MDT #11219

Merged
merged 4 commits into from
Jul 4, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -461,6 +461,8 @@ public void testOnlyValidPartitionsAdded(HoodieTableType tableType) throws Excep
// Create an empty directory which is not a partition directory (lacks partition metadata)
final String nonPartitionDirectory = HoodieTestDataGenerator.DEFAULT_PARTITION_PATHS[0] + "-nonpartition";
Files.createDirectories(Paths.get(basePath, nonPartitionDirectory));
// Write random file to assert it is not added to the view
Files.createFile(Paths.get(basePath, nonPartitionDirectory, "randomFile.parquet"));

// Three directories which are partitions but will be ignored due to filter
final String filterDirRegex = ".*-filterDir\\d|\\..*";
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -79,6 +79,7 @@ public class FSUtils {
public static final Pattern LOG_FILE_PATTERN =
Pattern.compile("^\\.(.+)_(.*)\\.(log|archive)\\.(\\d+)(_((\\d+)-(\\d+)-(\\d+))(.cdc)?)?");
public static final Pattern PREFIX_BY_FILE_ID_PATTERN = Pattern.compile("^(.+)-(\\d+)");
private static final Pattern BASE_FILE_PATTERN = Pattern.compile("[a-zA-Z0-9-]+_[a-zA-Z0-9-]+_[0-9]+\\.[a-zA-Z0-9]+");

private static final String LOG_FILE_EXTENSION = ".log";

Expand Down Expand Up @@ -398,7 +399,10 @@ public static String makeLogFileName(String fileId, String logFileExtension, Str

public static boolean isBaseFile(StoragePath path) {
String extension = getFileExtension(path.getName());
return HoodieFileFormat.BASE_FILE_EXTENSIONS.contains(extension);
if (HoodieFileFormat.BASE_FILE_EXTENSIONS.contains(extension)) {
return BASE_FILE_PATTERN.matcher(path.getName()).matches();
}
return false;
}

public static boolean isLogFile(StoragePath logPath) {
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -2000,16 +2000,16 @@ public DirectoryInfo(String relativePath, List<StoragePathInfo> pathInfos, Strin
// Pre-allocate with the maximum length possible
filenameToSizeMap = new HashMap<>(pathInfos.size());

// Presence of partition meta file implies this is a HUDI partition
isHoodiePartition = pathInfos.stream().anyMatch(status -> status.getPath().getName().startsWith(HoodiePartitionMetadata.HOODIE_PARTITION_METAFILE_PREFIX));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we fix the FSUtils.isDataFile instead? The check for log file uses the regex pattern match, we should fix the base file check to be in line with the log file.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And can you clarify what kind of unexpected parquets would cause issue here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you expose your Hudi table as a Delta Lake table with XTable, you will have parquet files in the _delta_log and this will lead to a parsing issue.

This is the proper way to fix the issue in my opinion. The intention of this code is to only add files that are in directories with a partition marker file. I'm worried that changing the isDataFile may lead to some unintended side effects

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm worried that changing the isDataFile may lead to some unintended side effects

Should be okay if all the CI tests pass. Actually the isDataFile for base file does not make sense because the invoker always needs to consider the directory is a Hudi partition dir, let's fix it.

Another concern is iterate through all the files under one partition is inefficient.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you expose your Hudi table as a Delta Lake table with XTable, you will have parquet files in the _delta_log and this will lead to a parsing issue.

A complete dataset includes files that both conform to and do not conform to the Hudi filename format. If the metadata table (MDT) only includes files that conform to Hudi's format, then some file data will be missing. It is not clear whether XTable has its own solution for maintaining the MDT. I think such handling should be maintained on the XTable side, not on the Hudi side. On the Hudi side, I think the MDT construction should fail and throw an exception, prompting the user to handle such anomalous files. what about you?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that the Hudi bootstrap should only consider files that are managed by Hudi. Letting people do things with their Hudi tables is important in my opinion. This can include adding directories under a base path that are not managed by Hudi to store some metadata.

The issue here is that if you had a Hudi table without MDT and then turn it on and you happen to have any parquet files that are not managed by Hudi then you will get an error even if those files are not in a data partition directory.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for partioned table, if the parent path is already a Hudi partition, is it still necessary to validate the partition metadata files of the subdirectories, can we use short-circuit condition?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, makes sense. I will add this

for (StoragePathInfo pathInfo : pathInfos) {
if (pathInfo.isDirectory()) {
// Do not attempt to search for more subdirectories inside directories that are partitions
if (!isHoodiePartition && pathInfo.isDirectory()) {
// Ignore .hoodie directory as there cannot be any partitions inside it
if (!pathInfo.getPath().getName().equals(HoodieTableMetaClient.METAFOLDER_NAME)) {
this.subDirectories.add(pathInfo.getPath());
}
} else if (pathInfo.getPath().getName().startsWith(HoodiePartitionMetadata.HOODIE_PARTITION_METAFILE_PREFIX)) {
// Presence of partition meta file implies this is a HUDI partition
this.isHoodiePartition = true;
} else if (FSUtils.isDataFile(pathInfo.getPath())) {
} else if (isHoodiePartition && FSUtils.isDataFile(pathInfo.getPath())) {
// Regular HUDI data file (base file or log file)
String dataFileCommitTime = FSUtils.getCommitTime(pathInfo.getPath().getName());
// Limit the file listings to files which were created by successful commits before the maxInstant time.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -381,7 +381,7 @@ public static void createInflightSavepoint(String basePath, String instantTime)
createMetaFile(basePath, instantTime, HoodieTimeline.INFLIGHT_SAVEPOINT_EXTENSION);
}

public static void createPartitionMetaFile(String basePath, String partitionPath) throws IOException {
public static URI createPartitionMetaFile(String basePath, String partitionPath) throws IOException {
Path metaFilePath;
try {
Path parentPath = Paths.get(new URI(basePath).getPath(), partitionPath);
Expand All @@ -390,6 +390,7 @@ public static void createPartitionMetaFile(String basePath, String partitionPath
if (Files.notExists(metaFilePath)) {
Files.createFile(metaFilePath);
}
return metaFilePath.toUri();
} catch (URISyntaxException e) {
throw new HoodieException("Error creating partition meta file", e);
}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,7 @@
import org.apache.hudi.common.model.HoodieBaseFile;
import org.apache.hudi.common.model.HoodieRecord;
import org.apache.hudi.common.table.HoodieTableMetaClient;
import org.apache.hudi.common.testutils.FileCreateUtils;
import org.apache.hudi.common.testutils.HoodieCommonTestHarness;
import org.apache.hudi.common.testutils.HoodieTestDataGenerator;
import org.apache.hudi.common.testutils.HoodieTestTable;
Expand All @@ -40,6 +41,7 @@
import org.junit.jupiter.api.Test;

import java.io.IOException;
import java.net.URI;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.Collections;
Expand Down Expand Up @@ -98,6 +100,8 @@ public void testConvertFilesToPartitionStatsRecords() throws Exception {
// Generate 10 inserts for each partition and populate partitionBaseFilePairs and recordKeys.
DATE_PARTITIONS.forEach(p -> {
try {
URI partitionMetaFile = FileCreateUtils.createPartitionMetaFile(basePath, p);
StoragePath partitionMetadataPath = new StoragePath(partitionMetaFile);
String fileId1 = UUID.randomUUID().toString();
FileSlice fileSlice1 = new FileSlice(p, instant1, fileId1);
StoragePath storagePath1 = new StoragePath(hoodieTestTable.getBaseFilePath(p, fileId1).toUri());
Expand All @@ -122,7 +126,7 @@ public void testConvertFilesToPartitionStatsRecords() throws Exception {
fileSlice2.setBaseFile(baseFile2);
partitionInfoList.add(new HoodieTableMetadataUtil.DirectoryInfo(
p,
metaClient.getStorage().listDirectEntries(Arrays.asList(storagePath1, storagePath2)),
metaClient.getStorage().listDirectEntries(Arrays.asList(partitionMetadataPath, storagePath1, storagePath2)),
instant2,
Collections.emptySet()));
} catch (Exception e) {
Expand Down
Loading