Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -1077,6 +1077,7 @@ private List<DirectoryInfo> listAllPartitionsFromFilesystem(String initializatio
StorageConfiguration<?> storageConf = dataMetaClient.getStorageConf();
final String dirFilterRegex = dataWriteConfig.getMetadataConfig().getDirectoryFilterRegex();
StoragePath storageBasePath = dataMetaClient.getBasePath();
long totalZeroSizeFiles = 0;

while (!pathsToList.isEmpty()) {
// In each round we will list a section of directories
Expand All @@ -1096,6 +1097,7 @@ private List<DirectoryInfo> listAllPartitionsFromFilesystem(String initializatio
// If the listing reveals a directory, add it to queue. If the listing reveals a hoodie partition, add it to
// the results.
for (DirectoryInfo dirInfo : processedDirectories) {
totalZeroSizeFiles += dirInfo.getZeroSizeFileCount();
if (!dirFilterRegex.isEmpty()) {
final String relativePath = dirInfo.getRelativePath();
if (!relativePath.isEmpty() && relativePath.matches(dirFilterRegex)) {
Expand All @@ -1114,6 +1116,8 @@ private List<DirectoryInfo> listAllPartitionsFromFilesystem(String initializatio
}
}

final long zeroSizeCount = totalZeroSizeFiles;
metrics.ifPresent(m -> m.incrementMetric("bootstrap_zero_size_files", zeroSizeCount));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 nit: "bootstrap_zero_size_files" as an inline string literal is a bit fragile — have you considered extracting it to a named constant (or alongside other metric name constants in this class) so it's easier to discover and harder to silently misspell?

- AI-generated; verify before applying. React 👍/👎 to flag quality.

return partitionsToBootstrap;
}

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -108,6 +108,29 @@ public void testMetadataBootstrapWithExtraFiles() throws Exception {
validateMetadata(testTable);
}

@Test
public void testMetadataBootstrapSkipsZeroSizeFiles() throws Exception {
HoodieTableType tableType = COPY_ON_WRITE;
init(tableType, false);
doPreBootstrapWriteOperation(testTable, INSERT, "0000001");
doPreBootstrapWriteOperation(testTable, "0000002");
// Add a zero-size base file — bootstrap should skip it without failing.
String fileName = UUID.randomUUID().toString();
Path zeroSizeFilePath = FileCreateUtilsLegacy.getBaseFilePath(basePath, "p1", "0000003", fileName);
FileCreateUtilsLegacy.createBaseFile(basePath, "p1", "0000003", fileName, 0);

writeConfig = getWriteConfig(true, true);
initWriteConfigAndMetatableWriter(writeConfig, true);
syncTableMetadata(writeConfig);

// Delete the zero-size file before validation — it was skipped in MDT and must not
// exist on disk for the filesystem-vs-MDT consistency check to pass.
Files.delete(zeroSizeFilePath);
validateMetadata(testTable);
doWriteInsertAndUpsert(testTable);
validateMetadata(testTable);
}

@ParameterizedTest
@EnumSource(HoodieTableType.class)
public void testMetadataBootstrapInsertUpsertRollback(HoodieTableType tableType) throws Exception {
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -3100,6 +3100,7 @@ public static class DirectoryInfo implements Serializable {
private final List<StoragePath> subDirectories = new ArrayList<>();
// Is this a hoodie partition
private boolean isHoodiePartition = false;
private int zeroSizeFileCount = 0;

public DirectoryInfo(String relativePath, List<StoragePathInfo> pathInfos, String maxInstantTime, Set<String> pendingDataInstants) {
this(relativePath, pathInfos, maxInstantTime, pendingDataInstants, true);
Expand Down Expand Up @@ -3130,11 +3131,20 @@ public DirectoryInfo(String relativePath, List<StoragePathInfo> pathInfos, Strin
String dataFileCommitTime = FSUtils.getCommitTime(pathInfo.getPath().getName());
// Limit the file listings to files which were created by successful commits before the maxInstant time.
if (!pendingDataInstants.contains(dataFileCommitTime) && compareTimestamps(dataFileCommitTime, LESSER_THAN_OR_EQUALS, maxInstantTime)) {
filenameToSizeMap.put(pathInfo.getPath().getName(), pathInfo.getLength());
if (pathInfo.getLength() > 0) {
filenameToSizeMap.put(pathInfo.getPath().getName(), pathInfo.getLength());
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Skipping these files in MDT means the filesystem still has them but MDT doesn't track them. Since the cleaner relies on MDT for file listing, won't these zero-size files become orphans that are never cleaned up? Have you considered whether the fix should also delete the zero-size file (or at least flag it for cleanup) so it doesn't leak forever on disk? @yihua

- AI-generated; verify before applying. React 👍/👎 to flag quality.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 This guard lives only in the bootstrap path. If a zero-size file can appear after bootstrap (e.g., from a partially-failed write), would incremental MDT updates hit the same failure mode the rebootstrap is hitting today? It might be worth understanding the upstream root cause (what's producing the zero-size committed data files) before deciding whether the bootstrap-only guard is sufficient.

- AI-generated; verify before applying. React 👍/👎 to flag quality.

} else {
log.warn("Skipping zero-size data file during MDT bootstrap: {}", pathInfo.getPath());
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Is it safe to silently drop a file referenced by a completed commit's metadata? If a prior commit/deltacommit instant in .hoodie/ lists this file (with whatever size it had recorded), MDT bootstrap will produce a state that disagrees with the active timeline's commit metadata. Worth confirming whether reconciliation paths handle that divergence.

- AI-generated; verify before applying. React 👍/👎 to flag quality.

zeroSizeFileCount++;
}
}
}
}
}

public int getZeroSizeFileCount() {
return zeroSizeFileCount;
}
}

private static TypedProperties getFileGroupReaderPropertiesFromStorageConf(StorageConfiguration<?> storageConf) {
Expand Down
Loading