-
Notifications
You must be signed in to change notification settings - Fork 2.5k
fix(metadata-table): skip zero-size data files during MDT bootstrap #18611
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -3100,6 +3100,7 @@ public static class DirectoryInfo implements Serializable { | |
| private final List<StoragePath> subDirectories = new ArrayList<>(); | ||
| // Is this a hoodie partition | ||
| private boolean isHoodiePartition = false; | ||
| private int zeroSizeFileCount = 0; | ||
|
|
||
| public DirectoryInfo(String relativePath, List<StoragePathInfo> pathInfos, String maxInstantTime, Set<String> pendingDataInstants) { | ||
| this(relativePath, pathInfos, maxInstantTime, pendingDataInstants, true); | ||
|
|
@@ -3130,11 +3131,20 @@ public DirectoryInfo(String relativePath, List<StoragePathInfo> pathInfos, Strin | |
| String dataFileCommitTime = FSUtils.getCommitTime(pathInfo.getPath().getName()); | ||
| // Limit the file listings to files which were created by successful commits before the maxInstant time. | ||
| if (!pendingDataInstants.contains(dataFileCommitTime) && compareTimestamps(dataFileCommitTime, LESSER_THAN_OR_EQUALS, maxInstantTime)) { | ||
| filenameToSizeMap.put(pathInfo.getPath().getName(), pathInfo.getLength()); | ||
| if (pathInfo.getLength() > 0) { | ||
| filenameToSizeMap.put(pathInfo.getPath().getName(), pathInfo.getLength()); | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 🤖 Skipping these files in MDT means the filesystem still has them but MDT doesn't track them. Since the cleaner relies on MDT for file listing, won't these zero-size files become orphans that are never cleaned up? Have you considered whether the fix should also delete the zero-size file (or at least flag it for cleanup) so it doesn't leak forever on disk? @yihua - AI-generated; verify before applying. React 👍/👎 to flag quality.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 🤖 This guard lives only in the bootstrap path. If a zero-size file can appear after bootstrap (e.g., from a partially-failed write), would incremental MDT updates hit the same failure mode the rebootstrap is hitting today? It might be worth understanding the upstream root cause (what's producing the zero-size committed data files) before deciding whether the bootstrap-only guard is sufficient. - AI-generated; verify before applying. React 👍/👎 to flag quality. |
||
| } else { | ||
| log.warn("Skipping zero-size data file during MDT bootstrap: {}", pathInfo.getPath()); | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 🤖 Is it safe to silently drop a file referenced by a completed commit's metadata? If a prior - AI-generated; verify before applying. React 👍/👎 to flag quality. |
||
| zeroSizeFileCount++; | ||
| } | ||
| } | ||
| } | ||
| } | ||
| } | ||
|
|
||
| public int getZeroSizeFileCount() { | ||
| return zeroSizeFileCount; | ||
| } | ||
| } | ||
|
|
||
| private static TypedProperties getFileGroupReaderPropertiesFromStorageConf(StorageConfiguration<?> storageConf) { | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤖 nit:
"bootstrap_zero_size_files"as an inline string literal is a bit fragile — have you considered extracting it to a named constant (or alongside other metric name constants in this class) so it's easier to discover and harder to silently misspell?- AI-generated; verify before applying. React 👍/👎 to flag quality.