fix(metadata-table): skip zero-size data files during MDT bootstrap#18611
fix(metadata-table): skip zero-size data files during MDT bootstrap#18611nada-attia wants to merge 2 commits into
Conversation
hudi-agent
left a comment
There was a problem hiding this comment.
🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.
Thanks for the contribution! This PR skips zero-size data files when building DirectoryInfo during MDT bootstrap and emits a bootstrap_zero_size_files metric. A couple of correctness questions worth double-checking in the inline comments — primarily around the divergence created between MDT and the filesystem (orphaned zero-size files), and whether the fix needs to extend beyond the bootstrap path. Please take a look at any inline comments, and this should be ready for a Hudi committer or PMC member to take it from here. One small naming nit below — the rest of the change is clean and well-structured.
| if (!pendingDataInstants.contains(dataFileCommitTime) && compareTimestamps(dataFileCommitTime, LESSER_THAN_OR_EQUALS, maxInstantTime)) { | ||
| filenameToSizeMap.put(pathInfo.getPath().getName(), pathInfo.getLength()); | ||
| if (pathInfo.getLength() > 0) { | ||
| filenameToSizeMap.put(pathInfo.getPath().getName(), pathInfo.getLength()); |
There was a problem hiding this comment.
🤖 Skipping these files in MDT means the filesystem still has them but MDT doesn't track them. Since the cleaner relies on MDT for file listing, won't these zero-size files become orphans that are never cleaned up? Have you considered whether the fix should also delete the zero-size file (or at least flag it for cleanup) so it doesn't leak forever on disk? @yihua
- AI-generated; verify before applying. React 👍/👎 to flag quality.
| if (!pendingDataInstants.contains(dataFileCommitTime) && compareTimestamps(dataFileCommitTime, LESSER_THAN_OR_EQUALS, maxInstantTime)) { | ||
| filenameToSizeMap.put(pathInfo.getPath().getName(), pathInfo.getLength()); | ||
| if (pathInfo.getLength() > 0) { | ||
| filenameToSizeMap.put(pathInfo.getPath().getName(), pathInfo.getLength()); |
There was a problem hiding this comment.
🤖 This guard lives only in the bootstrap path. If a zero-size file can appear after bootstrap (e.g., from a partially-failed write), would incremental MDT updates hit the same failure mode the rebootstrap is hitting today? It might be worth understanding the upstream root cause (what's producing the zero-size committed data files) before deciding whether the bootstrap-only guard is sufficient.
- AI-generated; verify before applying. React 👍/👎 to flag quality.
| if (pathInfo.getLength() > 0) { | ||
| filenameToSizeMap.put(pathInfo.getPath().getName(), pathInfo.getLength()); | ||
| } else { | ||
| log.warn("Skipping zero-size data file during MDT bootstrap: {}", pathInfo.getPath()); |
There was a problem hiding this comment.
🤖 Is it safe to silently drop a file referenced by a completed commit's metadata? If a prior commit/deltacommit instant in .hoodie/ lists this file (with whatever size it had recorded), MDT bootstrap will produce a state that disagrees with the active timeline's commit metadata. Worth confirming whether reconciliation paths handle that divergence.
- AI-generated; verify before applying. React 👍/👎 to flag quality.
| } | ||
|
|
||
| final long zeroSizeCount = totalZeroSizeFiles; | ||
| metrics.ifPresent(m -> m.incrementMetric("bootstrap_zero_size_files", zeroSizeCount)); |
There was a problem hiding this comment.
🤖 nit: "bootstrap_zero_size_files" as an inline string literal is a bit fragile — have you considered extracting it to a named constant (or alongside other metric name constants in this class) so it's easier to discover and harder to silently misspell?
- AI-generated; verify before applying. React 👍/👎 to flag quality.
hudi-agent
left a comment
There was a problem hiding this comment.
🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.
Thanks for the test update! The only change since the last pass is a tweak to TestHoodieMetadataBootstrap that deletes the zero-size file before validateMetadata so the filesystem-vs-MDT consistency check passes. Worth noting that this test workaround actually reinforces the orphan-file concern from the prior review: in production no one is deleting that zero-size file, so MDT and the filesystem will disagree indefinitely. None of the four prior comments (metric name constant nit, orphaned files, bootstrap-only guard, divergence from completed commit metadata) appear to be addressed in source yet — leaving them for the author / committer to weigh in on. Please take a look at the prior inline comments, and this should be ready for a Hudi committer or PMC member to take it from here.
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #18611 +/- ##
============================================
- Coverage 68.10% 68.03% -0.08%
- Complexity 28864 28894 +30
============================================
Files 2513 2518 +5
Lines 139909 140547 +638
Branches 17319 17421 +102
============================================
+ Hits 95291 95624 +333
- Misses 36785 37070 +285
- Partials 7833 7853 +20
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
|
hey @nada-attia : what exactly the exception we are hitting w/ 0 byte sized files I am wondering, how do we ever clean those files up? |
Describe
During MDT (Metadata Table) bootstrap, zero-size data files are now skipped when building the
filenameToSizeMapinDirectoryInfo. Previously, zero-size files were included in the metadata table entries, which could result in corrupted metadata and cause rebootstrap job failures.The fix adds a size guard in the
DirectoryInfoconstructor so files withlength == 0are excluded and a warning is logged. Abootstrap_zero_size_filesmetric is also emitted with the total count of skipped files across all partitions during bootstrap.Closes #18612
Impact
No public API or user-facing feature change. Rebootstrap jobs that previously failed due to zero-size data files being included in MDT will now succeed. A new metric
bootstrap_zero_size_filesis emitted during bootstrap to surface the count of skipped files.Risk level (write none, low medium or high below)
Low. The change only affects the MDT bootstrap file-listing path. Zero-size files are invalid data files and should never appear in the metadata table. The rest of the bootstrap logic is unchanged.
Documentation Update
None required. No new user-facing configs or features.
Contributor's checklist