Skip to content

fix(metadata-table): skip zero-size data files during MDT bootstrap#18611

Open
nada-attia wants to merge 2 commits into
apache:masterfrom
nada-attia:nada-attia/fix-mdt-bootstrap-zero-size-files
Open

fix(metadata-table): skip zero-size data files during MDT bootstrap#18611
nada-attia wants to merge 2 commits into
apache:masterfrom
nada-attia:nada-attia/fix-mdt-bootstrap-zero-size-files

Conversation

@nada-attia
Copy link
Copy Markdown
Contributor

@nada-attia nada-attia commented Apr 27, 2026

Describe

During MDT (Metadata Table) bootstrap, zero-size data files are now skipped when building the filenameToSizeMap in DirectoryInfo. Previously, zero-size files were included in the metadata table entries, which could result in corrupted metadata and cause rebootstrap job failures.

The fix adds a size guard in the DirectoryInfo constructor so files with length == 0 are excluded and a warning is logged. A bootstrap_zero_size_files metric is also emitted with the total count of skipped files across all partitions during bootstrap.

Closes #18612

Impact

No public API or user-facing feature change. Rebootstrap jobs that previously failed due to zero-size data files being included in MDT will now succeed. A new metric bootstrap_zero_size_files is emitted during bootstrap to surface the count of skipped files.

Risk level (write none, low medium or high below)

Low. The change only affects the MDT bootstrap file-listing path. Zero-size files are invalid data files and should never appear in the metadata table. The rest of the bootstrap logic is unchanged.

Documentation Update

None required. No new user-facing configs or features.

Contributor's checklist

  • Read through contributor's guide
  • Change Logs and Impact were stated clearly
  • Adequate tests were added if applicable
  • CI passed

Copy link
Copy Markdown
Contributor

@hudi-agent hudi-agent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Thanks for the contribution! This PR skips zero-size data files when building DirectoryInfo during MDT bootstrap and emits a bootstrap_zero_size_files metric. A couple of correctness questions worth double-checking in the inline comments — primarily around the divergence created between MDT and the filesystem (orphaned zero-size files), and whether the fix needs to extend beyond the bootstrap path. Please take a look at any inline comments, and this should be ready for a Hudi committer or PMC member to take it from here. One small naming nit below — the rest of the change is clean and well-structured.

if (!pendingDataInstants.contains(dataFileCommitTime) && compareTimestamps(dataFileCommitTime, LESSER_THAN_OR_EQUALS, maxInstantTime)) {
filenameToSizeMap.put(pathInfo.getPath().getName(), pathInfo.getLength());
if (pathInfo.getLength() > 0) {
filenameToSizeMap.put(pathInfo.getPath().getName(), pathInfo.getLength());
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Skipping these files in MDT means the filesystem still has them but MDT doesn't track them. Since the cleaner relies on MDT for file listing, won't these zero-size files become orphans that are never cleaned up? Have you considered whether the fix should also delete the zero-size file (or at least flag it for cleanup) so it doesn't leak forever on disk? @yihua

- AI-generated; verify before applying. React 👍/👎 to flag quality.

if (!pendingDataInstants.contains(dataFileCommitTime) && compareTimestamps(dataFileCommitTime, LESSER_THAN_OR_EQUALS, maxInstantTime)) {
filenameToSizeMap.put(pathInfo.getPath().getName(), pathInfo.getLength());
if (pathInfo.getLength() > 0) {
filenameToSizeMap.put(pathInfo.getPath().getName(), pathInfo.getLength());
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 This guard lives only in the bootstrap path. If a zero-size file can appear after bootstrap (e.g., from a partially-failed write), would incremental MDT updates hit the same failure mode the rebootstrap is hitting today? It might be worth understanding the upstream root cause (what's producing the zero-size committed data files) before deciding whether the bootstrap-only guard is sufficient.

- AI-generated; verify before applying. React 👍/👎 to flag quality.

if (pathInfo.getLength() > 0) {
filenameToSizeMap.put(pathInfo.getPath().getName(), pathInfo.getLength());
} else {
log.warn("Skipping zero-size data file during MDT bootstrap: {}", pathInfo.getPath());
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Is it safe to silently drop a file referenced by a completed commit's metadata? If a prior commit/deltacommit instant in .hoodie/ lists this file (with whatever size it had recorded), MDT bootstrap will produce a state that disagrees with the active timeline's commit metadata. Worth confirming whether reconciliation paths handle that divergence.

- AI-generated; verify before applying. React 👍/👎 to flag quality.

}

final long zeroSizeCount = totalZeroSizeFiles;
metrics.ifPresent(m -> m.incrementMetric("bootstrap_zero_size_files", zeroSizeCount));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 nit: "bootstrap_zero_size_files" as an inline string literal is a bit fragile — have you considered extracting it to a named constant (or alongside other metric name constants in this class) so it's easier to discover and harder to silently misspell?

- AI-generated; verify before applying. React 👍/👎 to flag quality.

@github-actions github-actions Bot added the size:S PR with lines of changes in (10, 100] label Apr 27, 2026
Copy link
Copy Markdown
Contributor

@hudi-agent hudi-agent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Thanks for the test update! The only change since the last pass is a tweak to TestHoodieMetadataBootstrap that deletes the zero-size file before validateMetadata so the filesystem-vs-MDT consistency check passes. Worth noting that this test workaround actually reinforces the orphan-file concern from the prior review: in production no one is deleting that zero-size file, so MDT and the filesystem will disagree indefinitely. None of the four prior comments (metric name constant nit, orphaned files, bootstrap-only guard, divergence from completed commit metadata) appear to be addressed in source yet — leaving them for the author / committer to weigh in on. Please take a look at the prior inline comments, and this should be ready for a Hudi committer or PMC member to take it from here.

@hudi-bot
Copy link
Copy Markdown
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@codecov-commenter
Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 68.03%. Comparing base (7ae0fd9) to head (377d588).
⚠️ Report is 41 commits behind head on master.

Additional details and impacted files
@@             Coverage Diff              @@
##             master   #18611      +/-   ##
============================================
- Coverage     68.10%   68.03%   -0.08%     
- Complexity    28864    28894      +30     
============================================
  Files          2513     2518       +5     
  Lines        139909   140547     +638     
  Branches      17319    17421     +102     
============================================
+ Hits          95291    95624     +333     
- Misses        36785    37070     +285     
- Partials       7833     7853      +20     
Flag Coverage Δ
common-and-other-modules 44.35% <60.00%> (+0.05%) ⬆️
hadoop-mr-java-client 44.81% <70.00%> (-0.02%) ⬇️
spark-client-hadoop-common 48.41% <70.00%> (+<0.01%) ⬆️
spark-java-tests 48.63% <100.00%> (-0.27%) ⬇️
spark-scala-tests 44.70% <70.00%> (-0.23%) ⬇️
utilities 37.69% <70.00%> (-0.23%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
...hudi/metadata/HoodieBackedTableMetadataWriter.java 83.91% <100.00%> (+0.05%) ⬆️
.../apache/hudi/metadata/HoodieTableMetadataUtil.java 82.35% <100.00%> (+0.06%) ⬆️

... and 26 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@nsivabalan
Copy link
Copy Markdown
Contributor

hey @nada-attia : what exactly the exception we are hitting w/ 0 byte sized files

I am wondering, how do we ever clean those files up?
It would just stay forever right? bcoz cleaner is never going to run for those files.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:S PR with lines of changes in (10, 100]

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Zero-size data files during MDT bootstrap cause rebootstrap job failures

5 participants