Skip to content

fix: Optimizing internal schema lookup in TableSchemaResolver#18387

Merged
nsivabalan merged 1 commit intoapache:masterfrom
nsivabalan:tableSchemaResolver_InternalSchemaOpt
Mar 27, 2026
Merged

fix: Optimizing internal schema lookup in TableSchemaResolver#18387
nsivabalan merged 1 commit intoapache:masterfrom
nsivabalan:tableSchemaResolver_InternalSchemaOpt

Conversation

@nsivabalan
Copy link
Copy Markdown
Contributor

Describe the issue this Pull Request addresses

Optimized TableSchemaResolver.getTableInternalSchemaFromCommitMetadata() to use short-circuit evaluation when searching for the most recent schema-updating instant. The previous implementation filtered the entire timeline and then called lastInstant(), which required processing all instants. The new implementation uses getReverseOrderedInstants().filter(...).findFirst() to stop as soon as the first (most recent) matching instant is found.

Summary and Changelog

Summary:
Users with tables that have long timelines will experience faster internal schema lookups, especially when recent commits contain non-schema-updating operations (CLUSTER, COMPACT, INDEX, LOG_COMPACT).

Changelog:

  • Refactored TableSchemaResolver.getTableInternalSchemaFromCommitMetadata() to use getReverseOrderedInstants().filter(...).findFirst()
    instead of filter(...).lastInstant()
  • This enables short-circuit evaluation - the method stops immediately upon finding the first (most recent) schema-updating instant
  • Added 4 comprehensive unit tests to validate correctness and verify the short-circuit behavior
  • Added inline documentation explaining the optimization

Technical details:

  • Before: completedInstants.filter(predicate) → creates filtered timeline → lastInstant() → processes all instants
  • After: completedInstants.getReverseOrderedInstants().filter(predicate).findFirst() → stops at first match

Impact

Performance improvement with no behavioral changes:

  • Reduces the number of commit metadata reads required, especially beneficial for:
    • Tables with long timelines (hundreds or thousands of commits)
    • Scenarios where recent commits are non-schema-updating operations

Risk Level

low

Documentation Update

None required.

Contributor's checklist

  • Read through contributor's guide
  • Enough context is provided in the sections above
  • Adequate tests were added if applicable

@codecov-commenter
Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 68.36%. Comparing base (2f07364) to head (6744945).
⚠️ Report is 2 commits behind head on master.

Additional details and impacted files
@@             Coverage Diff              @@
##             master   #18387      +/-   ##
============================================
- Coverage     68.36%   68.36%   -0.01%     
- Complexity    27566    27582      +16     
============================================
  Files          2432     2433       +1     
  Lines        133175   133283     +108     
  Branches      16023    16034      +11     
============================================
+ Hits          91047    91114      +67     
- Misses        35068    35112      +44     
+ Partials       7060     7057       -3     
Flag Coverage Δ
common-and-other-modules 44.34% <0.00%> (+0.02%) ⬆️
hadoop-mr-java-client 45.15% <100.00%> (+<0.01%) ⬆️
spark-client-hadoop-common 48.58% <100.00%> (+0.02%) ⬆️
spark-java-tests 48.71% <100.00%> (+<0.01%) ⬆️
spark-scala-tests 45.37% <100.00%> (-0.02%) ⬇️
utilities 38.53% <100.00%> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
.../apache/hudi/common/table/TableSchemaResolver.java 87.64% <100.00%> (+0.06%) ⬆️

... and 26 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@github-actions github-actions bot added the size:M PR with lines of changes in (100, 300] label Mar 26, 2026
@hudi-bot
Copy link
Copy Markdown
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@nsivabalan nsivabalan merged commit bb5abb6 into apache:master Mar 27, 2026
56 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:M PR with lines of changes in (100, 300]

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants