Skip to content

fix: Fix Scanner resource leak in HiveIncrementalPuller#18441

Merged
yihua merged 1 commit into
apache:masterfrom
linliu-code:fix_hive_leak
May 15, 2026
Merged

fix: Fix Scanner resource leak in HiveIncrementalPuller#18441
yihua merged 1 commit into
apache:masterfrom
linliu-code:fix_hive_leak

Conversation

@linliu-code
Copy link
Copy Markdown
Collaborator

@linliu-code linliu-code commented Apr 1, 2026

Describe the issue this Pull Request addresses

#18440

HiveIncrementalPuller has two related bugs in executeIncrementalSQL:

  1. Resource leak: The Scanner opened on the incremental SQL file is never closed — new Scanner(new File(...)).useDelimiter("\Z").next() leaks the file handle on every invocation.
  2. Late validation: The SQL file is validated (checks for correct source table reference and _hoodie_commit_time predicate) inside executeIncrementalSQL, which is only called after a JDBC connection has been established and a DROP TABLE IF EXISTS has already been issued. A misconfigured SQL file therefore wastes a connection and mutates state before the error is caught.

Summary and Changelog

Summary: SQL file validation now happens eagerly before any JDBC work, and the file handle is properly released after reading.

  • Extracted SQL validation into a new private validateIncrementalSQL() method and moved the call to the top of saveDelta(), before the JDBC connection is opened.
  • Wrapped Scanner in try-with-resources to guarantee the file handle is closed after reading, even on error.
  • Widened executeIncrementalSQL from private to package-private to enable unit testing the SQL template rendering logic without a live Hive server.
  • Added TestHiveIncrementalPullerExecuteSQL with three test cases: correct SQL rendering and commit-time substitution, FileNotFoundException on missing SQL file, and SQLException propagation on JDBC execution failure.

Impact

No public API or user-facing behavior change. The only observable difference is that a misconfigured incrementalSQLFile now throws before a JDBC connection is opened rather than after.

Risk Level

Low. The logic is unchanged — validation conditions are identical to before. The refactor only reorders when validation occurs (earlier) and ensures the Scanner is closed. Covered by new unit tests.

Documentation Update

Contributor's checklist

  • Read through contributor's guide
  • Enough context is provided in the sections above
  • Adequate tests were added if applicable

@linliu-code linliu-code marked this pull request as ready for review April 1, 2026 18:52
@github-actions github-actions Bot added the size:M PR with lines of changes in (100, 300] label Apr 1, 2026
@linliu-code linliu-code force-pushed the fix_hive_leak branch 2 times, most recently from 877eaad to 2e04da9 Compare April 1, 2026 19:35
@linliu-code linliu-code changed the title [BUG] Fix Scanner resource leak in HiveIncrementalPuller fix: Fix Scanner resource leak in HiveIncrementalPuller Apr 1, 2026
@linliu-code linliu-code force-pushed the fix_hive_leak branch 2 times, most recently from 85f5935 to 53ad69a Compare April 1, 2026 22:34
@hudi-bot
Copy link
Copy Markdown
Collaborator

hudi-bot commented Apr 1, 2026

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

Copy link
Copy Markdown
Contributor

@yihua yihua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Thanks for contributing! The resource leak fix and early validation are solid improvements. Just one design concern: the SQL file is now read twice (once for validation, once for execution) — an inline comment has details on how to consolidate that.

}
}

private void validateIncrementalSQL() throws IOException {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 The SQL file is now read twice — once here for validation and again in executeIncrementalSQL. Could this method return the validated SQL string so it can be passed into executeIncrementalSQL instead of re-reading? That would also eliminate a subtle TOCTOU window where the file could change between the two reads.

}

private void executeIncrementalSQL(String tempDbTable, String tempDbTablePath, Statement stmt)
void executeIncrementalSQL(String tempDbTable, String tempDbTablePath, Statement stmt)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Now that validation was moved to validateIncrementalSQL, this method still declares throws FileNotFoundException and re-reads the file. If the validated SQL were passed in as a parameter, this method wouldn't need file access at all — making it easier to test without a real file on disk.

@yihua yihua added this to the release-1.2.0 milestone May 15, 2026
…mentalPuller

- Extract validateIncrementalSQL() and call it at the top of saveDelta(),
  before any JDBC connection is opened, so a misconfigured SQL file fails
  fast without issuing a DROP TABLE or consuming a connection slot.
- Wrap Scanner in try-with-resources to close the file handle after reading.
- Widen executeIncrementalSQL visibility to package-private to allow unit
  testing the SQL template rendering with a mocked Statement.
- Add TestHiveIncrementalPullerExecuteSQL covering: correct SQL rendering,
  file-not-found, and JDBC execution failure cases.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@yihua yihua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@yihua yihua merged commit 21271e4 into apache:master May 15, 2026
20 of 62 checks passed
Copy link
Copy Markdown
Contributor

@hudi-agent hudi-agent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Thanks for the contribution! The Scanner try-with-resources and validation-before-JDBC refactor look good and the new unit tests are a nice addition. However, the diff also includes some unrelated changes to scanForCommitTime and getLastCommitTimePulled that appear to break compilation against current master — please take a look at the inline comments. This should be ready for a Hudi committer or PMC member to take it from here once those are sorted. One naming/visibility nit below.

}
HoodieTableMetaClient metadata = HoodieTableMetaClient.builder()
.setConf(HadoopFSUtils.getStorageConfWithCopy(fs.getConf())).setBasePath(targetDataPath).build();
HoodieTableMetaClient metadata = HoodieTableMetaClient.builder().setConf(fs.getConf()).setBasePath(targetDataPath).build();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 These two hunks look like they'd break compilation against current master: HoodieTableMetaClient.Builder#setConf only accepts StorageConfiguration<?> (not the Configuration returned by fs.getConf()), and HoodieInstant has requestedTime() but no getTimestamp() method. The same applies to the HoodieInstant::getTimestamp method reference in getLastCommitTimePulled and the removal of the HadoopFSUtils import. Was this PR rebased onto an older base, or is this an intentional revert? Either way it looks unrelated to the Scanner-leak fix described in the PR and should probably be dropped from this change.

- AI-generated; verify before applying. React 👍/👎 to flag quality.

}

private void executeIncrementalSQL(String tempDbTable, String tempDbTablePath, Statement stmt)
void executeIncrementalSQL(String tempDbTable, String tempDbTablePath, Statement stmt)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 nit: could you add @VisibleForTesting here? The method was private and is now package-private purely to allow the new test to call it directly, but that intent isn't obvious to a future reader. The rest of the utilities module uses this Guava annotation consistently for the same pattern (e.g. HoodieMetadataTableValidator).

- AI-generated; verify before applying. React 👍/👎 to flag quality.

@codecov-commenter
Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 54.00%. Comparing base (071b3f1) to head (b62c2f7).
⚠️ Report is 6 commits behind head on master.

❗ There is a different number of reports uploaded between BASE (071b3f1) and HEAD (b62c2f7). Click for more details.

HEAD has 32 uploads less than BASE
Flag BASE (071b3f1) HEAD (b62c2f7)
spark-scala-tests 12 0
utilities 1 0
spark-java-tests 18 0
common-and-other-modules 1 0
Additional details and impacted files
@@              Coverage Diff              @@
##             master   #18441       +/-   ##
=============================================
- Coverage     68.14%   54.00%   -14.14%     
+ Complexity    29094    12459    -16635     
=============================================
  Files          2517     1434     -1083     
  Lines        141113    72161    -68952     
  Branches      17508     8245     -9263     
=============================================
- Hits          96160    38974    -57186     
+ Misses        37046    29691     -7355     
+ Partials       7907     3496     -4411     
Flag Coverage Δ
common-and-other-modules ?
hadoop-mr-java-client 45.01% <ø> (+<0.01%) ⬆️
spark-client-hadoop-common 48.32% <ø> (-0.01%) ⬇️
spark-java-tests ?
spark-scala-tests ?
utilities ?

Flags with carried forward coverage won't be shown. Click here to find out more.
see 1863 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:M PR with lines of changes in (100, 300]

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants