Skip to content

fix(flink): fix the write handle close for append write#18756

Merged
danny0405 merged 1 commit into
apache:masterfrom
danny0405:close-fix
May 16, 2026
Merged

fix(flink): fix the write handle close for append write#18756
danny0405 merged 1 commit into
apache:masterfrom
danny0405:close-fix

Conversation

@danny0405
Copy link
Copy Markdown
Contributor

@danny0405 danny0405 commented May 16, 2026

Describe the issue this Pull Request addresses

AppendWriteFunctionWithBIMBufferSort can have an in-flight async flush when the Flink task is being closed. The existing shutdown path stopped accepting new async work but did not wait for already-submitted work before releasing the sort buffers, which could allow background writes to overlap with resource cleanup during failover or cancellation.

In addition, AppendWriteFunction did not close a surviving BulkInsertWriterHelper from its close() path, leaving helper-owned write resources open when a task shuts down before the normal checkpoint flush path closes them.

Summary and Changelog

This change tightens append-writer shutdown behavior so async buffer flushing completes before dependent resources are released, and any surviving writer helper is closed during function shutdown.

Working tree: harden append writer shutdown cleanup

  • Update AppendWriteFunction.close() to close writerHelper when it is still present during task shutdown.
  • Update AppendWriteFunctionWithBIMBufferSort.close() to shut down the async executor, wait for in-flight async flush work to finish, and only then release memory segment pools.
  • Extend TestAppendWriteFunction with testCloseClosesWriterHelper.
  • Add TestAppendWriteFunctionWithBIMBufferSort with testCloseWaitsForAsyncWriteBeforeClosingWriterHelper to verify close ordering when async work is still running.
  • Validate related append-buffer behavior with:
    • TestAppendWriteFunction
    • TestAppendWriteFunctionWithBIMBufferSort
    • TestAppendWriteFunctionWithBufferSort

Impact

This change affects Flink append-write task shutdown behavior only. It does not change public APIs, configuration, storage format, or normal write semantics.

The main user-facing effect is safer cleanup during cancellation/failover for append writers using bounded in-memory buffer sort, reducing the chance of resource cleanup racing with already-submitted async write work.

Risk Level

low

The change is localized to append-writer shutdown paths and adds explicit waiting during close, so the primary risk is a longer close path if async work is still in progress. This is mitigated by targeted unit coverage and successful validation with:

mvn -pl hudi-flink-datasource/hudi-flink -DskipITs -Dtest=TestAppendWriteFunction,TestAppendWriteFunctionWithBIMBufferSort,TestAppendWriteFunctionWithBufferSort test

Result: 16 tests run, 0 failures, 0 errors.

Documentation Update

none

No user-facing configuration, API, or behavior contract changed; this is an internal lifecycle and cleanup fix.

Contributor's checklist

  • Read through contributor's guide
  • Enough context is provided in the sections above
  • Adequate tests were added if applicable

Copy link
Copy Markdown
Contributor

@hudi-agent hudi-agent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Thanks for the contribution! This PR hardens append-writer shutdown so async buffer flushes finish before sort buffers are released and any surviving BulkInsertWriterHelper is closed during task shutdown. The ordering (shutdown → wait for in-flight async → awaitTermination → release pools → super.close → close writerHelper) correctly avoids the race between background writes and resource cleanup, and the tests cover both new behaviors. No correctness issues found. A few style/readability suggestions in the inline comments. Please take a look, and this should be ready for a Hudi committer or PMC member to take it from here. One small naming suggestion on the hardcoded termination timeout; the rest of the changes look clean.

cc @yihua

}
// Do not release the sort buffers while an already-submitted flush is still using them.
waitForAsyncWriteCompletion();
if (!asyncWriteExecutor.awaitTermination(10, TimeUnit.MINUTES)) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 nit: the 10 here is a magic number — could you extract it into a named constant like ASYNC_EXECUTOR_TERMINATION_TIMEOUT_MINUTES so a future reader knows this value is intentional rather than arbitrary?

- AI-generated; verify before applying. React 👍/👎 to flag quality.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extracting to constant is a good suggestion. Also, should it be configurable? I think that's unnecessary but will defer to your judgement.

Copy link
Copy Markdown
Member

@codope codope left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. One minor comment.

}
// Do not release the sort buffers while an already-submitted flush is still using them.
waitForAsyncWriteCompletion();
if (!asyncWriteExecutor.awaitTermination(10, TimeUnit.MINUTES)) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extracting to constant is a good suggestion. Also, should it be configurable? I think that's unnecessary but will defer to your judgement.

Copy link
Copy Markdown
Collaborator

@cshuo cshuo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@github-actions github-actions Bot added the size:M PR with lines of changes in (100, 300] label May 16, 2026
@codecov-commenter
Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 50.00000% with 5 lines in your changes missing coverage. Please review.
✅ Project coverage is 68.16%. Comparing base (071b3f1) to head (62c6ff6).
⚠️ Report is 12 commits behind head on master.

Files with missing lines Patch % Lines
...k/append/AppendWriteFunctionWithBIMBufferSort.java 33.33% 1 Missing and 3 partials ⚠️
...g/apache/hudi/sink/append/AppendWriteFunction.java 75.00% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##             master   #18756      +/-   ##
============================================
+ Coverage     68.14%   68.16%   +0.01%     
- Complexity    29094    29121      +27     
============================================
  Files          2517     2518       +1     
  Lines        141113   141229     +116     
  Branches      17508    17533      +25     
============================================
+ Hits          96160    96265     +105     
+ Misses        37046    37041       -5     
- Partials       7907     7923      +16     
Flag Coverage Δ
common-and-other-modules 44.42% <50.00%> (+0.01%) ⬆️
hadoop-mr-java-client 45.00% <ø> (+<0.01%) ⬆️
spark-client-hadoop-common 48.33% <ø> (ø)
spark-java-tests 48.99% <ø> (+0.01%) ⬆️
spark-scala-tests 44.90% <ø> (-0.02%) ⬇️
utilities 37.61% <ø> (-0.02%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
...g/apache/hudi/sink/append/AppendWriteFunction.java 90.32% <75.00%> (-1.06%) ⬇️
...k/append/AppendWriteFunctionWithBIMBufferSort.java 13.13% <33.33%> (+13.13%) ⬆️

... and 26 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@hudi-bot
Copy link
Copy Markdown
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@HuangZhenQiu
Copy link
Copy Markdown
Member

+1

Copy link
Copy Markdown
Contributor

@suryaprasanna suryaprasanna left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@danny0405 danny0405 merged commit 780595b into apache:master May 16, 2026
65 checks passed
dwshmilyss pushed a commit to dwshmilyss/hudi that referenced this pull request May 21, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:M PR with lines of changes in (100, 300]

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants