[fix][broker] Fix race condition in ServerCnx producer/consumer async callbacks by merlimat · Pull Request #25352 · apache/pulsar

merlimat · 2026-03-18T20:57:16Z

Flaky test failure

ServerCnxTest.testCreateProducerTimeout:2093->getResponse:2884->getResponse:2908 IO Failed to get response from socket within 10s

Summary

Fix a race condition in ServerCnx where producer/consumer lifecycle callbacks could execute out of order, causing the create/close/create producer sequence to fail intermittently.

Root Cause

The ServerCnx producers and consumers ConcurrentLongHashMap maps are designed with concurrencyLevel=1, assuming all accesses happen on the same Netty IO thread (ctx.executor()). However, most async callbacks used synchronous CompletableFuture variants (thenAccept, thenCompose, etc.) instead of async variants with ctx.executor().

The issue is how CompletableFuture chaining works without an explicit executor:

If the upstream future is NOT yet completed when thenCompose(fn) is called, fn is queued as a dependent and runs when the completing thread calls complete(). Multiple chained stages execute in registration order since the completing thread walks the chain sequentially.
If the upstream future is ALREADY completed when thenCompose(fn) is called, fn runs IMMEDIATELY on the calling thread, right there in the thenCompose call. It skips the queue entirely.

This distinction causes the create/close/create producer race: both createProducer1 and createProducer2 call getOrCreateTopic() and chain thenCompose(topic -> addProducer(...)). When the topic future is not yet completed, both callbacks are queued as dependents and execute in order. But if the topic future is already completed when createProducer2 chains on it, producer2's addProducer() runs immediately inline — potentially before producer1's cleanup from the close command has finished. Producer1 wins the race to register, producer2's addProducer() fails with "already connected", and the client never gets a success response.

Fix

Using thenComposeAsync(fn, ctx.executor()) forces fn to always be submitted to the executor's task queue, regardless of whether the future is already completed. This guarantees FIFO ordering — all stages go through the same queue, so the close handler runs before producer2's creation, and producer1's cleanup completes before producer2 tries to register.

Changes

handleProducer chain: thenApplyAsync, thenComposeAsync, thenRunAsync, exceptionallyAsync with ctx.executor()
buildProducerAndAddTopic: thenAcceptAsync, thenRunAsync
handleSubscribe chain: thenApplyAsync, thenAcceptAsync, exceptionallyAsync
handleCloseProducer: thenAcceptAsync
safelyRemoveProducer/Consumer: whenCompleteAsync
ServerCnxTest: add channel.runPendingTasks() in getResponse() polling loop so EmbeddedChannel executor tasks are processed

Documentation

doc-not-needed
(Your PR doesn't need any doc update)

Matching PR in forked repository

No response

Tip

Add the labels ready-to-test and area/test to trigger the CI.

… callbacks The ServerCnx producers and consumers ConcurrentLongHashMap maps are designed with concurrencyLevel=1, assuming all accesses happen on the same Netty IO thread (ctx.executor()). However, most async callbacks in the producer and consumer lifecycle used synchronous variants (thenAccept, thenCompose, etc.) instead of async variants with ctx.executor(). The issue is how CompletableFuture chaining works without an explicit executor: - If the upstream future is NOT yet completed when thenCompose(fn) is called, fn is queued as a dependent and runs when the completing thread calls complete(). Multiple chained stages execute in registration order since the completing thread walks the chain sequentially. - If the upstream future is ALREADY completed when thenCompose(fn) is called, fn runs IMMEDIATELY on the calling thread, right there in the thenCompose call. It skips the queue entirely. This distinction causes the create/close/create producer race: Both createProducer1 and createProducer2 call getOrCreateTopic() and chain thenCompose(topic -> addProducer(...)). When the topic future is not yet completed, both callbacks are queued as dependents and execute in order. But if the topic future is already completed when createProducer2 chains on it, producer2's addProducer() runs immediately inline - potentially before producer1's cleanup from the close command has finished. Producer1 wins the race to register, producer2's addProducer() fails with "already connected", and the client never gets a success response. The fix: using thenComposeAsync(fn, ctx.executor()) forces fn to always be submitted to the executor's task queue, regardless of whether the future is already completed. This guarantees FIFO ordering - all stages go through the same queue, so the close handler runs before producer2's creation, and producer1's cleanup completes before producer2 tries to register. Changes: - handleProducer chain: thenApplyAsync, thenComposeAsync, thenRunAsync, exceptionallyAsync with ctx.executor() - buildProducerAndAddTopic: thenAcceptAsync, thenRunAsync - handleSubscribe chain: thenApplyAsync, thenAcceptAsync, exceptionallyAsync - handleCloseProducer: thenAcceptAsync - safelyRemoveProducer/Consumer: whenCompleteAsync - ServerCnxTest: add channel.runPendingTasks() in getResponse() polling loop so EmbeddedChannel executor tasks are processed

…-callback-threading

codecov-commenter · 2026-03-19T04:39:24Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 72.73%. Comparing base (7f6bc23) to head (cc60065).
⚠️ Report is 8 commits behind head on master.

Additional details and impacted files

@@              Coverage Diff              @@
##             master   #25352       +/-   ##
=============================================
+ Coverage     37.45%   72.73%   +35.27%     
+ Complexity    13161     2457    -10704     
=============================================
  Files          1897     1954       +57     
  Lines        150557   154768     +4211     
  Branches      17215    17718      +503     
=============================================
+ Hits          56398   112572    +56174     
+ Misses        86428    33146    -53282     
- Partials       7731     9050     +1319

Flag	Coverage Δ
inttests	`25.83% <100.00%> (-0.21%)`	⬇️
systests	`22.54% <100.00%> (+0.04%)`	⬆️
unittests	`73.71% <100.00%> (+39.54%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
...va/org/apache/pulsar/broker/service/ServerCnx.java	`71.83% <100.00%> (+28.15%)`	⬆️

... and 1413 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

lhotari

LGTM. This explains some related issues that we have seen in the past.

… callbacks (apache#25352)

… callbacks (#25352)

merlimat added the ready-to-test label Mar 18, 2026

github-actions bot added the doc-not-needed Your PR changes do not impact docs label Mar 18, 2026

merlimat requested review from codelipenghui, lhotari and poorbarcode March 18, 2026 22:16

Merge remote-tracking branch 'apache/master' into fix/ServerCnx-async…

cc60065

…-callback-threading

lhotari approved these changes Mar 19, 2026

View reviewed changes

lhotari merged commit 39fa6d5 into apache:master Mar 19, 2026
103 of 106 checks passed

lhotari added this to the 4.2.0 milestone Mar 19, 2026

lhotari added release/4.1.4 release/4.0.10 labels Mar 19, 2026

Technoboy- assigned merlimat Mar 19, 2026

Technoboy- added the cherry-picked/branch-4.1 label Mar 19, 2026

Technoboy- pushed a commit to Technoboy-/pulsar that referenced this pull request Mar 19, 2026

[fix][broker] Fix race condition in ServerCnx producer/consumer async…

f59cc1b

… callbacks (apache#25352)

Technoboy- pushed a commit that referenced this pull request Mar 19, 2026

[fix][broker] Fix race condition in ServerCnx producer/consumer async…

ff41372

… callbacks (#25352)

merlimat deleted the fix/ServerCnx-async-callback-threading branch March 19, 2026 15:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[fix][broker] Fix race condition in ServerCnx producer/consumer async callbacks#25352

[fix][broker] Fix race condition in ServerCnx producer/consumer async callbacks#25352
lhotari merged 2 commits intoapache:masterfrom
merlimat:fix/ServerCnx-async-callback-threading

merlimat commented Mar 18, 2026

Uh oh!

codecov-commenter commented Mar 19, 2026

Uh oh!

lhotari left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

merlimat commented Mar 18, 2026

Flaky test failure

Summary

Root Cause

Fix

Changes

Documentation

Matching PR in forked repository

Tip

Uh oh!

codecov-commenter commented Mar 19, 2026

Codecov Report

Uh oh!

lhotari left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants