Skip to content

[fix][broker] Fix race condition in ServerCnx producer/consumer async callbacks#25352

Merged
lhotari merged 2 commits intoapache:masterfrom
merlimat:fix/ServerCnx-async-callback-threading
Mar 19, 2026
Merged

[fix][broker] Fix race condition in ServerCnx producer/consumer async callbacks#25352
lhotari merged 2 commits intoapache:masterfrom
merlimat:fix/ServerCnx-async-callback-threading

Conversation

@merlimat
Copy link
Copy Markdown
Contributor

Flaky test failure

ServerCnxTest.testCreateProducerTimeout:2093->getResponse:2884->getResponse:2908 IO Failed to get response from socket within 10s

Summary

Fix a race condition in ServerCnx where producer/consumer lifecycle callbacks could execute out of order, causing the create/close/create producer sequence to fail intermittently.

Root Cause

The ServerCnx producers and consumers ConcurrentLongHashMap maps are designed with concurrencyLevel=1, assuming all accesses happen on the same Netty IO thread (ctx.executor()). However, most async callbacks used synchronous CompletableFuture variants (thenAccept, thenCompose, etc.) instead of async variants with ctx.executor().

The issue is how CompletableFuture chaining works without an explicit executor:

  • If the upstream future is NOT yet completed when thenCompose(fn) is called, fn is queued as a dependent and runs when the completing thread calls complete(). Multiple chained stages execute in registration order since the completing thread walks the chain sequentially.

  • If the upstream future is ALREADY completed when thenCompose(fn) is called, fn runs IMMEDIATELY on the calling thread, right there in the thenCompose call. It skips the queue entirely.

This distinction causes the create/close/create producer race: both createProducer1 and createProducer2 call getOrCreateTopic() and chain thenCompose(topic -> addProducer(...)). When the topic future is not yet completed, both callbacks are queued as dependents and execute in order. But if the topic future is already completed when createProducer2 chains on it, producer2's addProducer() runs immediately inline — potentially before producer1's cleanup from the close command has finished. Producer1 wins the race to register, producer2's addProducer() fails with "already connected", and the client never gets a success response.

Fix

Using thenComposeAsync(fn, ctx.executor()) forces fn to always be submitted to the executor's task queue, regardless of whether the future is already completed. This guarantees FIFO ordering — all stages go through the same queue, so the close handler runs before producer2's creation, and producer1's cleanup completes before producer2 tries to register.

Changes

  • handleProducer chain: thenApplyAsync, thenComposeAsync, thenRunAsync, exceptionallyAsync with ctx.executor()
  • buildProducerAndAddTopic: thenAcceptAsync, thenRunAsync
  • handleSubscribe chain: thenApplyAsync, thenAcceptAsync, exceptionallyAsync
  • handleCloseProducer: thenAcceptAsync
  • safelyRemoveProducer/Consumer: whenCompleteAsync
  • ServerCnxTest: add channel.runPendingTasks() in getResponse() polling loop so EmbeddedChannel executor tasks are processed

Documentation

  • doc-not-needed
    (Your PR doesn't need any doc update)

Matching PR in forked repository

No response

Tip

Add the labels ready-to-test and area/test to trigger the CI.

… callbacks

The ServerCnx producers and consumers ConcurrentLongHashMap maps are
designed with concurrencyLevel=1, assuming all accesses happen on the
same Netty IO thread (ctx.executor()). However, most async callbacks
in the producer and consumer lifecycle used synchronous variants
(thenAccept, thenCompose, etc.) instead of async variants with
ctx.executor().

The issue is how CompletableFuture chaining works without an explicit
executor:

- If the upstream future is NOT yet completed when thenCompose(fn) is
  called, fn is queued as a dependent and runs when the completing
  thread calls complete(). Multiple chained stages execute in
  registration order since the completing thread walks the chain
  sequentially.

- If the upstream future is ALREADY completed when thenCompose(fn) is
  called, fn runs IMMEDIATELY on the calling thread, right there in
  the thenCompose call. It skips the queue entirely.

This distinction causes the create/close/create producer race:
Both createProducer1 and createProducer2 call getOrCreateTopic()
and chain thenCompose(topic -> addProducer(...)). When the topic
future is not yet completed, both callbacks are queued as dependents
and execute in order. But if the topic future is already completed
when createProducer2 chains on it, producer2's addProducer() runs
immediately inline - potentially before producer1's cleanup from the
close command has finished. Producer1 wins the race to register,
producer2's addProducer() fails with "already connected", and the
client never gets a success response.

The fix: using thenComposeAsync(fn, ctx.executor()) forces fn to
always be submitted to the executor's task queue, regardless of
whether the future is already completed. This guarantees FIFO
ordering - all stages go through the same queue, so the close
handler runs before producer2's creation, and producer1's cleanup
completes before producer2 tries to register.

Changes:
- handleProducer chain: thenApplyAsync, thenComposeAsync,
  thenRunAsync, exceptionallyAsync with ctx.executor()
- buildProducerAndAddTopic: thenAcceptAsync, thenRunAsync
- handleSubscribe chain: thenApplyAsync, thenAcceptAsync,
  exceptionallyAsync
- handleCloseProducer: thenAcceptAsync
- safelyRemoveProducer/Consumer: whenCompleteAsync
- ServerCnxTest: add channel.runPendingTasks() in getResponse()
  polling loop so EmbeddedChannel executor tasks are processed
@github-actions github-actions bot added the doc-not-needed Your PR changes do not impact docs label Mar 18, 2026
@codecov-commenter
Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 72.73%. Comparing base (7f6bc23) to head (cc60065).
⚠️ Report is 8 commits behind head on master.

Additional details and impacted files

Impacted file tree graph

@@              Coverage Diff              @@
##             master   #25352       +/-   ##
=============================================
+ Coverage     37.45%   72.73%   +35.27%     
+ Complexity    13161     2457    -10704     
=============================================
  Files          1897     1954       +57     
  Lines        150557   154768     +4211     
  Branches      17215    17718      +503     
=============================================
+ Hits          56398   112572    +56174     
+ Misses        86428    33146    -53282     
- Partials       7731     9050     +1319     
Flag Coverage Δ
inttests 25.83% <100.00%> (-0.21%) ⬇️
systests 22.54% <100.00%> (+0.04%) ⬆️
unittests 73.71% <100.00%> (+39.54%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
...va/org/apache/pulsar/broker/service/ServerCnx.java 71.83% <100.00%> (+28.15%) ⬆️

... and 1413 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link
Copy Markdown
Member

@lhotari lhotari left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. This explains some related issues that we have seen in the past.

@lhotari lhotari merged commit 39fa6d5 into apache:master Mar 19, 2026
103 of 106 checks passed
@lhotari lhotari added this to the 4.2.0 milestone Mar 19, 2026
Technoboy- pushed a commit to Technoboy-/pulsar that referenced this pull request Mar 19, 2026
@merlimat merlimat deleted the fix/ServerCnx-async-callback-threading branch March 19, 2026 15:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants