Skip to content

[fix][test] Fix flaky ProducerCleanupTest timer cleanup#25864

Merged
lhotari merged 2 commits into
apache:masterfrom
oneby-wang:producer_timer_task_cleanup_test_fix
May 25, 2026
Merged

[fix][test] Fix flaky ProducerCleanupTest timer cleanup#25864
lhotari merged 2 commits into
apache:masterfrom
oneby-wang:producer_timer_task_cleanup_test_fix

Conversation

@oneby-wang
Copy link
Copy Markdown
Contributor

Motivation

ProducerCleanupTest#testAllTimerTaskShouldCanceledAfterProducerClosed is flaky because the test asserted HashedWheelTimer.pendingTimeouts() on the shared test client immediately after producer close. The producer close path cancels the send timeout and producer stats timeout, but Netty's HashedWheelTimer removes cancelled timeouts asynchronously in the timer worker. Under load, pendingTimeouts() can still report the cancelled tasks for a short period.

Example failure:

2026-05-23T08:07:31,286 - INFO  - [BookieDeathWatcher-53904:BookieServer] - BookieDeathWatcher exited loop due to uncaught exception from thread BookieDeathWatcher-53904 {}
java.lang.RuntimeException: Bookie is not running any more
	at org.apache.bookkeeper.proto.BookieServer$DeathWatcher.run(BookieServer.java:289)
2026-05-23T08:07:31,286 - INFO  - [BookieDeathWatcher-53904:BookieServer] - Shutting down BookieServer {}
2026-05-23T08:07:31,286 - INFO  - [BookieDeathWatcher-53904:BookieNettyServer] - Shutting down BookieNettyServer {}

Expected :0
Actual   :2

java.lang.AssertionError: expected [0] but found [2]
	at org.testng.Assert.fail(Assert.java:111)
	at org.testng.Assert.failNotEquals(Assert.java:1590)
	at org.testng.Assert.assertEqualsImpl(Assert.java:150)
	at org.testng.Assert.assertEquals(Assert.java:132)
	at org.testng.Assert.assertEquals(Assert.java:992)
	at org.testng.Assert.assertEquals(Assert.java:956)
	at org.testng.Assert.assertEquals(Assert.java:1002)
	at org.apache.pulsar.client.api.ProducerCleanupTest.testAllTimerTaskShouldCanceledAfterProducerClosed(ProducerCleanupTest.java:40)

Modifications

  • Use a dedicated PulsarClient in the test so the timer assertion only observes tasks created by this test.
  • Set the producer send timeout to 15 seconds. This keeps the timeout task in the initial state long enough for producer.close() to exercise the cancellation path.
  • Replace the fixed Thread.sleep(2000) and one-shot assertion with Awaitility, waiting up to 10 seconds for Netty's timer worker to clean up cancelled timeouts.

Verifying this change

  • Make sure that the change passes the CI checks.

If the box was checked, please highlight the changes

  • Dependencies (add or upgrade a dependency)
  • The public API
  • The schema
  • The default values of configurations
  • The threading model
  • The binary protocol
  • The REST endpoints
  • The admin CLI options
  • The metrics
  • Anything that affects deployment

Copy link
Copy Markdown
Member

@lhotari lhotari left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@lhotari lhotari added this to the 5.0.0-M1 milestone May 25, 2026
@lhotari lhotari merged commit 2e02b78 into apache:master May 25, 2026
43 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants