Fix server outbound write failure creating zombie channels by suvodeep-pyne · Pull Request #17845 · apache/pinot

suvodeep-pyne · 2026-03-09T22:16:21Z

Summary

When writeAndFlush() fails on the server (e.g. direct memory OOM), Netty half-closes the channel via shutdownOutput() but channelInactive() never fires, creating a zombie channel that accepts queries but never sends responses
The broker sees silent timeouts and keeps routing queries to the broken channel
Add f.isSuccess() check in the writeAndFlush listener in InstanceRequestHandler.sendResponse(). On failure: log the error, increment NETTY_CONNECTION_SEND_RESPONSE_EXCEPTIONS metric, and close the channel via ctx.close() to trigger proper cleanup
Since sendErrorResponse() delegates to sendResponse(), this single change covers all outbound writes

Test plan

Added testWriteFailureClosesChannel unit test that captures the write listener, invokes it with a failed future, and verifies ctx.close() is called
Existing testCancelQuery test passes (no regression)
InstanceRequestHandlerTest: 2 tests, 0 failures

When writeAndFlush() fails (e.g. direct memory OOM), the channel is half-closed by Netty but channelInactive() never fires, creating a zombie channel that accepts queries but never sends responses. The broker sees silent timeouts and keeps routing to the broken channel. Add f.isSuccess() check in the writeAndFlush listener in InstanceRequestHandler.sendResponse(). On failure: log the error, increment NETTY_CONNECTION_SEND_RESPONSE_EXCEPTIONS metric, and close the channel via ctx.close() to trigger proper cleanup. Since sendErrorResponse() delegates to sendResponse(), this covers all outbound writes.

codecov-commenter · 2026-03-09T23:17:54Z

❌ 1 Tests Failed:

Tests completed	Failed	Passed	Skipped
10979	1	10978	55

View the top 2 failed test(s) by shortest run time

org.apache.pinot.controller.helix.core.minion.PinotTaskManagerDistributedLockingTest::testCreateTaskBlocksScheduleTaskForSpecificTable

Stack Traces | 11.6s run time

expected [Could not acquire table level distributed lock for scheduled task type: TestDistributedLockTaskType, table: testTable1_OFFLINE. Another controller is likely generating tasks for this table. Please try again later.] but found [Could not acquire table level distributed lock for scheduled task type: TestDistributedLockTaskType, table: testTable2_OFFLINE. Another controller is likely generating tasks for this table. Please try again later.]

org.apache.pinot.controller.helix.core.minion.PinotTaskManagerDistributedLockingTest::testCreateTaskBlocksScheduleTaskForSpecificTable

Stack Traces | 14.5s run time

expected [Could not acquire table level distributed lock for scheduled task type: TestDistributedLockTaskType, table: testTable1_OFFLINE. Another controller is likely generating tasks for this table. Please try again later.] but found [Could not acquire table level distributed lock for scheduled task type: TestDistributedLockTaskType, table: testTable2_OFFLINE. Another controller is likely generating tasks for this table. Please try again later.]

View the full list of 1 ❄️ flaky test(s)

org.apache.pinot.integration.tests.PauselessRealtimeIngestionWithDedupIntegrationTest::setUp
Flake rate in main: 100.00% (Passed 0 times, Failed 68 times)
Stack Traces | 16.9s run time
Failed to load 5 documents; current count=4 for table=DedupTableWithReplicas_REALTIME expected [5] but found [4]

To view more test analytics, go to the Test Analytics Dashboard
_{📋 Got 3 mins? Take this short survey to help us improve Test Analytics.}

Copilot

Pull request overview

Fixes a server-side Netty edge case where outbound writeAndFlush() failures can leave channels half-closed (“zombie” channels) that still accept requests but never return responses, by explicitly closing the channel on write failure and tracking the failure via a new metric.

Changes:

Add a ChannelFuture listener success/failure branch in InstanceRequestHandler.sendResponse(); on failure, log, meter, and ctx.close().
Introduce a new server meter NETTY_CONNECTION_SEND_RESPONSE_FAILURES.
Add a unit test asserting the channel is closed when the write future completes unsuccessfully.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File	Description
`pinot-core/src/main/java/org/apache/pinot/core/transport/InstanceRequestHandler.java`	Close the Netty channel and record a metric when response writes fail, preventing “zombie” channels.
`pinot-common/src/main/java/org/apache/pinot/common/metrics/ServerMeter.java`	Add a new server meter for outbound send-response failures.
`pinot-core/src/test/java/org/apache/pinot/core/transport/InstanceRequestHandlerTest.java`	Add unit coverage for the failed-write listener behavior (verifies `ctx.close()`).

Copilot · 2026-03-10T00:18:41Z

pinot-core/src/main/java/org/apache/pinot/core/transport/InstanceRequestHandler.java

+        Throwable cause = f.cause();
+        LOGGER.error("Failed to send response for request: {} table: {}", requestId, tableNameWithType, cause);
+        _serverMetrics.addMeteredGlobalValue(ServerMeter.NETTY_CONNECTION_SEND_RESPONSE_FAILURES, 1);
+        ctx.close();


The PR description mentions incrementing NETTY_CONNECTION_SEND_RESPONSE_EXCEPTIONS, but the implementation adds/increments NETTY_CONNECTION_SEND_RESPONSE_FAILURES. Please align the metric name (either rename the meter for consistency with existing *_EXCEPTIONS meters, or update the PR description and any dashboards/alerts that expect the old name).

Copilot · 2026-03-10T00:18:41Z

pinot-core/src/test/java/org/apache/pinot/core/transport/InstanceRequestHandlerTest.java

+  @BeforeClass
+  public void setUp() {
+    PinotMetricUtils.init(new PinotConfiguration());
+    PinotMetricsRegistry registry = PinotMetricUtils.getPinotMetricsRegistry();
+    ServerMetrics.register(new ServerMetrics(registry));
+  }


PinotMetricUtils.init(...) performs global metrics initialization (including default registration listeners such as the JMX reporter) and ServerMetrics.register(...) mutates a global singleton. This test doesn’t assert on metrics, and InstanceRequestHandler works with the NOOP ServerMetrics by default, so this setup appears unnecessary and can leak state across the test suite. Consider removing this setup, or adding an @AfterClass cleanup (e.g., ServerMetrics.deregister() and PinotMetricUtils.cleanUp()) to avoid cross-test interference.

Copilot · 2026-03-10T00:18:41Z

pinot-core/src/test/java/org/apache/pinot/core/transport/InstanceRequestHandlerTest.java

+    ChannelFuture writeFuture = mock(ChannelFuture.class);
+    when(ctx.writeAndFlush(any())).thenReturn(writeFuture);
+
+    ArgumentCaptor<GenericFutureListener> listenerCaptor = ArgumentCaptor.forClass(GenericFutureListener.class);


In testWriteFailureClosesChannel, the test uses a raw GenericFutureListener captor and suppresses unchecked warnings. It would be safer and clearer to parameterize the captor with the expected Netty listener type (matching ChannelFuture.addListener(...)) so the compiler can enforce the correct future type and the suppression can be removed.

Suggested change

ArgumentCaptor<GenericFutureListener> listenerCaptor = ArgumentCaptor.forClass(GenericFutureListener.class);

ArgumentCaptor<GenericFutureListener<Future<Void>>> listenerCaptor =

ArgumentCaptor.forClass((Class<GenericFutureListener<Future<Void>>>) (Class<?>) GenericFutureListener.class);

Renaming the metric to not be treated as an realtime "exception"

6dcdc13

xiangfu0 requested review from Jackie-Jiang, Copilot, jasperjiaguo and xiangfu0 March 10, 2026 00:15

Copilot started reviewing on behalf of xiangfu0 March 10, 2026 00:15 View session

xiangfu0 approved these changes Mar 10, 2026

View reviewed changes

Copilot AI reviewed Mar 10, 2026

View reviewed changes

deepthi912 approved these changes Mar 10, 2026

View reviewed changes

xiangfu0 merged commit 58daa99 into apache:master Mar 10, 2026
18 of 20 checks passed

suvodeep-pyne mentioned this pull request Mar 11, 2026

Fix broker write failure handling in ServerChannels #17861

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix server outbound write failure creating zombie channels#17845

Fix server outbound write failure creating zombie channels#17845
xiangfu0 merged 2 commits intoapache:masterfrom
suvodeep-pyne:spyne/fix-server-outbound-write-failure-zombie-channels

suvodeep-pyne commented Mar 9, 2026

Uh oh!

codecov-commenter commented Mar 9, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Mar 10, 2026

Uh oh!

Copilot AI Mar 10, 2026

Uh oh!

Copilot AI Mar 10, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

	ArgumentCaptor<GenericFutureListener> listenerCaptor = ArgumentCaptor.forClass(GenericFutureListener.class);
	ArgumentCaptor<GenericFutureListener<Future<Void>>> listenerCaptor =
	ArgumentCaptor.forClass((Class<GenericFutureListener<Future<Void>>>) (Class<?>) GenericFutureListener.class);

Conversation

suvodeep-pyne commented Mar 9, 2026

Summary

Test plan

Uh oh!

codecov-commenter commented Mar 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

❌ 1 Tests Failed:

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

codecov-commenter commented Mar 9, 2026 •

edited

Loading