Fix broker write failure handling in ServerChannels#17861
Fix broker write failure handling in ServerChannels#17861xiangfu0 merged 1 commit intoapache:masterfrom
Conversation
ServerChannels.sendRequestWithoutLocking() never checked writeAndFlush() success, causing silent failures: markRequestSent() called on failed writes (false positive), no channel close (zombie channels), no metrics, and queries waiting for full timeout instead of failing fast. Add f.isSuccess() check in the writeAndFlush listener: - On success: existing behavior (latency timer + markRequestSent) - On failure: log error, increment NETTY_CONNECTION_SEND_REQUEST_FAILURES metric, call markServerDown() for fast query failure, close channel Uses markServerDown() over markQueryFailed() because it is race-safe (no-op if server already responded) and the channel close also triggers channelInactive() for all in-flight queries. Wraps f.cause() in RuntimeException since markServerDown expects Exception but cause() can return Throwable (e.g. OutOfMemoryError).
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #17861 +/- ##
============================================
+ Coverage 63.26% 63.30% +0.04%
Complexity 1460 1460
============================================
Files 3190 3190
Lines 192011 192022 +11
Branches 29412 29413 +1
============================================
+ Hits 121469 121569 +100
+ Misses 61026 60924 -102
- Partials 9516 9529 +13
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Pull request overview
This PR improves broker-to-server Netty reliability by handling outbound writeAndFlush() failures in ServerChannels, ensuring failed sends don’t get incorrectly treated as “request sent” and that broken channels are closed and observed via metrics.
Changes:
- Add
ChannelFuture.isSuccess()handling inServerChannels.ServerChannel.sendRequestWithoutLocking()to log, meter, mark the server down, and close the channel on write failure. - Introduce a new broker meter
NETTY_CONNECTION_SEND_REQUEST_FAILURESfor observability of outbound write failures. - Add unit tests covering both write-failure and write-success listener behavior (plus small test hooks in
ServerChannels).
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
pinot-core/src/main/java/org/apache/pinot/core/transport/ServerChannels.java |
Adds write-failure handling in the Netty write listener; exposes minimal test hooks. |
pinot-core/src/test/java/org/apache/pinot/core/transport/ServerChannelsTest.java |
Adds tests validating correct behavior on write success vs failure (close channel / mark server down). |
pinot-common/src/main/java/org/apache/pinot/common/metrics/BrokerMeter.java |
Adds a new broker metric for outbound request send failures. |
You can also share your feedback on Copilot code review. Take the survey.
| _brokerMetrics.addMeteredGlobalValue(BrokerMeter.NETTY_CONNECTION_SEND_REQUEST_FAILURES, 1); | ||
| asyncQueryResponse.markServerDown(serverRoutingInstance, | ||
| new RuntimeException("Failed to send request to server: " + serverRoutingInstance, f.cause())); | ||
| _channel.close(); | ||
| } |
There was a problem hiding this comment.
In the write failure listener, calling _channel.close() uses the mutable ServerChannel._channel field, which may have been reconnected/updated by another request by the time the listener runs. That can accidentally close a newly-established healthy channel instead of the channel associated with the failed write. Capture the current channel in a local variable before writeAndFlush() (or use the channel from the future) and close that specific channel on failure.
Summary
writeAndFlush()inServerChannels.sendRequestWithoutLocking(): The write listener never checkedf.isSuccess(), so on write failuremarkRequestSent()was called anyway (false positive), no channel close occurred (zombie channel), no logging/metrics, and queries waited for full timeout instead of failing fast.f.isSuccess()check: On failure, logs the error, incrementsNETTY_CONNECTION_SEND_REQUEST_FAILURESmetric, callsmarkServerDown()for fast query failure, and closes the channel to prevent zombies. On success, existing behavior is preserved.NETTY_CONNECTION_SEND_REQUEST_FAILURESbroker metric for observability of write failures.Design Decisions
markServerDown()overmarkQueryFailed(): Race-safe — if server already responded, it's a no-op. The_channel.close()also triggerschannelInactive()→markServerDown()for all in-flight queries. Double invocation is safe (idempotent).f.cause()in RuntimeException:markServerDownexpectsExceptionbutf.cause()returnsThrowable(could beOutOfMemoryError).REQUESTS_SENT/BYTES_SENTkept outside listener: They're incremented synchronously before write completes. Moving them inside the success path would be a separate behavioral change.Test plan
testWriteFailureClosesChannelAndFailsQuery— verifies channel close, markServerDown called, markRequestSent never calledtestWriteSuccessMarksRequestSent— verifies markRequestSent called, markServerDown never called, close never called