Describe the bug
We are using helix 1.3.1 from the pinot 1.2 release.
We've observed a bug in our Pinot clusters where an expensive operation like rebalancing or heavy ingestion causes some servers to hit GC pauses that are longer than the ZK session timeout.
When this happens, the session expires and the client will reconnect. However, we observe session ID mismatches from the logs below continuously until we manually restart the server. Note that the expected + target session from these warning logs also change over time.
[2024-10-22 17:12:13.511581] WARN [HelixTaskExecutor] [ZkClient-EventThread-110-server1:2181,server2:2181,server3:2181:110] SessionId does NOT match. expected sessionId: 2003518003bd305, tgtSessionId in message: 5002ac58f6ad323, messageId: 39f20bfc-a607-44ca-8354-5e3736ebfd68
While the session ID mismatches are happening, Helix messages are treated as no-op causing the Pinot server to get into a bad state. We've traced that back to this code:
|
if (!sessionId.equals(tgtSessionId) && !tgtSessionId.equals("*")) { |
|
String warningMessage = "SessionId does NOT match. expected sessionId: " + sessionId |
|
+ ", tgtSessionId in message: " + tgtSessionId + ", messageId: " + message.getMsgId(); |
|
LOG.warn(warningMessage); |
|
reportAndRemoveMessage(message, accessor, instanceName, ProcessedMessageState.DISCARDED); |
|
_statusUpdateUtil |
|
.logWarning(message, HelixStateMachineEngine.class, warningMessage, manager); |
|
|
|
// Proactively send a session sync message from participant to controller |
|
// upon session mismatch after a new session is established |
|
if (manager.getInstanceType() == InstanceType.PARTICIPANT |
|
|| manager.getInstanceType() == InstanceType.CONTROLLER_PARTICIPANT) { |
|
if (message.getCreateTimeStamp() > manager.getSessionStartTime()) { |
|
syncSessionToController(manager); |
|
} |
|
} |
|
return true; |
|
} |
We only see 1-2 instances of the session syncing in our logs which the code does to remedy the issue but it does not seem to trigger as often as we'd expect.
[2024-10-22 02:18:59.913004] INFO [HelixTaskExecutor] [ZkClient-EventThread-111-server1:2181,server2:2181,server3:2181:111] Participant Server_st-noir-test-uswest2b-1_8098 syncs session with controller
To Reproduce
- Perform expensive rebalance or heavy ingestion on Pinot server that would trigger GC pause to exceed ZK session timeout
- Observe logs for session ID mismatches
Expected behavior
Participant is able to successfully sync its session so they no longer mismatch and server can heal itself
Describe the bug
We are using helix 1.3.1 from the pinot 1.2 release.
We've observed a bug in our Pinot clusters where an expensive operation like rebalancing or heavy ingestion causes some servers to hit GC pauses that are longer than the ZK session timeout.
When this happens, the session expires and the client will reconnect. However, we observe session ID mismatches from the logs below continuously until we manually restart the server. Note that the expected + target session from these warning logs also change over time.
While the session ID mismatches are happening, Helix messages are treated as no-op causing the Pinot server to get into a bad state. We've traced that back to this code:
helix/helix-core/src/main/java/org/apache/helix/messaging/handling/HelixTaskExecutor.java
Lines 1089 to 1106 in 3055f26
We only see 1-2 instances of the session syncing in our logs which the code does to remedy the issue but it does not seem to trigger as often as we'd expect.
To Reproduce
Expected behavior
Participant is able to successfully sync its session so they no longer mismatch and server can heal itself