Skip to content

Frequent ZK session ID mismatches after GC leading to Helix messages treated as no-op #2951

@dang-stripe

Description

@dang-stripe

Describe the bug

We are using helix 1.3.1 from the pinot 1.2 release.

We've observed a bug in our Pinot clusters where an expensive operation like rebalancing or heavy ingestion causes some servers to hit GC pauses that are longer than the ZK session timeout.

When this happens, the session expires and the client will reconnect. However, we observe session ID mismatches from the logs below continuously until we manually restart the server. Note that the expected + target session from these warning logs also change over time.

[2024-10-22 17:12:13.511581] WARN [HelixTaskExecutor] [ZkClient-EventThread-110-server1:2181,server2:2181,server3:2181:110] SessionId does NOT match. expected sessionId: 2003518003bd305, tgtSessionId in message: 5002ac58f6ad323, messageId: 39f20bfc-a607-44ca-8354-5e3736ebfd68

While the session ID mismatches are happening, Helix messages are treated as no-op causing the Pinot server to get into a bad state. We've traced that back to this code:

if (!sessionId.equals(tgtSessionId) && !tgtSessionId.equals("*")) {
String warningMessage = "SessionId does NOT match. expected sessionId: " + sessionId
+ ", tgtSessionId in message: " + tgtSessionId + ", messageId: " + message.getMsgId();
LOG.warn(warningMessage);
reportAndRemoveMessage(message, accessor, instanceName, ProcessedMessageState.DISCARDED);
_statusUpdateUtil
.logWarning(message, HelixStateMachineEngine.class, warningMessage, manager);
// Proactively send a session sync message from participant to controller
// upon session mismatch after a new session is established
if (manager.getInstanceType() == InstanceType.PARTICIPANT
|| manager.getInstanceType() == InstanceType.CONTROLLER_PARTICIPANT) {
if (message.getCreateTimeStamp() > manager.getSessionStartTime()) {
syncSessionToController(manager);
}
}
return true;
}

We only see 1-2 instances of the session syncing in our logs which the code does to remedy the issue but it does not seem to trigger as often as we'd expect.

[2024-10-22 02:18:59.913004] INFO [HelixTaskExecutor] [ZkClient-EventThread-111-server1:2181,server2:2181,server3:2181:111] Participant Server_st-noir-test-uswest2b-1_8098 syncs session with controller

To Reproduce

  1. Perform expensive rebalance or heavy ingestion on Pinot server that would trigger GC pause to exceed ZK session timeout
  2. Observe logs for session ID mismatches

Expected behavior

Participant is able to successfully sync its session so they no longer mismatch and server can heal itself

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions