Frequent ZK session ID mismatches after GC leading to Helix messages treated as no-op

### Describe the bug
We are using helix 1.3.1 from the pinot 1.2 release.

We've observed a bug in our Pinot clusters where an expensive operation like rebalancing or heavy ingestion causes some servers to hit GC pauses that are longer than the ZK session timeout.

When this happens, the session expires and the client will reconnect. However, we observe session ID mismatches from the logs below continuously until we manually restart the server. Note that the expected + target session from these warning logs also change over time.

```
[2024-10-22 17:12:13.511581] WARN [HelixTaskExecutor] [ZkClient-EventThread-110-server1:2181,server2:2181,server3:2181:110] SessionId does NOT match. expected sessionId: 2003518003bd305, tgtSessionId in message: 5002ac58f6ad323, messageId: 39f20bfc-a607-44ca-8354-5e3736ebfd68
```

While the session ID mismatches are happening, Helix messages are treated as no-op causing the Pinot server to get into a bad state. We've traced that back to this code: https://github.com/apache/helix/blob/3055f26aa4f040f549ca91581fc6b2ff6b275f33/helix-core/src/main/java/org/apache/helix/messaging/handling/HelixTaskExecutor.java#L1089-L1106

We only see 1-2 instances of the session syncing in our logs which the code does to remedy the issue but it does not seem to trigger as often as we'd expect.
```
[2024-10-22 02:18:59.913004] INFO [HelixTaskExecutor] [ZkClient-EventThread-111-server1:2181,server2:2181,server3:2181:111] Participant Server_st-noir-test-uswest2b-1_8098 syncs session with controller
```

### To Reproduce
1. Perform expensive rebalance or heavy ingestion on Pinot server that would trigger GC pause to exceed ZK session timeout
2. Observe logs for session ID mismatches

### Expected behavior
Participant is able to successfully sync its session so they no longer mismatch and server can heal itself

	if (!sessionId.equals(tgtSessionId) && !tgtSessionId.equals("*")) {
	String warningMessage = "SessionId does NOT match. expected sessionId: " + sessionId
	+ ", tgtSessionId in message: " + tgtSessionId + ", messageId: " + message.getMsgId();
	LOG.warn(warningMessage);
	reportAndRemoveMessage(message, accessor, instanceName, ProcessedMessageState.DISCARDED);
	_statusUpdateUtil
	.logWarning(message, HelixStateMachineEngine.class, warningMessage, manager);

	// Proactively send a session sync message from participant to controller
	// upon session mismatch after a new session is established
	if (manager.getInstanceType() == InstanceType.PARTICIPANT
	\|\| manager.getInstanceType() == InstanceType.CONTROLLER_PARTICIPANT) {
	if (message.getCreateTimeStamp() > manager.getSessionStartTime()) {
	syncSessionToController(manager);
	}
	}
	return true;
	}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Frequent ZK session ID mismatches after GC leading to Helix messages treated as no-op #2951

Describe the bug

To Reproduce

Expected behavior

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Frequent ZK session ID mismatches after GC leading to Helix messages treated as no-op #2951

Description

Describe the bug

To Reproduce

Expected behavior

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions