You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We observed some realtime ingestion lag on one of our Pinot clusters. After some investigation we determined that the lag was happening on a subset of the partitions for the Kafka stream we were ingesting from.
Analyzing the logs showed that this was caused by a temporary ZooKeeper connection issue that caused a cascade of InterupptedException and this caused some segment state transitions from OFFLINE to CONSUMING to fail.
Some relevant log messages:
2021/11/30 01:55:15.334 WARN [ZKHelixManager] [HelixTaskExecutor-message_handle_STATE_TRANSITION] zkClient to [redacted] is not connected, wait for 10000ms.
Exception while executing a state transition task [redacted segment name]
...
Caused by: java.lang.RuntimeException: InterruptedException when acquiring the partitionConsumerSemaphore for segment: [redacted segment name]
2021/11/30 01:55:15.334 ERROR [HelixTask] [HelixTaskExecutor-message_handle_STATE_TRANSITION] Exception after executing a message, msgId: 76da755d-4ae3-4d61-84e6-11a946f6bffcorg.I0Itec.zkclient.exception.ZkInterruptedException: java.lang.InterruptedException
org.I0Itec.zkclient.exception.ZkInterruptedException: java.lang.InterruptedException
at org.apache.helix.manager.zk.zookeeper.ZkClient.acquireEventLock(ZkClient.java:1142)
...
The end result was that consumption stopped for the partitions represented by these segments that had failed state transitions.
In order to get the servers to start consuming for those partitions again we had to restart the servers hosting those segments. The expectation is that Pinot should be able to eventually recover and start consuming again once the ZooKeeper connection is available again.
The text was updated successfully, but these errors were encountered:
+1 on this. We experienced this issue as well when there was a Zookeeper issue a couple of weeks ago. Tried to reset/reload the segment, but it went back into the bad state.
Ideally calling segment reset endpoint on controller should fix the problem and you shouldn't need to restart the server. I just looked into the code for consuming segments and found the issue that prevents segment reset doing its job. I'll create the PR for the fix soon.
We observed some realtime ingestion lag on one of our Pinot clusters. After some investigation we determined that the lag was happening on a subset of the partitions for the Kafka stream we were ingesting from.
Analyzing the logs showed that this was caused by a temporary ZooKeeper connection issue that caused a cascade of
InterupptedException
and this caused some segment state transitions fromOFFLINE
toCONSUMING
to fail.Some relevant log messages:
The end result was that consumption stopped for the partitions represented by these segments that had failed state transitions.
In order to get the servers to start consuming for those partitions again we had to restart the servers hosting those segments. The expectation is that Pinot should be able to eventually recover and start consuming again once the ZooKeeper connection is available again.
The text was updated successfully, but these errors were encountered: