Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Realtime consumption halted if segment state transition fails #7874

Open
jmint-stripe opened this issue Dec 6, 2021 · 2 comments
Open

Realtime consumption halted if segment state transition fails #7874

jmint-stripe opened this issue Dec 6, 2021 · 2 comments
Labels

Comments

@jmint-stripe
Copy link

We observed some realtime ingestion lag on one of our Pinot clusters. After some investigation we determined that the lag was happening on a subset of the partitions for the Kafka stream we were ingesting from.

Analyzing the logs showed that this was caused by a temporary ZooKeeper connection issue that caused a cascade of InterupptedException and this caused some segment state transitions from OFFLINE to CONSUMING to fail.

Some relevant log messages:

2021/11/30 01:55:15.334 WARN [ZKHelixManager] [HelixTaskExecutor-message_handle_STATE_TRANSITION] zkClient to [redacted] is not connected, wait for 10000ms.
Exception while executing a state transition task [redacted segment name]
    ...
    Caused by: java.lang.RuntimeException: InterruptedException when acquiring the partitionConsumerSemaphore for segment: [redacted segment name]
2021/11/30 01:55:15.334 ERROR [HelixTask] [HelixTaskExecutor-message_handle_STATE_TRANSITION] Exception after executing a message, msgId: 76da755d-4ae3-4d61-84e6-11a946f6bffcorg.I0Itec.zkclient.exception.ZkInterruptedException: java.lang.InterruptedException
    org.I0Itec.zkclient.exception.ZkInterruptedException: java.lang.InterruptedException
            at org.apache.helix.manager.zk.zookeeper.ZkClient.acquireEventLock(ZkClient.java:1142)
    ...

The end result was that consumption stopped for the partitions represented by these segments that had failed state transitions.

In order to get the servers to start consuming for those partitions again we had to restart the servers hosting those segments. The expectation is that Pinot should be able to eventually recover and start consuming again once the ZooKeeper connection is available again.

@mapshen
Copy link

mapshen commented Dec 8, 2021

+1 on this. We experienced this issue as well when there was a Zookeeper issue a couple of weeks ago. Tried to reset/reload the segment, but it went back into the bad state.

@sajjad-moradi
Copy link
Contributor

Ideally calling segment reset endpoint on controller should fix the problem and you shouldn't need to restart the server. I just looked into the code for consuming segments and found the issue that prevents segment reset doing its job. I'll create the PR for the fix soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants