Realtime consumption halted if segment state transition fails #7874

jmint-stripe · 2021-12-06T23:15:16Z

We observed some realtime ingestion lag on one of our Pinot clusters. After some investigation we determined that the lag was happening on a subset of the partitions for the Kafka stream we were ingesting from.

Analyzing the logs showed that this was caused by a temporary ZooKeeper connection issue that caused a cascade of InterupptedException and this caused some segment state transitions from OFFLINE to CONSUMING to fail.

Some relevant log messages:

2021/11/30 01:55:15.334 WARN [ZKHelixManager] [HelixTaskExecutor-message_handle_STATE_TRANSITION] zkClient to [redacted] is not connected, wait for 10000ms.

Exception while executing a state transition task [redacted segment name]
    ...
    Caused by: java.lang.RuntimeException: InterruptedException when acquiring the partitionConsumerSemaphore for segment: [redacted segment name]

2021/11/30 01:55:15.334 ERROR [HelixTask] [HelixTaskExecutor-message_handle_STATE_TRANSITION] Exception after executing a message, msgId: 76da755d-4ae3-4d61-84e6-11a946f6bffcorg.I0Itec.zkclient.exception.ZkInterruptedException: java.lang.InterruptedException
    org.I0Itec.zkclient.exception.ZkInterruptedException: java.lang.InterruptedException
            at org.apache.helix.manager.zk.zookeeper.ZkClient.acquireEventLock(ZkClient.java:1142)
    ...

The end result was that consumption stopped for the partitions represented by these segments that had failed state transitions.

In order to get the servers to start consuming for those partitions again we had to restart the servers hosting those segments. The expectation is that Pinot should be able to eventually recover and start consuming again once the ZooKeeper connection is available again.

The text was updated successfully, but these errors were encountered:

mapshen · 2021-12-08T22:26:07Z

+1 on this. We experienced this issue as well when there was a Zookeeper issue a couple of weeks ago. Tried to reset/reload the segment, but it went back into the bad state.

sajjad-moradi · 2021-12-09T07:51:31Z

Ideally calling segment reset endpoint on controller should fix the problem and you shouldn't need to restart the server. I just looked into the code for consuming segments and found the issue that prevents segment reset doing its job. I'll create the PR for the fix soon.

Jackie-Jiang added the bug label Dec 7, 2021

sajjad-moradi mentioned this issue Dec 9, 2021

BUG FIX: semaphore issue in consuming segments #7886

Merged

ankitsultana mentioned this issue Feb 10, 2024

InterruptedException when acquiring partitonConsumerSemaphore #12390

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Realtime consumption halted if segment state transition fails #7874

Realtime consumption halted if segment state transition fails #7874

jmint-stripe commented Dec 6, 2021

mapshen commented Dec 8, 2021 •

edited

Loading

sajjad-moradi commented Dec 9, 2021

Realtime consumption halted if segment state transition fails #7874

Realtime consumption halted if segment state transition fails #7874

Comments

jmint-stripe commented Dec 6, 2021

mapshen commented Dec 8, 2021 • edited Loading

sajjad-moradi commented Dec 9, 2021

mapshen commented Dec 8, 2021 •

edited

Loading