Skip to content

Potential controller start/stop race condition #8431

@walterddr

Description

@walterddr

When a test controller is restarted (such as in PinotTaskManagerStatelessTest), the controller will be come unavailable for ~1sec until it returns back to normal.

Issue was first discovered in #8390 but this could mean that the startController command didn't return a ready controller upon finish.

see code block:

Details ``` Connected to the target VM, address: '127.0.0.1:54163', transport: 'socket' Mar 28, 2022 2:03:04 PM org.glassfish.grizzly.http.server.NetworkListener start INFO: Started listener bound to [0.0.0.0:18998] Mar 28, 2022 2:03:04 PM org.glassfish.grizzly.http.server.HttpServer start INFO: [HttpServer] Started. Mar 28, 2022 2:03:10 PM org.glassfish.grizzly.http.server.NetworkListener shutdownNow INFO: Stopped listener bound to [0.0.0.0:18998] 14:03:10.409 [HelixController-pipeline-task-PinotTaskManagerStatelessTest-(b0992338_TASK)] ERROR org.apache.helix.controller.GenericHelixController - Cluster manager: localhost_18998 is not leader for PinotTaskManagerStatelessTest. Pipeline will not be invoked Mar 28, 2022 2:03:11 PM org.glassfish.grizzly.http.server.NetworkListener start INFO: Started listener bound to [0.0.0.0:18998] Mar 28, 2022 2:03:11 PM org.glassfish.grizzly.http.server.HttpServer start INFO: [HttpServer-1] Started. 14:03:36.069 [ZkClient-EventThread-113-localhost:2191] ERROR org.apache.pinot.controller.helix.core.realtime.PinotRealtimeSegmentManager - Caught exception while processing change for path /PinotTaskManagerStatelessTest/PROPERTYSTORE/CONFIGS/TABLE java.lang.IllegalStateException: ZkClient already closed! at org.apache.helix.manager.zk.zookeeper.ZkClient.retryUntilConnected(ZkClient.java:1171) ~[helix-core-0.9.8.jar:0.9.8] at org.apache.helix.manager.zk.zookeeper.ZkClient.getChildren(ZkClient.java:725) ~[helix-core-0.9.8.jar:0.9.8] at org.apache.helix.manager.zk.zookeeper.ZkClient.getChildren(ZkClient.java:718) ~[helix-core-0.9.8.jar:0.9.8] at org.apache.helix.manager.zk.ZkBaseDataAccessor.getChildNames(ZkBaseDataAccessor.java:511) ~[helix-core-0.9.8.jar:0.9.8] at org.apache.helix.manager.zk.ZkCacheBaseDataAccessor.getChildNames(ZkCacheBaseDataAccessor.java:675) ~[helix-core-0.9.8.jar:0.9.8] at org.apache.helix.store.zk.AutoFallbackPropertyStore.getChildNames(AutoFallbackPropertyStore.java:283) ~[helix-core-0.9.8.jar:0.9.8] at org.apache.helix.manager.zk.ZkCacheBaseDataAccessor.getChildren(ZkCacheBaseDataAccessor.java:692) ~[helix-core-0.9.8.jar:0.9.8] at org.apache.helix.manager.zk.ZkCacheBaseDataAccessor.getChildren(ZkCacheBaseDataAccessor.java:681) ~[helix-core-0.9.8.jar:0.9.8] at org.apache.pinot.controller.helix.core.realtime.PinotRealtimeSegmentManager.refreshWatchers(PinotRealtimeSegmentManager.java:318) ~[classes/:?] at org.apache.pinot.controller.helix.core.realtime.PinotRealtimeSegmentManager.processPropertyStoreChange(PinotRealtimeSegmentManager.java:295) [classes/:?] at org.apache.pinot.controller.helix.core.realtime.PinotRealtimeSegmentManager.handleChildChange(PinotRealtimeSegmentManager.java:383) [classes/:?] at org.apache.helix.manager.zk.zookeeper.ZkClient$8.run(ZkClient.java:1070) [helix-core-0.9.8.jar:0.9.8] at org.apache.helix.manager.zk.zookeeper.ZkEventThread.run(ZkEventThread.java:69) [helix-core-0.9.8.jar:0.9.8] 14:03:36.076 [ZkClient-EventThread-113-localhost:2191] ERROR org.apache.helix.manager.zk.zookeeper.ZkClient - Error handling event ZkEvent[Children of /PinotTaskManagerStatelessTest/PROPERTYSTORE/CONFIGS/TABLE changed sent to org.apache.pinot.controller.helix.core.realtime.PinotRealtimeSegmentManager@4ebfdbb4] java.lang.IllegalStateException: ZkClient already closed! at org.apache.helix.manager.zk.zookeeper.ZkClient.retryUntilConnected(ZkClient.java:1171) ~[helix-core-0.9.8.jar:0.9.8] at org.apache.helix.manager.zk.zookeeper.ZkClient.getChildren(ZkClient.java:725) ~[helix-core-0.9.8.jar:0.9.8] at org.apache.helix.manager.zk.zookeeper.ZkClient.getChildren(ZkClient.java:718) ~[helix-core-0.9.8.jar:0.9.8] at org.apache.helix.manager.zk.ZkBaseDataAccessor.getChildNames(ZkBaseDataAccessor.java:511) ~[helix-core-0.9.8.jar:0.9.8] at org.apache.helix.manager.zk.ZkCacheBaseDataAccessor.getChildNames(ZkCacheBaseDataAccessor.java:675) ~[helix-core-0.9.8.jar:0.9.8] at org.apache.helix.store.zk.AutoFallbackPropertyStore.getChildNames(AutoFallbackPropertyStore.java:283) ~[helix-core-0.9.8.jar:0.9.8] at org.apache.helix.manager.zk.ZkCacheBaseDataAccessor.getChildren(ZkCacheBaseDataAccessor.java:692) ~[helix-core-0.9.8.jar:0.9.8] at org.apache.helix.manager.zk.ZkCacheBaseDataAccessor.getChildren(ZkCacheBaseDataAccessor.java:681) ~[helix-core-0.9.8.jar:0.9.8] at org.apache.pinot.controller.helix.core.realtime.PinotRealtimeSegmentManager.refreshWatchers(PinotRealtimeSegmentManager.java:318) ~[classes/:?] at org.apache.pinot.controller.helix.core.realtime.PinotRealtimeSegmentManager.processPropertyStoreChange(PinotRealtimeSegmentManager.java:295) ~[classes/:?] at org.apache.pinot.controller.helix.core.realtime.PinotRealtimeSegmentManager.handleChildChange(PinotRealtimeSegmentManager.java:383) ~[classes/:?] at org.apache.helix.manager.zk.zookeeper.ZkClient$8.run(ZkClient.java:1070) ~[helix-core-0.9.8.jar:0.9.8] at org.apache.helix.manager.zk.zookeeper.ZkEventThread.run(ZkEventThread.java:69) [helix-core-0.9.8.jar:0.9.8] Mar 28, 2022 2:03:36 PM org.glassfish.grizzly.http.server.NetworkListener shutdownNow INFO: Stopped listener bound to [0.0.0.0:18998] 14:03:36.416 [HelixController-pipeline-task-PinotTaskManagerStatelessTest-(767f1028_TASK)] ERROR org.apache.helix.controller.GenericHelixController - Cluster manager: localhost_18998 is not leader for PinotTaskManagerStatelessTest. Pipeline will not be invoked

===============================================
Default Suite
Total tests run: 1, Failures: 0, Skips: 0

Disconnected from the target VM, address: '127.0.0.1:54163', transport: 'socket'

Process finished with exit code 0

</details>

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions