Skip to content

Will some PathChildrenCacheEvent be missed after the connection to zk reconnected #7893

@viongpanzi

Description

@viongpanzi

hi, all~

We have a problem!

The information about our prod cluster:

version: 0.13.0
number of segments: more than 6 million
GC: g1 gc (time cost in one fgc is more than 120 secs.)
incremental poll is enabled

After each fgc (take more than 120 seconds), the connection of one coordinator to the zookeeper is disconnected due to timeout. Soon the another coordinator becomes the leader, and a new fgc happens after polling all data segments from metadata. Again the connection to the zookeeper discoonectted and these two coordinators trap in a loop. However, if we restart these two coordinator service, they can work well for days.

In order to find the cause, we use MAT(Eclipse Memory Analyzer Tool) to analyze the dumped heap of one of those two coordinators, and it reports the following infos:

image

After tracing the call stack to zNodes and checking the logs of the coordinator service, some logs about zookeeper node event may be have some problem.

09/Jun/2019 20:49:42,970 [ServerInventoryView-0] WARN  org.apache.druid.curator.inventory.CuratorInventoryManager - CuratorInventoryManager: Exception while getting data for node /druid/seg
ments/host:8101/host:8101_indexer-executor__default_tier_2019-06-09T18:45:43.225Z_e8206828c4ba4a5799956bde201eceb60
org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /druid/segments/host:8101/host:8101_indexer-executor__default_tier_2019-
06-09T18:45:43.225Z_e8206828c4ba4a5799956bde201eceb60
        at org.apache.zookeeper.KeeperException.create(KeeperException.java:114) ~[zookeeper-3.4.11.jar:3.4.11-37e277162d567b55a07d1755f0b31c32e93c01a0]
        at org.apache.zookeeper.KeeperException.create(KeeperException.java:54) ~[zookeeper-3.4.11.jar:3.4.11-37e277162d567b55a07d1755f0b31c32e93c01a0]
        at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1215) ~[zookeeper-3.4.11.jar:3.4.11-37e277162d567b55a07d1755f0b31c32e93c01a0]
        at org.apache.curator.framework.imps.GetDataBuilderImpl$4.call(GetDataBuilderImpl.java:327) ~[curator-framework-4.1.0.jar:4.1.0]
        at org.apache.curator.framework.imps.GetDataBuilderImpl$4.call(GetDataBuilderImpl.java:316) ~[curator-framework-4.1.0.jar:4.1.0]
        at org.apache.curator.connection.StandardConnectionHandlingPolicy.callWithRetry(StandardConnectionHandlingPolicy.java:64) ~[curator-client-4.1.0.jar:?]
        at org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:100) ~[curator-client-4.1.0.jar:?]
        at org.apache.curator.framework.imps.GetDataBuilderImpl.pathInForeground(GetDataBuilderImpl.java:313) ~[curator-framework-4.1.0.jar:4.1.0]
        at org.apache.curator.framework.imps.GetDataBuilderImpl.forPath(GetDataBuilderImpl.java:304) ~[curator-framework-4.1.0.jar:4.1.0]
        at org.apache.curator.framework.imps.GetDataBuilderImpl$1.forPath(GetDataBuilderImpl.java:107) ~[curator-framework-4.1.0.jar:4.1.0]
        at org.apache.curator.framework.imps.GetDataBuilderImpl$1.forPath(GetDataBuilderImpl.java:67) ~[curator-framework-4.1.0.jar:4.1.0]
        at org.apache.druid.curator.inventory.CuratorInventoryManager.getZkDataForNode(CuratorInventoryManager.java:177) [druid-server-0.13.0-ad.jar:0.13.0-ad]
        at org.apache.druid.curator.inventory.CuratorInventoryManager.access$400(CuratorInventoryManager.java:58) [druid-server-0.13.0-ad.jar:0.13.0-ad]
        at org.apache.druid.curator.inventory.CuratorInventoryManager$ContainerCacheListener$InventoryCacheListener.childEvent(CuratorInventoryManager.java:402) [druid-server-0.13.0-ad.jar:
0.13.0-ad]
        at org.apache.curator.framework.recipes.cache.PathChildrenCache$5.apply(PathChildrenCache.java:538) [curator-recipes-4.1.0.jar:4.1.0]
        at org.apache.curator.framework.recipes.cache.PathChildrenCache$5.apply(PathChildrenCache.java:532) [curator-recipes-4.1.0.jar:4.1.0]
        at org.apache.curator.framework.listen.ListenerContainer$1.run(ListenerContainer.java:93) [curator-framework-4.1.0.jar:4.1.0]
        at org.apache.curator.shaded.com.google.common.util.concurrent.MoreExecutors$DirectExecutor.execute(MoreExecutors.java:435) [curator-client-4.1.0.jar:?]
        at org.apache.curator.framework.listen.ListenerContainer.forEach(ListenerContainer.java:85) [curator-framework-4.1.0.jar:4.1.0]
        at org.apache.curator.framework.recipes.cache.PathChildrenCache.callListeners(PathChildrenCache.java:530) [curator-recipes-4.1.0.jar:4.1.0]
        at org.apache.curator.framework.recipes.cache.EventOperation.invoke(EventOperation.java:35) [curator-recipes-4.1.0.jar:4.1.0]
        at org.apache.curator.framework.recipes.cache.PathChildrenCache$9.run(PathChildrenCache.java:808) [curator-recipes-4.1.0.jar:4.1.0]
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [?:1.8.0_131]
        at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_131]
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [?:1.8.0_131]
        at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_131]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_131]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_131]
        at java.lang.Thread.run(Thread.java:748) [?:1.8.0_131]
09/Jun/2019 20:49:42,970 [ServerInventoryView-0] INFO  org.apache.druid.curator.inventory.CuratorInventoryManager - CuratorInventoryManager: Ignoring event: Type - CHILD_UPDATED , Path - /d
ruid/segments/host:8101/host:8101_indexer-executor__default_tier_2019-06-09T18:45:43.225Z_e8206828c4ba4a5799956bde201eceb60 , Version - 4

Will some PathChildrenCacheEvent be missed after the connection to zk disconnected? If not, how to explain the exception above that coordinator attempt to update a node that does not exist?

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions