Skip to content

[Bug] [Master-server] Master-server often disconnect with zookeeper due to server down #11635

@stalary

Description

@stalary

Search before asking

  • I had searched in the issues and found no similar issues.

What happened

Master-server often disconnect with zookeeper due to server down
log

[WARN] 2022-08-25 12:19:25.150 +0800 org.apache.zookeeper.ClientCnxn:[1108] - [WorkflowInstance-0][TaskInstance-0] - Client session timed out, have not heard from server in 20010ms for sessionid 0x1017e2ca6e10058
[INFO] 2022-08-25 12:19:44.757 +0800 org.apache.zookeeper.ClientCnxn:[1156] - [WorkflowInstance-0][TaskInstance-0] - Client session timed out, have not heard from server in 20010ms for sessionid 0x1017e2ca6e10058, closing socket connection and attempting reconnect
[INFO] 2022-08-25 12:19:44.859 +0800 org.apache.curator.framework.state.ConnectionStateManager:[251] - [WorkflowInstance-0][TaskInstance-0] - State change: SUSPENDED
[WARN] 2022-08-25 12:19:44.861 +0800 org.apache.dolphinscheduler.plugin.registry.zookeeper.ZookeeperConnectionStateListener:[52] - [WorkflowInstance-0][TaskInstance-0] - Registry suspended
[WARN] 2022-08-25 12:19:44.862 +0800 org.apache.dolphinscheduler.server.master.registry.MasterConnectionStateListener:[48] - [WorkflowInstance-0][TaskInstance-0] - registry connection state is SUSPENDED, ready to retry connection
[INFO] 2022-08-25 12:19:46.203 +0800 org.apache.zookeeper.ClientCnxn:[1025] - [WorkflowInstance-0][TaskInstance-0] - Opening socket connection to server development-7/192.168.120.17:2181. Will not attempt to authenticate using SASL (unknown error)
[INFO] 2022-08-25 12:19:46.204 +0800 org.apache.zookeeper.ClientCnxn:[879] - [WorkflowInstance-0][TaskInstance-0] - Socket connection established to development-7/192.168.120.17:2181, initiating session
[WARN] 2022-08-25 12:19:46.206 +0800 org.apache.zookeeper.ClientCnxn:[1285] - [WorkflowInstance-0][TaskInstance-0] - Unable to reconnect to ZooKeeper service, session 0x1017e2ca6e10058 has expired
[WARN] 2022-08-25 12:19:46.206 +0800 org.apache.curator.ConnectionState:[376] - [WorkflowInstance-0][TaskInstance-0] - Session expired event received
[INFO] 2022-08-25 12:19:46.206 +0800 org.apache.zookeeper.ClientCnxn:[1154] - [WorkflowInstance-0][TaskInstance-0] - Unable to reconnect to ZooKeeper service, session 0x1017e2ca6e10058 has expired, closing socket connection
[INFO] 2022-08-25 12:19:46.207 +0800 org.apache.zookeeper.ZooKeeper:[442] - [WorkflowInstance-0][TaskInstance-0] - Initiating client connection, connectString=192.168.120.17:2181 sessionTimeout=30000 watcher=org.apache.curator.ConnectionState@16d41725
[INFO] 2022-08-25 12:19:46.207 +0800 org.apache.curator.framework.state.ConnectionStateManager:[251] - [WorkflowInstance-0][TaskInstance-0] - State change: LOST
[INFO] 2022-08-25 12:19:46.208 +0800 org.apache.zookeeper.ClientCnxn:[522] - [WorkflowInstance-0][TaskInstance-0] - EventThread shut down for session: 0x1017e2ca6e10058
[WARN] 2022-08-25 12:19:46.208 +0800 org.apache.dolphinscheduler.plugin.registry.zookeeper.ZookeeperConnectionStateListener:[44] - [WorkflowInstance-0][TaskInstance-0] - Registry disconnected
[INFO] 2022-08-25 12:19:46.209 +0800 org.apache.zookeeper.ClientCnxn:[1025] - [WorkflowInstance-0][TaskInstance-0] - Opening socket connection to server development-7/192.168.120.17:2181. Will not attempt to authenticate using SASL (unknown error)
[WARN] 2022-08-25 12:19:46.209 +0800 org.apache.dolphinscheduler.server.master.registry.MasterConnectionStateListener:[56] - [WorkflowInstance-0][TaskInstance-0] - registry connection state is DISCONNECTED, ready to stop myself
[INFO] 2022-08-25 12:19:46.209 +0800 org.apache.zookeeper.ClientCnxn:[879] - [WorkflowInstance-0][TaskInstance-0] - Socket connection established to development-7/192.168.120.17:2181, initiating session
[INFO] 2022-08-25 12:19:46.321 +0800 org.apache.zookeeper.ClientCnxn:[1299] - [WorkflowInstance-0][TaskInstance-0] - Session establishment complete on server development-7/192.168.120.17:2181, sessionid = 0x1017e2ca6e1005f, negotiated timeout = 30000
[INFO] 2022-08-25 12:19:46.321 +0800 org.apache.curator.framework.state.ConnectionStateManager:[251] - [WorkflowInstance-0][TaskInstance-0] - State change: RECONNECTED
[WARN] 2022-08-25 12:19:46.371 +0800 org.apache.dolphinscheduler.server.master.dispatch.host.LowerWeightHostManager:[136] - [WorkflowInstance-0][TaskInstance-0] - worker 192.168.120.17:1234 in work group default have not received the heartbeat
[WARN] 2022-08-25 12:19:46.488 +0800 org.apache.dolphinscheduler.server.master.MasterServer:[121] - [WorkflowInstance-0][TaskInstance-0] - MasterServer is already stopped, current cause: i was judged to death, release resources and stop myself
[WARN] 2022-08-25 12:19:46.489 +0800 org.apache.dolphinscheduler.server.master.MasterServer:[121] - [WorkflowInstance-0][TaskInstance-0] - MasterServer is already stopped, current cause: i was judged to death, release resources and stop myself
[WARN] 2022-08-25 12:19:46.489 +0800 org.apache.dolphinscheduler.server.master.MasterServer:[121] - [WorkflowInstance-0][TaskInstance-0] - MasterServer is already stopped, current cause: i was judged to death, release resources and stop myself
[WARN] 2022-08-25 12:19:46.490 +0800 org.apache.dolphinscheduler.server.master.MasterServer:[121] - [WorkflowInstance-0][TaskInstance-0] - MasterServer is already stopped, current cause: i was judged to death, release resources and stop myself
[INFO] 2022-08-25 12:19:49.209 +0800 org.apache.dolphinscheduler.server.master.MasterServer:[133] - [WorkflowInstance-0][TaskInstance-0] - Master server is stopping, current cause : registry connection state is DISCONNECTED, stop myself
[INFO] 2022-08-25 12:19:49.212 +0800 org.quartz.core.QuartzScheduler:[585] - [WorkflowInstance-0][TaskInstance-0] - Scheduler DolphinScheduler_$_development-71661396192695 paused.
[INFO] 2022-08-25 12:19:49.220 +0800 org.eclipse.jetty.server.AbstractConnector:[381] - [WorkflowInstance-0][TaskInstance-0] - Stopped ServerConnector@16bd7ae1{HTTP/1.1, (http/1.1)}{0.0.0.0:5679}
[INFO] 2022-08-25 12:19:49.220 +0800 org.eclipse.jetty.server.session:[149] - [WorkflowInstance-0][TaskInstance-0] - node0 Stopped scavenging
[INFO] 2022-08-25 12:19:49.221 +0800 org.eclipse.jetty.server.handler.ContextHandler.application:[2347] - [WorkflowInstance-0][TaskInstance-0] - Destroying Spring FrameworkServlet 'dispatcherServlet'
[INFO] 2022-08-25 12:19:49.222 +0800 org.eclipse.jetty.server.handler.ContextHandler:[1153] - [WorkflowInstance-0][TaskInstance-0] - Stopped o.s.b.w.e.j.JettyEmbeddedWebAppContext@7f2b584b{application,/,[file:///tmp/jetty-docbase.5679.7084582183858887588/],STOPPED}
[INFO] 2022-08-25 12:19:49.226 +0800 org.apache.dolphinscheduler.server.master.rpc.MasterRPCServer:[109] - [WorkflowInstance-0][TaskInstance-0] - Closing Master RPC Server...
[INFO] 2022-08-25 12:19:49.228 +0800 org.apache.dolphinscheduler.remote.NettyRemotingServer:[212] - [WorkflowInstance-0][TaskInstance-0] - netty server closed
[INFO] 2022-08-25 12:19:49.228 +0800 org.apache.dolphinscheduler.server.master.rpc.MasterRPCServer:[111] - [WorkflowInstance-0][TaskInstance-0] - Closed Master RPC Server...
[INFO] 2022-08-25 12:19:49.229 +0800 org.springframework.scheduling.quartz.SchedulerFactoryBean:[845] - [WorkflowInstance-0][TaskInstance-0] - Shutting down Quartz Scheduler
[INFO] 2022-08-25 12:19:49.229 +0800 org.quartz.core.QuartzScheduler:[666] - [WorkflowInstance-0][TaskInstance-0] - Scheduler DolphinScheduler_$_development-71661396192695 shutting down.
[INFO] 2022-08-25 12:19:49.229 +0800 org.quartz.core.QuartzScheduler:[585] - [WorkflowInstance-0][TaskInstance-0] - Scheduler DolphinScheduler_$_development-71661396192695 paused.
[INFO] 2022-08-25 12:19:49.234 +0800 org.quartz.core.QuartzScheduler:[740] - [WorkflowInstance-0][TaskInstance-0] - Scheduler DolphinScheduler_$_development-71661396192695 shutdown complete.
[INFO] 2022-08-25 12:19:49.234 +0800 org.apache.dolphinscheduler.server.master.runner.MasterSchedulerBootstrap:[117] - [WorkflowInstance-0][TaskInstance-0] - Master schedule bootstrap stopping...
[WARN] 2022-08-25 12:19:49.234 +0800 org.apache.dolphinscheduler.server.master.processor.queue.StateEventResponseService:[125] - [WorkflowInstance-0][TaskInstance-0] - State event loop service interrupted, will stop this loop
java.lang.InterruptedException: null
	at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2014)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2048)
	at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
	at org.apache.dolphinscheduler.server.master.processor.queue.StateEventResponseService$StateEventResponseWorker.run(StateEventResponseService.java:121)
[INFO] 2022-08-25 12:19:49.234 +0800 org.apache.dolphinscheduler.server.master.runner.MasterSchedulerBootstrap:[118] - [WorkflowInstance-0][TaskInstance-0] - Master schedule bootstrap stopped...
[INFO] 2022-08-25 12:19:49.235 +0800 org.apache.dolphinscheduler.server.master.processor.queue.StateEventResponseService:[132] - [WorkflowInstance-0][TaskInstance-0] - State event loop service stopped
[INFO] 2022-08-25 12:19:49.237 +0800 org.apache.dolphinscheduler.server.master.registry.MasterRegistryClient:[226] - [WorkflowInstance-0][TaskInstance-0] - Master node : 192.168.120.17:5678 unRegistry to register center.
[INFO] 2022-08-25 12:19:49.237 +0800 org.apache.dolphinscheduler.server.master.registry.MasterRegistryClient:[228] - [WorkflowInstance-0][TaskInstance-0] - MasterServer heartbeat executor shutdown
[INFO] 2022-08-25 12:19:49.240 +0800 org.apache.curator.framework.imps.CuratorFrameworkImpl:[955] - [WorkflowInstance-0][TaskInstance-0] - backgroundOperationsLoop exiting
[INFO] 2022-08-25 12:19:49.240 +0800 org.apache.zookeeper.ClientCnxn:[522] - [WorkflowInstance-0][TaskInstance-0] - EventThread shut down for session: 0x1017e2ca6e1005f
[INFO] 2022-08-25 12:19:49.240 +0800 org.apache.zookeeper.ZooKeeper:[693] - [WorkflowInstance-0][TaskInstance-0] - Session: 0x1017e2ca6e1005f closed
[INFO] 2022-08-25 12:19:49.241 +0800 org.apache.dolphinscheduler.server.master.processor.queue.TaskEventService:[126] - [WorkflowInstance-0][TaskInstance-0] - StateEventResponseWorker stopped
[INFO] 2022-08-25 12:19:49.245 +0800 com.zaxxer.hikari.HikariDataSource:[350] - [WorkflowInstance-0][TaskInstance-0] - DolphinScheduler - Shutdown initiated...
[INFO] 2022-08-25 12:19:49.250 +0800 com.zaxxer.hikari.HikariDataSource:[352] - [WorkflowInstance-0][TaskInstance-0] - DolphinScheduler - Shutdown completed.

What you expected to happen

run normal.

How to reproduce

start master-server register with zookeeper, wait some time.

Anything else

No response

Version

3.0.0

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

Metadata

Metadata

Assignees

No one assigned

    Labels

    backendbugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions