Skip to content

[Bug] [Server] Fault tolerance of dependent nodes leads to stuck #9873

@brave-lee

Description

@brave-lee

Search before asking

  • I had searched in the issues and found no similar issues.

What happened

[INFO] 2022-05-04 06:37:50.756 org.apache.dolphinscheduler.server.master.runner.WorkflowExecuteThread:[380] - work flow 1129 task 9904 state:NEED_FAULT_TOLERANCE
[INFO] 2022-05-04 06:37:50.757 org.apache.dolphinscheduler.server.master.runner.WorkflowExecuteThread:[1278] - add task to stand by list, task name:dep_dwd_1d_dag, task id:9904, task code:5196592535840
[INFO] 2022-05-04 06:37:50.758 org.apache.dolphinscheduler.service.process.ProcessService:[1080] - start submit task : dep_dwd_1d_dag, instance id:1129, state: RUNNING_EXECUTION
[INFO] 2022-05-04 06:37:50.761 org.apache.dolphinscheduler.service.process.ProcessService:[1093] - end submit task to db successfully:9904 dep_dwd_1d_dag state:SUBMITTED_SUCCESS complete, instance id:1129 state: RUNNING_EXECUTION
[INFO] 2022-05-04 06:37:50.767 org.apache.dolphinscheduler.server.master.runner.WorkflowExecuteThread:[1292] - remove task from stand by list, id: 9904 name:dep_dwd_1d_dag
[WARN] 2022-05-04 06:38:03.004 com.zaxxer.hikari.pool.PoolBase:[184] - DolphinScheduler - Failed to validate connection com.mysql.jdbc.JDBC4Connection@4eba99b2 (No operations allowed after connection closed.). Possibly consider using a shorter maxLifetime value.
[INFO] 2022-05-04 06:47:50.342 org.apache.dolphinscheduler.server.master.runner.FailoverExecuteThread:[68] - failover execute started
[INFO] 2022-05-04 06:47:50.344 org.apache.dolphinscheduler.server.master.runner.FailoverExecuteThread:[74] - need failover hosts:[host:port]
[INFO] 2022-05-04 06:47:50.349 org.apache.dolphinscheduler.server.master.registry.MasterRegistryClient:[424] - start master[host:port] failover, process list size:2
[INFO] 2022-05-04 06:47:50.351 org.apache.dolphinscheduler.server.master.registry.MasterRegistryClient:[442] - failover task instance id: 9904, process instance id: 1129
[INFO] 2022-05-04 06:47:51.352 org.apache.dolphinscheduler.service.log.LogClientService:[117] - view log path /opt/dolphinscheduler/logs/5196620942752_3/1129/9904.log
[ERROR] 2022-05-04 06:47:51.352 org.apache.dolphinscheduler.common.utils.LoggerUtils:[117] - read file error
java.io.FileNotFoundException: /opt/dolphinscheduler/logs/5196620942752_3/1129/9904.log (No such file or directory)
at java.io.FileInputStream.open0(Native Method)
at java.io.FileInputStream.open(FileInputStream.java:195)
at java.io.FileInputStream.(FileInputStream.java:138)
at java.io.FileInputStream.(FileInputStream.java:93)
at org.apache.dolphinscheduler.common.utils.LoggerUtils.readWholeFileContent(LoggerUtils.java:111)
at org.apache.dolphinscheduler.service.log.LogClientService.viewLog(LogClientService.java:123)
at org.apache.dolphinscheduler.server.utils.ProcessUtils.killYarnJob(ProcessUtils.java:190)
at org.apache.dolphinscheduler.server.master.registry.MasterRegistryClient.failoverTaskInstance(MasterRegistryClient.java:486)
at org.apache.dolphinscheduler.server.master.registry.MasterRegistryClient.failoverMaster(MasterRegistryClient.java:443)
at org.apache.dolphinscheduler.server.master.runner.FailoverExecuteThread.run(FailoverExecuteThread.java:80)
[INFO] 2022-05-04 06:47:51.353 org.apache.dolphinscheduler.remote.NettyRemotingClient:[390] - netty client closed
[INFO] 2022-05-04 06:47:51.353 org.apache.dolphinscheduler.service.log.LogClientService:[74] - logger client closed
[INFO] 2022-05-04 06:47:51.356 org.apache.dolphinscheduler.server.master.registry.MasterRegistryClient:[456] - master[host:port] failover end, useTime:1008ms
[INFO] 2022-05-04 06:47:51.803 org.apache.dolphinscheduler.server.master.runner.EventExecuteService:[127] - handle process instance : 1129 , events count:1
[INFO] 2022-05-04 06:47:51.803 org.apache.dolphinscheduler.server.master.runner.EventExecuteService:[130] - already exists handler process size:0
[INFO] 2022-05-04 06:47:51.803 org.apache.dolphinscheduler.server.master.runner.WorkflowExecuteThread:[301] - process event: State Event :key: null type: TASK_STATE_CHANGE executeStatus: NEED_FAULT_TOLERANCE task instance id: 9904 process instance id: 1129 context: null
[INFO] 2022-05-04 06:47:51.804 org.apache.dolphinscheduler.server.master.runner.WorkflowExecuteThread:[380] - work flow 1129 task 9904 state:NEED_FAULT_TOLERANCE
[INFO] 2022-05-04 06:47:51.805 org.apache.dolphinscheduler.server.master.runner.WorkflowExecuteThread:[1278] - add task to stand by list, task name:dep_dwd_1d_dag, task id:9904, task code:5196592535840
[INFO] 2022-05-04 06:47:51.806 org.apache.dolphinscheduler.service.process.ProcessService:[1080] - start submit task : dep_dwd_1d_dag, instance id:1129, state: RUNNING_EXECUTION
[INFO] 2022-05-04 06:47:51.809 org.apache.dolphinscheduler.service.process.ProcessService:[1093] - end submit task to db successfully:9904 dep_dwd_1d_dag state:SUBMITTED_SUCCESS complete, instance id:1129 state: RUNNING_EXECUTION
[INFO] 2022-05-04 06:47:51.814 org.apache.dolphinscheduler.server.master.runner.WorkflowExecuteThread:[1292] - remove task from stand by list, id: 9904 name:dep_dwd_1d_dag

What you expected to happen

It can be normal

How to reproduce

Fault tolerance of dependent nodes

Anything else

No response

Version

2.0.5

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

Metadata

Metadata

Assignees

Labels

StalebugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions