Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] [Master] Stopping a workflow does not update task status correctly #13244

Closed
2 of 3 tasks
wcmolin opened this issue Dec 21, 2022 · 2 comments
Closed
2 of 3 tasks
Labels
bug Something isn't working
Milestone

Comments

@wcmolin
Copy link
Contributor

wcmolin commented Dec 21, 2022

Search before asking

  • I had searched in the issues and found no similar issues.

What happened

Sometimes it happens that after stopping the workflow, the status of the killed task is still running

image

image

This is a screenshot of the task configuration:
image

What you expected to happen

Killed task status shows as killed

How to reproduce

I think the reason is that the status update of the workflow and task is independent. The workflow is set to the stopped state before the task is completely stopped, and then deleted from the processInstanceMap, so that the taskResponse cannot be processed.

The code must be changed to reproduce as much as possible.

Modify line 175 of org.apache.dolphinscheduler.server.master.processor.queue.TaskResponseService and add the following code

if (null != taskResponsePersistThread) {
    if (taskResponsePersistThread.addEvent(taskResponseEvent))...    
} else {
    logger.error("task response persist thread is null");
}

Modify line 118 of org.apache.dolphinscheduler.server.worker.processor.TaskKillProcessor and add the following code

private Pair<Boolean, List<String>> doKill(TaskKillRequestCommand killCommand) {
    try {
        Thread.sleep(2000);
    } catch (InterruptedException e) {
        throw new RuntimeException(e);
    }

The reason for this is to simulate another site to remove the instance in processInstanceMap.

if (taskResponsePersistThread.eventSize() == 0) {
    if (!processInstanceMap.containsKey(taskResponsePersistThread.getProcessInstanceId())) {
        processTaskResponseMap.remove(taskResponsePersistThread.getProcessInstanceId());
        logger.info("remove process instance: {}", taskResponsePersistThread.getProcessInstanceId());
    }
    continue;
}

This will cause the instance in processTaskResponseMap to be removed in the following code, resulting in TaskResponsePersistThread being empty and printing

task response persist thread is null

master log:

[INFO] 2022-12-21 20:44:27.372 org.apache.dolphinscheduler.server.master.processor.CacheProcessor:[52] - received command : CacheExpireCommand{CacheType=PROCESS_DEFINITION, cacheKey=7960963905984}
[INFO] 2022-12-21 20:44:27.390 org.apache.dolphinscheduler.server.master.processor.CacheProcessor:[70] - cache evict, type:processDefinition, key:7960963905984
[INFO] 2022-12-21 20:44:28.363 org.apache.dolphinscheduler.server.master.runner.MasterSchedulerService:[255] - find command 8, slot:0 :
[INFO] 2022-12-21 20:44:28.363 org.apache.dolphinscheduler.server.master.runner.MasterSchedulerService:[208] - find one command: id: 8, type: START_PROCESS
[INFO] 2022-12-21 20:44:28.482 org.apache.dolphinscheduler.server.master.runner.MasterSchedulerService:[221] - handle command end, command 8 process 8 start...
[INFO] 2022-12-21 20:44:28.650 org.apache.dolphinscheduler.server.master.runner.WorkflowExecuteThread:[1370] - add task to stand by list, task name:sleep-20s, task id:0, task code:7960945419712
[INFO] 2022-12-21 20:44:28.685 org.apache.dolphinscheduler.service.process.ProcessService:[1088] - start submit task : sleep-20s, instance id:8, state: RUNNING_EXECUTION
[INFO] 2022-12-21 20:44:28.803 org.apache.dolphinscheduler.service.process.ProcessService:[1101] - end submit task to db successfully:8 sleep-20s state:SUBMITTED_SUCCESS complete, instance id:8 state: RUNNING_EXECUTION  
[INFO] 2022-12-21 20:44:28.823 org.apache.dolphinscheduler.server.master.runner.task.CommonTaskProcessor:[164] - master submit success, task id:8, task name:sleep-20s, process id:8
[INFO] 2022-12-21 20:44:28.824 org.apache.dolphinscheduler.server.master.runner.WorkflowExecuteThread:[1384] - remove task from stand by list, id: 8 name:sleep-20s
[INFO] 2022-12-21 20:44:29.043 org.apache.dolphinscheduler.server.master.processor.TaskAckProcessor:[71] - taskAckCommand : TaskExecuteAckCommand{taskInstanceId=8, startTime=Wed Dec 21 20:44:29 CST 2022, host='192.168.20.62:1234', status=1, logPath='/Users/molin/dev/github/dolphinscheduler/logs/7960963905984_2/8/8.log', executePath='/tmp/dolphinscheduler/exec/process/7960944102080/7960963905984_2/8/8', processInstanceId='0'}
[INFO] 2022-12-21 20:44:29.043 org.apache.dolphinscheduler.server.master.processor.TaskAckProcessor:[71] - taskAckCommand : TaskExecuteAckCommand{taskInstanceId=8, startTime=Wed Dec 21 20:44:29 CST 2022, host='192.168.20.62:1234', status=1, logPath='/Users/molin/dev/github/dolphinscheduler/logs/7960963905984_2/8/8.log', executePath='/tmp/dolphinscheduler/exec/process/7960944102080/7960963905984_2/8/8', processInstanceId='8'}
[ERROR] 2022-12-21 20:44:29.044 org.apache.dolphinscheduler.server.master.processor.queue.TaskResponseService:[176] - task response persist thread is null
[INFO] 2022-12-21 20:44:33.368 org.apache.dolphinscheduler.server.master.processor.queue.TaskResponseService:[226] - already exists handler process size:0
[INFO] 2022-12-21 20:44:33.703 org.apache.dolphinscheduler.server.master.processor.queue.TaskResponseService:[233] - persist events 8 succeeded.
[INFO] 2022-12-21 20:44:42.226 org.apache.dolphinscheduler.server.master.processor.StateEventProcessor:[75] - received command : State Event :key: 8-0-8-0 type: PROCESS_STATE_CHANGE executeStatus: READY_STOP task instance id: 0 process instance id: 8 context: null
[INFO] 2022-12-21 20:44:42.975 org.apache.dolphinscheduler.server.master.runner.EventExecuteService:[127] - handle process instance : 8 , events count:1
[INFO] 2022-12-21 20:44:42.980 org.apache.dolphinscheduler.server.master.runner.EventExecuteService:[130] - already exists handler process size:0
[INFO] 2022-12-21 20:44:42.982 org.apache.dolphinscheduler.server.master.runner.WorkflowExecuteThread:[310] - process event: State Event :key: 8-0-8-0 type: PROCESS_STATE_CHANGE executeStatus: READY_STOP task instance id: 0 process instance id: 8 context: null
[INFO] 2022-12-21 20:44:42.983 org.apache.dolphinscheduler.server.master.runner.WorkflowExecuteThread:[497] - process:8 state RUNNING_EXECUTION change to READY_STOP
[INFO] 2022-12-21 20:44:43.013 org.apache.dolphinscheduler.server.master.runner.WorkflowExecuteThread:[1422] - kill called on process instance id: 8, num: 1
[INFO] 2022-12-21 20:44:43.094 org.apache.dolphinscheduler.server.master.runner.task.CommonTaskProcessor:[209] - master kill taskInstance name :sleep-20s taskInstance id:8
[INFO] 2022-12-21 20:44:43.095 org.apache.dolphinscheduler.server.master.runner.WorkflowExecuteThread:[310] - process event: State Event :key: null type: PROCESS_STATE_CHANGE executeStatus: STOP task instance id: 0 process instance id: 8 context: null
[INFO] 2022-12-21 20:44:43.095 org.apache.dolphinscheduler.server.master.runner.WorkflowExecuteThread:[497] - process:8 state READY_STOP change to STOP
[INFO] 2022-12-21 20:44:43.107 org.apache.dolphinscheduler.server.master.runner.WorkflowExecuteThread:[1320] - work flow process instance [id: 8, name:test_task_status_error-2-20221221204428415], state change from READY_STOP to STOP, cmd type: STOP
[INFO] 2022-12-21 20:44:43.307 org.apache.dolphinscheduler.server.master.runner.EventExecuteService:[139] - process instance 8 finished.
[INFO] 2022-12-21 20:44:43.385 org.apache.dolphinscheduler.server.master.processor.queue.TaskResponseService:[222] - remove process instance: 8
[INFO] 2022-12-21 20:44:43.610 org.apache.dolphinscheduler.server.master.processor.TaskResponseProcessor:[72] - received command : TaskExecuteResponseCommand{taskInstanceId=8, processInstanceId=8, status=9, startTime=null, endTime=Wed Dec 21 20:44:43 CST 2022, host=null, logPath=null, executePath=null, processId=2407, appIds='', varPool=[]}
[ERROR] 2022-12-21 20:44:43.613 org.apache.dolphinscheduler.server.master.processor.queue.TaskResponseService:[176] - task response persist thread is null

worker log:

[INFO] 2022-12-21 20:44:28.882 org.apache.dolphinscheduler.server.worker.processor.TaskExecuteProcessor:[110] - received command : TaskExecuteRequestCommand{taskExecutionContext='{"taskInstanceId":8,"taskName":"sleep-20s","firstSubmitTime":"2022-12-21 20:44:28","startTime":null,"endTime":null,"taskType":"SHELL","host":null,"executePath":null,"logPath":null,"taskJson":null,"processId":0,"processDefineCode":7960963905984,"processDefineVersion":2,"appIds":null,"processInstanceId":8,"scheduleTime":null,"globalParams":null,"executorId":1,"cmdTypeIfComplement":0,"tenantCode":"molin","queue":"default","projectCode":7960944102080,"taskParams":"{\"resourceList\":[],\"localParams\":[],\"rawScript\":\"echo \\\"start\\\"\\nsleep 20\\necho \\\"end\\\"\",\"dependence\":{},\"conditionResult\":{\"successNode\":[],\"failedNode\":[]},\"waitStartTimeout\":{},\"switchResult\":{}}","envFile":null,"environmentConfig":null,"definedParams":null,"taskAppId":null,"taskTimeoutStrategy":null,"taskTimeout":2147483647,"workerGroup":"default","delayTime":0,"currentExecutionStatus":null,"resources":{},"sqlTaskExecutionContext":{"warningGroupId":0,"connectionParams":null,"udfFuncTenantCodeMap":null,"defaultFS":null},"dataxTaskExecutionContext":{"dataSourceId":0,"sourcetype":0,"sourceConnectionParams":null,"dataTargetId":0,"targetType":0,"targetConnectionParams":null},"dependenceTaskExecutionContext":null,"sqoopTaskExecutionContext":{"dataSourceId":0,"sourcetype":0,"sourceConnectionParams":null,"dataTargetId":0,"targetType":0,"targetConnectionParams":null},"varPool":null,"dryRun":0,"paramsMap":null,"procedureTaskExecutionContext":{"connectionParams":null}}'}
[INFO] 2022-12-21 20:44:28.919 org.apache.dolphinscheduler.common.utils.FileUtils:[121] - create dir success /tmp/dolphinscheduler/exec/process/7960944102080/7960963905984_2/8/8
[INFO] 2022-12-21 20:44:28.921 org.apache.dolphinscheduler.server.worker.runner.TaskExecuteThread:[128] - task instance local execute path : /tmp/dolphinscheduler/exec/process/7960944102080/7960963905984_2/8/8
[INFO] 2022-12-21 20:44:29.038 org.apache.dolphinscheduler.server.worker.runner.TaskExecuteThread:[146] - the task begins to execute. task instance id: 8
[INFO] 2022-12-21 20:44:29.075 TaskLogLogger-class org.apache.dolphinscheduler.plugin.task.shell.ShellTask:[83] - shell task params {"resourceList":[],"localParams":[],"rawScript":"echo \"start\"\nsleep 20\necho \"end\"","dependence":{},"conditionResult":{"successNode":[],"failedNode":[]},"waitStartTimeout":{},"switchResult":{}}
[INFO] 2022-12-21 20:44:29.098 TaskLogLogger-class org.apache.dolphinscheduler.plugin.task.shell.ShellTask:[137] - raw script : echo "start"
sleep 20
echo "end"
[INFO] 2022-12-21 20:44:29.099 TaskLogLogger-class org.apache.dolphinscheduler.plugin.task.shell.ShellTask:[138] - task execute path : /tmp/dolphinscheduler/exec/process/7960944102080/7960963905984_2/8/8
[INFO] 2022-12-21 20:44:29.100 TaskLogLogger-class org.apache.dolphinscheduler.plugin.task.shell.ShellTask:[86] - tenantCode user:molin, task dir:8_8
[INFO] 2022-12-21 20:44:29.100 TaskLogLogger-class org.apache.dolphinscheduler.plugin.task.shell.ShellTask:[91] - create command file:/tmp/dolphinscheduler/exec/process/7960944102080/7960963905984_2/8/8/8_8.command
[INFO] 2022-12-21 20:44:29.101 TaskLogLogger-class org.apache.dolphinscheduler.plugin.task.shell.ShellTask:[117] - command : #!/bin/sh
BASEDIR=$(cd `dirname $0`; pwd)
cd $BASEDIR
source /etc/profile
/tmp/dolphinscheduler/exec/process/7960944102080/7960963905984_2/8/8/8_8_node.sh
[INFO] 2022-12-21 20:44:29.115 TaskLogLogger-class org.apache.dolphinscheduler.plugin.task.shell.ShellTask:[287] - task run command: sudo -u molin sh /tmp/dolphinscheduler/exec/process/7960944102080/7960963905984_2/8/8/8_8.command
[INFO] 2022-12-21 20:44:29.117 TaskLogLogger-class org.apache.dolphinscheduler.plugin.task.shell.ShellTask:[178] - process start, process id is: 2407
[INFO] 2022-12-21 20:44:30.121 TaskLogLogger-class org.apache.dolphinscheduler.plugin.task.shell.ShellTask:[66] -  -> welcome to use bigdata scheduling system...
	start
[INFO] 2022-12-21 20:44:43.101 org.apache.dolphinscheduler.server.worker.processor.TaskKillProcessor:[96] - received kill command : TaskKillRequestCommand{taskInstanceId=8}
[INFO] 2022-12-21 20:44:43.550 org.apache.dolphinscheduler.server.worker.processor.TaskKillProcessor:[145] - process id:2407, cmd:sudo -u molin kill -9 02408 02414 02415
[INFO] 2022-12-21 20:44:43.598 TaskLogLogger-class org.apache.dolphinscheduler.plugin.task.shell.ShellTask:[202] - process has exited, execute path:/tmp/dolphinscheduler/exec/process/7960944102080/7960963905984_2/8/8, processId:2407 ,exitStatusCode:137 ,processWaitForStatus:true ,processExitValue:137
[INFO] 2022-12-21 20:44:43.599 org.apache.dolphinscheduler.server.worker.runner.TaskExecuteThread:[206] - task instance id : 8,task final status : KILL
[INFO] 2022-12-21 20:44:43.602 org.apache.dolphinscheduler.server.worker.runner.TaskExecuteThread:[264] - develop mode is: false
[INFO] 2022-12-21 20:44:43.608 org.apache.dolphinscheduler.server.worker.runner.TaskExecuteThread:[282] - exec local path: /tmp/dolphinscheduler/exec/process/7960944102080/7960963905984_2/8/8 cleared.
[INFO] 2022-12-21 20:44:43.617 org.apache.dolphinscheduler.server.worker.processor.TaskKillProcessor:[194] - view log host : 192.168.20.62,logPath : /Users/molin/dev/github/dolphinscheduler/logs/7960963905984_2/8/8.log
[INFO] 2022-12-21 20:44:43.617 org.apache.dolphinscheduler.service.log.LogClientService:[117] - view log path /Users/molin/dev/github/dolphinscheduler/logs/7960963905984_2/8/8.log
[INFO] 2022-12-21 20:44:43.623 org.apache.dolphinscheduler.remote.NettyRemotingClient:[390] - netty client closed
[INFO] 2022-12-21 20:44:43.630 org.apache.dolphinscheduler.service.log.LogClientService:[74] - logger client closed
[INFO] 2022-12-21 20:44:44.149 TaskLogLogger-class org.apache.dolphinscheduler.plugin.task.shell.ShellTask:[60] - FINALIZE_SESSION

Anything else

No response

Version

2.0.x

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

@wcmolin wcmolin added bug Something isn't working Waiting for reply Waiting for reply labels Dec 21, 2022
@github-actions
Copy link

Thank you for your feedback, we have received your issue, Please wait patiently for a reply.

  • In order for us to understand your request as soon as possible, please provide detailed information、version or pictures.
  • If you haven't received a reply for a long time, you can join our slack and send your question to channel #troubleshooting

@wcmolin
Copy link
Contributor Author

wcmolin commented Dec 21, 2022

@JinyLeeChina @zwZjut can you help me to look at this problem?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants