Skip to content

Remove StateCheck for SubProcess/DependTask#14324

Closed
ORuteMa wants to merge 4 commits intoapache:3.1.8-preparefrom
ORuteMa:remove_improper_interval_for_state_check
Closed

Remove StateCheck for SubProcess/DependTask#14324
ORuteMa wants to merge 4 commits intoapache:3.1.8-preparefrom
ORuteMa:remove_improper_interval_for_state_check

Conversation

@ORuteMa
Copy link
Contributor

@ORuteMa ORuteMa commented Jun 12, 2023

Remove StateCheck for SubProcess/DependTask which will cause task submission madness in the default state-wheel-interval configuration

Close #14262

In dev I found pr #14242 has already fix this.
However, the description of PR14242 misses an important part: the statecheck can actually result in a large number of SubProcess/DependTask scenarios where the workflow cannot run successfully due to excessive task submission. This issue should be fixed in version 3.1.8, otherwise 3.1.x is not available with default parameters in such a scenario, state-wheel-interval is default 5 which will cause 200 statecheck task submitted in a second.

What's worse is that once the task is success, the previous state events are all in exception, cause slower event handling.
image

From my pratice, it can be stuck for 10+ hours.

cc @ruanwenjun @zhuangchong

Purpose of the pull request

Fix task submission madness in 3.1.x

Brief change log

Remove StateCheck for SubProcess/DependTask

Verify this pull request

This pull request is code cleanup without any test coverage.

…mission madness in the default state-wheel-interval configuration
@SbloodyS SbloodyS added first time contributor First-time contributor 3.1.x for 3.1.x version labels Jun 12, 2023
@SbloodyS SbloodyS added this to the 3.1.8 milestone Jun 12, 2023
@SbloodyS SbloodyS added the bug Something isn't working label Jun 12, 2023
Copy link
Member

@SbloodyS SbloodyS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please change your target branch to dev.

@ORuteMa
Copy link
Contributor Author

ORuteMa commented Jun 12, 2023

Please change your target branch to dev.

In dev this pr #14242 has already fix it. This pr actually solved a more serious problem than an improvement, it was not realized that it needed to be cherry-picked into version 3.1.x.

@SbloodyS
Copy link
Member

cc @ruanwenjun

@codecov-commenter
Copy link

codecov-commenter commented Jun 13, 2023

Codecov Report

❗ No coverage uploaded for pull request base (3.1.8-prepare@77a967c). Click here to learn what that means.
The diff coverage is n/a.

❗ Current head 0a3dd6d differs from pull request most recent head 6a90719. Consider uploading reports for the commit 6a90719 to get more accurate results

@@               Coverage Diff                @@
##             3.1.8-prepare   #14324   +/-   ##
================================================
  Coverage                 ?   38.24%           
  Complexity               ?     3968           
================================================
  Files                    ?      999           
  Lines                    ?    36827           
  Branches                 ?     4255           
================================================
  Hits                     ?    14086           
  Misses                   ?    21128           
  Partials                 ?     1613           

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@sonarqubecloud
Copy link

SonarCloud Quality Gate failed.    Quality Gate failed

Bug E 18 Bugs
Vulnerability B 9 Vulnerabilities
Security Hotspot E 1 Security Hotspot
Code Smell A 757 Code Smells

27.7% 27.7% Coverage
2.5% 2.5% Duplication

@ORuteMa
Copy link
Contributor Author

ORuteMa commented Jun 13, 2023

@SbloodyS E2E failed, pls retry.

@zhuangchong
Copy link
Contributor

@ORuteMa The dev branch removes this part of the code because the delay queue detection status scheme is introduced, but the 3.1.8 branch does not have this scheme, so the above status detection scheme needs to be retained.

@ORuteMa
Copy link
Contributor Author

ORuteMa commented Jun 27, 2023

@ORuteMa The dev branch removes this part of the code because the delay queue detection status scheme is introduced, but the 3.1.8 branch does not have this scheme, so the above status detection scheme needs to be retained.

From my practice, subprocess works fine after I remove this part of code. It may have other scheme to ensure async task.

@ORuteMa
Copy link
Contributor Author

ORuteMa commented Jun 28, 2023

I will figout out what exacly going on here, provide a more appropriate solution for 3.1.x.

@Mukvin
Copy link
Contributor

Mukvin commented Jun 29, 2023

@ORuteMa I also encounter this bug.
All task in a workflow are success but one which stuck on "SUBMIT SUCCESS", and I already cherrp-pick the remove pr: https://github.com/apache/dolphinscheduler/pull/14242 at branch 3.1.5-release.
please help me out, thx.

@ORuteMa
Copy link
Contributor Author

ORuteMa commented Jun 29, 2023

@ORuteMa I also encounter this bug. All task in a workflow are success but one which stuck on "SUBMIT SUCCESS", and I already cherrp-pick the remove pr: https://github.com/apache/dolphinscheduler/pull/14242 at branch 3.1.5-release. please help me out, thx.

The stuck one is a subProcess or dependentTask?

@Mukvin
Copy link
Contributor

Mukvin commented Jun 29, 2023

@ORuteMa Pls help me, I hava the same issue, but when I cherry-pick the

The stuck one is a subProcess or dependentTask?

@ORuteMa
Thank you for your reply, I found that it is same as #14262, and I also found that the task in task list ui will be the status "SUBMIT SUCCESS", and it stuck.
It can not dispatch to run automatically?

ORuteMa added 2 commits July 3, 2023 19:32
…task submission madness in the default state-wheel-interval configuration"

This reverts commit 6a90719.
@ORuteMa
Copy link
Contributor Author

ORuteMa commented Jul 3, 2023

@ORuteMa Pls help me, I hava the same issue, but when I cherry-pick the

The stuck one is a subProcess or dependentTask?

@ORuteMa Thank you for your reply, I found that it is same as #14262, and I also found that the task in task list ui will be the status "SUBMIT SUCCESS", and it stuck. It can not dispatch to run automatically?

Sorry for my late reply. You should revert the pr from dev, and adjust the config master.state-wheel-interval to 3000, try again.

@ORuteMa
Copy link
Contributor Author

ORuteMa commented Jul 3, 2023

@ORuteMa The dev branch removes this part of the code because the delay queue detection status scheme is introduced, but the 3.1.8 branch does not have this scheme, so the above status detection scheme needs to be retained.

I offer another solution maybe more appropriate, pls have a look.

@fuchanghai
Copy link
Member

hi @zhuangchong @ORuteMa
我也遇到了这种情况,但是我觉得这个轮询时间需要根据各自的系统资源来配置,master 会轮询由之分配的所有符合条件的任务,我们可以将这个时间设置为,worker 中的最大启动数量 乘以worker 的数量,再除以master 数量,再乘以每个event 被执行完的时间,再乘以2/3

I also encountered this situation, but I think this polling time needs to be configured according to their respective system resources. The master will poll all eligible tasks assigned by it. We can set this time as the maximum number of startups in the worker multiplied by the number of workers, divided by the number of masters, multiplied by the time each event is executed, and multiplied by 2/3

@ORuteMa
Copy link
Contributor Author

ORuteMa commented Jul 22, 2023

hi @zhuangchong @ORuteMa 我也遇到了这种情况,但是我觉得这个轮询时间需要根据各自的系统资源来配置,master 会轮询由之分配的所有符合条件的任务,我们可以将这个时间设置为,worker 中的最大启动数量 乘以worker 的数量,再除以master 数量,再乘以每个event 被执行完的时间,再乘以2/3

I also encountered this situation, but I think this polling time needs to be configured according to their respective system resources. The master will poll all eligible tasks assigned by it. We can set this time as the maximum number of startups in the worker multiplied by the number of workers, divided by the number of masters, multiplied by the time each event is executed, and multiplied by 2/3

hi @fuchanghai 我觉得策略可能稍显复杂,比如这里的event执行完的时间就不好定义,worker最大数量也是个问题。无论如何,默认5ms是非常不合理的。而且statecheck的事件我看下来只有dependtask和subprocess需要,它们的检查应该不需要并发,所以这里我用一个独立的队列并且保证事件不重复去规避这个问题,这样用户无论怎么调整执行间隔都不会导致事件大量提交的问题。

The strategy may be a bit complicated, for example, the time for the event to finish executing is not well defined here, and the maximum number of workers is also a problem. In any case, the default 5ms is very unreasonable. And statecheck event I see down only dependtask and subprocess need, their checks should not need concurrent, so here I use a separate queue and ensure that the event does not repeat to avoid this problem, so that the user no matter how to adjust the execution interval will not lead to a large number of events submitted to the problem.

@zhuangchong zhuangchong deleted the branch apache:3.1.8-prepare August 7, 2023 03:43
@zhuangchong zhuangchong closed this Aug 7, 2023
@ORuteMa
Copy link
Contributor Author

ORuteMa commented Aug 24, 2023

@zhuangchong 请问下为何close掉这个pr呢,这样3.1.x版本还是遗留这个问题

@ORuteMa
Copy link
Contributor Author

ORuteMa commented Sep 6, 2023

@zhuangchong Should I cherry-pick this pr to 3.1.9-prepare?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

3.1.x for 3.1.x version backend bug Something isn't working document first time contributor First-time contributor

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants