[Fix-10854] Fix database restart may lost task instance status by ruanwenjun · Pull Request #10866 · apache/dolphinscheduler

ruanwenjun · 2022-07-09T09:37:32Z

I have tested this PR, and need to point out, that right now there are still exist some problems when the database restarts, we may still lose some status, this is caused by right now our state handle is not idempotent, we need to split the state such when we finished, we may need to do many step, clear map, update db, xx, when we failed in a step, we will retry next time, and when we retry, we need to know we failed on which step, then just recover on this step.

Purpose of the pull request

close #10854

Brief change log

Add TaskEventHandler to handler worker events
When worker event handle error will rollback the taskInstanceStatus

Verify this pull request

codecov-commenter · 2022-07-09T14:08:55Z

Codecov Report

Merging #10866 (d879bfe) into dev (3f69ec8) will decrease coverage by 0.22%.
The diff coverage is 0.98%.

@@             Coverage Diff              @@
##                dev   #10866      +/-   ##
============================================
- Coverage     40.60%   40.37%   -0.23%     
- Complexity     4829     4843      +14     
============================================
  Files           915      933      +18     
  Lines         36436    36759     +323     
  Branches       4000     4025      +25     
============================================
+ Hits          14794    14842      +48     
- Misses        20160    20433     +273     
- Partials       1482     1484       +2

Impacted Files	Coverage Δ
.../dolphinscheduler/common/enums/StateEventType.java	`0.00% <0.00%> (ø)`
...e/dolphinscheduler/common/enums/TaskEventType.java	`0.00% <0.00%> (ø)`
.../dolphinscheduler/dao/utils/TaskInstanceUtils.java	`0.00% <0.00%> (ø)`
...e/dolphinscheduler/server/master/MasterServer.java	`0.00% <0.00%> (ø)`
...ler/server/master/event/TaskDelayEventHandler.java	`0.00% <0.00%> (ø)`
.../server/master/event/TaskDispatchEventHandler.java	`0.00% <0.00%> (ø)`
...uler/server/master/event/TaskEventHandleError.java	`0.00% <0.00%> (ø)`
.../server/master/event/TaskEventHandleException.java	`0.00% <0.00%> (ø)`
...r/master/event/TaskRejectByWorkerEventHandler.java	`0.00% <0.00%> (ø)`
...er/server/master/event/TaskResultEventHandler.java	`0.00% <0.00%> (ø)`
... and 32 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3f69ec8...d879bfe. Read the comment docs.

ruanwenjun · 2022-07-10T09:31:23Z

@caishunfeng This PR is ready to review, please take a look.

sonarqubecloud · 2022-07-10T09:44:10Z

SonarCloud Quality Gate failed.

0 Bugs
0 Vulnerabilities
0 Security Hotspots
21 Code Smells

1.3% Coverage
8.4% Duplication

caishunfeng

LGTM overall, some nip.

caishunfeng · 2022-07-11T01:37:36Z

...er/src/main/java/org/apache/dolphinscheduler/server/master/event/TaskResultEventHandler.java

+        if (!taskInstanceOptional.isPresent()) {
+            sendAckToWorker(taskEvent);
+            throw new TaskEventHandleError(
+                "Handle task result event error, cannot find the taskInstance from cache, will discord this event");


Suggested change

"Handle task result event error, cannot find the taskInstance from cache, will discord this event");

"Handle task result event error, cannot find the taskInstance from cache, will discare this event");

caishunfeng · 2022-07-11T01:49:49Z

.../main/java/org/apache/dolphinscheduler/server/master/event/WorkflowEventHandleException.java

+
+package org.apache.dolphinscheduler.server.master.event;
+
+public class WorkflowEventHandleException extends Exception {


please add some comments

ruanwenjun · 2022-07-11T01:57:20Z

I will fix this in another PR

…correct issue (apache#17) * [Fix-10842] Fix master/worker failover will cause status incorrect (apache#10839) * Fix master failover will not update task instance status * Add some failover log * Fix worker failover will rerun task more than once * Fix workflowInstance failover may rerun already success taskInstance (cherry picked from commit 3f69ec8) * [Fix-10854] Fix database restart may lost task instance status (apache#10866) * Fix database update error doesn't rollback the task instance status * Fix database error may cause workflow dead with running status (cherry picked from commit f639a2e)

* Fix database update error doesn't rollback the task instance status * Fix database error may cause workflow dead with running status (cherry picked from commit f639a2e)

ruanwenjun requested review from SbloodyS and caishunfeng as code owners July 9, 2022 09:37

github-actions bot added the backend label Jul 9, 2022

ruanwenjun marked this pull request as draft July 9, 2022 09:37

ruanwenjun force-pushed the dev_wenjun_fixDatabaseFailed branch 3 times, most recently from 25fb177 to 9d475fe Compare July 9, 2022 13:54

SbloodyS assigned ruanwenjun Jul 9, 2022

SbloodyS added the bug Something isn't working label Jul 9, 2022

SbloodyS added this to the 3.0.0-release milestone Jul 9, 2022

ruanwenjun force-pushed the dev_wenjun_fixDatabaseFailed branch 4 times, most recently from 350de5b to b007369 Compare July 9, 2022 16:18

ruanwenjun marked this pull request as ready for review July 9, 2022 16:19

Fix database update error doesn't rollback the task instance status

dd7d41f

ruanwenjun force-pushed the dev_wenjun_fixDatabaseFailed branch from b007369 to dd7d41f Compare July 9, 2022 16:51

Fix database error may cause workflow dead with running status

d879bfe

ruanwenjun changed the title ~~[Fix-10854] Fix database update error doesn't rollback the task instance status~~ [Fix-10854] Fix database restart may lost task instance status Jul 10, 2022

caishunfeng approved these changes Jul 11, 2022

View reviewed changes

ruanwenjun merged commit f639a2e into apache:dev Jul 11, 2022

ruanwenjun deleted the dev_wenjun_fixDatabaseFailed branch July 11, 2022 01:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Fix-10854] Fix database restart may lost task instance status#10866

[Fix-10854] Fix database restart may lost task instance status#10866
ruanwenjun merged 2 commits intoapache:devfrom
ruanwenjun:dev_wenjun_fixDatabaseFailed

ruanwenjun commented Jul 9, 2022 •

edited

Loading

Uh oh!

codecov-commenter commented Jul 9, 2022 •

edited

Loading

Uh oh!

ruanwenjun commented Jul 10, 2022

Uh oh!

sonarqubecloud bot commented Jul 10, 2022

Uh oh!

caishunfeng left a comment

Uh oh!

caishunfeng Jul 11, 2022

Uh oh!

caishunfeng Jul 11, 2022

Uh oh!

ruanwenjun commented Jul 11, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

	"Handle task result event error, cannot find the taskInstance from cache, will discord this event");
	"Handle task result event error, cannot find the taskInstance from cache, will discare this event");


		package org.apache.dolphinscheduler.server.master.event;

		public class WorkflowEventHandleException extends Exception {

Conversation

ruanwenjun commented Jul 9, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose of the pull request

Brief change log

Verify this pull request

Uh oh!

codecov-commenter commented Jul 9, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

ruanwenjun commented Jul 10, 2022

Uh oh!

sonarqubecloud bot commented Jul 10, 2022

Uh oh!

caishunfeng left a comment

Choose a reason for hiding this comment

Uh oh!

caishunfeng Jul 11, 2022

Choose a reason for hiding this comment

Uh oh!

caishunfeng Jul 11, 2022

Choose a reason for hiding this comment

Uh oh!

ruanwenjun commented Jul 11, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ruanwenjun commented Jul 9, 2022 •

edited

Loading

codecov-commenter commented Jul 9, 2022 •

edited

Loading