Skip to content

[Fix-10854] Fix database restart may lost task instance status#10866

Merged
ruanwenjun merged 2 commits intoapache:devfrom
ruanwenjun:dev_wenjun_fixDatabaseFailed
Jul 11, 2022
Merged

[Fix-10854] Fix database restart may lost task instance status#10866
ruanwenjun merged 2 commits intoapache:devfrom
ruanwenjun:dev_wenjun_fixDatabaseFailed

Conversation

@ruanwenjun
Copy link
Member

@ruanwenjun ruanwenjun commented Jul 9, 2022

I have tested this PR, and need to point out, that right now there are still exist some problems when the database restarts, we may still lose some status, this is caused by right now our state handle is not idempotent, we need to split the state such when we finished, we may need to do many step, clear map, update db, xx, when we failed in a step, we will retry next time, and when we retry, we need to know we failed on which step, then just recover on this step.

Purpose of the pull request

close #10854

Brief change log

  • Add TaskEventHandler to handler worker events
  • When worker event handle error will rollback the taskInstanceStatus

Verify this pull request

@ruanwenjun ruanwenjun marked this pull request as draft July 9, 2022 09:37
@ruanwenjun ruanwenjun force-pushed the dev_wenjun_fixDatabaseFailed branch 3 times, most recently from 25fb177 to 9d475fe Compare July 9, 2022 13:54
@SbloodyS SbloodyS added the bug Something isn't working label Jul 9, 2022
@SbloodyS SbloodyS added this to the 3.0.0-release milestone Jul 9, 2022
@codecov-commenter
Copy link

codecov-commenter commented Jul 9, 2022

Codecov Report

Merging #10866 (d879bfe) into dev (3f69ec8) will decrease coverage by 0.22%.
The diff coverage is 0.98%.

@@             Coverage Diff              @@
##                dev   #10866      +/-   ##
============================================
- Coverage     40.60%   40.37%   -0.23%     
- Complexity     4829     4843      +14     
============================================
  Files           915      933      +18     
  Lines         36436    36759     +323     
  Branches       4000     4025      +25     
============================================
+ Hits          14794    14842      +48     
- Misses        20160    20433     +273     
- Partials       1482     1484       +2     
Impacted Files Coverage Δ
.../dolphinscheduler/common/enums/StateEventType.java 0.00% <0.00%> (ø)
...e/dolphinscheduler/common/enums/TaskEventType.java 0.00% <0.00%> (ø)
.../dolphinscheduler/dao/utils/TaskInstanceUtils.java 0.00% <0.00%> (ø)
...e/dolphinscheduler/server/master/MasterServer.java 0.00% <0.00%> (ø)
...ler/server/master/event/TaskDelayEventHandler.java 0.00% <0.00%> (ø)
.../server/master/event/TaskDispatchEventHandler.java 0.00% <0.00%> (ø)
...uler/server/master/event/TaskEventHandleError.java 0.00% <0.00%> (ø)
.../server/master/event/TaskEventHandleException.java 0.00% <0.00%> (ø)
...r/master/event/TaskRejectByWorkerEventHandler.java 0.00% <0.00%> (ø)
...er/server/master/event/TaskResultEventHandler.java 0.00% <0.00%> (ø)
... and 32 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3f69ec8...d879bfe. Read the comment docs.

@ruanwenjun ruanwenjun force-pushed the dev_wenjun_fixDatabaseFailed branch 4 times, most recently from 350de5b to b007369 Compare July 9, 2022 16:18
@ruanwenjun ruanwenjun marked this pull request as ready for review July 9, 2022 16:19
@ruanwenjun ruanwenjun force-pushed the dev_wenjun_fixDatabaseFailed branch from b007369 to dd7d41f Compare July 9, 2022 16:51
@ruanwenjun
Copy link
Member Author

@caishunfeng This PR is ready to review, please take a look.

@sonarqubecloud
Copy link

SonarCloud Quality Gate failed.    Quality Gate failed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 21 Code Smells

1.3% 1.3% Coverage
8.4% 8.4% Duplication

@ruanwenjun ruanwenjun changed the title [Fix-10854] Fix database update error doesn't rollback the task instance status [Fix-10854] Fix database restart may lost task instance status Jul 10, 2022
Copy link
Contributor

@caishunfeng caishunfeng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM overall, some nip.

if (!taskInstanceOptional.isPresent()) {
sendAckToWorker(taskEvent);
throw new TaskEventHandleError(
"Handle task result event error, cannot find the taskInstance from cache, will discord this event");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"Handle task result event error, cannot find the taskInstance from cache, will discord this event");
"Handle task result event error, cannot find the taskInstance from cache, will discare this event");


package org.apache.dolphinscheduler.server.master.event;

public class WorkflowEventHandleException extends Exception {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add some comments

@ruanwenjun ruanwenjun merged commit f639a2e into apache:dev Jul 11, 2022
@ruanwenjun ruanwenjun deleted the dev_wenjun_fixDatabaseFailed branch July 11, 2022 01:57
@ruanwenjun
Copy link
Member Author

I will fix this in another PR

ruanwenjun added a commit to ruanwenjun/dolphinscheduler that referenced this pull request Jul 12, 2022
…correct issue (apache#17)

* [Fix-10842] Fix master/worker failover will cause status incorrect (apache#10839)

* Fix master failover will not update task instance status
* Add some failover log
* Fix worker failover will rerun task more than once
* Fix workflowInstance failover may rerun already success taskInstance

(cherry picked from commit 3f69ec8)

* [Fix-10854] Fix database restart may lost task instance status (apache#10866)

* Fix database update error doesn't rollback the task instance status

* Fix database error may cause workflow dead with running status

(cherry picked from commit f639a2e)
ruanwenjun added a commit that referenced this pull request Jul 19, 2022
* Fix database update error doesn't rollback the task instance status

* Fix database error may cause workflow dead with running status

(cherry picked from commit f639a2e)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backend bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] [Master] When database restart or close, the workflow instance status will be incorrect.

4 participants