Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Potential race condition in Flyte Propeller #3582

Closed
2 tasks done
pablocasares opened this issue Apr 11, 2023 · 3 comments · Fixed by flyteorg/flyteadmin#551, flyteorg/flytepropeller#553 or flyteorg/flytepropeller#574
Assignees
Labels
bug Something isn't working

Comments

@pablocasares
Copy link

Describe the bug

I am monitoring multiple workflows containing subworkflows running in parallel. I'm using Propeller v1.1.70.

Some of the executions fail with this error.

CausedByError: Failed to propagate Abort for workflow. Error: 0: [SystemError] system error, caused by: rpc error: code = PermissionDenied desc = Cannot abort an already terminate workflow execution.

One of the subworkflows is intented to fail under certain conditions. When this workflow fails, Propeller tries to abort the rest of the running subworkflows. Sometimes the rest of the subworkflows are properly aborted but other times Propeller receives that PermissionDenied error from Flyte Admin.

It seems to be a race condition in Propeller, when Propeller tries to abort a workflow in a terminated status because when Propeller checks the Status of the rest of the subworkflows they are in status "running" but at the time when the abort is called they already changed to a terminated status. I checked that the finish time difference when this happened between the failing subworkflow that is trying to abort the rest and the successful one is 3 ms so I think that when propeller checks the status of the rest it is reported as running although it is actually Succeeded when the abort call is executed. Maybe these lines are relevant to the issue: https://github.com/flyteorg/flytepropeller/blob/master/pkg/controller/nodes/task/handler.go#L795-L825
(currentPhase might change when p.Abort is called)

Please check the attached screenshots to see how different executions of the same code produce different results.

Eventually, the parent workflow (the one containing the subworkflows) fails with this error:

RuntimeExecutionError: max number of system retry attempts [51/50] exhausted.

This error is increasing the number of calls made to FlyteAdmin and also this is increasing the metric associated to the PermissionDenied error.

Please do not hesitate to ask for further information if needed.

Expected behavior

FlytePropeller should not retry to abort a node in a terminated status and that node status should be updated in parent workflow with the terminated status (sometimes the node is shown as running although it is succeeded when you open the subworkflow).

Additional context to reproduce

No response

Screenshots

image

image

image

Are you sure this issue hasn't been raised already?

  • Yes

Have you read the Code of Conduct?

  • Yes
@pablocasares pablocasares added bug Something isn't working untriaged This issues has not yet been looked at by the Maintainers labels Apr 11, 2023
@welcome
Copy link

welcome bot commented Apr 11, 2023

Thank you for opening your first issue here! 🛠

@kumare3
Copy link
Contributor

kumare3 commented Apr 11, 2023

Thank you for the issue. We will tal asap

@andresgomezfrr
Copy link
Contributor

hey @kumare3 & @EngHabu! This week we upgraded to the flyte 1.6.1 version which includes this patch, and we do some tests but the error is still there. It is not exactly the same error but it is the same behavior.

StackTrace

Workflow[blablabla] failed. RuntimeExecutionError: max number of system retry attempts [51/50] exhausted. Last known status message: Workflow[blablabla] failed. CausedByError: Failed to propagate Abort for workflow. Error: 0: [SystemError] system error, caused by: EventAlreadyInTerminalStateError: conflicting events; destination: ABORTED, caused by [rpc error: code = FailedPrecondition desc = Cannot abort an already terminate workflow execution]
1: [SystemError] system error, caused by: EventAlreadyInTerminalStateError: conflicting events; destination: ABORTED, caused by [rpc error: code = FailedPrecondition desc = Cannot abort an already terminate workflow execution]
2: [SystemError] system error, caused by: EventAlreadyInTerminalStateError: conflicting events; destination: FAILED, caused by [rpc error: code = FailedPrecondition desc = Cannot abort an already terminate workflow execution]
3: 0: [SystemError] system error, caused by: EventAlreadyInTerminalStateError: conflicting events; destination: ABORTED, caused by [rpc error: code = FailedPrecondition desc = Cannot abort an already terminate workflow execution]
1: [SystemError] system error, caused by: EventAlreadyInTerminalStateError: conflicting events; destination: ABORTED, caused by [rpc error: code = FailedPrecondition desc = Cannot abort an already terminate workflow execution]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment