Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Store failed execution in flyteadmin #4390

Merged
merged 5 commits into from
Nov 28, 2023

Conversation

iaroslav-ciupin
Copy link
Contributor

@iaroslav-ciupin iaroslav-ciupin commented Nov 9, 2023

Tracking issue

Docs link

Describe your changes

When creating an execution fails, store execution model in database with corresponding error.

Check all the applicable boxes

  • I updated the documentation accordingly.
  • All new and existing tests passed.
  • All commits are signed-off.

Setup Process

Screenshots

Note to reviewers

Related PRs

Copy link

codecov bot commented Nov 9, 2023

Codecov Report

Attention: 11 lines in your changes are missing coverage. Please review.

Comparison is base (268feae) 59.24% compared to head (9fd80e1) 59.74%.

Files Patch % Lines
...teadmin/pkg/repositories/transformers/execution.go 64.51% 11 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #4390      +/-   ##
==========================================
+ Coverage   59.24%   59.74%   +0.49%     
==========================================
  Files         618      636      +18     
  Lines       52334    53870    +1536     
==========================================
+ Hits        31007    32184    +1177     
- Misses      18859    19154     +295     
- Partials     2468     2532      +64     
Flag Coverage Δ
unittests 59.74% <79.24%> (+0.49%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@iaroslav-ciupin iaroslav-ciupin force-pushed the handle-task-validation-error branch 4 times, most recently from a64362b to 0c2ade1 Compare November 10, 2023 18:23
@iaroslav-ciupin iaroslav-ciupin changed the title handle task validation error wip Handle ExecutionRuntimeError in flyteadmin CreateExecution Nov 10, 2023

// ExecutionRuntimeError is a special error that can be returned by plugins denoting that
// execution failed during runtime and should still be saved to database
type ExecutionRuntimeError struct {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this specific to Create calls, and if so, should it be named accordingly, i.e., CreateExecutionRuntimeError? Otherwise I imagine most execution errors will be runtime errors (conceptually).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it has to be specific to Create calls, word "Runtime" means it happened while actually executing, no matter that execution was triggered.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of creating a new error type here and checking for errors.As, does it make sense to add a code to the FlyteAdmin error code for "BuildFailure" or something and just wrapping the build error here. It seems a little weird to me that were just checking if the error contains an flyteidl.core.ExecutionError and using that to indicate the we failed to correctly build the workflow.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry, I'm not following, what do you actually suggest to change? can you elaborate in more details. In Union Cloud, we have our own executor plugin which flyteadmin invokes during execution creation. Its shown in this PR. I have wrapped flyteidl.core.ExecutionError because this is what needed to store failed execution in ExecutionClosure in database. And I think its responsibility of plugin to set explicitly code, message and kind from this flyteidl.core.ExecutionError rather than flyteadmin deducing it from obtained error.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC the goal here is that every time the build process fails we write a record to the DB indicating that the build process fails. In this scenario, we are only performing this failure maintenance when the error contains an flyteidlcore.ExecutionError. This covers the case in your PR because you are explicitly throwing the correct error type. Alternatively, we could wrap this error similar to how we're doing elsewhere with a code that indicates a build error. Then every time that the build process fails we persist it in the DB.

There are plenty of other ways to handle this as well, but just wondering if we care about all failure scenarios, or just the one where you're explicitly throwing (which may be more accurately captured using error codes rather than types). If this is not the intended behavior, then disregard my comments. Also, I don't feel strongly about this, no need to over analyze.

Copy link
Contributor Author

@iaroslav-ciupin iaroslav-ciupin Nov 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I refactored this to store all failed executions in database. Please take a look and let me know what do you think.

DISCLAIMER: I am not experienced with flyteadmin codebase, so I may not understand all the consequences of storing ALL failed executions. For example, are there any clients expecting to receive an immediate error when execution is invalid? This change now has bigger scope than my initial intent of just storing task validation errors, so don't expect me to understand all the implications of PR at least at this moment 😄

@iaroslav-ciupin iaroslav-ciupin marked this pull request as ready for review November 10, 2023 19:06
EngHabu
EngHabu previously approved these changes Nov 16, 2023
@iaroslav-ciupin iaroslav-ciupin changed the title Handle ExecutionRuntimeError in flyteadmin CreateExecution Store failed execution in flyteadmin Nov 17, 2023
@dosubot dosubot bot added the size:XL This PR changes 500-999 lines, ignoring generated files. label Nov 27, 2023
EngHabu
EngHabu previously approved these changes Nov 27, 2023
Signed-off-by: Iaroslav Ciupin <iaroslav@union.ai>
@iaroslav-ciupin iaroslav-ciupin enabled auto-merge (squash) November 27, 2023 17:04
auto-merge was automatically disabled November 27, 2023 20:21

Pull Request is not mergeable

@iaroslav-ciupin iaroslav-ciupin enabled auto-merge (squash) November 27, 2023 21:15
EngHabu
EngHabu previously approved these changes Nov 27, 2023
@dosubot dosubot bot added size:L This PR changes 100-499 lines, ignoring generated files. and removed size:XL This PR changes 500-999 lines, ignoring generated files. labels Nov 28, 2023
@iaroslav-ciupin iaroslav-ciupin merged commit 4c1d993 into master Nov 28, 2023
45 checks passed
@iaroslav-ciupin iaroslav-ciupin deleted the handle-task-validation-error branch November 28, 2023 15:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
size:L This PR changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants