fix(propeller): treat K8s BadRequest/Invalid as permanent failure instead of retrying#7041
Open
themavik wants to merge 1 commit intoflyteorg:masterfrom
Open
fix(propeller): treat K8s BadRequest/Invalid as permanent failure instead of retrying#7041themavik wants to merge 1 commit intoflyteorg:masterfrom
themavik wants to merge 1 commit intoflyteorg:masterfrom
Conversation
…tead of retrying The return statement for BadRequest/Invalid errors was commented out, causing these errors to fall through to the generic system error path and be retried indefinitely. Validating webhook rejections and invalid resource specs are not transient — retrying them wastes resources and delays user feedback. Uncomment the return so BadRequest/Invalid immediately transitions to PhasePermanentFailure with a clear "BadTaskFormat" reason. The sibling code in flyteplugins/go/tasks/plugins/array/k8s/subtask.go already handles this correctly (uncommented). Closes flyteorg#6531 Made-with: Cursor Signed-off-by: Avik Kumar <avikkumar2004@gmail.com> Made-with: Cursor
a4f978a to
eb12b58
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
PhaseInfoFailurereturn forBadRequest/InvalidK8s errors inplugin_manager.goso they are treated as permanent failures instead of being retried indefinitelyProblem
When a Kubernetes API server returns a
BadRequest(400) orInvalid(422) error during resource creation (e.g., an admission webhook rejection), FlytePropeller logs the error but falls through to the generic system error handler. This causes the task to be retried indefinitely, wasting resources and delaying user feedback.The return statement on line 264 was commented out:
Without the return, execution falls through to line 271 which wraps the error as a generic system error and returns
UnknownTransition, causing infinite retries.Root Cause
The sibling code in
flyteplugins/go/tasks/plugins/array/k8s/subtask.gohandles the same case correctly (uncommented), confirming this was an oversight.Fix
Uncomment the return statement so
BadRequest/Invaliderrors immediately transition toPhasePermanentFailurewith a clear"BadTaskFormat"reason. Added a unit test (jobBadRequest) that verifies the fix.Testing
TestK8sTaskExecutor_Handle_LaunchResource/jobBadRequesttest that creates a fake client returningk8serrors.NewBadRequest(...)and verifies the transition phase isPhasePermanentFailureCloses #6531