Skip to content

fix(propeller): treat K8s BadRequest/Invalid as permanent failure instead of retrying#7041

Open
themavik wants to merge 1 commit intoflyteorg:masterfrom
themavik:fix/6531-badrequest-permanent-failure
Open

fix(propeller): treat K8s BadRequest/Invalid as permanent failure instead of retrying#7041
themavik wants to merge 1 commit intoflyteorg:masterfrom
themavik:fix/6531-badrequest-permanent-failure

Conversation

@themavik
Copy link

@themavik themavik commented Mar 18, 2026

Summary

  • Uncomment the PhaseInfoFailure return for BadRequest/Invalid K8s errors in plugin_manager.go so they are treated as permanent failures instead of being retried indefinitely

Problem

When a Kubernetes API server returns a BadRequest (400) or Invalid (422) error during resource creation (e.g., an admission webhook rejection), FlytePropeller logs the error but falls through to the generic system error handler. This causes the task to be retried indefinitely, wasting resources and delaying user feedback.

The return statement on line 264 was commented out:

} else if k8serrors.IsBadRequest(err) || k8serrors.IsInvalid(err) {
    logger.Errorf(ctx, "Badly formatted resource for plugin [%s], err %s", e.id, err)
    // return pluginsCore.DoTransition(pluginsCore.PhaseInfoFailure("BadTaskFormat", err.Error(), nil)), nil
}

Without the return, execution falls through to line 271 which wraps the error as a generic system error and returns UnknownTransition, causing infinite retries.

Root Cause

The sibling code in flyteplugins/go/tasks/plugins/array/k8s/subtask.go handles the same case correctly (uncommented), confirming this was an oversight.

Fix

Uncomment the return statement so BadRequest/Invalid errors immediately transition to PhasePermanentFailure with a clear "BadTaskFormat" reason. Added a unit test (jobBadRequest) that verifies the fix.

Testing

  • Added TestK8sTaskExecutor_Handle_LaunchResource/jobBadRequest test that creates a fake client returning k8serrors.NewBadRequest(...) and verifies the transition phase is PhasePermanentFailure
  • Existing tests remain unchanged

Closes #6531

…tead of retrying

The return statement for BadRequest/Invalid errors was commented out,
causing these errors to fall through to the generic system error path
and be retried indefinitely.  Validating webhook rejections and invalid
resource specs are not transient — retrying them wastes resources and
delays user feedback.

Uncomment the return so BadRequest/Invalid immediately transitions to
PhasePermanentFailure with a clear "BadTaskFormat" reason.

The sibling code in flyteplugins/go/tasks/plugins/array/k8s/subtask.go
already handles this correctly (uncommented).

Closes flyteorg#6531

Made-with: Cursor
Signed-off-by: Avik Kumar <avikkumar2004@gmail.com>
Made-with: Cursor
@themavik themavik force-pushed the fix/6531-badrequest-permanent-failure branch from a4f978a to eb12b58 Compare March 18, 2026 18:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] Flyte Propeller treats a BadRequest code from K8s when launching a resource as a retryable failure

2 participants