Skip to content

feat(connector): handle TaskExecution_RETRYABLE_FAILED phase#7409

Merged
pingsutw merged 2 commits into
mainfrom
kevin/connector-handle-retryable-failed
May 21, 2026
Merged

feat(connector): handle TaskExecution_RETRYABLE_FAILED phase#7409
pingsutw merged 2 commits into
mainfrom
kevin/connector-handle-retryable-failed

Conversation

@pingsutw
Copy link
Copy Markdown
Member

Tracking issue

Why are the changes needed?

The webapi connector plugin currently only handles TaskExecution_FAILED and TaskExecution_ABORTED as failure phases. The TaskExecution_RETRYABLE_FAILED phase (value 8 in flyteidl2/core/execution.proto) is silently ignored, falling through to the default branch which returns an "unknown execution phase" system error. As a result, a connector that legitimately reports a retryable failure surfaces as a system error to propeller and the task's retry policy is never honored.

What changes were proposed in this pull request?

In flyteplugins/go/tasks/plugins/webapi/connector/plugin.go:

  1. ResourceWrapper.IsTerminal() now treats TaskExecution_RETRYABLE_FAILED as terminal alongside SUCCEEDED/FAILED/ABORTED, so the framework skips an extra Get round-trip to the connector once it has returned this phase.
  2. The phase switch in Status maps TaskExecution_RETRYABLE_FAILED to core.PhaseInfoRetryableFailure(...), so propeller retries per the task's retry policy instead of marking the node as permanently failed.

How was this patch tested?

Built locally with go build ./flyteplugins/go/tasks/plugins/webapi/connector/.... No existing tests covered the previous default-branch behavior; happy to add a unit test for both IsTerminal() and the phase mapping if reviewers prefer.

Labels

  • added

Setup process

N/A

Screenshots

N/A

Check all the applicable boxes

  • I updated the documentation accordingly.
  • All new and existing tests passed.
  • All commits are signed-off.

Related PRs

Stack

Docs link

Map the RETRYABLE_FAILED phase returned by connectors to
PhaseInfoRetryableFailure so propeller honors the task's retry policy
instead of treating the failure as permanent, and treat it as terminal
in IsTerminal() to avoid an extra network call to the connector once
it has returned.

Signed-off-by: Kevin Su <pingsutw@apache.org>
Copilot AI review requested due to automatic review settings May 21, 2026 20:00
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the WebAPI connector task plugin to correctly recognize and translate the TaskExecution_RETRYABLE_FAILED phase from connector responses, ensuring Propeller treats it as a retryable task failure (honoring task retry policy) instead of surfacing an “unknown execution phase” system error.

Changes:

  • Treat TaskExecution_RETRYABLE_FAILED as terminal in ResourceWrapper.IsTerminal() to avoid extra connector Get polling once the phase is reached.
  • Map TaskExecution_RETRYABLE_FAILED to core.PhaseInfoRetryableFailure(...) in Status() so the failure is retryable.
Comments suppressed due to low confidence (2)

flyteplugins/go/tasks/plugins/webapi/connector/plugin.go:366

  • The new TaskExecution_RETRYABLE_FAILEDPhaseInfoRetryableFailure mapping isn’t covered by the existing plugin_test.go status-phase tests (which already cover FAILED/ABORTED). Please add a test that asserts Status() returns pluginsCore.PhaseRetryableFailure for this phase so the retry behavior remains protected.
	case flyteIdl.TaskExecution_FAILED:
		return core.PhaseInfoFailure(errorCode, fmt.Sprintf("failed to run the job: %s", resource.Message), taskInfo), nil
	case flyteIdl.TaskExecution_RETRYABLE_FAILED:
		return core.PhaseInfoRetryableFailure(errorCode, fmt.Sprintf("failed to run the job: %s", resource.Message), taskInfo), nil

flyteplugins/go/tasks/plugins/webapi/connector/plugin.go:366

  • Both TaskExecution_FAILED and TaskExecution_RETRYABLE_FAILED currently produce the same reason string prefix ("failed to run the job: …"), which makes it hard to tell from logs/events whether the failure is retryable vs permanent. Consider making the retryable-failure reason explicitly distinguishable (e.g., include "retryable" in the message).
	case flyteIdl.TaskExecution_FAILED:
		return core.PhaseInfoFailure(errorCode, fmt.Sprintf("failed to run the job: %s", resource.Message), taskInfo), nil
	case flyteIdl.TaskExecution_RETRYABLE_FAILED:
		return core.PhaseInfoRetryableFailure(errorCode, fmt.Sprintf("failed to run the job: %s", resource.Message), taskInfo), nil

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread flyteplugins/go/tasks/plugins/webapi/connector/plugin.go
Add a Status subtest asserting RETRYABLE_FAILED maps to
PhaseRetryableFailure with the expected error message, and a
table-driven TestResourceWrapper_IsTerminal that pins the terminal
phase set (SUCCEEDED, FAILED, RETRYABLE_FAILED, ABORTED).

Signed-off-by: Kevin Su <pingsutw@apache.org>
@pingsutw pingsutw self-assigned this May 21, 2026
@pingsutw pingsutw merged commit 11838ce into main May 21, 2026
21 checks passed
@pingsutw pingsutw deleted the kevin/connector-handle-retryable-failed branch May 21, 2026 21:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants