Skip to content

Conversation

@fg91
Copy link
Member

@fg91 fg91 commented Jan 12, 2026

Tracking issue

Fixes #4466

This is truly my arch nemesis in terms of Flyte issues. I reported this problem more than 2 years ago and it was discussed multiple times in the contributors' sync. Our platform users have run into this problem dozens of times, always leaving them profoundly confused and frustrated.

Why are the changes needed?

When a dynamic workflow is malformed in a way that is not detected on the flytekit level but only when compiling the workflow in flytepropeller, such as ...

@dataclass
class Foo:
    a: int = 1

@task
def foo(a: str = 1): 
    print(a)

@task
def bar() -> Foo: 
    return Foo()


@dynamic
def sub_wf():
    foo(a=bar()) # Pass instance of Foo instead of string

... the user is presented with this confusing view in flyteconsole, showing a succeeded attempt, failed node, and running workflow:

Screenshot 2026-01-12 at 17 56 29

Only ~1.5h later, the workflow fails with

RuntimeExecutionError: max number of system retry attempts [51/50] exhausted. Last known status message: 0: [User] malformed dynamic workflow ...

What changes were proposed in this pull request?

The discussions in the issue and in the contributors' sync always revolved around changing flyteconsole to show the underlying error to the user in the UI, see e.g. here.

I realized that this is actually not a UI issue but rather a bug in the dynamic handler logic:

  • The dynamic workflow cannot be built due to a user error so the node transitions to the failing phase.
  • The Abort and Finalize methods are called but both try to build the dynamic workflow another time - which again fails.
  • Not being able to build the dynamic workflow makes Abort and Finalize return an error.
  • This error is categorized as a transient system error so that the "system error retry logic" kicks in.
  • This prevents the workflow from fully failing until the system retry budget is exhausted.
  • Therefore, the workflow fails only ~1.5h later with max number of system retry attempts [51/50] exhausted ... malformed dynamic workflow ...

This PR modifies Abort and Finalize to not return an error when the dynamic workflow is malformed. This way, the workflow can be aborted and finalized immediately and the user is presented with there underlying user error right away.

How was this patch tested?

  • Ran propeller with the proposed changes and the example above. The workflow immediately fails and shows the underlying error to the user:

    Screenshot 2026-01-12 at 18 10 43
  • Added unit tests for the fixed behaviour.

Check all the applicable boxes

  • I updated the documentation accordingly.
  • All new and existing tests passed.
  • All commits are signed-off.

…succeeded + running' Schroedinger stage

Signed-off-by: Fabio Grätz <fabio@cusp.ai>
@fg91 fg91 added the fixed For any bug fixes label Jan 12, 2026
}
}

func createNodeContext(t assert.TestingT, ttype string, finalOutput storage.DataReference) *nodeMocks.NodeExecutionContext {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This helper is not new, I just moved it out of TestDynamicNodeTaskNodeHandler_Finalize to be able to use it in the new TestDynamicNodeTaskNodeHandler_Abort. For that I added the argument t assert.TestingT which was captured from the outer scope before.

@codecov
Copy link

codecov bot commented Jan 12, 2026

Codecov Report

❌ Patch coverage is 33.33333% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 56.96%. Comparing base (d6df4be) to head (a330d00).
⚠️ Report is 1 commits behind head on master.

Files with missing lines Patch % Lines
...epropeller/pkg/controller/nodes/dynamic/handler.go 33.33% 1 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #6854      +/-   ##
==========================================
+ Coverage   56.94%   56.96%   +0.02%     
==========================================
  Files         929      929              
  Lines       58150    58152       +2     
==========================================
+ Hits        33114    33128      +14     
+ Misses      21993    21982      -11     
+ Partials     3043     3042       -1     
Flag Coverage Δ
unittests-datacatalog 53.51% <ø> (ø)
unittests-flyteadmin 53.14% <ø> (ø)
unittests-flytecopilot 43.06% <ø> (ø)
unittests-flytectl 64.05% <ø> (+0.05%) ⬆️
unittests-flyteidl 75.71% <ø> (ø)
unittests-flyteplugins 60.13% <ø> (ø)
unittests-flytepropeller 53.63% <33.33%> (+0.06%) ⬆️
unittests-flytestdlib 63.27% <ø> (-0.03%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Signed-off-by: Fabio Grätz <fabio@cusp.ai>
@fg91 fg91 force-pushed the fg91/fix-succeeded-but-failed-dynamic branch from c233125 to 2335bf5 Compare January 19, 2026 11:33
Copy link
Contributor

@EngHabu EngHabu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh.. nice find!

@Sovietaced Sovietaced merged commit 575c0af into master Jan 19, 2026
47 of 49 checks passed
@Sovietaced Sovietaced deleted the fg91/fix-succeeded-but-failed-dynamic branch January 19, 2026 17:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

fixed For any bug fixes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[UI Feature] Propagate dynamic workflow error messages from flytepropeller to UI

4 participants