Skip to content

Fix non-deterministic workflow digest and oversized error messages#7211

Open
andresgomezfrr wants to merge 2 commits intoflyteorg:masterfrom
andresgomezfrr:fix/truncate-diff-error-message
Open

Fix non-deterministic workflow digest and oversized error messages#7211
andresgomezfrr wants to merge 2 commits intoflyteorg:masterfrom
andresgomezfrr:fix/truncate-diff-error-message

Conversation

@andresgomezfrr
Copy link
Copy Markdown
Contributor

@andresgomezfrr andresgomezfrr commented Apr 15, 2026

Tracking issue

Closes #7212
Related to #4780

Why are the changes needed?

When re-registering a workflow with the same version, FlyteAdmin computes a digest of the compiled workflow to check if the structure is identical. Two bugs cause this to fail for large workflows:

Bug 1: Non-deterministic digest (compiler)
ValidateWorkflow in workflow_compiler.go iterates wf.Nodes (a Go map) in two places without sorting keys. Since Go map iteration is randomized, the same workflow produces different CompiledWorkflowClosure outputs across compilations, leading to different digests. FlyteAdmin then incorrectly takes the "different structure" code path instead of returning ALREADY_EXISTS.

Bug 2: Oversized error message (errors)
The "different structure" code path computes a jsondiff of two large compiled workflow closures and includes the full diff in the gRPC error description. For workflows with many dependencies (e.g. 400+ JARs), this diff exceeds gRPC's default 4MB MaxSendMsgSize. gRPC-Go silently rejects the response at the transport layer, sending RST_STREAM INTERNAL_ERROR to the client with no server-side log.

The client sees INTERNAL: RST_STREAM closed stream. HTTP/2 error code: INTERNAL_ERROR instead of any useful error.

What changes were proposed in this pull request?

Commit 1: Truncate diff in error messages (flyteadmin/pkg/errors/errors.go)

  • Cap error messages at 3MB (leaving room for gRPC framing) in all three *ExistsDifferentStructureError functions
  • When truncated, append ... [diff truncated — exceeded gRPC max message size]
  • Safety net that prevents RST_STREAM regardless of diff size

Commit 2: Sort node IDs for deterministic compilation (flytepropeller/pkg/compiler/workflow_compiler.go)

  • Sort wf.Nodes map keys before iterating in ValidateWorkflow
  • Ensures identical workflows produce identical CompiledWorkflowClosure and therefore identical digests
  • Root cause fix: re-registration of identical workflows correctly returns ALREADY_EXISTS

How did you test it?

  • All existing tests pass: go test ./pkg/errors/ and go test ./pkg/compiler/...
  • Reproduced the original issue on a production FlyteAdmin instance by calling CreateWorkflow with a modified template (same version) — confirmed RST_STREAM INTERNAL_ERROR
  • Same call with identical template returns ALREADY_EXISTS (correct behavior)
  • CreateTask always returns ALREADY_EXISTS (task digests are already deterministic — no map iteration in task compilation)

When re-registering a workflow/task/launch plan with a different
structure, FlyteAdmin computes a jsondiff of the old and new specs
and includes it in the gRPC error message. For large workflows
(e.g. with hundreds of JAR dependencies), this diff can exceed
gRPC's default 4MB MaxSendMsgSize, causing the transport to reject
the response with RST_STREAM INTERNAL_ERROR — silently, with no
server-side log.

Cap error messages at 3MB (leaving room for gRPC framing) across
all three *ExistsDifferentStructureError functions: task, workflow,
and launch plan.

This is a safety net. The root cause of spurious digest mismatches
for identical workflows is non-deterministic map iteration in
CompileWorkflow (wf.Nodes), which should be fixed separately.

Signed-off-by: Andres Gomez Ferrer <andresg@spotify.com>
The ValidateWorkflow function iterates wf.Nodes (a Go map) in two
places without sorting keys. Since Go map iteration order is
randomized, the same workflow can produce different compiled outputs
across compilations. This causes FlyteAdmin's digest comparison to
fail for structurally identical workflows, triggering the
"different structure" code path with an oversized jsondiff error.

Sort node IDs before iterating to ensure deterministic edge
ordering in the CompiledWorkflowClosure. This makes identical
workflows produce identical digests, so re-registration correctly
returns ALREADY_EXISTS.

Signed-off-by: Andres Gomez Ferrer <andresg@spotify.com>
@codecov
Copy link
Copy Markdown

codecov bot commented Apr 15, 2026

Codecov Report

❌ Patch coverage is 88.88889% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 56.96%. Comparing base (c7419a8) to head (1b37203).

Files with missing lines Patch % Lines
flyteadmin/pkg/errors/errors.go 75.00% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #7211      +/-   ##
==========================================
+ Coverage   56.95%   56.96%   +0.01%     
==========================================
  Files         931      931              
  Lines       58234    58246      +12     
==========================================
+ Hits        33166    33179      +13     
+ Misses      22017    22016       -1     
  Partials     3051     3051              
Flag Coverage Δ
unittests-datacatalog 53.51% <ø> (ø)
unittests-flyteadmin 53.14% <75.00%> (+<0.01%) ⬆️
unittests-flytecopilot 43.06% <ø> (ø)
unittests-flytectl 64.14% <ø> (+0.04%) ⬆️
unittests-flyteidl 75.71% <ø> (ø)
unittests-flyteplugins 60.17% <ø> (ø)
unittests-flytepropeller 53.73% <100.00%> (+0.02%) ⬆️
unittests-flytestdlib 62.58% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] Workflow re-registration fails with RST_STREAM due to non-deterministic digest and oversized error

1 participant