Fix non-deterministic workflow digest and oversized error messages#7211
Open
andresgomezfrr wants to merge 2 commits intoflyteorg:masterfrom
Open
Fix non-deterministic workflow digest and oversized error messages#7211andresgomezfrr wants to merge 2 commits intoflyteorg:masterfrom
andresgomezfrr wants to merge 2 commits intoflyteorg:masterfrom
Conversation
When re-registering a workflow/task/launch plan with a different structure, FlyteAdmin computes a jsondiff of the old and new specs and includes it in the gRPC error message. For large workflows (e.g. with hundreds of JAR dependencies), this diff can exceed gRPC's default 4MB MaxSendMsgSize, causing the transport to reject the response with RST_STREAM INTERNAL_ERROR — silently, with no server-side log. Cap error messages at 3MB (leaving room for gRPC framing) across all three *ExistsDifferentStructureError functions: task, workflow, and launch plan. This is a safety net. The root cause of spurious digest mismatches for identical workflows is non-deterministic map iteration in CompileWorkflow (wf.Nodes), which should be fixed separately. Signed-off-by: Andres Gomez Ferrer <andresg@spotify.com>
The ValidateWorkflow function iterates wf.Nodes (a Go map) in two places without sorting keys. Since Go map iteration order is randomized, the same workflow can produce different compiled outputs across compilations. This causes FlyteAdmin's digest comparison to fail for structurally identical workflows, triggering the "different structure" code path with an oversized jsondiff error. Sort node IDs before iterating to ensure deterministic edge ordering in the CompiledWorkflowClosure. This makes identical workflows produce identical digests, so re-registration correctly returns ALREADY_EXISTS. Signed-off-by: Andres Gomez Ferrer <andresg@spotify.com>
2 tasks
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #7211 +/- ##
==========================================
+ Coverage 56.95% 56.96% +0.01%
==========================================
Files 931 931
Lines 58234 58246 +12
==========================================
+ Hits 33166 33179 +13
+ Misses 22017 22016 -1
Partials 3051 3051
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Tracking issue
Closes #7212
Related to #4780
Why are the changes needed?
When re-registering a workflow with the same version, FlyteAdmin computes a digest of the compiled workflow to check if the structure is identical. Two bugs cause this to fail for large workflows:
Bug 1: Non-deterministic digest (compiler)
ValidateWorkflowinworkflow_compiler.goiterateswf.Nodes(a Go map) in two places without sorting keys. Since Go map iteration is randomized, the same workflow produces differentCompiledWorkflowClosureoutputs across compilations, leading to different digests. FlyteAdmin then incorrectly takes the "different structure" code path instead of returningALREADY_EXISTS.Bug 2: Oversized error message (errors)
The "different structure" code path computes a
jsondiffof two large compiled workflow closures and includes the full diff in the gRPC error description. For workflows with many dependencies (e.g. 400+ JARs), this diff exceeds gRPC's default 4MBMaxSendMsgSize. gRPC-Go silently rejects the response at the transport layer, sendingRST_STREAM INTERNAL_ERRORto the client with no server-side log.The client sees
INTERNAL: RST_STREAM closed stream. HTTP/2 error code: INTERNAL_ERRORinstead of any useful error.What changes were proposed in this pull request?
Commit 1: Truncate diff in error messages (
flyteadmin/pkg/errors/errors.go)*ExistsDifferentStructureErrorfunctions... [diff truncated — exceeded gRPC max message size]Commit 2: Sort node IDs for deterministic compilation (
flytepropeller/pkg/compiler/workflow_compiler.go)wf.Nodesmap keys before iterating inValidateWorkflowCompiledWorkflowClosureand therefore identical digestsALREADY_EXISTSHow did you test it?
go test ./pkg/errors/andgo test ./pkg/compiler/...CreateWorkflowwith a modified template (same version) — confirmedRST_STREAM INTERNAL_ERRORALREADY_EXISTS(correct behavior)CreateTaskalways returnsALREADY_EXISTS(task digests are already deterministic — no map iteration in task compilation)