Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Handle auto refresh cache race condition #5406

Merged
merged 4 commits into from
May 24, 2024

Conversation

pvditt
Copy link
Contributor

@pvditt pvditt commented May 22, 2024

Tracking issue

Potentially closes: #5335

Why are the changes needed?

Propeller v1.12.0 introduced a bug in which child/external workflow status was not propagated back up to the parent workflow.

Not able to repro exactly. Current theory is that there's a race condition in which an item be in the processing set (which was introduced in new Flyte release) while not being in the workqueue. Due to if item, ok := value.(Item); !ok || (ok && !item.IsTerminal() && !w.processing.Contains(k)) { (in enqueueBatches), this would cause an item to no longer get added to the workqueue to then be re-synced.

Why we think this happens:

  • the item (workflow) is still in the LruCache as we keep getting status for it in GetStatus.

  • If the item were not in the cache, then the item would get re-added to the workqueue. If an item were in the workqueue, then it'd be included as part of the syncItem process that's trigged in the auto_refresh's sync. Sync grabs batches off the workqueue.

  • enqueueBatches adds items to the workqueue. An item only gets added to the workqueue if it's not in processing among other conditions.

  • gorm logs indicate that admin is not getting GetExecution requests for the child workflow that's status is not updating.

  • the addition of the processing sync.set was the only change that stood out in between flyte 1.11 and 1.12.

What changes were proposed in this pull request?

We want to keep the processing optimization to reduce to overhead of adding duplicate items to the workqueue.

We swap out the processing set in favor of a map in which they keys are the same set and the values are a timestamp of when the item was added to processing. We then check for how long the item has been in processing - if an item has been in processing for 10 sync periods we "evict" it from processing such that the item will get re-added to the workqueue.

How was this patch tested?

  • added a simple unit test for the inProcessing expiration check
  • ran a workflow launching external wf -> ensured that status was propagated to the parent.

Setup process

Screenshots

Check all the applicable boxes

  • I updated the documentation accordingly.
  • All new and existing tests passed.
  • All commits are signed-off.

Related PRs

Docs link

Signed-off-by: Paul Dittamo <pvdittamo@gmail.com>
Copy link

codecov bot commented May 22, 2024

Codecov Report

Attention: Patch coverage is 92.30769% with 1 lines in your changes are missing coverage. Please review.

Project coverage is 61.10%. Comparing base (ba3647f) to head (56870aa).

Files Patch % Lines
flytestdlib/cache/auto_refresh.go 92.30% 1 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##           master    #5406   +/-   ##
=======================================
  Coverage   61.10%   61.10%           
=======================================
  Files         793      793           
  Lines       51156    51164    +8     
=======================================
+ Hits        31257    31264    +7     
- Misses      17027    17028    +1     
  Partials     2872     2872           
Flag Coverage Δ
unittests-datacatalog 69.31% <ø> (ø)
unittests-flyteadmin 58.90% <ø> (ø)
unittests-flytecopilot 17.79% <ø> (ø)
unittests-flytectl 68.31% <ø> (ø)
unittests-flyteidl 79.30% <ø> (ø)
unittests-flyteplugins 61.94% <ø> (ø)
unittests-flytepropeller 57.32% <ø> (ø)
unittests-flytestdlib 65.80% <92.30%> (+0.04%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Signed-off-by: Paul Dittamo <pvdittamo@gmail.com>
@pvditt pvditt requested review from hamersaw and pingsutw May 22, 2024 07:13
Copy link
Member

@pingsutw pingsutw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just one question

flytestdlib/cache/auto_refresh.go Show resolved Hide resolved
Signed-off-by: Paul Dittamo <pvdittamo@gmail.com>
Signed-off-by: Paul Dittamo <pvdittamo@gmail.com>
@pvditt pvditt requested a review from pingsutw May 24, 2024 19:55
@pvditt pvditt enabled auto-merge (squash) May 24, 2024 20:05
@pvditt pvditt merged commit d04cf66 into master May 24, 2024
50 checks passed
@pvditt pvditt deleted the bug/external-workflow-status-propagation branch May 24, 2024 20:19
@andresgomezfrr
Copy link
Contributor

Is there any plan to create a fix release with this fix?

@pablocasares
Copy link

Thank you for working on this. Once the next release is available I can test this and I'll report back if the issue is solved.

@pvditt
Copy link
Contributor Author

pvditt commented May 30, 2024

Is there any plan to create a fix release with this fix?

@andresgomezfrr Yes, we are validating a new release end of this week and barring any issues will get an official release out next week.

@pvditt
Copy link
Contributor Author

pvditt commented Jun 13, 2024

@andresgomezfrr @pablocasares there's a RC that contains this fix. I'm unsure of when a final release containing this change will be made. I'll ping when that happens.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] Subworkflow status is not reported to the parent workflow
4 participants