-
Notifications
You must be signed in to change notification settings - Fork 3.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: Fixed memoization is unchecked after mutex synchronization. Fixes #11219 #11456
Conversation
workflow/controller/operator.go
Outdated
// If memoization is on, check if node output exists in cache | ||
if processedTmpl.Memoize != nil { | ||
// Apply memoize only when node is nil or node is using synchronization (mutex) that acquiring the lock. | ||
// This additional condition will make the cache correctly works after synchronization.\ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm thinking maybe modify the comment to something like:
// Check memoization cache if the node is about to be created, or was created in the past but is only now allowed to run due to acquiring a lock
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That sounds reasonable. As both the nested conditions actually means that, I'll take your opinion and set that comment on that if statement.
workflow/controller/operator.go
Outdated
@@ -1912,6 +1868,7 @@ func (woc *wfOperationCtx) executeTemplate(ctx context.Context, nodeName string, | |||
// unexpected behavior and is a bug. | |||
panic("bug: GetLockName should not return an error after a call to TryAcquire") | |||
} | |||
woc.log.Infof("Not Acquire lockname: %s", lockName) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggestion: "Could not acquire lock named: %s".
workflow/controller/operator.go
Outdated
@@ -1922,10 +1879,70 @@ func (woc *wfOperationCtx) executeTemplate(ctx context.Context, nodeName string, | |||
return nil, err | |||
} | |||
} | |||
// Set this value to check that this node is using synchronization, and acquire the lock. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggestion: "and has acquired the lock"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
workflow/controller/operator.go
Outdated
func (woc *wfOperationCtx) updateAsCacheNode(node *wfv1.NodeStatus, memStat *wfv1.MemoizationStatus) *wfv1.NodeStatus { | ||
node.MemoizationStatus = memStat | ||
|
||
woc.wf.Status.Nodes[node.ID] = *node |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should use woc.wf.Status.Nodes.Set
instead. It was introduced super recently in #11451
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More safety. thanks.
workflow/controller/operator.go
Outdated
node.FinishedAt = metav1.Time{Time: time.Now().UTC()} | ||
node.MemoizationStatus = memStat | ||
|
||
woc.wf.Status.Nodes[node.ID] = *node |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above, use the Set() function, or ideally do what Julie suggested.
job2CacheHit = node.MemoizationStatus.Hit | ||
} | ||
} | ||
assert.False(t, job1CacheHit) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Why not just assert in the case above (line 1134), it feels more readble.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice catch. It seems really weird... why did I do that? :)
…argoproj#11219 Signed-off-by: shmruin <meme_hm@naver.com>
@Joibel once you re-review and approve as well, I'll merge this |
thanks for the contribution! |
…argoproj#11219 (argoproj#11456) Signed-off-by: shmruin <meme_hm@naver.com>
…argoproj#11219 (argoproj#11456) Signed-off-by: shmruin <meme_hm@naver.com>
…argoproj#11219 (argoproj#11456) Signed-off-by: shmruin <meme_hm@naver.com>
Fixes #11219
Motivation
As it is described in issue #11219 , currently when using Mutex synchronization with Memoization cache, it seems working not correctly.
When running like the example in verification, two jobs related to same mutex and memozation works like this.
This means mutex synchronization is OK but memoization cache doesn't have an effect.
However, the correct way I think how it works to be is,
I think in this way, mutex will be more usable with memoization.
Modifications
I just change the order of check & update/initialize logic in
executeTemplate
inoperator.go
Currently it works like below.
This is why memoization cache never set to 'NO' to all nodes when mutex is applied.
Two (or can be more) nodes sharing same mutex key will pass the memoization check stage, which is of course not hit, and waiting for key acquired at the next synchronization checking stage.
When acquiring or waiting for the key, the are already initialized with memoization as not hit.
I twisted this behaviour to check synchronization first and then memoization. When synchronization is applied, and when it acquires the key, it goes down to the memoization check logic if the template own it.
Verification
I added a test in
operator_concurrency_test.go
or you can check it with the case below.Mutex synchronization should work and memoization of job-1 is NO and job-2 is YES, which immediately make job-2 as succeed.