Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Fixed memoization is unchecked after mutex synchronization. Fixes #11219 #11456

Merged
merged 1 commit into from
Aug 7, 2023

Conversation

shmruin
Copy link
Contributor

@shmruin shmruin commented Jul 26, 2023

Fixes #11219

Motivation

As it is described in issue #11219 , currently when using Mutex synchronization with Memoization cache, it seems working not correctly.
When running like the example in verification, two jobs related to same mutex and memozation works like this.

  • job-1 runs first (acquire the key)
  • when job-1 runs end, job-2 runs after that (no matter what memoization is)

This means mutex synchronization is OK but memoization cache doesn't have an effect.
However, the correct way I think how it works to be is,

  • job-1 runs first, and write the memoization cache.
  • when job-1 runs end, job-2 trys to run (acquire the key), but check the memoization.
  • as memoization cache already be written, job-2 just succeed and completed immediately.

I think in this way, mutex will be more usable with memoization.

Modifications

I just change the order of check & update/initialize logic in executeTemplate in operator.go
Currently it works like below.

  1. Check if node is NIL and using memoization, and if so, initialize the nodes with memoization features on them.
  2. Check if synchronization is added, if so, mark lock/unlock to the node. which is related to synchronization managing.

This is why memoization cache never set to 'NO' to all nodes when mutex is applied.
Two (or can be more) nodes sharing same mutex key will pass the memoization check stage, which is of course not hit, and waiting for key acquired at the next synchronization checking stage.
When acquiring or waiting for the key, the are already initialized with memoization as not hit.

I twisted this behaviour to check synchronization first and then memoization. When synchronization is applied, and when it acquires the key, it goes down to the memoization check logic if the template own it.

스크린샷 2023-07-27 오전 3 32 40
스크린샷 2023-07-27 오전 3 33 02

Verification

I added a test in operator_concurrency_test.go or you can check it with the case below.
Mutex synchronization should work and memoization of job-1 is NO and job-2 is YES, which immediately make job-2 as succeed.

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: example-steps-simple
spec:
  entrypoint: main

  templates:
    - name: main
      steps:
        - - name: job-1
            template: sleep
            arguments:
              parameters:
                - name: sleep_duration
                  value: 30
          - name: job-2
            template: sleep
            arguments:
              parameters:
                - name: sleep_duration
                  value: 15

    - name: sleep
      synchronization:
        mutex:
          name: mutex-example-steps-simple
      inputs:
        parameters:
          - name: sleep_duration
      script:
        image: alpine:latest
        command: [/bin/sh]
        source: |
          echo "Sleeping for {{ inputs.parameters.sleep_duration }}"
          sleep {{ inputs.parameters.sleep_duration }}
      memoize:
        key: "memo-key-1"
        cache:
          configMap:
            name: cache-example-steps-simple

// If memoization is on, check if node output exists in cache
if processedTmpl.Memoize != nil {
// Apply memoize only when node is nil or node is using synchronization (mutex) that acquiring the lock.
// This additional condition will make the cache correctly works after synchronization.\
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking maybe modify the comment to something like:
// Check memoization cache if the node is about to be created, or was created in the past but is only now allowed to run due to acquiring a lock
What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds reasonable. As both the nested conditions actually means that, I'll take your opinion and set that comment on that if statement.

@@ -1912,6 +1868,7 @@ func (woc *wfOperationCtx) executeTemplate(ctx context.Context, nodeName string,
// unexpected behavior and is a bug.
panic("bug: GetLockName should not return an error after a call to TryAcquire")
}
woc.log.Infof("Not Acquire lockname: %s", lockName)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: "Could not acquire lock named: %s".

@@ -1922,10 +1879,70 @@ func (woc *wfOperationCtx) executeTemplate(ctx context.Context, nodeName string,
return nil, err
}
}
// Set this value to check that this node is using synchronization, and acquire the lock.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: "and has acquired the lock"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

workflow/controller/operator.go Show resolved Hide resolved
func (woc *wfOperationCtx) updateAsCacheNode(node *wfv1.NodeStatus, memStat *wfv1.MemoizationStatus) *wfv1.NodeStatus {
node.MemoizationStatus = memStat

woc.wf.Status.Nodes[node.ID] = *node
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should use woc.wf.Status.Nodes.Set instead. It was introduced super recently in #11451

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More safety. thanks.

node.FinishedAt = metav1.Time{Time: time.Now().UTC()}
node.MemoizationStatus = memStat

woc.wf.Status.Nodes[node.ID] = *node
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above, use the Set() function, or ideally do what Julie suggested.

job2CacheHit = node.MemoizationStatus.Hit
}
}
assert.False(t, job1CacheHit)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Why not just assert in the case above (line 1134), it feels more readble.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch. It seems really weird... why did I do that? :)

@juliev0
Copy link
Contributor

juliev0 commented Aug 3, 2023

@Joibel once you re-review and approve as well, I'll merge this

@shmruin
Copy link
Contributor Author

shmruin commented Aug 7, 2023

@juliev0 , @Joibel , Could you merge this PR if this is OK?

@juliev0 juliev0 merged commit 143d0f5 into argoproj:master Aug 7, 2023
23 checks passed
@juliev0
Copy link
Contributor

juliev0 commented Aug 7, 2023

thanks for the contribution!

jaen pushed a commit to jaen/argo-workflows that referenced this pull request Aug 12, 2023
jaen pushed a commit to jaen/argo-workflows that referenced this pull request Aug 12, 2023
shmruin added a commit to shmruin/argo-workflows that referenced this pull request Aug 15, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Mutex check happens after memoization cache check and lock info missing from UI
4 participants