fix: Fix inconsistent persitence of cron workflow objects #4639

simster7 · 2020-12-03T14:01:52Z

Maybe fixes: #4558

What I think the issue is

There are two places in the code where we read a cron workflow from the informer:

When building a context when processing a cron workflow CRD. The cron workflow is then stored and used by Run at the next scheduled runtime:
https://github.com/argoproj/argo/blob/1212df4d19dd18045fd0aded7fd1dc5726f7d5c5/workflow/cron/controller.go#L124
When running syncAll:
https://github.com/argoproj/argo/blob/1212df4d19dd18045fd0aded7fd1dc5726f7d5c5/workflow/cron/controller.go#L216

I believe that there is a chacne a cron workflow will be scheduled at the same time syncAll is called. When this happens syncAll will pick up a cron workflow from the informer that is identical to the one about to be use to schedule a workflow in Run. Run processes the cron workflow, schedules a new workflow, updates LastScheduledTime to the current time, and patches the cron workflow in the cluster. At the same time, syncAll runs and since it never modifies LastScheduledTime, it patches the cron workflow in the cluster to the previous (and now incorrect) LastScheduledTime. As soon as this happens, the cron workflow controller picks up the most recent cron workflow (patched by SyncAll) and process it. Since it sees that the (incorrect) LastScheduledTime implies an execution was missed, it schedules a new workflow.

The main issue here is that we use patch to update the cron workflow (as changed in #4294). Since we use patch, there is no code that checks if we are trying to update an object with stale and outdated information.

Why I think this solution may work

Using update instead of patch guarantees that we won't update the cron workflow in the cluster with information that is now stale. Moreover, this approach allows us to run reapplyUpdate to attempt to reconcile the differences if the scenario above does indeed happen.

Moreover, I believe the reservation from #4294 does not apply:

The ResourceVersion checks made the assumption that the corn operation context is used only once. It is wrong, it is used many times - each time Run is invoked.

I believe this is not true. It is true that Run would use the same origCronWf every time Run is called (i.e. every time the cron workflow is scheduled). However, since we create an entirely new context every time the cron workflow object in the cluster is modified (which includes new executions), we can guarantee that origCronWf will always be the most recent version that is present in the cluster.

The only exception to this is when the cron workflow in the cluster was modified while Run was being executed. In which case there is even a stronger argument to using update instead of patch, in order to recognize this has happened.

Why I believe #4294 introduced this bug

From #4294:

This change instead uses Patch to update the cron workflow status. I believe that, unlike for a workflow, this is safe because all the status fields can be patched safely.

For the reasons above, I don't believe this is true.

Would using deterministic workflow names fix this issue?

Maybe in most cases. However, there is an edge case: After the patch is made with stale information and the controller tries to scheduled the "missed" workflow, it will fail because a workflow with that (deterministic) name will already exist. However, it is possible that the original workflow was completed and delete from the cluster in that time, in which case the "missed" workflow will still be scheduled.

Although I believe that using deterministic names is a hacky solution to this problem, I now do believe that we should use them as a whole regardless. To that end I have opened: #4638

What is holding me back from fully endorsing this solution

I would like to know why #4294 was done originally. I seem to recall a problem with cron workflows and mismatched resourceVersions. If that is the case, I think we should explore fixing that issue with a different approach. I would like @alexec to comment on this

…goproj#4294)" This reverts commit e54bf81. Signed-off-by: Simon Behar <simbeh7@gmail.com>

Signed-off-by: Simon Behar <simbeh7@gmail.com>

alexec · 2020-12-03T16:24:42Z

The ResourceVersion checks made the assumption that the corn operation context is used only once. It is wrong, it is used many times - each time Run is invoked.

This is correct. I assumed that because it appears very similar in concept to the woc, it was the same in this critical area.

I'd note that when I changed to patch - I was attempting to fix other bugs.

I'm not sure I agreed with your comment that deterministic names are "hacky" - this solution is lifted directly from the core CronJob code base. You're correct that this solution assumes that all cron jobs must run at least 1m, which will often be false. So it is not the correct solution. They may have chosen to do this so that the names had this additional information in them.

You are correct that use of "patch" removes an important check. That check is resource version, not last scheduled time. I think we should do a check on that instead. Is that possible?

E.g.

In-memory KeyLock the cron job.
Get the latest version.
Check the last schedule time - abort if the time we're planning to schedule.
Create workflow
Update cron job - no need for re-apply - should never happen
Unlock the KeyLock

I don't think we based the functional behaviour on the resource version check mechanism - that's simply not what it is designed for.

sarabala1979 · 2020-12-04T04:11:39Z

@simster7 I would suggest that before the cronworkflow gets scheduled, it's better to check whether it has already scheduled or not. This will avoid scheduling same cronworkflow multiple times in any scenario or usecase.

jobCtx, _ :=cc.cron.Load(key.(string))

	if jobCtx == nil {
		err = cc.cron.AddJob(key.(string), cronSchedule, cronWorkflowOperationCtx)
		if err != nil {
			logCtx.WithError(err).Error("could not schedule CronWorkflow")
			return true
		}
	}

Signed-off-by: Simon Behar <simbeh7@gmail.com>

simster7 · 2020-12-04T13:44:50Z

I would suggest that before the cronworkflow gets scheduled, it's better to check whether it has already scheduled or not. This will avoid scheduling same cronworkflow multiple times in any scenario or usecase.

This approach does not work, the contexts are ephemeral and the fact that a context exists doesn't mean a job has been scheduled and vice versa

sarabala1979 · 2020-12-04T16:24:32Z

I would suggest that before the cronworkflow gets scheduled, it's better to check whether it has already scheduled or not. This will avoid scheduling same cronworkflow multiple times in any scenario or usecase.

This approach does not work, the contexts are ephemeral and the fact that a context exists doesn't mean a job has been scheduled and vice versa

FYI...
The load function is looking at Entries from cron and if entry is nil or not a context. it will return an error. As long as Key is unique for every reconciliation. We can check the key has been scheduled already or not

func (f *cronFacade) Load(key string) (*cronWfOperationCtx, error) {
	f.mu.Lock()
	defer f.mu.Unlock()
	entryID, ok := f.entryIDs[key]
	if !ok {
		return nil, fmt.Errorf("entry ID for %s not found", key)
	}
	entry := f.cron.Entry(entryID).Job
	cwoc, ok := entry.(*cronWfOperationCtx)
	if !ok {
		return nil, fmt.Errorf("job entry ID for %s was not a *cronWfOperationCtx, was %v", key, reflect.TypeOf(entry))
	}
	return cwoc, nil
}

simster7 · 2020-12-07T18:12:29Z

Closing in favor of #4659

simster7 added 2 commits December 3, 2020 07:29

Revert "fix(controller): Patch rather than update cron workflows. (ar…

3d73dfa

…goproj#4294)" This reverts commit e54bf81. Signed-off-by: Simon Behar <simbeh7@gmail.com>

minor

14e8ac0

Signed-off-by: Simon Behar <simbeh7@gmail.com>

simster7 changed the title ~~Revert cron~~ fix: Fix inconsistent persitence of cron workflow objects Dec 3, 2020

fix test

f56e8c7

Signed-off-by: Simon Behar <simbeh7@gmail.com>

simster7 closed this Dec 7, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Fix inconsistent persitence of cron workflow objects #4639

fix: Fix inconsistent persitence of cron workflow objects #4639

simster7 commented Dec 3, 2020 •

edited

Loading

alexec commented Dec 3, 2020 •

edited

Loading

sarabala1979 commented Dec 4, 2020 •

edited

Loading

simster7 commented Dec 4, 2020

sarabala1979 commented Dec 4, 2020 •

edited

Loading

simster7 commented Dec 7, 2020

fix: Fix inconsistent persitence of cron workflow objects #4639

fix: Fix inconsistent persitence of cron workflow objects #4639

Conversation

simster7 commented Dec 3, 2020 • edited Loading

alexec commented Dec 3, 2020 • edited Loading

sarabala1979 commented Dec 4, 2020 • edited Loading

simster7 commented Dec 4, 2020

sarabala1979 commented Dec 4, 2020 • edited Loading

simster7 commented Dec 7, 2020

simster7 commented Dec 3, 2020 •

edited

Loading

alexec commented Dec 3, 2020 •

edited

Loading

sarabala1979 commented Dec 4, 2020 •

edited

Loading

sarabala1979 commented Dec 4, 2020 •

edited

Loading