Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Fix inconsistent persitence of cron workflow objects #4639

Closed
wants to merge 3 commits into from

Conversation

simster7
Copy link
Member

@simster7 simster7 commented Dec 3, 2020

Maybe fixes: #4558

What I think the issue is

There are two places in the code where we read a cron workflow from the informer:

  1. When building a context when processing a cron workflow CRD. The cron workflow is then stored and used by Run at the next scheduled runtime:
    https://github.com/argoproj/argo/blob/1212df4d19dd18045fd0aded7fd1dc5726f7d5c5/workflow/cron/controller.go#L124
  2. When running syncAll:
    https://github.com/argoproj/argo/blob/1212df4d19dd18045fd0aded7fd1dc5726f7d5c5/workflow/cron/controller.go#L216

I believe that there is a chacne a cron workflow will be scheduled at the same time syncAll is called. When this happens syncAll will pick up a cron workflow from the informer that is identical to the one about to be use to schedule a workflow in Run. Run processes the cron workflow, schedules a new workflow, updates LastScheduledTime to the current time, and patches the cron workflow in the cluster. At the same time, syncAll runs and since it never modifies LastScheduledTime, it patches the cron workflow in the cluster to the previous (and now incorrect) LastScheduledTime. As soon as this happens, the cron workflow controller picks up the most recent cron workflow (patched by SyncAll) and process it. Since it sees that the (incorrect) LastScheduledTime implies an execution was missed, it schedules a new workflow.

The main issue here is that we use patch to update the cron workflow (as changed in #4294). Since we use patch, there is no code that checks if we are trying to update an object with stale and outdated information.

Why I think this solution may work

Using update instead of patch guarantees that we won't update the cron workflow in the cluster with information that is now stale. Moreover, this approach allows us to run reapplyUpdate to attempt to reconcile the differences if the scenario above does indeed happen.

Moreover, I believe the reservation from #4294 does not apply:

The ResourceVersion checks made the assumption that the corn operation context is used only once. It is wrong, it is used many times - each time Run is invoked.

I believe this is not true. It is true that Run would use the same origCronWf every time Run is called (i.e. every time the cron workflow is scheduled). However, since we create an entirely new context every time the cron workflow object in the cluster is modified (which includes new executions), we can guarantee that origCronWf will always be the most recent version that is present in the cluster.

The only exception to this is when the cron workflow in the cluster was modified while Run was being executed. In which case there is even a stronger argument to using update instead of patch, in order to recognize this has happened.

Why I believe #4294 introduced this bug

From #4294:

This change instead uses Patch to update the cron workflow status. I believe that, unlike for a workflow, this is safe because all the status fields can be patched safely.

For the reasons above, I don't believe this is true.

Would using deterministic workflow names fix this issue?

Maybe in most cases. However, there is an edge case: After the patch is made with stale information and the controller tries to scheduled the "missed" workflow, it will fail because a workflow with that (deterministic) name will already exist. However, it is possible that the original workflow was completed and delete from the cluster in that time, in which case the "missed" workflow will still be scheduled.

Although I believe that using deterministic names is a hacky solution to this problem, I now do believe that we should use them as a whole regardless. To that end I have opened: #4638

What is holding me back from fully endorsing this solution

I would like to know why #4294 was done originally. I seem to recall a problem with cron workflows and mismatched resourceVersions. If that is the case, I think we should explore fixing that issue with a different approach. I would like @alexec to comment on this

…goproj#4294)"

This reverts commit e54bf81.

Signed-off-by: Simon Behar <simbeh7@gmail.com>
Signed-off-by: Simon Behar <simbeh7@gmail.com>
@simster7 simster7 changed the title Revert cron fix: Fix inconsistent persitence of cron workflow objects Dec 3, 2020
@alexec
Copy link
Contributor

alexec commented Dec 3, 2020

The ResourceVersion checks made the assumption that the corn operation context is used only once. It is wrong, it is used many times - each time Run is invoked.

This is correct. I assumed that because it appears very similar in concept to the woc, it was the same in this critical area.

I'd note that when I changed to patch - I was attempting to fix other bugs.

I'm not sure I agreed with your comment that deterministic names are "hacky" - this solution is lifted directly from the core CronJob code base. You're correct that this solution assumes that all cron jobs must run at least 1m, which will often be false. So it is not the correct solution. They may have chosen to do this so that the names had this additional information in them.

You are correct that use of "patch" removes an important check. That check is resource version, not last scheduled time. I think we should do a check on that instead. Is that possible?

E.g.

  • In-memory KeyLock the cron job.
  • Get the latest version.
  • Check the last schedule time - abort if the time we're planning to schedule.
  • Create workflow
  • Update cron job - no need for re-apply - should never happen
  • Unlock the KeyLock

I don't think we based the functional behaviour on the resource version check mechanism - that's simply not what it is designed for.

@sarabala1979
Copy link
Member

sarabala1979 commented Dec 4, 2020

@simster7 I would suggest that before the cronworkflow gets scheduled, it's better to check whether it has already scheduled or not. This will avoid scheduling same cronworkflow multiple times in any scenario or usecase.

jobCtx, _ :=cc.cron.Load(key.(string))

	if jobCtx == nil {
		err = cc.cron.AddJob(key.(string), cronSchedule, cronWorkflowOperationCtx)
		if err != nil {
			logCtx.WithError(err).Error("could not schedule CronWorkflow")
			return true
		}
	}

Signed-off-by: Simon Behar <simbeh7@gmail.com>
@simster7
Copy link
Member Author

simster7 commented Dec 4, 2020

I would suggest that before the cronworkflow gets scheduled, it's better to check whether it has already scheduled or not. This will avoid scheduling same cronworkflow multiple times in any scenario or usecase.

This approach does not work, the contexts are ephemeral and the fact that a context exists doesn't mean a job has been scheduled and vice versa

@sarabala1979
Copy link
Member

sarabala1979 commented Dec 4, 2020

I would suggest that before the cronworkflow gets scheduled, it's better to check whether it has already scheduled or not. This will avoid scheduling same cronworkflow multiple times in any scenario or usecase.

This approach does not work, the contexts are ephemeral and the fact that a context exists doesn't mean a job has been scheduled and vice versa

FYI...
The load function is looking at Entries from cron and if entry is nil or not a context. it will return an error. As long as Key is unique for every reconciliation. We can check the key has been scheduled already or not

func (f *cronFacade) Load(key string) (*cronWfOperationCtx, error) {
	f.mu.Lock()
	defer f.mu.Unlock()
	entryID, ok := f.entryIDs[key]
	if !ok {
		return nil, fmt.Errorf("entry ID for %s not found", key)
	}
	entry := f.cron.Entry(entryID).Job
	cwoc, ok := entry.(*cronWfOperationCtx)
	if !ok {
		return nil, fmt.Errorf("job entry ID for %s was not a *cronWfOperationCtx, was %v", key, reflect.TypeOf(entry))
	}
	return cwoc, nil
}

@simster7
Copy link
Member Author

simster7 commented Dec 7, 2020

Closing in favor of #4659

@simster7 simster7 closed this Dec 7, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Cronworkflow still trigger twice after upgraded to v2.12.0-rc1
3 participants