-
Notifications
You must be signed in to change notification settings - Fork 3.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: Fix inconsistent persitence of cron workflow objects #4639
Conversation
…goproj#4294)" This reverts commit e54bf81. Signed-off-by: Simon Behar <simbeh7@gmail.com>
This is correct. I assumed that because it appears very similar in concept to the woc, it was the same in this critical area. I'd note that when I changed to patch - I was attempting to fix other bugs. I'm not sure I agreed with your comment that deterministic names are "hacky" - this solution is lifted directly from the core CronJob code base. You're correct that this solution assumes that all cron jobs must run at least 1m, which will often be false. So it is not the correct solution. They may have chosen to do this so that the names had this additional information in them. You are correct that use of "patch" removes an important check. That check is resource version, not last scheduled time. I think we should do a check on that instead. Is that possible? E.g.
I don't think we based the functional behaviour on the resource version check mechanism - that's simply not what it is designed for. |
@simster7 I would suggest that before the cronworkflow gets scheduled, it's better to check whether it has already scheduled or not. This will avoid scheduling same cronworkflow multiple times in any scenario or usecase. jobCtx, _ :=cc.cron.Load(key.(string))
if jobCtx == nil {
err = cc.cron.AddJob(key.(string), cronSchedule, cronWorkflowOperationCtx)
if err != nil {
logCtx.WithError(err).Error("could not schedule CronWorkflow")
return true
}
} |
This approach does not work, the contexts are ephemeral and the fact that a context exists doesn't mean a job has been scheduled and vice versa |
FYI... func (f *cronFacade) Load(key string) (*cronWfOperationCtx, error) {
f.mu.Lock()
defer f.mu.Unlock()
entryID, ok := f.entryIDs[key]
if !ok {
return nil, fmt.Errorf("entry ID for %s not found", key)
}
entry := f.cron.Entry(entryID).Job
cwoc, ok := entry.(*cronWfOperationCtx)
if !ok {
return nil, fmt.Errorf("job entry ID for %s was not a *cronWfOperationCtx, was %v", key, reflect.TypeOf(entry))
}
return cwoc, nil
} |
Closing in favor of #4659 |
Maybe fixes: #4558
What I think the issue is
There are two places in the code where we read a cron workflow from the informer:
Run
at the next scheduled runtime:https://github.com/argoproj/argo/blob/1212df4d19dd18045fd0aded7fd1dc5726f7d5c5/workflow/cron/controller.go#L124
syncAll
:https://github.com/argoproj/argo/blob/1212df4d19dd18045fd0aded7fd1dc5726f7d5c5/workflow/cron/controller.go#L216
I believe that there is a chacne a cron workflow will be scheduled at the same time
syncAll
is called. When this happenssyncAll
will pick up a cron workflow from the informer that is identical to the one about to be use to schedule a workflow inRun
.Run
processes the cron workflow, schedules a new workflow, updatesLastScheduledTime
to the current time, and patches the cron workflow in the cluster. At the same time,syncAll
runs and since it never modifiesLastScheduledTime
, it patches the cron workflow in the cluster to the previous (and now incorrect)LastScheduledTime
. As soon as this happens, the cron workflow controller picks up the most recent cron workflow (patched bySyncAll
) and process it. Since it sees that the (incorrect)LastScheduledTime
implies an execution was missed, it schedules a new workflow.The main issue here is that we use
patch
to update the cron workflow (as changed in #4294). Since we usepatch
, there is no code that checks if we are trying to update an object with stale and outdated information.Why I think this solution may work
Using
update
instead ofpatch
guarantees that we won't update the cron workflow in the cluster with information that is now stale. Moreover, this approach allows us to runreapplyUpdate
to attempt to reconcile the differences if the scenario above does indeed happen.Moreover, I believe the reservation from #4294 does not apply:
I believe this is not true. It is true that
Run
would use the sameorigCronWf
every timeRun
is called (i.e. every time the cron workflow is scheduled). However, since we create an entirely new context every time the cron workflow object in the cluster is modified (which includes new executions), we can guarantee thatorigCronWf
will always be the most recent version that is present in the cluster.The only exception to this is when the cron workflow in the cluster was modified while
Run
was being executed. In which case there is even a stronger argument to usingupdate
instead ofpatch
, in order to recognize this has happened.Why I believe #4294 introduced this bug
From #4294:
For the reasons above, I don't believe this is true.
Would using deterministic workflow names fix this issue?
Maybe in most cases. However, there is an edge case: After the
patch
is made with stale information and the controller tries to scheduled the "missed" workflow, it will fail because a workflow with that (deterministic) name will already exist. However, it is possible that the original workflow was completed and delete from the cluster in that time, in which case the "missed" workflow will still be scheduled.Although I believe that using deterministic names is a hacky solution to this problem, I now do believe that we should use them as a whole regardless. To that end I have opened: #4638
What is holding me back from fully endorsing this solution
I would like to know why #4294 was done originally. I seem to recall a problem with cron workflows and mismatched
resourceVersions
. If that is the case, I think we should explore fixing that issue with a different approach. I would like @alexec to comment on this