-
Notifications
You must be signed in to change notification settings - Fork 129
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[armada-scheduler] Avoid retrying jobs on the same node #2343
Conversation
This PR adds the functionality to avoid nodes on retrying jobs It will only avoid a node if: - The lease was returned - The runAttempted flag was set on the lease return This is so if the lease returned for reasons that didn't involve then node - we don't avoid the node The node avoidance is implemented using affinities and setting a node anti-affinity
Codecov ReportPatch coverage:
Additional details and impacted files@@ Coverage Diff @@
## master #2343 +/- ##
==========================================
+ Coverage 58.97% 59.23% +0.25%
==========================================
Files 225 226 +1
Lines 28470 28781 +311
==========================================
+ Hits 16790 17048 +258
- Misses 10388 10426 +38
- Partials 1292 1307 +15
Flags with carried forward coverage won't be shown. Click here to find out more.
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report in Codecov by Sentry. |
) | ||
|
||
func AddNodeAntiAffinity(affinity *v1.Affinity, labelName string, labelValue string) { | ||
if affinity == nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is it ok to silently ignore a nil here?
internal/scheduler/scheduler.go
Outdated
@@ -273,13 +285,39 @@ func (s *Scheduler) syncState(ctx context.Context) ([]*jobdb.Job, error) { | |||
jobsToUpdateById[jobId] = job | |||
} | |||
|
|||
// Ensure queued jobs being retried have node anti affinities for any nodes they have already run on | |||
for _, job := range jobsToUpdateById { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can this be simplified? We seem to iterate over jobsToUpdateById
twice and do much the same thing in each case?
It feels like we could iterate over once and apply the anti affinity in one place.
internal/scheduler/scheduler.go
Outdated
// Jobs with no active runs are queued. | ||
for _, job := range jobsToUpdateById { | ||
// Determine if the job needs to be re-queued. | ||
// TODO: If have an explicit queued message, we can remove this code. | ||
run := job.LatestRun() | ||
desiredQueueState := false | ||
requeueJob := run != nil && run.Returned() && job.NumReturned() <= s.maxLeaseReturns | ||
requeueJob := run != nil && run.Returned() && job.NumAttempts() < s.maxAttemptedRuns | ||
if requeueJob && !job.Queued() && run.Returned() && run.RunAttempted() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we simplify this? To work out if the job is going to fall into this, we've got 6 bools to check!
To tie this in with my previous comment- could we not have a single loop and do something like:
if job.Queued || requeueJob {
// Ensure affinity logic is set
}
internal/scheduler/scheduler.go
Outdated
if lastRun.Returned() { | ||
errorMessage := fmt.Sprintf("Maximum number of attempts (%d) reached - this job will no longer be retried", s.maxAttemptedRuns) | ||
if job.NumAttempts() < s.maxAttemptedRuns { | ||
errorMessage = fmt.Sprintf("Job was attempeted %d times, and has been tried once on all nodes it can run on - this job will no longer be retried", job.NumAttempts()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
errorMessage = fmt.Sprintf("Job was attempeted %d times, and has been tried once on all nodes it can run on - this job will no longer be retried", job.NumAttempts()) | |
errorMessage = fmt.Sprintf("Job was attempted %d times, and has been tried once on all nodes it can run on - this job will no longer be retried", job.NumAttempts()) |
if podSchedulingRequirement == nil { | ||
return nil, false, fmt.Errorf("no pod scheduling requirement found for job %s", job.GetId()) | ||
} | ||
isSchedulable, _ := s.submitChecker.CheckPodRequirements(podSchedulingRequirement) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a bit dodgy- I think this function is being called from syncState
which means that all replicas of the scheduler are executing this regardless of whether they are leader. Previously all the code called by this function was deterministic so we could guarantee that all replicas would end up in the same state. This code breaks that assumption which means we could lose a job. For example:
- Leader executes this code and determines job is ok to requeue
- Follower executes this code and determines job should fail (and marks it as such)
- Leader leases the job to the executor
- Follower Becomes Leader and fails the job
- Job run is now orphaned
TBF this whole bit of code is awkward because the point of syncState
is to make the internal in-memory model look like the database. The re-queuing of jobs doesn't fit nicely into that because we're not just taking the state from the db, but rather using scheduler-specific business logic to modify the state and at that point you run the risk that different scheduler instances may make different decisions. I've gone over a couple of different ways of sorting this and the best I can come up with is:
- Always requeue in
syncstate
, but leave the job run as failed - Move the requeing logic into
generateUpdateMessages
which can fail the job if it determines it shouldn't be requeued.
I think this works, because it means that only the leader will be checking whether something can be requeued.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with this and we can move it around
Is there a reason we don't send a requeue message in generateUpdateMessages - and remove all the business logic from syncstate.
Then only leader can decide if we requeue or fail the job and the others just get their state from the db.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think a requeue
message is probably the best solution, it would also make life easier for e.g. lookout which could then mark the state correctly without having to gues what the scheduler has decided.
The only slightly tricky bit is what to store in the schedulerDb as a bool won't be enough. I think an int giving the number times it has been requeued should be enough, but I haven't thought through this enough to be sure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do wonder if the requeue message could include the updated scheduling requirements. That would be nice from the Pov of ensuing that all schedulers agree what nodes are being excluded.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you envisioning that we'd conditionally exclude nodes?
Currently its either exclude all nodes we've tried on or fail - so it shouldn't be ambiguous - however I'm happy to be proven wrong.
Although including the scheduling requirements on the requeue message would work and be pretty flexible
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So you're right- at the moment the logic is unambiguous so each scheduler can determine which nodes to exclude purely from the runs. My thinking behind suggesting an explicit update to the scheduling requirements is mainly based around the feeling that the fewer decisions non-leader schedulers make, the less chance there is of them getting out of sync. There's a few secondary benefits of this approach too- e.g. something like lookout could now show excluded nodes, but I'm not sure these are compelling enough to make a change.
The other thing I'm thinking of is how we integrate this with the hash-based scheduling optimization, which should go a long way toward mitigating the performance of scheduleMany
as discussed in #2319. Ideally we don't always compute hashes in syncState
because this will involve computing len(queue)
hashes every time the leader fails over. Instead, I was hoping to calculate them in the ingester and a requeued
message with an updated schedulingRequirements would be useful for this I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that is all reasonable and won't be a lot more work.
In terms of implementation - would the plan be to just updated the scheduling_info column on the job table - on each requeue
message that changes the scheduling requirements?
- Or would we want to record these updates elsewhere?
- One pain point here is working out when the scheduling_info has changed in a cheap way - maybe this would be a reason to implement the hash in this PR
Would you be happy if I updated this PR to:
- Add
requeue
event - Move all requeue logic to generateUpdateMessages (i.e send requeue event) and simplify
syncState
Unsure if I want to also add a scheduling info hash in this PR - it depends the answer to the above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add requeue event
Move all requeue logic to generateUpdateMessages (i.e send requeue event) and simplify syncState
That seems good. My suggestion for working out whether scheduling info has changed would be to add a seq no to scheduling info (both the table and the in-mem representation). Every time you generate a new scheduling infor, increment the seq no and include it in the requeue
event. The ingester should only overwrite the scheduling info if the seq num is higher (I think you can do this relatively cheaply with an ON CONFLICT), likewise the in memory representation should only be overwritten if the seq num is higher.
With regard for hashing- I think a tentative plan would be:
- Have a hash col that lives alongside the scheduling info and is versioned with it
- Scheduler ingester calculates and inserts the initial hash on the submit message
- Requeue message needs to include a hash (the leader scheduler will have to calculate the has when it requeues). The logic for updating the hash (both for the ingester and for follower schedulers) is identical to updating the scheduling info.
If you can get the hashing in that would be great, but if not it shouldn't be too hard to get it in later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add requeue event
Move all requeue logic to generateUpdateMessages (i.e send requeue event) and simplify syncState
That seems good. My suggestion for working out whether scheduling info has changed would be to add a seq no to scheduling info (both the table and the in-mem representation). Every time you generate a new scheduling infor, increment the seq no and include it in the requeue
event. The ingester should only overwrite the scheduling info if the seq num is higher (I think you can do this relatively cheaply with an ON CONFLICT), likewise the in memory representation should only be overwritten if the seq num is higher.
With regard for hashing- I think a tentative plan would be:
- Have a hash col that lives alongside the scheduling info and is versioned with it
- Scheduler ingester calculates and inserts the initial hash on the submit message
- Requeue message needs to include a hash (the leader scheduler will have to calculate the has when it requeues). The logic for updating the hash (both for the ingester and for follower schedulers) is identical to updating the scheduling info.
If you can get the hashing in that would be great, but if not it shouldn't be too hard to get it in later.
Hi @d80tb7 So I've made the changes:
Functionally this works, I know there are some points that need tidying:
However before I finished off - I just wanted your overall opinion on if I'm butchering things too much Also I've included JobSchedulingInfo in the JobRequeued event - however I've for now imported it from our internal schedulerobjects proto We could make duplicate proto and convert them? Or move JobSchedulingInfo. It is a bit awkward |
Errors: []*armadaevents.Error{runError}, | ||
requeueJob := lastRun.Returned() && job.NumAttempts() < s.maxAttemptedRuns | ||
|
||
if requeueJob && lastRun.RunAttempted() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When we become leader (either on startup or through failover) we pass all jobs to this function. In this case we will get jobs that have already been requeued fall into this code branch. In that case I think the code will do the right thing- i.e. it'll end up with the jobs queued with the correct anti affinity set but it might be either writing a test to be sure, or seeing if we can already detect the job is in the correct state and short circuit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I think it'll do the correct thing - however I think it'll end up sending the event back through the loop
We probably could just short circuit by checking if the job is already queued - if it is - skip. If it isn't - that implies it hasn't been processed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we can avoid sending a large number of redundant messages that would be good. Is it enough to check that the job is already queued, or do we also need to check that the scheduling Info is as we expect? I think you're right that rechecking the queued state is enough, but I think I'd need to stare at this a bit more to convince myself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm relatively confident it will be sufficient to check it is queued - however we could also validate the affinity is as we expect
However this would only handle if we managed to get into an unexpected state - but maybe that isn't a bad thing
@@ -216,7 +248,14 @@ func (c *InstructionConverter) handleJobRunErrors(jobRunErrors *armadaevents.Job | |||
JobID: jobId, | |||
Error: bytes, | |||
} | |||
markRunsFailed[runId] = &JobRunFailed{LeaseReturned: runError.GetPodLeaseReturned() != nil} | |||
runAttempted := false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this default to true?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably yes
It is a bit awkward because the value only matters if lease was returned - so its value is somewhat irrelevant for other cases
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In practice it probably doesn't matter too much- although I'd argue that in general it's safer to assume the run was attempted if we can't determine whether it was.
Yeah this all looks good- I made a few coments- none of them serious I don't think. Even the comment about the reordering of events in the ingester I don't think will cause any issues. Re the proto- I think for now that's fine. My personal view, which admittedly isn't quite crystallized yet, is something like:
|
Sadly sqlc doesn't support dynamic queries - so I had to write a dynamic query manually This means we don't need to do N update statements
… rest client No other proto is considered public in that folder currently
@@ -26,6 +30,8 @@ CREATE TABLE jobs ( | |||
submit_message bytea NOT NULL, | |||
-- JobSchedulingInfo message stored as a proto buffer. | |||
scheduling_info bytea NOT NULL, | |||
-- The current version of the JobSchedulingInfo, used to prevent JobSchedulingInfo state going backwards on even replay |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-- The current version of the JobSchedulingInfo, used to prevent JobSchedulingInfo state going backwards on even replay | |
-- The current version of the JobSchedulingInfo, used to prevent JobSchedulingInfo state going backwards even on replay |
|
||
func AddNodeAntiAffinity(affinity *v1.Affinity, labelName string, labelValue string) error { | ||
if affinity == nil { | ||
return fmt.Errorf("failed to add not anti affinity, as provided affinity is nil") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return fmt.Errorf("failed to add not anti affinity, as provided affinity is nil") | |
return errors.Errorf("failed to add node anti affinity, as provided affinity is nil") |
newSchedulingInfo.Version = job.JobSchedulingInfo().Version + 1 | ||
podSchedulingRequirement := PodRequirementFromJobSchedulingInfo(newSchedulingInfo) | ||
if podSchedulingRequirement == nil { | ||
return nil, fmt.Errorf("no pod scheduling requirement found for job %s", job.GetId()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return nil, fmt.Errorf("no pod scheduling requirement found for job %s", job.GetId()) | |
return nil, errors.Errorf("no pod scheduling requirement found for job %s", job.GetId()) |
} | ||
podSchedulingRequirement := PodRequirementFromJobSchedulingInfo(schedulingInfoWithNodeAntiAffinity) | ||
if podSchedulingRequirement == nil { | ||
return nil, false, fmt.Errorf("no pod scheduling requirement found for job %s", job.GetId()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return nil, false, fmt.Errorf("no pod scheduling requirement found for job %s", job.GetId()) | |
return nil, false, errors.Errorf("no pod scheduling requirement found for job %s", job.GetId()) |
} | ||
|
||
if requeueJob { | ||
job = job.WithUpdatedRun(lastRun.WithFailed(false)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need to do this? I previously had to set lastRun.Failed = false
because I was doing this in syncState
and I needed to prevent the job from being failed later. I think now we are handling the requeing in the correct place so we can indeed treat the run as failed.
runError := jobRunErrors[lastRun.Id()] | ||
job = job.WithFailed(true).WithQueued(false) | ||
if lastRun.Returned() { | ||
errorMessage := fmt.Sprintf("Maximum number of attempts (%d) reached - this job will no longer be retried", s.maxAttemptedRuns) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
errorMessage := fmt.Sprintf("Maximum number of attempts (%d) reached - this job will no longer be retried", s.maxAttemptedRuns) | |
errorMessage := errors.Sprintf("Maximum number of attempts (%d) reached - this job will no longer be retried", s.maxAttemptedRuns) |
if lastRun.Returned() { | ||
errorMessage := fmt.Sprintf("Maximum number of attempts (%d) reached - this job will no longer be retried", s.maxAttemptedRuns) | ||
if job.NumAttempts() < s.maxAttemptedRuns { | ||
errorMessage = fmt.Sprintf("Job was attempeted %d times, and has been tried once on all nodes it can run on - this job will no longer be retried", job.NumAttempts()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
errorMessage = fmt.Sprintf("Job was attempeted %d times, and has been tried once on all nodes it can run on - this job will no longer be retried", job.NumAttempts()) | |
errorMessage = errors.Sprintf("Job was attempeted %d times, and has been tried once on all nodes it can run on - this job will no longer be retried", job.NumAttempts()) |
argMarkers := make([]string, 0, len(o)) | ||
|
||
currentIndex := 1 | ||
for key, value := range o { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this might be simpler as a pgx.batch
query (https://github.com/jackc/pgx/blob/master/batch_test.go#L30). I suspect the performance won't be much worse either.
return errors.WithStack(err) | ||
} | ||
case UpdateJobQueuedState: | ||
args := make([]interface{}, 0, len(o)*3) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as above- I think this could be a pgx.batch
query.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Raised some comments- but nothing super important.
This PR adds the functionality to avoid nodes on retrying jobs
It will only avoid a node if:
This is so if the lease returned for reasons that didn't involve then node - we don't avoid the node
The node avoidance is implemented using affinities and setting a node anti-affinity
┆Issue is synchronized with this Jira Task by Unito