-
Notifications
You must be signed in to change notification settings - Fork 86
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Roll forward schduler_server refactor #5918
Conversation
enterprise/server/scheduling/scheduler_server/scheduler_server.go
Outdated
Show resolved
Hide resolved
enterprise/server/scheduling/scheduler_server/scheduler_server.go
Outdated
Show resolved
Hide resolved
cancel() | ||
if err != nil { | ||
log.CtxWarningf(ctx, "EnqueueTaskReservation via scheduler target %q failed: %s", node.schedulerHostPort, err) | ||
time.Sleep(schedulerEnqueueTaskReservationFailureSleep) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we move this sleep outside the func and sleep whenever it returns false? (I think it would protect us against a hot CPU loop if something is borked.)
Alternatively we could return (bool, error)
and abort the scheduling loop if a "fatal" error is returned. (non-fatal errors would return (false, nil)
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That seems like a reasonable goal. I went ahead and did the (bool, error)
approach because that let me preserve the existing behavior as-is, always sleeping on false would change the behavior by introducing a new sleep in a few cases which I'd like to avoid in this PR. LMK what you think.
b3412d1
to
8d873da
Compare
8d873da
to
2641018
Compare
Whoops, forgot to re-request review, but Brandon if you could take another look some time (not urgent), I'd appreciate it. Not planning to merge this in this week's release as I'd rather give it a little time in dev. |
return false // no sleep | ||
} | ||
_, err := node.handle.EnqueueTaskReservation(ctx, request) | ||
if err != nil { | ||
log.CtxInfof(ctx, "Failed to enqueue task on connected executor: %s", err) | ||
} | ||
return err == nil // no sleep |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found these "// no sleep" comments confusing because these return values don't actually control whether we sleep or not - the "nil" error value returned on line 1792 means that we don't sleep. I think it would be clearer if we had this func return (bool, error)
then always return nil
for the error to avoid sleeping (for now)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sg, done
…nected executor while the command is running, which can tickle the scheduling bug from the previous attempt at this change.
This reverts commit 7fe6d2a.
Co-authored-by: Brandon Duffany <brandon@buildbuddy.io>
Co-authored-by: Brandon Duffany <brandon@buildbuddy.io>
5bee5e2
to
0b1904e
Compare
This is a revert of 5639 with a fix for the bug mentioned in there. I put a bit of effort into writing a test for the bug and was able to tickle it with a simple test and throwing a
time.Sleep()
into the production code, because it relies on a specific sequence of events that are difficult to trigger organically. That's obviously not submittable, but I've included it in the git history of this branch so you can take a look or play around with it if you'd like.To fix the bug, I got rid of the node ring and replaced it with a slice from which nodes are removed after enqueue attempts are made, and refreshed that slice of nodes each time it's empty.
Here's the diff versus the previous iteration (useful for reviewing scheduler_server): 4331112...4d10242
Related issues: N/A