Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Roll forward schduler_server refactor #5918

Merged
merged 14 commits into from
Mar 14, 2024
Merged

Roll forward schduler_server refactor #5918

merged 14 commits into from
Mar 14, 2024

Conversation

iain-macdonald
Copy link
Contributor

This is a revert of 5639 with a fix for the bug mentioned in there. I put a bit of effort into writing a test for the bug and was able to tickle it with a simple test and throwing a time.Sleep() into the production code, because it relies on a specific sequence of events that are difficult to trigger organically. That's obviously not submittable, but I've included it in the git history of this branch so you can take a look or play around with it if you'd like.

To fix the bug, I got rid of the node ring and replaced it with a slice from which nodes are removed after enqueue attempts are made, and refreshed that slice of nodes each time it's empty.

Here's the diff versus the previous iteration (useful for reviewing scheduler_server): 4331112...4d10242

Related issues: N/A

cancel()
if err != nil {
log.CtxWarningf(ctx, "EnqueueTaskReservation via scheduler target %q failed: %s", node.schedulerHostPort, err)
time.Sleep(schedulerEnqueueTaskReservationFailureSleep)
Copy link
Member

@bduffany bduffany Feb 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we move this sleep outside the func and sleep whenever it returns false? (I think it would protect us against a hot CPU loop if something is borked.)

Alternatively we could return (bool, error) and abort the scheduling loop if a "fatal" error is returned. (non-fatal errors would return (false, nil))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That seems like a reasonable goal. I went ahead and did the (bool, error) approach because that let me preserve the existing behavior as-is, always sleeping on false would change the behavior by introducing a new sleep in a few cases which I'd like to avoid in this PR. LMK what you think.

@iain-macdonald
Copy link
Contributor Author

Whoops, forgot to re-request review, but Brandon if you could take another look some time (not urgent), I'd appreciate it. Not planning to merge this in this week's release as I'd rather give it a little time in dev.

Comment on lines 1801 to 1807
return false // no sleep
}
_, err := node.handle.EnqueueTaskReservation(ctx, request)
if err != nil {
log.CtxInfof(ctx, "Failed to enqueue task on connected executor: %s", err)
}
return err == nil // no sleep
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found these "// no sleep" comments confusing because these return values don't actually control whether we sleep or not - the "nil" error value returned on line 1792 means that we don't sleep. I think it would be clearer if we had this func return (bool, error) then always return nil for the error to avoid sleeping (for now)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sg, done

@iain-macdonald iain-macdonald merged commit 0bcfd5d into master Mar 14, 2024
19 checks passed
@iain-macdonald iain-macdonald deleted the abcdefghi branch March 14, 2024 21:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants