New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: commands (TensorBoards, notebooks, etc.) should not be preempted [DET-4157] #1346
Conversation
75779c7
to
ea3bec4
Compare
ea3bec4
to
b957dab
Compare
@mackrorysd I have left one comment. Also, can you also run E2E GPU tests for this PR before merging it? |
5ed9380
to
e8466ae
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have one last comment. Otherwise, it looks very good!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🚢 ship it!
Description
Commands (TensorBoards, Notebooks), being interactive, would cause more obvious disruption to an end-user if preempted.
Implementing this requires changes in 2 aspects of the scheduler. First simply not sending the ReleaseResources message to such tasks (and closely related: commands will also ignore this unless something changes CanTerminate to true, and nothing does right now). Second, the scheduler will also take this into account when allocating slots so that a group cannot lose a slot it needs for non-preemptible tasks, so the system cannot oversubscribe itself.
Test Plan
I've added unit tests that reproduced the original problem, including subtleties discovered during implementation. Some other unit tests needed to have CanTerminate added to their task definitions since it defaults to false, which triggers the new behavior.
Commentary
This patch prevents these commands from being preempted using the existing CanTerminate flag. Because of Go's zero-value initialization, this means all tasks are now non-premptible by default. This feels backwards - I would actually prefer to replace this with a field call NonPreemptible (depending on how reviewers feel about it) and invert the logic.
I also made garbage collection non-preemptible - I'm not sure if that's desirable or not.
Checklist
docs/release-notes/
.See Release Note for details.