-
Notifications
You must be signed in to change notification settings - Fork 121
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[YUNIKORN-2] gang scheduling shim side implementation #219
Conversation
This PR updates the app CRD types to include more info about gangs, such as node-selector and tolerations.
…ng scheduling (#201) This PR adds the docker files and scripts to build docker images for the simulated job, such a job requires to get a configurable number of instances before it is able to execute. These images will be used in e2e testing and performance evaluation scenarios.
* [YUNIKORN-453] add taskGroup info to apps/tasks This PR addresses the following tasks: - Add taskGroups to apps/tasks - Add Placeholder and PlaceholderManager to manage placeholders - Add Reserving state in apps to handle taskGroups - Related UTs
Co-authored-by: Weiwei Yang <wyang@cloudera.com>
…cleans up orphan placeholders (#208)
hi @kingamarton please help to review. Since all the patches have been previously reviewed before committing to the YUNIKORN-2 branch. I think we can do a light review for the merge. Thanks. |
Codecov Report
@@ Coverage Diff @@
## master #219 +/- ##
==========================================
+ Coverage 57.78% 59.72% +1.94%
==========================================
Files 32 35 +3
Lines 2859 3136 +277
==========================================
+ Hits 1652 1873 +221
- Misses 1138 1181 +43
- Partials 69 82 +13
Continue to review full report at Codecov.
|
There is a test failure in the Travis, can you please check if it can be something related to the changes, or not? |
Hmm, yes, that was the basicSchedulingTest, it seems to work fine in the last run. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, not it looks good. I just added two minor comments.
{Name: string(events.UpdateReservation), | ||
Src: []string{states.Reserving}, | ||
Dst: states.Reserving}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For me it a bit strange that we have this event here in the state machine, but there is no state transition associated with it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hi @kingamarton pls take a look at the SM definition here:
fsm.Callbacks{
...
string(events.UpdateReservation): app.onReservationStateChange,
...
},
this is called every time when a task is bound to a node, we check if the app gets enough placeholders; when it gets enough, in onReservationStateChange()
it will trigger a state transition from Reserving
to Running
.
func (app *Application) onReservationStateChange(event *fsm.Event) {
...
if desireCounts.Equals(actualCounts) {
ev := NewRunApplicationEvent(app.applicationID)
dispatcher.Dispatch(ev)
}
}
hi @kingamarton I don't think the e2e test was failed because of this PR. There seems to be some intermittent issue happening on the latest codebase. I've observed similar failures at #220 (purely e2e test on latest codebase) and #221. I suggest getting this one merged and work on fixing the intermittent issue in a separate JIRA, as that may take some time. Does that make sense? Thanks |
This is a PR opened for extra review before merging into the master branch.