Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[YUNIKORN-2] gang scheduling shim side implementation #219

Merged
merged 17 commits into from
Jan 20, 2021
Merged

Conversation

yangwwei
Copy link
Contributor

This is a PR opened for extra review before merging into the master branch.

yangwwei and others added 14 commits October 27, 2020 13:54
This PR updates the app CRD types to include more info about gangs, such as
node-selector and tolerations.
…ng scheduling (#201)

This PR adds the docker files and scripts to build docker images for the simulated job,
such a job requires to get a configurable number of instances before it is able to execute.
These images will be used in e2e testing and performance evaluation scenarios.
* [YUNIKORN-453] add taskGroup info to apps/tasks

This PR addresses the following tasks:
  - Add taskGroups to apps/tasks
  - Add Placeholder and PlaceholderManager to manage placeholders
  - Add Reserving state in apps to handle taskGroups
  - Related UTs
Co-authored-by: Weiwei Yang <wyang@cloudera.com>
@yangwwei
Copy link
Contributor Author

hi @kingamarton please help to review. Since all the patches have been previously reviewed before committing to the YUNIKORN-2 branch. I think we can do a light review for the merge. Thanks.

@yangwwei yangwwei added this to the 0.10.0 milestone Jan 14, 2021
@codecov
Copy link

codecov bot commented Jan 14, 2021

Codecov Report

Merging #219 (12ecf11) into master (eb888df) will increase coverage by 1.94%.
The diff coverage is 76.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #219      +/-   ##
==========================================
+ Coverage   57.78%   59.72%   +1.94%     
==========================================
  Files          32       35       +3     
  Lines        2859     3136     +277     
==========================================
+ Hits         1652     1873     +221     
- Misses       1138     1181      +43     
- Partials       69       82      +13     
Impacted Files Coverage Δ
pkg/cache/application_events.go 53.06% <0.00%> (-7.41%) ⬇️
pkg/common/events/states.go 0.00% <0.00%> (ø)
pkg/common/utils/utils.go 25.00% <ø> (ø)
pkg/cache/application.go 76.74% <59.25%> (-2.31%) ⬇️
pkg/cache/placeholder_manager.go 77.94% <77.94%> (ø)
pkg/common/utils/gang_utils.go 81.53% <81.53%> (ø)
pkg/cache/task.go 74.40% <82.22%> (+2.77%) ⬆️
pkg/appmgmt/general/general.go 59.47% <83.33%> (+2.03%) ⬆️
pkg/common/si_helper.go 63.15% <90.00%> (+1.36%) ⬆️
pkg/cache/context.go 41.08% <100.00%> (+0.13%) ⬆️
... and 5 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update eb888df...12ecf11. Read the comment docs.

deployments/examples/gang/cmd/gangDeploy.sh Show resolved Hide resolved
deployments/examples/gang/cmd/gangDeploy.sh Show resolved Hide resolved
deployments/examples/gang/cmd/gangDeploy.sh Show resolved Hide resolved
pkg/cache/application.go Show resolved Hide resolved
pkg/cache/application.go Show resolved Hide resolved
pkg/cache/placeholder_manager.go Show resolved Hide resolved
pkg/cache/placeholder_manager.go Show resolved Hide resolved
pkg/cache/placeholder_manager.go Show resolved Hide resolved
pkg/cache/placeholder_manager.go Show resolved Hide resolved
pkg/common/utils/gang_utils.go Show resolved Hide resolved
@kingamarton
Copy link
Contributor

There is a test failure in the Travis, can you please check if it can be something related to the changes, or not?

pkg/shim/scheduler.go Outdated Show resolved Hide resolved
pkg/cache/placeholder_manager.go Outdated Show resolved Hide resolved
pkg/cache/placeholder_manager.go Show resolved Hide resolved
pkg/common/utils/gang_utils.go Show resolved Hide resolved
pkg/common/utils/gang_utils.go Outdated Show resolved Hide resolved
pkg/common/utils/gang_utils.go Show resolved Hide resolved
@yangwwei
Copy link
Contributor Author

There is a test failure in the Travis, can you please check if it can be something related to the changes, or not?

Hmm, yes, that was the basicSchedulingTest, it seems to work fine in the last run.
it might be an intermittent error, let's see what happens in the next run.

Copy link
Contributor

@kingamarton kingamarton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, not it looks good. I just added two minor comments.

Comment on lines +93 to +95
{Name: string(events.UpdateReservation),
Src: []string{states.Reserving},
Dst: states.Reserving},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For me it a bit strange that we have this event here in the state machine, but there is no state transition associated with it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hi @kingamarton pls take a look at the SM definition here:

fsm.Callbacks{
  ...
  string(events.UpdateReservation):      app.onReservationStateChange,
  ...
},

this is called every time when a task is bound to a node, we check if the app gets enough placeholders; when it gets enough, in onReservationStateChange() it will trigger a state transition from Reserving to Running.

func (app *Application) onReservationStateChange(event *fsm.Event) {
         ...
	if desireCounts.Equals(actualCounts) {
		ev := NewRunApplicationEvent(app.applicationID)
		dispatcher.Dispatch(ev)
	}
}

pkg/cache/placeholder_manager.go Outdated Show resolved Hide resolved
@yangwwei
Copy link
Contributor Author

hi @kingamarton I don't think the e2e test was failed because of this PR. There seems to be some intermittent issue happening on the latest codebase. I've observed similar failures at #220 (purely e2e test on latest codebase) and #221. I suggest getting this one merged and work on fixing the intermittent issue in a separate JIRA, as that may take some time. Does that make sense? Thanks

@yangwwei yangwwei merged commit 12ecf11 into master Jan 20, 2021
@yangwwei yangwwei deleted the YUNIKORN-2 branch January 20, 2021 08:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants