Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for variation-aware scheduler #517

Merged
merged 12 commits into from Oct 4, 2019
Merged

Conversation

@tpatki
Copy link
Member

tpatki commented Sep 20, 2019

Initial design for #514.

Needs extensive error checks and additional tests, but posting for early review and feedback. A simple example has been tested and verified.

Note: shall fix indentation, comments, etc in the final PR.

@tpatki tpatki requested review from dongahn and SteVwonder Sep 20, 2019
@tpatki tpatki force-pushed the tpatki:var_aware branch 3 times, most recently from 0570a14 to 046f831 Sep 20, 2019
@codecov-io

This comment has been minimized.

Copy link

codecov-io commented Sep 20, 2019

Codecov Report

Merging #517 into master will increase coverage by 0.14%.
The diff coverage is 86.2%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #517      +/-   ##
==========================================
+ Coverage   76.23%   76.38%   +0.14%     
==========================================
  Files          58       59       +1     
  Lines        6165     6221      +56     
==========================================
+ Hits         4700     4752      +52     
- Misses       1465     1469       +4
Impacted Files Coverage Δ
resource/policies/dfu_match_policy_factory.cpp 100% <100%> (+28.57%) ⬆️
resource/utilities/resource-query.cpp 50.27% <100%> (+0.56%) ⬆️
resource/policies/dfu_match_policy_factory.hpp 100% <100%> (ø) ⬆️
resource/modules/resource_match.cpp 75.72% <100%> (+0.05%) ⬆️
resource/policies/dfu_match_var_aware.cpp 84% <84%> (ø)
resource/libjobspec/jobspec.cpp 85.6% <0%> (-1.14%) ⬇️
resource/policies/dfu_match_locality.cpp 4.54% <0%> (+4.54%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 64de044...8014d6f. Read the comment docs.

@dongahn

This comment has been minimized.

Copy link
Contributor

dongahn commented Sep 20, 2019

Thanks @tpatki!

I took a brief look at this looking at where your changes are inserted and and structures etc. This seems like a great start.

The only initial feedback is: at least for user visible names like the option argument for --match-policy of resource-query, I prefer variation instead of var_aware.

@dongahn

This comment has been minimized.

Copy link
Contributor

dongahn commented Sep 20, 2019

Also, now that I merged your set-/get-property support you can safely rebase.

@tpatki

This comment has been minimized.

Copy link
Member Author

tpatki commented Sep 20, 2019

Thanks @dongahn for the early feedback. I need to add more tests for (1) missing property values, (2) erroneous inputs where stoi fails, etc. I will change the name to variation instead of var_aware.

I also have the example from quartz that we had tested earlier, it has variation data on 2417 nodes with 5 performance classes and ~200 jobs from quartz queue. I also have a helper script that plots graphs and compares the lowest ID first, highest ID first and variation-aware algorithms for figure of merit.

Do you want me to add this quartz test as part of our production code or should I just commit it to a new local branch so we can work on analysis later?

@dongahn

This comment has been minimized.

Copy link
Contributor

dongahn commented Sep 20, 2019

I also have the example from quartz that we had tested earlier, it has variation data on 2417 nodes with 5 performance classes and ~200 jobs from quartz queue. I also have a helper script that plots graphs and compares the lowest ID first, highest ID first and variation-aware algorithms for figure of merit.

Cool.

Do you want me to add this quartz test as part of our production code or should I just commit it to a new local branch so we can work on analysis later?

I think these should stay in a research branch for our analysis later; our repo probably want to stay lean and mean as much as possible :-)

@tpatki

This comment has been minimized.

Copy link
Member Author

tpatki commented Sep 20, 2019

I think these should stay in a research branch for our analysis later; our repo probably want to stay lean and mean as much as possible :-)

@dongahn: Agreed, just wanted to check. I'll create a new research branch in my fork based on the recent updates and maintain these there. Thanks!

@tpatki tpatki force-pushed the tpatki:var_aware branch from 046f831 to 2868cac Sep 21, 2019
@tpatki tpatki marked this pull request as ready for review Sep 21, 2019
@tpatki tpatki force-pushed the tpatki:var_aware branch from 2868cac to d8a61af Sep 21, 2019
@dongahn

This comment has been minimized.

Copy link
Contributor

dongahn commented Sep 24, 2019

@tpatki: Thanks Tapasya. I haven't looked at your logic yet but here are my initial set of change requests which will help me and @SteVwonder to review your PR further. This PR is relatively large but currently only have two commits. If they can be further splitter into topical commits, this would be very helpful. One suggestion is:

The first feature commit can be split as the following:

  1. commit that includes your policy selection bug fix (thank you again)!
  2. commit that includes dfu_match_var_aware.hpp|cpp
  3. commit that integrate these files into the build system: resoure/Makefile.am
  4. commit that integrates this into the factory class: dfu_match_policy_factory.hpp|cpp
  5. commit that integrates this into the resource-query: resource-query.cpp
  6. commit that integrates this into the resource match: resource_match.cpp

For testing commit, I would split it as the following:

  1. Commit small.graphml on its own
  2. Commit with cmds files
  3. Commit with output files
  4. Commit with jobspec files
  5. Commit that introduces t3014-resource-var-aware.t
  6. Commit that introduces t4007-match-var-aware.t
  7. Commit t/Makefile.am to tie those two sharness test scripts into the build system

I believe each commit can benefit from more detailed commit messages. Each commit has a specific topic, so hopefully it wouldn't be too painful to add a description. The most detail will be needed for your commit that contains changes in dfu_match_var_aware.hpp|cpp and perhaps the sharness tests.

@dongahn

This comment has been minimized.

Copy link
Contributor

dongahn commented Sep 24, 2019

@SteVwonder's schedutil change went in. This needs to be debased as well.

@tpatki tpatki force-pushed the tpatki:var_aware branch from d8a61af to aec8673 Sep 26, 2019
@tpatki

This comment has been minimized.

Copy link
Member Author

tpatki commented Sep 26, 2019

Something went wrong. Redoing this.

@tpatki tpatki force-pushed the tpatki:var_aware branch from aec8673 to f2c4728 Sep 26, 2019
@tpatki

This comment has been minimized.

Copy link
Member Author

tpatki commented Sep 26, 2019

@dongahn, @SteVwonder: Split the commits are requested and this is ready for review.
Can't select @milroy from the list, he's not a part of the flux-framework team I think...

@tpatki tpatki requested a review from milroy Sep 26, 2019
@tpatki tpatki changed the title WIP: support for variation-aware scheduler Support for variation-aware scheduler Sep 30, 2019
@dongahn

This comment has been minimized.

Copy link
Contributor

dongahn commented Sep 30, 2019

@tpatki: Thanks. I will review this soon (probably tonight).

resource/modules/resource_match.cpp Show resolved Hide resolved
resource/Makefile.am Show resolved Hide resolved
t/data/resource/jobspecs/var_aware/job_1N.yaml Outdated Show resolved Hide resolved
t/data/resource/jobspecs/var_aware/job_1N.yaml Outdated Show resolved Hide resolved
t/t3014-resource-var-aware.t Outdated Show resolved Hide resolved
t/t4007-match-var-aware.t Show resolved Hide resolved
resource/policies/dfu_match_var_aware.cpp Outdated Show resolved Hide resolved
resource/policies/dfu_match_var_aware.hpp Outdated Show resolved Hide resolved
resource/policies/dfu_match_var_aware.cpp Outdated Show resolved Hide resolved
@dongahn

This comment has been minimized.

Copy link
Contributor

dongahn commented Oct 1, 2019

@tpatki: This PR looks pretty good!

I left my review comments in-line to your code and testing inputs. Most of the comments are trivial, but you should look at the ones about the main algorithm and see if you agree.

In addition, I generally find that your commit messages are a bit terse. I'd suggest you augment some of the commit messages (not on the title, but on the comment message body) with a bit more details. When someone other than us look at the log messages later for bug hunting of some sort, I am sure they will appreciate they could do that only by reading commit messages.

Once these issues are addressed, I believe this PR can be merged.

@dongahn

This comment has been minimized.

Copy link
Contributor

dongahn commented Oct 1, 2019

@tpatki: PR #523 has been merged. So when you update your PR, please make sure you rebase to the upstream master.

@tpatki

This comment has been minimized.

Copy link
Member Author

tpatki commented Oct 1, 2019

@dongahn: sounds good! Will update this PR soon (in the next day).

@tpatki tpatki force-pushed the tpatki:var_aware branch from f2c4728 to 3d690d1 Oct 4, 2019
@tpatki

This comment has been minimized.

Copy link
Member Author

tpatki commented Oct 4, 2019

@dongahn: This is ready for your review/merging, addressed all the changes you requested.

@dongahn

This comment has been minimized.

Copy link
Contributor

dongahn commented Oct 4, 2019

@tapasya: looks great to me. If you want to address the one last nit (a leading white space before stoi and add std namespace), let me know.

@tpatki

This comment has been minimized.

Copy link
Member Author

tpatki commented Oct 4, 2019

@dongahn: fixed the whitespace and std::stoi issue. Thanks for the thorough and detailed review of this! :)

@dongahn

This comment has been minimized.

Copy link
Contributor

dongahn commented Oct 4, 2019

LGTM! Please squash the last commit with the original amd let me know. I will merge.

tpatki added 11 commits Oct 3, 2019
The policy relies on the use of the "perf_class" property for the resource.
This "perf_class" is provided by the user through the set-property interface.
The range for perf_class is 0 through 9999.
A lower value indicates better efficiency of the resource.
As a result, fold::less is used for scoring.
In the case where there is a missing "perf_class", we assume the worst
efficiency, which we designate as "perf_class" of 9999.

style: fix whitespace for stoi in dfu_match_var_aware.cpp.
There are 5 tests for resource-query. These are the test command files.
Tests 1-4 have 4 jobs and test the match allocate.
Test 1 checks for correct inputs and verifies the allocations.
Tests 2,3,4 test for missing, invalid, and out of range inputs for perf_class.
These three tests assume the worst-efficiency for the resources that have
incorrect input, and thus use a perf_class of 9999 for them.

Test 5 has 8 jobs and tests for match allocate-orelse-reserve.
These are the expected outputs for the test commands for resource_query in
t/data/resource/commands/var_aware.
These are the jobspec files used for test commands of resource_query in
t/data/resource/commands/var_aware/. There are 3 jobspec files: 1 node,
2 node and 4 node. Each has a different duration to allow for both match-
allocate and allocate-orelse-reserve tests.
This test case checks for match allocate, match allocate-orelse-reserve,
and missing, invalid and out of range inputs for perf_class.
This test case tests usage with resource-match and tests set-property, job
submission, and allocation with variation-aware policy.
@tpatki tpatki force-pushed the tpatki:var_aware branch from 69a93c0 to 8014d6f Oct 4, 2019
@tpatki

This comment has been minimized.

Copy link
Member Author

tpatki commented Oct 4, 2019

@dongahn: Squashed the last commit in to original variation-aware policy files. Thanks!

@dongahn

This comment has been minimized.

Copy link
Contributor

dongahn commented Oct 4, 2019

Great work. Thank you! Merging.

(Just FYI next time when you squash, you probably don't need to include the description of the squashing commit because the file was newly introduced and from the perspective of the commit history that problem doesn't really exist :-)

@dongahn dongahn merged commit adea512 into flux-framework:master Oct 4, 2019
3 checks passed
3 checks passed
codecov/patch 86.2% of diff hit (target 76.23%)
Details
codecov/project 76.38% (+0.14%) compared to 64de044
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants
You can’t perform that action at this time.