-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add EASY/HYBRID/CONSERVATIVE policies #504
Conversation
It is likely a test case will fail on one of the travis config. So I will work on that before our review. It would be best if this PR is reviewed last anyway. |
Codecov Report
@@ Coverage Diff @@
## master #504 +/- ##
==========================================
+ Coverage 75.61% 75.95% +0.33%
==========================================
Files 60 63 +3
Lines 6210 6442 +232
==========================================
+ Hits 4696 4893 +197
- Misses 1514 1549 +35
Continue to review full report at Codecov.
|
Rebased to the master after PR #502 and forced a push. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
{ | ||
job_lifecycle_t state = job_lifecycle_t::INIT; | ||
|
||
if (id < 0 || at < 0) { | ||
if (id < 0 || now < 0 || at < 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
might also be useful to check for and error out when at < now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. Thanks.
@SteVwonder: just a heads-up. I will get to this today unless something else pops up. |
Add MATCH_ALLOCATE_W_SATISFIABILITY. This operation attempts to allocate first. And if succeeds, it returns the matching info as before. If fails, however, it sets the scheduled time to a point as late as possible within the time box of the planner and try the match. Then, if it can't still find matching resources at that scheduled point, the jobspec is deemed unsatisfiable. This satisfiability query must not change the visible state of the graph data store. To support this, overload the update method that only cleans up the state changes created by the satisfiability checking walk. dfu_impl_t::schedule() (and therefore also run()) returns this information with -1 return code with errno=ENODEV. This is different than EBUSY, which is set when dfu_impl_t::schedule() (and therefore also run()) cannot find the matching resources at a given scheduled point (e.g., now). Cleaned up how various errnos are propagated in dfu_impl.cpp to simplify how errno=ENODEV and errno=EBUSY are determined at dfu_impl_t::schedule(). Update in-line documentation with error numbers for the public dfu_traverser_t::run() method.
Decompose queue_policy_fcfs_t<reapi_type>::run_sched_loop() into two orthogonal methods: cancel_completed_jobs() and allocate_jobs().
6cdeff4
to
b267c45
Compare
@SteVwonder: I rebased this PR to the current master and addressed all of your (excellent) review items. If CI turns green, this can use your second review. The only changes you will need to review are the ones marked with |
OK. Travis CI just turned green! |
LGTM! Squash away and I'll push the button. |
Problem: Upcoming queuing policies (EASY, HYBRID and CONSERVATIVE) will have to reserve and cancel jobs many times as part of their backfill algorithms. If a cancelled job remains in the resource matching service module, a subsequent allocate_orelse_reserve on the same job will immediately fail with ENOENT. Completely remove jobs from the resource module when they are successfully cancelled (via cancel callback). Adjust our cancel tests in t/t4003-cancel-info.t with this new cancel semantics. Also adjust how performance statstics are computed via the stat callback. We now explicitly track the number of jobs that have been matched instead of inferring it by looking up the number of jobs that is in the resource module.
Problem: We use time 0 as base time for allocate or allocate_orelse_reserve. While it was handy to test one schedule loop within our resource infrastructure, this doesn't allow for correct implementations for plan-based scheduling algorithms including EAST and CONSERVATIVE. For these algorithms, we need to move forward the base time as the wallclock time moves forward. Take the current time via gettimeofday() upon receiving each match-allocate request within resource. If the matched time is equal to the current time, we deem this case to be "allocated." If the matched time is greater than the current time, we deem this to be "reserved," instead.
Use the ISO 8601 format to display the scheduled times reported by the flux-resource command. Now that we use gettimeofday() as base time for scheduling, this command can no longer display "Now". ISO 8601 provides a more human readable time representation than the epoch time reported by gettimeofday().
Add support to pass parameters pertaining to the queuing policy into the base class of queue policy layer. Each parameter is a key-value pair specified as key=value. Multiple parameters can be passed in threeo different ways: 1) call the set_params method multiple times with a single key-value pair; 2) call the set_params method with multiple comma-delimited key-value key-value pairs (e.g., reservation-depth=3,other-policy-param1=true); ; or 3) something in between. Introduce apply_params() method into the the base class of queue policy layer. This method must be called to effect the parameters that have been passed so far. Note that this is a virtual method which can be overriden by derived queue policy classes. This way dervied classes can customize how to enforce their parameter setting. If not overriden, the default apply_params() is a no-op.
Add support for queue-policy=<policy> where only fcfs policy is added. Add support for policy-params as well. The key-value pair(s) passed via queue-params will control general queuing behaviors (such as queue-depth=k in the future).
Add an implementation of EASY backfilling queuing policy. EASY backfilling: Only the 1st waiting job is considered for allocation, with a guaranteed starting time. When this 1st job cannot start right away because its requested amount of resources is not available, the algorithm browses the list of waiting jobs to find candidates for backfilling. These candidates are jobs that can start immediately, but without delaying the 1st job of the list. Introduce queue_policy_bf_base.hpp and queue_policy_bf_base_impl.hpp that implement core algorithms for a majority of backfillling strategies. Then, queue_policy_easy.hpp and queue_policy_easy_impl.hpp implement an EASY backfilling algorithm by deriving from this base class while specifializing it with reservation-depth=1.
CONSERVATIVE backfilling: A less aggressive alternative to EASY with similar performance. It determines an allocation for each job when it enters the system. Then a job can be a candidate to backfillingif and only if it can begin its execution immediately without delaying any of the other pre-allocated jobs. HYBRID backfilling: An algorithm lies in between EASY and CONSERVATIVE backfilling strategies. It allows for limiting the number of waiting jobs that must have a starting time guarantee to be K where K is configurable. K must be greater than 1 and less than the system max. Implement HYBRID in queue_policy_hybrid.hpp and queue_policy_bybrid_impl.hpp and CONSERVATIVE in queue_policy_conservative.hpp and queue_policy_conservative_impl.hpp. Derived from the backfill base class while specifializing it with reservation-depth=system-max or reservation-depth=K where K is configurable.
When queue-params=queue-depth=<K> is specified as part of qmanager module load commandline, run_sched_loop() in each different queuing policy classes will not look beyond Kth job in their pending job queue. Improve scheduling performance in exchange for potentially less effective backfilling. Set reservation depth to queue depth when bigger.
Problem: Max values are set relatively low compared to the practices used by other schedulers. Max queue depth is now one million and max reservation depth is now 100000.
Generate a bunch of differently sized and timed jobs using flux-jobspec srun. Test whether those jobs are scheduled in the order defined by different backfilling strategies: EASY, HYBRID and CONSERVATIVE.
Add hwloc-whitelist=node,core,gpu into resource's rc1 script so that our graph scheduler can build and operate on a significantly trimmed graph for scheduling (thus fast).
With the default hwloc-whitelist support within resource's rc1, pu will not even be built into the graph populated out of hwloc. So prune-filter=ALL:pu won't do much in helping reducing tree walks for matching. Change this to prune-filters=ALL:core as core will be present in most cases and by default it will be at the fringes of the tree graph built with hwloc.
Add t1004-qmanager-optimize.t to populate test cases pertaining to queue optimization. Add cases for queue-depth optimization for backfilling policies: EASY, HYBRID and CONSERVATIVE.
t/t1003-qmanager-policy.t misses a double amphesand at the end of one intermediate statements.
Fix a new/free mismatch detected by Valgrind.
Problem: We currently set the queue depth and reservation depth parameters to the default values when the parameter values given by users are greater than the MAX values. This can be non-intuitive. Set them to the max values instead as this behavior would be less confusing to the users.
b267c45
to
a93d023
Compare
@SteVwonder: I squashed most of them but decide to keep two commits on their own. I thought it would be reasonable to leave a trail about these changes. If you are okay with these change, this PR is ready to go in once Travis turns green. |
OK. It is green now. |
LGTM! Thanks @dongahn! |
Copy flux-core's t5000-valgrind.t script, since a few things have changed. Adjust for flux-sched: - use SHARNESS_BUILD_DIRECTORY, not FLUX_BUILD_DIR - reference config.h at top level of build tree - drop unused BROKER variable Fixes flux-framework#504
Copy flux-core's t5000-valgrind.t script, since a few things have changed. Adjust for flux-sched: - use SHARNESS_BUILD_DIRECTORY, not FLUX_BUILD_DIR - reference config.h at top level of build tree - drop unused BROKER variable Fixes flux-framework#504
This PR adds various backfill queuing polices and queue-depth support for optimization:
Add EASY/HYBRID/CONSERVATIVE policies
Introduce queue_policy_bf_base.hpp and queue_policy_bf_base_impl.hpp that implement core algorithms for a majority of backfillling strategies. Derive from this class to implement EASY/HYBRID/CONSERVATIVE policies.
Add queue-depth support for backfill policies
This PR builds on PR #502 and PR #503 so please don't merge it before those PRs land. Resolve #476, #482, and #498.