qmanager: defer jobs with unmatchable constraints #1188

trws · 2024-04-27T03:45:10Z

This is the in-progress PR for the constraint job blocking problem. Full description below, but we're still lacking two things I really want to have:

A test that will surface this if it comes back, other than looking at timing, this is very hard to make an integration test for. A unit test would be easier, trying to come up with something.
A fix for the stats collection logic that includes failed matches, both the number of them and times of them since that being missing is part of what helped this hide so well.

problem: Jobs with constraints that can't be matched because nodes are
down or drained are currently considered every time we enter the
scheduling loop. If they reach the head of the queue, which is likely
because we currently only configure one sched queue, they get
re-considered over and over despite the fact they can't run, which
greatly slows down scheduling, and can cause severe blocking observed up
to 20 seconds of delay for a single submission. This change also
exposed a bug with the duration calculation for duration=0.0 jobs, which
used to be set to the full duration of the graph rather than the
remaining duration of the graph.

solution: Add a new "m_blocked" member to the qmanager base which holds
jobs which return EBUSY from an alloc_orelse_reserve. This state can
only happen when the job is blocked by a constraint requiring a node in
an unusable state. The jobs in m_blocked are moved into m_pending
(ready to be considered) by the notify callback. Currently they are
moved regardless of what status changes occurred (a node being drained
causes them to move) but it's a relatively small cost to move them back
afterward and simplifies the logic considerably. The duration for 0.0
duration jobs is now set to the remaining time rather than the total
time during meta build time.

trws · 2024-04-29T04:14:02Z

I see the coverage failure, and will see what I can do about that, but that's going to be a few days. Looks like the uploader we're using is deprecated, and the actual error is because the uploader is using too old a version of gcov compared to the version of gcc on bookworm, but it only matters on qmanager for some reason. 🤷

grondo · 2024-04-29T13:55:03Z

We switched to codecov-action@v4 in flux-core awihle ago. I can attempt to do an update here as well (along with the other deprecated actions) then we can rebase this one.

trws · 2024-04-29T14:13:14Z

That would be awesome if you could @grondo!

grondo

This LGTM, I'm not that familiar with the code in question, but these look like quite clever fixes!

Not sure if you came up with any way to test the deferral of unreservable jobs due to other jobs with no timelimit though. Would it be enough to submit a job without a timelimit and using all resources, submit another job, then cancel the first job and ensure the 2nd is started? (I'm sure I'm oversimplifying, apologies if so).

resource/traversers/dfu_impl.hpp

t/python/t10001-resourcegraph.py

trws · 2024-04-30T03:33:41Z

I think I have a test to use, I just have to incorporate it into sharness. Thankfully the unreservable aspect is an easy test, submit two jobs requiring the same resource with no time limit first one that runs at least until the second is considered, wait for both. If they both succeed it's all good.

The performance regression case I'm not sure how to test in a way that's reliable. Maybe with the updates @milroy is adding for stats, that way we could watch for number of failed matches and see if it's repeatedly trying to schedule when it can't? We may have to revisit that one.

trws · 2024-04-30T04:08:03Z

Ok, added the test I was using locally basically as-is. I didn't see anywhere that it clearly fit, but if anyone knows a spot I can merge it into some other file.

milroy

I found a few nits which should be quick to fix.

I do have some questions about the behavior of some of the checks in this PR. I don't think they warrant changes, but I'd like to understand the behavior a bit better before the PR gets merged.

qmanager/policies/base/queue_policy_base.hpp

qmanager/policies/queue_policy_bf_base_impl.hpp

resource/modules/resource_match.cpp

qmanager/modules/qmanager.cpp

resource/traversers/dfu_impl.hpp

qmanager/policies/base/queue_policy_base.hpp

resource/traversers/dfu.cpp

trws · 2024-04-30T17:51:37Z

Ok, I think this is all cleaned up. Here are the highlights:

@milroy was completely right, there was no need for the extra ==ENOENT check or the change to duration logic after the other changes, so I removed those.
rather than trying to fix up the python script shebang, I added an invocation of flux env around the test script so that we always get the flux-side python path added on
formatting nits applied, planning to come back to this and get clang-format and clang-tidy working on this repo so I stop missing things like that
commit adding the test now actually runs the test (slightly important 🤦)

milroy · 2024-05-01T07:15:35Z

resource/traversers/dfu.cpp

+            // no schedulable point found even at the end of the time, return EBUSY
+            errno = EBUSY;
+            return -1;
+        }
        if (*at < 0 or *at >= graph_end) {


I don't think this really matters, but *at == graph_end satisfies both the if condition above and this one.

Suggested change

if (*at < 0 or *at >= graph_end) {

if (*at < 0 or *at > graph_end) {

milroy · 2024-05-01T07:21:22Z

This looks cleaner and easier to understand; thanks! I also like that you renamed the ov variable. Fluxion has too many inscrutable variables.

I'm wondering if there is a way to include a test for jobs with unsatisfiable constraints being reconsidered many times. You could submit a few jobs that require a down node and then undrain the node. After the stats update PR #1187 gets merged you could then check for the number of failed matches.

trws · 2024-05-01T15:18:30Z

I was thinking much the same. If we can get the stats PR in today it would be much easier to write a deterministic test for it with our current setup. In fact I think my existing test would work, it just needs some extra jobs added and a stats check.

…

--- Sent from Workspace ONE Boxer<https://whatisworkspaceone.com/boxer> On May 1, 2024 at 12:21:52 AM PDT, Daniel Milroy ***@***.***> wrote: This looks cleaner and easier to understand; thanks! I also like that you renamed the ov variable. Fluxion has too many inscrutable variables. I'm wondering if there is a way to include a test for jobs with unsatisfiable constraints being reconsidered many times. You could submit a few jobs that require a down node and then undrain the node. After the stats update PR #1187<https://urldefense.us/v3/__https://github.com/flux-framework/flux-sched/pull/1187__;!!G2kpM7uM-TzIFchu!3ysUsHM_4wNKdWvUhB_nrCia0qqXTkYf2ZDn8JgwDElXd9UmwK_pVHSCXY1ndll2B0KHnw_OSZqJoantred-UCnycl4$> gets merged you could then check for the number of failed matches. — Reply to this email directly, view it on GitHub<https://urldefense.us/v3/__https://github.com/flux-framework/flux-sched/pull/1188*issuecomment-2088084833__;Iw!!G2kpM7uM-TzIFchu!3ysUsHM_4wNKdWvUhB_nrCia0qqXTkYf2ZDn8JgwDElXd9UmwK_pVHSCXY1ndll2B0KHnw_OSZqJoantred-72JDVLo$>, or unsubscribe<https://urldefense.us/v3/__https://github.com/notifications/unsubscribe-auth/AAFBFNIZUJDBPU5LO24BO3TZACJYRAVCNFSM6AAAAABG3XX4EKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAOBYGA4DIOBTGM__;!!G2kpM7uM-TzIFchu!3ysUsHM_4wNKdWvUhB_nrCia0qqXTkYf2ZDn8JgwDElXd9UmwK_pVHSCXY1ndll2B0KHnw_OSZqJoantred-c34ixAk$>. You are receiving this because you authored the thread.Message ID: ***@***.***>

milroy · 2024-05-01T17:35:05Z

In fact I think my existing test would work, it just needs some extra jobs added and a stats check.

The stats PR got merged. I agree that just adding some extra jobs should be sufficient to test this.

t/t4013-unreservable.sh

trws · 2024-05-03T03:39:42Z

Ok, the test has been extended. We'll have to see how well it holds. As far as I understand the system, it should reliably hit 10, but it's possible it's partially timing dependent. If we find that it's unreliable we may want to adjust the test to <= something.

trws · 2024-05-03T22:04:00Z

@milroy, any chance I could convince you to take a (hopefully) last pass over here? I think this is about where we need it.

milroy · 2024-05-06T16:53:38Z

@trws: yes, looking now.

milroy

A few commit messages need to be cleaned up, but once that's done I think this PR is ready to merge.

Great job fixing this tricky problem!

qmanager/modules/qmanager.cpp

CMakePresets.json

problem: Jobs with constraints that can't be matched because nodes are down or drained are currently considered every time we enter the scheduling loop. If they reach the head of the queue, which is likely because we currently only configure one sched queue, they get re-considered over and over despite the fact they can't run, which greatly slows down scheduling, and can cause severe blocking observed up to 20 seconds of delay for a single submission. solution: Add a new "m_blocked" member to the qmanager base which holds jobs which return EBUSY from an alloc_orelse_reserve. This state can only happen when the job is blocked by a constraint requiring a node in an unusable state. The jobs in m_blocked are moved into m_pending (ready to be considered) by the notify callback. Currently they are moved regardless of what status changes occurred (a node being drained causes them to move) but it's a relatively small cost to move them back afterward and simplifies the logic considerably. The duration for 0.0 duration jobs is now set to the remaining time rather than the total time during meta build time.

problem: "ov" is meaningless and impenetrable solution: expand it to overhead

problem: some parsers of the presets format break with includes that don't exist solution: remove them

problem: cmake defaults to the Debug build type, which turns off many optimizations by default, making profiling and importantly some debugging tasks much harder. solution: use the RelWithDebInfo build type, which more closely matches the default used by autoconf in core, and should be used by default.

problem: jobs that are satisfiable but unreservable, in queues that use reservation, would fail with either a match failure or an EINVAL return from resource. This seems to only happen when `schedule` returns -1, an errno of ENOENT and the `at` value equal to the end of the graph. Solution: Rather than making that EINVAL, it now passes out EBUSY, which causes the new code for blocked constraints to pull the job out of pending and place it in blocked. To allow the job to proceed, we move blocked jobs back into the pending queue whenever a job is removed from the queue due to completion or cancellation. We probably consider them more than necessary, but this seems to solve the issue and still preserves the performance improvement.

problem: ensure we don't end up with unreservable jobs treated as unsatisfiable again solution: add a test that runs three jobs that are only allowed to run on the same resource, and each take all remaining time. This forces jobs two and three to be unreservable until one completes. It's technically sensitive to timing, but in this constrained case I would be surprised if it proves to be flaky. Still should set something up we can use in sharness for setting up jobs that wait on a trigger on a domain socket so we can easily make this kind of test fully deterministic.

problem: while the sched PYTHONPATH is set reliably by a mix of cmake and maybe-installtest, the flux PYTHONPATH is not solution: wrap the test in an invocation of `flux env` to set the PYTHONPATH appropriately allowing the python shebang in python tests to succeed fixes flux-framework#1172

problem: we didn't have a test to reproduce the issue with blocked jobs being constantly reconsidered solution: with the new failed stats support, after the fix there should be no more than 10 failures to match, 14 is somewhat deterministic for this test if the issue comes back.

codecov · 2024-05-06T20:17:59Z

Codecov Report

All modified and coverable lines are covered by tests ✅

❗ No coverage uploaded for pull request base (master@a4eb20a). Click here to learn what that means.

❗ Current head 29063a7 differs from pull request most recent head 80457fa. Consider uploading reports for the commit 80457fa to get more accurate results

Additional details and impacted files

@@           Coverage Diff            @@
##             master   #1188   +/-   ##
========================================
  Coverage          ?   74.0%           
========================================
  Files             ?     102           
  Lines             ?   14611           
  Branches          ?       0           
========================================
  Hits              ?   10822           
  Misses            ?    3789           
  Partials          ?       0

Files	Coverage Δ
qmanager/modules/qmanager.cpp	`73.4% <100.0%> (ø)`
qmanager/policies/base/queue_policy_base.hpp	`74.3% <100.0%> (ø)`
qmanager/policies/queue_policy_bf_base_impl.hpp	`89.1% <100.0%> (ø)`
resource/modules/resource_match.cpp	`69.3% <100.0%> (ø)`
resource/traversers/dfu.cpp	`87.9% <100.0%> (ø)`
resource/traversers/dfu_impl.hpp	`94.7% <100.0%> (ø)`

trws force-pushed the park-constraint-blocked-jobs branch from 9a54116 to 0c792c1 Compare April 29, 2024 02:22

trws requested review from milroy and grondo April 29, 2024 02:22

grondo force-pushed the park-constraint-blocked-jobs branch from 0c792c1 to 7fea593 Compare April 30, 2024 01:38

grondo approved these changes Apr 30, 2024

View reviewed changes

resource/traversers/dfu_impl.hpp Outdated Show resolved Hide resolved

t/python/t10001-resourcegraph.py Outdated Show resolved Hide resolved

milroy reviewed Apr 30, 2024

View reviewed changes

trws force-pushed the park-constraint-blocked-jobs branch 5 times, most recently from 8f8c5bf to 4b75098 Compare April 30, 2024 17:47

trws force-pushed the park-constraint-blocked-jobs branch from 4b75098 to 790f16b Compare April 30, 2024 18:10

milroy reviewed May 1, 2024

View reviewed changes

trws force-pushed the park-constraint-blocked-jobs branch from abd0f6e to d422100 Compare May 1, 2024 16:16

jameshcorbett reviewed May 1, 2024

View reviewed changes

t/t4013-unreservable.sh Show resolved Hide resolved

trws force-pushed the park-constraint-blocked-jobs branch 4 times, most recently from b4c8be7 to 8fd75e0 Compare May 3, 2024 03:38

trws force-pushed the park-constraint-blocked-jobs branch from 8fd75e0 to 29063a7 Compare May 3, 2024 16:43

trws force-pushed the park-constraint-blocked-jobs branch from 29063a7 to 89f38b7 Compare May 6, 2024 17:34

milroy approved these changes May 6, 2024

View reviewed changes

qmanager/modules/qmanager.cpp Show resolved Hide resolved

CMakePresets.json Show resolved Hide resolved

CMakePresets.json Show resolved Hide resolved

trws added 8 commits May 6, 2024 13:10

resource_match: rename "ov" to overhead

6edbc5d

problem: "ov" is meaningless and impenetrable solution: expand it to overhead

cmake preset: remove unnecessary include lines

f6a4181

problem: some parsers of the presets format break with includes that don't exist solution: remove them

trws force-pushed the park-constraint-blocked-jobs branch from 89f38b7 to 80457fa Compare May 6, 2024 20:10

trws added the merge-when-passing mergify.io - merge PR automatically once CI passes label May 6, 2024

mergify bot merged commit a082acf into flux-framework:master May 6, 2024
21 checks passed

trws deleted the park-constraint-blocked-jobs branch May 6, 2024 20:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

qmanager: defer jobs with unmatchable constraints #1188

qmanager: defer jobs with unmatchable constraints #1188

trws commented Apr 27, 2024

trws commented Apr 29, 2024

grondo commented Apr 29, 2024 •

edited

trws commented Apr 29, 2024

grondo left a comment

trws commented Apr 30, 2024

trws commented Apr 30, 2024

milroy left a comment

trws commented Apr 30, 2024

milroy May 1, 2024 •

edited

milroy commented May 1, 2024

trws commented May 1, 2024 via email

milroy commented May 1, 2024

trws commented May 3, 2024

trws commented May 3, 2024

milroy commented May 6, 2024

milroy left a comment •

edited

codecov bot commented May 6, 2024

	if (at < 0 or at >= graph_end) {
	if (at < 0 or at > graph_end) {

qmanager: defer jobs with unmatchable constraints #1188

qmanager: defer jobs with unmatchable constraints #1188

Conversation

trws commented Apr 27, 2024

trws commented Apr 29, 2024

grondo commented Apr 29, 2024 • edited

trws commented Apr 29, 2024

grondo left a comment

Choose a reason for hiding this comment

trws commented Apr 30, 2024

trws commented Apr 30, 2024

milroy left a comment

Choose a reason for hiding this comment

trws commented Apr 30, 2024

milroy May 1, 2024 • edited

Choose a reason for hiding this comment

milroy commented May 1, 2024

trws commented May 1, 2024 via email

milroy commented May 1, 2024

trws commented May 3, 2024

trws commented May 3, 2024

milroy commented May 6, 2024

milroy left a comment • edited

Choose a reason for hiding this comment

codecov bot commented May 6, 2024

Codecov Report

grondo commented Apr 29, 2024 •

edited

milroy May 1, 2024 •

edited

milroy left a comment •

edited