Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

qmanager: defer jobs with unmatchable constraints #1188

Merged
merged 8 commits into from May 6, 2024

Conversation

trws
Copy link
Member

@trws trws commented Apr 27, 2024

This is the in-progress PR for the constraint job blocking problem. Full description below, but we're still lacking two things I really want to have:

  1. A test that will surface this if it comes back, other than looking at timing, this is very hard to make an integration test for. A unit test would be easier, trying to come up with something.
  2. A fix for the stats collection logic that includes failed matches, both the number of them and times of them since that being missing is part of what helped this hide so well.

problem: Jobs with constraints that can't be matched because nodes are
down or drained are currently considered every time we enter the
scheduling loop. If they reach the head of the queue, which is likely
because we currently only configure one sched queue, they get
re-considered over and over despite the fact they can't run, which
greatly slows down scheduling, and can cause severe blocking observed up
to 20 seconds of delay for a single submission. This change also
exposed a bug with the duration calculation for duration=0.0 jobs, which
used to be set to the full duration of the graph rather than the
remaining duration of the graph.

solution: Add a new "m_blocked" member to the qmanager base which holds
jobs which return EBUSY from an alloc_orelse_reserve. This state can
only happen when the job is blocked by a constraint requiring a node in
an unusable state. The jobs in m_blocked are moved into m_pending
(ready to be considered) by the notify callback. Currently they are
moved regardless of what status changes occurred (a node being drained
causes them to move) but it's a relatively small cost to move them back
afterward and simplifies the logic considerably. The duration for 0.0
duration jobs is now set to the remaining time rather than the total
time during meta build time.

@trws trws force-pushed the park-constraint-blocked-jobs branch from 9a54116 to 0c792c1 Compare April 29, 2024 02:22
@trws trws requested review from milroy and grondo April 29, 2024 02:22
@trws
Copy link
Member Author

trws commented Apr 29, 2024

I see the coverage failure, and will see what I can do about that, but that's going to be a few days. Looks like the uploader we're using is deprecated, and the actual error is because the uploader is using too old a version of gcov compared to the version of gcc on bookworm, but it only matters on qmanager for some reason. 🤷

@grondo
Copy link
Contributor

grondo commented Apr 29, 2024

We switched to codecov-action@v4 in flux-core awihle ago. I can attempt to do an update here as well (along with the other deprecated actions) then we can rebase this one.

@trws
Copy link
Member Author

trws commented Apr 29, 2024

That would be awesome if you could @grondo!

@grondo grondo force-pushed the park-constraint-blocked-jobs branch from 0c792c1 to 7fea593 Compare April 30, 2024 01:38
Copy link
Contributor

@grondo grondo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This LGTM, I'm not that familiar with the code in question, but these look like quite clever fixes!

Not sure if you came up with any way to test the deferral of unreservable jobs due to other jobs with no timelimit though. Would it be enough to submit a job without a timelimit and using all resources, submit another job, then cancel the first job and ensure the 2nd is started? (I'm sure I'm oversimplifying, apologies if so).

resource/traversers/dfu_impl.hpp Outdated Show resolved Hide resolved
t/python/t10001-resourcegraph.py Outdated Show resolved Hide resolved
@trws
Copy link
Member Author

trws commented Apr 30, 2024

I think I have a test to use, I just have to incorporate it into sharness. Thankfully the unreservable aspect is an easy test, submit two jobs requiring the same resource with no time limit first one that runs at least until the second is considered, wait for both. If they both succeed it's all good.

The performance regression case I'm not sure how to test in a way that's reliable. Maybe with the updates @milroy is adding for stats, that way we could watch for number of failed matches and see if it's repeatedly trying to schedule when it can't? We may have to revisit that one.

@trws
Copy link
Member Author

trws commented Apr 30, 2024

Ok, added the test I was using locally basically as-is. I didn't see anywhere that it clearly fit, but if anyone knows a spot I can merge it into some other file.

Copy link
Member

@milroy milroy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found a few nits which should be quick to fix.

I do have some questions about the behavior of some of the checks in this PR. I don't think they warrant changes, but I'd like to understand the behavior a bit better before the PR gets merged.

qmanager/policies/base/queue_policy_base.hpp Outdated Show resolved Hide resolved
qmanager/policies/queue_policy_bf_base_impl.hpp Outdated Show resolved Hide resolved
resource/modules/resource_match.cpp Outdated Show resolved Hide resolved
qmanager/modules/qmanager.cpp Outdated Show resolved Hide resolved
resource/traversers/dfu_impl.hpp Outdated Show resolved Hide resolved
qmanager/policies/base/queue_policy_base.hpp Outdated Show resolved Hide resolved
resource/traversers/dfu.cpp Outdated Show resolved Hide resolved
@trws trws force-pushed the park-constraint-blocked-jobs branch 5 times, most recently from 8f8c5bf to 4b75098 Compare April 30, 2024 17:47
@trws
Copy link
Member Author

trws commented Apr 30, 2024

Ok, I think this is all cleaned up. Here are the highlights:

  1. @milroy was completely right, there was no need for the extra ==ENOENT check or the change to duration logic after the other changes, so I removed those.
  2. rather than trying to fix up the python script shebang, I added an invocation of flux env around the test script so that we always get the flux-side python path added on
  3. formatting nits applied, planning to come back to this and get clang-format and clang-tidy working on this repo so I stop missing things like that
  4. commit adding the test now actually runs the test (slightly important 🤦)

@trws trws force-pushed the park-constraint-blocked-jobs branch from 4b75098 to 790f16b Compare April 30, 2024 18:10
// no schedulable point found even at the end of the time, return EBUSY
errno = EBUSY;
return -1;
}
if (*at < 0 or *at >= graph_end) {
Copy link
Member

@milroy milroy May 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this really matters, but *at == graph_end satisfies both the if condition above and this one.

Suggested change
if (*at < 0 or *at >= graph_end) {
if (*at < 0 or *at > graph_end) {

@milroy
Copy link
Member

milroy commented May 1, 2024

This looks cleaner and easier to understand; thanks! I also like that you renamed the ov variable. Fluxion has too many inscrutable variables.

I'm wondering if there is a way to include a test for jobs with unsatisfiable constraints being reconsidered many times. You could submit a few jobs that require a down node and then undrain the node. After the stats update PR #1187 gets merged you could then check for the number of failed matches.

@trws
Copy link
Member Author

trws commented May 1, 2024 via email

@trws trws force-pushed the park-constraint-blocked-jobs branch from abd0f6e to d422100 Compare May 1, 2024 16:16
@milroy
Copy link
Member

milroy commented May 1, 2024

In fact I think my existing test would work, it just needs some extra jobs added and a stats check.

The stats PR got merged. I agree that just adding some extra jobs should be sufficient to test this.

@trws trws force-pushed the park-constraint-blocked-jobs branch 4 times, most recently from b4c8be7 to 8fd75e0 Compare May 3, 2024 03:38
@trws
Copy link
Member Author

trws commented May 3, 2024

Ok, the test has been extended. We'll have to see how well it holds. As far as I understand the system, it should reliably hit 10, but it's possible it's partially timing dependent. If we find that it's unreliable we may want to adjust the test to <= something.

@trws trws force-pushed the park-constraint-blocked-jobs branch from 8fd75e0 to 29063a7 Compare May 3, 2024 16:43
@trws
Copy link
Member Author

trws commented May 3, 2024

@milroy, any chance I could convince you to take a (hopefully) last pass over here? I think this is about where we need it.

@milroy
Copy link
Member

milroy commented May 6, 2024

@trws: yes, looking now.

@trws trws force-pushed the park-constraint-blocked-jobs branch from 29063a7 to 89f38b7 Compare May 6, 2024 17:34
Copy link
Member

@milroy milroy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few commit messages need to be cleaned up, but once that's done I think this PR is ready to merge.

Great job fixing this tricky problem!

qmanager/modules/qmanager.cpp Show resolved Hide resolved
CMakePresets.json Show resolved Hide resolved
CMakePresets.json Show resolved Hide resolved
trws added 8 commits May 6, 2024 13:10
problem: Jobs with constraints that can't be matched because nodes are
down or drained are currently considered every time we enter the
scheduling loop.  If they reach the head of the queue, which is likely
because we currently only configure one sched queue, they get
re-considered over and over despite the fact they can't run, which
greatly slows down scheduling, and can cause severe blocking observed up
to 20 seconds of delay for a single submission.

solution: Add a new "m_blocked" member to the qmanager base which holds
jobs which return EBUSY from an alloc_orelse_reserve. This state can
only happen when the job is blocked by a constraint requiring a node in
an unusable state.  The jobs in m_blocked are moved into m_pending
(ready to be considered) by the notify callback.  Currently they are
moved regardless of what status changes occurred (a node being drained
causes them to move) but it's a relatively small cost to move them back
afterward and simplifies the logic considerably. The duration for 0.0
duration jobs is now set to the remaining time rather than the total
time during meta build time.
problem: "ov" is meaningless and impenetrable

solution: expand it to overhead
problem: some parsers of the presets format break with includes that
don't exist

solution: remove them
problem: cmake defaults to the Debug build type, which turns off many
optimizations by default, making profiling and importantly some
debugging tasks much harder.

solution: use the RelWithDebInfo build type, which  more closely matches
the default used by autoconf in core, and should be used by default.
problem: jobs that are satisfiable but unreservable, in queues that use
reservation, would fail with either a match failure or an EINVAL return
from resource.  This seems to only happen when `schedule` returns -1,
an errno of ENOENT and the `at` value equal to the end of the graph.

Solution: Rather than making that EINVAL, it now passes out EBUSY, which
causes the new code for blocked constraints to pull the job out of
pending and place it in blocked.  To allow the job to proceed, we move
blocked jobs back into the pending queue whenever a job is removed from
the queue due to completion or cancellation. We probably consider them
more than necessary, but this seems to solve the issue and still
preserves the performance improvement.
problem: ensure we don't end up with unreservable jobs treated as
unsatisfiable again

solution: add a test that runs three jobs that are only allowed to run
on the same resource, and each take all remaining time.  This forces
jobs two and three to be unreservable until one completes.  It's
technically sensitive to timing, but in this constrained case I would be
surprised if it proves to be flaky.  Still should set something up we
can use in sharness for setting up jobs that wait on a trigger on a
domain socket so we can easily make this kind of test fully
deterministic.
problem: while the sched PYTHONPATH is set reliably by a mix of cmake
and maybe-installtest, the flux PYTHONPATH is not

solution: wrap the test in an invocation of `flux env` to set the
PYTHONPATH appropriately allowing the python shebang in python tests to
succeed

fixes flux-framework#1172
problem: we didn't have a test to reproduce the issue with blocked jobs
being constantly reconsidered

solution: with the new failed stats support, after the fix there should
be no more than 10 failures to match, 14 is somewhat deterministic for
this test if the issue comes back.
@trws trws force-pushed the park-constraint-blocked-jobs branch from 89f38b7 to 80457fa Compare May 6, 2024 20:10
@trws trws added the merge-when-passing mergify.io - merge PR automatically once CI passes label May 6, 2024
Copy link

codecov bot commented May 6, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

❗ No coverage uploaded for pull request base (master@a4eb20a). Click here to learn what that means.

❗ Current head 29063a7 differs from pull request most recent head 80457fa. Consider uploading reports for the commit 80457fa to get more accurate results

Additional details and impacted files
@@           Coverage Diff            @@
##             master   #1188   +/-   ##
========================================
  Coverage          ?   74.0%           
========================================
  Files             ?     102           
  Lines             ?   14611           
  Branches          ?       0           
========================================
  Hits              ?   10822           
  Misses            ?    3789           
  Partials          ?       0           
Files Coverage Δ
qmanager/modules/qmanager.cpp 73.4% <100.0%> (ø)
qmanager/policies/base/queue_policy_base.hpp 74.3% <100.0%> (ø)
qmanager/policies/queue_policy_bf_base_impl.hpp 89.1% <100.0%> (ø)
resource/modules/resource_match.cpp 69.3% <100.0%> (ø)
resource/traversers/dfu.cpp 87.9% <100.0%> (ø)
resource/traversers/dfu_impl.hpp 94.7% <100.0%> (ø)

@mergify mergify bot merged commit a082acf into flux-framework:master May 6, 2024
21 checks passed
@trws trws deleted the park-constraint-blocked-jobs branch May 6, 2024 20:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
merge-when-passing mergify.io - merge PR automatically once CI passes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants