Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

planner: ensure result in planner_avail_resources_at #1038

Merged
merged 2 commits into from
Jun 23, 2023

Conversation

trws
Copy link
Member

@trws trws commented Jun 20, 2023

During restart, we can end up in a situation where a time before the
planner's base is requested, in this case negative time, so there's no
earlier time to return from get_state. It feels like get_state should
always return some kind of state, but it's not clear how to do that
and avoid accidentally pretending there are no resource available. This
avoids asking for that invalid state by checking the precondition in the
planner, if an at before the plan_start is requested it's treated as
an invalid argument.

Fixes #1035 (comment)

@trws trws requested review from grondo and milroy June 20, 2023 17:34
@trws
Copy link
Member Author

trws commented Jun 20, 2023

Hold off on this for a minute. It works, but I think it works for the wrong reason. I'm pretty sure the planner is being asked for available resources at a negative time point.

During restart, we can end up in a situation where a time before the
planner's base is requested, in this case negative time, so there's no
earlier time to return from get_state. It feels like get_state should
always return *some kind of state*, but it's not clear how to do that
and avoid accidentally pretending there are no resource available. This
avoids asking for that invalid state by checking the precondition in the
planner, if an `at` before the `plan_start` is requested it's treated as
an invalid argument.
@trws
Copy link
Member Author

trws commented Jun 20, 2023

Ok, this version is better. I'm pretty sure there's still more to do to ensure planner always handles this right, but at least this case is addressed for now.

Copy link
Contributor

@grondo grondo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Testing shows this works and the reproducer that showed Fluxion couldn't restart with running jobs + queues now passes.

I went ahead added the script as an issues reproducer along with a new test t5100-issues-test-driver.t (a copy of t4000-issues-test-driver.t from flux-core).

Mind if we cherry pick that commit and add it to this PR?
f6d0367

Problem: There is no reproducer for issue flux-framework#1035: fluxion can't restart
with queues enabled.

Add a new test driver for issue reproducers:

 t5100-issues-test-driver.t

Then add a reproducer script for flux-framework#1035 to the t/issues subdirectory.
@grondo
Copy link
Contributor

grondo commented Jun 21, 2023

I went ahead and pushed the testsuite commit on top here @trws. Thanks!

@codecov
Copy link

codecov bot commented Jun 21, 2023

Codecov Report

Merging #1038 (61cb555) into master (c65026c) will decrease coverage by 0.1%.
The diff coverage is n/a.

❗ Current head 61cb555 differs from pull request most recent head cf4de65. Consider uploading reports for the commit cf4de65 to get more accurate results

@@           Coverage Diff            @@
##           master   #1038     +/-   ##
========================================
- Coverage    74.4%   74.3%   -0.1%     
========================================
  Files          86      86             
  Lines        9434    9434             
========================================
- Hits         7021    7017      -4     
- Misses       2413    2417      +4     

see 1 file with indirect coverage changes

@trws trws added the merge-when-passing mergify.io - merge PR automatically once CI passes label Jun 23, 2023
@mergify mergify bot merged commit 32f74d6 into flux-framework:master Jun 23, 2023
18 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
merge-when-passing mergify.io - merge PR automatically once CI passes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

fluxion can't restart with queues enabled
2 participants