Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

job-manager: fix for job priority not reset after a duplicate urgency update #4941

Merged
merged 2 commits into from Feb 15, 2023

Conversation

grondo
Copy link
Contributor

@grondo grondo commented Feb 15, 2023

This PR fixes the issue described in #4940, and adds some tests to the hold suite of sharness tests to set urgency to hold already held jobs to ensure we don't see this particular issue again.

Problem: The job manager prioritization code contains an optimization
that avoids posting a priority event if the priority did not change
after reprioritization. However, this causes jobs that are in PRIORITY
state to get stuck there, since they normally transition to SCHED
as a result of the priority event. This occurs if a job's urgency is
updated to the same value, e.g. if a held job is held again.

Force a priority event to be emitted in states other than SCHED, even
if the priority is unchanged. This allows jobs kicked back to PRIORITY
state to transition to SCHED, even if a duplicate urgency value is set.

Fixes flux-framework#4940
@grondo grondo changed the title job-manager: fix jobs stuck in PRIORITY when urgency is set to a duplicate value job-manager: fix for job priority not reset after a duplicate urgency update Feb 15, 2023
Comment on lines 37 to 43
test_expect_success 'job-manager: hold job 3, 5 again (issue #4940)' '
flux job urgency $(cat job3.id) 16 &&
flux job urgency $(cat job3.id) 16 &&
flux job urgency $(cat job3.id) 16 &&
flux job urgency $(cat job3.id) 16 &&
flux job urgency $(cat job5.id) 16
'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be urgency 0/hold instead of 16?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep! 🤦 nice catch!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Fixed that blunder and force pushed the result.

Problem: None of the job manager job "hold" tests have a reproducer
for holding an already held job as described in issue flux-framework#4940.

Sprinkle some `flux job urgency id 0` in the tests on already held
jobs to act as a reproducer for flux-framework#4940.
Copy link
Member

@chu11 chu11 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lgtm!

@grondo
Copy link
Contributor Author

grondo commented Feb 15, 2023

Thanks! I've set MWP.

@mergify mergify bot merged commit 33af335 into flux-framework:master Feb 15, 2023
@grondo grondo deleted the issue#4940 branch February 16, 2023 03:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants