Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

testsuite: fix lingering processes after make check #5769

Merged
merged 1 commit into from Mar 20, 2024

Conversation

grondo
Copy link
Contributor

@grondo grondo commented Mar 5, 2024

This PR attempts to fix or workaround an issue that I keep seeing: A couple orphaned instances that stick around after make -j N check. A quick investigation showed that the instances all had a sleep job stuck in CLEANUP, with the scheduler unloaded, and had at least instance-level 1 (so they were batch jobs run within the test instance).

I wasn't really able to make a reproducer for this issue, but terminating the affected instances more cleanly seems to resolve the problem (at least in my testing), so it seems like a good idea to do that.

Copy link
Member

@garlick garlick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK with me!

I hadn't noticed this on my system, FWIW.

@grondo
Copy link
Contributor Author

grondo commented Mar 5, 2024

Well this change is causing hangs in CI now, so obviously this wasn't a sufficient or good workaround. More analysis needed.

Problem: There are often a couple orphaned Flux instances that remain
after a run of `make -j N check`. The instances appear to be blocked
in rc3 with a sleep job in CLEANUP which can never exit because the
scheduler module has already been unloaded. The root cause of this
situation is unknown, but probably the affected tests can be improved
to reduce the probability of the issue.

The affected tests include t2800-jobs-recursive.t and t2802-uri-cmd.t,
each of which start a nested instance of one or more levels with
sleep jobs.

Attempt to terminate the affected batch jobs cleanly rather than
leaving it to the test instance rc3 script. This should reduce or
eliminate stray processes after running make check or the individual
test.
@grondo
Copy link
Contributor Author

grondo commented Mar 20, 2024

Ok. The hang was from a use of flux proxy that was prompting for a password. Fixed and will set MWP

Copy link

codecov bot commented Mar 20, 2024

Codecov Report

Merging #5769 (ffa89a0) into master (1844972) will increase coverage by 0.00%.
The diff coverage is n/a.

Additional details and impacted files
@@           Coverage Diff           @@
##           master    #5769   +/-   ##
=======================================
  Coverage   83.34%   83.34%           
=======================================
  Files         509      509           
  Lines       82485    82485           
=======================================
+ Hits        68744    68747    +3     
+ Misses      13741    13738    -3     

see 8 files with indirect coverage changes

@mergify mergify bot merged commit eaaf293 into flux-framework:master Mar 20, 2024
35 checks passed
@grondo grondo deleted the make-check-orphans branch March 20, 2024 13:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants