New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
On Fedora rawhide tests have a tendency to hang for no obvious reason #1405
Comments
I've found a reason:
But WHY?! |
Well, |
Also working:
So, stopping and starting the process also "unhangs" it. |
Something is inconsistent or crazy! Maybe it is me!?
I'm glad I pasted the "T" result above. I'd think I dreamt it otherwise. |
Now a simple reproducer:
This (eventually) hangs.
There's nothing entirely obvious from attaching
|
Running under |
1397: Fix Fedora rawhide build in CI r=wmww a=AlanGriffiths Partial fix for Fedora rawhide build in CI. (Fixes: #1400, fixes: #1399) 1. Drop dependency that's unsatisfiable on Fedora rawhide (I think it's redundant) 2. Add a missing dependency 3. Clear WAYLAND_DISPLAY from the test environment 4. Disable a couple of tests that try to frig with /dev/random That leaves us with a build, but hanging tests (#1405) but that can be fixed later. Co-authored-by: Alan Griffiths <alan@octopull.co.uk>
OK, I've a nasty hack that appears to prevents this problem manifesting. Which means I have a theory of what's going wrong. (It is racy code.) I'll try to create a tidy solution tomorrow. |
I'm going to park this and come back to it another day. Where I've got to:
There's a couple of places in the system that use The frequency of failures can be significantly reduced (failing in 10s of iterations => failing in 100s of iterations), for example by calling This is highly suggestive that our use of We can get out of this failure state by sending SIGSTOP & SIGCONT, or by detaching the terminal ^Z and bringing to the foreground I also note that we don't seem to use more than one thread in the pool, so maybe this is all overengineered. |
I wonder if this could also be the cause of occasional failures on Ubuntu CI? |
I couldn't leave it alone! Tried on 20.04...
Clearly, we leak file handles in here - would be nice if that's related. |
Was bors a bit eager there, or does the LTTNG thing actually fix the Fedora hangs? |
Now you ask I realise It was me that was a bit eager - I tried reproducing with #1496, and couldn't. What I failed to do is confirm that I could still reproduce without it. Either way, this is fixed. |
Happens to miral-test, mir_acceptance_tests, mir_integration_tests and probably others.
If you attach gdb to the process, there's nothing obviously wrong, and the test continues without problems if you "continue" or detach (or quit).
I suspect this is the reason for timeouts in CI.
To reproduce:
May take a minute or two before it hangs.
The text was updated successfully, but these errors were encountered: