Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

On Fedora rawhide tests have a tendency to hang for no obvious reason #1405

Closed
AlanGriffiths opened this issue Apr 3, 2020 · 12 comments · Fixed by #1496
Closed

On Fedora rawhide tests have a tendency to hang for no obvious reason #1405

AlanGriffiths opened this issue Apr 3, 2020 · 12 comments · Fixed by #1496

Comments

@AlanGriffiths
Copy link
Contributor

Happens to miral-test, mir_acceptance_tests, mir_integration_tests and probably others.

If you attach gdb to the process, there's nothing obviously wrong, and the test continues without problems if you "continue" or detach (or quit).

I suspect this is the reason for timeouts in CI.

To reproduce:

while bin/miral-test; do :; done

May take a minute or two before it hangs.

@AlanGriffiths
Copy link
Contributor Author

I've found a reason:

[alan@localhost mir]$ ps -eaf | grep miral
alan       15035   12756  3 10:22 pts/0    00:00:00 cmake-build-debug/bin/miral-test.bin
alan       15278   15228  0 10:23 pts/1    00:00:00 grep --color=auto miral
[alan@localhost mir]$ ps -o state= -p 15035
T
        T    stopped by job control signal

But WHY?!

@AlanGriffiths
Copy link
Contributor Author

Well, kill -SIGCONT <pid> doesn't restart the test, but closing and opening the lid does.

@AlanGriffiths AlanGriffiths self-assigned this Apr 6, 2020
@AlanGriffiths
Copy link
Contributor Author

Well, kill -SIGCONT <pid> doesn't restart the test, but closing and opening the lid does.

Also working:

killall -SIGSTOP cmake-build-debug/bin/miral-test.bin; killall -SIGCONT cmake-build-debug/bin/miral-test.bin

So, stopping and starting the process also "unhangs" it.

@AlanGriffiths
Copy link
Contributor Author

Something is inconsistent or crazy! Maybe it is me!?

$ ps -o state=,cmd=,pid= -a | grep miral | grep -v grep
S cmake-build-debug/bin/miral   31508
          S    interruptible sleep (waiting for an event to complete)

I'm glad I pasted the "T" result above. I'd think I dreamt it otherwise.

@AlanGriffiths
Copy link
Contributor Author

Now a simple reproducer:

while cmake-build-debug/bin/miral-test --gtest_filter=TestServer.connect_client_works; do :; done

This (eventually) hangs.

$ ps -o state= -o command:37= -o pid= -C miral-test.bin
S cmake-build-debug/bin/miral-test.bin    93315

There's nothing entirely obvious from attaching gdb.
Attaching strace and then STOP/START:

$ strace --attach 93315 & (killall -SIGSTOP cmake-build-debug/bin/miral-test.bin; killall -SIGCONT cmake-build-debug/bin/miral-test.bin)
[1] 93358
strace: Process 93315 attached
futex(0x7ff2777fe9d0, FUTEX_WAIT, 93325, NULL) = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
--- SIGSTOP {si_signo=SIGSTOP, si_code=SI_USER, si_pid=93360, si_uid=1000} ---
--- stopped by SIGSTOP ---
futex(0x7ff2777fe9d0, FUTEX_WAIT, 93325, NULL) = 0
close(32)                               = 0
close(31)                               = 0
close(30)                               = 0
close(25)                               = 0
close(29)                               = 0
munmap(0x7ff286b05000, 135976)          = 0
clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=0, tv_nsec=100000000}, [alan@localhost mir]$ 0x7ffcec2877a0) = 0
brk(NULL)                               = 0x2326000
brk(0x2347000)                          = 0x2347000
write(3, "\1\0\0\0\0\0\0\0", 8)         = 8
futex(0x7ff28000f730, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x23224bc, FUTEX_WAIT_BITSET_PRIVATE, 0, {tv_sec=41437, tv_nsec=314920382}, FUTEX_BITSET_MATCH_ANY) = 0
munmap(0x7ff284ac7000, 8392704)         = 0
futex(0x2322468, FUTEX_WAKE_PRIVATE, 1) = 0
write(1, "\33[0;32m[       OK ] \33[mTestServe"..., 67) = 67
write(1, "\33[0;32m[----------] \33[m1 test fr"..., 65) = 65
write(1, "\33[0;32m[----------] \33[mGlobal te"..., 57) = 57
write(1, "\33[0;32m[==========] \33[m1 test fr"..., 71) = 71
write(1, "\33[0;32m[  PASSED  ] \33[m1 test.\n", 31) = 31
rt_sigprocmask(SIG_SETMASK, ~[RTMIN RT_1], [], 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
rt_sigprocmask(SIG_BLOCK, ~[RTMIN RT_1], [], 8) = 0
membarrier(MEMBARRIER_CMD_PRIVATE_EXPEDITED, 0) = 0
membarrier(MEMBARRIER_CMD_PRIVATE_EXPEDITED, 0) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
rt_sigprocmask(SIG_SETMASK, ~[RTMIN RT_1], [], 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
getpid()                                = 93315
rt_sigprocmask(SIG_SETMASK, ~[RTMIN RT_1], [], 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
rt_sigprocmask(SIG_BLOCK, ~[RTMIN RT_1], [], 8) = 0
membarrier(MEMBARRIER_CMD_PRIVATE_EXPEDITED, 0) = 0
membarrier(MEMBARRIER_CMD_PRIVATE_EXPEDITED, 0) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
rt_sigprocmask(SIG_SETMASK, ~[RTMIN RT_1], [], 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
rt_sigprocmask(SIG_SETMASK, ~[RTMIN RT_1], [], 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
rt_sigprocmask(SIG_BLOCK, ~[RTMIN RT_1], [], 8) = 0
membarrier(MEMBARRIER_CMD_PRIVATE_EXPEDITED, 0) = 0
membarrier(MEMBARRIER_CMD_PRIVATE_EXPEDITED, 0) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
rt_sigprocmask(SIG_SETMASK, ~[RTMIN RT_1], [], 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
rt_sigprocmask(SIG_SETMASK, ~[RTMIN RT_1], [], 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
rt_sigprocmask(SIG_BLOCK, ~[RTMIN RT_1], [], 8) = 0
membarrier(MEMBARRIER_CMD_PRIVATE_EXPEDITED, 0) = 0
membarrier(MEMBARRIER_CMD_PRIVATE_EXPEDITED, 0) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
rt_sigprocmask(SIG_SETMASK, ~[RTMIN RT_1], [], 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
rt_sigprocmask(SIG_SETMASK, ~[RTMIN RT_1], [], 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
rt_sigprocmask(SIG_SETMASK, ~[RTMIN RT_1], [], 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
rt_sigprocmask(SIG_SETMASK, ~[RTMIN RT_1], [], 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
rt_sigprocmask(SIG_SETMASK, ~[RTMIN RT_1], [], 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
rt_sigprocmask(SIG_SETMASK, ~[RTMIN RT_1], [], 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
rt_sigprocmask(SIG_BLOCK, ~[RTMIN RT_1], [], 8) = 0
membarrier(MEMBARRIER_CMD_PRIVATE_EXPEDITED, 0) = 0
membarrier(MEMBARRIER_CMD_PRIVATE_EXPEDITED, 0) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
rt_sigprocmask(SIG_SETMASK, ~[RTMIN RT_1], [], 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
exit_group(0)                           = ?
+++ exited with 0 +++

@AlanGriffiths
Copy link
Contributor Author

Running under strace or gdb doesn't reproduce.

bors bot added a commit that referenced this issue Apr 9, 2020
1397: Fix Fedora rawhide build in CI r=wmww a=AlanGriffiths

Partial fix for Fedora rawhide build in CI. (Fixes: #1400, fixes: #1399)

1. Drop dependency that's unsatisfiable on Fedora rawhide (I think it's redundant)
2. Add a missing dependency
3. Clear WAYLAND_DISPLAY from the test environment
4. Disable a couple of tests that try to frig with /dev/random

That leaves us with a build, but hanging tests (#1405) but that can be fixed later.

Co-authored-by: Alan Griffiths <alan@octopull.co.uk>
@AlanGriffiths
Copy link
Contributor Author

OK, I've a nasty hack that appears to prevents this problem manifesting. Which means I have a theory of what's going wrong. (It is racy code.)

I'll try to create a tidy solution tomorrow.

@AlanGriffiths
Copy link
Contributor Author

AlanGriffiths commented Apr 17, 2020

I'm going to park this and come back to it another day.

Where I've got to:

  1. The problem is in md::ThreadedDispatcher.
  2. It can be reproduced under the debugger with
    2.1. miral-test --gtest_filter=TestServer.connect_client_works --gtest_repeat=1000
  3. The failure scenario is:
    3.1. the thread in threadpool is waiting on poll();
    3.2. ~ThreadedDispatcher() calls thread_exiter->terminate_all_threads_async()
    3.3. poll() doesn't return, the threadpool thread doesn't exit; and,
    3.4. ~ThreadedDispatcher() hangs on join()

There's a couple of places in the system that use md::ThreadedDispatcher, both can exhibit this failure mode.

The frequency of failures can be significantly reduced (failing in 10s of iterations => failing in 100s of iterations), for example by calling terminate_all_threads_async() twice. (Or by running under strace or the debugger.)

This is highly suggestive that our use of eventfd() is racy, but I've yet to spot why.

We can get out of this failure state by sending SIGSTOP & SIGCONT, or by detaching the terminal ^Z and bringing to the foreground fg.

I also note that we don't seem to use more than one thread in the pool, so maybe this is all overengineered.

@AlanGriffiths
Copy link
Contributor Author

I wonder if this could also be the cause of occasional failures on Ubuntu CI?

@AlanGriffiths AlanGriffiths removed their assignment Apr 17, 2020
@AlanGriffiths
Copy link
Contributor Author

I couldn't leave it alone! Tried on 20.04...

$ miral-test --gtest_filter=TestServer.connect_client_works --gtest_repeat=1000
...
Repeating all tests (iteration 1000) . . .

Note: Google Test filter = TestServer.connect_client_works
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from TestServer
[ RUN      ] TestServer.connect_client_works
ERROR: /home/alan/display_server/mir/src/common/dispatch/multiplexing_dispatchable.cpp(80): Throw in function mir::dispatch::MultiplexingDispatchable::MultiplexingDispatchable()
Dynamic exception type: boost::wrapexcept<std::system_error>
std::exception::what: Failed to create epoll monitor: Too many open files

unknown file: Failure
C++ exception with description "Failed to start server thread" thrown in SetUp().
[  FAILED  ] TestServer.connect_client_works (20100 ms)
[----------] 1 test from TestServer (20100 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test suite ran. (20100 ms total)
[  PASSED  ] 0 tests.
[  FAILED  ] 1 test, listed below:
[  FAILED  ] TestServer.connect_client_works

 1 FAILED TEST

Clearly, we leak file handles in here - would be nice if that's related.

@RAOF
Copy link
Contributor

RAOF commented Jun 1, 2020

Was bors a bit eager there, or does the LTTNG thing actually fix the Fedora hangs?

@AlanGriffiths
Copy link
Contributor Author

Was bors a bit eager there, or does the LTTNG thing actually fix the Fedora hangs?

Now you ask I realise It was me that was a bit eager - I tried reproducing with #1496, and couldn't. What I failed to do is confirm that I could still reproduce without it. Either way, this is fixed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants