On Fedora rawhide tests have a tendency to hang for no obvious reason #1405

AlanGriffiths · 2020-04-03T15:37:53Z

Happens to miral-test, mir_acceptance_tests, mir_integration_tests and probably others.

If you attach gdb to the process, there's nothing obviously wrong, and the test continues without problems if you "continue" or detach (or quit).

I suspect this is the reason for timeouts in CI.

To reproduce:

while bin/miral-test; do :; done

May take a minute or two before it hangs.

The text was updated successfully, but these errors were encountered:

AlanGriffiths · 2020-04-06T09:26:51Z

I've found a reason:

[alan@localhost mir]$ ps -eaf | grep miral
alan       15035   12756  3 10:22 pts/0    00:00:00 cmake-build-debug/bin/miral-test.bin
alan       15278   15228  0 10:23 pts/1    00:00:00 grep --color=auto miral
[alan@localhost mir]$ ps -o state= -p 15035
T

        T    stopped by job control signal

But WHY?!

AlanGriffiths · 2020-04-06T09:39:18Z

Well, kill -SIGCONT <pid> doesn't restart the test, but closing and opening the lid does.

AlanGriffiths · 2020-04-06T13:31:52Z

Well, kill -SIGCONT <pid> doesn't restart the test, but closing and opening the lid does.

Also working:

killall -SIGSTOP cmake-build-debug/bin/miral-test.bin; killall -SIGCONT cmake-build-debug/bin/miral-test.bin

So, stopping and starting the process also "unhangs" it.

AlanGriffiths · 2020-04-06T15:34:09Z

Something is inconsistent or crazy! Maybe it is me!?

$ ps -o state=,cmd=,pid= -a | grep miral | grep -v grep
S cmake-build-debug/bin/miral   31508

          S    interruptible sleep (waiting for an event to complete)

I'm glad I pasted the "T" result above. I'd think I dreamt it otherwise.

AlanGriffiths · 2020-04-07T10:07:36Z

Now a simple reproducer:

while cmake-build-debug/bin/miral-test --gtest_filter=TestServer.connect_client_works; do :; done

This (eventually) hangs.

$ ps -o state= -o command:37= -o pid= -C miral-test.bin
S cmake-build-debug/bin/miral-test.bin    93315

There's nothing entirely obvious from attaching gdb.
Attaching strace and then STOP/START:

$ strace --attach 93315 & (killall -SIGSTOP cmake-build-debug/bin/miral-test.bin; killall -SIGCONT cmake-build-debug/bin/miral-test.bin)
[1] 93358
strace: Process 93315 attached
futex(0x7ff2777fe9d0, FUTEX_WAIT, 93325, NULL) = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
--- SIGSTOP {si_signo=SIGSTOP, si_code=SI_USER, si_pid=93360, si_uid=1000} ---
--- stopped by SIGSTOP ---
futex(0x7ff2777fe9d0, FUTEX_WAIT, 93325, NULL) = 0
close(32)                               = 0
close(31)                               = 0
close(30)                               = 0
close(25)                               = 0
close(29)                               = 0
munmap(0x7ff286b05000, 135976)          = 0
clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=0, tv_nsec=100000000}, [alan@localhost mir]$ 0x7ffcec2877a0) = 0
brk(NULL)                               = 0x2326000
brk(0x2347000)                          = 0x2347000
write(3, "\1\0\0\0\0\0\0\0", 8)         = 8
futex(0x7ff28000f730, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x23224bc, FUTEX_WAIT_BITSET_PRIVATE, 0, {tv_sec=41437, tv_nsec=314920382}, FUTEX_BITSET_MATCH_ANY) = 0
munmap(0x7ff284ac7000, 8392704)         = 0
futex(0x2322468, FUTEX_WAKE_PRIVATE, 1) = 0
write(1, "\33[0;32m[       OK ] \33[mTestServe"..., 67) = 67
write(1, "\33[0;32m[----------] \33[m1 test fr"..., 65) = 65
write(1, "\33[0;32m[----------] \33[mGlobal te"..., 57) = 57
write(1, "\33[0;32m[==========] \33[m1 test fr"..., 71) = 71
write(1, "\33[0;32m[  PASSED  ] \33[m1 test.\n", 31) = 31
rt_sigprocmask(SIG_SETMASK, ~[RTMIN RT_1], [], 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
rt_sigprocmask(SIG_BLOCK, ~[RTMIN RT_1], [], 8) = 0
membarrier(MEMBARRIER_CMD_PRIVATE_EXPEDITED, 0) = 0
membarrier(MEMBARRIER_CMD_PRIVATE_EXPEDITED, 0) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
rt_sigprocmask(SIG_SETMASK, ~[RTMIN RT_1], [], 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
getpid()                                = 93315
rt_sigprocmask(SIG_SETMASK, ~[RTMIN RT_1], [], 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
rt_sigprocmask(SIG_BLOCK, ~[RTMIN RT_1], [], 8) = 0
membarrier(MEMBARRIER_CMD_PRIVATE_EXPEDITED, 0) = 0
membarrier(MEMBARRIER_CMD_PRIVATE_EXPEDITED, 0) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
rt_sigprocmask(SIG_SETMASK, ~[RTMIN RT_1], [], 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
rt_sigprocmask(SIG_SETMASK, ~[RTMIN RT_1], [], 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
rt_sigprocmask(SIG_BLOCK, ~[RTMIN RT_1], [], 8) = 0
membarrier(MEMBARRIER_CMD_PRIVATE_EXPEDITED, 0) = 0
membarrier(MEMBARRIER_CMD_PRIVATE_EXPEDITED, 0) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
rt_sigprocmask(SIG_SETMASK, ~[RTMIN RT_1], [], 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
rt_sigprocmask(SIG_SETMASK, ~[RTMIN RT_1], [], 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
rt_sigprocmask(SIG_BLOCK, ~[RTMIN RT_1], [], 8) = 0
membarrier(MEMBARRIER_CMD_PRIVATE_EXPEDITED, 0) = 0
membarrier(MEMBARRIER_CMD_PRIVATE_EXPEDITED, 0) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
rt_sigprocmask(SIG_SETMASK, ~[RTMIN RT_1], [], 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
rt_sigprocmask(SIG_SETMASK, ~[RTMIN RT_1], [], 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
rt_sigprocmask(SIG_SETMASK, ~[RTMIN RT_1], [], 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
rt_sigprocmask(SIG_SETMASK, ~[RTMIN RT_1], [], 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
rt_sigprocmask(SIG_SETMASK, ~[RTMIN RT_1], [], 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
rt_sigprocmask(SIG_SETMASK, ~[RTMIN RT_1], [], 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
rt_sigprocmask(SIG_BLOCK, ~[RTMIN RT_1], [], 8) = 0
membarrier(MEMBARRIER_CMD_PRIVATE_EXPEDITED, 0) = 0
membarrier(MEMBARRIER_CMD_PRIVATE_EXPEDITED, 0) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
rt_sigprocmask(SIG_SETMASK, ~[RTMIN RT_1], [], 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
exit_group(0)                           = ?
+++ exited with 0 +++

AlanGriffiths · 2020-04-07T10:08:45Z

Running under strace or gdb doesn't reproduce.

1397: Fix Fedora rawhide build in CI r=wmww a=AlanGriffiths Partial fix for Fedora rawhide build in CI. (Fixes: #1400, fixes: #1399) 1. Drop dependency that's unsatisfiable on Fedora rawhide (I think it's redundant) 2. Add a missing dependency 3. Clear WAYLAND_DISPLAY from the test environment 4. Disable a couple of tests that try to frig with /dev/random That leaves us with a build, but hanging tests (#1405) but that can be fixed later. Co-authored-by: Alan Griffiths <alan@octopull.co.uk>

AlanGriffiths · 2020-04-14T16:58:30Z

OK, I've a nasty hack that appears to prevents this problem manifesting. Which means I have a theory of what's going wrong. (It is racy code.)

I'll try to create a tidy solution tomorrow.

AlanGriffiths · 2020-04-17T09:12:30Z

I'm going to park this and come back to it another day.

Where I've got to:

The problem is in md::ThreadedDispatcher.
It can be reproduced under the debugger with
2.1. miral-test --gtest_filter=TestServer.connect_client_works --gtest_repeat=1000
The failure scenario is:
3.1. the thread in threadpool is waiting on poll();
3.2. ~ThreadedDispatcher() calls thread_exiter->terminate_all_threads_async()
3.3. poll() doesn't return, the threadpool thread doesn't exit; and,
3.4. ~ThreadedDispatcher() hangs on join()

There's a couple of places in the system that use md::ThreadedDispatcher, both can exhibit this failure mode.

The frequency of failures can be significantly reduced (failing in 10s of iterations => failing in 100s of iterations), for example by calling terminate_all_threads_async() twice. (Or by running under strace or the debugger.)

This is highly suggestive that our use of eventfd() is racy, but I've yet to spot why.

We can get out of this failure state by sending SIGSTOP & SIGCONT, or by detaching the terminal ^Z and bringing to the foreground fg.

I also note that we don't seem to use more than one thread in the pool, so maybe this is all overengineered.

AlanGriffiths · 2020-04-17T09:34:13Z

I wonder if this could also be the cause of occasional failures on Ubuntu CI?

AlanGriffiths · 2020-04-17T14:36:50Z

I couldn't leave it alone! Tried on 20.04...

$ miral-test --gtest_filter=TestServer.connect_client_works --gtest_repeat=1000
...
Repeating all tests (iteration 1000) . . .

Note: Google Test filter = TestServer.connect_client_works
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from TestServer
[ RUN      ] TestServer.connect_client_works
ERROR: /home/alan/display_server/mir/src/common/dispatch/multiplexing_dispatchable.cpp(80): Throw in function mir::dispatch::MultiplexingDispatchable::MultiplexingDispatchable()
Dynamic exception type: boost::wrapexcept<std::system_error>
std::exception::what: Failed to create epoll monitor: Too many open files

unknown file: Failure
C++ exception with description "Failed to start server thread" thrown in SetUp().
[  FAILED  ] TestServer.connect_client_works (20100 ms)
[----------] 1 test from TestServer (20100 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test suite ran. (20100 ms total)
[  PASSED  ] 0 tests.
[  FAILED  ] 1 test, listed below:
[  FAILED  ] TestServer.connect_client_works

 1 FAILED TEST

Clearly, we leak file handles in here - would be nice if that's related.

RAOF · 2020-06-01T23:43:35Z

Was bors a bit eager there, or does the LTTNG thing actually fix the Fedora hangs?

AlanGriffiths · 2020-06-02T08:10:43Z

Was bors a bit eager there, or does the LTTNG thing actually fix the Fedora hangs?

Now you ask I realise It was me that was a bit eager - I tried reproducing with #1496, and couldn't. What I failed to do is confirm that I could still reproduce without it. Either way, this is fixed.

AlanGriffiths mentioned this issue Apr 3, 2020

Fix Fedora rawhide build in CI #1397

Merged

AlanGriffiths self-assigned this Apr 6, 2020

AlanGriffiths mentioned this issue Apr 9, 2020

Release 1.8.0 #1410

Merged

AlanGriffiths mentioned this issue Apr 17, 2020

The Fedora 30 build is working, so keep it that way #1427

Closed

AlanGriffiths removed their assignment Apr 17, 2020

AlanGriffiths mentioned this issue Apr 17, 2020

Intermittent in WLCS: Mir fatal error: fake event failed to go through #1432

Closed

AlanGriffiths mentioned this issue Jun 1, 2020

Tests: Make LTTNG fork()-safe #1496

Merged

bors bot closed this as completed in 69786be Jun 1, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

On Fedora rawhide tests have a tendency to hang for no obvious reason #1405

On Fedora rawhide tests have a tendency to hang for no obvious reason #1405

AlanGriffiths commented Apr 3, 2020

AlanGriffiths commented Apr 6, 2020

AlanGriffiths commented Apr 6, 2020

AlanGriffiths commented Apr 6, 2020

AlanGriffiths commented Apr 6, 2020

AlanGriffiths commented Apr 7, 2020

AlanGriffiths commented Apr 7, 2020

AlanGriffiths commented Apr 14, 2020

AlanGriffiths commented Apr 17, 2020 •

edited

AlanGriffiths commented Apr 17, 2020

AlanGriffiths commented Apr 17, 2020

RAOF commented Jun 1, 2020

AlanGriffiths commented Jun 2, 2020

On Fedora rawhide tests have a tendency to hang for no obvious reason #1405

On Fedora rawhide tests have a tendency to hang for no obvious reason #1405

Comments

AlanGriffiths commented Apr 3, 2020

AlanGriffiths commented Apr 6, 2020

AlanGriffiths commented Apr 6, 2020

AlanGriffiths commented Apr 6, 2020

AlanGriffiths commented Apr 6, 2020

AlanGriffiths commented Apr 7, 2020

AlanGriffiths commented Apr 7, 2020

AlanGriffiths commented Apr 14, 2020

AlanGriffiths commented Apr 17, 2020 • edited

AlanGriffiths commented Apr 17, 2020

AlanGriffiths commented Apr 17, 2020

RAOF commented Jun 1, 2020

AlanGriffiths commented Jun 2, 2020

AlanGriffiths commented Apr 17, 2020 •

edited