Running the nrunner with tests nested in other frameworks hangs [99.0 regression] #5521

eskultety · 2022-11-15T16:23:53Z

Describe the bug
When running the upstream libvirt TCK suite with avocado 99.0 some of the tests will hang until a timeout is reached in Avocado. Very specifically the tests that hang are bash scripts wrapped by the TCK framework, to illustrate the situation with frameworks in libvirt:

avocado-> libvirt-tck framework -> nwfilter test framework -> Bash scripts
                     |
                     +-> native TCK Perl tests

The native TCK Perl tests run fine, the problem occurs only when running the double wrapped Bash scripts.
An example of a problematic test:

$ sudo avocado --config avocado.config run --job-results-dir /avocado ./scripts/nwfilter/050-apply-verify-host.t --job-timeout 65s
JOB ID     : 26c45d41811cccb65db216133003a8632cc412fd
JOB LOG    : /avocado/job-2022-11-15T15.30-26c45d4/job.log
 (1/1) ./scripts/nwfilter/050-apply-verify-host.t: STARTED

RESULTS    : PASS 0 | ERROR 0 | FAIL 0 | SKIP 1 | WARN 0 | INTERRUPT 0 | CANCEL 0
JOB TIME   : 65.31 s

The problem doesn't occur in Avocado 98.0

Steps to reproduce

Install libvirt RPMs (you can use fresh RPMs from here
Install libvirt perl RPMs (preferably use the fresh RPMs from here
install Avocado 99.0
Clone libvirt-tck repo
from the cloned repo run sudo avocado --config avocado.config run --job-results-dir /avocado ./scripts/nwfilter/050-apply-verify-host.t --job-timeout 65s

Expected behavior
All jobs in the test suite finish execution, i.e. don't hang

Current behavior
The job above hangs until the timeout is reached

System information (please complete the following information):

OS: doesn't matter (I used CentOS Stream 9)
Avocado version: 99.0
Avocado installation method: pip, git

Additional information
I bisected the issue to the following commit: 0a8178d

I only bisected it once, so take ^this with a grain of salt. FWIW I used this stupid bisect script which yielded the commit above:

#!/bin/bash

AVOCADO_DIR=/home/gitlab-runner/avocado.git
TCK_DIR=/tmp/libvirt-tck
pushd . &>/dev/null

cd $AVOCADO_DIR
sudo python3 setup.py install --prefix=/usr

cd $TCK_DIR
sudo avocado --config avocado.config run --job-results-dir /avocado ./scripts/nwfilter/050-apply-verify-host.t --job-timeout 65s
rc=$?
if [ $rc -ne 0 ]; then
    for n in $(sudo virsh list --name);do sudo virsh destroy $n; done
    sudo virsh nwfilter-undefine tck-testcase
    sudo virsh nwfilter-undefine nwfiltertestfilter
fi

sudo rm -f /usr/bin/avocado*
sudo rm -rf /usr/lib/python3.9/site-packages/avocado*

popd &>/dev/null

exit $rc

The text was updated successfully, but these errors were encountered:

clebergnu · 2022-11-16T14:52:28Z

Hi @eskultety, thanks for the very detailed reporting!

To attempt to rule out libvirt-tck specific details, I wrote the following "mock TAP test":

#!/bin/sh
sleep 10&
cat <<EOF
1..1
ok 1 /fake/test
EOF

The goal was to check if Avocado was waiting or not for child processes created by tests (sleep 10 in this case). So far, I have not seen a difference in behavior from 98.0 and 99.0 in that aspect (they both wait on the children processes, sleep in this case). 98.0 used to use communicate() while 99.0 uses poll().

I'll continue to investigate this, but it certainly opens up a separate discussion (apart from this issue still under investigation) that I'd like to get your feedback on:

Do you anticipate situation where a test will spawn processes that will live beyond the test, and that should not be waited on by the test runner?

Second, do you believe it's possible that those specific libvirt-tck tests are doing something along these lines, and, across executions with different Avocado versions, some executions needed to spawn these processes and others didn't?

eskultety · 2022-11-17T14:20:20Z

Hi @eskultety, thanks for the very detailed reporting!

To attempt to rule out libvirt-tck specific details, I wrote the following "mock TAP test":
#!/bin/sh
sleep 10&
cat <<EOF
1..1
ok 1 /fake/test
EOF
The goal was to check if Avocado was waiting or not for child processes created by tests (sleep 10 in this case). So far, I have not seen a difference in behavior from 98.0 and 99.0 in that aspect (they both wait on the children processes, sleep in this case). 98.0 used to use communicate() while 99.0 uses poll().

I'll continue to investigate this, but it certainly opens up a separate discussion (apart from this issue still under investigation) that I'd like to get your feedback on:

Do you anticipate situation where a test will spawn processes that will live beyond the test, and that should not be waited on by the test runner?

No, that would be IMO spurious behavior, all resources used by the test suite should be cleaned afterwards with no remnants.

Second, do you believe it's possible that those specific libvirt-tck tests are doing something along these lines, and, across executions with different Avocado versions, some executions needed to spawn these processes and others didn't?

Hmm, not sure I fully understand the question, but let me answer "No, I don't believe it is" simply because I think we would have caught such behavior already as these tests have not been changed for a very long time. However, at the same time, let me say that it would help us a lot if we could get any kind of output from those tests, IOW if you run the 050-apply-verify-host.t via TCK, then you'll see the progress of the test as the bash script is printing out the results - it's an absolute nightmare to debug, but you at least know where the problem occurred and so you can sleep the process in the right moment and investigate. That's not the case when run through Avocado, no stdout nor stderr files are present when the test hangs and debug.log is empty, so even if I wanted to help you in the investigation in parallel, I don't really know where to start looking because I have no idea at which step the test hung since there's no output.

clebergnu · 2022-11-17T21:08:57Z

OK, I have definitely found the root cause of this issue, and it's actually linked to your debugging complaint. I'll follow up with a plan to resolve resolve it.

clebergnu · 2022-11-21T15:14:29Z

@eskultety an update here. The core issue is indeed a difference between communicate() and poll(), but no wrt children processes. It has to do with a thread that is started by communicate() but not by poll() that drains the stdout and stderr (a PIPE in this case).

The ./scripts/nwfilter/050-apply-verify-host.t test actually outputs more than the default PIPE size (64kb in my experience). Python 3.10 and later allows for the configuration of the PIPE size, but even if that was easily available across all Python versions, I don't think it'd be the best solution.

Because of the debugging aspects you mentioned, I'm looking into making stdout and stderr streamable (from the lower level avocado-runner-{exec-test,tap} into avocado (run) proper). I've written a rough PoC that covers the avocado-runner-{exec-test,tap} layer only. The avocado (run) side is missing in that PoC, and my actual proposal will probably be quite different.

Anyway, that's only to keep you in the loop and let you know that I aim for both the bugfix and the "streamable" logs to be in for version 100.0.

Most of the information that generic executables (exec-test) tests generate will be done with their STDOUT and STDERR. Avocado, since 0a8178d, inadvertently reduced the size capacity when keeping and storing the STDOUT and STDERR. Still, limits will always exist, either in memory buffer sizes, network buffer sizes, or even filesystem file size limits or overall storage resources. This change increases considerably the limits of an "exec-test" by using, when available, the given "output_dir" parameter to a runnable or task. Whenever a test is run under "avocado run", a task will be used, and a task always has an "output_dir", so, unless one is running "avocado-runner-exec-test" manually and *not* providing an output_dir, the STDOUT and STDERR limits are now the filesystem limits. The tests document the existing limits: * 64kib when using PIPEs * Considerably larger limits (filesystem ones) when using an output_dir It also "documents" a new behavior introduced here that produces an error if an output_dir is attempted to be reused, to avoid overwriting unintended stdout/stderr files. Fixes: avocado-framework#5521 Signed-off-by: Cleber Rosa <crosa@redhat.com>

eskultety added bug triage labels Nov 15, 2022

clebergnu self-assigned this Nov 15, 2022

richtja added this to Long Term (Next Q) Backlog in Avocado Kanban via automation Nov 28, 2022

richtja added this to the #100 (the 100) milestone Nov 28, 2022

richtja removed the triage label Nov 28, 2022

richtja moved this from Long Term (Next Q) Backlog to In progress in Avocado Kanban Dec 6, 2022

clebergnu mentioned this issue Dec 21, 2022

exec-test: allow the use of runnable's output dir for stdout/stderr #5565

Closed

clebergnu mentioned this issue Dec 22, 2022

exec-test: allow the use of runnable's output dir for stdout/stderr [v2] #5566

Closed

clebergnu mentioned this issue Dec 22, 2022

exec-test: allow the use of runnable's output dir for stdout/stderr [v2] #5567

Merged

richtja closed this as completed in #5567 Dec 22, 2022

Avocado Kanban automation moved this from In progress to Done #100(the 100) Dec 22, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Running the nrunner with tests nested in other frameworks hangs [99.0 regression] #5521

Running the nrunner with tests nested in other frameworks hangs [99.0 regression] #5521

eskultety commented Nov 15, 2022 •

edited

clebergnu commented Nov 16, 2022

eskultety commented Nov 17, 2022

clebergnu commented Nov 17, 2022

clebergnu commented Nov 21, 2022

Running the nrunner with tests nested in other frameworks hangs [99.0 regression] #5521

Running the nrunner with tests nested in other frameworks hangs [99.0 regression] #5521

Comments

eskultety commented Nov 15, 2022 • edited

clebergnu commented Nov 16, 2022

eskultety commented Nov 17, 2022

clebergnu commented Nov 17, 2022

clebergnu commented Nov 21, 2022

eskultety commented Nov 15, 2022 •

edited