Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Running the nrunner with tests nested in other frameworks hangs [99.0 regression] #5521

Closed
eskultety opened this issue Nov 15, 2022 · 4 comments · Fixed by #5567
Closed

Running the nrunner with tests nested in other frameworks hangs [99.0 regression] #5521

eskultety opened this issue Nov 15, 2022 · 4 comments · Fixed by #5567
Assignees
Labels

Comments

@eskultety
Copy link

eskultety commented Nov 15, 2022

Describe the bug
When running the upstream libvirt TCK suite with avocado 99.0 some of the tests will hang until a timeout is reached in Avocado. Very specifically the tests that hang are bash scripts wrapped by the TCK framework, to illustrate the situation with frameworks in libvirt:

avocado-> libvirt-tck framework -> nwfilter test framework -> Bash scripts
                     |
                     +-> native TCK Perl tests

The native TCK Perl tests run fine, the problem occurs only when running the double wrapped Bash scripts.
An example of a problematic test:

$ sudo avocado --config avocado.config run --job-results-dir /avocado ./scripts/nwfilter/050-apply-verify-host.t --job-timeout 65s
JOB ID     : 26c45d41811cccb65db216133003a8632cc412fd
JOB LOG    : /avocado/job-2022-11-15T15.30-26c45d4/job.log
 (1/1) ./scripts/nwfilter/050-apply-verify-host.t: STARTED

RESULTS    : PASS 0 | ERROR 0 | FAIL 0 | SKIP 1 | WARN 0 | INTERRUPT 0 | CANCEL 0
JOB TIME   : 65.31 s

The problem doesn't occur in Avocado 98.0

Steps to reproduce

  1. Install libvirt RPMs (you can use fresh RPMs from here
  2. Install libvirt perl RPMs (preferably use the fresh RPMs from here
  3. install Avocado 99.0
  4. Clone libvirt-tck repo
  5. from the cloned repo run sudo avocado --config avocado.config run --job-results-dir /avocado ./scripts/nwfilter/050-apply-verify-host.t --job-timeout 65s

Expected behavior
All jobs in the test suite finish execution, i.e. don't hang

Current behavior
The job above hangs until the timeout is reached

System information (please complete the following information):

  • OS: doesn't matter (I used CentOS Stream 9)
  • Avocado version: 99.0
  • Avocado installation method: pip, git

Additional information
I bisected the issue to the following commit: 0a8178d

I only bisected it once, so take ^this with a grain of salt. FWIW I used this stupid bisect script which yielded the commit above:

#!/bin/bash

AVOCADO_DIR=/home/gitlab-runner/avocado.git
TCK_DIR=/tmp/libvirt-tck
pushd . &>/dev/null

cd $AVOCADO_DIR
sudo python3 setup.py install --prefix=/usr

cd $TCK_DIR
sudo avocado --config avocado.config run --job-results-dir /avocado ./scripts/nwfilter/050-apply-verify-host.t --job-timeout 65s
rc=$?
if [ $rc -ne 0 ]; then
    for n in $(sudo virsh list --name);do sudo virsh destroy $n; done
    sudo virsh nwfilter-undefine tck-testcase
    sudo virsh nwfilter-undefine nwfiltertestfilter
fi

sudo rm -f /usr/bin/avocado*
sudo rm -rf /usr/lib/python3.9/site-packages/avocado*

popd &>/dev/null

exit $rc

@clebergnu
Copy link
Contributor

Hi @eskultety, thanks for the very detailed reporting!

To attempt to rule out libvirt-tck specific details, I wrote the following "mock TAP test":

#!/bin/sh
sleep 10&
cat <<EOF
1..1
ok 1 /fake/test
EOF

The goal was to check if Avocado was waiting or not for child processes created by tests (sleep 10 in this case). So far, I have not seen a difference in behavior from 98.0 and 99.0 in that aspect (they both wait on the children processes, sleep in this case). 98.0 used to use communicate() while 99.0 uses poll().

I'll continue to investigate this, but it certainly opens up a separate discussion (apart from this issue still under investigation) that I'd like to get your feedback on:

Do you anticipate situation where a test will spawn processes that will live beyond the test, and that should not be waited on by the test runner?

Second, do you believe it's possible that those specific libvirt-tck tests are doing something along these lines, and, across executions with different Avocado versions, some executions needed to spawn these processes and others didn't?

@eskultety
Copy link
Author

Hi @eskultety, thanks for the very detailed reporting!

To attempt to rule out libvirt-tck specific details, I wrote the following "mock TAP test":

#!/bin/sh
sleep 10&
cat <<EOF
1..1
ok 1 /fake/test
EOF

The goal was to check if Avocado was waiting or not for child processes created by tests (sleep 10 in this case). So far, I have not seen a difference in behavior from 98.0 and 99.0 in that aspect (they both wait on the children processes, sleep in this case). 98.0 used to use communicate() while 99.0 uses poll().

I'll continue to investigate this, but it certainly opens up a separate discussion (apart from this issue still under investigation) that I'd like to get your feedback on:

Do you anticipate situation where a test will spawn processes that will live beyond the test, and that should not be waited on by the test runner?

No, that would be IMO spurious behavior, all resources used by the test suite should be cleaned afterwards with no remnants.

Second, do you believe it's possible that those specific libvirt-tck tests are doing something along these lines, and, across executions with different Avocado versions, some executions needed to spawn these processes and others didn't?

Hmm, not sure I fully understand the question, but let me answer "No, I don't believe it is" simply because I think we would have caught such behavior already as these tests have not been changed for a very long time. However, at the same time, let me say that it would help us a lot if we could get any kind of output from those tests, IOW if you run the 050-apply-verify-host.t via TCK, then you'll see the progress of the test as the bash script is printing out the results - it's an absolute nightmare to debug, but you at least know where the problem occurred and so you can sleep the process in the right moment and investigate. That's not the case when run through Avocado, no stdout nor stderr files are present when the test hangs and debug.log is empty, so even if I wanted to help you in the investigation in parallel, I don't really know where to start looking because I have no idea at which step the test hung since there's no output.

@clebergnu
Copy link
Contributor

OK, I have definitely found the root cause of this issue, and it's actually linked to your debugging complaint. I'll follow up with a plan to resolve resolve it.

@clebergnu
Copy link
Contributor

@eskultety an update here. The core issue is indeed a difference between communicate() and poll(), but no wrt children processes. It has to do with a thread that is started by communicate() but not by poll() that drains the stdout and stderr (a PIPE in this case).

The ./scripts/nwfilter/050-apply-verify-host.t test actually outputs more than the default PIPE size (64kb in my experience). Python 3.10 and later allows for the configuration of the PIPE size, but even if that was easily available across all Python versions, I don't think it'd be the best solution.

Because of the debugging aspects you mentioned, I'm looking into making stdout and stderr streamable (from the lower level avocado-runner-{exec-test,tap} into avocado (run) proper). I've written a rough PoC that covers the avocado-runner-{exec-test,tap} layer only. The avocado (run) side is missing in that PoC, and my actual proposal will probably be quite different.

Anyway, that's only to keep you in the loop and let you know that I aim for both the bugfix and the "streamable" logs to be in for version 100.0.

@richtja richtja added this to Long Term (Next Q) Backlog in Avocado Kanban via automation Nov 28, 2022
@richtja richtja added this to the #100 (the 100) milestone Nov 28, 2022
@richtja richtja removed the triage label Nov 28, 2022
@richtja richtja moved this from Long Term (Next Q) Backlog to In progress in Avocado Kanban Dec 6, 2022
clebergnu added a commit to clebergnu/avocado that referenced this issue Dec 21, 2022
Most of the information that generic executables (exec-test) tests
generate will be done with their STDOUT and STDERR.

Avocado, since 0a8178d, inadvertently reduced the size capacity when
keeping and storing the STDOUT and STDERR.  Still, limits will always
exist, either in memory buffer sizes, network buffer sizes, or even
filesystem file size limits or overall storage resources.

This change increases considerably the limits of an "exec-test" by
using, when available, the given "output_dir" parameter to a runnable
or task.  Whenever a test is run under "avocado run", a task will be
used, and a task always has an "output_dir", so, unless one is running
"avocado-runner-exec-test" manually and *not* providing an output_dir,
the STDOUT and STDERR limits are now the filesystem limits.

The tests document the existing limits:

 * 64kib when using PIPEs
 * Considerably larger limits (filesystem ones) when using an
   output_dir

It also "documents" a new behavior introduced here that produces an
error if an output_dir is attempted to be reused, to avoid overwriting
unintended stdout/stderr files.

Fixes: avocado-framework#5521
Signed-off-by: Cleber Rosa <crosa@redhat.com>
clebergnu added a commit to clebergnu/avocado that referenced this issue Dec 22, 2022
Most of the information that generic executables (exec-test) tests
generate will be done with their STDOUT and STDERR.

Avocado, since 0a8178d, inadvertently reduced the size capacity when
keeping and storing the STDOUT and STDERR.  Still, limits will always
exist, either in memory buffer sizes, network buffer sizes, or even
filesystem file size limits or overall storage resources.

This change increases considerably the limits of an "exec-test" by
using, when available, the given "output_dir" parameter to a runnable
or task.  Whenever a test is run under "avocado run", a task will be
used, and a task always has an "output_dir", so, unless one is running
"avocado-runner-exec-test" manually and *not* providing an output_dir,
the STDOUT and STDERR limits are now the filesystem limits.

The tests document the existing limits:

 * 64kib when using PIPEs
 * Considerably larger limits (filesystem ones) when using an
   output_dir

It also "documents" a new behavior introduced here that produces an
error if an output_dir is attempted to be reused, to avoid overwriting
unintended stdout/stderr files.

Fixes: avocado-framework#5521
Signed-off-by: Cleber Rosa <crosa@redhat.com>
clebergnu added a commit to clebergnu/avocado that referenced this issue Dec 22, 2022
Most of the information that generic executables (exec-test) tests
generate will be done with their STDOUT and STDERR.

Avocado, since 0a8178d, inadvertently reduced the size capacity when
keeping and storing the STDOUT and STDERR.  Still, limits will always
exist, either in memory buffer sizes, network buffer sizes, or even
filesystem file size limits or overall storage resources.

This change increases considerably the limits of an "exec-test" by
using, when available, the given "output_dir" parameter to a runnable
or task.  Whenever a test is run under "avocado run", a task will be
used, and a task always has an "output_dir", so, unless one is running
"avocado-runner-exec-test" manually and *not* providing an output_dir,
the STDOUT and STDERR limits are now the filesystem limits.

The tests document the existing limits:

 * 64kib when using PIPEs
 * Considerably larger limits (filesystem ones) when using an
   output_dir

It also "documents" a new behavior introduced here that produces an
error if an output_dir is attempted to be reused, to avoid overwriting
unintended stdout/stderr files.

Fixes: avocado-framework#5521
Signed-off-by: Cleber Rosa <crosa@redhat.com>
clebergnu added a commit to clebergnu/avocado that referenced this issue Dec 22, 2022
Most of the information that generic executables (exec-test) tests
generate will be done with their STDOUT and STDERR.

Avocado, since 0a8178d, inadvertently reduced the size capacity when
keeping and storing the STDOUT and STDERR.  Still, limits will always
exist, either in memory buffer sizes, network buffer sizes, or even
filesystem file size limits or overall storage resources.

This change increases considerably the limits of an "exec-test" by
using, when available, the given "output_dir" parameter to a runnable
or task.  Whenever a test is run under "avocado run", a task will be
used, and a task always has an "output_dir", so, unless one is running
"avocado-runner-exec-test" manually and *not* providing an output_dir,
the STDOUT and STDERR limits are now the filesystem limits.

The tests document the existing limits:

 * 64kib when using PIPEs
 * Considerably larger limits (filesystem ones) when using an
   output_dir

It also "documents" a new behavior introduced here that produces an
error if an output_dir is attempted to be reused, to avoid overwriting
unintended stdout/stderr files.

Fixes: avocado-framework#5521
Signed-off-by: Cleber Rosa <crosa@redhat.com>
clebergnu added a commit to clebergnu/avocado that referenced this issue Dec 22, 2022
Most of the information that generic executables (exec-test) tests
generate will be done with their STDOUT and STDERR.

Avocado, since 0a8178d, inadvertently reduced the size capacity when
keeping and storing the STDOUT and STDERR.  Still, limits will always
exist, either in memory buffer sizes, network buffer sizes, or even
filesystem file size limits or overall storage resources.

This change increases considerably the limits of an "exec-test" by
using, when available, the given "output_dir" parameter to a runnable
or task.  Whenever a test is run under "avocado run", a task will be
used, and a task always has an "output_dir", so, unless one is running
"avocado-runner-exec-test" manually and *not* providing an output_dir,
the STDOUT and STDERR limits are now the filesystem limits.

The tests document the existing limits:

 * 64kib when using PIPEs
 * Considerably larger limits (filesystem ones) when using an
   output_dir

It also "documents" a new behavior introduced here that produces an
error if an output_dir is attempted to be reused, to avoid overwriting
unintended stdout/stderr files.

Fixes: avocado-framework#5521
Signed-off-by: Cleber Rosa <crosa@redhat.com>
clebergnu added a commit to clebergnu/avocado that referenced this issue Dec 22, 2022
Most of the information that generic executables (exec-test) tests
generate will be done with their STDOUT and STDERR.

Avocado, since 0a8178d, inadvertently reduced the size capacity when
keeping and storing the STDOUT and STDERR.  Still, limits will always
exist, either in memory buffer sizes, network buffer sizes, or even
filesystem file size limits or overall storage resources.

This change increases considerably the limits of an "exec-test" by
using, when available, the given "output_dir" parameter to a runnable
or task.  Whenever a test is run under "avocado run", a task will be
used, and a task always has an "output_dir", so, unless one is running
"avocado-runner-exec-test" manually and *not* providing an output_dir,
the STDOUT and STDERR limits are now the filesystem limits.

The tests document the existing limits:

 * 64kib when using PIPEs
 * Considerably larger limits (filesystem ones) when using an
   output_dir

It also "documents" a new behavior introduced here that produces an
error if an output_dir is attempted to be reused, to avoid overwriting
unintended stdout/stderr files.

Fixes: avocado-framework#5521
Signed-off-by: Cleber Rosa <crosa@redhat.com>
clebergnu added a commit to clebergnu/avocado that referenced this issue Dec 22, 2022
Most of the information that generic executables (exec-test) tests
generate will be done with their STDOUT and STDERR.

Avocado, since 0a8178d, inadvertently reduced the size capacity when
keeping and storing the STDOUT and STDERR.  Still, limits will always
exist, either in memory buffer sizes, network buffer sizes, or even
filesystem file size limits or overall storage resources.

This change increases considerably the limits of an "exec-test" by
using, when available, the given "output_dir" parameter to a runnable
or task.  Whenever a test is run under "avocado run", a task will be
used, and a task always has an "output_dir", so, unless one is running
"avocado-runner-exec-test" manually and *not* providing an output_dir,
the STDOUT and STDERR limits are now the filesystem limits.

The tests document the existing limits:

 * 64kib when using PIPEs
 * Considerably larger limits (filesystem ones) when using an
   output_dir

It also "documents" a new behavior introduced here that produces an
error if an output_dir is attempted to be reused, to avoid overwriting
unintended stdout/stderr files.

Fixes: avocado-framework#5521
Signed-off-by: Cleber Rosa <crosa@redhat.com>
Avocado Kanban automation moved this from In progress to Done #100(the 100) Dec 22, 2022
maramsmurthy pushed a commit to maramsmurthy/avocado that referenced this issue Sep 6, 2023
Most of the information that generic executables (exec-test) tests
generate will be done with their STDOUT and STDERR.

Avocado, since 0a8178d, inadvertently reduced the size capacity when
keeping and storing the STDOUT and STDERR.  Still, limits will always
exist, either in memory buffer sizes, network buffer sizes, or even
filesystem file size limits or overall storage resources.

This change increases considerably the limits of an "exec-test" by
using, when available, the given "output_dir" parameter to a runnable
or task.  Whenever a test is run under "avocado run", a task will be
used, and a task always has an "output_dir", so, unless one is running
"avocado-runner-exec-test" manually and *not* providing an output_dir,
the STDOUT and STDERR limits are now the filesystem limits.

The tests document the existing limits:

 * 64kib when using PIPEs
 * Considerably larger limits (filesystem ones) when using an
   output_dir

It also "documents" a new behavior introduced here that produces an
error if an output_dir is attempted to be reused, to avoid overwriting
unintended stdout/stderr files.

Fixes: avocado-framework#5521
Signed-off-by: Cleber Rosa <crosa@redhat.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment