Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test stdout is sometimes missing #4608

Closed
ola-rozenfeld opened this issue Feb 8, 2018 · 27 comments
Closed

test stdout is sometimes missing #4608

ola-rozenfeld opened this issue Feb 8, 2018 · 27 comments
Assignees
Labels
P1 I'll work on this now. (Assignee required) team-Remote-Exec Issues and PRs for the Execution (Remote) team type: bug

Comments

@ola-rozenfeld
Copy link
Contributor

There is a race condition in test-setup.sh which will cause the test output to occasionally be (partially or fully) missing, from either the test log or from the test.xml, or both.

Sample Tensorflow tests revealed the stdout missing from xml in as much as 40% of the cases, although usual incidence for small tests is ~1%. Internal issue: b/65977241.

@ulfjack
Copy link
Contributor

ulfjack commented Feb 9, 2018

After a lot of offline discussion, we think that the principled fix is to split test execution and test.xml generation (only to be executed if the test did not generate a test.xml file). I have a patch for that, but unfortunately, it's causing problems elsewhere. I have a patch for that, except it's causing problems in yet another place.

As a short-term solution, we're going to put in a Linux-specific fix, which is the most critical case right now.

bazel-io pushed a commit that referenced this issue Feb 9, 2018
Progress on #4608.

PiperOrigin-RevId: 185126689
bazel-io pushed a commit that referenced this issue Feb 12, 2018
…here "tail --pid" is supported.

The solutions aren't mine, the new test was taken from Ola's unknown commit and the way to avoid race condition courtesy of sethkoehler@

Mitigates #4608 for compatible Linux systems.

TESTED=manual scripts and new test case.
RELNOTES: None
PiperOrigin-RevId: 185374273
katre pushed a commit that referenced this issue Feb 12, 2018
…here "tail --pid" is supported.

The solutions aren't mine, the new test was taken from Ola's unknown commit and the way to avoid race condition courtesy of sethkoehler@

Mitigates #4608 for compatible Linux systems.

TESTED=manual scripts and new test case.
RELNOTES: None
PiperOrigin-RevId: 185374273
bazel-io pushed a commit that referenced this issue Feb 13, 2018
*** Reason for rollback ***

#4625

What I thought was a short fix is turning into a long hunt, so I better roll this back to get the build green again.

I'm not yet 100% certain what the interactions are, but there's a chance that it's back to the drawing board.

*** Original change description ***

Fixing test-setup.sh occasionally missing stdout/stderr, on systems where "tail --pid" is supported.

The solutions aren't mine, the new test was taken from Ola's unknown commit and the way to avoid race condition courtesy of sethkoehler@

Mitigates #4608 for compatible Linux systems.

TESTED=manual scripts and new test case.
RELNOTES: None
PiperOrigin-RevId: 185482604
bazel-io pushed a commit that referenced this issue Feb 20, 2018
…ess in a sub-shell.

Apparently, nested background processes interfere with SIGINT handling in bash. I don't 100% understand why and how, but I do have a small bash script that demonstrates the problem: script A that spawns a background process, sends it a SIGINT, and verifies it was received. The script works, *unless* run in the background by a process B; this extra layer of backgrounding cause process A's logic to stop working. See experimental/users/olaola/shell/ for examples. See also https://stackoverflow.com/questions/48847722/nested-background-processes-and-sigint-handling

*** Original change description ***

Fixing test-setup.sh occasionally missing stdout/stderr, on systems where "tail --pid" is supported.

The solutions aren't mine, the new test was taken from Ola's unknown commit and the way to avoid race condition courtesy of sethkoehler@

Mitigates #4608 for compatible Linux systems.

TESTED=presubmits, manual shell tests on new bazel
RELNOTES: None
PiperOrigin-RevId: 186312008
katre pushed a commit that referenced this issue Feb 20, 2018
…here "tail --pid" is supported.

The solutions aren't mine, the new test was taken from Ola's unknown commit and the way to avoid race condition courtesy of sethkoehler@

Mitigates #4608 for compatible Linux systems.

TESTED=manual scripts and new test case.
RELNOTES: None
PiperOrigin-RevId: 185374273
katre pushed a commit that referenced this issue Feb 20, 2018
*** Reason for rollback ***

#4625

What I thought was a short fix is turning into a long hunt, so I better roll this back to get the build green again.

I'm not yet 100% certain what the interactions are, but there's a chance that it's back to the drawing board.

*** Original change description ***

Fixing test-setup.sh occasionally missing stdout/stderr, on systems where "tail --pid" is supported.

The solutions aren't mine, the new test was taken from Ola's unknown commit and the way to avoid race condition courtesy of sethkoehler@

Mitigates #4608 for compatible Linux systems.

TESTED=manual scripts and new test case.
RELNOTES: None
PiperOrigin-RevId: 185482604
katre pushed a commit that referenced this issue Feb 20, 2018
…ess in a sub-shell.

Apparently, nested background processes interfere with SIGINT handling in bash. I don't 100% understand why and how, but I do have a small bash script that demonstrates the problem: script A that spawns a background process, sends it a SIGINT, and verifies it was received. The script works, *unless* run in the background by a process B; this extra layer of backgrounding cause process A's logic to stop working. See experimental/users/olaola/shell/ for examples. See also https://stackoverflow.com/questions/48847722/nested-background-processes-and-sigint-handling

*** Original change description ***

Fixing test-setup.sh occasionally missing stdout/stderr, on systems where "tail --pid" is supported.

The solutions aren't mine, the new test was taken from Ola's unknown commit and the way to avoid race condition courtesy of sethkoehler@

Mitigates #4608 for compatible Linux systems.

TESTED=presubmits, manual shell tests on new bazel
RELNOTES: None
PiperOrigin-RevId: 186312008
@katre
Copy link
Member

katre commented Feb 22, 2018

Is the remaining work on this a release blocker for 0.11.0?

@ola-rozenfeld
Copy link
Contributor Author

No, as long as 9fe02aa is in 0.11.0, we're good. the remaining work will take a few more weeks.

@katre
Copy link
Member

katre commented Feb 22, 2018

That's fine, I am removing the release blocker label.

katre pushed a commit that referenced this issue Feb 28, 2018
…here "tail --pid" is supported.

The solutions aren't mine, the new test was taken from Ola's unknown commit and the way to avoid race condition courtesy of sethkoehler@

Mitigates #4608 for compatible Linux systems.

TESTED=manual scripts and new test case.
RELNOTES: None
PiperOrigin-RevId: 185374273
katre pushed a commit that referenced this issue Feb 28, 2018
…ess in a sub-shell.

Apparently, nested background processes interfere with SIGINT handling in bash. I don't 100% understand why and how, but I do have a small bash script that demonstrates the problem: script A that spawns a background process, sends it a SIGINT, and verifies it was received. The script works, *unless* run in the background by a process B; this extra layer of backgrounding cause process A's logic to stop working. See experimental/users/olaola/shell/ for examples. See also https://stackoverflow.com/questions/48847722/nested-background-processes-and-sigint-handling

*** Original change description ***

Fixing test-setup.sh occasionally missing stdout/stderr, on systems where "tail --pid" is supported.

The solutions aren't mine, the new test was taken from Ola's unknown commit and the way to avoid race condition courtesy of sethkoehler@

Mitigates #4608 for compatible Linux systems.

TESTED=presubmits, manual shell tests on new bazel
RELNOTES: None
PiperOrigin-RevId: 186312008
philwo pushed a commit that referenced this issue Mar 6, 2018
…here "tail --pid" is supported.

The solutions aren't mine, the new test was taken from Ola's unknown commit and the way to avoid race condition courtesy of sethkoehler@

Mitigates #4608 for compatible Linux systems.

TESTED=manual scripts and new test case.
RELNOTES: None
PiperOrigin-RevId: 185374273
philwo pushed a commit that referenced this issue Mar 6, 2018
…ess in a sub-shell.

Apparently, nested background processes interfere with SIGINT handling in bash. I don't 100% understand why and how, but I do have a small bash script that demonstrates the problem: script A that spawns a background process, sends it a SIGINT, and verifies it was received. The script works, *unless* run in the background by a process B; this extra layer of backgrounding cause process A's logic to stop working. See experimental/users/olaola/shell/ for examples. See also https://stackoverflow.com/questions/48847722/nested-background-processes-and-sigint-handling

*** Original change description ***

Fixing test-setup.sh occasionally missing stdout/stderr, on systems where "tail --pid" is supported.

The solutions aren't mine, the new test was taken from Ola's unknown commit and the way to avoid race condition courtesy of sethkoehler@

Mitigates #4608 for compatible Linux systems.

TESTED=presubmits, manual shell tests on new bazel
RELNOTES: None
PiperOrigin-RevId: 186312008
@ulfjack ulfjack assigned ulfjack and unassigned agoulti Mar 7, 2018
@ulfjack
Copy link
Contributor

ulfjack commented Mar 7, 2018

Assigning to myself for the long-term fix.

@Reflexe
Copy link
Contributor

Reflexe commented Apr 4, 2018

I looked into the issue a bit with strace. Looked like bash is not reliable at catching SIGINT at all - it just keeps blocking it on some occasions with rt_sigprocmask(2).

I'm not very familiar with Bazel internals but is there a good reason for using bash for this task? (Sometimes the short way is actually the long way).

@agoulti
Copy link

agoulti commented Apr 4, 2018

Thanks for looking into it!

The "short way" implementation is in the works.
The long-term fix that Ulf is working on is exactly about removing this functionality from the Bash script and moving it into Bazel itself.

Unfortunately, some non-trivial refactoring is involved so it's not coming as fast as we'd like - that's the main reason for the stop-gap fix in 9fe02aa

@buchgr
Copy link
Contributor

buchgr commented Sep 11, 2018

Do you have a CL number of the change that needs to be rolled forward? Will you take care of this or should I?

@ulfjack
Copy link
Contributor

ulfjack commented Sep 11, 2018

cl/207084179 needs to be rolled back (i.e., the original change is cl/205629237 and needs to be rolled forward). I already have a rollback of the rollback, but we were unable to determine how the change could have caused the problems that led to the rollback. @meisterT was pretty confident that the original change caused a problem. We later found another change that caused a similar problem, but it's unclear if that was the same problem.

@ulfjack
Copy link
Contributor

ulfjack commented Sep 11, 2018

There was an intermediate change by Janak which may have resolved the underlying problem - cl/210731120.

@ulfjack
Copy link
Contributor

ulfjack commented Sep 11, 2018

It's unlikely that I'll have any time this week to work on it.

@ulfjack
Copy link
Contributor

ulfjack commented Oct 24, 2018

So, the rollback of the rollback is in, guarded with a flag: f2b260c

Unfortunately, @meisterT was completely correct that the change was broken. Fortunately, we managed to track it down and I just submitted a fix: 9de2ea5

(See the description for a long explanation of what was going wrong.)

If that second change sticks, then we can move forward with flipping the flag introduced in the first change and backing out the temporary code. That should happen some time next week. Assuming that that also sticks, i.e., we didn't overlook something else, we're done with the second-level Yak shaving.

That means we can go up one level and flip --experimental_split_xml_generation to true. If next week looks good, then we can tackle that the week after. If that also sticks, then we can clean up the temporary code for that, and finally close this bug.

@ulfjack
Copy link
Contributor

ulfjack commented Nov 20, 2018

Well, it took me a bit longer but I have a patch to flip the flag, as well as a patch to back out the temporary code. They should get submitted this week.

@ulfjack
Copy link
Contributor

ulfjack commented Nov 20, 2018

To clarify: I already flipped the flag for all Google builds and we have not observed any problems. My patch is to flip the flag in Bazel as well, and then to back out the temporary code.

bazel-io pushed a commit that referenced this issue Nov 20, 2018
@ulfjack
Copy link
Contributor

ulfjack commented Nov 23, 2018

Temporary code removed in b7ae7b0. Next up is to flip the --experimental_split_xml_generation flag.

bazel-io pushed a commit that referenced this issue Nov 29, 2018
Otherwise the test name is not generated correctly. This is caught by
our existing integration tests, and I'm adding an assertion to our unit
tests as well.

Progress on #4608.

PiperOrigin-RevId: 223338264
bazel-io pushed a commit that referenced this issue Nov 30, 2018
Progress on #4608.

PiperOrigin-RevId: 223497462
@ulfjack
Copy link
Contributor

ulfjack commented Jan 11, 2019

This got delayed due to me being on vacation for all of December. Unfortunately, I just found yet another issue. As far as I can tell, our code to replace invalid xml characters doesn't actually work right now (at least with some versions of perl). On the plus side, I can now confirm that the upcoming replacement is equivalent to what we're doing right now.

@ulfjack
Copy link
Contributor

ulfjack commented Jan 11, 2019

Make that two issues - there's still a bug in the code for which I have a fix, and the current XML character replacement function doesn't work, which I want to replace anyway.

Ok, make that three issues - the test doesn't actually test whether XML character replacement works because it's broken.

@ulfjack
Copy link
Contributor

ulfjack commented Jan 11, 2019

Actually, four - the script is silently ignoring the non-zero exit code from perl.

bazel-io pushed a commit that referenced this issue Jan 14, 2019
test-setup.sh was setting a trap to call write_xml_output_file, but
there's no test.log.xml if split xml generation is enabled.

Add a test case for this and comment out a part of the test case that is
not actually working (the test passes because it ignores the exit code
from perl).

Progress on #4608.

PiperOrigin-RevId: 229167973
@buchgr buchgr removed their assignment Jan 16, 2019
@buchgr buchgr added P2 We'll consider working on this in future. (Assignee optional) team-Remote-Exec Issues and PRs for the Execution (Remote) team and removed P1 I'll work on this now. (Assignee required) category: misc > testing labels Jan 16, 2019
@ulfjack ulfjack added P1 I'll work on this now. (Assignee required) and removed P2 We'll consider working on this in future. (Assignee optional) labels Jan 21, 2019
@ulfjack
Copy link
Contributor

ulfjack commented Jan 25, 2019

The good news is that I fixed most of the issues mentioned above, at least those that are critical for rolling this out.

However, I found yet another issue with signal handling, which apparently isn't working correctly.

bazel-io pushed a commit that referenced this issue Feb 1, 2019
If we don't set a trap here, then bash ignores the signal, and the test
process also does not receive the signal, so the test runner has no
chance of writing a test.xml output.

However, the behavior of trap forwarding the signal to the subprocess is
not at all documented in the bash documentation, and also inconsistent
with the behavior reported in #7119.

There is a similar problem in the Java stub template reported in #6338.

This may or may not be progress on #4608.

PiperOrigin-RevId: 232035930
@ulfjack
Copy link
Contributor

ulfjack commented Feb 1, 2019

It looks like there may still be an issue with remote execution.

bazel-io pushed a commit that referenced this issue Feb 4, 2019
It is possible for Bazel to run test spawns that generate no output files (only stdout/stderr). Leave any such validation to the higher level.

This was failing with --experimental_split_xml_generation.

Progress on #4608.

PiperOrigin-RevId: 232295958
@ulfjack
Copy link
Contributor

ulfjack commented Feb 4, 2019

It's in! It's amazing! (But don't celebrate yet, it could still be rolled back...)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P1 I'll work on this now. (Assignee required) team-Remote-Exec Issues and PRs for the Execution (Remote) team type: bug
Projects
None yet
Development

No branches or pull requests

8 participants