test stdout is sometimes missing #4608

ola-rozenfeld · 2018-02-08T16:31:06Z

There is a race condition in test-setup.sh which will cause the test output to occasionally be (partially or fully) missing, from either the test log or from the test.xml, or both.

Sample Tensorflow tests revealed the stdout missing from xml in as much as 40% of the cases, although usual incidence for small tests is ~1%. Internal issue: b/65977241.

ulfjack · 2018-02-09T10:08:12Z

After a lot of offline discussion, we think that the principled fix is to split test execution and test.xml generation (only to be executed if the test did not generate a test.xml file). I have a patch for that, but unfortunately, it's causing problems elsewhere. I have a patch for that, except it's causing problems in yet another place.

As a short-term solution, we're going to put in a Linux-specific fix, which is the most critical case right now.

Progress on #4608. PiperOrigin-RevId: 185126689

…here "tail --pid" is supported. The solutions aren't mine, the new test was taken from Ola's unknown commit and the way to avoid race condition courtesy of sethkoehler@ Mitigates #4608 for compatible Linux systems. TESTED=manual scripts and new test case. RELNOTES: None PiperOrigin-RevId: 185374273

*** Reason for rollback *** #4625 What I thought was a short fix is turning into a long hunt, so I better roll this back to get the build green again. I'm not yet 100% certain what the interactions are, but there's a chance that it's back to the drawing board. *** Original change description *** Fixing test-setup.sh occasionally missing stdout/stderr, on systems where "tail --pid" is supported. The solutions aren't mine, the new test was taken from Ola's unknown commit and the way to avoid race condition courtesy of sethkoehler@ Mitigates #4608 for compatible Linux systems. TESTED=manual scripts and new test case. RELNOTES: None PiperOrigin-RevId: 185482604

…ess in a sub-shell. Apparently, nested background processes interfere with SIGINT handling in bash. I don't 100% understand why and how, but I do have a small bash script that demonstrates the problem: script A that spawns a background process, sends it a SIGINT, and verifies it was received. The script works, *unless* run in the background by a process B; this extra layer of backgrounding cause process A's logic to stop working. See experimental/users/olaola/shell/ for examples. See also https://stackoverflow.com/questions/48847722/nested-background-processes-and-sigint-handling *** Original change description *** Fixing test-setup.sh occasionally missing stdout/stderr, on systems where "tail --pid" is supported. The solutions aren't mine, the new test was taken from Ola's unknown commit and the way to avoid race condition courtesy of sethkoehler@ Mitigates #4608 for compatible Linux systems. TESTED=presubmits, manual shell tests on new bazel RELNOTES: None PiperOrigin-RevId: 186312008

…here "tail --pid" is supported. The solutions aren't mine, the new test was taken from Ola's unknown commit and the way to avoid race condition courtesy of sethkoehler@ Mitigates #4608 for compatible Linux systems. TESTED=manual scripts and new test case. RELNOTES: None PiperOrigin-RevId: 185374273

*** Reason for rollback *** #4625 What I thought was a short fix is turning into a long hunt, so I better roll this back to get the build green again. I'm not yet 100% certain what the interactions are, but there's a chance that it's back to the drawing board. *** Original change description *** Fixing test-setup.sh occasionally missing stdout/stderr, on systems where "tail --pid" is supported. The solutions aren't mine, the new test was taken from Ola's unknown commit and the way to avoid race condition courtesy of sethkoehler@ Mitigates #4608 for compatible Linux systems. TESTED=manual scripts and new test case. RELNOTES: None PiperOrigin-RevId: 185482604

…ess in a sub-shell. Apparently, nested background processes interfere with SIGINT handling in bash. I don't 100% understand why and how, but I do have a small bash script that demonstrates the problem: script A that spawns a background process, sends it a SIGINT, and verifies it was received. The script works, *unless* run in the background by a process B; this extra layer of backgrounding cause process A's logic to stop working. See experimental/users/olaola/shell/ for examples. See also https://stackoverflow.com/questions/48847722/nested-background-processes-and-sigint-handling *** Original change description *** Fixing test-setup.sh occasionally missing stdout/stderr, on systems where "tail --pid" is supported. The solutions aren't mine, the new test was taken from Ola's unknown commit and the way to avoid race condition courtesy of sethkoehler@ Mitigates #4608 for compatible Linux systems. TESTED=presubmits, manual shell tests on new bazel RELNOTES: None PiperOrigin-RevId: 186312008

katre · 2018-02-22T20:45:08Z

Is the remaining work on this a release blocker for 0.11.0?

ola-rozenfeld · 2018-02-22T21:06:38Z

No, as long as 9fe02aa is in 0.11.0, we're good. the remaining work will take a few more weeks.

katre · 2018-02-22T21:07:06Z

That's fine, I am removing the release blocker label.

…here "tail --pid" is supported. The solutions aren't mine, the new test was taken from Ola's unknown commit and the way to avoid race condition courtesy of sethkoehler@ Mitigates #4608 for compatible Linux systems. TESTED=manual scripts and new test case. RELNOTES: None PiperOrigin-RevId: 185374273

…ess in a sub-shell. Apparently, nested background processes interfere with SIGINT handling in bash. I don't 100% understand why and how, but I do have a small bash script that demonstrates the problem: script A that spawns a background process, sends it a SIGINT, and verifies it was received. The script works, *unless* run in the background by a process B; this extra layer of backgrounding cause process A's logic to stop working. See experimental/users/olaola/shell/ for examples. See also https://stackoverflow.com/questions/48847722/nested-background-processes-and-sigint-handling *** Original change description *** Fixing test-setup.sh occasionally missing stdout/stderr, on systems where "tail --pid" is supported. The solutions aren't mine, the new test was taken from Ola's unknown commit and the way to avoid race condition courtesy of sethkoehler@ Mitigates #4608 for compatible Linux systems. TESTED=presubmits, manual shell tests on new bazel RELNOTES: None PiperOrigin-RevId: 186312008

…here "tail --pid" is supported. The solutions aren't mine, the new test was taken from Ola's unknown commit and the way to avoid race condition courtesy of sethkoehler@ Mitigates #4608 for compatible Linux systems. TESTED=manual scripts and new test case. RELNOTES: None PiperOrigin-RevId: 185374273

…ess in a sub-shell. Apparently, nested background processes interfere with SIGINT handling in bash. I don't 100% understand why and how, but I do have a small bash script that demonstrates the problem: script A that spawns a background process, sends it a SIGINT, and verifies it was received. The script works, *unless* run in the background by a process B; this extra layer of backgrounding cause process A's logic to stop working. See experimental/users/olaola/shell/ for examples. See also https://stackoverflow.com/questions/48847722/nested-background-processes-and-sigint-handling *** Original change description *** Fixing test-setup.sh occasionally missing stdout/stderr, on systems where "tail --pid" is supported. The solutions aren't mine, the new test was taken from Ola's unknown commit and the way to avoid race condition courtesy of sethkoehler@ Mitigates #4608 for compatible Linux systems. TESTED=presubmits, manual shell tests on new bazel RELNOTES: None PiperOrigin-RevId: 186312008

ulfjack · 2018-03-07T09:23:09Z

Assigning to myself for the long-term fix.

Reflexe · 2018-04-04T15:56:07Z

I looked into the issue a bit with strace. Looked like bash is not reliable at catching SIGINT at all - it just keeps blocking it on some occasions with rt_sigprocmask(2).

I'm not very familiar with Bazel internals but is there a good reason for using bash for this task? (Sometimes the short way is actually the long way).

agoulti · 2018-04-04T16:11:21Z

Thanks for looking into it!

The "short way" implementation is in the works.
The long-term fix that Ulf is working on is exactly about removing this functionality from the Bash script and moving it into Bazel itself.

Unfortunately, some non-trivial refactoring is involved so it's not coming as fast as we'd like - that's the main reason for the stop-gap fix in 9fe02aa

buchgr · 2018-09-11T12:10:55Z

Do you have a CL number of the change that needs to be rolled forward? Will you take care of this or should I?

ulfjack · 2018-09-11T12:30:21Z

cl/207084179 needs to be rolled back (i.e., the original change is cl/205629237 and needs to be rolled forward). I already have a rollback of the rollback, but we were unable to determine how the change could have caused the problems that led to the rollback. @meisterT was pretty confident that the original change caused a problem. We later found another change that caused a similar problem, but it's unclear if that was the same problem.

ulfjack · 2018-09-11T12:32:05Z

There was an intermediate change by Janak which may have resolved the underlying problem - cl/210731120.

ulfjack · 2018-09-11T12:32:55Z

It's unlikely that I'll have any time this week to work on it.

ulfjack · 2018-10-24T15:15:12Z

So, the rollback of the rollback is in, guarded with a flag: f2b260c

Unfortunately, @meisterT was completely correct that the change was broken. Fortunately, we managed to track it down and I just submitted a fix: 9de2ea5

(See the description for a long explanation of what was going wrong.)

If that second change sticks, then we can move forward with flipping the flag introduced in the first change and backing out the temporary code. That should happen some time next week. Assuming that that also sticks, i.e., we didn't overlook something else, we're done with the second-level Yak shaving.

That means we can go up one level and flip --experimental_split_xml_generation to true. If next week looks good, then we can tackle that the week after. If that also sticks, then we can clean up the temporary code for that, and finally close this bug.

ulfjack · 2018-11-20T08:13:58Z

Well, it took me a bit longer but I have a patch to flip the flag, as well as a patch to back out the temporary code. They should get submitted this week.

ulfjack · 2018-11-20T08:15:02Z

To clarify: I already flipped the flag for all Google builds and we have not observed any problems. My patch is to flip the flag in Bazel as well, and then to back out the temporary code.

Progress on #4608. PiperOrigin-RevId: 222205065

ulfjack · 2018-11-23T09:57:00Z

Temporary code removed in b7ae7b0. Next up is to flip the --experimental_split_xml_generation flag.

Otherwise the test name is not generated correctly. This is caught by our existing integration tests, and I'm adding an assertion to our unit tests as well. Progress on #4608. PiperOrigin-RevId: 223338264

Progress on #4608. PiperOrigin-RevId: 223497462

ulfjack · 2019-01-11T14:52:45Z

This got delayed due to me being on vacation for all of December. Unfortunately, I just found yet another issue. As far as I can tell, our code to replace invalid xml characters doesn't actually work right now (at least with some versions of perl). On the plus side, I can now confirm that the upcoming replacement is equivalent to what we're doing right now.

ulfjack · 2019-01-11T14:54:44Z

Make that two issues - there's still a bug in the code for which I have a fix, and the current XML character replacement function doesn't work, which I want to replace anyway.

Ok, make that three issues - the test doesn't actually test whether XML character replacement works because it's broken.

ulfjack · 2019-01-11T14:59:44Z

Actually, four - the script is silently ignoring the non-zero exit code from perl.

test-setup.sh was setting a trap to call write_xml_output_file, but there's no test.log.xml if split xml generation is enabled. Add a test case for this and comment out a part of the test case that is not actually working (the test passes because it ignores the exit code from perl). Progress on #4608. PiperOrigin-RevId: 229167973

ulfjack · 2019-01-25T15:18:23Z

The good news is that I fixed most of the issues mentioned above, at least those that are critical for rolling this out.

However, I found yet another issue with signal handling, which apparently isn't working correctly.

If we don't set a trap here, then bash ignores the signal, and the test process also does not receive the signal, so the test runner has no chance of writing a test.xml output. However, the behavior of trap forwarding the signal to the subprocess is not at all documented in the bash documentation, and also inconsistent with the behavior reported in #7119. There is a similar problem in the Java stub template reported in #6338. This may or may not be progress on #4608. PiperOrigin-RevId: 232035930

ulfjack · 2019-02-01T23:38:26Z

It looks like there may still be an issue with remote execution.

It is possible for Bazel to run test spawns that generate no output files (only stdout/stderr). Leave any such validation to the higher level. This was failing with --experimental_split_xml_generation. Progress on #4608. PiperOrigin-RevId: 232295958

ulfjack · 2019-02-04T22:58:59Z

It's in! It's amazing! (But don't celebrate yet, it could still be rolled back...)

ola-rozenfeld added the type: bug label Feb 8, 2018

katre added Release blocker P1 I'll work on this now. (Assignee required) category: misc > testing labels Feb 8, 2018

katre mentioned this issue Feb 8, 2018

Release - February 2018 - Target RC date: 2018-02-01 - name: 0.11.0 #3959

Closed

ulfjack assigned agoulti Feb 9, 2018

bazel-io pushed a commit that referenced this issue Feb 9, 2018

Simplify BinTools setup for integration tests

9f6995a

Progress on #4608. PiperOrigin-RevId: 185126689

katre mentioned this issue Feb 12, 2018

test_interrupt_kills_child failing in 0.11.0rc3 #4625

Closed

katre removed the release blocker label Feb 22, 2018

ulfjack assigned ulfjack and unassigned agoulti Mar 7, 2018

benjaminp mentioned this issue Mar 8, 2018

undocumented usage of perl for cc_test #4691

Closed

bazel-io pushed a commit that referenced this issue Nov 20, 2018

Flip the default for incompatible_use_per_action_file_cache in Bazel

bff6b11

Progress on #4608. PiperOrigin-RevId: 222205065

bazel-io pushed a commit that referenced this issue Nov 30, 2018

Remove --incompatible_use_per_action_file_cache

88c4410

Progress on #4608. PiperOrigin-RevId: 223497462

buchgr removed their assignment Jan 16, 2019

buchgr added P2 We'll consider working on this in future. (Assignee optional) team-Remote-Exec Issues and PRs for the Execution (Remote) team and removed P1 I'll work on this now. (Assignee required) category: misc > testing labels Jan 16, 2019

ulfjack added P1 I'll work on this now. (Assignee required) and removed P2 We'll consider working on this in future. (Assignee optional) labels Jan 21, 2019

ulfjack mentioned this issue Jan 21, 2019

Test rules are unable to handle signals #7119

Closed

bazel-io closed this as completed in 6da8982 Feb 4, 2019

buchgr mentioned this issue Mar 21, 2019

pass platform and execution requirements to xml generating spawn #7794

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test stdout is sometimes missing #4608

test stdout is sometimes missing #4608

ola-rozenfeld commented Feb 8, 2018

ulfjack commented Feb 9, 2018

katre commented Feb 22, 2018

ola-rozenfeld commented Feb 22, 2018

katre commented Feb 22, 2018

ulfjack commented Mar 7, 2018

Reflexe commented Apr 4, 2018

agoulti commented Apr 4, 2018

buchgr commented Sep 11, 2018

ulfjack commented Sep 11, 2018

ulfjack commented Sep 11, 2018

ulfjack commented Sep 11, 2018

ulfjack commented Oct 24, 2018

ulfjack commented Nov 20, 2018

ulfjack commented Nov 20, 2018

ulfjack commented Nov 23, 2018

ulfjack commented Jan 11, 2019

ulfjack commented Jan 11, 2019

ulfjack commented Jan 11, 2019

ulfjack commented Jan 25, 2019

ulfjack commented Feb 1, 2019

ulfjack commented Feb 4, 2019

test stdout is sometimes missing #4608

test stdout is sometimes missing #4608

Comments

ola-rozenfeld commented Feb 8, 2018

ulfjack commented Feb 9, 2018

katre commented Feb 22, 2018

ola-rozenfeld commented Feb 22, 2018

katre commented Feb 22, 2018

ulfjack commented Mar 7, 2018

Reflexe commented Apr 4, 2018

agoulti commented Apr 4, 2018

buchgr commented Sep 11, 2018

ulfjack commented Sep 11, 2018

ulfjack commented Sep 11, 2018

ulfjack commented Sep 11, 2018

ulfjack commented Oct 24, 2018

ulfjack commented Nov 20, 2018

ulfjack commented Nov 20, 2018

ulfjack commented Nov 23, 2018

ulfjack commented Jan 11, 2019

ulfjack commented Jan 11, 2019

ulfjack commented Jan 11, 2019

ulfjack commented Jan 25, 2019

ulfjack commented Feb 1, 2019

ulfjack commented Feb 4, 2019