New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

undocumented usage of perl for cc_test #4691

Closed
ahippler opened this Issue Feb 23, 2018 · 86 comments

Comments

Projects
None yet
9 participants
@ahippler
Copy link
Contributor

ahippler commented Feb 23, 2018

Description of the problem / feature request:

cc_test uses a inline perl script for failed tests.

perl -CSDA -pe's/[^\x9\xA\xD\x20-\x{D7FF}\x{E000}-\x{FFFD}\x{10000}-\x{10FFFF}]+/?/g;' "$1" \

Feature requests: what underlying problem are you trying to solve with this feature?

The usage of perl is not documented.
Windows does not have Perl installed by default.

Bugs: what's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

  • execute a failing test on windows will print:
==================== Test output for //Test/Unit:Unit (shard 1 of 8):

[...]

external/bazel_tools/tools/test/test-setup.sh: line 152: perl: command not found
================================================================================

What operating system are you running Bazel on?

Windows 10

What's the output of bazel info release?

0.10.1

The perl script replaces invalid XML characters and invalid sequence in CDATA.
To get rid of perl bash or python could be used.

@aehlig

This comment has been minimized.

Copy link
Member

aehlig commented Feb 26, 2018

Categorizing as bug, as pulling in an documented run-time dependency can cause breakage for users not otherwise using perl. The main question is if we really need perl (then we should document it as a run-time dependency of bazel) or whether an appropriate set up can be achieved with more standard tools.

@aehlig

This comment has been minimized.

Copy link
Member

aehlig commented Feb 26, 2018

It's the actual test rules that are problematic.

@ulfjack

This comment has been minimized.

Copy link
Contributor

ulfjack commented Mar 7, 2018

Note that this is the fallback code path, so if you make sure that your test framework generates a test.xml, that should workaround the issue.

Any suggestions on what to do here? We do need to escape the characters when generating the test.xml.

@benjaminp

This comment has been minimized.

Copy link
Contributor

benjaminp commented Mar 8, 2018

Perhaps rewrite test-setup.sh in a language like C++ or Python where text processing is easy to do portably? That might also be helpful for solving subtle test-setup.sh issues like #4608. (I can't imagine bash is very fun to run on Windows either.)

Relatedly, I would like the ability to customize the test setup executable. One can usually use --run_under, but then the user loses the ability to usefully specify their own --run_under.

@ulfjack

This comment has been minimized.

Copy link
Contributor

ulfjack commented Mar 8, 2018

We discussed that, but it means that we'd unconditionally need a C++ compiler in order to run tests. We could bundle a pre-compiled binary into Bazel, but what platform would it be compiled for? For cross-compilation, you need a binary for the target platform, not the host platform. We could bundle a binary and also ship the source, but then you still need a C++ compiler in order to do cross-compilation.

We could use Python, but then we require that everyone has python installed, even if they 'only' do web or C++ development.

We also discussed moving the fallback codepath into Bazel (i.e., have the test.xml generation in Bazel itself). However, that's also causing problems with remote execution, where we now would force test.xml generation onto the local machine. Except, of course, if we also require that remote execution provides X for some value of X.

It's not impossible to solve these problems, but there's no free lunch.

For now, I'd prefer we keep doing it in shell, but maybe there's a more standard tool than perl that we can use to do the escaping? awk? sed?

@laszlocsomor laszlocsomor self-assigned this Mar 9, 2018

@benjaminp

This comment has been minimized.

Copy link
Contributor

benjaminp commented Mar 13, 2018

FWIW, Bazel doesn't work at all without a C++ compiler on the system:

$ bazel build
ERROR: in target '//external:cc_toolchain': no such package '@local_config_cc//': Traceback (most recent call last):
    File "/home/benjamin/.cache/bazel/_bazel_root/887904812217cca9bc2b9adb875daf42/external/bazel_tools/tools/cpp/cc_configure.bzl", line 42
        configure_unix_toolchain(repository_ctx, cpu_value, overriden...)
    File "/home/benjamin/.cache/bazel/_bazel_root/887904812217cca9bc2b9adb875daf42/external/bazel_tools/tools/cpp/unix_cc_configure.bzl", line 416, in configure_unix_toolchain
        find_cc(repository_ctx, overriden_tools)
    File "/home/benjamin/.cache/bazel/_bazel_root/887904812217cca9bc2b9adb875daf42/external/bazel_tools/tools/cpp/unix_cc_configure.bzl", line 407, in find_cc
        fail(("Cannot find gcc or CC%s, eithe...))
Cannot find gcc or CC, either correct your path or set the CC environment variable
@ulfjack

This comment has been minimized.

Copy link
Contributor

ulfjack commented Mar 13, 2018

That's true, and we should fix that independently of this problem. I'd rather not add another dependency on having a C++ compiler available.

@laszlocsomor

This comment has been minimized.

Copy link
Contributor

laszlocsomor commented Mar 13, 2018

That's true, and we should fix that independently of this problem.

+1

I'd rather not add another dependency on having a C++ compiler available.

I'd give another +1 if I weren't out of them already.

@ahippler

This comment has been minimized.

Copy link
Contributor

ahippler commented Mar 13, 2018

We could use Python, but then we require that everyone has python installed, even if they 'only' do web or C++ development.

python is already required as per documentation:
https://docs.bazel.build/versions/master/install-ubuntu.html#1-install-required-packages
https://docs.bazel.build/versions/master/windows.html#software-requirements

@laszlocsomor

This comment has been minimized.

Copy link
Contributor

laszlocsomor commented Mar 13, 2018

@ahippler : Like with C++, that's true and we should fix that independently of this problem. In fact it'd be best if Bazel would only require toolchains that it needs for the build, i.e. if you don't build Python rules then Bazel shouldn't require a Python installation.

Only exception (if you could call it an exception) should be Java: the JDK is always available because we bundle one with Bazel, because Bazel itself needs one in order to run.

@laszlocsomor

This comment has been minimized.

Copy link
Contributor

laszlocsomor commented May 2, 2018

Thanks to @aehlig we now know what this command is supposed to do:

It means, read the string as sequence of uni-code characters, and replace every non-empty sequence of characters in the given ranges by a question mark symbol. (That's the first line. The ranges come from the esoteric definition of which unicode characters are and are not allowed in XML documents.)
The negation (^) makes sense. The allowed unicode characters are:
x9 | xA | xD | [x20-xD7FF] | [xE000-xFFFD] | [x10000-x10FFFF]
And every non-empty sequence of other charachters is replaced by a question mark symbol.
The reason why perl was used was that the whitelist only applies after you interpreted the sequence of octets as a sequnce of (utf-8 encoded) unicode characters, and perl could do that on the fly (while the more traditional unix tools work on sequnce of octets).

@ulfjack

This comment has been minimized.

Copy link
Contributor

ulfjack commented May 14, 2018

(Note that any solution for test-setup needs to work with remote execution - this makes it very difficult to use pre-compiled binaries, because you don't know which platform the test action will actually run on.)

@laszlocsomor

This comment has been minimized.

Copy link
Contributor

laszlocsomor commented May 14, 2018

@ulfjack : True. Bazel could include a precompiled embedded binary for the host platform, and the remote execution platform author would have to provide one. Bazel would select the active one with toolchain rules. WDYT?

@laszlocsomor

This comment has been minimized.

Copy link
Contributor

laszlocsomor commented May 14, 2018

Let's not forget that the tests currently depend on Bash, so a remote execution worker would have to have Bash installed anyway.

@ulfjack

This comment has been minimized.

Copy link
Contributor

ulfjack commented May 14, 2018

That would make it much more difficult to change test-setup since all such changes would have to be rolled out to all remote execution systems.

@laszlocsomor

This comment has been minimized.

Copy link
Contributor

laszlocsomor commented May 14, 2018

Let's not forget that the tests currently depend on Bash, so a remote execution worker would have to have Bash installed anyway.

...meaning that requiring a test-setup binary would not make things worse. And we could provide a reference implementation in a GitHub repo.

@ulfjack

This comment has been minimized.

Copy link
Contributor

ulfjack commented May 14, 2018

I think one option would be to fork test-setup, and have different implementations for Linux/Mac and Windows.

@laszlocsomor

This comment has been minimized.

Copy link
Contributor

laszlocsomor commented May 14, 2018

Forking test-setup would also introduce the synchronization difficulty you alluded to.

@laszlocsomor

This comment has been minimized.

Copy link
Contributor

laszlocsomor commented May 15, 2018

(Note that any solution for test-setup needs to work with remote execution - this makes it very difficult to use pre-compiled binaries, because you don't know which platform the test action will actually run on.)

The options I see:

  • Use precompiled binary. Faster than a script and requires no runtime, but Bazel must either include the binary for all local and remote execution platforms, or the remote execution platform must provide its version of the binary and Bazel needs an appropriate toolchain rule. Need to maintain and keep in sync multiple source files.
  • Use scripts. Requires a runtime (Bash / Python / Powershell) but more portable than a precompiled binary. Need to maintain and keep in sync multiple script files.
  • Use one script. This is what we have now. Requires the same runtime (Bash / Python) on every platform (goes against #4319 ). Need to maintain just one script.

The second option seems to be the best tradeoff. WDYT?

@ulfjack

This comment has been minimized.

Copy link
Contributor

ulfjack commented May 15, 2018

I agree.

Note that I've been working on splitting test-setup into two separate steps - right now it runs the test and then generates a test.xml file if there isn't one. The way it's done is triggering a code path on Linux and MacOS that has an inherent race condition. @agoulti proposed that we split up the two parts - run the test first and then run a separate action to generate the test.xml file if the test didn't generate one. That'll fix the race condition and potentially make the test-setup script a bit simpler.

@ulfjack

This comment has been minimized.

Copy link
Contributor

ulfjack commented May 15, 2018

There's some background in #4608.

@benjaminp

This comment has been minimized.

Copy link
Contributor

benjaminp commented May 15, 2018

The options I see:
...

There's also the option I've raised before of rewriting test-setup.sh in C++, requiring a target C++ toolchain to run tests, and compiling test-setup on the fly for the target platform. For simple use cases, a precompiled binary test-setup could be included with Bazel in the embedded tools. This has the advantage of letting test-setup use native APIs (managing processes properly in bash is ironically difficult) and not duplicating too much work across platforms. I understand the desire to avoid requiring a C++ toolchain, but it seems like the need for that ends up creeping in for any non-trivial project anyway.

@laszlocsomor

This comment has been minimized.

Copy link
Contributor

laszlocsomor commented Jul 25, 2018

Thanks.

Re: technical debt: I hear you, but your point sounds rather speculative, so I'd not worry about non-x64 Windows platforms for the foreseeable future (say 1 year).

Re: compiler: since you don't want to require it for tests, which I also agree with, I think C++ is not the right language for the test wrapper on Linux. (It still seems like the right choice on Windows.)

But then I don't know what's the right choice for Linux, if only Bash is promised (without Perl). Maybe a clever sed program could help?

@laszlocsomor

This comment has been minimized.

Copy link
Contributor

laszlocsomor commented Jul 25, 2018

@nlopezgi , another question: do you know how Bazel cancels a remotely running test action in case the user presses Ctrl+C? Does Bazel dispatch this to the remote service or does it just close the connection?

@nlopezgi

This comment has been minimized.

Copy link
Member

nlopezgi commented Jul 25, 2018

re: compiler: I did not state I did not want to require it for tests, just that I was not sure its the right choice and don't want to be the one to make it w/o at least conferring with some other folks (I'll get back to you once I've confirmed). imo, if some tool is to be required for all tests, i'd rather it is the c++ compiler (instead of perl or python), but not sure what the trade-offs (effort/maintenance) between c++ vs a clever sed program would be.

re: canceling a test: not sure, you'd want to ask @ola-rozenfeld about api details

@laszlocsomor

This comment has been minimized.

Copy link
Contributor

laszlocsomor commented Jul 26, 2018

I was too optimistic with the "clever sed program". The task is to UTF-8-decode an octet-stream, test decoded characters if they fall into any of some disjoint ranges and replace them with "?", then UTF-8-encode the result again.

I'm not aware of an efficient way to do this without encoding and decoding.

@ulfjack

This comment has been minimized.

Copy link
Contributor

ulfjack commented Jul 26, 2018

You don't need to decode and re-encode - you can simply test on the utf-8 representation using seds regexp support.

@laszlocsomor

This comment has been minimized.

Copy link
Contributor

laszlocsomor commented Jul 26, 2018

How? Does sed support matching UTF-8 strings?

@ulfjack

This comment has been minimized.

Copy link
Contributor

ulfjack commented Jul 26, 2018

sed supports matching binary, and you know the utf-8 encoding, so you can check for specific utf-8 ranges, like so:

echo "ä ö ü" | LANG=C sed -e "s/[\xc0-\xdf][\x00-\xff]/?/g"

Here, I'm replacing all two-byte utf-8 sequences with a single '?' character. See https://en.wikipedia.org/wiki/UTF-8 for the multi-byte ranges.

@ulfjack

This comment has been minimized.

Copy link
Contributor

ulfjack commented Jul 26, 2018

After a lot of trial and error, I've come up with a sed script that - I think - does what we want:

cat test.log | LANG=C sed -E \
  -e 's/.*/& /g' \
  -e 's/(([\x9\xa\xd\x20-\x7f]|[\xc0-\xdf][\x80-\xbf]|[\xe0-\xec][\x80-\xbf][\x80-\xbf]|[\xed][\x80-\x9f][\x80-\xbf]|[\xee-\xef][\x80-\xbf][\x80-\xbf]|[\xf0][\x80-\x8f][\x80-\xbf][\x80-\xbf])*)./\1?/g' \
  -e 's/(.*)\?/\1/g'
@ulfjack

This comment has been minimized.

Copy link
Contributor

ulfjack commented Jul 26, 2018

First, add a single white space character (' ') to the end of each line. Second, replace all (possibly empty) sequences of legal characters followed by a character with the sequence of legal characters followed by a question mark character ('?'). Third, remove the trailing question mark character ('?') from each line.

@laszlocsomor

This comment has been minimized.

Copy link
Contributor

laszlocsomor commented Jul 26, 2018

Hats off, that's quite impressive.

I think you made a couple mistakes:

  • you need to cover the entire 2-octet UTF-8 domain (U+80 .. U+7FF), that is "e0 80 80 .. ed 9f bf", so that part of the regex should be [\xe0-\xed][\x80-\x9f][\x80-\xbf], whereas yours is split up into [\xe0-\xec][\x80-\xbf][\x80-\xbf] and [\xed][\x80-\x9f][\x80-\xbf]
  • the final allowed UTF-16 range is U+10000 .. U+10FFFF, that is "f0 80 80 80 .. f4 8f bf bf", so the regex should be [\xf0-\xf4][\x80-\x8f][\x80-\xbf][\x80-\xbf], but yours is [\xf0][\x80-\x8f][\x80-\xbf][\x80-\xbf]

Could you compare your results with mine?

@laszlocsomor

This comment has been minimized.

Copy link
Contributor

laszlocsomor commented Jul 26, 2018

Sorry, I got confused, gimme a minute to correct this.

@laszlocsomor

This comment has been minimized.

Copy link
Contributor

laszlocsomor commented Jul 26, 2018

you need to cover the entire 2-octet UTF-8 domain (U+80 .. U+7FF)

This is wrong. I meant to say: you need to cover the U+80..U+7FF (two UTF-8 octets) and U+800..U+D7FF (three UTF-8 octets) ranges. The 2-octet domain is covered correctly ([\xc0-\xdf][\x80-\xbf]), but I think the 3-octet-matching regex is wrong in your solution.

@ulfjack

This comment has been minimized.

Copy link
Contributor

ulfjack commented Jul 26, 2018

Ok, how about this:

[\x9\xa\xd\x20-\x7f]                         <--- (9,A,D,20-7F)
[\xc0-\xdf][\x80-\xbf]                       <--- (0080-07FF)
[\xe0-\xec][\x80-\xbf][\x80-\xbf]            <--- (0800-CFFF)
[\xed][\x80-\x9f][\x80-\xbf]                 <--- (D000-D7FF)
[\xf0-\xf7][\x80-\xbf][\x80-\xbf][\x80-\xbf] <--- (010000-10FFFF)
@ulfjack

This comment has been minimized.

Copy link
Contributor

ulfjack commented Jul 26, 2018

Wait, there's still one range missing. Gah!

@laszlocsomor

This comment has been minimized.

Copy link
Contributor

laszlocsomor commented Jul 26, 2018

Ah I see where I was wrong, you are right to match 0800-CFFF and match D000-D7FF separately.

@ulfjack

This comment has been minimized.

Copy link
Contributor

ulfjack commented Jul 26, 2018

Another try:

[\x9\xa\xd\x20-\x7f]                         <--- (9,A,D,20-7F)
[\xc0-\xdf][\x80-\xbf]                       <--- (0080-07FF)
[\xe0-\xec][\x80-\xbf][\x80-\xbf]            <--- (0800-CFFF)
[\xed][\x80-\x9f][\x80-\xbf]                 <--- (D000-D7FF)
[\xee][\x80-\xbf][\x80-\xbf]                 <--- (E000-EFFF)
[\xef][\x80-\xbe][\x80-\xbf]                 <--- (F000-FFEF)
[\xef][\xbf][\x80-\xbd]                      <--- (FFF0-FFFD)
[\xf0-\xf7][\x80-\xbf][\x80-\xbf][\x80-\xbf] <--- (010000-10FFFF)
@ulfjack

This comment has been minimized.

Copy link
Contributor

ulfjack commented Jul 26, 2018

Ok, I wrote a small Java program to double-check the pattern, and it returned the expected ranges:

9-a,d-d,20-d7ff,e000-fffd,10000-10ffff
@laszlocsomor

This comment has been minimized.

Copy link
Contributor

laszlocsomor commented Jul 26, 2018

I'm glad we are free now.

@ulfjack

This comment has been minimized.

Copy link
Contributor

ulfjack commented Jul 26, 2018

Ok, I have a patch which conflicts with my other changes to test-setup.sh. Both are a bit risky, and we need to pick one to be merged first.

@laszlocsomor

This comment has been minimized.

Copy link
Contributor

laszlocsomor commented Jul 26, 2018

If you have changes lined up, merge those first.
This bug has been open for a long time, it's OK to wait another day.

@ulfjack

This comment has been minimized.

Copy link
Contributor

ulfjack commented Aug 3, 2018

Patch is here: https://bazel-review.googlesource.com/c/bazel/+/68711

There are still a couple possible issues with this that we need to look into.

My primary concern is what should happen if the default charset of the current machine is NOT UTF-8. The previous Perl solution was broken as well: it did a charset conversion from the default charset to UTF-8 on input, but also from UTF-8 to the default charset on output, which can actually re-introduce illegal characters (we might want to file a bug for that, or note it on the existing bug), breaking the resulting XML.

The new code intentionally does not perform any charset conversion.

Ideally, we'd do a charset conversion from the default charset to UTF-8 before running the sed script. Note that I override the LOCALE before running sed, which probably sets the default charset to ISO-8859-1 (this needs to be double-checked!).

  • find a way to do charset conversion from the default charset to utf-8; we might be able to get sed to do that for us, but we need to prevent a charset conversion on output
  • make sure to set the default charset temporarily to ISO-8859-1 for the sed invocation
  • make sure that all copies of this code are updated
@ulfjack

This comment has been minimized.

Copy link
Contributor

ulfjack commented Aug 3, 2018

(I'm on vacation for the rest of August, and won't be able to work on this until I'm back.)

@laszlocsomor

This comment has been minimized.

Copy link
Contributor

laszlocsomor commented Aug 3, 2018

Thanks!

The previous Perl solution was broken as well: it did a charset conversion from the default charset to UTF-8 on input, but also from UTF-8 to the default charset on output, which can actually re-introduce illegal characters (we might want to file a bug for that, or note it on the existing bug), breaking the resulting XML.

Please elaborate. How and where exactly are the conversions done?

(I'm on vacation for the rest of August, and won't be able to work on this until I'm back.)

Do you intend to finish this task yourself or to appoint it to someone else (and if so, whom)?

@ulfjack

This comment has been minimized.

Copy link
Contributor

ulfjack commented Aug 3, 2018

If it's still open when I'm back, I'll do it. If someone comes in and finishes my changes, I'm happy, too.

Perl does implicit charset conversion on every read and write from a file or stdin/stdout. I believe it converts from the platform default charset to UTF-8 internally. That's my reading of the docs, anyway.

@ulfjack ulfjack assigned ulfjack and unassigned laszlocsomor Nov 26, 2018

@ulfjack

This comment has been minimized.

Copy link
Contributor

ulfjack commented Nov 26, 2018

I have a pending patch.

@Profpatsch

This comment has been minimized.

Copy link
Contributor

Profpatsch commented Dec 19, 2018

Would be awesome to see that line go. It’s the last remaining use of perl in bazel from what I can see (I already removed the other use in #5999).

Go @ulfjack! 🍰 🏃

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment