Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make command substitution split on NUL #3164

Closed
imgx64 opened this issue Jun 23, 2016 · 26 comments
Closed

Make command substitution split on NUL #3164

imgx64 opened this issue Jun 23, 2016 · 26 comments

Comments

@imgx64
Copy link

imgx64 commented Jun 23, 2016

Currently, if the output from command substitution contains a null character, there is no way to actually get the characters after the null. Which kinda makes sense, since both argv and environment variables are null-terminated.

This makes it impossible to use find -print0 inside command substitution, and therefore having to use it without -print0 (as mentioned in the docs). This is slightly wrong in the sense that filenames with newlines (Yuck, but it's sadly allowed) will not be handled correctly.

I suggest making fish split command substitution output on null characters when IFS is empty. The current behavior is silently discarding everything after the null. If someone really wants this, it can be done with (...)[1]. I'm not sure if this null-splitting should also be done in addition to \n when IFS is non-empty.

Reproduction Steps:

~/test> touch normal\ file new\nline
~/test> count *
2 # OK, but I want to do some advanced filtering using find
~/test> count (find . -type f)
3 # nope
~/test> begin; set -l IFS; count (find . -type f); end
1 # nope
~/test> begin; set -l IFS; count (find . -type f -print0); end
1 # nope, but I think fish should be changed to make it work
~/test> begin; set -l IFS; echo (find . -type f -print0); end # let's see what it outputs
./new
line
# "normal file" is missing, which is not too surprising, but irritating

Expected behavior:

~/test> begin; set -l IFS; count (find . -type f -print0); end
2 # The two arguments should be new\nline and "normal file"

Additional information:

This is possibly related to #159.
A workaround would be using a loop and read --null (#1694). But it's cumbersome and I love fish's simplicity. If I wanted to write awkward loops to handle filenames with newlines, I would still be using bash.


Fish version: fish, version 2.3.0
Operating system: Cygwin, Fedora 23
Terminal or terminal emulator: Windows's cmd.exe window, Konsole

@krader1961 krader1961 added this to the fish-future milestone Jun 23, 2016
@krader1961
Copy link
Contributor

Your proposal is simple and the change in behavior is very unlikely to cause any problems in practice. Partly because the current behavior, as you noted, is to discard anything after the first null byte. Too, how likely is it anyone currently using an unset IFS for its current side effect of not splitting on newlines is also expecting it to keep embedded nulls given that doesn't work? I like it.

@ridiculousfish
Copy link
Member

ridiculousfish commented Jun 23, 2016

IFS is legacy. I would rather see this implemented by piping to string split \0

@krader1961
Copy link
Contributor

So you'd say this is a bug 😄:

$ echo -e "abc\x00def\x00ghi" | string split \0
a
b
c

@ridiculousfish
Copy link
Member

Yes, but one we cannot fix as long as our builtins take their arguments as null terminated strings.

@krader1961
Copy link
Contributor

I wonder if implementing the proposed IFS behavior first might be easier than changing the builtins to operate on wcstring objects rather than c-style strings. I doubt anyone working on the project disagrees that the latter should be done sooner rather than later. But the question is whether we can implement the IFS solution with very little work versus the presumably non-trivial work to switch from char * to wcstring.

@imgx64
Copy link
Author

imgx64 commented Jun 27, 2016

Yeah, the best way to fix this long-term is making string split magical (I think I've seen this feature being called string superpowers). But my suggestion is simple and hopefully doesn't break anything. Would an IFS-based PR be accepted if I submitted it? I haven't touched C++ in 9 years but I think I can try my hand at it.

On a side note, I don't like handling builtin arguments differently because it would break consistency. I would prefer string split --null instead of string split \0.

@faho
Copy link
Member

faho commented Jun 27, 2016

On a side note, I don't like handling builtin arguments differently because it would break consistency. I would prefer string split --null instead of string split \0.

What's inconsistent here? string split will split on whatever argument has been given. \0 is our notation for a NULL-byte, so string split \0 should split on NULL, just like string split a should split on a.

Yes, it's not like read, but read is a different thing, so different arguments are to be expected.

@imgx64
Copy link
Author

imgx64 commented Jun 27, 2016

A non-builtin has no way to accept a \0 in its arguments, why should builtins have a different calling convention (or so to speak)?

@imgx64
Copy link
Author

imgx64 commented Jun 27, 2016

I did the change locally, and I think it's more confusing than it's worth. With this change, set IFS; count (true) prints 0 instead of 1, and simply adding (...)[1] gives an Array index out of bounds error instead of the old behavior (which may be a bug, what do you think?).

What would be a reasonable output for set IFS; count (true), 1 or 0?

@faho
Copy link
Member

faho commented Jun 27, 2016

What would be a reasonable output for set IFS; count (true), 1 or 0?

0 of course, like it already does without messing with IFS. true doesn't print any output, so there's nothing to split.

@ridiculousfish
Copy link
Member

ridiculousfish commented Jun 27, 2016

I can see an argument for IFS splitting on NUL if you set IFS to '\0'. But if IFS is set to empty, it shouldn't split at all. I also think IFS is sort of legacy and we don't want to encourage its use.

One approach to fix this today is to introduce a split0 command to string, e.g. count (find . -type f | string split0). I think this is better than messing with IFS.

@krader1961
Copy link
Contributor

See my comment on PR #3174 but in short I've changed my mind and would prefer we not overload IFS with yet another behavior.

@anordal
Copy link
Contributor

anordal commented Jul 12, 2016

string split \0 should split on NUL, just like string split a should split on a.

A non-builtin has no way to accept a \0 in its arguments, why should builtins have a different calling convention (or so to speak)?

Specifically, \0 is indistinguishable from the empty string:

> string length \0
0
> string length ''
0

The question is how ok it is to let string split '' split on NUL.

@krader1961
Copy link
Contributor

The question is how ok it is to let string split '' split on NUL.

Seems like a bad idea to me for much the same reasons as overloading IFS. We should either allow specifying \0 as a char to split on or, if that's going to be too difficult given the pervasive use of C style strings in the code, then introduce a string split0 subcommand per @ridiculousfish's suggestion.

@cben
Copy link
Contributor

cben commented Dec 17, 2017

I want a way to express do-something (find ... -print0) without setting IFS.

I think if the following was default command substitution behavior, it might cover both this and #159:

  • iff any NUL is present, split on NUL and only on NUL.
  • otherwise, split on newlines.

For some reason I long thought that's the default behavior of fish but I see that's not the case.
Could such auto NUL support be considered for 3.0 (#4154)?

I'm not sure how exactly this would interact with modifying IFS (and frankly I don't care :).


Why not | string split? The hard part is not passing the NUL in argument to string split (eg. split0 is a perfectly pragmatic workaround) — it's preventing its output from being split on newlines:

> string escape (echo foo\nbar\ baz0second0 | string split 0)
foo
'bar baz'
second
''

IIRC preventing normal splitting was part of the "superpowers" that have been discussed.
But I doubt superpowers will arrive soon, and they go against some fish principles — whereas command substitution supporting NUL more or less allows any command/function to attain similar powers.

Plus less to write.


The best current solution, as mentioned in OP, is repeated read into a variable:

> printf 'foo\nbar baz\0second\0' | while read -z out; set outs $outs $out; end; string escape $outs
foo\nbar\ baz
second

This could be simplified by adding some read -z --all mode that reads all "lines" into an array (unlike read --array which reads one line and splits it).
Still, I'd like to keep the "shape" of command substitution rather than read into var; use var.

@faho
Copy link
Member

faho commented Dec 17, 2017

Could such auto NUL support be considered for 3.0 (#4154)?

That's quite a clever idea. One possible issue here is that, if there is only one "thing" to print and the program does not print a trailing NUL (which means no NULs will show up in the output), this would still split on newlines.

I mean I just checked that my find (from GNU findutils) prints one, but what about other implementations? Or other things?

At that point we're back to (find -print0; printf '\0') - which would do the wrong thing when confronted with an implementation that behaves "correctly", and which is quite unwieldy and magic (unless you've read the manual or followed this issue).

@cben
Copy link
Contributor

cben commented Dec 18, 2017

Good point, but all the nul-outputting commands I've ever seen use them as \n replacement, as terminators not separators. I'm very not concerned about that in practice.
BTW, I want splitting on \0 to expect trailing \0 and not return an empty string in the end.

Another use for appending \0 would then be #159, assuming input contains no \0:

alias whole "cat; printf '\0'"
set whole_output (ls | whole)

See, with this you can implement any splitting modes using regular commands!
I think at some point a magic yield was being discussed, this is a good non-magic approximation.

It does return array instead of single string if input contained \0. read could avoid this, but even if you can stuff \0 into a var (?), I don't see anything you can currently do with such var that wouldn't drop everything past first \0... I'm willing to accept that \0 can't usefully exist inside a shell, only on stdin/stdout...
(The way to handle arbitrary data is some form of quoting/escaping, e.g. set whole_output (ls | string escape) assuming #4605 is fixed.)

@faho
Copy link
Member

faho commented Dec 18, 2017

It does return array instead of single string if input contained \0.

Note: For ls (or any other command that prints file or pathnames), this is impossible. On unix, those aren't allowed to contain \x00, and filenames aren't allowed \x2f (ASCII "/"). Every other byte is allowed. That's where the idea of using NULs comes from. It's literally the only safe separator for paths.

I'm willing to accept that \0 can't usefully exist inside a shell, only on stdin/stdout...

For variables, I think it'd be technically possible for fish to allow them inside fish, but once exported they'd be lost (since the length information would be gone). As a commandline argument, however, this is a lost cause. Those are sent as c-style strings without any additional length information (the int argc, char** argv thing), so there's no way to read past a NUL safely. For fish to use them, we'd have to rework our builtins to retain additional info (e.g. receive their arguments in a different format that contains length), and even then only our builtins could use them.

Good point, but all the nul-outputting commands I've ever seen use them as \n replacement, as terminators not separators. I'm very not concerned about that in practice.

I've never really found the "terminator"/"separator" distinction clear enough. Also someone once tried to defend windows' notepad.exe behavior regarding \n with that, so I'm kinda lukewarm on it. Anyway.... what I'd like to see here is some data.

Have you tested find -print0 on macOS? Or locate? Or realpath? How about hg? If we determine that at least the majority of commonly-used tools do the "right" thing here, we can talk about this. If it fixes find on linux but breaks everything else, it's not much of an improvement.

@cben
Copy link
Contributor

cben commented Dec 25, 2017

I don't have a mac, can't say anything about that.
Any other unixes we care about? I can try experimenting with open source ones... BSD? Android? ChromeOS?
I can find a Windows if necessary, though both cygwin and ubuntu subsystem run GNU tools so don't expect differences there. Are there any native windows programs to check that support NUL-delimited output?

I can try to look into line-oriented languages like awk/sed/perl -p configured for NUL record separators.

faho added a commit to faho/fish-shell that referenced this issue Dec 27, 2017
Currently, there is no easy way to use e.g. `find -print0` in command substitutions.

What this does is check if a NUL is in the comsub output and then split on that instead of newline,
which makes `something (find -print0)` just work.

The one fly in the ointment is if there is a command that uses NULs to separate its output,
but doesn't print a trailing NUL if there is just one "thing" to print.
That would cause us to fall back to splitting on newlines, which might do the wrong thing.

So far, everything I've tested (`find`, `git config -z`) seemed to print a trailing NUL.

This implements a proposal by @cben.

Fixes fish-shell#3164.
@faho
Copy link
Member

faho commented Dec 27, 2017

@cben: I've just implemented your proposal - which turned out to be quite easy. So let's test it a bit!

@cben
Copy link
Contributor

cben commented Dec 31, 2017

Incomplete list of tools to check (from man -K NUL):

I'm listing things that take NUL on input, but won't test most of them.
-0,--null notation used for flag synonyms.
Currently all below tested on Linux (fedora 27).
Unset checkbox merely means I didn't test it.

  • fish history search -z,--null — terminated
  • fish read -z,--null as well as other shells
  • bash read -d '' reads till NUL. Errors and doesn't assign var if EOF before NUL!
    by default it supports backslash-escaping, interacts somewhat weirdly with -d, (backslash-newline still gets swallowed), and backslash-NUL causes following text till next NUL to get lost. read -r -d '' disables escaping.
  • bash readarray -d '' arr worked for me interactively but not in pipe?! It does read last part if not terminated with NUL. It normally keeps the delimiter char but not visible with NUL.
  • ruby -0
  • GNU awk, other awks
  • GNU sed -z,--null-data, other seds
  • perl -0, perl -ln0e
  • dos2unix,unix2dos,mac2unix,unix2mac -0
  • Git (not all tested but I think the picture is clear):
    • git commit -z,--null
    • git ls-files -z — terminated
    • git ls-tree -z — terminated
      • git mktree -z (reads output of ls-tree -z)
    • git apply -z
    • git check-attr -z — terminated
      • git check-attr -z --stdin
    • git check-ignore -z --stdin
    • git checkout-index -z --stdin
    • git diff -z --numstat (and some other flag combos) — terminated
      • git diff-files -z, git diff-index -z, git diff-tree -z
    • git rev-list --header — terminated
    • git grep -z,--null — terminated
    • git show -z — terminated
    • git update-index -z --stdin
    • git update-ref -z — terminated
    • git status -z — terminated
    • git log -z
    • git config -z,--null
  • bzr ls -0,--null
  • Mercurial:
    • hg files -0,--print0
    • hg grep -0,--print0
    • hg locate -0,--print0
    • hg status -0,--print0
    • hg purge -0,--print0
  • rsync -0,--from0
  • GNU coreutils:
    • basename -z,--zero — terminated
    • dirname -z,--zero — terminated
    • du -0,--null — terminated
    • du --files0-from=F
    • env -0,--null — terminated
    • cpio -0,--null
    • xargs -0,--null
    • tar --null
    • printenv -0,--null
    • readlink -z,--zero
    • realpath -z,--zero
    • comm -z,--zero-terminated
    • cut -z,--zero-terminated
    • numfmt -z,--zero-terminated
    • paste -z,--zero-terminated
    • head -z,--zero-terminated
    • tail -z,--zero-terminated
    • id -z,--zero — terminated
    • join -z,--zero-terminated
    • shuf -z,--zero-terminated
    • grep -Z,--null -l — terminated
    • grep -z,--null-data
    • locate -0,--null — terminated
    • sort -z,--zero-terminated
    • sort --files0-from=F
    • wc --files0-from=F
    • file -0,--print0
    • split -t,--separator='\0'
    • uniq -z,--zero-terminated
    • find -print0 — terminated
  • Other tars?
    • gpg-tar --null --files-from=...
    • bsdtar --null (with -I or -T)
  • xz --files0=F
  • ag -l -0,--null,--print0
  • ack --print0
  • cfv --list0
  • lsof -F0 — terminated
  • bwrap --args FD "Parse nul-separated arguments for file descriptor"
  • lslogins -z,--print0
  • msgexec 0 "outputs the translation, followed by a null byte."
  • ostree-ls --nul-filenames-only
  • zsh completions machinery uses NUL in couple places, don't care :-)

✅ So far all I checked all emit NUL on last output line.
I'm definitely NOT going to test all, but there are some more I want...
I'll also make some sort of resource out of this on Stack Overflow :)

P.S. posix discussing find -exec motivation mentions find -print0 was "was considered here, but not adopted. Using a null terminator meant that any utility that was going to process find's -print0 output had to add a new option to parse the null terminators it would now be reading." :-)

@faho
Copy link
Member

faho commented Dec 31, 2017

P.S. posix discussing find -exec motivation mentions find -print0 was "was considered here, but not adopted. Using a null terminator meant that any utility that was going to process find's -print0 output had to add a new option to parse the null terminators it would now be reading." :-)

Yes. That's one of the larger failings of POSIX, I'd argue, and a major bummer.


There are a bunch of tools in there that use NUL delimiters in their input, which is irrelevant here. For instance fish's read, head, tail, paste, xargs and all the --files0-from things.

I've taken the liberty of marking the tools I've checked.

The one that is going to be weird is GNU's file -0. The output looks like

filename\0Description\n
filename\0Description\n

So you would have to split on both, or you'd get the Description of the first file mixed in with the name of the second file.

file -00 will also use a NUL after the description.

I don't think there's anything we can do about that, and I wouldn't assume that many people actually parse the output of file.


I think the picture is clear: We haven't found a utility that breaks the assumption that NUL is used as "the terminator", i.e. that it is printed even for one thing.

What helps us here is the weirdness of bash's read:

bash read -d '' reads till NUL. Errors and doesn't assign var if EOF before NUL!

That means that tools that want to work with bash here need to print a NUL, so we got a bunch of testing for free.

@cben
Copy link
Contributor

cben commented Dec 31, 2017

There is another risk:
A program with normal newline-terminated output, that is capable of representing NUL chars internally (easy if written in modern languages), could get a NUL sneaked in middle of one of its output lines.
Then fish would suddenly parse it all wrong!

first line\n
second line\n
"Foo\0bar", he said\n
fourth line\n

-> [first line\nsecond line\n"Foo, bar", he said\nfourth line\n]

I don't know how to quantify this risk.
The main defense is that currently it already parses it wrong, dropping text after the NUL, but it's not nearly as bad: [first line, second line, "Foo, fourth line]

@faho
Copy link
Member

faho commented Dec 31, 2017

could get a NUL sneaked in middle of one of its output lines.

At that point, something is already wrong. There is no way to read that correctly, because it is incorrect.

dropping text after the NUL, but it's not nearly as bad

Well, both are wrong. But I'd argue that the current behavior is worse, because it's dropping text. The new behavior doesn't drop anything but the NUL.

It's also worth noting that bash loudly drops NULs. When you do foo=$(find . -print0), you'll get "Warning: command substitution: ignored null byte in input", and foo will be set to the output, with the NULs skipped (so a\0b becomes ab).

And that means two things: One, the proposed behavior for fish results in the same string if you string join "" it together.

And two: The only reason I know about bash's behavior is that I explicitly tested it. I've never seen that message before, because anecdotally at least, it just does not happen that "NUL sneaked in".

easy if written in modern languages

How exactly? I mean sure I can do print("something\0") in python (does that count as a "modern" language?), but that's something I explicitly chose to do. "Modern" languages do their best to hide NULs, because the details of how c-style strings work are a weird combination of tricky and boring. And they already can't really have NUL "sneaking in", because you can't use that for e.g. filenames.


zsh's behavior is also of interest here. The same foo=$(find . -print0) results in $foo actually being set to the output of find . -print0, including NULs, without any error or anything. When you try to use that variable with the builtin echo (and I'd imagine any other builtin), you'll actually get it reconstructed perfectly, including NULs. But when you then pass it to command echo, you'll get it only up to the first NUL (exclusive).

So there's no indication that anything is wrong, and there isn't anything wrong, until you pass it to an external command. It's possible there are options to configure this, I haven't checked.

mksh behaves like bash without the error.

@IsaacOscar
Copy link

IsaacOscar commented Mar 11, 2018

This proposal seems to suggest that we split on \n and \0, I think it would make life much easier for fish script writers, like myself, if we only (and always) split on \0. You can always translate the newlines to \0 (e.g. by piping to tr '\n' '\0').

If we are concerned about backwards comparability, then we can add an implicit | tr '\n' \'0' to the end of each command substitution if IFS is set. If we also remove IFS for the read builtin, one could just put 'set -e IFS' in there fish.config, and opt-out of the backwards compatibility.

The advantage of this is that we can always loslesllly store the result of a command to a variable, e.g.

function print_0
    set s $status
    if test (count argv) -gt 1
        printf "%s" $argv[1]
        for a in $argv[1..]
            printf "\0%s" $a
        end
   end
   return $s
end

set v (cat /my-file)
print_0 $v > /my-file2
diff /my-file /my-file2  # should return 0 for any possible files

Unfortunantlly, this requires that a trailing '\0' be treated as a separate empty string.

An additional benifit of my proposal, is that you can now use print_0 (cmd) for arbitrary commands, and it will produce the same output & exit status as running begin; cmd; end which is much uglier, especially when using it in pipes or part of more complicated commands.

@ridiculousfish
Copy link
Member

84b7c2b adds support for string split0 which splits on nul (and not newlines), and string join0 for the reverse. I think this is the right way to work with nul-delimited output.

@faho faho modified the milestones: fish-future, fish-3.0 Sep 15, 2018
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Apr 16, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
7 participants