Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support one-to-many piping in the pipeline syntax #500

Open
xiaq opened this issue Oct 11, 2017 · 9 comments
Open

Support one-to-many piping in the pipeline syntax #500

xiaq opened this issue Oct 11, 2017 · 9 comments
Labels

Comments

@xiaq
Copy link
Member

xiaq commented Oct 11, 2017

This issue is filed from #485, which asks for the functionality of piping both the stdout and the stderr of a command to different commands. That bug was closed because the functionality is now possible with the low-level run-parallel and pipe builtins, but no new syntax were introduced.

This issue discusses the possibility of extending the pipeline syntax to support such a pipeline configuration. Citing @mqudsi's comment, it is not easy to come with a unambiguous syntax for this:

The primary issue with with working on both is parsing intent. Presuming stream redirection operators like 1>| and 2>|, how the command

./foo 1>| bar 2>| bar2

is parsed is ambiguous. Is the stderror redirection meant to apply to ./foo or the output of bar? This can only be solved by using brackets:

./foo 1>| bar 2>| bar2
./foo 1>| { bar 2>| bar2 }

A comment about this. I am not sure whether @mqudsi proposes that ./foo 1>| bar 2>| bar2 to mean "pipe stdout of foo to bar, and stderr of foo to bar2", but if that is the case, this is quite counter-intuitive. Traditional pipelines always work in a linear fashion, so it is tempting to interpret this as "pipe stdout of foo to bar, and stderr of bar to bar2".

The syntax for the pipeline should prioritize linear pipelines and make non-linear pipelines more explicit.


Traditionally, this functionality is implemented with process substitution:

foo > >(bar) 2> >(bar2)

However, process substitution relies support for either /dev/fd filesystem or named FIFOs. This is backwards: named FIFOs or /dev/fd is indeed needed if the process substitution needs to be used as command arguments, but when used in redirections, the same functionality is entirely implementable with plain, unnamed pipes.

In fact, in Elvish it is already possible to do this, except that you have to manage the lifecycle of pipes manually:

pout = (pipe)
perr = (pipe)
run-parallel {
  foo > $pout 2> $perr
  pwclose $pout
  pwclose $perr
} {
  bar < $pout
  prclose $pout
} {
  bar2 < $perr
  prclose $perr
}

Note that in the first function passed to run-parallel, foo > $pout 2> $perr resembles the process substitution version. This is expected.


Now for brainstorming a new syntax!

I think this is a bad idea, but a very intuitive syntax can look like this:

foo | bar
   2| bar2

I have chosen to change 2>| proposed by @mqudsi to 2| for terseness. Like 2>, there must not be any space between 2 and |.

When you have longer pipelines you will need to align them up:

foo | bar | quux
   2| bar2 # applies to stderr of foo
         2| quux2 # applies to stderr of bar

This syntax really takes whitespace-dependent syntax to the extreme. Again I don't think it's a good idea.

Another idea is supporting putting markers on commands in a pipeline, so that they can be referred to later on. Here I use ^name both as marker and reference, but it's likely we will need separate syntax for them:

foo ^f | bar ^b | quux
   ^f 2| bar2
            ^b 2| quux2

The parser can work by looking beyond the pipeline on the first line, and as long as subsequent lines start with a marker, add that to the part of the pipeline.

@zzamboni
Copy link
Contributor

zzamboni commented Oct 12, 2017

@xiaq First of all, I must say that this is a great feature. I have wished for parallel stdout/stderr pipelines for a long time, and it's amazing that Elvish already supports this. And I also learned about pipe and friends in the process - great work!

About the syntax - honestly I think the following should be clear enough:

cmd | stdout-consumer 2| stderr-consumer

Lamdas are already allowed in pipes, so disambiguation could be done easily, e.g. to my eyes, the following is clear:

cmd | { cmd2 | cmd3 } 2| { cmd4 | cmd4 2| cmd5 }

However, I understand your point about the linearity of pipes. If you want to make it explicity, why not introduce a new builtin (pipesplit comes to mind) that works like this:

pipesplit cmd stdout-consumer stderr-consumer

This could internally map directly into the structure you described with run-parallel.

@zzamboni
Copy link
Contributor

zzamboni commented Oct 12, 2017

It just dawned on me that this can be implemented as a function, and it works:

fn pipesplit [l1 l2 l3]{
  pout = (pipe)
  perr = (pipe)
  run-parallel {
    $l1 > $pout 2> $perr
    pwclose $pout
    pwclose $perr
  } {
    $l2 < $pout
    prclose $pout
  } {
    $l3 < $perr
    prclose $perr
  }
}

E.g.:

> pipesplit { echo stdout-test; echo stderr-test >&2 } { echo STDOUT: (cat) } { echo STDERR: (cat) }
STDOUT: stdout-test
STDERR: stderr-test

@xiaq
Copy link
Member Author

xiaq commented Oct 12, 2017

@zzamboni Right, it is already implementable as a function :)

cmd | stdout-consumer 2| stderr-consumer might seem clear for such simple pipelines, but we need a clear rule for longer pipelines. For example, what about foo | bar 2| lorem 2| ipsum? Whose stderr does ipsum get?

@zzamboni
Copy link
Contributor

zzamboni commented Oct 12, 2017

@xiaq the rule could be "maximum one | and one 2| per level", and force the user to disambiguate. For example, foo | bar 2| lorem 2| ipsum would give an error because there are 2 stderr-pipeline symbols at the same level. The user would need to specify using lambdas whether he meant foo | bar 2| { lorem 2| ipsum } or foo | { bar 2| lorem } 2| ipsum, both of which would be valid according to the rule.

@zzamboni
Copy link
Contributor

By using lambdas, the user could choose to use indentation to clarify things, as in some of your suggestions above, but the meaning would be clear to Elvish from the nesting, not from the indentation. E.g.:

foo `
| {
  bar 2| lorem
} `
2| ipsum

@zakukai
Copy link

zakukai commented Apr 6, 2020

I think using lambdas to disambiguate the proposed multiple-pipeline syntax would not actually achieve the desired result. The expectation in a pipe is that it connects a command immediately to the left, to a command immediately to the right. So:

~> A | B 2| C

This really would have to connect B's stderr, not A's, to C. Working around that with lambdas doesn't get us far IMO:

~> { A 2| C } | B      # B gets a combination of stdout and value streams from A and C
~> { A | B } 2| C      # C gets a combination of stderr from A and B

Getting around this just with lambdas in the pipeline would require preventing the undesired output of B or C from reaching the other tool that's being connected to A's other output:

~> { A | B 2> /dev/null } 2| C

...And of course that's not great either: to get B's stderr as part of the whole pipeline's combined stderr, you would need to dup stderr:

~> { { A | B 2>&3 } 2| C } 3>&2

I know we already have a solution in the form of a user-defined function ("pipesplit" shown above) - and it's actually a good solution, but I want to explore the possibilities a bit with regard to syntax:

Consider: the real benefit of redirecting to a process substitution, instead of piping, is that it uses redirection syntax rather than pipeline syntax: Redirections do not chain, instead they stack on the last command in the pipeline, so it's relatively straightforward to attach multiple of them to a single command.

So we could do something like this:

fn pipe-command [cmd]{
  local:p = (pipe)
  { $cmd < $p; prclose $p } &
  put $p
}
~> A stderr>(pipe-command { C }) | B

Personally I like that better, at least expressively speaking, than "split-pipe". It doesn't use /dev/fd (because elvish supports file objects that can be used in redirections), and it's easier to expand truly to "many" pipelines rather than just one or two.

Of course, there are problems with the implementation of "pipe-command" above:

  • The write end of the pipe isn't closed until the pipe gets garbage-collected. (Maybe we could have an atomic "redirect to an open file and close it"?)
  • The pipe object is retained until "C" terminates. (see above about atomic redirect-and-close)
  • "C" can therefore hang: If "C" won't terminate until it gets EOF on its input, that won't occur until the write end of the pipe is closed, which won't happen until the pipe gets GC'ed, which won't happen... until "C" terminates.
  • The read end of the pipe is held open until "C" terminates: that means if "C" closes its input, the pipe buffer could fill up and "A" could hang trying to write to stderr (rather than SIGPIPE and probably terminate)
  • "C" isn't part of the pipeline job: The job won't wait for "C" to terminate and "C" won't receive suspend or interrupt signals issued to the job's process group, etc.

(Personally I think it would really be preferable to have the two ends of the pipe as separate objects. Relying on GC to clean up a pipe isn't an ideal situation, of course, since it's not immediate - but the two ends are usually handled separately, doesn't really make sense IMO to bind them together.
It should also be noted that some of the issues outlined above apply to "pipesplit" as well: processes in the "pipeline" can't signal each other by closing their individual pipe ends, because the shell holds those pipe ends open until the processes terminate.)

So building on this idea, I think a good answer could be to add syntax that works like "pipe-command" above, except better-managed: The shell doesn't retain the read end of the pipe at all - avoiding the deadlock preventing the pipe from getting GC'ed and the one from A filling up the pipe buffer and hanging if "C" closes its input but doesn't terminate:

~> put (| X)  # "Pipe into X": Run X as a background job, yield a pipe to its input
▶ <pipe{-1 10}>
~> put (Y |)  # "Pipe from Y": Like above but it captures output instead
▶ <pipe{11 -1}>
~> A stderr> (| C) | B   # Connect A's stderr to a pipe into C...

It still doesn't solve the problem of C not being part of the pipeline job, unfortunately. C will run in the background until it self-terminates. But when A terminates, the write end of the pipe to C should be closed (possibly after a GC?), so if C terminates on EOF of its input, it would terminate at that point.

Alternately, if the feature were only allowed as part of a pipeline, "C" could be part of the pipeline job, and the shell would be in a better position to manage the lifetime of the pipe:

~> A stderr>| C | B

Not sure about the syntax (I at least like it better than "A > >(C)"...) but you get the idea: it's a pipe, but with a syntax that behaves like a redirect so that multiple of them would all apply to A rather than chaining.

@xiaq
Copy link
Member Author

xiaq commented Apr 29, 2020

@zakukai I really like your syntax exploration and enjoyed reading it. Thanks for sharing your insights.

IIUC, the (| X) and (Y |) syntax you described actually has almost the same semantics as process substitution, except that they evaluate to pipe objects instead of filenames. As you said, the advantage is not depending on /dev/fd or named pipes, but it has the disadvantage that they can no longer be used as arguments to external commands. For example, the classical use case of process substitution, diff <(sort file1) <(sort file2), cannot be satisfied by this.

I echo your sentiment that A stderr>| C | B has the best trade-off so far. It has the visual of both redirection, which meant that they are stacked on a command, and pipe, which hints at the parallelism and that the shell is managing the pipes under the hood.

@zakukai
Copy link

zakukai commented Apr 30, 2020

Yes, the (| X) syntax, as I used it in my post, is kind of like process substitution, but evaluating to a pipe object. I think it's also very similar to what in Bash or Korn Shell we would call a "coprocess" in that it launches the job and yields a pipeline to talk to it - but it's used in this case with cleanup semantics that would make it useful as a temporary, inline part of another command.

But the purpose of it wasn't as a general-purpose replacement of process substitution, rather my observation was that the real contribution of process substitution in forming this pipeline was that it allowed us to use redirection syntax, rather than pipeline syntax, to form a pipeline.

A hypothetical syntax like that could work as a general-purpose replacement for process substitution, if we added another piece to do the following:

  1. Attach that file object to the command being run (as a redirection to an unused FD, essentially)
  2. Insert into the command arguments a filename that identifies that numbered file descriptor (i.e. a /dev/fd path)

A mechanism like that could also apply more generally to other file objects - except that /dev/fd is kind of a terrible mess of a feature (at least on Linux) and does not work generally for other file objects (particularly sockets, which on Linux simply do not work on /dev/fd - this "broke" /dev/stdout in recent versions of Korn Shell, where pipelines were implemented with socket pairs). Given the issues surrounding /dev/fd personally I'm not too inclined to build features around it.

So why go through all this to create an alternate syntax for a problem that can already be solved with process substitution? Basically I think that process substitution is really ugly syntax when it's combined with redirection to turn it back into a pipeline:

cmd1 > >(cmd2)           # Yuck! And we need that space between the two '>' characters:
cmd1 >>(cmd2)            # Easily confused with command substitution in Elvish!
cmd1 --outfile >(cmd2)   # "Actual use case" of process substitution is not so bad...  Though it still looks like redirection-to-command-substitution in Elvish

And, more generally, I want to explore how existing shell idioms could be transformed (for the better, one would hope) in a shell whose feature set may provide fundamentally better ways of doing some of these things.

I like the general applicability of (| command) but the reason I also explored a more restrictive version that could only be used within a pipeline is because it makes it clearer that the stderr-consumer process should be part of the job, rather than a job all its own:

# If we allow (| C) to be used in isolation:
stderr_consumer=(| C)    # C is run as a background job
A 2>$stderr_consumer | B  # Job returns when A and B terminate - C may still be running

# Combining the two lines:
A 2>(| C) | B      # Is C part of the job or is it a background job all its own?

# Alternately, if (| C), or some alternate syntax for the concept, may only be used as part of a pipeline:
x=(| C)    # This wouldn't work
A 2>(| C) | B   # C is part of the job, shell will wait for all three processes to terminate and all will be in the same process group for job control purposes

Looking at it again I think A 2>|C | B is a bit confusing because it looks like the output of C is going to go into B. I'm not sure any of my concepts address that very well, honestly.

@L-as
Copy link

L-as commented Aug 8, 2020

Would it be possible to just detect when something like foo > >(bar) 2> >(bar2) can be done purely with unnamed pipes instead and just do it with unnamed pipes then?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants