support `console` pool to stream output of long-running commands #68

keith · 2023-06-30T17:31:10Z

In llvm there are a lot of build targets like check-lld which run a large suite of lit tests. It's useful to see the progress of these commands since they can be very long running, and if something fails in the middle you might want to ctrl-c out of the build and triage that specific failure rather than wait for them all to be done.

The text was updated successfully, but these errors were encountered:

dae · 2023-07-04T01:39:36Z

On a semi-related note, I have a couple of long-running commands that I'd prefer to see status output on (things like invoking a build in a separate project to get its build outputs, and invoking package managers that may take time to download+install things). What kept me from using console with ninja is the fact that it also serializes access, so the build becomes less parallel if the output is shown from multiple commands.

I presume that the serialization was done to avoid interleaved output from multiple commands? Would a pool like console that allowed such interleaving (line buffered) make sense?

As n2 shows each running command on a separate line, one alternative to streaming the lines with scrolling would be to have the last line output by a given command to be shown to the right of the running command - even if it were only updated once a second, that would give some indication of how things are progressing. But I guess it would not be helpful in cases where a long-running task printed errors and then printed other text.

Not an essential feature in my case anyway - n2 showing what tasks are currently running has already been a big improvement over ninja, where some task had stalled but NINJA_STATUS was showing some other command that happened to end afterwards.

evmar · 2023-07-04T01:41:48Z

Yeah, my thinking here we could maybe show the most recent N lines from commands as they are running (and then their full output when they complete). The main question is whether it can be done in a way that isn’t visually overwhelming.

Sarcasm · 2023-07-04T22:40:25Z

The console pool in ninja has to run serially because stdin access is also possible, I think.

Mixing output of multiple commands or truncating it does not work well with some tools that overwrite lines in-place, like ninja/n2.
Personnally, I use the ninja console pool for sphinx-build output, which has some sort of inline progress output, if stdout or stderr is a terminal, it will print this kind of progress bar:

reading sources... [  6%] release/v1.0\r
reading sources... [ 13%] release/v1.1\r
reading sources... [ 20%] release/v1.2\r
reading sources... [ 26%] release/v1.3\r
reading sources... [ 33%] release/v2.0\r
[...]\r
reading sources... [100%] release/index
[...]
dumping object inventory... done
build succeeded.

The HTML pages are in [...]/build/doc/html.

If stdin access is ignored and only process output (stdout, stderr) is sufficient I think it's possible to have a nice linear log, a command at a time.
The first of the commands that is int the console pool can write to the terminal, and the others can be buffered.

IMHO interleaving or truncating the outputs of the console pool would make it less useful.

Related reading "redo, buildroot, and serializing parallel logs":

https://apenwarr.ca/log/20181106

dae · 2023-07-04T23:47:39Z

I guess there are a couple of different use cases and trade-offs here. The one I was focusing on was seeing the most up-to-date output from a command, to e.g. see if it has stalled. If output were delayed until completion so it could be displayed linearly, this case wouldn't be covered so well.

For commands where you wanted the output non-interleaved, that's already the status-quo, isn't it? My thinking was that there'd be a way to opt-in to immediate output for a command, while otherwise keeping things as they are. If the full output from the command were buffered, it could be reprinted in full if the command failed, which would make debugging issues easier (or alternatively, logged to a file in the failure case)

evmar · 2023-07-26T12:34:55Z

Thought about this more, thanks for the comments here.

I think the Ninja behavior is:

command output is buffered, and printed when commands finished
'console' commands serialize and have direct access to tty

I had thought I might try something clever around running console commands in parallel but I think existing ninja projects are likely too sensitive to work with it. In particular I think I don't want to get into pseudotty behavior.

(I think the one tweak I will attempt is for non-console commands, while they are running, I might display their most recent line of output in the temporary progress.)

While a task is running, we gather the last line of its (buffered) output and display it in the fancy progress. Part of #68.

evmar · 2023-07-26T13:43:47Z

Looks like this:

https://asciinema.org/a/kZGBQLft3LdZwqSJRblZZShmx

(The subcommand here print the literal string print1: 1 etc)

dae · 2023-07-27T04:00:23Z

For adding better insight into long-running subtasks, this is a nice improvement!

recording.mp4

Normally, I run commands with a wrapper that hides all their output unless there's an error. To test out this new behaviour, I turned that off, but now the full output of the command is printed as it completes, even in the success case, which makes the build output somewhat messy. It would be great to have some sort of option like a build variable that could suppress the aggregated output in the success case, as while I love the extra insight during build, I'd miss the clean output if I had to decide on only one of the two.

One other minor thing: when a build step invokes n2 in a subproject, n2 prints the descriptions of each step without any leading indent, so it becomes a bit hard to distinguish between the currently-running step in the outer invocation, and the step that's running in the inner invocation - for example ts:congrats is an inner step here:

Would one or two spaces before the last output line, truncating the last one or two characters, make sense?

As requested on #68.

evmar · 2023-07-28T18:57:05Z

Would one or two spaces before the last output line, truncating the last one or two characters, make sense?

Good idea, done.

evmar · 2023-07-28T18:58:39Z

It would be great to have some sort of option like a build variable that could suppress the aggregated output in the success case, as while I love the extra insight during build, I'd miss the clean output if I had to decide on only one of the two.

You're in a fairly unique situation in that most people only have .ninja files generated by a system they don't control. I like the idea though, let me think about it. (I recall in Ninja we had a lot of circular discussions around different verbosity-related command-line flags to either show task output or not, show task command lines or not, etc.)

dae · 2023-08-23T01:36:11Z

I was eager to try out the new functionality, so I've hacked in some options in a local branch. Only the Pretty progress path has been updated:

ankitects@5620260

One curious thing I noticed after being able to see the last line of output is that sometimes there's a fairly long period between when the command has finished, and when it disappears from the list of tasks. For example, 'build:minilints' in the following video. Can you think of a reason why that might be happening?

recording.mp4

dae · 2023-08-29T01:28:22Z

I've been noticing this happen fairly frequently - some commands, whether they have any output or not, appear to run for much longer than they should, and don't appear in top - so I presume something is a bit funky with the cleaning up of finished commands at the moment.

evmar · 2023-08-29T13:41:22Z

Do the stuck tasks ever finish?

Two guesses are:

Somehow the bookkeeping in the UI is off, where the "task finished" message gets treated wrong. The relevant UI code is here. This code does reuse ids so maybe there's some race* between a terminated task and a new one claiming the id? The id is released in the call to .wait and the above code is invoked just after though, in here. Edit: this is wrong, they are id'd by unique build ids and there is no reuse.
Alternatively, we're failing to track when the process actually finishes. This is where we block on the subprocess output/termination.

To judge if it's the second one you might be able to look at (via like ps) whether the subprocess is really still present when things are stuck.

* Not a race in the threadsy sense, but rather a logic error related to the order of events.

evmar · 2023-08-29T13:58:02Z

(I previously wrote that I might have been reusing task ids and getting the bookkeeping wrong, but that was a misread of the code. n2 uses unique types for each of the various "id" integers flying around so I think it's unlikely I'm mixing these up, but ...)

dae · 2023-08-29T14:09:17Z

The tasks do eventually complete, and it seems to coincide with the completion of some other long-running task. As far as I can tell, the tasks are no longer in the process list (well, top - I'll check ps next time) after their last output, so it just seems to be the status not updating. It happens with a moderate degree of frequency, but is sporadic - the next time it happens I'll see if I can get get it to trigger more reproducibly.

From looking at #68.

I think this was a real bug. Its symptom would be that after a task produced output, if nothing else happened, we might wait a while before updating the UI. From looking at #68, but won't fix it.

Ninja does this. Not sure if it helps with #68.

Possibly could help with #68?

evmar · 2023-08-29T17:12:22Z

I pushed 3 small fixes in this area that might help, but I'm not sure about any of them.

I think if the bug is related to the process lurking around too long, it would likely be lurking in some sort of hung (like waiting to write to a pipe) or zombie state, so it wouldn't show up at the top of top. I would instead expect the process to still exist but just be waiting.

dae · 2023-08-29T22:57:28Z

Thanks, I tried again with the latest changes. Looking at ps when I see a task that has printed its last line but appears stuck, I see one or more zombie processes:

dae 2319878 0.0 0.0 0 0 pts/1 Z+ 08:08 0:00 [sh]

I added some logging to run_command() try get better insight into what's going on, and it appears the read() call hangs for a while after the process has printed its final output:

1693349032 nextest: begin open
1693349032 nextest: close
1693349032 nextest: block to read
1693349032 nextest read '�[0m�[0m�[1m�[36m    Blocking�[0m waiting for file lock on package cache
'
1693349032 nextest: block to read
1693349032 nextest read '�[0m�[0m�[1m�[36m    Blocking�[0m waiting for file lock on package cache
'
1693349032 nextest: block to read
1693349032 nextest read '�[0m�[0m�[1m�[32m   Compiling�[0m anki_proto v0'
1693349032 nextest: block to read
1693349032 nextest read '.0.0 (/home/dae/Work/desktop/anki/rslib/proto)
'
1693349032 nextest: block to read
1693349033 nextest read '�[0m�[0m�[1m�[32m   Compiling�[0m anki v'
1693349033 nextest: block to read
1693349033 nextest read '0.0.0 (/home/dae/Work/desktop/anki/rslib)
'
1693349033 nextest: block to read
1693349035 nextest read '�[0m�[0m�[1m�[32m   Compiling�[0m linkchecker v0.0.0 ('
1693349035 nextest: block to read
1693349035 nextest read '/home/dae/Work/desktop/anki/rslib/linkchecker)
'
1693349035 nextest: block to read
1693349036 nextest read '�[0m�[0m�[1m�[32m   Compiling�[0m'
1693349036 nextest: block to read
1693349036 nextest read ' anki-sync-server v0.0.0 (/home/dae/Work/desktop/anki/rslib/sync)
�[0m�[0m�[1m�[32m   Compiling�[0m rsbridge v0.0.0 ('
1693349036 nextest: block to read
1693349036 nextest read '/home/dae/Work/desktop/anki/pylib/rsbridge)
'
1693349036 nextest: block to read
1693349036 nextest read '�[0m�[0m�[1m�[32m    Finished�[0m '
1693349036 nextest: block to read
1693349036 nextest read 'test [unoptimized + debuginfo] target(s) in 4.42s
'
1693349036 nextest: block to read
1693349036 nextest read '�[32;1m    Starting�[0m �[1m262�[0m tests across �[1m17�[0m binaries
'
1693349036 nextest: block to read
1693349037 nextest read '------------
�[32;1m     Summary�[0m [   0.702s] �[1m262�[0m tests run: �[1m262�[0m �[32;1mpassed�[0m, �[1m0�[0m �[33;1mskipped�[0m
'
1693349037 nextest: block to read
1693349045 nextest: got 0
1693349045 nextest: waitpid start
1693349045 nextest: waitpid end

I've figured out a way to reproduce this fairly consistently, so if you have any suggestions on things I should try, please let me know.

dae · 2023-08-29T23:16:42Z

Could it be that the pipes for one spawned process are leaking into other spawned processes? That would explain why a command seems to unblock when another happens to complete, and with the following change, the issue seems to disappear:

    for fd in 3..1024 {
        actions.addclose(fd)?;
    }
    // actions.addclose(pipe[0])?;
    // actions.addclose(pipe[1])?;

We want to open all our fds using CLOEXEC so they're not inherited by child processes. All Rust builtin APIs do this already, but we used libc::pipe() for our pipe. To fix this, we use pipe2() on non-Mac (which allows passing the flag) and Mac-specific POSIX_SPAWN_CLOEXEC_DEFAULT otherwise. Fixes leak found in #68.

evmar · 2023-08-30T06:00:43Z

Great find! I dug a bit and wrote I think a proper fix in 12652a1, hopefully.

evmar · 2023-08-30T06:03:39Z

At least on this Mac, running the build in tests/manual/prints I would see the processes inheriting stuff like

bash    23396 evmar    1   PIPE 0x908089607cfefd0b    16384                     ->0x656addaecb5f1c95
bash    23396 evmar    2   PIPE 0x908089607cfefd0b    16384                     ->0x656addaecb5f1c95
bash    23396 evmar    4   PIPE 0xe05bffa6d41dad18    16384                     ->0xf57f30ba54da4219
bash    23396 evmar    5   PIPE 0xf57f30ba54da4219    16384                     ->0xe05bffa6d41dad18
bash    23396 evmar    6   PIPE 0xf93745033fa400af    16384                     ->0x9a377a882197cb70
bash    23396 evmar    7   PIPE 0x9a377a882197cb70    16384                     ->0xf93745033fa400af

and after the patch they only have pipe fds 0/1/2.

dae · 2023-08-30T22:39:54Z

Thanks for looking into it; I can confirm it's all working smoothly now, and CLOEXEC is much more elegant than closing 1000 FDs.

dae · 2023-09-05T01:22:55Z

Every time I run a build, being able to see the latest line of output makes me smile :-) It's such a nice improvement.

Have you had any thoughts yet on whether output-suppressing options might be suitable for inclusion / what they might look like?

evmar · 2024-01-08T19:38:51Z

Sorry for dropping this thread! I think we have a few different ideas mixed in here and I wanna make sure I address all of them.

I think for @keith's original request for check-lld the current behavior is maybe now sufficient.
For @Sarcasm's sphinx-build output, I suspect that because n2 buffers the output, sphinx doesn't believe it's connected to a console, so n2's behavior doesn't work. I think support for a true console pool where n2 would suppress its own status updates while the subtask ran (or, possibly, using a pty, yuck) would address that.
For @dae's final remark about output-suppression, I will file a new bug.

evmar · 2024-01-08T19:52:56Z

I built check-lld and it looks pretty good to me:

evmar changed the title ~~n2 doesn't stream test output~~ support console pool to stream output of long-running commands Jul 4, 2023

evmar added a commit that referenced this issue Jul 26, 2023

show last line of output from tasks

dffbd7e

While a task is running, we gather the last line of its (buffered) output and display it in the fancy progress. Part of #68.

evmar added a commit that referenced this issue Jul 28, 2023

indent output from tasks in progress

599e162

As requested on #68.

evmar added a commit that referenced this issue Aug 29, 2023

clearer task started/finished api

f99ee6a

From looking at #68.

evmar added a commit that referenced this issue Aug 29, 2023

mark dirty on task output

84c14c0

I think this was a real bug. Its symptom would be that after a task produced output, if nothing else happened, we might wait a while before updating the UI. From looking at #68, but won't fix it.

evmar added a commit that referenced this issue Aug 29, 2023

open /dev/null over stdin in subprocesses

fc0f93b

Ninja does this. Not sure if it helps with #68.

evmar added a commit that referenced this issue Aug 29, 2023

close pipe before waitpid

4ba54d1

Possibly could help with #68?

evmar mentioned this issue Jan 8, 2024

proper console pool support #100

Open

evmar mentioned this issue Jan 8, 2024

better control over subcommand output #101

Open

evmar closed this as completed Jan 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

support `console` pool to stream output of long-running commands #68

support `console` pool to stream output of long-running commands #68

keith commented Jun 30, 2023

dae commented Jul 4, 2023

evmar commented Jul 4, 2023

Sarcasm commented Jul 4, 2023

dae commented Jul 4, 2023

evmar commented Jul 26, 2023

evmar commented Jul 26, 2023

dae commented Jul 27, 2023

evmar commented Jul 28, 2023

evmar commented Jul 28, 2023

dae commented Aug 23, 2023 •

edited

dae commented Aug 29, 2023

evmar commented Aug 29, 2023 •

edited

evmar commented Aug 29, 2023

dae commented Aug 29, 2023 •

edited

evmar commented Aug 29, 2023

dae commented Aug 29, 2023

dae commented Aug 29, 2023

evmar commented Aug 30, 2023

evmar commented Aug 30, 2023

dae commented Aug 30, 2023

dae commented Sep 5, 2023

evmar commented Jan 8, 2024

evmar commented Jan 8, 2024

support console pool to stream output of long-running commands #68

support console pool to stream output of long-running commands #68

Comments

keith commented Jun 30, 2023

dae commented Jul 4, 2023

evmar commented Jul 4, 2023

Sarcasm commented Jul 4, 2023

dae commented Jul 4, 2023

evmar commented Jul 26, 2023

evmar commented Jul 26, 2023

dae commented Jul 27, 2023

evmar commented Jul 28, 2023

evmar commented Jul 28, 2023

dae commented Aug 23, 2023 • edited

dae commented Aug 29, 2023

evmar commented Aug 29, 2023 • edited

evmar commented Aug 29, 2023

dae commented Aug 29, 2023 • edited

evmar commented Aug 29, 2023

dae commented Aug 29, 2023

dae commented Aug 29, 2023

evmar commented Aug 30, 2023

evmar commented Aug 30, 2023

dae commented Aug 30, 2023

dae commented Sep 5, 2023

evmar commented Jan 8, 2024

evmar commented Jan 8, 2024

support `console` pool to stream output of long-running commands #68

support `console` pool to stream output of long-running commands #68

dae commented Aug 23, 2023 •

edited

evmar commented Aug 29, 2023 •

edited

dae commented Aug 29, 2023 •

edited