Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support console pool to stream output of long-running commands #68

Closed
keith opened this issue Jun 30, 2023 · 23 comments
Closed

support console pool to stream output of long-running commands #68

keith opened this issue Jun 30, 2023 · 23 comments

Comments

@keith
Copy link

keith commented Jun 30, 2023

In llvm there are a lot of build targets like check-lld which run a large suite of lit tests. It's useful to see the progress of these commands since they can be very long running, and if something fails in the middle you might want to ctrl-c out of the build and triage that specific failure rather than wait for them all to be done.

@evmar evmar changed the title n2 doesn't stream test output support console pool to stream output of long-running commands Jul 4, 2023
@dae
Copy link
Contributor

dae commented Jul 4, 2023

On a semi-related note, I have a couple of long-running commands that I'd prefer to see status output on (things like invoking a build in a separate project to get its build outputs, and invoking package managers that may take time to download+install things). What kept me from using console with ninja is the fact that it also serializes access, so the build becomes less parallel if the output is shown from multiple commands.

I presume that the serialization was done to avoid interleaved output from multiple commands? Would a pool like console that allowed such interleaving (line buffered) make sense?

As n2 shows each running command on a separate line, one alternative to streaming the lines with scrolling would be to have the last line output by a given command to be shown to the right of the running command - even if it were only updated once a second, that would give some indication of how things are progressing. But I guess it would not be helpful in cases where a long-running task printed errors and then printed other text.

Not an essential feature in my case anyway - n2 showing what tasks are currently running has already been a big improvement over ninja, where some task had stalled but NINJA_STATUS was showing some other command that happened to end afterwards.

@evmar
Copy link
Owner

evmar commented Jul 4, 2023

Yeah, my thinking here we could maybe show the most recent N lines from commands as they are running (and then their full output when they complete). The main question is whether it can be done in a way that isn’t visually overwhelming.

@Sarcasm
Copy link

Sarcasm commented Jul 4, 2023

The console pool in ninja has to run serially because stdin access is also possible, I think.

Mixing output of multiple commands or truncating it does not work well with some tools that overwrite lines in-place, like ninja/n2.
Personnally, I use the ninja console pool for sphinx-build output, which has some sort of inline progress output, if stdout or stderr is a terminal, it will print this kind of progress bar:

reading sources... [  6%] release/v1.0\r
reading sources... [ 13%] release/v1.1\r
reading sources... [ 20%] release/v1.2\r
reading sources... [ 26%] release/v1.3\r
reading sources... [ 33%] release/v2.0\r
[...]\r
reading sources... [100%] release/index
[...]
dumping object inventory... done
build succeeded.

The HTML pages are in [...]/build/doc/html.

If stdin access is ignored and only process output (stdout, stderr) is sufficient I think it's possible to have a nice linear log, a command at a time.
The first of the commands that is int the console pool can write to the terminal, and the others can be buffered.

IMHO interleaving or truncating the outputs of the console pool would make it less useful.

Related reading "redo, buildroot, and serializing parallel logs":

@dae
Copy link
Contributor

dae commented Jul 4, 2023

I guess there are a couple of different use cases and trade-offs here. The one I was focusing on was seeing the most up-to-date output from a command, to e.g. see if it has stalled. If output were delayed until completion so it could be displayed linearly, this case wouldn't be covered so well.

For commands where you wanted the output non-interleaved, that's already the status-quo, isn't it? My thinking was that there'd be a way to opt-in to immediate output for a command, while otherwise keeping things as they are. If the full output from the command were buffered, it could be reprinted in full if the command failed, which would make debugging issues easier (or alternatively, logged to a file in the failure case)

@evmar
Copy link
Owner

evmar commented Jul 26, 2023

Thought about this more, thanks for the comments here.

I think the Ninja behavior is:

  • command output is buffered, and printed when commands finished
  • 'console' commands serialize and have direct access to tty

I had thought I might try something clever around running console commands in parallel but I think existing ninja projects are likely too sensitive to work with it. In particular I think I don't want to get into pseudotty behavior.

(I think the one tweak I will attempt is for non-console commands, while they are running, I might display their most recent line of output in the temporary progress.)

evmar added a commit that referenced this issue Jul 26, 2023
While a task is running, we gather the last line of its (buffered) output
and display it in the fancy progress.

Part of #68.
@evmar
Copy link
Owner

evmar commented Jul 26, 2023

Looks like this:

https://asciinema.org/a/kZGBQLft3LdZwqSJRblZZShmx

(The subcommand here print the literal string print1: 1 etc)

@dae
Copy link
Contributor

dae commented Jul 27, 2023

For adding better insight into long-running subtasks, this is a nice improvement!

recording.mp4

Normally, I run commands with a wrapper that hides all their output unless there's an error. To test out this new behaviour, I turned that off, but now the full output of the command is printed as it completes, even in the success case, which makes the build output somewhat messy. It would be great to have some sort of option like a build variable that could suppress the aggregated output in the success case, as while I love the extra insight during build, I'd miss the clean output if I had to decide on only one of the two.

One other minor thing: when a build step invokes n2 in a subproject, n2 prints the descriptions of each step without any leading indent, so it becomes a bit hard to distinguish between the currently-running step in the outer invocation, and the step that's running in the inner invocation - for example ts:congrats is an inner step here:

2023-07-27T13:48:36,903363719+10:00

Would one or two spaces before the last output line, truncating the last one or two characters, make sense?

evmar added a commit that referenced this issue Jul 28, 2023
@evmar
Copy link
Owner

evmar commented Jul 28, 2023

Would one or two spaces before the last output line, truncating the last one or two characters, make sense?

Good idea, done.

@evmar
Copy link
Owner

evmar commented Jul 28, 2023

It would be great to have some sort of option like a build variable that could suppress the aggregated output in the success case, as while I love the extra insight during build, I'd miss the clean output if I had to decide on only one of the two.

You're in a fairly unique situation in that most people only have .ninja files generated by a system they don't control. I like the idea though, let me think about it. (I recall in Ninja we had a lot of circular discussions around different verbosity-related command-line flags to either show task output or not, show task command lines or not, etc.)

@dae
Copy link
Contributor

dae commented Aug 23, 2023

I was eager to try out the new functionality, so I've hacked in some options in a local branch. Only the Pretty progress path has been updated:

ankitects@5620260

One curious thing I noticed after being able to see the last line of output is that sometimes there's a fairly long period between when the command has finished, and when it disappears from the list of tasks. For example, 'build:minilints' in the following video. Can you think of a reason why that might be happening?

recording.mp4

@dae
Copy link
Contributor

dae commented Aug 29, 2023

I've been noticing this happen fairly frequently - some commands, whether they have any output or not, appear to run for much longer than they should, and don't appear in top - so I presume something is a bit funky with the cleaning up of finished commands at the moment.

@evmar
Copy link
Owner

evmar commented Aug 29, 2023

Do the stuck tasks ever finish?

Two guesses are:

  1. Somehow the bookkeeping in the UI is off, where the "task finished" message gets treated wrong. The relevant UI code is here. This code does reuse ids so maybe there's some race* between a terminated task and a new one claiming the id? The id is released in the call to .wait and the above code is invoked just after though, in here. Edit: this is wrong, they are id'd by unique build ids and there is no reuse.
  2. Alternatively, we're failing to track when the process actually finishes. This is where we block on the subprocess output/termination.

To judge if it's the second one you might be able to look at (via like ps) whether the subprocess is really still present when things are stuck.

* Not a race in the threadsy sense, but rather a logic error related to the order of events.

@evmar
Copy link
Owner

evmar commented Aug 29, 2023

(I previously wrote that I might have been reusing task ids and getting the bookkeeping wrong, but that was a misread of the code. n2 uses unique types for each of the various "id" integers flying around so I think it's unlikely I'm mixing these up, but ...)

@dae
Copy link
Contributor

dae commented Aug 29, 2023

The tasks do eventually complete, and it seems to coincide with the completion of some other long-running task. As far as I can tell, the tasks are no longer in the process list (well, top - I'll check ps next time) after their last output, so it just seems to be the status not updating. It happens with a moderate degree of frequency, but is sporadic - the next time it happens I'll see if I can get get it to trigger more reproducibly.

evmar added a commit that referenced this issue Aug 29, 2023
evmar added a commit that referenced this issue Aug 29, 2023
I think this was a real bug.  Its symptom would be that after a task produced
output, if nothing else happened, we might wait a while before updating the UI.

From looking at #68, but won't fix it.
evmar added a commit that referenced this issue Aug 29, 2023
Ninja does this.  Not sure if it helps with #68.
evmar added a commit that referenced this issue Aug 29, 2023
Possibly could help with #68?
@evmar
Copy link
Owner

evmar commented Aug 29, 2023

I pushed 3 small fixes in this area that might help, but I'm not sure about any of them.

I think if the bug is related to the process lurking around too long, it would likely be lurking in some sort of hung (like waiting to write to a pipe) or zombie state, so it wouldn't show up at the top of top. I would instead expect the process to still exist but just be waiting.

@dae
Copy link
Contributor

dae commented Aug 29, 2023

Thanks, I tried again with the latest changes. Looking at ps when I see a task that has printed its last line but appears stuck, I see one or more zombie processes:

dae 2319878 0.0 0.0 0 0 pts/1 Z+ 08:08 0:00 [sh]

I added some logging to run_command() try get better insight into what's going on, and it appears the read() call hangs for a while after the process has printed its final output:

1693349032 nextest: begin open
1693349032 nextest: close
1693349032 nextest: block to read
1693349032 nextest read '�[0m�[0m�[1m�[36m    Blocking�[0m waiting for file lock on package cache
'
1693349032 nextest: block to read
1693349032 nextest read '�[0m�[0m�[1m�[36m    Blocking�[0m waiting for file lock on package cache
'
1693349032 nextest: block to read
1693349032 nextest read '�[0m�[0m�[1m�[32m   Compiling�[0m anki_proto v0'
1693349032 nextest: block to read
1693349032 nextest read '.0.0 (/home/dae/Work/desktop/anki/rslib/proto)
'
1693349032 nextest: block to read
1693349033 nextest read '�[0m�[0m�[1m�[32m   Compiling�[0m anki v'
1693349033 nextest: block to read
1693349033 nextest read '0.0.0 (/home/dae/Work/desktop/anki/rslib)
'
1693349033 nextest: block to read
1693349035 nextest read '�[0m�[0m�[1m�[32m   Compiling�[0m linkchecker v0.0.0 ('
1693349035 nextest: block to read
1693349035 nextest read '/home/dae/Work/desktop/anki/rslib/linkchecker)
'
1693349035 nextest: block to read
1693349036 nextest read '�[0m�[0m�[1m�[32m   Compiling�[0m'
1693349036 nextest: block to read
1693349036 nextest read ' anki-sync-server v0.0.0 (/home/dae/Work/desktop/anki/rslib/sync)
�[0m�[0m�[1m�[32m   Compiling�[0m rsbridge v0.0.0 ('
1693349036 nextest: block to read
1693349036 nextest read '/home/dae/Work/desktop/anki/pylib/rsbridge)
'
1693349036 nextest: block to read
1693349036 nextest read '�[0m�[0m�[1m�[32m    Finished�[0m '
1693349036 nextest: block to read
1693349036 nextest read 'test [unoptimized + debuginfo] target(s) in 4.42s
'
1693349036 nextest: block to read
1693349036 nextest read '�[32;1m    Starting�[0m �[1m262�[0m tests across �[1m17�[0m binaries
'
1693349036 nextest: block to read
1693349037 nextest read '------------
�[32;1m     Summary�[0m [   0.702s] �[1m262�[0m tests run: �[1m262�[0m �[32;1mpassed�[0m, �[1m0�[0m �[33;1mskipped�[0m
'
1693349037 nextest: block to read
1693349045 nextest: got 0
1693349045 nextest: waitpid start
1693349045 nextest: waitpid end

I've figured out a way to reproduce this fairly consistently, so if you have any suggestions on things I should try, please let me know.

@dae
Copy link
Contributor

dae commented Aug 29, 2023

Could it be that the pipes for one spawned process are leaking into other spawned processes? That would explain why a command seems to unblock when another happens to complete, and with the following change, the issue seems to disappear:

    for fd in 3..1024 {
        actions.addclose(fd)?;
    }
    // actions.addclose(pipe[0])?;
    // actions.addclose(pipe[1])?;

evmar added a commit that referenced this issue Aug 30, 2023
We want to open all our fds using CLOEXEC so they're not inherited by
child processes.  All Rust builtin APIs do this already, but we used
libc::pipe() for our pipe.

To fix this, we use pipe2() on non-Mac (which allows passing the flag)
and Mac-specific POSIX_SPAWN_CLOEXEC_DEFAULT otherwise.

Fixes leak found in #68.
@evmar
Copy link
Owner

evmar commented Aug 30, 2023

Great find! I dug a bit and wrote I think a proper fix in 12652a1, hopefully.

@evmar
Copy link
Owner

evmar commented Aug 30, 2023

At least on this Mac, running the build in tests/manual/prints I would see the processes inheriting stuff like

bash    23396 evmar    1   PIPE 0x908089607cfefd0b    16384                     ->0x656addaecb5f1c95
bash    23396 evmar    2   PIPE 0x908089607cfefd0b    16384                     ->0x656addaecb5f1c95
bash    23396 evmar    4   PIPE 0xe05bffa6d41dad18    16384                     ->0xf57f30ba54da4219
bash    23396 evmar    5   PIPE 0xf57f30ba54da4219    16384                     ->0xe05bffa6d41dad18
bash    23396 evmar    6   PIPE 0xf93745033fa400af    16384                     ->0x9a377a882197cb70
bash    23396 evmar    7   PIPE 0x9a377a882197cb70    16384                     ->0xf93745033fa400af

and after the patch they only have pipe fds 0/1/2.

@dae
Copy link
Contributor

dae commented Aug 30, 2023

Thanks for looking into it; I can confirm it's all working smoothly now, and CLOEXEC is much more elegant than closing 1000 FDs.

@dae
Copy link
Contributor

dae commented Sep 5, 2023

Every time I run a build, being able to see the latest line of output makes me smile :-) It's such a nice improvement.

Have you had any thoughts yet on whether output-suppressing options might be suitable for inclusion / what they might look like?

@evmar
Copy link
Owner

evmar commented Jan 8, 2024

Sorry for dropping this thread! I think we have a few different ideas mixed in here and I wanna make sure I address all of them.

  • I think for @keith's original request for check-lld the current behavior is maybe now sufficient.
  • For @Sarcasm's sphinx-build output, I suspect that because n2 buffers the output, sphinx doesn't believe it's connected to a console, so n2's behavior doesn't work. I think support for a true console pool where n2 would suppress its own status updates while the subtask ran (or, possibly, using a pty, yuck) would address that.
  • For @dae's final remark about output-suppression, I will file a new bug.

@evmar
Copy link
Owner

evmar commented Jan 8, 2024

I built check-lld and it looks pretty good to me:

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants