Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure that --on-process-exit is race-proof #7210

Closed
ridiculousfish opened this issue Jul 19, 2020 · 6 comments
Closed

Ensure that --on-process-exit is race-proof #7210

ridiculousfish opened this issue Jul 19, 2020 · 6 comments
Milestone

Comments

@ridiculousfish
Copy link
Member

In #7207 we hypothesized that --on-process-exit may have a race. For code like so:

cmd &
function done --on-process-exit $last_pid
         echo "All done"
end

does anything guarantee that cmd won't be reaped before the event handler is installed? In practice it seems to work, but we may just be getting lucky.

This tracks either making this pattern reliable (with a test), or providing something better.

@ridiculousfish ridiculousfish changed the title Make --on-process-exit race-proof Ensure that --on-process-exit is race-proof Jul 19, 2020
@ridiculousfish ridiculousfish added this to the fish-future milestone Jul 19, 2020
@faho
Copy link
Member

faho commented Jul 19, 2020

does anything guarantee that cmd won't be reaped before the event handler is installed? In practice it seems to work, but we may just be getting lucky.

Note: The more general problem is that nothing keeps the "event" around.

E.g. if you did something between the background process and defining the function (or calling wait), fish would reap the process and leave no trace around that can be waited on.

sleep 1 &
set -l pid  $last_pid
sleep 2 # to ensure the backgrounded sleep has already quit

wait $pid # will error with "Could not find a job with process id"

However, I'm not sure how to solve this, because... well, how long would we keep the data (that it existed, the $status) around? We can't very well do "until it's waited on", because that's not the typical thing to do in a shellscript, so we'd keep it indefinitely.

So do we keep these things for a certain time (five minutes?), or do we require a marker to ensure it can be waited on (the & syntax is bad and doesn't allow something like this), like our weird and awkward and basically unused block builtin?

Or am I entirely missing something?

@ridiculousfish
Copy link
Member Author

ridiculousfish commented Jul 19, 2020

I think wait does not provide the $status of the exited job, am I wrong?

One solution would be to assign a sequential, non-recycled ID to every job. If you call wait with such an ID, and there is no such running process, it must have already exited. This is how --on-exit caller is implemented. But then we'd have three separate notions (pid, job ID, sequential ID) and that seems excessive.

Another possibility is to allow the user to name backgrounded jobs. Here the user names it mysleep:

 sleep 10 &mysleep
 wait mysleep

We would keep the data for mysleep around until it is waited upon.

Before we get too far along, I'd like to understand how other shells solve this.

@faho
Copy link
Member

faho commented Jul 21, 2020

Something similar applies to disown. See 5b706fa, where I used disown $last_pid to be sure.

The gist is that a bare disown applies to the last active job, so if it's run after the job that was meant has finished (e.g. because it forks another process off and exits), it disowns another job.

E.g.

sleep 5m &

command true &

disown # will (most likely) disown the *sleep*, not the *true*

Adding the disown to the same commandline or function seems to help, and I can't seem to trigger it in a script, even with full job-control (job not yet reaped?).

(I'm having trouble understanding how this all should work because job control always felt like this weird tacked-on feature to me)

@hexagonrecursion
Copy link

#7210 (comment)

However, I'm not sure how to solve this, because... well, how long would we keep the data (that it existed, the $status) around? We can't very well do "until it's waited on", because that's not the typical thing to do in a shellscript, so we'd keep it indefinitely.

#7210 (comment)

Before we get too far along, I'd like to understand how other shells solve this.

UNIX syscalls

We all know how unix (and all its descendants) avoid the race condition: zombies. They do not consider it a memory leak in the kernel but rather a resource leak in the user code.

bash

If you wait for just any random number, you will most likely get an error

~ $ wait 5000
bash: wait: pid 5000 is not a child of this shell

If you wait for the pid of background job, you get no error. This happens even if you already waited for the job. I have not looked at the code so I do not know if bash ever forgets a background job.

~ $ sleep 1s & X=$!; sleep 2s; wait $X
[1] 30759
[1]+  Done                    sleep 1s
~ $ wait $X
~ $ wait $X
~ $ wait $X
~ $ wait $X
~ $ wait $X

There is a cap for the "memory leak": pid_max

According to this there are never more then 4194304 unique pids.

faho added a commit that referenced this issue Jan 9, 2021
As mentioned in 5b706fa, bare
`disown` has a problem: It disowns the last *existing* job.

Unfortunately, it's easy to see cases where that won't happen:

    sleep 5m &
    /bin/true & # will exit immediately
    disown # will most likely disown *sleep*, not true

So what we do is to pass $last_pid.

In help especially this is likely to occur because many graphical
browsers fork immediately to avoid blocking the terminal (we only
added the backgrounding and disown because some weren't).

Note that it's *possible* this doesn't occur if used in the same
function, but I don't want to rely on those semantics.

It might be worth doing this as the default - see #7210.
@ridiculousfish
Copy link
Member Author

zsh at least solves this by separately interrogating active jobs, and also a "saved background status" (see getbgstatus).

My approach is going to be to wrap up the "stuff you need to wait for a job" into a a little type that can outlive the job, called its "wait handle." The job can produce its wait handle on demand, and when we reap a job we save the wait handle if the job exited in the background. This allows both active and reaped jobs to be handled with some uniformity.

ridiculousfish added a commit to ridiculousfish/fish-shell that referenced this issue May 16, 2021
This switches builtin_wait from waiting on jobs in the active job list, to
waiting on the wait handles. The wait handles may be either derived from
the job list itself, or from saved wait handles from jobs that exited in
the background.

Fixes fish-shell#7210
@zanchey zanchey modified the milestones: fish-future, fish 3.3.0 May 16, 2021
@ridiculousfish
Copy link
Member Author

I realized that my fix "only" addressed wait and not on-process-exit

@github-actions github-actions bot locked as resolved and limited conversation to collaborators May 23, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants