Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unblock process deletion in worker runtime #8479

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions worker/runtime/process.go
Original file line number Diff line number Diff line change
Expand Up @@ -48,14 +48,14 @@ func (p *Process) Wait() (int, error) {
return 0, fmt.Errorf("proc closeio: %w", err)
}

p.process.IO().Wait()

_, err = p.process.Delete(context.Background())
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chenbh do you mean why should we delete it here? My knowledge is limited here about linux programming TBO. Not sure how it suppose to work.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding of what happens is:

  1. User clicks the abort button
  2. Web node handles the request and the step context gets cancelled
  3. Cancelling the context causes invokes Stop(false) on the worker which will try to gracefully kill the process
  4. The killer will first try to gracefully stop the process (by sending a SIGINT/SIGTERM), but if that doesn't work it sends a SIGKILL
  5. Even if the resource ignores the SIGINT/SIGTERM and refuse to self-terminate, then it should've been killed by the SIGKILL
  6. Because the process has been killed, calling p.process.IO().Wait() eventually returns (because you know, the process is killed and the IO is closed)
  7. We can then p.process.Delete() to free up any remaining OS resources reserved by the process

But from the #8462, it looks like the the resource is misbehaving by ignoring the SIGINT/SIGTERM, but for some reason the SIGKILL wasn't issued, which allowed the process to continue executing and jamming up the worker

Copy link
Contributor Author

@xtremerui xtremerui Jul 27, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pool resource hasn't been changed for a long time. The change comes from the big refactor in Concourse v7.5.

6. Because the process has been killed, calling p.process.IO().Wait() eventually returns (because you know, the process is killed and the IO is closed)

The problem is on the last part. During my investigation, I cancel the build while monitoring the process in the container.
Before cancelling,

PID   USER     TIME  COMMAND
    1 root      0:00 /tmp/gdn-init
    7 root      0:00 {out} /bin/sh /opt/resource/out /tmp/build/put
   19 root      0:00 ssh-agent
   32 root      0:00 /opt/go/out /tmp/build/put
   49 root      0:00 bash
   60 root      0:00 ps aux

After cancelling, my ssh session got kicked out and i have to ssh back to see:

PID   USER     TIME  COMMAND
    1 root      0:00 /tmp/gdn-init
   19 root      0:00 ssh-agent
   32 root      0:00 /opt/go/out /tmp/build/put
  156 root      0:00 bash
  167 root      0:00 ps aux

worker logs after cancelling:

{"timestamp":"2022-07-27T16:44:44.198530100Z","level":"debug","source":"worker","message":"worker.garden.garden-server.stop.stopping","data":{"handle":"2faeb592-4850-48e7-6969-680c045582ba","session":"1.4.27"}}
...
{"timestamp":"2022-07-27T16:44:54.213367800Z","level":"info","source":"worker","message":"worker.garden.garden-server.stop.stopped","data":{"handle":"2faeb592-4850-48e7-6969-680c045582ba","session":"1.4.27"}}
{"timestamp":"2022-07-27T16:44:54.215373100Z","level":"debug","source":"worker","message":"worker.garden.connection.closed","data":{"local_addr":{"IP":"127.0.0.1","Port":7777,"Zone":""},"remote_addr":{"IP":"127.0.0.1","Port":37256,"Zone":""},"session":"1.5"}}
{"timestamp":"2022-07-27T16:44:54.226306800Z","level":"info","source":"worker","message":"worker.garden.garden-server.run.exited","data":{"handle":"2faeb592-4850-48e7-6969-680c045582ba","id":"4ecd375f-6844-4f99-6128-6199b7f02bb3","session":"1.4.20","status":137}}

you can see the process was actually killed (v7.4 behaves the same), and the worker logs shows garden stopped the container, but the IO was still running and outputing those ....

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chenbh I have attached a pipeline for reproducible steps and acceptance.

// ignore "not found" errors - the process was already deleted
if err != nil && !errors.Is(err, errdefs.ErrNotFound) {
return 0, fmt.Errorf("delete process: %w", err)
}

p.process.IO().Wait()

return int(status.ExitCode()), nil
}

Expand Down
6 changes: 0 additions & 6 deletions worker/runtime/process_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -50,7 +50,6 @@ func (s *ProcessSuite) TestWaitStatusErr() {

func (s *ProcessSuite) TestProcessWaitDeleteError() {
s.ch <- *containerd.NewExitStatus(0, time.Now(), nil)
s.containerdProcess.IOReturns(s.io)

expectedErr := errors.New("status-err")
s.containerdProcess.DeleteReturns(nil, expectedErr)
Expand All @@ -73,11 +72,6 @@ func (s *ProcessSuite) TestProcessWaitBlocksUntilIOFinishes() {
s.ch <- *containerd.NewExitStatus(0, time.Now(), nil)
s.containerdProcess.IOReturns(s.io)

s.io.WaitStub = func() {
// ensure Wait() is called before Delete() which cancels IO
s.Equal(0, s.containerdProcess.DeleteCallCount())
}

_, err := s.process.Wait()
s.NoError(err)

Expand Down