Skip to content

Commit

Permalink
test: do not spawn goroutines to wait for canceled context in `RunCom…
Browse files Browse the repository at this point in the history
…mandContext`

[ upstream commit cf14b6a ]

It was observed while running `gops stack` for the Ginkgo test suite locally,
that we were leaking goroutines that were getting stuck while waiting for SSH
sessions to finish. We accrued over 1000 of these per K8s CI run:

```
16 unique stack traces
960 occurences. Sample stack trace:
github.com/cilium/cilium/vendor/golang.org/x/crypto/ssh.(*Session).Wait(0xc00098e000, 0x230f79c, 0x1a)
        /Users/ianvernon/go/src/github.com/cilium/cilium/vendor/golang.org/x/crypto/ssh/session.go:403 +0x57
github.com/cilium/cilium/test/helpers.(*SSHClient).RunCommandContext.func1(0x2502600, 0xc000542280, 0xc00098e000, 0xc0001aa780, 0xc000268220)
        /Users/ianvernon/go/src/github.com/cilium/cilium/test/helpers/ssh_command.go:262 +0x1cc
created by github.com/cilium/cilium/test/helpers.(*SSHClient).RunCommandContext
        /Users/ianvernon/go/src/github.com/cilium/cilium/test/helpers/ssh_command.go:253 +0x13c
```

This example shows that there were over 960 goroutines stuck on `session.Wait()`.

Whenever we run a command via SSH, we call `runCommand`. When we call
`runCommand`, it calls `session.Run`, which calls `session.Start()` and
`session.Wait()`. I observed that  that calling `Wait()` on a session which
already has had `Run` invoked will never return, even if we try
to call `session.Close()` before invoking `session.Wait()`. This indicates
that our logic for trying to kill the session if the context which is provided
to `RunCommandContext` is canceled is flawed, as waiting for the session to
finish before closing it will block infinitely.

I enabled debug mode for the SSH library we use (`golang.org/x/crypto/ssh`), and
I see that the session receives an EOF message *before* we even try to close the
session:

```
>------- session started, and session.Run() invoked

2019/06/05 08:16:59 send global(2): ssh.channelOpenMsg{ChanType:"session", PeersID:0x2, PeersWindow:0x200000, MaxPacketSize:0x8000, TypeSpecificData:[]uint8(nil)}
2019/06/05 08:16:59 decoding(2): 91 &ssh.channelOpenConfirmMsg{PeersID:0x2, MyID:0x0, MyWindow:0x0, MaxPacketSize:0x8000, TypeSpecificData:[]uint8{}} - 17 bytes
2019/06/05 08:16:59 send(2): ssh.channelRequestMsg{PeersID:0x0, Request:"exec", WantReply:true, RequestSpecificData:[]uint8{0x0, 0x0, 0x0, 0x6d, 0x6b, 0x75, 0x62, 0x65, 0x63, 0x74, 0x6c, 0x20, 0x65, 0x78, 0x65, 0x63, 0x20, 0x2d, 0x6e, 0x20, 0x6b, 0x75, 0x62, 0x65, 0x2d, 0x73, 0x79, 0x73, 0x74, 0x65, 0x6d, 0x20, 0x63, 0x69, 0x6c, 0x69, 0x75, 0x6d, 0x2d, 0x62, 0x63, 0x63, 0x6d, 0x34, 0x20, 0x2d, 0x2d, 0x20, 0x63, 0x69, 0x6c, 0x69, 0x75, 0x6d, 0x20, 0x73, 0x74, 0x61, 0x74, 0x75, 0x73, 0x20, 0x2d, 0x6f, 0x20, 0x6a, 0x73, 0x6f, 0x6e, 0x70, 0x61, 0x74, 0x68, 0x3d, 0x27, 0x7b, 0x2e, 0x63, 0x6c, 0x75, 0x73, 0x74, 0x65, 0x72, 0x2e, 0x6e, 0x6f, 0x64, 0x65, 0x73, 0x5b, 0x2a, 0x5d, 0x2e, 0x70, 0x72, 0x69, 0x6d, 0x61, 0x72, 0x79, 0x2d, 0x61, 0x64, 0x64, 0x72, 0x65, 0x73, 0x73, 0x2e, 0x2a, 0x7d, 0x27}}
2019/06/05 08:16:59 decoding(2): 93 &ssh.windowAdjustMsg{PeersID:0x2, AdditionalBytes:0x200000} - 9 bytes
2019/06/05 08:16:59 decoding(2): 99 &ssh.channelRequestSuccessMsg{PeersID:0x2} - 5 bytes

>-------  EOF sent on channel (not by us; we have not closed the sesion yet!

2019/06/05 08:16:59 send(2): ssh.channelEOFMsg{PeersID:0x0}
2019/06/05 08:16:59 decoding(2): data packet - 181 bytes
2019/06/05 08:16:59 send(2): ssh.windowAdjustMsg{PeersID:0x0, AdditionalBytes:0xac}
2019/06/05 08:16:59 decoding(2): 96 &ssh.channelEOFMsg{PeersID:0x2} - 5 bytes
2019/06/05 08:16:59 decoding(2): 98 &ssh.channelRequestMsg{PeersID:0x2, Request:"exit-status", WantReply:false, RequestSpecificData:[]uint8{0x0, 0x0, 0x0, 0x0}} - 25 bytes
2019/06/05 08:16:59 decoding(2): 97 &ssh.channelCloseMsg{PeersID:0x2} - 5 bytes
2019/06/05 08:16:59 send(2): ssh.channelCloseMsg{PeersID:0x0}

>------- we try to close the session, and receive the following error

failed to close session: EOF
```

It appears that we cannot close the session, since an EOF has already been sent
for it. I am not exactly sure where this comes from. I've posted an issue /
question in the GitHub repository for golang:
golang/go#32453 . Our attempts to send signals (SIGHUP
and SIGINT) are met by this same EOF error as well; there is no point on
waiting for the session to finish in this case, so just try to close it
and move on, and not leak goroutines that will be stuck forever. When running
`gops` now against the Gingko test suite, we no longer accrue a ton of these
goroutines blocked on `session.Wait()` - the biggest # of occcurrences for
"similar" goroutines is at most 2 in a stack trace captured below, for example:

```
$ ../contrib/scripts/consolidate_go_stacktrace.py stack9.out  | head -n 15
14 unique stack traces
2 occurences. Sample stack trace:
internal/poll.runtime_pollWait(0x3f14d50, 0x72, 0xc000a6cad0)
	/usr/local/go/src/runtime/netpoll.go:173 +0x66
internal/poll.(*pollDesc).wait(0xc00043c218, 0x72, 0xffffffffffffff00, 0x24e7ac0, 0x3225738)
	/usr/local/go/src/internal/poll/fd_poll_runtime.go:85 +0x9a
...
```

Signed-off by: Ian Vernon <ian@cilium.io>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
  • Loading branch information
Ian Vernon authored and tgraf committed Jun 6, 2019
1 parent a049140 commit 911eb3f
Showing 1 changed file with 7 additions and 29 deletions.
36 changes: 7 additions & 29 deletions test/helpers/ssh_command.go
Expand Up @@ -23,7 +23,6 @@ import (
"net"
"os"
"strconv"
"sync"
"time"

"github.com/kevinburke/ssh_config"
Expand Down Expand Up @@ -250,40 +249,19 @@ func (client *SSHClient) RunCommandContext(ctx context.Context, cmd *SSHCommand)
if err != nil {
return err
}
defer session.Close()

stdin, err := session.StdinPipe()
if err != nil {
log.Errorf("Could not get stdin %s", err)
}

var wg sync.WaitGroup
wg.Add(1)
go func() {
select {
case <-ctx.Done():
_, err := stdin.Write([]byte{3})
if err != nil {
log.Errorf("write ^C error: %s", err)
}
err = session.Wait()
if err != nil {
log.Errorf("wait error: %s", err)
}
if err = session.Signal(ssh.SIGHUP); err != nil {
log.Errorf("failed to kill command: %s", err)
}
if err = session.Close(); err != nil {
log.Errorf("failed to close session: %s", err)
}
defer func() {
if closeErr := session.Close(); closeErr != nil {
log.WithError(closeErr).Error("failed to close session")
}
wg.Done()
}()
err = runCommand(session, cmd)
select {
case <-ctx.Done():
// Wait until the ssh session is stopped
wg.Wait()
log.Warning("sending SIGHUP to session due to canceled context")
if err = session.Signal(ssh.SIGHUP); err != nil {
log.Errorf("failed to kill command when context is canceled: %s", err)
}
return ctx.Err()
default:
return err
Expand Down

0 comments on commit 911eb3f

Please sign in to comment.