Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
test: do not spawn goroutines to wait for canceled context in `RunCom…
…mandContext` [ upstream commit cf14b6a ] It was observed while running `gops stack` for the Ginkgo test suite locally, that we were leaking goroutines that were getting stuck while waiting for SSH sessions to finish. We accrued over 1000 of these per K8s CI run: ``` 16 unique stack traces 960 occurences. Sample stack trace: github.com/cilium/cilium/vendor/golang.org/x/crypto/ssh.(*Session).Wait(0xc00098e000, 0x230f79c, 0x1a) /Users/ianvernon/go/src/github.com/cilium/cilium/vendor/golang.org/x/crypto/ssh/session.go:403 +0x57 github.com/cilium/cilium/test/helpers.(*SSHClient).RunCommandContext.func1(0x2502600, 0xc000542280, 0xc00098e000, 0xc0001aa780, 0xc000268220) /Users/ianvernon/go/src/github.com/cilium/cilium/test/helpers/ssh_command.go:262 +0x1cc created by github.com/cilium/cilium/test/helpers.(*SSHClient).RunCommandContext /Users/ianvernon/go/src/github.com/cilium/cilium/test/helpers/ssh_command.go:253 +0x13c ``` This example shows that there were over 960 goroutines stuck on `session.Wait()`. Whenever we run a command via SSH, we call `runCommand`. When we call `runCommand`, it calls `session.Run`, which calls `session.Start()` and `session.Wait()`. I observed that that calling `Wait()` on a session which already has had `Run` invoked will never return, even if we try to call `session.Close()` before invoking `session.Wait()`. This indicates that our logic for trying to kill the session if the context which is provided to `RunCommandContext` is canceled is flawed, as waiting for the session to finish before closing it will block infinitely. I enabled debug mode for the SSH library we use (`golang.org/x/crypto/ssh`), and I see that the session receives an EOF message *before* we even try to close the session: ``` >------- session started, and session.Run() invoked 2019/06/05 08:16:59 send global(2): ssh.channelOpenMsg{ChanType:"session", PeersID:0x2, PeersWindow:0x200000, MaxPacketSize:0x8000, TypeSpecificData:[]uint8(nil)} 2019/06/05 08:16:59 decoding(2): 91 &ssh.channelOpenConfirmMsg{PeersID:0x2, MyID:0x0, MyWindow:0x0, MaxPacketSize:0x8000, TypeSpecificData:[]uint8{}} - 17 bytes 2019/06/05 08:16:59 send(2): ssh.channelRequestMsg{PeersID:0x0, Request:"exec", WantReply:true, RequestSpecificData:[]uint8{0x0, 0x0, 0x0, 0x6d, 0x6b, 0x75, 0x62, 0x65, 0x63, 0x74, 0x6c, 0x20, 0x65, 0x78, 0x65, 0x63, 0x20, 0x2d, 0x6e, 0x20, 0x6b, 0x75, 0x62, 0x65, 0x2d, 0x73, 0x79, 0x73, 0x74, 0x65, 0x6d, 0x20, 0x63, 0x69, 0x6c, 0x69, 0x75, 0x6d, 0x2d, 0x62, 0x63, 0x63, 0x6d, 0x34, 0x20, 0x2d, 0x2d, 0x20, 0x63, 0x69, 0x6c, 0x69, 0x75, 0x6d, 0x20, 0x73, 0x74, 0x61, 0x74, 0x75, 0x73, 0x20, 0x2d, 0x6f, 0x20, 0x6a, 0x73, 0x6f, 0x6e, 0x70, 0x61, 0x74, 0x68, 0x3d, 0x27, 0x7b, 0x2e, 0x63, 0x6c, 0x75, 0x73, 0x74, 0x65, 0x72, 0x2e, 0x6e, 0x6f, 0x64, 0x65, 0x73, 0x5b, 0x2a, 0x5d, 0x2e, 0x70, 0x72, 0x69, 0x6d, 0x61, 0x72, 0x79, 0x2d, 0x61, 0x64, 0x64, 0x72, 0x65, 0x73, 0x73, 0x2e, 0x2a, 0x7d, 0x27}} 2019/06/05 08:16:59 decoding(2): 93 &ssh.windowAdjustMsg{PeersID:0x2, AdditionalBytes:0x200000} - 9 bytes 2019/06/05 08:16:59 decoding(2): 99 &ssh.channelRequestSuccessMsg{PeersID:0x2} - 5 bytes >------- EOF sent on channel (not by us; we have not closed the sesion yet! 2019/06/05 08:16:59 send(2): ssh.channelEOFMsg{PeersID:0x0} 2019/06/05 08:16:59 decoding(2): data packet - 181 bytes 2019/06/05 08:16:59 send(2): ssh.windowAdjustMsg{PeersID:0x0, AdditionalBytes:0xac} 2019/06/05 08:16:59 decoding(2): 96 &ssh.channelEOFMsg{PeersID:0x2} - 5 bytes 2019/06/05 08:16:59 decoding(2): 98 &ssh.channelRequestMsg{PeersID:0x2, Request:"exit-status", WantReply:false, RequestSpecificData:[]uint8{0x0, 0x0, 0x0, 0x0}} - 25 bytes 2019/06/05 08:16:59 decoding(2): 97 &ssh.channelCloseMsg{PeersID:0x2} - 5 bytes 2019/06/05 08:16:59 send(2): ssh.channelCloseMsg{PeersID:0x0} >------- we try to close the session, and receive the following error failed to close session: EOF ``` It appears that we cannot close the session, since an EOF has already been sent for it. I am not exactly sure where this comes from. I've posted an issue / question in the GitHub repository for golang: golang/go#32453 . Our attempts to send signals (SIGHUP and SIGINT) are met by this same EOF error as well; there is no point on waiting for the session to finish in this case, so just try to close it and move on, and not leak goroutines that will be stuck forever. When running `gops` now against the Gingko test suite, we no longer accrue a ton of these goroutines blocked on `session.Wait()` - the biggest # of occcurrences for "similar" goroutines is at most 2 in a stack trace captured below, for example: ``` $ ../contrib/scripts/consolidate_go_stacktrace.py stack9.out | head -n 15 14 unique stack traces 2 occurences. Sample stack trace: internal/poll.runtime_pollWait(0x3f14d50, 0x72, 0xc000a6cad0) /usr/local/go/src/runtime/netpoll.go:173 +0x66 internal/poll.(*pollDesc).wait(0xc00043c218, 0x72, 0xffffffffffffff00, 0x24e7ac0, 0x3225738) /usr/local/go/src/internal/poll/fd_poll_runtime.go:85 +0x9a ... ``` Signed-off by: Ian Vernon <ian@cilium.io> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
- Loading branch information