New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
task: don't close()
io before cancel()
#8643
Conversation
Hi @laurazard. Thanks for your PR. I'm waiting for a containerd member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@laurazard: GitHub didn't allow me to request PR reviews from the following users: corhere. Note that only containerd members and repo collaborators can review this PR, and authors cannot review their own PRs. In response to this: Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/ok-to-test |
OH! I should've noticed that. I guess we might need a slightly more complex solution here. Closing task io before waiting doesn't seem correct to me, but maybe we can change Other than that, it still seems incorrect (albeit maybe harmless) to always call Line 338 in f3a0793
Close() shouldn't be called before a successful task delete)
|
Investigating #5621 (and the fix in 55faa5e), it seems to me that adding the
Since this happens only on restore, we can follow the thread and we find that: The specific containerd/pkg/cri/server/restart.go Line 218 in 4281a95
containerd/pkg/cri/io/container_io.go Lines 76 to 98 in 4281a95
and the Wait() being called iscontainerd/pkg/cri/io/container_io.go Line 227 in 4281a95
closer there comes from containerd/pkg/cri/io/helpers.go Lines 98 to 144 in 4281a95
Wait() is containerd/pkg/cri/io/helpers.go Line 62 in 4281a95
The wg there is referenced in containerd/pkg/cri/io/container_io.go Line 107 in 4281a95
which gets called in containerd/pkg/cri/server/restart.go – which fits in with our theory.
containerd/pkg/cri/io/container_io.go Lines 107 to 134 in 4281a95
we can see then that the likely culprit here is some locking on containerd/pkg/cri/io/container_io.go Line 112 in 4281a95
containerd/pkg/cri/io/container_io.go Line 117 in 4281a95
Locking the containerd/pkg/cri/io/helpers_windows.go Lines 60 to 74 in 4281a95
It's likely that this is only an issue in this case since we're "re-attaching" existing IO for stopped tasks on Windows, and those pipes are somehow broken and blocking. This happens around Lines 407 to 413 in f92e576
Calling |
02c4cdb
to
c841b0e
Compare
// io.Wait locks for restored tasks on Windows unless we call | ||
// io.Close first (https://github.com/containerd/containerd/issues/5621) | ||
// in other cases, preserve the contract and let IO finish before closing | ||
if t.client.runtime == fmt.Sprintf("%s.%s", plugin.RuntimePlugin, "windows") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me. The Delete
should be called after the task has been stopped.
The kernel or shim will ensure all the processes that are created by this task are killed. Just needs to ensure that the t.io.Cancel()
won't close the fifo file (REF: #8334).
/cc @AkihiroSuda @fuweid |
// in other cases, preserve the contract and let IO finish before closing | ||
if t.client.runtime == fmt.Sprintf("%s.%s", plugin.RuntimePlugin, "windows") { | ||
t.io.Close() | ||
} | ||
t.io.Cancel() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe we should add some comments like "Cancel is used to cancel the goroutine which is still in fifo-opening state (container hasn't been started yet). It's not trying to stop pipe because the pipe should be stopped by shim-side. Otherwise we might loss the data from container"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That sounds good! From this and the other PRs, it looks like there's a lot of confusion around what Cancel
does/should do. I'll add the comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @laurazard ! The comment can be handled by the follow-up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oop, I was a bit too fast, I've added it in this commit. Lmk if you think that looks good @fuweid
The contract for `cio/io.go/IO` states that a call to `Close()` will always be preceded by a call to `Cancel()` - https://github.com/containerd/containerd/blob/f3a07934b49bf142925a5913e3e19f3528eda0d2/cio/io.go#L59 which isn't being held up here. Furthermore, the call to `Close()` here makes the subsequent `Wait()` moot, and causes issues to consumers (see: moby/moby#45689) It seems from https://github.com/containerd/containerd/blob/f3a07934b49bf142925a5913e3e19f3528eda0d2/task.go#L338 that the `Close()` should be called there, the call removed in this commit is unnecessary/erroneous. We leave the `Close()` call on Windows only since this was introduced in containerd#5974 to address containerd#5621. Signed-off-by: Laura Brehm <laurabrehm@hey.com>
c841b0e
to
34a93a0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks LGTM
The contract for
cio/io.go/IO
states that a call toClose()
will always be preceded by a call toCancel()
-containerd/cio/io.go
Line 59 in f3a0793
Furthermore, the call to
Close()
here makes the subsequentWait()
moot, and causes issues to consumers (see: moby/moby#45689) – basically, if we close beforeWait()
ing, there will be times where the pipes get closed while task IO (stdin/stdout/whatever) is still being piped, so we lose task output.It seems from
containerd/task.go
Line 338 in f3a0793
Close()
should be called there, the call removed in this commit is unnecessary/erroneous.