[WIP] Add timeout for container/sandbox recover. #884

Random-Liu · 2018-08-24T04:41:07Z

Although we don't want, containerd-shim sometimes hangs, e.g. containerd/containerd#2438. We should definitely root cause and fix it.

However, on the other hand, we should make sure that the system still functional when some containerd-shims hang. An example is that ctr task ls will hang if one single containerd-shim hangs. --> We should probably fix it. @crosbymichael

Luckily, the CRI plugin in most cases doesn't handle multiple containers at a time, this makes sure that a single container failure won't block other containers.

However, the only case that CRI plugin may handle multiple containers at a time is the recovery logic. This means that if a containerd-shim hangs, CRI plugin won't be able to restart. This is super bad.

This PR adds a timeout to per container/sandbox recovery logic, so that even one container hangs, we just don't load it, and continue dealing with other containers.

Signed-off-by: Lantao Liu lantaol@google.com

Signed-off-by: Lantao Liu <lantaol@google.com>

Random-Liu · 2018-08-24T04:41:26Z

We should cherry-pick this into all supported branches.

crosbymichael · 2018-08-24T14:29:19Z

LGTM

mikebrow

/LGTM

yujuhong · 2018-08-24T17:12:09Z

LGTM

One question: if the load request timed out, how is the container going to be represented in the CRI plugin -- does CRI show the existence of the container and what state will it be in?

Random-Liu · 2018-08-24T17:32:19Z

@yujuhong Currently, won't show the container, but I think it would be better to show the container in Unknown state. Let me see whether showing a bad container will cause any issues.

BTW, how does Kubelet deal with unknown state? I remember it triggers a sync pod, and kubelet will try to start a new one.

yujuhong · 2018-08-24T17:39:35Z

Yes, kubelet will try to kill the container in the unknown state and start a new one.

Random-Liu · 2018-08-25T02:02:26Z

I found this PR doesn't actually help, because:

There is no grpc between CRI plugin and containerd, so no one handles the timeout;
The context will be eventually passed to containerd-shim, and ideally the rpc client will handle the timeout. However, ttrpc doesn't support timeout yet ttrpc: end to end timeout support ttrpc#3.

Given so, I'll leave this PR here now... We need a better timeout solution to make the system more reliable. I filed an issue containerd/containerd#2578.

Add timeout for container/sandbox recover.

2789930

Signed-off-by: Lantao Liu <lantaol@google.com>

Random-Liu added this to the v1.0 milestone Aug 24, 2018

k8s-ci-robot added the size/S label Aug 24, 2018

Random-Liu assigned yujuhong and mikebrow Aug 24, 2018

Random-Liu added the kind/enhancement label Aug 24, 2018

mikebrow approved these changes Aug 24, 2018

View reviewed changes

k8s-ci-robot added the lgtm label Aug 24, 2018

abhi approved these changes Aug 24, 2018

View reviewed changes

Random-Liu changed the title ~~Add timeout for container/sandbox recover.~~ [WIP] Add timeout for container/sandbox recover. Aug 25, 2018

Random-Liu removed this from the v1.0 milestone Aug 25, 2018

Random-Liu closed this Sep 10, 2018

Random-Liu deleted the add-timeout-for-recover branch September 10, 2018 18:48

Random-Liu mentioned this pull request Sep 21, 2018

Add timeout for recover and event monitor #924

Merged

Random-Liu mentioned this pull request Jan 2, 2019

'failed to reserve sandbox name' error after hard reboot #1014

Closed

zhuangqh mentioned this pull request Jul 27, 2021

respect context timeout in shim binary call containerd/containerd#5800

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Add timeout for container/sandbox recover. #884

[WIP] Add timeout for container/sandbox recover. #884

Random-Liu commented Aug 24, 2018 •

edited

Loading

Random-Liu commented Aug 24, 2018

crosbymichael commented Aug 24, 2018

mikebrow left a comment

yujuhong commented Aug 24, 2018

Random-Liu commented Aug 24, 2018

yujuhong commented Aug 24, 2018

Random-Liu commented Aug 25, 2018 •

edited

Loading

[WIP] Add timeout for container/sandbox recover. #884

[WIP] Add timeout for container/sandbox recover. #884

Conversation

Random-Liu commented Aug 24, 2018 • edited Loading

Random-Liu commented Aug 24, 2018

crosbymichael commented Aug 24, 2018

mikebrow left a comment

Choose a reason for hiding this comment

yujuhong commented Aug 24, 2018

Random-Liu commented Aug 24, 2018

yujuhong commented Aug 24, 2018

Random-Liu commented Aug 25, 2018 • edited Loading

Random-Liu commented Aug 24, 2018 •

edited

Loading

Random-Liu commented Aug 25, 2018 •

edited

Loading