Skip to content
This repository has been archived by the owner on Mar 9, 2022. It is now read-only.

[WIP] Add timeout for container/sandbox recover. #884

Closed

Conversation

Random-Liu
Copy link
Member

@Random-Liu Random-Liu commented Aug 24, 2018

Although we don't want, containerd-shim sometimes hangs, e.g. containerd/containerd#2438. We should definitely root cause and fix it.

However, on the other hand, we should make sure that the system still functional when some containerd-shims hang. An example is that ctr task ls will hang if one single containerd-shim hangs. --> We should probably fix it. @crosbymichael

Luckily, the CRI plugin in most cases doesn't handle multiple containers at a time, this makes sure that a single container failure won't block other containers.

However, the only case that CRI plugin may handle multiple containers at a time is the recovery logic. This means that if a containerd-shim hangs, CRI plugin won't be able to restart. This is super bad.

This PR adds a timeout to per container/sandbox recovery logic, so that even one container hangs, we just don't load it, and continue dealing with other containers.

Signed-off-by: Lantao Liu lantaol@google.com

Signed-off-by: Lantao Liu <lantaol@google.com>
@Random-Liu Random-Liu added this to the v1.0 milestone Aug 24, 2018
@Random-Liu
Copy link
Member Author

We should cherry-pick this into all supported branches.

@crosbymichael
Copy link
Member

LGTM

Copy link
Member

@mikebrow mikebrow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/LGTM

@yujuhong
Copy link
Member

LGTM

One question: if the load request timed out, how is the container going to be represented in the CRI plugin -- does CRI show the existence of the container and what state will it be in?

@Random-Liu
Copy link
Member Author

@yujuhong Currently, won't show the container, but I think it would be better to show the container in Unknown state. Let me see whether showing a bad container will cause any issues.

BTW, how does Kubelet deal with unknown state? I remember it triggers a sync pod, and kubelet will try to start a new one.

@yujuhong
Copy link
Member

Yes, kubelet will try to kill the container in the unknown state and start a new one.

@Random-Liu
Copy link
Member Author

Random-Liu commented Aug 25, 2018

I found this PR doesn't actually help, because:

  1. There is no grpc between CRI plugin and containerd, so no one handles the timeout;
  2. The context will be eventually passed to containerd-shim, and ideally the rpc client will handle the timeout. However, ttrpc doesn't support timeout yet ttrpc: end to end timeout support ttrpc#3.

Given so, I'll leave this PR here now... We need a better timeout solution to make the system more reliable. I filed an issue containerd/containerd#2578.

@Random-Liu Random-Liu changed the title Add timeout for container/sandbox recover. [WIP] Add timeout for container/sandbox recover. Aug 25, 2018
@Random-Liu Random-Liu removed this from the v1.0 milestone Aug 25, 2018
@Random-Liu Random-Liu closed this Sep 10, 2018
@Random-Liu Random-Liu deleted the add-timeout-for-recover branch September 10, 2018 18:48
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants