Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CRI-O should be smarter about context timeouts #4221

Closed
haircommander opened this issue Sep 24, 2020 · 4 comments · Fixed by #4394
Closed

CRI-O should be smarter about context timeouts #4221

haircommander opened this issue Sep 24, 2020 · 4 comments · Fixed by #4394
Assignees

Comments

@haircommander
Copy link
Member

Description
On nodes under load, we run into issues where ContainerCreate and RunPodSandbox requests time out. The most common cases are i/o load, network creation taking a long time, SELinux relabelling of volumes taking a while

Right now, ContainerCreate and RunPodSandbox run to completion, but check at the end that we timed out. if so, we remove the resource we recently created, and try again. We do this because the kubelet works as follows:

Kubelet makes a request
CRI-O handles the request, reserving the resource name as asked by the kubelet
kubelet request times out, and it begins polling CRI-O, continuously trying to create the resource it first tries.
note, it does not bump the attempt number, as this would cause CRI-O to simultaneously try to create multiples of the same container, further adding load to the already overloaded node
Eventually, CRI-O learns the kubelet timed out, and cleans up after itself
note: we clean up here so the next request will pass. if we didn't, the kubelet would say "give me the same container" and CRI-O would say "I already created that container!", even though the kubelet's original request failed
Kubelet finally asks for the same resource again, and one day, assuming the node's load reduces, CRI-O succeeds in creating the container and returns it to the kubelet.

Notice, the kubelet is asking for the same resource the whole time (name, attempt, sandbox, etc), and yet, CRI-O has the possibility of creating it infinite number of times (if it keeps timing out on the request that actually is handling the creation of the resource), while returning errors while it's trying to create that resource ("name is reserved" errors)

we should do better

Instead of removing the resource upon finding out we timed out, CRI-O should put that newly created resource into a cache, report an error to the kubelet. When the kubelet eventually re-requests that resource (as it would have been doing upon the timeout), we should pull that resource from the cache, and return it, without creating it a second time.

There are a couple of gotchas we need to handle:
What if the kubelet bumps the attempt number?
If the kubelet asks to create attempt 0, and then the api-server tells it to create a new version, CRI-O should abandon attempt 0.
What if the kubelet gives up on that container?
CRI-O should abandon the old container, as it's no longer needed.
What if CRI-O is restarted before the container is correctly returned?
CRI-O should be prepared to clean up that container if the kubelet doesn't ask for it.

So here's the proposed solution:
At the end of a CreateContainer or RunPodSandbox request, before marking that newly creating resource as successfully created, CRI-O should check that the request did not time out. If it did, it should save the resource in a ResourceCache. There will be one ResourceCache for pods and one for containers. CRI-O should not report these resources as created (with a PodSandboxList or ContainerList request), as the kubelet doesn't think that resource has been created yet.

When the kubelet asks for the resource again, CRI-O should verify the request is the same (deep comparison against the Request object), and if so, mark the value of the ResourceCache as created, move it to its state, and remove it from the cache

If the resource is not asked for after a certain amount of time (a couple of minutes?), CRI-O should clean up the resource from the cache.

This issue is for discussion of possible other gotchas and failure modes. This will be a pretty substantial change of CRI-O's state machine, so we need to make sure there are no cases we confuse the kubelet.

@haircommander
Copy link
Member Author

kubernetes-sigs/cri-tools#666 adds deadlines to runp and create, allowing us to test different cases

@haircommander haircommander self-assigned this Sep 24, 2020
@saschagrunert
Copy link
Member

saschagrunert commented Sep 24, 2020

Generally agree 👍 Thanks for pointing the steps out in such a detailed way!

note, it does not bump the attempt number, as this would cause CRI-O to simultaneously try to create multiples of the same container, further adding load to the already overloaded node

"Attempt" is a confusing naming here. Even if the RunPodSandboxRequest returns an error the attempt will not incremented by the kubelet. The "documentation" states that attempt refers to any sandbox modification, so it's more a "modified indicator" rather than a plain attempt.

I probably see a misconception of the attempt here. Changing it would break the API for sure. But if the attempt would increase "more often", then we could handle it in a more sane way on the runtime side.

@haircommander
Copy link
Member Author

ah that's a good clarifying point, though, I think that still works in our context. CRI-O knows the difference between a failed attempt and a timed out attempt, so we can only cache in the timeout case, and discard in the failed case. We still check the "attempt number" as it protects us against the sequence of attempted request -> timed out request -> changed request where we are to clear our cache. is my understanding correct there?

@saschagrunert
Copy link
Member

Yes this sounds correct to me.

Noting to myself that we can probably fix/rephrase this by moving the CRI to beta

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants