New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CRI-O should be smarter about context timeouts #4221
Comments
|
kubernetes-sigs/cri-tools#666 adds deadlines to runp and create, allowing us to test different cases |
|
Generally agree 👍 Thanks for pointing the steps out in such a detailed way!
"Attempt" is a confusing naming here. Even if the I probably see a misconception of the attempt here. Changing it would break the API for sure. But if the attempt would increase "more often", then we could handle it in a more sane way on the runtime side. |
|
ah that's a good clarifying point, though, I think that still works in our context. CRI-O knows the difference between a failed attempt and a timed out attempt, so we can only cache in the timeout case, and discard in the failed case. We still check the "attempt number" as it protects us against the sequence of |
|
Yes this sounds correct to me. Noting to myself that we can probably fix/rephrase this by moving the CRI to beta |
Description
On nodes under load, we run into issues where ContainerCreate and RunPodSandbox requests time out. The most common cases are i/o load, network creation taking a long time, SELinux relabelling of volumes taking a while
Right now, ContainerCreate and RunPodSandbox run to completion, but check at the end that we timed out. if so, we remove the resource we recently created, and try again. We do this because the kubelet works as follows:
Kubelet makes a request
CRI-O handles the request, reserving the resource name as asked by the kubelet
kubelet request times out, and it begins polling CRI-O, continuously trying to create the resource it first tries.
note, it does not bump the attempt number, as this would cause CRI-O to simultaneously try to create multiples of the same container, further adding load to the already overloaded node
Eventually, CRI-O learns the kubelet timed out, and cleans up after itself
note: we clean up here so the next request will pass. if we didn't, the kubelet would say "give me the same container" and CRI-O would say "I already created that container!", even though the kubelet's original request failed
Kubelet finally asks for the same resource again, and one day, assuming the node's load reduces, CRI-O succeeds in creating the container and returns it to the kubelet.
Notice, the kubelet is asking for the same resource the whole time (name, attempt, sandbox, etc), and yet, CRI-O has the possibility of creating it infinite number of times (if it keeps timing out on the request that actually is handling the creation of the resource), while returning errors while it's trying to create that resource ("name is reserved" errors)
we should do better
Instead of removing the resource upon finding out we timed out, CRI-O should put that newly created resource into a cache, report an error to the kubelet. When the kubelet eventually re-requests that resource (as it would have been doing upon the timeout), we should pull that resource from the cache, and return it, without creating it a second time.
There are a couple of gotchas we need to handle:
What if the kubelet bumps the attempt number?
If the kubelet asks to create attempt 0, and then the api-server tells it to create a new version, CRI-O should abandon attempt 0.
What if the kubelet gives up on that container?
CRI-O should abandon the old container, as it's no longer needed.
What if CRI-O is restarted before the container is correctly returned?
CRI-O should be prepared to clean up that container if the kubelet doesn't ask for it.
So here's the proposed solution:
At the end of a CreateContainer or RunPodSandbox request, before marking that newly creating resource as successfully created, CRI-O should check that the request did not time out. If it did, it should save the resource in a ResourceCache. There will be one ResourceCache for pods and one for containers. CRI-O should not report these resources as created (with a PodSandboxList or ContainerList request), as the kubelet doesn't think that resource has been created yet.
When the kubelet asks for the resource again, CRI-O should verify the request is the same (deep comparison against the Request object), and if so, mark the value of the ResourceCache as created, move it to its state, and remove it from the cache
If the resource is not asked for after a certain amount of time (a couple of minutes?), CRI-O should clean up the resource from the cache.
This issue is for discussion of possible other gotchas and failure modes. This will be a pretty substantial change of CRI-O's state machine, so we need to make sure there are no cases we confuse the kubelet.
The text was updated successfully, but these errors were encountered: