Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Arbitrary image pinning support (i.e. to exclude from kubelet's image GC) #6930

Closed
ruiwen-zhao opened this issue May 11, 2022 · 19 comments
Closed
Labels
area/cri Container Runtime Interface (CRI) kind/feature

Comments

@ruiwen-zhao
Copy link
Member

ruiwen-zhao commented May 11, 2022

What is the problem you're trying to solve

Containerd users should be able to pin (i.e. to exclude from kubelet's image GC) arbitrary images.

There is an existing PR (#6456) pins sandbox image. However, we may want to generalize the functionality and allow users to pin arbitrary images. Creating this issue to discuss and (hopefully) finalize the expected behavior.

Describe the solution you'd like

We have 3 options here:

Option 1 containerd takes “pinning requests” through PullImageRequest.Image.Annotation field.
This is suggested in https://github.com/containerd/containerd/pull/6456/files#r803279310
Pros:
No extra user-facing API/flag/config needed. Users can pin arbitrary images by specifying pod annotations (because kubelet copies pod annotations over to PullImageRequest.Image.Annotation).
Cons:
Anyone who can modify pod spec can pin images. This might be too much privilege for pod owners, and may be especially bad for multi-tenant clusters.

Option 2 containerd takes “pinning requests” through an additional field in the CRI API.
Pros:
Limited access to image pinning ability. (Depending on how k8s will expose the functionality.)
Cons:
Kubelet and CRI changes required.
K8s needs to figure out how to expose this functionality to end users.

Option 3: containerd exposes the option through containerd config. (which toml section is TBD)
Pros:
Limited access to image pinning ability. Only users who have access to containerd config would be able to pin images.
No CRI/k8s change needed. Containerd will have the sole authority to decide which images to pin.
Cons:
K8s users/admins will not be able to decide what image to pin.

I propose option 3 here because it is both relatively simple (avoiding the back and forth between kubelet and containerd) and it limits the privilege of pinning images. However, I am open to discussion and other options.

Additional context

No response

@ruiwen-zhao
Copy link
Member Author

cc @mikebrow @adisky

@AkihiroSuda AkihiroSuda changed the title Arbitrary image pinning support Arbitrary image pinning support (i.e. to exclude from kubelet's image GC) May 12, 2022
@ruiwen-zhao
Copy link
Member Author

Seems like kubernetes/kubernetes#109936 and the corresponding PR kubernetes/kubernetes#109936 are trying to achieve pinning arbitrary images by adding a new kubelet config, and therefore avoid relying on containerd to provide the list of images to be pinned.

@mikebrow
Copy link
Member

mikebrow commented May 23, 2022

we have a gc in kubelet, and one in containerd, and user/admins have tools ..

we definitely need to work together on this and other image services in the k8s.io namespace

@BenTheElder
Copy link
Contributor

@mikebrow
Copy link
Member

IMO there should be levels of priority for image retention in the gc process, beyond the least recently used/size metric used today. But that's a different issue than pinning.

IMO pin is a recommendation to not delete an image during gc, (last to remove in level if you have levels).

@mikebrow
Copy link
Member

locking from removal for security or other administrative reasons.. that would be different than pin.. though it depends on the definition of pin? .. if the pin is locked to a security process that would be interesting.

@ruiwen-zhao
Copy link
Member Author

we have a gc in kubelet, and one in containerd, and user/admins have tools ..

we definitely need to work together on this and other image services in the k8s.io namespace

Yep. But for the purpose of this issue, I am trying to limit the scope to only generalizing the pinned field to support arbitrary image. The only GC in scope is the one in kubelet.

@ruiwen-zhao
Copy link
Member Author

kubernetes/kubernetes#109936 brought up kubernetes/kubernetes@d3ae0a3#diff-583fcb2b2b9fed64e8ac45392565a924918b4a64a7d4b66df0d4bccd298e8769R1259

maybe it should be more than a recommendation ?

I am ok with keeping it as a recommendation, if the contract around pinned field clearly says so.

@BenTheElder
Copy link
Contributor

FWIW, in side-loaded image scenarios we desire to have 100% never-GCed images (consider pause) that may not have ever been pulled and may not be pullable (airgapped etc).

KIND preloads core images and so does GKE, wouldn't be surprised if others do as well. For those images GCing them is counter productive, and the kubelet GC is rather naive.

@dmcgowan
Copy link
Member

KIND preloads core images and so does GKE, wouldn't be surprised if others do as well. For those images GCing them is counter productive, and the kubelet GC is rather naive.

In this scenario is the preloading for warming the cache or because the image may not otherwise be available to pull? From a cleanup perspective, if the space is needed, they should be collectible. Having a naive implementation when they should be prioritized or scored in some way makes that complicated.

For images not needed for running pods, the recommended way is not using k8s.io namespace. For those other use cases, I would also like to explore alternatives to pulling from a registry, such as getting the content directly from another containerd namespace or loading the image from a static location (such as OCI image on disk). I am not a fan of "ignore garbage collection" flags on content.

@BenTheElder
Copy link
Contributor

BenTheElder commented May 24, 2022

In this scenario is the preloading for warming the cache or because the image may not otherwise be available to pull?

For kind the former, we may be running an unpushed development build of Kubernetes.
For GKE on GCP the latter, but I know other products also have the former.

Other reasons they may not be pullable for KIND users:

  • k8s.gcr.io is not reachable behind the great firewall
  • the developer is on an aircraft or otherwise without internet connectivity at the moment

Today we're essentially forced to just disable the GC entirely and users must periodically create a new cluster (which may do zero runtime pulling) or perform "GC" themselves. This works well in many environments, but is less-than-ideal.

From a cleanup perspective, if the space is needed, they should be collectible. From a cleanup perspective, if the space is needed, they should be collectible. Having a naive implementation when they should be prioritized or scored in some way makes that complicated

Er - kubelet makes the selection today and kubelet's selection is naive. And kubelet is just trying to free up disk by any means, but doesn't have knobs for this (it has one deprecated knob for pause).

For images not needed for running pods, the recommended way is not using k8s.io namespace.

I don't think that's being discussed here, only CRI.

For those other use cases, I would also like to explore alternatives to pulling from a registry, such as getting the content directly from another containerd namespace or loading the image from a static location (such as OCI image on disk). I am not a fan of "ignore garbage collection" flags on content.

Storing it in an OCI image on disk just increases the space necessary and still effectively creates a "don't GC this" flag, in a roundabout way?

Storing directly in containerd is more compact than also storing an intermediate OCI image on the target machine.

I'm not sure about the "getting content directly from another containerd namespace approach", can you elaborate on what that should look like?

The cluster administrator should know what they're doing when images are marked.

@harshanarayana
Copy link

harshanarayana commented May 24, 2022

Here is my two cents from previous disasters we have had to deal with because of image GC

if the space is needed, they should be collectible

Indeed. However, there should still be a way to retain certain cluster critical images, that when missing can take the entire cluster down with it. Take the following example.

  1. In our cluster, all of the DNS resolution goes via the coredns. i.e Host forwards all requests to the coreDNS which get forwarded to upstream from it. In this case, if the CoreDNS pod image got GC'ed for whatever reason, there is no way I could pull that image without first restoring the DNS Config on host by hacking around the /etc/resolv.conf temporarily and then put it back to old clean state.
  2. You are running a HA AirGap cluster with a docker registry built into the cluster (i.e. Registry is a pod running in the cluster) as the customer doesn't have their own registry. In cases like this, if the Registry image itself gets GC'ed you have no way to pull other images on the cluster without restoring the registry first and that would need some kind of an external tooling / remediation workflow to be built. Which can be less than ideal in some cases.

such as getting the content directly from another containerd namespace or loading the image from a static location

Other than the duplication that @BenTheElder mentioned, this also involves one having to split the images into individual tarball via docker save like mechanism so that only the selective images can be loaded. Where as we generally prefer doing a bulk save and them split using split command to get a better behavior for saving images.

For images not needed for running pods, the recommended way is not using k8s.io namespace

This is not always going to be possible right ? We have a collection of images that correspond to a set of Kubernetes Jobs that run over an event. In such cases these images can't be kept on a different namespace. And since these images won't have a running container, the GC will happily cleanup these images. Which is less than ideal for cases of running jobs. Also a separate namespace would mean the image needs to be side loaded first and then moved to the right namespaces when required or during the remediation. Which means there is still going to be duplication of artifacts on the system. Now keeping these two namespace for images will be an additional overhead.

@ruiwen-zhao
Copy link
Member Author

I am also putting here our case and why we need to exclude images from GC:

We preload the GPU driver installer images on the nodes with GPUs attached, and more specifically we preload different driver installer images onto different nodes based on hardware types. To make it easier to use for our customers, we give the same name to all the driver installer images regardless of their versions, so that customers don't need to worry about the versions, and they can simply apply -f the same daemon set yaml.

Since we use the same name for all images, we cannot make the images pullable. The nodes won't be able to tell what exact image to pull based on the names.

On the other hand, the driver installer images are only used in an initContainer, so they will not be used (and become the target of GC) once the installation finishes.

@fuweid fuweid added the area/cri Container Runtime Interface (CRI) label Jun 24, 2022
@adisky
Copy link
Contributor

adisky commented Jan 11, 2023

@ruiwen-zhao have we reached on any conclusion here? IMO the easy/secure way to support arbitrary images exemption from GC is config.toml option. We may not want to give all users access to pin images via annotation.

Also one more thought here, what is the need to involve containerd here? we can directly provide images to kubelet via config? one possible reason could be in air gapped env images are pre loaded via containerd so it would be easier for user to pin it via containerd.
There could be one more way user can set labels on images via
sudo ctr -n k8s.io images label <image-name> io.cri-containerd.pinned=pinned
This should work after #7944 gets merged without doing anything additionally

@ruiwen-zhao
Copy link
Member Author

I dont think we have reached any conclusion yet.

I think containerd is involved here mostly because

a. in CRI v1, kubelet expects the pinned field to be set by container runtime, and
b. users can delete images directly on container runtime. So if containerd is not involved, it is possible that the image is protected from kubelet GC but is still deleted.

@xiaoyaozi5566
Copy link

This is exactly our pain point. We don't want images to be GCed in airgap environment. You can GC unused containers and snapshots because they can be recreated. Can we at least provide an option to disable GC on images?

@samuelkarp
Copy link
Member

Can we at least provide an option to disable GC on images?

containerd itself doesn't do any GC of images, so that'd be a question for modifying the kubelet.

@xiaoyaozi5566
Copy link

Right, the images were GCed by kubelet garbage collector.

@samuelkarp
Copy link
Member

Arbitrary pinning is available via sudo ctr -n k8s.io images label <image-name> io.cri-containerd.pinned=pinned since containerd v1.7.14 and v1.6.30.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/cri Container Runtime Interface (CRI) kind/feature
Projects
None yet
Development

No branches or pull requests

9 participants