Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: "STATUS" verb #859

Closed
squeed opened this issue Sep 8, 2021 · 26 comments
Closed

Proposal: "STATUS" verb #859

squeed opened this issue Sep 8, 2021 · 26 comments

Comments

@squeed
Copy link
Member

squeed commented Sep 8, 2021

Sometimes plugins know that they can't accept any CNI requests.

Possible causes:

  • They're a shim to a daemon, and that daemon is down.
  • They're out of IP addresses
  • Their configuration file is somehow invalid or broken

Right now, CRI (kubernetes)-flavored runtimes need to report network status, but they do so exclusively by checking for the existence of a CNI configuration file. It would be useful to provide more information to the end-user / administrator.

What if we added a "STATUS" verb? It would take a CNI configuration, but not be called in the context of a container.

The result should differentiate between permanent and temporary errors (which we already have some support for via error codes).

@squeed squeed changed the title Proposal: plugin-wide "STATUS" verb Proposal: "STATUS" verb Sep 8, 2021
@squeed
Copy link
Member Author

squeed commented Sep 8, 2021

Wow, looks like @danwinship had the same idea in #628.

@squeed
Copy link
Member Author

squeed commented Sep 8, 2021

Two interesting observations from the CNI meeting:

IP address exhaustion is funny, because DEL will still succeed even if ADD won't. So, a STATUS error of "no available IPs" should not keep the runtime from executing DELs, but should block ADDs

It would be interesting if there was a way to return a STATUS response on ADD or DEL failure, so the runtime knows that everything is broken without having to also poll STATUS.

@squeed
Copy link
Member Author

squeed commented Sep 8, 2021

Other considerations:

We need to add some wording around timing. Plugins have to handle ADD even after they return a failure in STATUS (with a reasonable error message). Runtimes should avoid calling ADD if a STATUS has failed.

@mars1024
Copy link
Member

How a stateless plugin record the status in an efficient way? Without memory or cache?
If the status is recorded on the disk, then what's the difference between ADD or STATUS? I mean ADD can also reject the request based on status, and the upstreams will have no need to change their behaviors.

@danwinship
Copy link
Contributor

Wow, looks like @danwinship had the same idea in #628.

Yeah, but I don't think that's the right answer any more. I mean, yes, CNI should have a way to indicate "I'm out of IPs". But for the case of Kubernetes network plugins, we should just accept that this is a more complicated problem than CNI network plugin readiness, and deal with this at the Kubernetes / CRI level. Eg, kubelet could have a mode in which it decides that the network is ready if and only if there is a host-network pod running on the node with the label "networking.k8s.io/network-plugin" which is Ready. (openshift/origin#24906)

@matthewdupre
Copy link
Member

I'm not a huge fan of using pod readiness to solely indicate plugin status: we've found it painful to have a pod readiness depend on the state of other pods and in some cases a network plugin may need multiple components to function to be ready.

@jayunit100
Copy link
Contributor

jayunit100 commented Sep 15, 2021

these are both valuable but different... I think dans solution is orthogonal to this, right? ...

  • Dans original proposed solution was at the 'node level'. I guess that sort of addresses matts question of what node readiness means...
  • Caseys here is stateless, i.e. its a return value that has more granular info...
    both seem to have value

with Dans proposal, cni providers can give vendor specific suggestions/heuristics on the state of things...that might not be reflected in individual ADD/DEL calls...

with Caseys proposal, individual runtimes can cascade structured information up the kubelet->cri->cni API chain of command to be utilized in a more modular way/broader context (i.e. metrics or monitoring or retry calibration or whatever)

@danwinship
Copy link
Contributor

I'm not a huge fan of using pod readiness to solely indicate plugin status: we've found it painful to have a pod readiness depend on the state of other pods and in some cases a network plugin may need multiple components to function to be ready.

I don't see why it needs to be painful... it seems like something somewhere in the system must know whether or not the network is ready, and you can just expose that via a trivial http server on the "main" pod.

I think dans solution is orthogonal to this, right?

Well, Casey mentioned "They're a shim to a daemon, and that daemon is down." and "Their configuration file is somehow invalid or broken" as use cases, and I'm saying I don't think CNI is the right level to solve those problems (unless people want to implement a CNI 2.0 that is totally different from the current spec).

The CNI spec works well when your plugin is fully installed before anyone needs to use it, and remains fully available until after everyone is done with it. It's really not well design to handle the case of plugins that are installed after the CNI-consuming-system starts up, and which might occasionally go away and then come back (eg due to being upgraded).

@mars1024
Copy link
Member

mars1024 commented Sep 22, 2021

Well, Casey mentioned "They're a shim to a daemon, and that daemon is down." and "Their configuration file is somehow invalid or broken" as use cases, and I'm saying I don't think CNI is the right level to solve those problems (unless people want to implement a CNI 2.0 that is totally different from the current spec).

The CNI spec works well when your plugin is fully installed before anyone needs to use it, and remains fully available until after everyone is done with it. It's really not well design to handle the case of plugins that are installed after the CNI-consuming-system starts up, and which might occasionally go away and then come back (eg due to being upgraded).

Yes, the problem is not exposing a "STATUS" verb on plugins, before this, a new CNI 2.0 is needed to run as daemon and handle plugin registration, periodical status check and dynamic configuration.

Then "STATUS" verb will work well under the new spec.

@squeed
Copy link
Member Author

squeed commented Jan 31, 2022

I don't see why it needs to be painful... it seems like something somewhere in the system must know whether or not the network is ready, and you can just expose that via a trivial http server on the "main" pod.

I think that's what's nice about a theoretical STATUS verb. We already have a protocol for runtimes to talk to network plugins - creating a new API just for readiness seems a bit unnecessary. Instead, you can put that logic in your plugin binary. If it makes sense internally for that to be an HTTP call, then great. It doesn't actually require any commitments to the initialization status of your plugin.

I agree that it doesn't cover all possible cases with Kubernetes (i.e., the chicken-egg problem w.r.t. dropping configuration files and binaries), but, honestly, I'm having a hard time thinking of any that aren't covered by the "drop a configuration file on disk" initial gate.

Besides, there could totally be daemonless CNI plugins, especially in a magical ipv6 world. (Someday soon, I hope). I don't think we want to restrict us to needing a kubernetes pod to signify readiness.

@squeed
Copy link
Member Author

squeed commented Jan 31, 2022

One more thought: another advantage of adding a STATUS verb is that it also works with chained plugins.

@danwinship
Copy link
Contributor

Besides, there could totally be daemonless CNI plugins

My point is that I wasn't talking about CNI plugins in #628, I was talking about Kubernetes networking. At one point in the distant past when everyone used the stock kube-proxy and NetworkPolicy didn't exist, it was possible to set up Kubernetes networking using a plugin that communicated with the rest of Kubernetes solely via CNI. This is no longer the case, and never will be the case again in the future.

So when we have a problem with Kubernetes network plugins that does not apply to CNI plugins such as the ones in containernetworking/plugins/main, then that suggests that CNI is not the right level to try to solve the problem at, because it's not a problem with CNI, it's a problem with Kubernetes networking, of which CNI is just one piece.

(Though again, I agree that CNI totally needs a way to say "I am out of IP addresses".)

@squeed
Copy link
Member Author

squeed commented Jan 31, 2022

Ah. I see your point.

Another way of framing this issue might be "what are the decisions I might make that are informed by network status?"

Asking that question from the Kubernetes perspective (but remaining higher-level), I can think of a few:

  • Should I send traffic to / via this node? (Scope here is undefined, but may be container-specific or node-wide)
  • Can I start pods?

It seems like we actually need two separate mechanisms to inform these decision points. So, perhaps, the spec wording of STATUS should be very explicit about only covering the latter case.

(As an aside, NotReady on the k8s Node object is horribly overloaded. Is it just for scheduling? Should it inform eviction? It causes nodes to be removed from Cloud LBs, but doesn't affect EndpointSlices. The API we're working with here is just too coarse.)

@danwinship
Copy link
Contributor

danwinship commented Feb 15, 2022

OK, there are a few pieces to this:

  1. There are certain failure modes which are well-known to implementors but which CNI has no way of expressing to the runtime. (eg, "the plugin is not running", "I'm out of IPs".)

    • This can be addressed, at least in part, with new error codes.
    • Alternatively/additionally, we could add a freeform-but-machine-readable "reason" to CNI errors like with Kubernetes Conditions so you could say {"code": 11, "reason": "PluginNotReady", "msg": ..., "details": ...} or {"code": 11, "reason": "OutOfIPs", ...} or whatever, where certain "reason" values would have agreed-upon semantics.
  2. All current runtime/CNI communication is in the context of a specific pod, so the runtime has no way to ask the plugin about host-level problems, and plugins have no way to (unambiguously) indicate host-level problems

    • Especially, at startup time, on a host with no pre-existing pods, the runtime cannot ask the CNI plugins anything without actually trying to create a pod.
    • Adding a new verb is the easiest way to address this. Casey had suggested STATUS, but it seems like it would be better to have a name that explicitly clarifies that it's not a pod-level operation (ie, unlike every other verb) so maybe HOSTSTATUS?
  3. Sometimes a CNI plugin chain is able to know pre-emptively that an ADD would fail because of a host-level problem (eg, you are using host-local IPAM and it has used up all of its IPs), and it would be cool if it could communicate this to the runtime, so that the runtime could prevent pods from being scheduled/started when they're guaranteed to fail.

    • In the host-local IPAM case, the plugin could theoretically indicate to the runtime "I can support exactly 254 simultaneous pods on this host, so just do the math, okay?".
    • Other plugins might not be able to know for certain ahead of time how many pods they'll be able to support, but they might be able to recognize after doing a successful ADD that they have used up all of their resources, at which point they could tell the runtime "the next ADD will fail, unless you do a DEL first". (EDIT: or rather, every ADD/DEL response would include some code indicating "I believe I can add at least one more pod" / "I believe I can't add any more pods" / "I don't know whether or not I can add more pods".)
    • In other cases, the plugin might not actually notice immediately when it is out of resources, but it might be relatively easy for it to check if an ADD would succeed without actually doing an ADD, so in theory the runtime could poll it with HOSTSTATUS to figure out the current state.
    • (But in other cases, testing whether an ADD would fail might be expensive or unreliable: eg, when using dhcp IPAM, there's no way to know if more IPs are available without actually trying to reserve one, and then even if you determine that an IP is available, it's possible that another node will use it first. So in that case the plugin would probably want to tell the runtime "I don't know"...)
  4. (This is starting to get farther away from the original ideas, but it's something else I'm thinking about that's sort of related so...) In general, a host may be able to support only certain numbers of certain kinds of pods. Eg, it may have a certain number of IPs or a certain number of SR-IOV VFs, or more abstractly, you might have some set of pods that declare that they need 1G of guaranteed bandwidth, and a node might declare that it can support 10 of them, but no more.

    • Kubernetes uses node.status.capacity and pod.spec.resources.requests for some of this, but it's not consistent. (eg, resources are used for SR-IOV VFs, but not for pod IP addresses). Also, there is no direct interaction there with CNI. Also, in order for the accounting to work correctly, the user has to request the resources (eg, an SR-IOV VF) themself; there is no way for the CNI plugin to spontaneously claim a resource that the user didn't request (eg, when doing OVN offload when you want all pods to get SR-IOV VFs).
    • See also: Network Profiles which is something similar some other people are thinking about.

@squeed
Copy link
Member Author

squeed commented May 4, 2022

FYI, I've just thought of another potential use for Status: detecting that a node's configuration is untenable and reporting that back to the end-user, before any pods are scheduled.

So, what if we were to return a status with multiple fields? Something like

  • Containers can be created
  • Containers can be deleted
  • The node's network is functional

Perhaps answers are optional...

@squeed
Copy link
Member Author

squeed commented Nov 7, 2022

Picking this back up, as I keep running in to use cases for this.

@danwinship as always, your analysis is spot on. There is a large scope for what "Status" could mean, and we need to leave whatever design open for evolution.

I would really like to add "countable" resource support to CNI, but I think that's better tied in with a larger rethink in CNI 2.0. Unless we can come up with a simple expression that fits in the existing CNI 1.0 model, I'm loath to block any sort of "status" progress.

At the end of the day, "Status" would be intended to be informational; it would not be a violation to see a "failed" status response and immediately call ADD. It would be silly, but allowed.

The two status "flags" we care about are

  • Can we process ADDs?
  • Are packets (probably) flowing?

Do we think it's useful to also add a third flag, given that DEL should "always" succeed?

  • Can we process DELs

@squeed squeed added this to the CNI v1.5 milestone Nov 7, 2022
@squeed squeed modified the milestones: CNI v1.5, CNI v1.1 Jan 9, 2023
@aojea
Copy link

aojea commented Mar 9, 2023

/cc

@aojea
Copy link

aojea commented Jun 5, 2023

Especially, at startup time, on a host with no pre-existing pods, the runtime cannot ask the CNI plugins anything without actually trying to create a pod.

There is a networkReady condition in the kubelet

https://github.com/kubernetes/kubernetes/blob/34abd36e88e50d913af9413f9353515deb7f09d4/pkg/kubelet/kubelet.go#L2873-L2876

that is part of the CRI API

https://github.com/kubernetes/kubernetes/blob/bba9833c398230f68cd30ff23e22f303e898e6a9/staging/src/k8s.io/cri-api/pkg/apis/runtime/v1/constants.go#L26

you don't need to create Pods, currently, the networkReady state is based on the existence of a CNI config file, but with the STATUS verb we can have more reliable information ... agree with you that this is at host or plugin based level check

@maiqueb
Copy link

maiqueb commented Jun 5, 2023

@squeed does this issue have any relation to CNI 2.0 ?

@squeed squeed mentioned this issue Jun 19, 2023
1 task
@squeed
Copy link
Member Author

squeed commented Jul 25, 2023

An update from the most recent meeting: plugins need a way to know if they can rely on STATUS. If not, they need to wait until they are ready before writing their CNI configuration file.

This is the same issue as discussed in #927 -- so we will have the discussion there.

@danwinship
Copy link
Contributor

More discussion of kubelet-startup-network-readiness: kubernetes/kubernetes#120486

@aojea
Copy link

aojea commented Oct 2, 2023

/assign

@squeed
Copy link
Member Author

squeed commented Dec 4, 2023

Support for status has merged.

@squeed squeed closed this as completed Dec 4, 2023
@aojea
Copy link

aojea commented Dec 5, 2023

wohoo

@aojea
Copy link

aojea commented Dec 5, 2023

Support for status has merged.

@squeed what are the next steps? Implementing it in libcni?

@squeed
Copy link
Member Author

squeed commented Dec 5, 2023

@aojea the next step is to cut a -rc and implement this in the plugins. Assuming there are no issues found, we can tag v1.1.0 and be done :)

I hope to cut the rc next week.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants