-
Notifications
You must be signed in to change notification settings - Fork 981
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
API Server (and clients) becomes unresponsive with too many CRDs #2649
Comments
Breadcrumbs for this possibly related issue: kubernetes/kubernetes#101755 |
A few questions:
I don't believe disabling API groups will be a sustainable fix long term - we need to make sure the API server can handle hundreds or thousands of CRDs. It is important that we start a conversation around this upstream ASAP. |
Just to add more context from our earlier offline discussions with @ulucinar . I think there are two problems we're facing:
In my opinion, we will need to take action for both problems but the selective API installation is the one that we can implement with low cost, it can address both problems for the most users and we will need it even if we do the pooling. Because the first problem can always be encountered, even with all the fixes we can think of in place, and users need the freedom of choice so that they can decide how much of that problem they would like to bear depending on their need. Then I believe we can work on creation pooling, something like EDIT: @negz I just saw your new comment after submitting this. |
@muvaf I'm not convinced the second part of what you've mentioned - users have to install more CRDs than they need - is a real problem. If there were no performance downsides would it have any other meaningful impacts? Presumably the main downside would be 'clutter' in OpenAPI discovery and If we are indeed prioritising this mostly because it's a quick win for performance while we look into other options, could we do it without making a change to the |
Hi @negz, I also think that at some point in time, total number of CRDs in a cluster might become an "official" scalability dimension, i.e., be explicitly mentioned in the scalability thresholds document as a dimension. But we may choose not to act proactively about this, especially because we don't know what such a limit would be and we have no observations of a negative impact on API call latencies (apart from the saturated cases). We don't expect the high number of CRDs in a cluster to affect API server call latency SLIs, as long as we do not saturate the CPU (as it was the case in @muvaf's and @turkenh's initial experiences when they tried to install 700+ CRDs from One issue though with implementing a throttling mechanism on the Crossplane package revision controller is that it will be available only for our reconciler and any other Kubernetes clients registering CRDs on top of the 100s of them we have registered with throttling in-place, will still be susceptible to the discussed issues. Regarding the controller-runtime client throttling, it looks like what we have so far observed is related to API discovery. My understanding is that *: Here, I'm talking about an SLI not on the response times of the |
We have discussed implementing a filtering mechanism not via the |
@ulucinar I agree - can you please raise that issue and reference it here?
Definitely. We should be providing input to this process. We should be advocating for the API server to be able to handle our use cases, rather than waiting for the API machinery folks to tell us what the limit will be. 😄 If it can't natively, we're going to need to fork it and that would not be a great outcome.
That's a fair point - though I imagine we're a bit of an edge case in both the frequency and number of CRDs that we install. i.e. I imagine most other projects would very infrequently be adding <10 CRDs and thus hopefully not be as impacted as e.g. a Crossplane provider. This makes me wonder how impactful the OpenAPI issue will really be based on when folks typically install CRDs. If folks are typically installing CRDs at cluster bootstrapping time this may be an inconvenience but otherwise a non-issue. If folks are installing CRDs on existing clusters that are doing real work it seems like more of a potentially user-impacting issue.
Ah - I was wondering where this configuration lived. Nice find. I suspect tuning the above option will have a big impact. I see that there's a
Ah, sorry I missed that discussion. It's true that we don't want I would feel comfortable making this feature a fully fledged part of the v1 Let me know if it would be helpful to setup a call to discuss some of this synchronously. |
TBH, I had this assumption that even if we do not use all those CRDs, having them installed costs us some apiserver and also controller CPU and/or memory consumption. I tested kubectl behavior and it didn't seem like a big deal and if @ulucinar 's experiments show that this isn't the case even at scale and the only problem thousands of CRDs cause are discovery client throttling and install time saturation, I can walk back from having it to be a first-class configuration on Provider v1 API today. I feel like we will need it for various reasons at some point, because having say 1500 CRDs while you use 100 of them will cause some problems in runtime down the road but it doesn't seem like a problem we have today. So, I'm open to implementing the pooling/throttling in CRD installation of
@negz @hasheddan since you're more involved with upstream folks, do you know how they manage the process of adding new alpha fields on stable APIs? |
FWIW, we're planning to roll with a set of default CRDs to be installed in Terrajet providers so that you get to install the provider with CRDs (like ~100 to ~200 CRDs) even if you don't provide this config and there is no throttling in package manager. |
@ulucinar, @hasheddan and I met this morning to discuss this synchronously. We're agreed that the primary motivation behind adding this feature now is to alleviate performance issues we've observed when there are hundreds of CRDs installed. In summary, and in order of importance, those issues are:
In addition to these issues, we think it may be useful to be able to selectively install only certain APIs, but we don't yet have any concrete use cases or user feedback to support our intuition. The worst symptom we've observed is that installing hundreds of CRDs can lock up a The second symptom we've observed is that the time for a new CRD to become usable increases as more CRDs exist in an API server. We believe this is because the API server recomputes its entire OpenAPI spec each time a new CRD is installed, and recomputing the spec requires processing all CRDs. We've observed that it can take 20 minutes to install a large provider, then ~70 minutes to install a second large provider (where 'large' means having hundreds of CRDs). The final symptom we've observed is that 'discovery' takes a long time when there are many CRDs installed. API server clients use a process known as discovery to determine what APIs the API server supports. This process involves 'walking the tree' of API server endpoints, which can involve hundreds of requests when hundreds of CRDs are installed. This can trigger client-side rate limiting; REST clients are often capped at (for example) 20rps with 30rps burst. Discovery typically happens at client startup, so in practice this symptom primarily means slower client startup and potentially noisy logs (clients often log when they limit themselves). We believe there are several possible remediations:
I'd like us to hold off on moving forward with the remediation this issue proposes until we've raised an issue to determine how receptive Kubernetes folks are to this being fixed upstream. I'm also happy for us to move forward with rate limiting CRD installs, since that remediation has less of a direct UX implication (it doesn't require users to configure anything to avoid performance issues). I'm okay moving forward with the remediation this issue proposes if we do find that it's our only option; i.e. that upstream will not fix this issue at all or in a timely fashion and we find that we cannot tolerate the symptoms we're seeing in the mean time. |
@muvaf my primary experience with this was assisting with promoting Also wanted to ask more generally, have we considered just breaking these large providers into smaller, group-based providers? It would require no code changes, just some different flags in the package build process to only include some CRDs and maybe set some flag on the entrypoint of the controller image. |
Opened an upstream issue to hopefully initiate a discussion here: |
@hasheddan I think if we see that pooling the creations doesn't solve the problem we'll come back to selective installation and consider both options. There are some caveats to both approaches, for example, we need to handle different |
I have done some further experiments to assess the effectiveness of throttling CRD creation by limiting on the # of CRDs in not-established state. The idea is:
Here batch size is a parameter. I have done 4 experiments in parallel on 4 different GKE clusters. The experiments differ only on their batch sizes. The batch sizes tried are: 1, 10, 50 and 100. The total number of CRDs to be registered is 658 in all experiments (CRDs generated for Even in the most conservative case where we create a single CRD (batch size 1) at a time, I have observed logs similar to:
and in the GKE console I have observed that all experiment clusters transitioned into "repairing" state. It took ~47 min to register all of the 658 CRDs with batch size 1. This also includes cluster repairing time. In parallel, I also took a look at how CRDs acquire the My understanding (although it might be inaccurate or plain wrong) is that a CRD is put into |
I'm repeating the |
Both clusters (confirmation for the
|
Just found an upstream bug tracking the client-side throttling bits of this issue - kubernetes/kubectl#1126. There's a PR open to bump the client-side discovery burst up from 100 to 150rps, but more broadly there's a debate about just removing the limits and letting the API server priority and fairness handle it. |
Per my comment at kubernetes/kubernetes#105932 (comment) I don't believe that we were really 'crashing' the GKE API server in @ulucinar's tests. Rather what I believe was happening is that the control plane (which is not redundant for zonal clusters) was temporarily going offline to resize to accommodate so many CRDs. Supporting evidence at:
I can't reproduce the issue @ulucinar saw when using a regional GKE cluster or an EKS cluster (both of which have redundant control planes and can thus resize themselves without a temporary control plane outage). |
Hrm - second guessing that now. I'm yet to see a regional cluster go into the repairing state but I'm reliably able to get a (regional, v1.21) GKE cluster to exhibit degraded performance (including returning etcd errors at least once) by applying 2,000 CRDs without any rate limiting. |
Summarizing my tests from today: I tested on an EKS cluster for the first time, and it seems pretty resilient to the issue. You need to get up to around 3,000 CRDs created consecutively before it starts to see performance issues, and it will happily accept 2,000 CRDs all applied at once. I'm guessing their control planes are fairly powerful. They also are definitely running multiple API server replicas - in some cases I saw new replicas coming and going during my tests presumably in response to the increased load. Unfortunately I also repeated my tests on GKE several times, and could not reproduce the success I had yesterday. Despite managing to get up to ~1,200 CRDs without issue yesterday, today GKE clusters - even regional ones - consistently exhibit various different kinds of errors while attempting to connect to the API server using
In all cases I saw all kinds of crazy errors, from etcd leaders changing to the API server reporting that there was no such kind as In each test I used a different cluster, but one created using the same I also happened to (accidentally) test creating 4,000 CRDs on a This all means that unfortunately I'm not feeling confident about there being any way to accommodate installing several very large providers simultaneously - e.g. Terrajet generated AWS, Azure, and GCP providers would together be around 2,000 CRDs. |
It seems like in kubernetes/kubectl#1126 there's consensus that the rate limiting in I've raised kubernetes-sigs/controller-runtime#1707 to discuss whether a similar change to controllers would make sense. Sounds like they're amenable to the change. |
Executive Summary
Sadly I'm feeling convinced that at the moment the only way to reliably avoid the API server and Rate Limiting ExperimentsGKE continues to be unable to scale to 2,000 CRDs (regardless of how we rate limit) without becoming unresponsive. API discovery begins to suffer from ~20 second long client-side rate limiting with as few as 200 CRDs in the cluster. The most successful strategy I've found so far for GKE is batches of 50 CRDs spread 30 seconds apart. EKS on the other hand will happily accept 2,000 CRDs applied all at once with no rate limiting at all or other immediately discernible performance degradation but will then exhibit client-side rate limiting of up to six minutes before some kubectl commands will complete. In my experience I've noticed that CPU consumption seems to drop off about an hour after a huge number of CRDs are applied; presumably this is how long the API server takes to recompute its OpenAPI schema over and over again. Memory consumption doesn't seem to drop unless the API server process restarts. The API server has a special clause to load many CRDs more efficiently at startup as compared to those same CRDs being added at runtime. Impending Upstream FixesBoth of the key upstream issues we're facing (OpenAPI processing and kubectl discovery rate limiting) have PRs open (kubernetes/kube-openapi#251, kubernetes/kubernetes#105520). I would expect these fixes to be merged imminently but there are no guarantees. If they're candidates for patch releases it's possible fixes will be available within a month per the patch release schedule. I've reached out for clarification around whether the folks working on the issues expect them to be backported and available as patch releases, or whether they'll need to wait until the next minor release. If the upstream fixes do indeed become available as patch releases I personally would feel comfortable requiring that Crossplane users be on a supported version of Kubernetes at the latest patch release in order to support large providers. Asking users to update to the latest minor version of Kubernetes seems like a taller order and not something many enterprises would be able to do easily. Options for Reducing Installed CRDsPersonally I still don't buy that there's any real value in reducing the number of CRDs (used or not) in the system except to workaround these performance issues, but it seems like said performance issues alone may force our hand.
Of the two reduction avenues I'm aware of (smaller providers vs filtering what CRDs are enabled for a provider) I support the approach @hasheddan proposed above. It sounds like we'd need to work through a few things technically to make it work though, as @muvaf mentioned. For example:
I feel the benefits of smaller providers (vs filtering) are:
Of course, neither smaller providers nor filtering providers will actually fix the scalability issues - they'll just reduce the likelihood that folks will run into them so ultimately we are going to need to continue working with upstream Kubernetes to ensure the API server can meet our needs. |
The general consensus within the folks working on this (i.e. @ulucinar, @muvaf, @luebken, and myself) is that we're not likely to get these cherry-picked in time for the patch releases that will be released on Nov 17th:
|
Just left a comment in the upstream PR to signal the direction we would like proceed with: |
Quoting @apelisse on Kubernetes Slack:
He's helped us by creating the following kube-openapi branches:
These branches correspond to the kube-openapi commits Kubernetes 1.20, 1.21, and 1.22 are currently using. We will now proceed by:
This will reduce the scope of the k/k cherry-picks back down to only including kubernetes/kube-openapi#251. |
The current status is that the scope of changes for release-1.20 and release-1.21 had included some dependency bump and we had to revert that. Now, they both have minimal scope just like release-1.22 one. The PR that bumps master is merged, meaning the fix is guaranteed to be included in 1.23. The patch release PRs are:
Once we get approved-for-merge for 1.20 and 1.21 as well, we'll ping release manager to see if they can include them in the respective patch releases. |
All the patch release PRs are merged. Now it's guaranteed that the fix will be included in the following releases of Kubernetes:
I think we can close this issue once the kubectl PR kubernetes/kubernetes#106016 is merged as well. |
Here are kind node images built from the master & active release k/k branches containing the lazy marshaling behavior: |
All branches now contain the OpenAPI aggregated spec lazy marshaling changes. Here are
|
The sig-cli folks ended up merging kubernetes/kubernetes#105520, which bumped up rate limits quite a lot, rather than disabling client-side rate limiting. Specifically:
|
We may want to leave this issue open to track further upstream improvements; notably there's still an appetite to remove the client-side rate limiting (on discovery at least) in kubectl and also to improve how discovery is cached. |
@negz I think we should close this issue as the problems it's describing are fixed, i.e. clusters and clients do not become unresponsive with too many CRDs anymore. If there are other problems, a new issue that is more specific could make more sense. |
I'm closing this now. Feel free to re-open if you experience described problems in the released versions. |
@muvaf I just tested the newly merged client update with the increased burst and QPS that was referred by @negz, but I still ran into client side throttling. And that on a cluster with <300 CRDs. |
@jonnylangefeld thanks for reporting! Either client side or server side throttling is expected during the cache calls but the part that we were interested in fixing was the client becoming unresponsive, i.e. 6mins of wait for a simple |
A simple
In the other post I describe how |
I did a bit of digging and found out why the original upstream PR kubernetes/kubernetes#105520 didn't work: kubernetes/kubernetes#105520 (comment). Maybe give my fix a thumbs up to make It's only one step on the way to at least not always run into rate limits. To actually optionally disable the discovery cache I opened a separate issue: kubernetes/kubernetes#107130. |
This is a real blocker for us. We're on 1.22.4 so applying the ~650 CRDs of provider-jet-azure is quite fast with the lazy marshaling change. On the client-side however, it's not just kubectl which is quite unusable with the throttling but also other API clients like management dashboards (we're using Rancher) which get extremely slow. Can we maybe revisit the idea of reducing installed CRDs? |
The fix I posted in the comment below will make it into kubectl 1.24, so that’ll be a slight improvement for kubectl. Any other client is still affected. Basically anything using client-go. Unfortunately a real solution can only be achieved through a server change as well. |
Hey there, Very aware of lots of problems with discovery, the large amount of requests and how the cache is not invalidated properly. Thanks |
Hi @apelisse, |
@apelisse We're happy to help if we can. Is there somewhere we can find your thoughts so far? |
What problem are you facing?
As part of the ongoing Terrajet-based providers effort, we have observed that registering 100s of CRDs has some performance impacts on the cluster. Observations from two sets of experiments are described here and here. As discussed in the results of the first experiment, Kubernetes scalability thresholds document currently does not consider #-of-CRDs (per cluster) as a dimension. However,
sig-api-machinery
folks suggest a maximum limit of 500 CRDs. And the reason for this suggested limit is not API call latency SLOs but rather, as we also identified in our experiments, due to the latency in the OpenAPI spec publishing. As the results of the second experiment demonstrate, the marginal cost of adding a new CRD increases as more and more CRDs exist in the cluster.Although not considered as a scalability dimension officially yet, it looks like we need to be careful for the #-of-CRDs we install in a cluster. And with Terrajet-based providers we would like to be able to ship 100s of CRDs per provider package. Currently, for the initial releases of these providers, we are including a small subset (less than 10% of the total count) of all the managed resources we can generate. We would like to be able to ship all supported resources in a provider package.
How could Crossplane help solve your problem?
As part of the
v1.Provider
,v1.ProviderRevision
,v1.Configuration
andv1.ConfigurationRevision
specs, we could have a new configuration field that tells the GVKs of package objects to be installed onto the cluster. To behave backwards-compatible, if this new API is not used, all objects defined in the package manifest get installed. If the new field has a non-zero value, then it's enabled and only selected objects get installed. In order to make the UX around installing packages using the new configuration field easier, we can also add new options for theinstall provider
andinstall configuration
commands of the Crossplanekubectl
plugin.As described in the experiment results, when a large #-of-CRDs are registered in a relatively short period of time, the background task that prepares the OpenAPI spec to be published from the
/openapi/v2
endpoint may cause a spike in the API server's CPU utilization, and this may saturate the CPU resources allocated to the API server. Similar to what controller-runtime client does, we may also consider implementing a throttling mechanism in the package revision controller to prevent this. However, because of the reasons discussed above and in the experiment results (especially the latency introduced in OpenAPI spec publishing), it looks like we will need a complementary mechanism like the one suggested above in addition to a throttling implementation in the package revision reconciler.The text was updated successfully, but these errors were encountered: