Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API Server (and clients) becomes unresponsive with too many CRDs #2649

Closed
ulucinar opened this issue Oct 21, 2021 · 53 comments
Closed

API Server (and clients) becomes unresponsive with too many CRDs #2649

ulucinar opened this issue Oct 21, 2021 · 53 comments
Labels

Comments

@ulucinar
Copy link
Contributor

ulucinar commented Oct 21, 2021

What problem are you facing?

As part of the ongoing Terrajet-based providers effort, we have observed that registering 100s of CRDs has some performance impacts on the cluster. Observations from two sets of experiments are described here and here. As discussed in the results of the first experiment, Kubernetes scalability thresholds document currently does not consider #-of-CRDs (per cluster) as a dimension. However, sig-api-machinery folks suggest a maximum limit of 500 CRDs. And the reason for this suggested limit is not API call latency SLOs but rather, as we also identified in our experiments, due to the latency in the OpenAPI spec publishing. As the results of the second experiment demonstrate, the marginal cost of adding a new CRD increases as more and more CRDs exist in the cluster.

Although not considered as a scalability dimension officially yet, it looks like we need to be careful for the #-of-CRDs we install in a cluster. And with Terrajet-based providers we would like to be able to ship 100s of CRDs per provider package. Currently, for the initial releases of these providers, we are including a small subset (less than 10% of the total count) of all the managed resources we can generate. We would like to be able to ship all supported resources in a provider package.

How could Crossplane help solve your problem?

As part of the v1.Provider, v1.ProviderRevision, v1.Configuration and v1.ConfigurationRevision specs, we could have a new configuration field that tells the GVKs of package objects to be installed onto the cluster. To behave backwards-compatible, if this new API is not used, all objects defined in the package manifest get installed. If the new field has a non-zero value, then it's enabled and only selected objects get installed. In order to make the UX around installing packages using the new configuration field easier, we can also add new options for the install provider and install configuration commands of the Crossplane kubectl plugin.

As described in the experiment results, when a large #-of-CRDs are registered in a relatively short period of time, the background task that prepares the OpenAPI spec to be published from the /openapi/v2 endpoint may cause a spike in the API server's CPU utilization, and this may saturate the CPU resources allocated to the API server. Similar to what controller-runtime client does, we may also consider implementing a throttling mechanism in the package revision controller to prevent this. However, because of the reasons discussed above and in the experiment results (especially the latency introduced in OpenAPI spec publishing), it looks like we will need a complementary mechanism like the one suggested above in addition to a throttling implementation in the package revision reconciler.

@ulucinar ulucinar added the enhancement New feature or request label Oct 21, 2021
@negz
Copy link
Member

negz commented Oct 21, 2021

Breadcrumbs for this possibly related issue: kubernetes/kubernetes#101755

@negz
Copy link
Member

negz commented Oct 22, 2021

A few questions:

  1. Do we think this is the same issue as kube-apiserver memory consumption during CRD creation kubernetes/kubernetes#101755?
  2. If not, have we raised an upstream issue tracking what we're seeing?
  3. I've previously been told we'd address the issues we were seeing by rate limiting CRD installs - not filtering them. Why are we not taking that approach?
  4. I've heard out of band that we're also seeing issues with client side throttling (probably in REST mapper discovery) when there are too many CRDs. Is this tracked anywhere? Could it be covered by this issue?

I don't believe disabling API groups will be a sustainable fix long term - we need to make sure the API server can handle hundreds or thousands of CRDs. It is important that we start a conversation around this upstream ASAP.

@muvaf
Copy link
Member

muvaf commented Oct 22, 2021

Just to add more context from our earlier offline discussions with @ulucinar . I think there are two problems we're facing:

  • Having more than 500 CRDs in a cluster causes problems during installation, even break some clusters like GKE for a while, as well as in runtime by making discovery client make thousands of calls.
    • One way or another, even without Terrajet providers, it's very easy to run into this by installing more than one Crossplane provider.
    • Even though it's not suggested to go over 500, I believe we should make as much effort as we can to be able to manage that many CRDs from our side, i.e. package manager creation pooling and follow up upstream to see whether there could be a way to improve.
  • Users have to install more than 500 CRDs even if they need only, say 50.
    • We have many users using provider-aws in production which has under 100 CRDs. If they wanted to use Terrajet-based provider, they would have to install 700 more CRDs and bear the cost.
    • Even if we can install that many CRDs, it's not a good UX to force all of it to be installed because there are implications of that not only during installation time but also client behaviors, like discovery client making thousands of calls to find a kind unnecessarily.

In my opinion, we will need to take action for both problems but the selective API installation is the one that we can implement with low cost, it can address both problems for the most users and we will need it even if we do the pooling. Because the first problem can always be encountered, even with all the fixes we can think of in place, and users need the freedom of choice so that they can decide how much of that problem they would like to bear depending on their need.

Then I believe we can work on creation pooling, something like list all CRDs and make sure not more than 10 are in "establishing" state, if less create the next CRD. In the long run, apiserver changes regarding OpenAPI spec publishing and possibly discovery client improvements will make things better, nevertheless, I believe we'll still need both of these mechanisms in Crossplane because in Crossplane ecosystem, having 1000s of CRDs will not be very uncommon anymore so all of these solutions will help to some extent.


EDIT: @negz I just saw your new comment after submitting this.

@negz
Copy link
Member

negz commented Oct 23, 2021

@muvaf I'm not convinced the second part of what you've mentioned - users have to install more CRDs than they need - is a real problem. If there were no performance downsides would it have any other meaningful impacts?

Presumably the main downside would be 'clutter' in OpenAPI discovery and kubectl get crds, but that seems like a very minor concern. Especially since if we do start having folks filter the set of installed CRDs those folks are presumably going to have to remember to go update their provider as they begin to need API groups that they didn't think they would before.

If we are indeed prioritising this mostly because it's a quick win for performance while we look into other options, could we do it without making a change to the Provider API? e.g. Add the flag to provider binaries and advise folks to use a ControllerConfig to set that flag if they're hitting performance issues, until we get long term fixes in place?

@ulucinar
Copy link
Contributor Author

ulucinar commented Oct 25, 2021

Hi @negz,
Regarding the points 1 and 2 above, I believe it makes sense to bring the high CPU utilization issue into the attention of the community in addition to the memory issue. I have found this discussion, which looks related to the case that concerns us. We can also get a better understanding if sig-api-machinery and/or sig-scalability folks have any planned limits on the number of CRDs in a cluster, whether they are actually planning to define this as a scalability threshold. And if so, what order the prospective limit will be in, e.g., will it really be something close to 500 as mentioned here. It may very well turn out to be a much higher limit, as my understanding is that the real concern they have, when mentioning max 500 CRDs limit as a scalability target for GA, is the cost of the OpenAPI spec publishing. If they are planning to define any SLI, for example, for the /openapi/v2 endpoint*, it may also impose a limit on the number of CRDs in a cluster. So my question is, why do they mention a max. limit of 500 CRDs per cluster as a scalability target. I think we need some clarification around this.

I also think that at some point in time, total number of CRDs in a cluster might become an "official" scalability dimension, i.e., be explicitly mentioned in the scalability thresholds document as a dimension. But we may choose not to act proactively about this, especially because we don't know what such a limit would be and we have no observations of a negative impact on API call latencies (apart from the saturated cases).

We don't expect the high number of CRDs in a cluster to affect API server call latency SLIs, as long as we do not saturate the CPU (as it was the case in @muvaf's and @turkenh's initial experiences when they tried to install 700+ CRDs from provider-tf-aws) or the memory. But for the saturation issue, we have an alternative solution as mentioned above, i.e., implementing throttling on our side.

One issue though with implementing a throttling mechanism on the Crossplane package revision controller is that it will be available only for our reconciler and any other Kubernetes clients registering CRDs on top of the 100s of them we have registered with throttling in-place, will still be susceptible to the discussed issues.

Regarding the controller-runtime client throttling, it looks like what we have so far observed is related to API discovery. My understanding is that kubectl (v1.21.x) configures a hardcoded cache TTL of 10m here for its file system cache. And for certain cases, the discovery client cache is also explicitly invalidated, like the api-resources or the api-versions commands.

*: Here, I'm talking about an SLI not on the response times of the /openapi/v2 endpoint. But something like an SLI that captures the interval from when the API server accepts a new CRD with an OpenAPI schema, to the time the aggregated OpenAPI spec published from that endpoint contains the new CRD's schema.

@ulucinar
Copy link
Contributor Author

If we are indeed prioritising this mostly because it's a quick win for performance while we look into other options, could we do it without making a change to the Provider API? e.g. Add the flag to provider binaries and advise folks to use a ControllerConfig to set that flag if they're hitting performance issues, until we get long term fixes in place?

We have discussed implementing a filtering mechanism not via the Provider.pkg API but maybe via the ControllerConfig API with @muvaf and @hasheddan, but my understanding is that we do not want to treat ControllerConfig as a catch-all API. That has been the reasoning behind extending the Provider.pkg API.

@negz
Copy link
Member

negz commented Oct 25, 2021

I believe it makes sense to bring the high CPU utilization issue into the attention of the community in addition to the memory issue.

@ulucinar I agree - can you please raise that issue and reference it here?

We can also get a better understanding if sig-api-machinery and/or sig-scalability folks have any planned limits on the number of CRDs in a cluster, whether they are actually planning to define this as a scalability threshold.

Definitely. We should be providing input to this process. We should be advocating for the API server to be able to handle our use cases, rather than waiting for the API machinery folks to tell us what the limit will be. 😄 If it can't natively, we're going to need to fork it and that would not be a great outcome.

One issue though with implementing a throttling mechanism on the Crossplane package revision controller is that it will be available only for our reconciler and any other Kubernetes clients registering CRDs on top of the 100s of them we have registered with throttling in-place, will still be susceptible to the discussed issues.

That's a fair point - though I imagine we're a bit of an edge case in both the frequency and number of CRDs that we install. i.e. I imagine most other projects would very infrequently be adding <10 CRDs and thus hopefully not be as impacted as e.g. a Crossplane provider.

This makes me wonder how impactful the OpenAPI issue will really be based on when folks typically install CRDs. If folks are typically installing CRDs at cluster bootstrapping time this may be an inconvenience but otherwise a non-issue. If folks are installing CRDs on existing clusters that are doing real work it seems like more of a potentially user-impacting issue.

Regarding the controller-runtime client throttling, it looks like what we have so far observed is related to API discovery. My understanding is that kubectl (v1.21.x) configures a hardcoded cache TTL of 10m here for its file system cache. And for certain cases, the discovery client cache is also explicitly invalidated, like the api-resources or the api-versions commands.

https://github.com/kubernetes/kubernetes/blob/0fb71846/staging/src/k8s.io/cli-runtime/pkg/genericclioptions/config_flags.go#L111

Ah - I was wondering where this configuration lived. Nice find. I suspect tuning the above option will have a big impact. I see that there's a WithDiscoveryBurst option, but I'm unsure whether that's exposed anywhere (e.g. as a flag). Perhaps we can make a case for simply bumping the default burst up from 100 upstream? Of course this would only affect kubectl, but perhaps if we could have kubectl and controller-runtime increase their default discovery client rate limits we'd be fixing the problem for 80% of clients?

We have discussed implementing a filtering mechanism not via the Provider.pkg API but maybe via the ControllerConfig API with @muvaf and @hasheddan, but my understanding is that we do not want to treat ControllerConfig as a catch-all API. That has been the reasoning behind extending the Provider.pkg API.

Ah, sorry I missed that discussion. It's true that we don't want ControllerConfig to become a catch-all, but that doesn't rule it out for this feature in my mind.

I would feel comfortable making this feature a fully fledged part of the v1 Provider API if we were confident it was a long term thing that many folks would be choosing to do, but I'm not convinced that's the case. It seems to me that filtering the set of installed API groups is the quickest fix we have available to a problem we're seeing, but perhaps not the best fix long term. Put otherwise, would we still make this change if we could fix the API server performance issues and client rate limiting upstream? If the answer to that question is 'no' or 'probably not', then I would feel more comfortable with this not being a feature we had to commit to at v1. Using ControllerConfig (at least for now) is a good compromise in my mind; we can unblock ourselves while only committing to this as a v1alpha1 feature.

Let me know if it would be helpful to setup a call to discuss some of this synchronously.

@muvaf
Copy link
Member

muvaf commented Oct 25, 2021

We don't expect the high number of CRDs in a cluster to affect API server call latency SLIs, as long as we do not saturate the CPU (as it was the case in @muvaf's and @turkenh's initial experiences when they tried to install 700+ CRDs from provider-tf-aws) or the memory. But for the saturation issue, we have an alternative solution as mentioned above, i.e., implementing throttling on our side.

TBH, I had this assumption that even if we do not use all those CRDs, having them installed costs us some apiserver and also controller CPU and/or memory consumption. I tested kubectl behavior and it didn't seem like a big deal and if @ulucinar 's experiments show that this isn't the case even at scale and the only problem thousands of CRDs cause are discovery client throttling and install time saturation, I can walk back from having it to be a first-class configuration on Provider v1 API today.

I feel like we will need it for various reasons at some point, because having say 1500 CRDs while you use 100 of them will cause some problems in runtime down the road but it doesn't seem like a problem we have today. So, I'm open to implementing the pooling/throttling in CRD installation of PackageRevision controller today instead of this feature OR merge this feature as alpha in some way (discussed below) and then once we have the pooling and we don't see any problem in runtime, remove it. Leaning towards the second option.

Using ControllerConfig (at least for now) is a good compromise in my mind; we can unblock ourselves while only committing to this as a v1alpha1 feature.

@negz ControllerConfig is tailored around affecting only controller but this list affects the CRDs applied, too, so I'm not sure if it will fit there, really. It seems like the only concern about where this will live at this point is about the api versioning. I wonder what upstreams folks do when they add a new field to Pod for example. I think I've seen usage of a special annotation first, like alpha.pkg.crossplane.io/api-groups: "*.ec2.tf.aws.crossplane.io,<some regex>,<another regex>", which is something we used to have in provider-aws for endpoint configuration and I quite liked it. In some cases, feature flags but I'm not sure if that's applied to schema changes.

@hasheddan since you're more involved with upstream folks, do you know how they manage the process of adding new alpha fields on stable APIs?

@muvaf
Copy link
Member

muvaf commented Oct 26, 2021

FWIW, we're planning to roll with a set of default CRDs to be installed in Terrajet providers so that you get to install the provider with CRDs (like ~100 to ~200 CRDs) even if you don't provide this config and there is no throttling in package manager.

@negz
Copy link
Member

negz commented Oct 26, 2021

@ulucinar, @hasheddan and I met this morning to discuss this synchronously.

We're agreed that the primary motivation behind adding this feature now is to alleviate performance issues we've observed when there are hundreds of CRDs installed. In summary, and in order of importance, those issues are:

  1. Processing the OpenAPI spec of many new CRDs at once can starve the API server, making it unresponsive.
  2. The more CRDs exist, the longer it takes a new CRD to become 'established' (usable) due to OpenAPI processing.
  3. When many CRDs exist API server clients can take longer to perform discovery.

In addition to these issues, we think it may be useful to be able to selectively install only certain APIs, but we don't yet have any concrete use cases or user feedback to support our intuition.

The worst symptom we've observed is that installing hundreds of CRDs can lock up a kind cluster. This happens due to CPU starvation - recomputing the OpenAPI spec prevents the API server handling requests. So far we've only seen this happen on kind on Intel Macs - notably M1 Macs manage to process the CRDs without noticeable performance degradation. @ulucinar is currently testing whether 'production' API servers (e.g. GKE) are affected.

The second symptom we've observed is that the time for a new CRD to become usable increases as more CRDs exist in an API server. We believe this is because the API server recomputes its entire OpenAPI spec each time a new CRD is installed, and recomputing the spec requires processing all CRDs. We've observed that it can take 20 minutes to install a large provider, then ~70 minutes to install a second large provider (where 'large' means having hundreds of CRDs).

The final symptom we've observed is that 'discovery' takes a long time when there are many CRDs installed. API server clients use a process known as discovery to determine what APIs the API server supports. This process involves 'walking the tree' of API server endpoints, which can involve hundreds of requests when hundreds of CRDs are installed. This can trigger client-side rate limiting; REST clients are often capped at (for example) 20rps with 30rps burst. Discovery typically happens at client startup, so in practice this symptom primarily means slower client startup and potentially noisy logs (clients often log when they limit themselves).

We believe there are several possible remediations:

  1. What this issue proposes. Make it possible to filter the number of CRDs installed to those that are needed.
  2. Limit the rate at which CRDs are installed - this will only alleviate the saturation issue.
  3. Work with upstream Kubernetes to improve the performance (or at least fairness) of the API server.
  4. Work with upstream Kubernetes to increase discovery rate limits (and increase them in our clients).

I'd like us to hold off on moving forward with the remediation this issue proposes until we've raised an issue to determine how receptive Kubernetes folks are to this being fixed upstream. I'm also happy for us to move forward with rate limiting CRD installs, since that remediation has less of a direct UX implication (it doesn't require users to configure anything to avoid performance issues).

I'm okay moving forward with the remediation this issue proposes if we do find that it's our only option; i.e. that upstream will not fix this issue at all or in a timely fashion and we find that we cannot tolerate the symptoms we're seeing in the mean time.

@negz
Copy link
Member

negz commented Oct 26, 2021

@ulucinar is currently testing whether 'production' API servers (e.g. GKE) are affected.

I'll let @ulucinar provide more details after he has slept, but in short - they are. We're finding that installing ~680 CRDs at once reliably takes a GKE control plane offline.

@hasheddan
Copy link
Member

@hasheddan since you're more involved with upstream folks, do you know how they manage the process of adding new alpha fields on stable APIs?

@muvaf my primary experience with this was assisting with promoting seccompProfile fields from alpha to stable. You can see the change we made in this section of the v1.20 docs. In short, annotations are typically used for alpha features, which could theoretically be used for the case discussed in this issue as well.

Also wanted to ask more generally, have we considered just breaking these large providers into smaller, group-based providers? It would require no code changes, just some different flags in the package build process to only include some CRDs and maybe set some flag on the entrypoint of the controller image.

@ulucinar
Copy link
Contributor Author

Opened an upstream issue to hopefully initiate a discussion here:
kubernetes/kubernetes#105932

@muvaf
Copy link
Member

muvaf commented Oct 27, 2021

Also wanted to ask more generally, have we considered just breaking these large providers into smaller, group-based providers?

@hasheddan I think if we see that pooling the creations doesn't solve the problem we'll come back to selective installation and consider both options. There are some caveats to both approaches, for example, we need to handle different Providers installing the same ProviderConfig CRD. And in both cases, I believe we'll need throttling if user does decide that they actually need 1000 CRDs; either filtered or installed as different providers under one Configuration.

@ulucinar
Copy link
Contributor Author

ulucinar commented Oct 27, 2021

I have done some further experiments to assess the effectiveness of throttling CRD creation by limiting on the # of CRDs in not-established state. The idea is:

  1. Create a batch of CRDs
  2. Wait until all CRDs in that batch acquire the Established condition
  3. Repeat with the next batch of CRDs

Here batch size is a parameter. I have done 4 experiments in parallel on 4 different GKE clusters. The experiments differ only on their batch sizes. The batch sizes tried are: 1, 10, 50 and 100. The total number of CRDs to be registered is 658 in all experiments (CRDs generated for provider-tf-azure).

Even in the most conservative case where we create a single CRD (batch size 1) at a time, I have observed logs similar to:

...
Applying index 315
customresourcedefinition.apiextensions.k8s.io/kubernetesclusternodepools.kubernetes.azure2.tf.crossplane.io created
Checking kubernetesclusternodepools.kubernetes.azure2.tf.crossplane.io
customresourcedefinition.apiextensions.k8s.io/kubernetesclusternodepools.kubernetes.azure2.tf.crossplane.io condition met
>Establish for kubernetesclusternodepools.kubernetes.azure2.tf.crossplane.io, start=Wed Oct 27 16:40:52 +03 2021, end=Wed Oct 27 16:40:53 +03 2021
>Batch (315, 316) started at: Wed Oct 27 16:40:51 +03 2021, ended at: Wed Oct 27 16:40:53 +03 2021
Applying index 316
The connection to the server ... was refused - did you specify the right host or port?
Retrying...
The connection to the server ... was refused - did you specify the right host or port?
Retrying...
The connection to the server ... was refused - did you specify the right host or port?
Retrying...
...

and in the GKE console I have observed that all experiment clusters transitioned into "repairing" state.

It took ~47 min to register all of the 658 CRDs with batch size 1. This also includes cluster repairing time.

In parallel, I also took a look at how CRDs acquire the Established condition. Although I did not really delve into its details, it looks like it's not related to OpenAPI spec publishing.

My understanding (although it might be inaccurate or plain wrong) is that a CRD is put into EstablishingController's queue if its name is accepted. EstablishingController processes items from its rate-limited queue by adding the Established condition if its name is accepted:
https://github.com/kubernetes/kubernetes/blob/0fb71846df9babb6012a7fce22e2533e9d795baa/staging/src/k8s.io/apiextensions-apiserver/pkg/controller/establish/establishing_controller.go#L131

@ulucinar
Copy link
Contributor Author

I'm repeating the batch_size=1 experiment and in parallel conducting a new experiment in which we do not check the Established condition but instead just sleep 1 second before create requests.

@ulucinar
Copy link
Contributor Author

I'm repeating the batch_size=1 experiment and in parallel conducting a new experiment in which we do not check the Established condition but instead just sleep 1 second before create requests.

Both clusters (confirmation for the batch_size=1 parameter and sleep-based) have transitioned into "Reconciling" (repairing) state:

alper-scaling-test-cluster-batch1  us-east1-c  1.20.10-gke.301  35.196.175.224  e2-standard-4  1.20.10-gke.301  3          RECONCILING
alper-scaling-test-cluster-sleep1  us-east1-c  1.20.10-gke.301  35.243.172.198  e2-standard-4  1.20.10-gke.301  3          RECONCILING

@negz negz changed the title Support Selective Processing of Crossplane Package Objects API Server becomes unresponsive when too many CRDs are installed Oct 27, 2021
@negz
Copy link
Member

negz commented Oct 28, 2021

Just found an upstream bug tracking the client-side throttling bits of this issue - kubernetes/kubectl#1126. There's a PR open to bump the client-side discovery burst up from 100 to 150rps, but more broadly there's a debate about just removing the limits and letting the API server priority and fairness handle it.

@negz
Copy link
Member

negz commented Oct 29, 2021

Per my comment at kubernetes/kubernetes#105932 (comment) I don't believe that we were really 'crashing' the GKE API server in @ulucinar's tests. Rather what I believe was happening is that the control plane (which is not redundant for zonal clusters) was temporarily going offline to resize to accommodate so many CRDs. Supporting evidence at:

I can't reproduce the issue @ulucinar saw when using a regional GKE cluster or an EKS cluster (both of which have redundant control planes and can thus resize themselves without a temporary control plane outage).

@negz
Copy link
Member

negz commented Oct 29, 2021

Per my comment at kubernetes/kubernetes#105932 (comment) I don't believe that we were really 'crashing' the GKE API server in @ulucinar's tests.

Hrm - second guessing that now. I'm yet to see a regional cluster go into the repairing state but I'm reliably able to get a (regional, v1.21) GKE cluster to exhibit degraded performance (including returning etcd errors at least once) by applying 2,000 CRDs without any rate limiting.

@negz
Copy link
Member

negz commented Oct 29, 2021

Summarizing my tests from today:

I tested on an EKS cluster for the first time, and it seems pretty resilient to the issue. You need to get up to around 3,000 CRDs created consecutively before it starts to see performance issues, and it will happily accept 2,000 CRDs all applied at once. I'm guessing their control planes are fairly powerful. They also are definitely running multiple API server replicas - in some cases I saw new replicas coming and going during my tests presumably in response to the increased load.

Unfortunately I also repeated my tests on GKE several times, and could not reproduce the success I had yesterday. Despite managing to get up to ~1,200 CRDs without issue yesterday, today GKE clusters - even regional ones - consistently exhibit various different kinds of errors while attempting to connect to the API server using kubectl after around 500 CRDs are created. I've tried three times today:

  • Once I attempted to create 2,000 CRDs without any rate limiting.
  • Twice I attempted to create 2,000 CRDs in batches of 100 with 60 second pauses between batches.

In all cases I saw all kinds of crazy errors, from etcd leaders changing to the API server reporting that there was no such kind as CustomResourceDefinition, to HTTP 401 Unauthorized despite my credentials working on subsequent requests. In some cases the clusters went into the 'repairing' state while I was seeing the errors, in others they did not (as far as I noticed, at least).

In each test I used a different cluster, but one created using the same Composition as the cluster I tested on yesterday. The only thing I can think of that was different between the cluster I tested on yesterday and the ones I created today was that yesterday's cluster was running a few more pods (e.g. Crossplane), and was older having been online for 2-3 weeks.

I also happened to (accidentally) test creating 4,000 CRDs on a kind cluster running on my e2-custom-4-16640 GCP development VM. The CRDs created successfully but discovery began to take upward of 10 minutes - i.e. 10 minute pauses before kubectl commands would run.

This all means that unfortunately I'm not feeling confident about there being any way to accommodate installing several very large providers simultaneously - e.g. Terrajet generated AWS, Azure, and GCP providers would together be around 2,000 CRDs.

@negz negz changed the title API Server becomes unresponsive when too many CRDs are installed API Server (and clients) becomes unresponsive with too many CRDs Oct 29, 2021
@negz
Copy link
Member

negz commented Oct 29, 2021

It seems like in kubernetes/kubectl#1126 there's consensus that the rate limiting in kubectl will be removed and they'll instead defer to APF. I believe this should alleviate the discovery rate limiting issue - we'd be subject to configurable server side request queueing instead.

I've raised kubernetes-sigs/controller-runtime#1707 to discuss whether a similar change to controllers would make sense. Sounds like they're amenable to the change.

@negz
Copy link
Member

negz commented Oct 29, 2021

Executive Summary

  • Rate limiting CRD installs doesn't reliably alleviate the problems we're seeing.
  • Upstream fixes are in progress for the problems we're seeing and may be available as patch releases in ~1 month.
  • I support breaking Terrajet into smaller providers rather than adding filtering support.

Sadly I'm feeling convinced that at the moment the only way to reliably avoid the API server and kubectl performance issues we're seeing - apart from waiting for the upstream fixes - is to avoid creating 'too many' CRDs. It seems like what constitutes too many varies depending on the resources available to the control plane so there's not a reliable number we can cap it at, but ~300 is my conservative estimate.

Rate Limiting Experiments

GKE continues to be unable to scale to 2,000 CRDs (regardless of how we rate limit) without becoming unresponsive. API discovery begins to suffer from ~20 second long client-side rate limiting with as few as 200 CRDs in the cluster. The most successful strategy I've found so far for GKE is batches of 50 CRDs spread 30 seconds apart.

EKS on the other hand will happily accept 2,000 CRDs applied all at once with no rate limiting at all or other immediately discernible performance degradation but will then exhibit client-side rate limiting of up to six minutes before some kubectl commands will complete.

In my experience kind does seem to work (I got it up to 4,000 CRDs) but uses a huge amount of resources (2 cores, GBs of memory) which is not a great experience for someone wanting to try out Crossplane for the first time on their laptop or workstation.

I've noticed that CPU consumption seems to drop off about an hour after a huge number of CRDs are applied; presumably this is how long the API server takes to recompute its OpenAPI schema over and over again. Memory consumption doesn't seem to drop unless the API server process restarts. The API server has a special clause to load many CRDs more efficiently at startup as compared to those same CRDs being added at runtime.

Impending Upstream Fixes

Both of the key upstream issues we're facing (OpenAPI processing and kubectl discovery rate limiting) have PRs open (kubernetes/kube-openapi#251, kubernetes/kubernetes#105520). I would expect these fixes to be merged imminently but there are no guarantees. If they're candidates for patch releases it's possible fixes will be available within a month per the patch release schedule. I've reached out for clarification around whether the folks working on the issues expect them to be backported and available as patch releases, or whether they'll need to wait until the next minor release.

If the upstream fixes do indeed become available as patch releases I personally would feel comfortable requiring that Crossplane users be on a supported version of Kubernetes at the latest patch release in order to support large providers. Asking users to update to the latest minor version of Kubernetes seems like a taller order and not something many enterprises would be able to do easily.

Options for Reducing Installed CRDs

Personally I still don't buy that there's any real value in reducing the number of CRDs (used or not) in the system except to workaround these performance issues, but it seems like said performance issues alone may force our hand.

have we considered just breaking these large providers into smaller, group-based providers? It would require no code changes, just some different flags in the package build process to only include some CRDs and maybe set some flag on the entrypoint of the controller image.

Of the two reduction avenues I'm aware of (smaller providers vs filtering what CRDs are enabled for a provider) I support the approach @hasheddan proposed above. It sounds like we'd need to work through a few things technically to make it work though, as @muvaf mentioned. For example:

  • Do we keep one ProviderConfig per provider (now group scoped) or try to share across providers?
  • What do we do with cross resource references, which would now mostly be to other providers (i.e. API groups)?

I feel the benefits of smaller providers (vs filtering) are:

  • It requires no changes to upstream Crossplane, decoupling it from the Crossplane review and release cycle.
  • Given the goal is to reduce the ratio of installed to actively used CRDs it seems (IMO) much more intuitive to "install the providers for the functionality you need". The alternative feels more to me like requiring users to carefully and actively avoid installing too many CRDs.
  • It's not something we need to commit to; for example we could return to larger providers (if so desired) or simply not worry about providers growing larger once the upstream issues were fixed.

Of course, neither smaller providers nor filtering providers will actually fix the scalability issues - they'll just reduce the likelihood that folks will run into them so ultimately we are going to need to continue working with upstream Kubernetes to ensure the API server can meet our needs.

@negz
Copy link
Member

negz commented Nov 8, 2021

The general consensus within the folks working on this (i.e. @ulucinar, @muvaf, @luebken, and myself) is that we're not likely to get these cherry-picked in time for the patch releases that will be released on Nov 17th:

  • The PR that needs cherry-picking is not yet reviewed or merged (but is fairly uncontroversial).
  • The deadline to get those cherry-picks merged is the 12th (this Friday).
  • The 11th is a public holiday for many folks here in the US.
  • We've been advised by @hasheddan (sig-release team emeritus) that the closer you get to the cherry-pick deadline, the harder a sell it is to cherry-pick a PR.

@ulucinar
Copy link
Contributor Author

ulucinar commented Nov 8, 2021

Just left a comment in the upstream PR to signal the direction we would like proceed with:
kubernetes/kubernetes#106181 (comment)

@negz
Copy link
Member

negz commented Nov 8, 2021

Quoting @apelisse on Kubernetes Slack:

If you can get this done quickly enough to get your cherry-picks in before cherry-pick freeze, I'll happily approve. I'm mostly focusing on the main release code-freeze though so we won't have time to push that ourselves

He's helped us by creating the following kube-openapi branches:

These branches correspond to the kube-openapi commits Kubernetes 1.20, 1.21, and 1.22 are currently using. We will now proceed by:

  1. Opening PRs to cherry-pick Lazy marshaling for OpenAPI v2 spec kubernetes/kube-openapi#251 to each of the above branches.
  2. Opening PRs to update the k/k release-1.20, release-1.21, and release-1.22 branches to the latest commit of their corresponding k/kube-openapi release branches.

This will reduce the scope of the k/k cherry-picks back down to only including kubernetes/kube-openapi#251.

@muvaf
Copy link
Member

muvaf commented Nov 10, 2021

The current status is that the scope of changes for release-1.20 and release-1.21 had included some dependency bump and we had to revert that. Now, they both have minimal scope just like release-1.22 one.

The PR that bumps master is merged, meaning the fix is guaranteed to be included in 1.23.

The patch release PRs are:

Once we get approved-for-merge for 1.20 and 1.21 as well, we'll ping release manager to see if they can include them in the respective patch releases.

@muvaf
Copy link
Member

muvaf commented Nov 11, 2021

All the patch release PRs are merged. Now it's guaranteed that the fix will be included in the following releases of Kubernetes:

  • 1.23 (December 7th)
  • 1.22.4 (November 17th)
  • 1.21.7 (November 17th)
  • 1.20.13 (November 17th)

I think we can close this issue once the kubectl PR kubernetes/kubernetes#106016 is merged as well.

@ulucinar
Copy link
Contributor Author

Here are kind node images built from the master & active release k/k branches containing the lazy marshaling behavior:
master: ulucinar/node-amd64:master-lazy-marshal
1.22: ulucinar/node-amd64:1.22-lazy-marshal
1.21: ulucinar/node-amd64:1.21-lazy-marshal
1.20: ulucinar/node-amd64:1.20-lazy-marshal

@ulucinar
Copy link
Contributor Author

All branches now contain the OpenAPI aggregated spec lazy marshaling changes. Here are kind images for them:

ulucinar/node-amd64:v1.23.0-beta.0
ulucinar/node-amd64:v1.22.4
ulucinar/node-amd64:v1.21.7
ulucinar/node-amd64:v1.20.13

kind has also released an official node image for the release-1.22 branch:

kindest/node:v1.22.4

@negz
Copy link
Member

negz commented Nov 22, 2021

The sig-cli folks ended up merging kubernetes/kubernetes#105520, which bumped up rate limits quite a lot, rather than disabling client-side rate limiting. Specifically:

  • The discovery request rate was bumped from 50rps, up from 5.
  • The discovery client now allows bursts of up to 300rps, up from 100.

@negz
Copy link
Member

negz commented Nov 22, 2021

We may want to leave this issue open to track further upstream improvements; notably there's still an appetite to remove the client-side rate limiting (on discovery at least) in kubectl and also to improve how discovery is cached.

@muvaf
Copy link
Member

muvaf commented Nov 24, 2021

@negz I think we should close this issue as the problems it's describing are fixed, i.e. clusters and clients do not become unresponsive with too many CRDs anymore. If there are other problems, a new issue that is more specific could make more sense.

@muvaf
Copy link
Member

muvaf commented Dec 1, 2021

I'm closing this now. Feel free to re-open if you experience described problems in the released versions.

@muvaf muvaf closed this as completed Dec 1, 2021
@jonnylangefeld
Copy link
Contributor

@muvaf I just tested the newly merged client update with the increased burst and QPS that was referred by @negz, but I still ran into client side throttling. And that on a cluster with <300 CRDs.
All details on the original upstream PR: kubernetes/kubernetes#105520 (comment)

@muvaf
Copy link
Member

muvaf commented Dec 10, 2021

@jonnylangefeld thanks for reporting! Either client side or server side throttling is expected during the cache calls but the part that we were interested in fixing was the client becoming unresponsive, i.e. 6mins of wait for a simple GET call or timing out. How long does it take for calls to complete in your case?

@jonnylangefeld
Copy link
Contributor

A simple GET request via raw API call wasn't the issue before though right? That was always fast. Only if you do a kubectl get, then the discovery cache ran and that is what took long. Here a comparison of the two:

time kubectl get pods -n default
kubectl get pods -n default  0.50s user 1.19s system 11% cpu 14.174 total

time kubectl get --raw  "/api/v1/namespaces/default/pods"
kubectl get --raw "/api/v1/namespaces/default/pods"  0.06s user 0.03s system 23% cpu 0.398 total

In the other post I describe how kubectl get does 170 requests to the API server if the doesn't cache exists and 4 requests if it does exist (time kubectl get --raw "/api/v1/namespaces/default/pods" is only 1 request).

@apelisse
Copy link

There are improvements related to this that I'd love to see done @seans3 @justinsb

@jonnylangefeld
Copy link
Contributor

I did a bit of digging and found out why the original upstream PR kubernetes/kubernetes#105520 didn't work: kubernetes/kubernetes#105520 (comment).

Maybe give my fix a thumbs up to make kubectl more usable again when controllers like crossplane are installed?

It's only one step on the way to at least not always run into rate limits. To actually optionally disable the discovery cache I opened a separate issue: kubernetes/kubernetes#107130.

@lippertmarkus
Copy link

lippertmarkus commented Feb 4, 2022

This is a real blocker for us. We're on 1.22.4 so applying the ~650 CRDs of provider-jet-azure is quite fast with the lazy marshaling change. On the client-side however, it's not just kubectl which is quite unusable with the throttling but also other API clients like management dashboards (we're using Rancher) which get extremely slow.

Can we maybe revisit the idea of reducing installed CRDs?

@jonnylangefeld
Copy link
Contributor

The fix I posted in the comment below will make it into kubectl 1.24, so that’ll be a slight improvement for kubectl. Any other client is still affected. Basically anything using client-go.
Check out this blog post for a in-depth explanation of the issue: https://jonnylangefeld.com/blog/the-kubernetes-discovery-cache-blessing-and-curse.

Unfortunately a real solution can only be achieved through a server change as well.
But if we want to keep building controllers we’ll have to fix this issue.

@apelisse
Copy link

apelisse commented Feb 4, 2022

Hey there,

Very aware of lots of problems with discovery, the large amount of requests and how the cache is not invalidated properly.
@Jefftree and I have been designing a possible solution but we have limited cycles for implementation. Happy to help someone implement some changes if you have some bandwidth available. Also happy to discuss the problem in more depth to make sure we address it properly.

Thanks

@ulucinar
Copy link
Contributor Author

ulucinar commented Feb 4, 2022

Hi @apelisse,
I'm also investigating these issues and their implications. Would love to share ideas, gain a deeper understanding and contribute to the implementations. Thank you for your support!

@negz
Copy link
Member

negz commented Feb 7, 2022

@apelisse We're happy to help if we can. Is there somewhere we can find your thoughts so far?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants