Skip to content

API Server (and clients) becomes unresponsive with too many CRDs #2649

@ulucinar

Description

@ulucinar

What problem are you facing?

As part of the ongoing Terrajet-based providers effort, we have observed that registering 100s of CRDs has some performance impacts on the cluster. Observations from two sets of experiments are described here and here. As discussed in the results of the first experiment, Kubernetes scalability thresholds document currently does not consider #-of-CRDs (per cluster) as a dimension. However, sig-api-machinery folks suggest a maximum limit of 500 CRDs. And the reason for this suggested limit is not API call latency SLOs but rather, as we also identified in our experiments, due to the latency in the OpenAPI spec publishing. As the results of the second experiment demonstrate, the marginal cost of adding a new CRD increases as more and more CRDs exist in the cluster.

Although not considered as a scalability dimension officially yet, it looks like we need to be careful for the #-of-CRDs we install in a cluster. And with Terrajet-based providers we would like to be able to ship 100s of CRDs per provider package. Currently, for the initial releases of these providers, we are including a small subset (less than 10% of the total count) of all the managed resources we can generate. We would like to be able to ship all supported resources in a provider package.

How could Crossplane help solve your problem?

As part of the v1.Provider, v1.ProviderRevision, v1.Configuration and v1.ConfigurationRevision specs, we could have a new configuration field that tells the GVKs of package objects to be installed onto the cluster. To behave backwards-compatible, if this new API is not used, all objects defined in the package manifest get installed. If the new field has a non-zero value, then it's enabled and only selected objects get installed. In order to make the UX around installing packages using the new configuration field easier, we can also add new options for the install provider and install configuration commands of the Crossplane kubectl plugin.

As described in the experiment results, when a large #-of-CRDs are registered in a relatively short period of time, the background task that prepares the OpenAPI spec to be published from the /openapi/v2 endpoint may cause a spike in the API server's CPU utilization, and this may saturate the CPU resources allocated to the API server. Similar to what controller-runtime client does, we may also consider implementing a throttling mechanism in the package revision controller to prevent this. However, because of the reasons discussed above and in the experiment results (especially the latency introduced in OpenAPI spec publishing), it looks like we will need a complementary mechanism like the one suggested above in addition to a throttling implementation in the package revision reconciler.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions