Bump context deadline for ProviderRevision controller to 3 mins #2570

turkenh · 2021-09-10T09:38:07Z

Description of your changes

This PRs bumps the context deadline for providerrevision controller from 1 min to 3 mins. The current deadline is not sufficient when the provider package contains a high number of CRDs. Even it does not make sense to bump the deadline indefinitely to support an infinite number of CRDs, we expect that supporting around 1000 CRDs should be a good expectation, given provider-tf-aws contains around 750.

I have experimented installing turkenh/provider-tf-aws:daf1e9f7-2 with this change couple of times and noticed that it is installed in around 2 mins. I believe 3 mins should be a good value leaving some room for different environments and around 250 more CRDs (to reach our rough target of 1000).

Fixes #2564

I have:

Read and followed Crossplane's contribution process.
Run make reviewable to ensure this PR is ready for review.
Added backport release-x.y labels to auto-backport this PR if necessary.

How has this code been tested

Install the provider package in the issue description:

kubectl crossplane install provider turkenh/provider-tf-aws:daf1e9f7-2

Fixes crossplane#2564 Signed-off-by: Hasan Turken <turkenh@gmail.com>

ulucinar · 2021-09-10T09:42:19Z

Hi @turkenh,
Was provider-tf-aws successfully started after the CRD registrations by the revision controller? In my experience with provider-tf-azure, as I mentioned, it was in a crash loop due to API server timeouts.

ulucinar · 2021-09-10T09:49:39Z

Another question is: Although I do not think that it's currently needed, but will we consider another approach for registering CRDs in revision controller?

turkenh · 2021-09-10T09:50:09Z

Hi @turkenh,
Was provider-tf-aws successfully started after the CRD registrations by the revision controller? In my experience with provider-tf-azure, as I mentioned, it was in a crash loop due to API server timeouts.

@ulucinar yes, I didn't observe any issues with controllers (5/5 tries) except the following throttling log (this one is running for 15mins now):

I0910 09:35:07.449360       1 request.go:665] Waited for 1.0767866s due to client-side throttling, not priority and fairness, request: GET:https://10.96.0.1:443/apis/apiextensions.crossplane.io/v1beta1?timeout=32s

ulucinar · 2021-09-10T09:51:46Z

Hi @turkenh,
Was provider-tf-aws successfully started after the CRD registrations by the revision controller? In my experience with provider-tf-azure, as I mentioned, it was in a crash loop due to API server timeouts.

@ulucinar yes, I didn't observe any issues with controllers (5/5 tries) except the following throttling log (this one is running for 15mins now):
I0910 09:35:07.449360       1 request.go:665] Waited for 1.0767866s due to client-side throttling, not priority and fairness, request: GET:https://10.96.0.1:443/apis/apiextensions.crossplane.io/v1beta1?timeout=32s

Interesting, thanks for sharing your observation. Is your cluster also a Kind cluster? I will give provider-tf-azure a retry.

turkenh · 2021-09-10T09:57:26Z

Interesting, thanks for sharing your observation. Is your cluster also a Kind cluster? I will give provider-tf-azure a retry.

Yes, it was a kind cluster.
Now testing with a GKE (autopilot) cluster, let's see.

turkenh · 2021-09-10T10:07:12Z

GKE (autopilot) is performing much worse. APIServer stopped responding for some requests while the package is being installed:

kubectl describe providerrevisions.pkg.crossplane.io turkenh-provider-tfturkenh-provider-tf-aws-1206561c4d26

I0910 13:03:11.292739   26899 request.go:645] Throttling request took 1.19845629s, request: GET:https://34.68.113.240/apis/wafregional.aws.tf.crossplane.io/v1alpha1?timeout=32s
I0910 13:03:21.492224   26899 request.go:645] Throttling request took 11.397248344s, request: GET:https://34.68.113.240/apis/apiextensions.crossplane.io/v1beta1?timeout=32s
Error from server: etcdserver: request timed out

And now observing these events with this change:

Spec:
  Desired State:                  Active
  Ignore Crossplane Constraints:  false
  Image:                          turkenh/provider-tf-aws:daf1e9f7-2
  Package Pull Policy:            IfNotPresent
  Revision:                       1
  Skip Dependency Resolution:     false
Events:
  Type     Reason             Age                From                                         Message
  ----     ------             ----               ----                                         -------
  Normal   BindClusterRole    10m (x2 over 10m)  rbac/providerrevision.pkg.crossplane.io      Bound system ClusterRole to provider ServiceAccount(s)
  Normal   ApplyClusterRoles  10m (x3 over 10m)  rbac/providerrevision.pkg.crossplane.io      Applied RBAC ClusterRoles
  Warning  SyncPackage        9m40s              packages/providerrevision.pkg.crossplane.io  cannot establish control of object: Post "https://10.65.0.1:443/apis/apiextensions.k8s.io/v1/customresourcedefinitions?dryRun=All": context canceled
  Warning  SyncPackage        6m6s               packages/providerrevision.pkg.crossplane.io  cannot establish control of object: Post "https://10.65.0.1:443/apis/apiextensions.k8s.io/v1/customresourcedefinitions": context deadline exceeded
  Normal   BindClusterRole    9s (x2 over 9s)    rbac/providerrevision.pkg.crossplane.io      Bound system ClusterRole to provider ServiceAccount(s)
  Normal   ApplyClusterRoles  5s (x2 over 6s)    rbac/providerrevision.pkg.crossplane.io      Applied RBAC ClusterRoles

turkenh · 2021-09-10T11:08:34Z

It is the same for a standard (not autopilot) GKE cluster.

I think we broke something on GKE side, my clusters are no longer responding and clusters are in this state 😕

So, we definitely need to understand what is really going on with this amount of CRD types.

turkenh · 2021-09-10T11:12:51Z

However, bumping the context deadline helped on my local kind cluster and I think it still makes sense to bump the context deadline to give enough time to install that many CRDs as long as API server keeps working.

I am just wondering if we should give some more wiggle room and set it as 5 mins instead? 🤔

ulucinar · 2021-09-10T11:31:24Z

However, bumping the context deadline helped on my local kind cluster and I think it still makes sense to bump the context deadline to give enough time to install that many CRDs as long as API server keeps working.

I am just wondering if we should give some more wiggle room and set it as 5 mins instead? 🤔

Consider also the case where multiple Terraform-based providers are provisioned on a single cluster for a multi-cloud scenario :P But I think installing every CRD we have generated for a provider is not the way to go as it looks like we are putting too much stress on the API server (something to be investigated) and it's not efficient, some way of selectively installing CRDs and starting the associated controllers sounds like the approach we should take.

muvaf

As we have discussed, this is definitely not the complete answer for the problem of installing CRDs on the order of hundreds. Increasing the timeout doesn't address all concerns, like dealing with slow api-servers or installing multiple providers with thousands CRDs. But bumping it from 1 to 3 is a sane increase IMO that will get us to a place where at least you can install ~500 CRDs in most clusters. When the real solutions come in, like sharding or even maybe fix in apiserver, context timeout of 1 minute could still be a problem. So, we're just moving one stone out of the way without much compromise, like having it +10 mins.

Thanks @turkenh !

github-actions · 2021-09-10T13:39:25Z

Successfully created backport PR #2571 for release-1.4.

Bump context deadline for providerrevision controller to 3 mins

091a7c4

Fixes crossplane#2564 Signed-off-by: Hasan Turken <turkenh@gmail.com>

turkenh added the backport release-1.4 label Sep 10, 2021

turkenh requested review from muvaf, ulucinar and hasheddan September 10, 2021 09:38

ulucinar approved these changes Sep 10, 2021

View reviewed changes

muvaf approved these changes Sep 10, 2021

View reviewed changes

muvaf merged commit 311cc0a into crossplane:master Sep 10, 2021

github-actions bot mentioned this pull request Sep 10, 2021

[Backport release-1.4] Bump context deadline for ProviderRevision controller to 3 mins #2571

Merged

This was referenced Apr 29, 2022

Increase timeout #3060

Closed

Increase timeout to 15min #3061

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bump context deadline for ProviderRevision controller to 3 mins #2570

Bump context deadline for ProviderRevision controller to 3 mins #2570

turkenh commented Sep 10, 2021 •

edited

ulucinar commented Sep 10, 2021

ulucinar commented Sep 10, 2021

turkenh commented Sep 10, 2021

ulucinar commented Sep 10, 2021

turkenh commented Sep 10, 2021

turkenh commented Sep 10, 2021

turkenh commented Sep 10, 2021

turkenh commented Sep 10, 2021

ulucinar commented Sep 10, 2021 •

edited

muvaf left a comment

github-actions bot commented Sep 10, 2021

Bump context deadline for ProviderRevision controller to 3 mins #2570

Bump context deadline for ProviderRevision controller to 3 mins #2570

Conversation

turkenh commented Sep 10, 2021 • edited

Description of your changes

How has this code been tested

ulucinar commented Sep 10, 2021

ulucinar commented Sep 10, 2021

turkenh commented Sep 10, 2021

ulucinar commented Sep 10, 2021

turkenh commented Sep 10, 2021

turkenh commented Sep 10, 2021

turkenh commented Sep 10, 2021

turkenh commented Sep 10, 2021

ulucinar commented Sep 10, 2021 • edited

muvaf left a comment

Choose a reason for hiding this comment

github-actions bot commented Sep 10, 2021

turkenh commented Sep 10, 2021 •

edited

ulucinar commented Sep 10, 2021 •

edited