Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bump context deadline for ProviderRevision controller to 3 mins #2570

Merged
merged 1 commit into from Sep 10, 2021

Conversation

turkenh
Copy link
Member

@turkenh turkenh commented Sep 10, 2021

Description of your changes

This PRs bumps the context deadline for providerrevision controller from 1 min to 3 mins. The current deadline is not sufficient when the provider package contains a high number of CRDs. Even it does not make sense to bump the deadline indefinitely to support an infinite number of CRDs, we expect that supporting around 1000 CRDs should be a good expectation, given provider-tf-aws contains around 750.

I have experimented installing turkenh/provider-tf-aws:daf1e9f7-2 with this change couple of times and noticed that it is installed in around 2 mins. I believe 3 mins should be a good value leaving some room for different environments and around 250 more CRDs (to reach our rough target of 1000).

Fixes #2564

I have:

  • Read and followed Crossplane's contribution process.
  • Run make reviewable to ensure this PR is ready for review.
  • Added backport release-x.y labels to auto-backport this PR if necessary.

How has this code been tested

Install the provider package in the issue description:

kubectl crossplane install provider turkenh/provider-tf-aws:daf1e9f7-2

Fixes crossplane#2564

Signed-off-by: Hasan Turken <turkenh@gmail.com>
@ulucinar
Copy link
Contributor

Hi @turkenh,
Was provider-tf-aws successfully started after the CRD registrations by the revision controller? In my experience with provider-tf-azure, as I mentioned, it was in a crash loop due to API server timeouts.

@ulucinar
Copy link
Contributor

Another question is: Although I do not think that it's currently needed, but will we consider another approach for registering CRDs in revision controller?

@turkenh
Copy link
Member Author

turkenh commented Sep 10, 2021

Hi @turkenh,
Was provider-tf-aws successfully started after the CRD registrations by the revision controller? In my experience with provider-tf-azure, as I mentioned, it was in a crash loop due to API server timeouts.

@ulucinar yes, I didn't observe any issues with controllers (5/5 tries) except the following throttling log (this one is running for 15mins now):

I0910 09:35:07.449360       1 request.go:665] Waited for 1.0767866s due to client-side throttling, not priority and fairness, request: GET:https://10.96.0.1:443/apis/apiextensions.crossplane.io/v1beta1?timeout=32s

@ulucinar
Copy link
Contributor

Hi @turkenh,
Was provider-tf-aws successfully started after the CRD registrations by the revision controller? In my experience with provider-tf-azure, as I mentioned, it was in a crash loop due to API server timeouts.

@ulucinar yes, I didn't observe any issues with controllers (5/5 tries) except the following throttling log (this one is running for 15mins now):

I0910 09:35:07.449360       1 request.go:665] Waited for 1.0767866s due to client-side throttling, not priority and fairness, request: GET:https://10.96.0.1:443/apis/apiextensions.crossplane.io/v1beta1?timeout=32s

Interesting, thanks for sharing your observation. Is your cluster also a Kind cluster? I will give provider-tf-azure a retry.

@turkenh
Copy link
Member Author

turkenh commented Sep 10, 2021

Interesting, thanks for sharing your observation. Is your cluster also a Kind cluster? I will give provider-tf-azure a retry.

Yes, it was a kind cluster.
Now testing with a GKE (autopilot) cluster, let's see.

@turkenh
Copy link
Member Author

turkenh commented Sep 10, 2021

GKE (autopilot) is performing much worse. APIServer stopped responding for some requests while the package is being installed:

kubectl describe providerrevisions.pkg.crossplane.io turkenh-provider-tfturkenh-provider-tf-aws-1206561c4d26

I0910 13:03:11.292739   26899 request.go:645] Throttling request took 1.19845629s, request: GET:https://34.68.113.240/apis/wafregional.aws.tf.crossplane.io/v1alpha1?timeout=32s
I0910 13:03:21.492224   26899 request.go:645] Throttling request took 11.397248344s, request: GET:https://34.68.113.240/apis/apiextensions.crossplane.io/v1beta1?timeout=32s
Error from server: etcdserver: request timed out

And now observing these events with this change:

Spec:
  Desired State:                  Active
  Ignore Crossplane Constraints:  false
  Image:                          turkenh/provider-tf-aws:daf1e9f7-2
  Package Pull Policy:            IfNotPresent
  Revision:                       1
  Skip Dependency Resolution:     false
Events:
  Type     Reason             Age                From                                         Message
  ----     ------             ----               ----                                         -------
  Normal   BindClusterRole    10m (x2 over 10m)  rbac/providerrevision.pkg.crossplane.io      Bound system ClusterRole to provider ServiceAccount(s)
  Normal   ApplyClusterRoles  10m (x3 over 10m)  rbac/providerrevision.pkg.crossplane.io      Applied RBAC ClusterRoles
  Warning  SyncPackage        9m40s              packages/providerrevision.pkg.crossplane.io  cannot establish control of object: Post "https://10.65.0.1:443/apis/apiextensions.k8s.io/v1/customresourcedefinitions?dryRun=All": context canceled
  Warning  SyncPackage        6m6s               packages/providerrevision.pkg.crossplane.io  cannot establish control of object: Post "https://10.65.0.1:443/apis/apiextensions.k8s.io/v1/customresourcedefinitions": context deadline exceeded
  Normal   BindClusterRole    9s (x2 over 9s)    rbac/providerrevision.pkg.crossplane.io      Bound system ClusterRole to provider ServiceAccount(s)
  Normal   ApplyClusterRoles  5s (x2 over 6s)    rbac/providerrevision.pkg.crossplane.io      Applied RBAC ClusterRoles

@turkenh
Copy link
Member Author

turkenh commented Sep 10, 2021

It is the same for a standard (not autopilot) GKE cluster.

I think we broke something on GKE side, my clusters are no longer responding and clusters are in this state 😕

Screen Shot 2021-09-10 at 14 03 12

So, we definitely need to understand what is really going on with this amount of CRD types.

@turkenh
Copy link
Member Author

turkenh commented Sep 10, 2021

However, bumping the context deadline helped on my local kind cluster and I think it still makes sense to bump the context deadline to give enough time to install that many CRDs as long as API server keeps working.

I am just wondering if we should give some more wiggle room and set it as 5 mins instead? 🤔

@ulucinar
Copy link
Contributor

ulucinar commented Sep 10, 2021

However, bumping the context deadline helped on my local kind cluster and I think it still makes sense to bump the context deadline to give enough time to install that many CRDs as long as API server keeps working.

I am just wondering if we should give some more wiggle room and set it as 5 mins instead? 🤔

Consider also the case where multiple Terraform-based providers are provisioned on a single cluster for a multi-cloud scenario :P But I think installing every CRD we have generated for a provider is not the way to go as it looks like we are putting too much stress on the API server (something to be investigated) and it's not efficient, some way of selectively installing CRDs and starting the associated controllers sounds like the approach we should take.

Copy link
Member

@muvaf muvaf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we have discussed, this is definitely not the complete answer for the problem of installing CRDs on the order of hundreds. Increasing the timeout doesn't address all concerns, like dealing with slow api-servers or installing multiple providers with thousands CRDs. But bumping it from 1 to 3 is a sane increase IMO that will get us to a place where at least you can install ~500 CRDs in most clusters. When the real solutions come in, like sharding or even maybe fix in apiserver, context timeout of 1 minute could still be a problem. So, we're just moving one stone out of the way without much compromise, like having it +10 mins.

Thanks @turkenh !

@github-actions
Copy link

Successfully created backport PR #2571 for release-1.4.

This was referenced Apr 29, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Installing packages with many CRDs causes reconciler to exceed context deadline
3 participants