Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory Spikes in Crossplane During Upgrades in large clusters deployment #5272

Open
abelhoula opened this issue Jan 22, 2024 · 7 comments
Open
Labels
bug Something isn't working package performance

Comments

@abelhoula
Copy link

What happened?

Experiencing memory spikes during the upgrade of Crossplane from v1.11.2 to version v1.14.5 in large Kubernetes clusters. Our environment details are as follows:

  • Kubernetes Version: GKE 1.27
  • Provider: Kubernetes Provider
  • Number of Objects: More than 4000 (MR) & more than 1600 CLaims

During the upgrade process, the Crossplane controller attempts to reconcile all objects simultaneously, leading to memory spikes. This issue is more pronounced in larger clusters. Importantly, this behavior has not been observed in our smaller clusters.
image

Additionally, we noticed that increasing the memory limits from 900 MB to 2 GB resolves the issue(oomkilled). However, after the upgrade, memory consumption returns to normal.
image

Proposed Solutions:

  • Throttle Upgrade Rate: Introduce mechanisms to throttle the rate at which objects are upgraded. This will prevent the system from being overwhelmed by too many upgrades simultaneously.
  • Batch Processing: If possible, divide the set of objects into batches and process each batch separately. This approach can prevent the controller from processing all objects in a single reconciliation loop, thereby mitigating the memory spike.

How can we reproduce it?

  • Set up a Kubernetes cluster: Deploy a Kubernetes cluster with the following specifications: GKE 1.27, Kubernetes Provider for Crossplane, and a configuration with more than 4000 Managed Resources & more than 1600 CLAIMs.
  • Apply memory limits: Could you make sure that the memory limits for the Crossplane pod are set to 900 Mi?
  • Initiate the Crossplane upgrade: Trigger the upgrade process for Crossplane to version v0.14.5.
  • Observe memory utilization during the upgrade: Monitor the memory utilization of the Crossplane pod as the upgrade progresses.

What environment did it happen in?

Crossplane version: v1.14.5

Cloud provider - Provider Kubernetes
Kubernetes version (use kubectl version) - k1.27
Kubernetes distribution (e.g. Tectonic, GKE, OpenShift) - GKE

@abelhoula abelhoula added the bug Something isn't working label Jan 22, 2024
@btwseeu78
Copy link

Thanks for the really Generous notes. We really need something or doing reconcile in chunks . This issue is there for us as well.
We have seen this behavior with Gatekeeper has the ability to do it in chunks.
it would be a really good feature.

@phisco
Copy link
Contributor

phisco commented Jan 22, 2024

@btwseeu78 any reference to Gatekeeper's docs mentioning this? at a quick glance I only could find the chunk size for Audits here, is that what you were referring to?

@abelhoula
Copy link
Author

During Crossplane startup, traces reveal normal listing and watching of objects, but an observed stream error with v1.Secret objects raises concerns, potentially contributing to memory spikes during upgrades.

I0123 10:02:55.244941       1 trace.go:236] Trace[1298330381]: "Reflector ListAndWatch" name:k8s.io/client-go@v0.28.3/tools/cache/reflector.go:229 (23-Jan-2024 10:02:35.738) (total time: 19506ms):
Trace[1298330381]: ---"Objects listed" error:<nil> 19504ms (10:02:55.242)
Trace[1298330381]: [19.506445969s] [19.506445969s] END

I0123 10:02:56.740233       1 trace.go:236] Trace[1488299156]: "Reflector ListAndWatch" name:k8s.io/client-go@v0.28.3/tools/cache/reflector.go:229 (23-Jan-2024 10:02:34.347) (total time: 22393ms):
Trace[1488299156]: ---"Objects listed" error:<nil> 21499ms (10:02:55.846)
Trace[1488299156]: [22.393027556s] [22.393027556s] END

I0123 10:02:56.941924       1 trace.go:236] Trace[839564159]: "Reflector ListAndWatch" name:k8s.io/client-go@v0.28.3/tools/cache/reflector.go:229 (23-Jan-2024 10:02:34.539) (total time: 22401ms):
Trace[839564159]: ---"Objects listed" error:<nil> 22399ms (10:02:56.939)
Trace[839564159]: [22.401877025s] [22.401877025s] END

I0123 10:03:00.240082       1 trace.go:236] Trace[251027572]: "Reflector ListAndWatch" name:k8s.io/client-go@v0.28.3/tools/cache/reflector.go:229 (23-Jan-2024 10:02:48.338) (total time: 11901ms):
Trace[251027572]: ---"Objects listed" error:<nil> 11900ms (10:03:00.239)
Trace[251027572]: [11.901022092s] [11.901022092s] END

I0123 10:03:02.343144       1 trace.go:236] Trace[2074234589]: "Reflector ListAndWatch" name:k8s.io/client-go@v0.28.3/tools/cache/reflector.go:229 (23-Jan-2024 10:02:48.439) (total time: 13903ms):
Trace[2074234589]: ---"Objects listed" error:<nil> 13902ms (10:03:02.342)
Trace[2074234589]: [13.903369954s] [13.903369954s] END

I0123 10:03:04.745854       1 trace.go:236] Trace[883204833]: "Reflector ListAndWatch" name:k8s.io/client-go@v0.28.3/tools/cache/reflector.go:229 (23-Jan-2024 10:02:45.145) (total time: 19600ms):
Trace[883204833]: ---"Objects listed" error:<nil> 19597ms (10:03:04.742)
Trace[883204833]: [19.600764286s] [19.600764286s] END

I0123 10:03:07.641299       1 trace.go:236] Trace[1490995012]: "Reflector ListAndWatch" name:k8s.io/client-go@v0.28.3/tools/cache/reflector.go:229 (23-Jan-2024 10:02:45.144) (total time: 22497ms):
Trace[1490995012]: ---"Objects listed" error:<nil> 22494ms (10:03:07.638)
Trace[1490995012]: [22.497001796s] [22.497001796s] END

I0123 10:03:13.840521       1 trace.go:236] Trace[117155820]: "Reflector ListAndWatch" name:k8s.io/client-go@v0.28.3/tools/cache/reflector.go:229 (23-Jan-2024 10:02:48.538) (total time: 25301ms):
Trace[117155820]: ---"Objects listed" error:<nil> 25299ms (10:03:13.838)
Trace[117155820]: [25.301687697s] [25.301687697s] END

W0123 10:03:34.238618       1 reflector.go:535] k8s.io/client-go@v0.28.3/tools/cache/reflector.go:229: failed to list *v1.Secret: stream error when reading response body, may be caused by closed connection. Please retry. Original error: stream error: stream ID 75; INTERNAL_ERROR; received from peer

I0123 10:03:34.238715       1 trace.go:236] Trace[1654916232]: "Reflector ListAndWatch" name:k8s.io/client-go@v0.28.3/tools/cache/reflector.go:229 (23-Jan-2024 10:02:30.642) (total time: 63595ms):
Trace[1654916232]: ---"Objects listed" error:stream error when reading response body, may be caused by closed connection. Please retry. Original error: stream error: stream ID 75; INTERNAL_ERROR; received from peer 63595ms (10:03:34.238)
Trace[1654916232]: [1m3.595882683s] [1m3.595882683s] END

I0123 10:03:34.238739       1 reflector.go:147] k8s.io/client-go@v0.28.3/tools/cache/reflector.go:229: Failed to watch *v1.Secret: failed to list *v1.Secret: stream error when reading response body, may be caused by closed connection. Please retry. Original error: stream error: stream ID 75; INTERNAL_ERROR; received from peer

I0123 10:04:30.844676       1 trace.go:236] Trace[512972156]: "Reflector ListAndWatch" name:k8s.io/client-go@v0.28.3/tools/cache/reflector.go:229 (23-Jan-2024 10:03:35.238) (total time: 55606ms):
Trace[512972156]: [55.606511491s] [55.606511491s] END

@haarchri
Copy link
Contributor

with crossplane 1.14 validation webhook is enabled by default - wonder if it's related - you can disable validation webhook and check spike again during restart ? Wdyt

@abelhoula
Copy link
Author

After disabling the validation webhook, the memory issue persists, and I've noticed a significant increase in Goroutines. Specifically, I conducted a detailed analysis of the memory usage between Crossplane versions (v1.11.1 & v1.14.5). I observed that the memory increase persists even during normal operation, rather than being isolated to the restart or upgrade process.

In addition, I've identified a recurring error in the Crossplane controller logs:
E0126 09:06:07.217898 1 request.go:1116] Unexpected error when reading response body: context canceledE0126 09:26:29.742646 1 request.go:1116] Unexpected error when reading response body: context canceledE0126 09:27:45.717104 1 request.go:1116] Unexpected error when reading response body: context canceledE0126 09:32:16.143825 1 request.go:1116] Unexpected error when reading response body: context canceledE0126 09:37:55.916520 1 request.go:1116] Unexpected error when reading response body: context canceledE0126 09:41:29.121660 1 request.go:1116] Unexpected error when reading response body: context canceledE0126 09:42:43.415709 1 request.go:1116] Unexpected error when reading response body: context canceledE0126 09:48:34.124928 1 request.go:1116] Unexpected error when reading response body: context canceledE0126 09:48:34.124993 1 request.go:1116] Unexpected error when reading response body: context canceledE0126 09:48:34.125068 1 request.go:1116] Unexpected error when reading response body: context canceledE0126 09:57:56.528134 1 request.go:1116] Unexpected error when reading response body: context canceledE0126 10:02:14.807933 1 request.go:1116] Unexpected error when reading response body: context canceled

Crossplane is currently managing approximately 620 claims in this cluster. In the previous versions, the memory consumption peaked at 346 MiB, while in the new version, it has increased to 890 MiB.

image

For Goroutines, the older version had around 730 Goroutines, whereas the new version exhibits an increase to 873.

image

@abelhoula
Copy link
Author

@haarchri After downgrading from Crossplane version v1.14.5 to version v1.13.2, I observed a significant decrease in memory consumption from 1 GiB to 390 MiB, reverting to levels observed before the upgrade. Additionally, the previously encountered context canceled errors were not observed in the older version.
image

@abelhoula
Copy link
Author

@smileisak I've reviewed our setup and can confirm that we primarily utilize the Kubernetes provider and have not configured the same service account for multiple providers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working package performance
Projects
Status: Backlog
Development

No branches or pull requests

5 participants