-
Notifications
You must be signed in to change notification settings - Fork 907
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memory Spikes in Crossplane During Upgrades in large clusters deployment #5272
Comments
Thanks for the really Generous notes. We really need something or doing reconcile in chunks . This issue is there for us as well. |
@btwseeu78 any reference to Gatekeeper's docs mentioning this? at a quick glance I only could find the chunk size for Audits here, is that what you were referring to? |
During Crossplane startup, traces reveal normal listing and watching of objects, but an observed stream error with I0123 10:02:55.244941 1 trace.go:236] Trace[1298330381]: "Reflector ListAndWatch" name:k8s.io/client-go@v0.28.3/tools/cache/reflector.go:229 (23-Jan-2024 10:02:35.738) (total time: 19506ms):
Trace[1298330381]: ---"Objects listed" error:<nil> 19504ms (10:02:55.242)
Trace[1298330381]: [19.506445969s] [19.506445969s] END
I0123 10:02:56.740233 1 trace.go:236] Trace[1488299156]: "Reflector ListAndWatch" name:k8s.io/client-go@v0.28.3/tools/cache/reflector.go:229 (23-Jan-2024 10:02:34.347) (total time: 22393ms):
Trace[1488299156]: ---"Objects listed" error:<nil> 21499ms (10:02:55.846)
Trace[1488299156]: [22.393027556s] [22.393027556s] END
I0123 10:02:56.941924 1 trace.go:236] Trace[839564159]: "Reflector ListAndWatch" name:k8s.io/client-go@v0.28.3/tools/cache/reflector.go:229 (23-Jan-2024 10:02:34.539) (total time: 22401ms):
Trace[839564159]: ---"Objects listed" error:<nil> 22399ms (10:02:56.939)
Trace[839564159]: [22.401877025s] [22.401877025s] END
I0123 10:03:00.240082 1 trace.go:236] Trace[251027572]: "Reflector ListAndWatch" name:k8s.io/client-go@v0.28.3/tools/cache/reflector.go:229 (23-Jan-2024 10:02:48.338) (total time: 11901ms):
Trace[251027572]: ---"Objects listed" error:<nil> 11900ms (10:03:00.239)
Trace[251027572]: [11.901022092s] [11.901022092s] END
I0123 10:03:02.343144 1 trace.go:236] Trace[2074234589]: "Reflector ListAndWatch" name:k8s.io/client-go@v0.28.3/tools/cache/reflector.go:229 (23-Jan-2024 10:02:48.439) (total time: 13903ms):
Trace[2074234589]: ---"Objects listed" error:<nil> 13902ms (10:03:02.342)
Trace[2074234589]: [13.903369954s] [13.903369954s] END
I0123 10:03:04.745854 1 trace.go:236] Trace[883204833]: "Reflector ListAndWatch" name:k8s.io/client-go@v0.28.3/tools/cache/reflector.go:229 (23-Jan-2024 10:02:45.145) (total time: 19600ms):
Trace[883204833]: ---"Objects listed" error:<nil> 19597ms (10:03:04.742)
Trace[883204833]: [19.600764286s] [19.600764286s] END
I0123 10:03:07.641299 1 trace.go:236] Trace[1490995012]: "Reflector ListAndWatch" name:k8s.io/client-go@v0.28.3/tools/cache/reflector.go:229 (23-Jan-2024 10:02:45.144) (total time: 22497ms):
Trace[1490995012]: ---"Objects listed" error:<nil> 22494ms (10:03:07.638)
Trace[1490995012]: [22.497001796s] [22.497001796s] END
I0123 10:03:13.840521 1 trace.go:236] Trace[117155820]: "Reflector ListAndWatch" name:k8s.io/client-go@v0.28.3/tools/cache/reflector.go:229 (23-Jan-2024 10:02:48.538) (total time: 25301ms):
Trace[117155820]: ---"Objects listed" error:<nil> 25299ms (10:03:13.838)
Trace[117155820]: [25.301687697s] [25.301687697s] END
W0123 10:03:34.238618 1 reflector.go:535] k8s.io/client-go@v0.28.3/tools/cache/reflector.go:229: failed to list *v1.Secret: stream error when reading response body, may be caused by closed connection. Please retry. Original error: stream error: stream ID 75; INTERNAL_ERROR; received from peer
I0123 10:03:34.238715 1 trace.go:236] Trace[1654916232]: "Reflector ListAndWatch" name:k8s.io/client-go@v0.28.3/tools/cache/reflector.go:229 (23-Jan-2024 10:02:30.642) (total time: 63595ms):
Trace[1654916232]: ---"Objects listed" error:stream error when reading response body, may be caused by closed connection. Please retry. Original error: stream error: stream ID 75; INTERNAL_ERROR; received from peer 63595ms (10:03:34.238)
Trace[1654916232]: [1m3.595882683s] [1m3.595882683s] END
I0123 10:03:34.238739 1 reflector.go:147] k8s.io/client-go@v0.28.3/tools/cache/reflector.go:229: Failed to watch *v1.Secret: failed to list *v1.Secret: stream error when reading response body, may be caused by closed connection. Please retry. Original error: stream error: stream ID 75; INTERNAL_ERROR; received from peer
I0123 10:04:30.844676 1 trace.go:236] Trace[512972156]: "Reflector ListAndWatch" name:k8s.io/client-go@v0.28.3/tools/cache/reflector.go:229 (23-Jan-2024 10:03:35.238) (total time: 55606ms):
Trace[512972156]: [55.606511491s] [55.606511491s] END |
with crossplane 1.14 validation webhook is enabled by default - wonder if it's related - you can disable validation webhook and check spike again during restart ? Wdyt |
After disabling the validation webhook, the memory issue persists, and I've noticed a significant increase in Goroutines. Specifically, I conducted a detailed analysis of the memory usage between Crossplane versions (v1.11.1 & v1.14.5). I observed that the memory increase persists even during normal operation, rather than being isolated to the restart or upgrade process. In addition, I've identified a recurring error in the Crossplane controller logs: Crossplane is currently managing approximately 620 claims in this cluster. In the previous versions, the memory consumption peaked at 346 MiB, while in the new version, it has increased to 890 MiB. For Goroutines, the older version had around 730 Goroutines, whereas the new version exhibits an increase to 873. |
@haarchri After downgrading from Crossplane version v1.14.5 to version v1.13.2, I observed a significant decrease in memory consumption from 1 GiB to 390 MiB, reverting to levels observed before the upgrade. Additionally, the previously encountered |
@smileisak I've reviewed our setup and can confirm that we primarily utilize the Kubernetes provider and have not configured the same service account for multiple providers. |
What happened?
Experiencing memory spikes during the upgrade of Crossplane from v1.11.2 to version v1.14.5 in large Kubernetes clusters. Our environment details are as follows:
During the upgrade process, the Crossplane controller attempts to reconcile all objects simultaneously, leading to memory spikes. This issue is more pronounced in larger clusters. Importantly, this behavior has not been observed in our smaller clusters.
![image](https://private-user-images.githubusercontent.com/45313572/298611760-4a40f097-2536-4c22-ad69-91d9900496a6.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MTg3MDg0MjksIm5iZiI6MTcxODcwODEyOSwicGF0aCI6Ii80NTMxMzU3Mi8yOTg2MTE3NjAtNGE0MGYwOTctMjUzNi00YzIyLWFkNjktOTFkOTkwMDQ5NmE2LnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNDA2MTglMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjQwNjE4VDEwNTUyOVomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTEyMzc2N2M1ZGZhYWY4YTA0NGRiZjk3ZDM1YTUyMTFhMjNjZjk4NTE0YThjNTIxODY4YjA2MjgwMmYxYTk0NjUmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0JmFjdG9yX2lkPTAma2V5X2lkPTAmcmVwb19pZD0wIn0.WZlkOf5JODqduXEjUf8zdDrlsvK9_8ZDc0WKdTEI0no)
Additionally, we noticed that increasing the memory limits from 900 MB to 2 GB resolves the issue(oomkilled). However, after the upgrade, memory consumption returns to normal.
![image](https://private-user-images.githubusercontent.com/45313572/298611954-c84641bd-3383-4556-8545-447da5324075.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MTg3MDg0MjksIm5iZiI6MTcxODcwODEyOSwicGF0aCI6Ii80NTMxMzU3Mi8yOTg2MTE5NTQtYzg0NjQxYmQtMzM4My00NTU2LTg1NDUtNDQ3ZGE1MzI0MDc1LnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNDA2MTglMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjQwNjE4VDEwNTUyOVomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTEzMzhkZTI3ODRiZDhlOTRlNWVkOWIxZTQ0ZDAxMGZjNGFiNjU3YjQzOWEwN2IyMTJjN2FjYmY4Y2Y4NjRlNDEmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0JmFjdG9yX2lkPTAma2V5X2lkPTAmcmVwb19pZD0wIn0.nAcz_h7c0bQdYMvD-wcyuXydFXl-WEH9gTY_kHtyhRU)
Proposed Solutions:
How can we reproduce it?
What environment did it happen in?
Crossplane version: v1.14.5
Cloud provider - Provider Kubernetes
Kubernetes version (use kubectl version) - k1.27
Kubernetes distribution (e.g. Tectonic, GKE, OpenShift) - GKE
The text was updated successfully, but these errors were encountered: