Reduce memory footprint for GCP admission component #143

timuthy · 2020-07-28T12:37:13Z

How to categorize this issue?

/area robustness
/area cost
/priority critical
/platform gcp

What would you like to be added:
Since the validation of cloud provider secrets has been introduced (#112) the GCP admission plugin's memory footprint increased for a considerable amount, mainly because of added caches for Shoots, SecretBindings and Secrets.

The required memory depends on the K8s cluster or the number/size of the stored resources, but we've observed increases from from ~20Mb to ~500Mb. Under consideration that such an admission component has multiple replicas and is only responsible for one provider, the runtime costs are too high.

Hence, we should try to:

Disable caching for all objects and analyse the consequences, i.e. is cost per admission request still acceptable
If 1.) is not feasible, use API Reader to read Secrets because we expect that Secrets are the most occurring resource kinds in the Garden cluster.

The text was updated successfully, but these errors were encountered:

ialidzhikov · 2020-08-21T07:07:34Z

/assign @ialidzhikov

ialidzhikov · 2020-08-28T09:14:04Z

Thank you @timuthy for opening this issue.

The avg memory usage now seem to be stable ~80Mb with spikes to 100Mb. These numbers are from a landscape with ~10000 Secrets and ~7000 SecretBindings.

I can give a try for

If 1.) is not feasible, use API Reader to read Secrets because we expect that Secrets are the most occurring resource kinds in the Garden cluster.

and see whether it improves the memory usage.

ialidzhikov · 2021-09-07T16:27:52Z

I had a look today into this issue and tried to outline future improvements.

I checked whether the admission-gcp component is using application/vnd.kubernetes.protobuf content type. According to the protobuf proposal, using protobuf can decrease the CPU, memory and network usage:

Experiments have demonstrated a 10x reduction in CPU use during serialization and deserialization, a 2x reduction in size in bytes on the wire, and a 6-9x reduction in the amount of objects created on the heap during serialization.

But it seems that admission-gcp is already using the protobuf content type today (I think, by chance?) thanks to the init() in the github.com/gardener/gardener/pkg/client/kubernetes pkg that calls apiutil.AddToProtobufScheme(protobufSchemeBuilder.AddToScheme):

https://github.com/gardener/gardener/blob/53847edc6c7ffa7dfd4bf54ace3739a1d15c66c8/pkg/client/kubernetes/client.go#L52-L61
Another possible option that theoritically should reduce the memory usage is to use the label/field selector on controller-runtime cache (ref ✨ Add SelectorsByObject option to cache kubernetes-sigs/controller-runtime#1435) and to filter Shoots by .spec.provider.type. This would require require gardener-apiserver to support field selector on .spec.provider.type.

My last change on this issue was to disable the cache on Secrets (#253 (comment)) and it had a possitive effect by reducing the memory usage to ~ avg 118Mi. When I checked today the memory usage on a busy landscape it was again ~ avg 224 Mi which is surprising for me as I don't think something has changed on admission-gcp side code-wise.

timuthy · 2021-09-08T07:13:54Z

Another possible option that theoritically should reduce the memory usage is to use the label/field selector on controller-runtime cache (ref ✨ Add SelectorsByObject option to cache kubernetes-sigs/controller-runtime#1435) and to filter Shoots by .spec.provider.type. This would require require gardener-apiserver to support field selector on .spec.provider.type.

I guess this won't hurt but in the end I don't think that it will save much memory, especially if we use protobuf, right?

My last change on this issue was to disable the cache on Secrets (#253 (comment)) and it had a possitive effect by reducing the memory usage to ~ avg 118Mi. When I checked today the memory usage on a busy landscape it was again ~ avg 224 Mi which is surprising for me as I don't think something has changed on admission-gcp side code-wise.

Does it make sense to use profiling (gardener/gardener#4568) here?

timebertt · 2021-09-08T07:45:38Z

I guess this won't hurt but in the end I don't think that it will save much memory, especially if we use protobuf, right?

Hmm, why do you think so? The shoot resource can grow quite large and has to be stored somewhere in memory, which is independent of the used transport protocol.
Filtering out all irrelevant shoots (i.e., spec.provider.type!=gcp), will still save quite some memory, I assume.
Am I missing something?

timuthy · 2021-09-08T07:57:28Z

Hmm, why do you think so?

I checked the rough size of shoots on our biggest landscape and yes, it's a number but I cannot deduce that this will save memory in the ~100Mi range. Of course, you cannot really compare a json representation size with the size that's needed in memory, but still I should give a direction, right?

Please note, that I never claimed filtering out the shoots can be disregarded. I was rather wondering if it will save a considerable amount of memory increase that we have experienced since quite a while.

timebertt · 2021-09-08T08:06:09Z

Got your point. Well, I guess we just have to try and see 😄

ialidzhikov · 2021-09-08T15:05:43Z

After more thoughts on how to tackle this issue, I came up with the following proposal:
gardenlet to maintain a label (provider.shoot.gardener.cloud/<type>=true) on the cloud provider Secret. gardenlet will add this label to the cloud provider Secret when the Secret is used by at least 1 Shoot from the given provider type. And respectively, gardenlet will remove the label when there are no more Shoots from the given provider that reference the Secret. This label will allow admissions to specify an objectSelector in their webhooks that match only Secrets related to Shoots from the given provider. In this way the admission components will no longer need to have to list (and have a cache resp.) SecretBindings and Shoots to determine whether the Secret is used by any Shoot. Reference POC/implementation can be found in ialidzhikov/gardener@37b9528 and ialidzhikov@18c5920.
This approach has one drawback. End user can bypass the validation by the webhook by first removing the label on the Secret and then updating the Secret by invalid data. But having in mind that the validation currently is kind of best effort (mainly checking for the required fields and if possible, for the data itself), I think this approach should be generally fine and would help us in most of the cases. If we have concerns about this drawback, we can add additional "prerequisite checks" for the cloudprovider Secret itself in the extension controllers and flag missing/invalid fields in the Secret data with the appropriate error code.
Despite the drawback I like the proposal. Of course, second opinions are always welcome. 👀

rfranzke · 2021-09-08T15:09:09Z

Thanks @ialidzhikov!
One comment:

gardenlet to maintain a label (provider.shoot.gardener.cloud/=true) on the cloud provider Secret. gardenlet will add this label to the cloud provider Secret when the Secret is used by at least 1 Shoot from the given provider type. And respectively, gardenlet will remove the label when there are no more Shoots from the given provider that reference the Secret.

This should probably be the GCM since it's in no way seed-specific and can be handled centrally.

ialidzhikov · 2021-09-08T15:22:03Z

This should probably be the GCM since it's in no way seed-specific and can be handled centrally.

I thought about it but faced the following problem. If it is a new controller in GCM that is watching Shoots and maintaining the cloud provider Secret label, then when GCM is down, GCM will lose track of the Shoots that were deleted meanwhile (=> fails to remove the label from the Secret). Of course, this can be handled by adding another finalizer to the Shoot metadata but I was not quite sure whether we want to do this. I can think again and try to find something better..

rfranzke · 2021-09-08T15:29:04Z

Good point, indeed. Yeah, the additional finalizer sounds not optimal. Still, I can imagine that having such logic in gardenlet might lead to other challenges w.r.t. the seed authorization and multiple gardenlets chasing for the same secret, or? I haven't thought about it much, but from a first gud feeling putting such logic in GCM would feel more natural (if we find good ways to overcome the mentioned problems, of course). Let's see whether there'll be ideas about it.

timebertt · 2021-09-09T10:00:19Z

+1 for the proposal from @ialidzhikov, I like it :)
and +1 for putting the controller into GCM

then when GCM is down, GCM will lose track of the Shoots that were deleted meanwhile

We should be able to construct a controller that does not only rely on watch events (controllers generally should be able to compensate for restarts) even without using finalizers.
The controller should build a current state of the world (CSW) triggered by watch events (also received on startup). Then it can calculate the desired state of the world (DSW), i.e. which secret should have what labels.
A secret add watch event will trigger a reconciliation for every secret in the system on startup, which will decide based on DSW whether to add or remove which labels.

ialidzhikov · 2021-09-09T11:40:10Z

When I go one step back, I think that the issue boils down to the direction that there is no "free" (by free I mean without additional requests/watches) way to determine the provider type of cloud provider Secret. Currently it is really nice that the provider type is part of the Shoot and CloudProfile specs. This allows extension admission controllers to pick only the requests that match the provider type. I think that the part contributed to the memory increase in the admission controller is the part that determines the cloud provider Secret type and in general whether the Secret is in use by a Shoot from the given provider type (of course, we can try to enable profiling as suggested by @timuthy to confirm/reject this).
If SecretBinding is used only to reference a cloud provider Secret (currently I cannot point other use case), then in ideal world we can have the provider type as part of the SecretBinding spec:

apiVersion: core.gardener.cloud/v1beta1
kind: SecretBinding
metadata:
  name: my-provider-account
  namespace: garden-dev
provider:
  type: gcp
secretRef:
  name: my-provider-account

Pros:

We can forbid creation of SecretBinding that refrences invalid cloud provider Secret. Currently this validation happens on Shoot creation where we check whether the Shoot references a valid cloud provider Secret.
It can help us to tackle this issue. GCM can easily maintain the SecretBinding provider.type as label in the referenced Secret and as shared above extension admission controllers can use objectSelector to match only the Secret with the given provider type.
In general other components like the Gardener dashboard can benefit from such field. Currently while showing Secrets the Gardener dashboard is "naive" - it adds a cloudprofile.garden.sapcloud.io/name label to SecretBinding on Secret creation from the dashboard, then it relies on this label to show the Secret and its type + CloudProfile name. For example, if you remove this label from the SecretBinding, the dashboard will stop showing this Secret. More details can be found in cloudprofile.garden.sapcloud.io/name label should not be mandatory dashboard#781. In this case dashboard can also show the SecretBinding and use its provider.type.

Cons:

Not clear how to introduce such change in backwards compatible manner. And not clear how to tackle potential cases where single Secret and SecretBinding is used for multiple cloud providers.

Implementations proposal:
1. Add new optional provider.type field to the SecretBinding resource.
2. Introduce validation for the newly introduced field to contain only valid values.
3. Add new controller in GCM that will maintain the provider type of each SecretBinding based on its current usage.
  - If the SecretBinding is used by multiple provider types, generate appropriate event to the SecretBinding and do not set the provider type field. For such cases Gardener operators can contact the end user and resolve the case by splitting the SecretBinding.
4. Adapt existing clients to specify the new field.
  - Gardener dashboard - the dashboard can start setting the new field when it is running agaist Gardener API server >= vX.Y.Z
  - Other clients
5. Once the migration controller does its job and all SecretBindings are provider-specific, introduce a feature gate that forbids creating SecretBinding without provider.type.
  - This will allow Gardener operators to rollout the backwards incompatible validation in controlled manner.
  - Take care to promote the feature gate - Alpha -> Beta -> GA.
6. Adapt the exising SecretBinding controller to maintain its provider type (if set) as label in the referenced Secret.
7. Enhance our admission components to rely on the provider label in the referenced Secret.
Makes the SecretBinding resource coupled to cloud provider Secret and "denies" potential usage of the SecretBinding resource for non-cloudprovider Secrets.
- Can be workarounded by using corev1.SecretReference instead

ialidzhikov · 2021-09-10T13:58:55Z

I updated my previous comment #143 (comment) with implementation proposal on how to introduce a provider.type on SecretBinding. The proposal itself consists of several stages to make sure that the backwards incompatible API change is rolled out in graceful and controlled manner.
Let me know what you think in general about the idea to make the SecretBinding resource provider-specific. Let me know also what you think about the proposal.

rfranzke · 2021-09-10T14:12:00Z

I like the proposal @ialidzhikov!

And not clear how to tackle potential cases where single Secret and SecretBinding is used for multiple cloud providers.
If the SecretBinding is used by multiple provider types, generate appropriate event to the SecretBinding and do not set the provider type field. For such cases Gardener operators can contact the end user and resolve the case by splitting the SecretBinding.

Could we use a list of providers instead of only one?

Introduce validation for the newly introduced field to contain only valid values.

What are valid values?

Adapt the exising SecretBinding controller to maintain its provider type (if set) as label in the referenced Secret.

How would the label be named?

ialidzhikov · 2021-09-10T15:09:33Z

Could we use a list of providers instead of only one?

My current assumption is that in general it is not a good practice to reuse single Secret. At least I would vote for separation of concerns and in a personal setup I wouldn't put such sensitive data into single Secret. Let me know if there are cases where we recommend this approach. I agree with you in general that it is kind of an incompatible change. On the other side we can use the opportunity to impose the good practice if we really have only exceptional cases of misuse. So, I better check how many folks depend on this reuse of single cloud provider Secret for multiple provider types.

What are valid values?

Good point. Initially I had in mind to check whether the provider type is a registered one (similar to what we do for Shoot and Seed). But I see that we allow for example CloudProfile to be created with non-registered provider type. We better follow the CloudProfile approach and do not introduce a validation for this.
Another validation that we can introduce is whether the Shoot provider.type matches the SecretBinding provider.type.

How would the label be named?

I had in mind provider.shoot.gardener.cloud/<type>=true, but open for suggestions of course.

rfranzke · 2021-09-29T05:48:23Z

/component gardener
/roadmap internal
/milestone 2021-Q4

ialidzhikov · 2022-03-23T07:05:04Z

With #396 on a not that busy landscape the avg memory utilization of admission-gcp dropped from ~40MiB to ~21MiB.

Let's see what will be the effect for busy landscapes where admission-gcp memory usage is > 200MiB.

ialidzhikov · 2022-05-12T15:17:54Z

On a large landscape with the rollout of admission-gcp@v1.22.0 the memory usage dropped significantly - from 200-250MiB to 40-60MiB:

We can say that the issue is fixed for admission-gcp. However I will keep it open to track the progress for other other admission components that have to be adapted in a similar way.

rfranzke · 2022-05-12T15:29:37Z

Very nice, well done @ialidzhikov

ialidzhikov · 2022-05-30T20:15:26Z

The other admission components are adapted in a similar way. Hence, we can close this issue.

/close

timuthy added the kind/enhancement Enhancement, improvement, extension label Jul 28, 2020

gardener-robot added area/cost Cost related area/robustness Robustness, reliability, resilience related platform/gcp Google cloud platform/infrastructure priority/critical Needs to be resolved soon, because it impacts users negatively labels Jul 28, 2020

timuthy mentioned this issue Jul 28, 2020

Add VPA for admission-gcp deployment #141

Closed

rfranzke mentioned this issue Aug 11, 2020

Add PDB for admission server #153

Merged

gardener-robot assigned ialidzhikov Aug 21, 2020

gardener-robot added the lifecycle/stale Nobody worked on this for 6 months (will further age) label Oct 28, 2020

ialidzhikov removed the lifecycle/stale Nobody worked on this for 6 months (will further age) label Jan 18, 2021

gardener-robot added priority/2 Priority (lower number equals higher priority) and removed priority/critical Needs to be resolved soon, because it impacts users negatively labels Mar 8, 2021

ialidzhikov mentioned this issue Apr 7, 2021

Use client.Reader to prevent starting Informer for Secrets #253

Merged

gardener-robot added this to the 2021-Q4 milestone Sep 29, 2021

gardener-robot added component/gardener Gardener roadmap/internal labels Sep 29, 2021

rfranzke modified the milestones: 2021-Q4, 2022-Q1 Dec 13, 2021

ialidzhikov mentioned this issue Feb 22, 2022

admission-gcp: Adapt Secrets webhook to rely on the provider label #396

Merged

1 task

ialidzhikov mentioned this issue Apr 18, 2022

admission-gcp: Add SecretBinding validator #428

Merged

1 task

gardener-robot closed this as completed May 30, 2022

gardener-robot added the status/closed Issue is closed (either delivered or triaged) label May 30, 2022

gardener-robot added the kind/roadmap Roadmap BLI label Mar 23, 2023

plkokanov mentioned this issue May 17, 2024

[GEP-26] Remove unused packages gardener/gardener#9784

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce memory footprint for GCP admission component #143

Reduce memory footprint for GCP admission component #143

timuthy commented Jul 28, 2020

ialidzhikov commented Aug 21, 2020

ialidzhikov commented Aug 28, 2020 •

edited

Loading

ialidzhikov commented Sep 7, 2021

timuthy commented Sep 8, 2021

timebertt commented Sep 8, 2021

timuthy commented Sep 8, 2021

timebertt commented Sep 8, 2021

ialidzhikov commented Sep 8, 2021

rfranzke commented Sep 8, 2021

ialidzhikov commented Sep 8, 2021 •

edited

Loading

rfranzke commented Sep 8, 2021

timebertt commented Sep 9, 2021

ialidzhikov commented Sep 9, 2021 •

edited

Loading

ialidzhikov commented Sep 10, 2021

rfranzke commented Sep 10, 2021

ialidzhikov commented Sep 10, 2021 •

edited

Loading

rfranzke commented Sep 29, 2021

ialidzhikov commented Mar 23, 2022

ialidzhikov commented May 12, 2022

rfranzke commented May 12, 2022

ialidzhikov commented May 30, 2022

Reduce memory footprint for GCP admission component #143

Reduce memory footprint for GCP admission component #143

Comments

timuthy commented Jul 28, 2020

ialidzhikov commented Aug 21, 2020

ialidzhikov commented Aug 28, 2020 • edited Loading

ialidzhikov commented Sep 7, 2021

timuthy commented Sep 8, 2021

timebertt commented Sep 8, 2021

timuthy commented Sep 8, 2021

timebertt commented Sep 8, 2021

ialidzhikov commented Sep 8, 2021

rfranzke commented Sep 8, 2021

ialidzhikov commented Sep 8, 2021 • edited Loading

rfranzke commented Sep 8, 2021

timebertt commented Sep 9, 2021

ialidzhikov commented Sep 9, 2021 • edited Loading

ialidzhikov commented Sep 10, 2021

rfranzke commented Sep 10, 2021

ialidzhikov commented Sep 10, 2021 • edited Loading

rfranzke commented Sep 29, 2021

ialidzhikov commented Mar 23, 2022

ialidzhikov commented May 12, 2022

rfranzke commented May 12, 2022

ialidzhikov commented May 30, 2022

ialidzhikov commented Aug 28, 2020 •

edited

Loading

ialidzhikov commented Sep 8, 2021 •

edited

Loading

ialidzhikov commented Sep 9, 2021 •

edited

Loading

ialidzhikov commented Sep 10, 2021 •

edited

Loading