Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Random dealy in creating or fetching new secrets in namespace (between 4m-7m) #2837

Open
oranerez opened this issue Nov 2, 2023 · 14 comments
Labels
kind/bug Categorizes issue or PR as related to a bug. triage/support Indicates an issue that is a support question.

Comments

@oranerez
Copy link

oranerez commented Nov 2, 2023

Describe the bug
When creating new namespace observed that ClusterSecretStore is not creating/fetching secret based on refreshInterval (set to 10s) and there is somewhat of delay in creation of secrets (between 4minutes to 7 minutes).
The ExternalSecret manifest is created almost right away but the actual creation of the secret after fetching it from AWS Secret Manager takes some time (please see screenshot) - there is some minutes passing by till the refreshTime is getting populate which indicate the secret was fetch and created locally.

In the ESO log the following log takes the same amount of time till being reported:
{"level":"info","ts":1698923181.4201677,"logger":"controllers.ExternalSecret","msg":"reconciled secret","ExternalSecret":{"name":"artifactory-docker-registry","namespace":"oran-test9"}}

Till then the log is quite stable without any prompt for a change being made:
{"level":"info","ts":1698923177.8425815,"logger":"provider.aws","msg":"using aws session","region":"us-east-1","external id":"","credentials":{}} {"level":"info","ts":1698923177.8426237,"logger":"provider.aws.secretsmanager","msg":"fetching secret map","key":"dev-us-east-1/secret/secret-name1"} {"level":"info","ts":1698923177.8426342,"logger":"provider.aws.secretsmanager","msg":"fetching secret value","key":"dev-us-east-1/secret/secret-name2","version":"AWSCURRENT","value":"SECRET"} {"level":"info","ts":1698923179.8481295,"logger":"provider.aws","msg":"using aws session","region":"us-east-1","external id":"","credentials":{}} {"level":"info","ts":1698923179.8481748,"logger":"provider.aws.secretsmanager","msg":"fetching secret map","key":"dev-us-east-1/secret/secret-name3"} {"level":"info","ts":1698923179.8481805,"logger":"provider.aws.secretsmanager","msg":"fetching secret value","key":"dev-us-east-1/global/playwright-credentials","version":"AWSCURRENT","value":"SECRET"} {"level":"info","ts":1698923180.2435653,"logger":"provider.aws","msg":"using aws session","region":"us-east-1","external id":"","credentials":{}} {"level":"info","ts":1698923180.2436075,"logger":"provider.aws.secretsmanager","msg":"fetching secret map","key":"dev-us-east-1/secret/secret-name4"} {"level":"info","ts":1698923180.2436125,"logger":"provider.aws.secretsmanager","msg":"fetching secret value","key":"dev-us-east-1/mongo/mongo4-user-and-password","version":"AWSCURRENT","value":"SECRET"} {"level":"info","ts":1698923180.4427361,"logger":"provider.aws","msg":"using aws session","region":"us-east-1","external id":"","credentials":{}} {"level":"info","ts":1698923180.44278,"logger":"provider.aws.secretsmanager","msg":"fetching secret map","key":"dev-us-east-1/secret/secret-name5"} {"level":"info","ts":1698923180.4427857,"logger":"provider.aws.secretsmanager","msg":"fetching secret value","key":"dev-us-east-1/global/alertmanager-pagerduty","version":"AWSCURRENT","value":"SECRET"} {"level":"info","ts":1698923180.6423898,"logger":"provider.aws","msg":"using aws session","region":"us-east-1","external id":"","credentials":{}} {"level":"info","ts":1698923180.6424341,"logger":"provider.aws.secretsmanager","msg":"fetching secret map","key":"dev-us-east-1/secret/secret-name6"} {"level":"info","ts":1698923180.6424391,"logger":"provider.aws.secretsmanager","msg":"fetching secret value","key":"dev-us-east-1/secret/secret-name7","version":"AWSCURRENT","value":"SECRET"} {"level":"info","ts":1698923180.844502,"logger":"provider.aws","msg":"using aws session","region":"us-east-1","external id":"","credentials":{}} {"level":"info","ts":1698923180.8445442,"logger":"provider.aws.secretsmanager","msg":"fetching secret map","key":"dev-us-east-1/secret/secret-name8"} {"level":"info","ts":1698923180.844549,"logger":"provider.aws.secretsmanager","msg":"fetching secret value","key":"dev-us-east-1/automation-tests-data-platform","version":"AWSCURRENT","value":"SECRET"}

image

To Reproduce
Setup ESO v0.9.7 (using helm-chart) as follow:
image

Create ClusterSecretStore (we are using Terraform for that one) as follow:
image

We are using ArgoCD and local helm-chart repository to bring all applications under new namespace, which in return create all the manifest - therefore create application which instruct creation of local secrets using ESO (secret provider in our scenario is AWS Secret Manager)

Kubernetes version: EKS 1.26
ESO version: 0.9.7

Expected behavior
Creation of new secrets locally using ESO with fetching from AWS Secret Manager should be based on refreshInterval or any other way as of now it is randomly take place (between 4 to 7 minutes) and create delay in creating new namespace which impact our test automation mechanism.

Screenshots
Attached inline

Additional context
N/A

@oranerez oranerez added the kind/bug Categorizes issue or PR as related to a bug. label Nov 2, 2023
@moolen
Copy link
Member

moolen commented Nov 2, 2023

Hey 👋, thanks for your report!

From what i can see: the ExternalSecret had a lastTransitionTime at 11:06 (where it presumable was successfully synced for the first time). The logs start at 11:06 (after the first transition time).

  1. can you provide the logs between 11:01 and 11:06, that's where it should've been reconciling.
  2. if you have a chance can you re-run the experiment with loglevel=debug ? That'll provide a bit more context
  3. do you have metrics for the relevant ExternalSecret/SecretStores available?

It would be great if you could capture the state of ExternalSecret/SecretStore in a broken condition and in a working condition. That would help immensly with the debugging.

So, what could be the case is that the ExternalSecret may run into a exponential backoff or ESO's queue was too long and it was not able to process the resources in time (though 6min would be insanely long for that).

@oranerez
Copy link
Author

oranerez commented Nov 6, 2023

Hi @moolen,
Regarding Q1 & Q2 - I have attached a new set of logs in debugging loglevel so it can be seen from the beginning till the creation and updated secrets data (had to mask most of the dat ain the log with reference to secrets name and keep only the following secret unmasked - artifactory-docker-registry taken from path global/artifactory/docker-registry in ASM.
Regarding Q3 we are not using any metrics or expose any from within ESO.
It took the secret more than 7 minutes to get updated after manifest already created for ExternalSecrets (see attached screenshot).
image
I hope the details can now shade the light a bit more regarding the issue facing using ESO.

logs for new running session with debug log level:
external-secrets.log

@moolen
Copy link
Member

moolen commented Nov 6, 2023

I took a quick look at the provided information:

Small remark: you have to look at the status.conditions[].lastTransitionTime - that shows when it first got into Ready=true state. The refreshTime shows the last time it has been refreshed.

It took a little less than 4 minutes for it to be reconciled.
Looking at the logs i can see it was reconciled at 08:42:24Z:

$ grep "artifactory-docker-registry" external-secrets.log | grep "oran-test9"
{"level":"info","ts":1699260144.8180609,"logger":"controllers.ExternalSecret","msg":"reconciled secret","ExternalSecret":{"name":"artifactory-docker-registry","namespace":"oran-test9"}}
{"level":"debug","ts":1699260144.8180957,"logger":"events","msg":"Updated Secret","type":"Normal","object":{"kind":"ExternalSecret","namespace":"oran-test9","name":"artifactory-docker-registry","uid":"bb0ccba4-60ce-4652-b1c9-984d675c3b50","apiVersion":"external-secrets.io/v1beta1","resourceVersion":"2437963201"},"reason":"Updated"}
{"level":"debug","ts":1699260360.4635084,"logger":"controllers.ExternalSecret","msg":"reconciled secret","ExternalSecret":{"name":"artifactory-docker-registry","namespace":"oran-test9"}}
{"level":"debug","ts":1699260360.4635434,"logger":"events","msg":"Updated Secret","type":"Normal","object":{"kind":"ExternalSecret","namespace":"oran-test9","name":"artifactory-docker-registry","uid":"bb0ccba4-60ce-4652-b1c9-984d675c3b50","apiVersion":"external-secrets.io/v1beta1","resourceVersion":"2437972595"},"reason":"Updated"}
{"level":"debug","ts":1699260576.6623328,"logger":"controllers.ExternalSecret","msg":"reconciled secret","ExternalSecret":{"name":"artifactory-docker-registry","namespace":"oran-test9"}}
{"level":"debug","ts":1699260576.6623802,"logger":"events","msg":"Updated Secret","type":"Normal","object":{"kind":"ExternalSecret","namespace":"oran-test9","name":"artifactory-docker-registry","uid":"bb0ccba4-60ce-4652-b1c9-984d675c3b50","apiVersion":"external-secrets.io/v1beta1","resourceVersion":"2437981883"},"reason":"Updated"}

That leaves us with two assumptions:

  1. the bespoke ExternalSecret resource was not enqueued by controller-runtime due to high load (probable)
  2. the bespoke ExternalSecret took too long to reconcile (e.g. due to network timeout, congestion etc.)

ESO does only log if it finished reconiling (in both error / non-error cases).

Both assumptions can be proven/disproven by looking at the metrics, see here. The interesting part is (1): workqueue_depth and (2): reconcile latency.

@guyelia
Copy link

guyelia commented Nov 30, 2023

Hey @moolen, i'm from @oranerez 's team.
I'm attaching the metrics we see in the environment, first the workqueue_depth:
image
~1500 in the Q
Latency:
image
reconcile per minute:
image

It looks like we cannot reconcile tha secrets in time, i know that we have a LOT of them.
it something we can scale? the pod resources are low (<1 CPU, <1GB RAM), maybe scale out?

Thanks again!

@moolen
Copy link
Member

moolen commented Nov 30, 2023

Increasing --concurrent should help you fix the symptoms, but the issue seems to be the high reconcile time. It looks like most of the ExternalSecret resources are not reconciled successfully.

Increase concurrent slowly, otherwise you may hammer the upstream API and run into rate limits.

See: https://external-secrets.io/latest/api/controller-options/

@gusfcarvalho gusfcarvalho added the triage/support Indicates an issue that is a support question. label Nov 30, 2023
@Sartigan
Copy link

We can't increase the --concurrent too much (up to 4 in our case) because the API Server throttles the operator.

Is there anything else that can be done to resolve the issue? Our cluster contains 250-300 ExternalSecrets and we also experience a delay of up to 4 Minutes before the secret is created.

However a different cluster with 80 ExternalSecret will have its secret created intstantly

@Skarlso
Copy link
Contributor

Skarlso commented Jan 10, 2024

Stupid question, but can we increase ESO replica count? @moolen ?

@moolen
Copy link
Member

moolen commented Jan 10, 2024

ESO runs with leader election and only a single replica will be reconciling. @Sartigan can you provide more information on how you set up ESO (helm values etc)? I guess it would be best to create a separate issue (or are you with OP?)

Are you sure it is server-side? We also have client-side throttling which can be changed with --client-qps and --client-burst. If it is server, you may have to talk with your cluster management team or your managed service provider.

What kind of APIs do you use in ExternalSecret ? dataFrom.extract, dataFrom.find ? What creationPolicy do you use, is it merge? That also has an effect on ESO behaviour and the kind of access patterns it creates.

You can give --enable-secrets-caching or --enable-configmaps-caching a shot, that changes the access pattern towards the API server (but increase memory consumption). In you cluster it could be significant.

@Sartigan
Copy link

--enable-secrets-caching Worked perfectly for us. I Appreciate the support.

@moolen
Copy link
Member

moolen commented Jan 11, 2024

Wait, did that resolve all the problems?!
I'm trying to understand this, maybe it would be reasonable to change the default for this? 🤔

@Sartigan
Copy link

Our problem was the same as OP, when a cluster is having high number of ExternalSecrets resources, the workqueue_depth becomes high and all new ExternalSecrets takes about 3-4 minutes to generate its secret.

The --concurrent tweaking was to solve that issue but that barely did anything for us.

Adding --enable-secrets-caching fixed it for us. workqueue_depth is now practically zero, our RAM consumption went higher (around 2Gi) but that's acceptable and all new ExternalSecrets generate its secret instantly.

Copy link

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 30 days.

@github-actions github-actions bot added the Stale label Apr 18, 2024
@pre
Copy link

pre commented Apr 25, 2024

We are experiencing the same slowness -- we didn't try with --enable-secrets-caching yet, but that's a good tip!

Related:

@pre
Copy link

pre commented Apr 26, 2024

In our case, we are synchronozing an image pull secret and couple of other static secrets to all namespaces. We have ~800 namespaces, each having one ExternalSecret. We also tried by having only a single ClusterExternalSecret and annotated the namepsaces, but it didn't perform any better.

The Secret synchronization speed and workqueue depth has become an issue when all the ExternalSecrets are to be refreshed at the same time.

With leader election, only a single replica of external-secrets-operator processes the workqueue. This obviously becomes a bottleneck.

In our use case the synced secret is static and will rarely change (rotation period is a matter of months).

By disabling leader election and having concurrent:1, multiple replicas were able to process the queue without throttling messages in the logs. 
EDIT: Disabling leader election was not a correct fix. It resulted in multiple replicas handling the same ExternalSecret at the same time, where the latter operation would sometimes fail (not always) and mark the ExternalSecret with an error state.

We settled with the following, having RAM usage ~100Mi with 800 namespaces:

      - args:
        - '--enable-leader-election=true' 
        - '--enable-secrets-caching=true'
        - '--concurrent=5'

In our case concurrent:10 with leader election caused throttling messages appear in the log and didn't improve the total speed.

It seems --enable-secrets-caching=true is helping with the delay. Without this setting the single leader external-secrets replica is performing at ~100-150 mcpu while the other replica is idling. With this setting it seems not so much cpu processing is needed and the ExternalSecrets resource queue is finished faster.

Related issues

@github-actions github-actions bot removed the Stale label May 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. triage/support Indicates an issue that is a support question.
Projects
None yet
Development

No branches or pull requests

7 participants