Random dealy in creating or fetching new secrets in namespace (between 4m-7m) #2837

oranerez · 2023-11-02T11:56:07Z

Describe the bug
When creating new namespace observed that ClusterSecretStore is not creating/fetching secret based on refreshInterval (set to 10s) and there is somewhat of delay in creation of secrets (between 4minutes to 7 minutes).
The ExternalSecret manifest is created almost right away but the actual creation of the secret after fetching it from AWS Secret Manager takes some time (please see screenshot) - there is some minutes passing by till the refreshTime is getting populate which indicate the secret was fetch and created locally.

In the ESO log the following log takes the same amount of time till being reported:
{"level":"info","ts":1698923181.4201677,"logger":"controllers.ExternalSecret","msg":"reconciled secret","ExternalSecret":{"name":"artifactory-docker-registry","namespace":"oran-test9"}}

Till then the log is quite stable without any prompt for a change being made:
{"level":"info","ts":1698923177.8425815,"logger":"provider.aws","msg":"using aws session","region":"us-east-1","external id":"","credentials":{}} {"level":"info","ts":1698923177.8426237,"logger":"provider.aws.secretsmanager","msg":"fetching secret map","key":"dev-us-east-1/secret/secret-name1"} {"level":"info","ts":1698923177.8426342,"logger":"provider.aws.secretsmanager","msg":"fetching secret value","key":"dev-us-east-1/secret/secret-name2","version":"AWSCURRENT","value":"SECRET"} {"level":"info","ts":1698923179.8481295,"logger":"provider.aws","msg":"using aws session","region":"us-east-1","external id":"","credentials":{}} {"level":"info","ts":1698923179.8481748,"logger":"provider.aws.secretsmanager","msg":"fetching secret map","key":"dev-us-east-1/secret/secret-name3"} {"level":"info","ts":1698923179.8481805,"logger":"provider.aws.secretsmanager","msg":"fetching secret value","key":"dev-us-east-1/global/playwright-credentials","version":"AWSCURRENT","value":"SECRET"} {"level":"info","ts":1698923180.2435653,"logger":"provider.aws","msg":"using aws session","region":"us-east-1","external id":"","credentials":{}} {"level":"info","ts":1698923180.2436075,"logger":"provider.aws.secretsmanager","msg":"fetching secret map","key":"dev-us-east-1/secret/secret-name4"} {"level":"info","ts":1698923180.2436125,"logger":"provider.aws.secretsmanager","msg":"fetching secret value","key":"dev-us-east-1/mongo/mongo4-user-and-password","version":"AWSCURRENT","value":"SECRET"} {"level":"info","ts":1698923180.4427361,"logger":"provider.aws","msg":"using aws session","region":"us-east-1","external id":"","credentials":{}} {"level":"info","ts":1698923180.44278,"logger":"provider.aws.secretsmanager","msg":"fetching secret map","key":"dev-us-east-1/secret/secret-name5"} {"level":"info","ts":1698923180.4427857,"logger":"provider.aws.secretsmanager","msg":"fetching secret value","key":"dev-us-east-1/global/alertmanager-pagerduty","version":"AWSCURRENT","value":"SECRET"} {"level":"info","ts":1698923180.6423898,"logger":"provider.aws","msg":"using aws session","region":"us-east-1","external id":"","credentials":{}} {"level":"info","ts":1698923180.6424341,"logger":"provider.aws.secretsmanager","msg":"fetching secret map","key":"dev-us-east-1/secret/secret-name6"} {"level":"info","ts":1698923180.6424391,"logger":"provider.aws.secretsmanager","msg":"fetching secret value","key":"dev-us-east-1/secret/secret-name7","version":"AWSCURRENT","value":"SECRET"} {"level":"info","ts":1698923180.844502,"logger":"provider.aws","msg":"using aws session","region":"us-east-1","external id":"","credentials":{}} {"level":"info","ts":1698923180.8445442,"logger":"provider.aws.secretsmanager","msg":"fetching secret map","key":"dev-us-east-1/secret/secret-name8"} {"level":"info","ts":1698923180.844549,"logger":"provider.aws.secretsmanager","msg":"fetching secret value","key":"dev-us-east-1/automation-tests-data-platform","version":"AWSCURRENT","value":"SECRET"}

To Reproduce
Setup ESO v0.9.7 (using helm-chart) as follow:

Create ClusterSecretStore (we are using Terraform for that one) as follow:

We are using ArgoCD and local helm-chart repository to bring all applications under new namespace, which in return create all the manifest - therefore create application which instruct creation of local secrets using ESO (secret provider in our scenario is AWS Secret Manager)

Kubernetes version: EKS 1.26
ESO version: 0.9.7

Expected behavior
Creation of new secrets locally using ESO with fetching from AWS Secret Manager should be based on refreshInterval or any other way as of now it is randomly take place (between 4 to 7 minutes) and create delay in creating new namespace which impact our test automation mechanism.

Screenshots
Attached inline

Additional context
N/A

The text was updated successfully, but these errors were encountered:

moolen · 2023-11-02T18:51:06Z

Hey 👋, thanks for your report!

From what i can see: the ExternalSecret had a lastTransitionTime at 11:06 (where it presumable was successfully synced for the first time). The logs start at 11:06 (after the first transition time).

can you provide the logs between 11:01 and 11:06, that's where it should've been reconciling.
if you have a chance can you re-run the experiment with loglevel=debug ? That'll provide a bit more context
do you have metrics for the relevant ExternalSecret/SecretStores available?

It would be great if you could capture the state of ExternalSecret/SecretStore in a broken condition and in a working condition. That would help immensly with the debugging.

So, what could be the case is that the ExternalSecret may run into a exponential backoff or ESO's queue was too long and it was not able to process the resources in time (though 6min would be insanely long for that).

oranerez · 2023-11-06T10:46:26Z

Hi @moolen,
Regarding Q1 & Q2 - I have attached a new set of logs in debugging loglevel so it can be seen from the beginning till the creation and updated secrets data (had to mask most of the dat ain the log with reference to secrets name and keep only the following secret unmasked - artifactory-docker-registry taken from path global/artifactory/docker-registry in ASM.
Regarding Q3 we are not using any metrics or expose any from within ESO.
It took the secret more than 7 minutes to get updated after manifest already created for ExternalSecrets (see attached screenshot).

I hope the details can now shade the light a bit more regarding the issue facing using ESO.

logs for new running session with debug log level:
external-secrets.log

moolen · 2023-11-06T13:11:10Z

I took a quick look at the provided information:

Small remark: you have to look at the status.conditions[].lastTransitionTime - that shows when it first got into Ready=true state. The refreshTime shows the last time it has been refreshed.

It took a little less than 4 minutes for it to be reconciled.
Looking at the logs i can see it was reconciled at 08:42:24Z:

$ grep "artifactory-docker-registry" external-secrets.log | grep "oran-test9"
{"level":"info","ts":1699260144.8180609,"logger":"controllers.ExternalSecret","msg":"reconciled secret","ExternalSecret":{"name":"artifactory-docker-registry","namespace":"oran-test9"}}
{"level":"debug","ts":1699260144.8180957,"logger":"events","msg":"Updated Secret","type":"Normal","object":{"kind":"ExternalSecret","namespace":"oran-test9","name":"artifactory-docker-registry","uid":"bb0ccba4-60ce-4652-b1c9-984d675c3b50","apiVersion":"external-secrets.io/v1beta1","resourceVersion":"2437963201"},"reason":"Updated"}
{"level":"debug","ts":1699260360.4635084,"logger":"controllers.ExternalSecret","msg":"reconciled secret","ExternalSecret":{"name":"artifactory-docker-registry","namespace":"oran-test9"}}
{"level":"debug","ts":1699260360.4635434,"logger":"events","msg":"Updated Secret","type":"Normal","object":{"kind":"ExternalSecret","namespace":"oran-test9","name":"artifactory-docker-registry","uid":"bb0ccba4-60ce-4652-b1c9-984d675c3b50","apiVersion":"external-secrets.io/v1beta1","resourceVersion":"2437972595"},"reason":"Updated"}
{"level":"debug","ts":1699260576.6623328,"logger":"controllers.ExternalSecret","msg":"reconciled secret","ExternalSecret":{"name":"artifactory-docker-registry","namespace":"oran-test9"}}
{"level":"debug","ts":1699260576.6623802,"logger":"events","msg":"Updated Secret","type":"Normal","object":{"kind":"ExternalSecret","namespace":"oran-test9","name":"artifactory-docker-registry","uid":"bb0ccba4-60ce-4652-b1c9-984d675c3b50","apiVersion":"external-secrets.io/v1beta1","resourceVersion":"2437981883"},"reason":"Updated"}

That leaves us with two assumptions:

the bespoke ExternalSecret resource was not enqueued by controller-runtime due to high load (probable)
the bespoke ExternalSecret took too long to reconcile (e.g. due to network timeout, congestion etc.)

ESO does only log if it finished reconiling (in both error / non-error cases).

Both assumptions can be proven/disproven by looking at the metrics, see here. The interesting part is (1): workqueue_depth and (2): reconcile latency.

guyelia · 2023-11-30T14:57:17Z

Hey @moolen, i'm from @oranerez 's team.
I'm attaching the metrics we see in the environment, first the workqueue_depth:

~1500 in the Q
Latency:

reconcile per minute:

It looks like we cannot reconcile tha secrets in time, i know that we have a LOT of them.
it something we can scale? the pod resources are low (<1 CPU, <1GB RAM), maybe scale out?

Thanks again!

moolen · 2023-11-30T16:11:52Z

Increasing --concurrent should help you fix the symptoms, but the issue seems to be the high reconcile time. It looks like most of the ExternalSecret resources are not reconciled successfully.

Increase concurrent slowly, otherwise you may hammer the upstream API and run into rate limits.

See: https://external-secrets.io/latest/api/controller-options/

Sartigan · 2024-01-10T14:18:24Z

We can't increase the --concurrent too much (up to 4 in our case) because the API Server throttles the operator.

Is there anything else that can be done to resolve the issue? Our cluster contains 250-300 ExternalSecrets and we also experience a delay of up to 4 Minutes before the secret is created.

However a different cluster with 80 ExternalSecret will have its secret created intstantly

Skarlso · 2024-01-10T20:20:25Z

Stupid question, but can we increase ESO replica count? @moolen ?

moolen · 2024-01-10T21:47:01Z

ESO runs with leader election and only a single replica will be reconciling. @Sartigan can you provide more information on how you set up ESO (helm values etc)? I guess it would be best to create a separate issue (or are you with OP?)

Are you sure it is server-side? We also have client-side throttling which can be changed with --client-qps and --client-burst. If it is server, you may have to talk with your cluster management team or your managed service provider.

What kind of APIs do you use in ExternalSecret ? dataFrom.extract, dataFrom.find ? What creationPolicy do you use, is it merge? That also has an effect on ESO behaviour and the kind of access patterns it creates.

You can give --enable-secrets-caching or --enable-configmaps-caching a shot, that changes the access pattern towards the API server (but increase memory consumption). In you cluster it could be significant.

Sartigan · 2024-01-11T14:45:09Z

--enable-secrets-caching Worked perfectly for us. I Appreciate the support.

moolen · 2024-01-11T15:01:11Z

Wait, did that resolve all the problems?!
I'm trying to understand this, maybe it would be reasonable to change the default for this? 🤔

Sartigan · 2024-01-11T15:13:57Z

Our problem was the same as OP, when a cluster is having high number of ExternalSecrets resources, the workqueue_depth becomes high and all new ExternalSecrets takes about 3-4 minutes to generate its secret.

The --concurrent tweaking was to solve that issue but that barely did anything for us.

Adding --enable-secrets-caching fixed it for us. workqueue_depth is now practically zero, our RAM consumption went higher (around 2Gi) but that's acceptable and all new ExternalSecrets generate its secret instantly.

github-actions · 2024-04-18T01:45:04Z

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 30 days.

pre · 2024-04-25T14:19:45Z

We are experiencing the same slowness -- we didn't try with --enable-secrets-caching yet, but that's a good tip!

ExternalSecret synchronization speed degradation #3258

pre · 2024-04-26T06:10:20Z

In our case, we are synchronozing an image pull secret and couple of other static secrets to all namespaces. We have ~800 namespaces, each having one ExternalSecret. We also tried by having only a single ClusterExternalSecret and annotated the namepsaces, but it didn't perform any better.

The Secret synchronization speed and workqueue depth has become an issue when all the ExternalSecrets are to be refreshed at the same time.

With leader election, only a single replica of external-secrets-operator processes the workqueue. This obviously becomes a bottleneck.

In our use case the synced secret is static and will rarely change (rotation period is a matter of months).

~~By disabling leader election and having concurrent:1, multiple replicas were able to process the queue without throttling messages in the logs.~~
EDIT: Disabling leader election was not a correct fix. It resulted in multiple replicas handling the same ExternalSecret at the same time, where the latter operation would sometimes fail (not always) and mark the ExternalSecret with an error state.

We settled with the following, having RAM usage ~100Mi with 800 namespaces:

      - args:
        - '--enable-leader-election=true' 
        - '--enable-secrets-caching=true'
        - '--concurrent=5'

In our case concurrent:10 with leader election caused throttling messages appear in the log and didn't improve the total speed.

It seems --enable-secrets-caching=true is helping with the delay. Without this setting the single leader external-secrets replica is performing at ~100-150 mcpu while the other replica is idling. With this setting it seems not so much cpu processing is needed and the ExternalSecrets resource queue is finished faster.

Related issues

oranerez added the kind/bug Categorizes issue or PR as related to a bug. label Nov 2, 2023

gusfcarvalho added the triage/support Indicates an issue that is a support question. label Nov 30, 2023

github-actions bot added the Stale label Apr 18, 2024

github-actions bot removed the Stale label May 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Random dealy in creating or fetching new secrets in namespace (between 4m-7m) #2837

Random dealy in creating or fetching new secrets in namespace (between 4m-7m) #2837

oranerez commented Nov 2, 2023 •

edited

Loading

moolen commented Nov 2, 2023 •

edited

Loading

oranerez commented Nov 6, 2023

moolen commented Nov 6, 2023 •

edited

Loading

guyelia commented Nov 30, 2023

moolen commented Nov 30, 2023

Sartigan commented Jan 10, 2024

Skarlso commented Jan 10, 2024

moolen commented Jan 10, 2024

Sartigan commented Jan 11, 2024

moolen commented Jan 11, 2024

Sartigan commented Jan 11, 2024

github-actions bot commented Apr 18, 2024

pre commented Apr 25, 2024

pre commented Apr 26, 2024 •

edited

Loading

Random dealy in creating or fetching new secrets in namespace (between 4m-7m) #2837

Random dealy in creating or fetching new secrets in namespace (between 4m-7m) #2837

Comments

oranerez commented Nov 2, 2023 • edited Loading

moolen commented Nov 2, 2023 • edited Loading

oranerez commented Nov 6, 2023

moolen commented Nov 6, 2023 • edited Loading

guyelia commented Nov 30, 2023

moolen commented Nov 30, 2023

Sartigan commented Jan 10, 2024

Skarlso commented Jan 10, 2024

moolen commented Jan 10, 2024

Sartigan commented Jan 11, 2024

moolen commented Jan 11, 2024

Sartigan commented Jan 11, 2024

github-actions bot commented Apr 18, 2024

pre commented Apr 25, 2024

pre commented Apr 26, 2024 • edited Loading

oranerez commented Nov 2, 2023 •

edited

Loading

moolen commented Nov 2, 2023 •

edited

Loading

moolen commented Nov 6, 2023 •

edited

Loading

pre commented Apr 26, 2024 •

edited

Loading