Skip to content
This repository has been archived by the owner on Jul 26, 2022. It is now read-only.

status check went boom #826

Closed
djfinnoy opened this issue Aug 23, 2021 · 13 comments
Closed

status check went boom #826

djfinnoy opened this issue Aug 23, 2021 · 13 comments
Labels

Comments

@djfinnoy
Copy link

Possibly same issue as #362.

Running v8.3.0, installed via Helm chart, on GKE (1.20.8-gke.900).
The cluster gets bootstrapped by an automated pipeline, and external-secrets is expected to retrieve a single secret from Google Secret Manager.
I works about 50% of the time.

When it doesn't work, I'm seeing the following lines repeated in the pod logs:

{"level":50,"message_time":"2021-08-23T08:31:59.073Z","pid":18,"hostname":"external-secrets-kubernetes-external-secrets-b44fd8967-dgplh","payload":{},"msg":"status check went boom for oauth2-proxy/sm-oauth-github"}
{"level":30,"message_time":"2021-08-23T08:32:59.123Z","pid":18,"hostname":"external-secrets-kubernetes-external-secrets-b44fd8967-dgplh","msg":"No watch event for 60000 ms, restarting watcher for *"}
{"level":30,"message_time":"2021-08-23T08:32:59.123Z","pid":18,"hostname":"external-secrets-kubernetes-external-secrets-b44fd8967-dgplh","msg":"Stopping watch stream for namespace * due to event: END"}
{"level":30,"message_time":"2021-08-23T08:32:59.131Z","pid":18,"hostname":"external-secrets-kubernetes-external-secrets-b44fd8967-dgplh","msg":"starting poller for oauth2-proxy/sm-oauth-github"}
{"level":50,"message_time":"2021-08-23T08:32:59.131Z","pid":18,"hostname":"external-secrets-kubernetes-external-secrets-b44fd8967-dgplh","payload":{},"msg":"status check went boom for oauth2-proxy/sm-oauth-github"

Deleting the pod solves the problem, but I have not checked whether the problem re-occurs.
I never saw this problem before upgrading from 6.2.0 to 8.3.0.

@Sukovitsyn
Copy link

I have the same problem in OpenShift bare metal cluster.
Anybody have a workaround to avoid this?

@derekwilliamsliquidx
Copy link

derekwilliamsliquidx commented Oct 28, 2021

I have the same issue in Azure Kubernetes 1.20.7.

I had a similar issue in 6.4.0, and this persists in 8.3.2

Kubernetes External Secrets Version: 8.3.2
Vault Cluster backed by postgres: 1.7.3

@derekwilliamsliquidx
Copy link

This happened to me again at 18:36. Vault cluster was just fine and I could log in. External secrets simply stopped polling. This is affecting production systems. Please advise.

@derekwilliamsliquidx
Copy link

This happened again at 10:01am

{"level":30,"message_time":"2021-10-27T18:20:53.906Z","pid":17,"hostname":"kubernetes-external-secrets-69c544c96d-tkncr","msg":"No watch event for 60000 ms, restarting watcher for *"}

@Flydiverny
Copy link
Member

Flydiverny commented Oct 31, 2021

This happened again at 10:01am

{"level":30,"message_time":"2021-10-27T18:20:53.906Z","pid":17,"hostname":"kubernetes-external-secrets-69c544c96d-tkncr","msg":"No watch event for 60000 ms, restarting watcher for *"}

I'm not following, this particular log line is to be expected even in a healthy cluster or setup. The operator is watching a HTTP stream and for some reason those are not cleanly closing towards the kubernetes api, so the operator will as a precaution restart the watch every 60 seconds (or whatever is configured). Otherwise the operator could hang forever, not knowing if nothing happened or if there will never be any new events published on the stream, for more background see #362

@Flydiverny
Copy link
Member

Possibly same issue as #362.

Running v8.3.0, installed via Helm chart, on GKE (1.20.8-gke.900). The cluster gets bootstrapped by an automated pipeline, and external-secrets is expected to retrieve a single secret from Google Secret Manager. I works about 50% of the time.

When it doesn't work, I'm seeing the following lines repeated in the pod logs:

{"level":50,"message_time":"2021-08-23T08:31:59.073Z","pid":18,"hostname":"external-secrets-kubernetes-external-secrets-b44fd8967-dgplh","payload":{},"msg":"status check went boom for oauth2-proxy/sm-oauth-github"}
{"level":30,"message_time":"2021-08-23T08:32:59.123Z","pid":18,"hostname":"external-secrets-kubernetes-external-secrets-b44fd8967-dgplh","msg":"No watch event for 60000 ms, restarting watcher for *"}
{"level":30,"message_time":"2021-08-23T08:32:59.123Z","pid":18,"hostname":"external-secrets-kubernetes-external-secrets-b44fd8967-dgplh","msg":"Stopping watch stream for namespace * due to event: END"}
{"level":30,"message_time":"2021-08-23T08:32:59.131Z","pid":18,"hostname":"external-secrets-kubernetes-external-secrets-b44fd8967-dgplh","msg":"starting poller for oauth2-proxy/sm-oauth-github"}
{"level":50,"message_time":"2021-08-23T08:32:59.131Z","pid":18,"hostname":"external-secrets-kubernetes-external-secrets-b44fd8967-dgplh","payload":{},"msg":"status check went boom for oauth2-proxy/sm-oauth-github"

Deleting the pod solves the problem, but I have not checked whether the problem re-occurs. I never saw this problem before upgrading from 6.2.0 to 8.3.0.

Hard to tell what goes wrong from this, and odd that there's no error being logged since this particular log line (status check went boom) does try to log an error 😕

@derekwilliamsliquidx
Copy link

This happened again at 10:01am

{"level":30,"message_time":"2021-10-27T18:20:53.906Z","pid":17,"hostname":"kubernetes-external-secrets-69c544c96d-tkncr","msg":"No watch event for 60000 ms, restarting watcher for *"}

I'm not following, this particular log line is to be expected even in a healthy cluster or setup. The operator is watching a HTTP stream and for some reason those are not cleanly closing towards the kubernetes api, so the operator will as a precaution restart the watch every 60 seconds (or whatever is configured). Otherwise the operator could hang forever, not knowing if nothing happened or if there will never be any new events published on the stream, for more background see #362

Yes this is the last logline before KES simply stops upserting. There are NO errors outputted for me to provide you. The poller simply stops polling. This is happening DAILY at this point. We need a fix our I'm going to have to stop using this engine.

@Nick-Triller
Copy link
Contributor

I also observed this issue in bare metal Kubernetes Clusters with Vault. Some secrets are not updated anymore with no error besides "status check went boom". This also happened with a completely new ExternalSecret once. The ExternalSecret was not updated at all (no error message in the custom resource).

Kubernetes version: v1.20.11
KES version: 8.3.0
Vault version: 1.7.1+ent

@Flydiverny
Copy link
Member

I wish I had an easy fix 😄
However I can only advise to look into ESO https://github.com/external-secrets/external-secrets
See #864

@moolen
Copy link
Member

moolen commented Dec 2, 2021

Can someone provide metrics for KES around the time where it stops polling? I'm particularly interested in cpu/mem/netstat and nodejs_* related metrics. It's odd that the eventloop seems to stop 🤷.
Let me find some time to look into that. I think we can improve a little on the observability side.

@moolen
Copy link
Member

moolen commented Dec 7, 2021

I couldn't find an issue using the aws provider. I let it run for 3 days, with 1/10/100/500 Kind=ExternalSecret. CPU usage was highish but it was syncing properly.
I opened an PR (see above) that exports additional nodejs_* runtime metrics that will be helpful debugging this issue. Please apply the patch manually or wait for a release.
Please provide some additional information like how many ExternalSecrets you have, your POLLING_INTERVAL and please attach strace to process in the container so we might see where it's stuck.

@github-actions
Copy link

github-actions bot commented Mar 8, 2022

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 30 days.

@github-actions github-actions bot added the Stale label Mar 8, 2022
@github-actions
Copy link

github-actions bot commented Apr 7, 2022

This issue was closed because it has been stalled for 30 days with no activity.

@github-actions github-actions bot closed this as completed Apr 7, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

6 participants