Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Operator may crash on the first start #3321

Open
barkbay opened this issue Jun 26, 2020 · 1 comment
Open

Operator may crash on the first start #3321

barkbay opened this issue Jun 26, 2020 · 1 comment

Comments

@barkbay
Copy link
Collaborator

@barkbay barkbay commented Jun 26, 2020

When the operator is deployed for the first time the Secret which contains the webhook certificate is populated by the operator (if the webhook certificate is managed by ECK, which is the case by default).

Unfortunately it can take some time for the content of the Secret to be propagated into the container. A wait loop has been introduced in #2312 but my experience while testing 1.2.0-bc2 seems to show that the current timeout (30 seconds) is too low.

I did a few tests to understand what timeout value would be acceptable (ECK version: 1.2.0-bc2 on v1.15.12-gke.2):

  • 79 seconds:
{"log.level":"info","@timestamp":"2020-06-25T16:49:01.036Z","log.logger":"manager","message":"Polling for the webhook certificate to be available","service.version":"1.2.0-d53bc36a","service.type":"eck","ecs.version":"1.4.0","path":"/tmp/
k8s-webhook-server/serving-certs/tls.crt"}
{"log.level":"debug","@timestamp":"2020-06-25T16:49:01.036Z","log.logger":"manager","message":"Webhook certificate file not present on filesystem yet","service.version":"1.2.0-d53bc36a","service.type":"eck","ecs.version":"1.4.0","path":"/
tmp/k8s-webhook-server/serving-certs/tls.crt"}
...
{"log.level":"debug","@timestamp":"2020-06-25T16:50:20.037Z","log.logger":"manager","message":"Webhook certificate file present on filesystem","service.version":"1.2.0-d53bc36a","service.type":"eck","ecs.version":"1.4.0","path":"/tmp/k8s-
webhook-server/serving-certs/tls.crt"}
  • 90 seconds:
{"log.level":"info","@timestamp":"2020-06-26T06:17:34.985Z","log.logger":"manager","message":"Polling for the webhook certificate to be available","service.version":"1.2.0-d53bc36a","service.type":"eck","ecs.version":"1.4.0","path":"/tmp/k8s-webhook-server/serving-certs/tls.crt"}
{"log.level":"debug","@timestamp":"2020-06-26T06:17:34.985Z","log.logger":"manager","message":"Webhook certificate file not present on filesystem yet","service.version":"1.2.0-d53bc36a","service.type":"eck","ecs.version":"1.4.0","path":"/tmp/k8s-webhook-server/serving-certs/tls.crt"}
...
{"log.level":"debug","@timestamp":"2020-06-26T06:19:04.985Z","log.logger":"manager","message":"Webhook certificate file present on filesystem","service.version":"1.2.0-d53bc36a","service.type":"eck","ecs.version":"1.4.0","path":"/tmp/k8s-webhook-server/serving-certs/tls.crt"}
  • 80 seconds:
{"log.level":"info","@timestamp":"2020-06-26T06:22:56.395Z","log.logger":"manager","message":"Polling for the webhook certificate to be available","service.version":"1.2.0-d53bc36a","service.type":"eck","ecs.version":"1.4.0","path":"/tmp/k8s-webhook-server/serving-certs/tls.crt"}
{"log.level":"debug","@timestamp":"2020-06-26T06:22:56.395Z","log.logger":"manager","message":"Webhook certificate file not present on filesystem yet","service.version":"1.2.0-d53bc36a","service.type":"eck","ecs.version":"1.4.0","path":"/tmp/k8s-webhook-server/serving-certs/tls.crt"}

{"log.level":"debug","@timestamp":"2020-06-26T06:24:16.396Z","log.logger":"manager","message":"Webhook certificate file present on filesystem","service.version":"1.2.0-d53bc36a","service.type":"eck","ecs.version":"1.4.0","path":"/tmp/k8s-webhook-server/serving-certs/tls.crt"}
{"log.level":"debug","@timestamp":"2020-06-26T06:24:16.396Z","log.logger":"dynamic-enqueue-request","message":"Adding new handler registration","service.version":"1.2.0-d53bc36a","service.type":"eck","ecs.version":"1.4.0","key":"-owner","current_registrations":{}}
  • 68 seconds:
{"log.level":"info","@timestamp":"2020-06-26T06:31:36.446Z","log.logger":"manager","message":"Polling for the webhook certificate to be available","service.version":"1.2.0-d53bc36a","service.type":"eck","ecs.version":"1.4.0","path":"/tmp/k8s-webhook-server/serving-certs/tls.crt"}
{"log.level":"debug","@timestamp":"2020-06-26T06:31:36.446Z","log.logger":"manager","message":"Webhook certificate file not present on filesystem yet","service.version":"1.2.0-d53bc36a","service.type":"eck","ecs.version":"1.4.0","path":"/tmp/k8s-webhook-server/serving-certs/tls.crt"}
...
{"log.level":"debug","@timestamp":"2020-06-26T06:32:42.446Z","log.logger":"manager","message":"Webhook certificate file present on filesystem","service.version":"1.2.0-d53bc36a","service.type":"eck","ecs.version":"1.4.0","path":"/tmp/k8s-webhook-server/serving-certs/tls.crt"}
  • A first option would be to increase the timeout to something like 90~100 seconds, but it really looks like a high value to me.

  • A second solution would be to update the Pod which is running the operator by using the MarkPodsAsUpdated function. An other benefit would be that the certificate is also propagate faster when renewed. But it means that the operator should be able to update its own Pod. (see #496 and #568 for more context)

@pebrc
Copy link
Collaborator

@pebrc pebrc commented Jun 26, 2020

I fear that secret propagation times are dependent on more than k8s cluster version but also size of the cluster, number of secrets or other API resources. So any timeout we pick might fail for some cases.

That means the second approach you suggested sounds more compelling i.e. marking the operator pod as updated to speed up secret propagation. If I am not mistaken the operator already has the necessary permissions (?)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
2 participants