Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Readyness probe is timing out #2259

Closed
l0x-c0d3z opened this issue Dec 12, 2019 · 3 comments
Closed

Readyness probe is timing out #2259

l0x-c0d3z opened this issue Dec 12, 2019 · 3 comments

Comments

@l0x-c0d3z
Copy link

Bug Report

What did you do?

I'm running a three node cluster. I have write-heavy loads. I run sometimes queries on a pretty large dataset (~600GB).

What did you expect to see?

The nodes are all up and running, and kibana should work.

What did you see instead? Under which circumstances?

Soon after the initial query, kibana stops working and every request starts to fail. I get the following in kibana log:

{"type":"log","@timestamp":"2019-12-12T17:56:27Z","tags":["security","error"],"pid":6,"message":"Error registering Kibana Privileges with Elasticsearch for kibana-.kibana: No Living connections"}

When I check the pods, they are marked as not ready.

pod/noscit-production-logs-es-basic-0         0/1     Running   0          28h
pod/noscit-production-logs-es-basic-1         0/1     Running   0          28h
pod/noscit-production-logs-es-basic-2         0/1     Running   0          31h

I have stuff like this in my pods events:

  Warning  Unhealthy  19s (x15444 over 28h)  kubelet, node-002  Readiness probe failed:

However, all my curl requests to the node seem to work fine.

Because no nodes is considered ready, the service reports no endpoints:

kubectl describe svc noscit-production-logs-es-http
Name:              noscit-production-logs-es-http
Namespace:         noscit-production
Labels:            common.k8s.elastic.co/type=elasticsearch
                   elasticsearch.k8s.elastic.co/cluster-name=noscit-production-logs
Annotations:       <none>
Selector:          common.k8s.elastic.co/type=elasticsearch,elasticsearch.k8s.elastic.co/cluster-name=noscit-production-logs
Type:              ClusterIP
IP:                x.x.x.x
Port:              https  9200/TCP
TargetPort:        9200/TCP
Endpoints:         
Session Affinity:  None
Events:            <none>

I tracked it down to the probe script (/mnt/elastic-internal/scripts/readiness-probe-script.sh ). It performs a curl request with a 3 seconds timeout. And it seems pretty obvious I'm hitting the timeout.

[root@noscit-production-logs-es-basic-1 elasticsearch]# time /mnt/elastic-internal/scripts/readiness-probe-script.sh ; echo $?

real	0m3.006s
user	0m0.024s
sys	0m0.000s
1

I quickly confirmed that this was indeed the case:

[root@noscit-production-logs-es-basic-1 elasticsearch]# time curl -k ${BASIC_AUTH} $ENDPOINT
10.244.1.36  65 99 55 2.24 3.52 4.47 dilm * noscit-production-logs-es-basic-2
10.244.2.148 63 99 49 7.99 6.43 5.38 dilm - noscit-production-logs-es-basic-1
10.244.0.132 63 99 44 3.99 4.77 5.13 dilm - noscit-production-logs-es-basic-0

real	0m3.069s
user	0m0.020s
sys	0m0.000s

Environment

  • ECK version: 1.0
  • Kubernetes information: On premise, 1.14
Client Version: version.Info{Major:"1", Minor:"15", GitVersion:"v1.15.3", GitCommit:"2d3c76f9091b6bec110a5e63777c332469e0cba2", GitTreeState:"clean", BuildDate:"2019-08-19T11:13:54Z", GoVersion:"go1.12.9", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"14", GitVersion:"v1.14.0", GitCommit:"641856db18352033a0d96dbc99153fa3b27298e5", GitTreeState:"clean", BuildDate:"2019-03-25T15:45:25Z", GoVersion:"go1.12.1", Compiler:"gc", Platform:"linux/amd64"}

  • Resource definition:
apiVersion: elasticsearch.k8s.elastic.co/v1beta1
kind: Elasticsearch
metadata:
  name: noscit-production-logs
  labels:
    helm.sh/chart: elasticsearch-0.1.0
    app.kubernetes.io/name: elasticsearch
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/instance: elastic-production
    app.kubernetes.io/component: elasticsearch
spec:
  version: 7.4.0
  nodeSets:
  - name: basic
    count: 3
    podTemplate:
      metadata:
        labels:
          helm.sh/chart: elasticsearch-0.1.0
          app.kubernetes.io/name: elasticsearch
          app.kubernetes.io/managed-by: ECK
          app.kubernetes.io/instance: elastic-production
          app.kubernetes.io/component: elasticsearch
      spec:
        containers:
        - name: elasticsearch
          env:
          - name: ES_JAVA_OPTS
            value: -Xms7g -Xmx7g
          resources:
            limits:
              cpu: 4
              memory: 8Gi
            requests:
              cpu: 2
              memory: 8Gi
    config:
      node.master: true
      node.data: true
      node.ingest: true
      xpack.security.authc.realms:
        native.native1:
          order: 1
    volumeClaimTemplates:
    - metadata:
        name: elasticsearch-data
        label:
          helm.sh/chart: elasticsearch-0.1.0
          app.kubernetes.io/name: elasticsearch
          app.kubernetes.io/managed-by: Helm
          app.kubernetes.io/instance: elastic-production
          app.kubernetes.io/component: elasticsearch
      spec:
        accessModes:
        - ReadWriteOnce
        resources:
          requests:
            storage: 1000Gi
        storageClassName: local-slow

I feel we should either:

  • make the timeout bigger
  • make the timeout configurable

Any ideas of workarounds would be greatly appreciated.

@sebgl
Copy link
Contributor

sebgl commented Dec 12, 2019

Relates #2248

@anyasabo
Copy link
Contributor

real 0m3.006s

So close! 😞

Thanks for bringing this up. I opened #2260 to allow it to be configurable. In the meantime, you should be able to override the readiness probe in the podTemplate. The easiest solution may be to simply use a tcp check as described here:
https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/#define-a-tcp-liveness-probe

which means some pods may join the service before they are ready, but that seems to be an improvement over the existing situation. If you have any issues or further questions please let us know.

@anyasabo
Copy link
Contributor

The above referenced PR has been merged and should land in the next release. In the meantime the above mentioned workaround exists, so I'll go ahead and close this. Feel free to re-open if there's further questions/comments though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants