Readyness probe is timing out #2259

l0x-c0d3z · 2019-12-12T18:24:49Z

Bug Report

What did you do?

I'm running a three node cluster. I have write-heavy loads. I run sometimes queries on a pretty large dataset (~600GB).

What did you expect to see?

The nodes are all up and running, and kibana should work.

What did you see instead? Under which circumstances?

Soon after the initial query, kibana stops working and every request starts to fail. I get the following in kibana log:

{"type":"log","@timestamp":"2019-12-12T17:56:27Z","tags":["security","error"],"pid":6,"message":"Error registering Kibana Privileges with Elasticsearch for kibana-.kibana: No Living connections"}

When I check the pods, they are marked as not ready.

pod/noscit-production-logs-es-basic-0         0/1     Running   0          28h
pod/noscit-production-logs-es-basic-1         0/1     Running   0          28h
pod/noscit-production-logs-es-basic-2         0/1     Running   0          31h

I have stuff like this in my pods events:

  Warning  Unhealthy  19s (x15444 over 28h)  kubelet, node-002  Readiness probe failed:

However, all my curl requests to the node seem to work fine.

Because no nodes is considered ready, the service reports no endpoints:

kubectl describe svc noscit-production-logs-es-http
Name:              noscit-production-logs-es-http
Namespace:         noscit-production
Labels:            common.k8s.elastic.co/type=elasticsearch
                   elasticsearch.k8s.elastic.co/cluster-name=noscit-production-logs
Annotations:       <none>
Selector:          common.k8s.elastic.co/type=elasticsearch,elasticsearch.k8s.elastic.co/cluster-name=noscit-production-logs
Type:              ClusterIP
IP:                x.x.x.x
Port:              https  9200/TCP
TargetPort:        9200/TCP
Endpoints:         
Session Affinity:  None
Events:            <none>

I tracked it down to the probe script (/mnt/elastic-internal/scripts/readiness-probe-script.sh ). It performs a curl request with a 3 seconds timeout. And it seems pretty obvious I'm hitting the timeout.

[root@noscit-production-logs-es-basic-1 elasticsearch]# time /mnt/elastic-internal/scripts/readiness-probe-script.sh ; echo $?

real	0m3.006s
user	0m0.024s
sys	0m0.000s
1

I quickly confirmed that this was indeed the case:

[root@noscit-production-logs-es-basic-1 elasticsearch]# time curl -k ${BASIC_AUTH} $ENDPOINT
10.244.1.36  65 99 55 2.24 3.52 4.47 dilm * noscit-production-logs-es-basic-2
10.244.2.148 63 99 49 7.99 6.43 5.38 dilm - noscit-production-logs-es-basic-1
10.244.0.132 63 99 44 3.99 4.77 5.13 dilm - noscit-production-logs-es-basic-0

real	0m3.069s
user	0m0.020s
sys	0m0.000s

Environment

ECK version: 1.0
Kubernetes information: On premise, 1.14

Client Version: version.Info{Major:"1", Minor:"15", GitVersion:"v1.15.3", GitCommit:"2d3c76f9091b6bec110a5e63777c332469e0cba2", GitTreeState:"clean", BuildDate:"2019-08-19T11:13:54Z", GoVersion:"go1.12.9", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"14", GitVersion:"v1.14.0", GitCommit:"641856db18352033a0d96dbc99153fa3b27298e5", GitTreeState:"clean", BuildDate:"2019-03-25T15:45:25Z", GoVersion:"go1.12.1", Compiler:"gc", Platform:"linux/amd64"}

Resource definition:

apiVersion: elasticsearch.k8s.elastic.co/v1beta1
kind: Elasticsearch
metadata:
  name: noscit-production-logs
  labels:
    helm.sh/chart: elasticsearch-0.1.0
    app.kubernetes.io/name: elasticsearch
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/instance: elastic-production
    app.kubernetes.io/component: elasticsearch
spec:
  version: 7.4.0
  nodeSets:
  - name: basic
    count: 3
    podTemplate:
      metadata:
        labels:
          helm.sh/chart: elasticsearch-0.1.0
          app.kubernetes.io/name: elasticsearch
          app.kubernetes.io/managed-by: ECK
          app.kubernetes.io/instance: elastic-production
          app.kubernetes.io/component: elasticsearch
      spec:
        containers:
        - name: elasticsearch
          env:
          - name: ES_JAVA_OPTS
            value: -Xms7g -Xmx7g
          resources:
            limits:
              cpu: 4
              memory: 8Gi
            requests:
              cpu: 2
              memory: 8Gi
    config:
      node.master: true
      node.data: true
      node.ingest: true
      xpack.security.authc.realms:
        native.native1:
          order: 1
    volumeClaimTemplates:
    - metadata:
        name: elasticsearch-data
        label:
          helm.sh/chart: elasticsearch-0.1.0
          app.kubernetes.io/name: elasticsearch
          app.kubernetes.io/managed-by: Helm
          app.kubernetes.io/instance: elastic-production
          app.kubernetes.io/component: elasticsearch
      spec:
        accessModes:
        - ReadWriteOnce
        resources:
          requests:
            storage: 1000Gi
        storageClassName: local-slow

I feel we should either:

make the timeout bigger
make the timeout configurable

Any ideas of workarounds would be greatly appreciated.

The text was updated successfully, but these errors were encountered:

sebgl · 2019-12-12T18:34:23Z

Relates #2248

anyasabo · 2019-12-12T21:04:33Z

real 0m3.006s

So close! 😞

Thanks for bringing this up. I opened #2260 to allow it to be configurable. In the meantime, you should be able to override the readiness probe in the podTemplate. The easiest solution may be to simply use a tcp check as described here:
https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/#define-a-tcp-liveness-probe

which means some pods may join the service before they are ready, but that seems to be an improvement over the existing situation. If you have any issues or further questions please let us know.

anyasabo · 2019-12-13T15:06:10Z

The above referenced PR has been merged and should land in the next release. In the meantime the above mentioned workaround exists, so I'll go ahead and close this. Feel free to re-open if there's further questions/comments though.

anyasabo mentioned this issue Dec 12, 2019

Allow user to override readiness timeout #2260

Merged

anyasabo closed this as completed Dec 13, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Readyness probe is timing out #2259

Readyness probe is timing out #2259

l0x-c0d3z commented Dec 12, 2019

sebgl commented Dec 12, 2019

anyasabo commented Dec 12, 2019

anyasabo commented Dec 13, 2019

Readyness probe is timing out #2259

Readyness probe is timing out #2259

Comments

l0x-c0d3z commented Dec 12, 2019

Bug Report

sebgl commented Dec 12, 2019

anyasabo commented Dec 12, 2019

anyasabo commented Dec 13, 2019