Help needed, A how-to around istio "0.5.1" + cilium "v1.0.0-rc5" #3010

brant4test · 2018-03-05T13:31:06Z

Help needed, A how-to around istio "0.5.1" + cilium "v1.0.0-rc5"

integration of istio and cilium with less maintenance cost (future upgrade)

I'm following this guide to create a stack of kubespray “master 5aeaa24” + istio "0.5.1" + cilium "v1.0.0-rc5".
https://github.com/kubernetes-incubator/kubespray/tree/master/contrib/terraform/aws

Stack creation was successful (ansible PLAY RECAP), but then cilium pods seems got re-created frequently (including changing name of pods, cilium-***** and becoming to Pending ), causing application services to flip a lot as well.

PLAY RECAP ******************************************************************************************************************************************************************************************************
kubernetes-devtesta-etcd0  : ok=206  changed=51   unreachable=0    failed=0   
kubernetes-devtesta-etcd1  : ok=203  changed=51   unreachable=0    failed=0   
kubernetes-devtesta-etcd2  : ok=203  changed=51   unreachable=0    failed=0   
kubernetes-devtesta-master0 : ok=335  changed=111  unreachable=0    failed=0   
kubernetes-devtesta-master1 : ok=299  changed=93   unreachable=0    failed=0   
kubernetes-devtesta-master2 : ok=299  changed=93   unreachable=0    failed=0   
kubernetes-devtesta-worker0 : ok=301  changed=86   unreachable=0    failed=0   
kubernetes-devtesta-worker1 : ok=267  changed=75   unreachable=0    failed=0   
kubernetes-devtesta-worker2 : ok=267  changed=75   unreachable=0    failed=0   
kubernetes-devtesta-worker3 : ok=267  changed=75   unreachable=0    failed=0   
localhost                  : ok=5    changed=1    unreachable=0    failed=0   

Friday 02 March 2018  19:01:25 +0000 (0:00:00.029)       0:29:55.833 ********** 
=============================================================================== 
kubernetes/secrets : Check certs | check if a cert already exists on node ------------------------------------------------------------------------------------------------------------------------------- 41.82s
etcd : Configure | Join member(s) to etcd-events cluster one at a time ---------------------------------------------------------------------------------------------------------------------------------- 40.35s
etcd : Configure | Join member(s) to etcd cluster one at a time ----------------------------------------------------------------------------------------------------------------------------------------- 40.34s
download : container_download | Download containers if pull is required or told to always pull (all nodes) ---------------------------------------------------------------------------------------------- 35.87s
download : container_download | Download containers if pull is required or told to always pull (all nodes) ---------------------------------------------------------------------------------------------- 35.43s
download : container_download | Download containers if pull is required or told to always pull (all nodes) ---------------------------------------------------------------------------------------------- 27.42s
download : container_download | Download containers if pull is required or told to always pull (all nodes) ---------------------------------------------------------------------------------------------- 20.36s
etcd : Configure | Join member(s) to etcd cluster one at a time ----------------------------------------------------------------------------------------------------------------------------------------- 20.23s
etcd : Configure | Join member(s) to etcd-events cluster one at a time ---------------------------------------------------------------------------------------------------------------------------------- 20.21s
download : container_download | Download containers if pull is required or told to always pull (all nodes) ---------------------------------------------------------------------------------------------- 19.16s
download : container_download | Download containers if pull is required or told to always pull (all nodes) ---------------------------------------------------------------------------------------------- 18.33s
kubernetes/node : write the kubecfg (auth) file for kubelet --------------------------------------------------------------------------------------------------------------------------------------------- 15.67s
bootstrap-os : Bootstrap | Install pip ------------------------------------------------------------------------------------------------------------------------------------------------------------------ 14.68s
gather facts from all instances ------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 14.47s
kubernetes/preinstall : Create kubernetes directories --------------------------------------------------------------------------------------------------------------------------------------------------- 14.41s
download : container_download | Download containers if pull is required or told to always pull (all nodes) ---------------------------------------------------------------------------------------------- 12.26s
download : container_download | Download containers if pull is required or told to always pull (all nodes) ---------------------------------------------------------------------------------------------- 12.15s
download : container_download | Download containers if pull is required or told to always pull (all nodes) ---------------------------------------------------------------------------------------------- 11.62s
kubernetes/node : install | Copy kubelet from hyperkube container --------------------------------------------------------------------------------------------------------------------------------------- 11.60s
kubernetes/node : Enable bridge-nf-call tables ---------------------------------------------------------------------------------------------------------------------------------------------------------- 11.22s


$ kubectl -n kube-system logs cilium-8645d

...

level=warning msg="etcd watcher received error" error="etcdserver: mvcc: required revision has been compacted" revision=0 watcher=cilium/state/identities/v1/id

level=warning msg="etcd watcher received error" error="etcdserver: mvcc: required revision has been compacted" revision=0 watcher=cilium/state/identities/v1/id

level=warning msg="etcd watcher received error" error="etcdserver: mvcc: required revision has been compacted" revision=0 watcher=cilium/state/identities/v1/id

level=warning msg="etcd watcher received error" error="etcdserver: mvcc: required revision has been compacted" revision=0 watcher=cilium/state/identities/v1/id

level=warning msg="etcd watcher received error" error="etcdserver: mvcc: required revision has been compacted" revision=0 watcher=cilium/state/identities/v1/id

level=warning msg="etcd watcher received error" error="etcdserver: mvcc: required revision has been compacted" revision=0 watcher=cilium/state/identities/v1/id

level=warning msg="etcd watcher received error" error="etcdserver: mvcc: required revision has been compacted" revision=0 watcher=cilium/state/identities/v1/id

General Information

Cilium version (run cilium version)

$ kubectl exec -it cilium-5mn87 -n kube-system -- bash
root@ip-10-250-208-195:~# cilium version
Client: 1.0.0-rc5 1eb7ae1a0 2018-02-27T20:50:47+00:00 go version go1.9 linux/amd64
Daemon: 1.0.0-rc5 1eb7ae1a0 2018-02-27T20:50:47+00:00 go version go1.9 linux/amd64

Kernel version (run uname -a)

core@ip-10-250-202-243 ~ $ uname -a
Linux ip-10-250-202-243.ec2.internal 4.14.19-coreos #1 SMP Wed Feb 14 03:18:05 UTC 2018 x86_64 Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz GenuineIntel GNU/Linux

Orchestration system version in use (e.g. kubectl version, Mesos, ...)

$ kubectl version
Client Version: version.Info{Major:"1", Minor:"8", GitVersion:"v1.8.0", GitCommit:"6e937839ac04a38cac63e6a7a306c5d035fe7b0a", GitTreeState:"clean", BuildDate:"2017-09-28T22:57:57Z", GoVersion:"go1.8.3", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"9", GitVersion:"v1.9.3+coreos.0", GitCommit:"f588569ed1bd4a6c986205dd0d7b04da4ab1a3b6", GitTreeState:"clean", BuildDate:"2018-02-10T01:42:55Z", GoVersion:"go1.9.2", Compiler:"gc", Platform:"linux/amd64"}

Link to relevant artifacts (policies, deployments scripts, ...)

https://github.com/istio/istio/releases/tag/0.5.1

How to reproduce the issue

kubespray/contrib/terraform/aws$ terraform apply -var-file=credentials.tfvars -var 'loadbalancer_apiserver_address=*.*.*.*'
kubespray$ ansible-playbook -i ./inventory/hosts ./cluster.yml -e ansible_ssh_user=core -e bootstrap_os=coreos -b --become-user=root --flush-cache -e ansible_user=core
istio-0.5.1$ kubectl apply -f install/kubernetes/istio.yaml

Feature Requests

a how-to around istio "0.5.1" + cilium "v1.0.0-rc5".

Thanks! :)

The text was updated successfully, but these errors were encountered:

tgraf · 2018-03-05T13:45:13Z

@brant4test You mentioned that the cilium pods are restarting. Can you attach the full logs and also kubectl -n kube-system describe pod cilium

brant4test · 2018-03-05T13:59:54Z

kubectl -n kube-system describe pod cilium.txt

kubectl -n kube-system logs cilium-8645d.txt

brant4test · 2018-03-05T14:07:20Z

Stack size and layout. Btw, do you think aws_etcd_size = "t2.medium" is enough?

tgraf · 2018-03-05T14:27:25Z

Investigating similar issues:

Give the “Compactor” work to etcd kubernetes/kubernetes#46705

etcd does not support watching on a compacted revision and will error out. Fortunately etcd tells us the minimum compact revision that we can watch, therefore, recreate the watcher with the provided minimum revision. Fixes: #3010 Signed-off-by: Thomas Graf <thomas@cilium.io>

tgraf · 2018-03-05T16:07:12Z

Reported/asked upstream etcd-io/etcd#9386

etcd does not support watching on a compacted revision and will error out. Do a fresh get on the latest revision and restart the watcher. In order to continue maintaining the proper order of events, a local cache is introduced. The ListDone signal is only emitted once at the beginning. On ReList, deletion events are sent for keys which can no longer be found. Fixes: #3010 Signed-off-by: Thomas Graf <thomas@cilium.io>

brant4test · 2018-03-06T07:29:21Z

@tgraf Great! so "cilium v1.0.0-rc5 pods restart with istio 0.5.1" issue has been solved? :)

rlenglet · 2018-03-06T08:51:10Z

@brant4test You should be able to run the current Cilium versions in combination with any Istio version (without mTLS) without modifying the Istio proxy images.

tgraf · 2018-03-06T11:34:37Z

@brant4test The etcd errors you have seen should also be resolved correctly if you are using the image cilium/cilium:latest. We will put out a new release in a couple of days as well.

etcd does not support watching on a compacted revision and will error out. Do a fresh get on the latest revision and restart the watcher. In order to continue maintaining the proper order of events, a local cache is introduced. The ListDone signal is only emitted once at the beginning. On ReList, deletion events are sent for keys which can no longer be found. Fixes: #3010 Signed-off-by: Thomas Graf <thomas@cilium.io>

brant4test · 2018-03-07T19:11:18Z

kubectl -n kube-system describe pod cilium.txt

$ kubectl get -n kube-system pods
NAME                                                     READY     STATUS    RESTARTS   AGE
cilium-89hpn                                             1/1       Running   0          40m
cilium-9jk6w                                             1/1       Running   0          40m
cilium-g92xl                                             0/1       Pending   0          5s
cilium-hznms                                             1/1       Running   0          40m
cilium-qlt6c                                             1/1       Running   0          2m
cilium-sqznj                                             1/1       Running   0          11m
cilium-z5dmg                                             0/1       Pending   0          2s
elasticsearch-logging-v1-776b8b856c-94zq5                1/1       Running   0          1m
elasticsearch-logging-v1-776b8b856c-hs9pr                1/1       Running   0          9m
fluentd-es-v1.22-2d92d                                   1/1       Running   0          39m
fluentd-es-v1.22-2ph5k                                   1/1       Running   0          39m
fluentd-es-v1.22-5f4wr                                   1/1       Running   0          39m
fluentd-es-v1.22-jj7km                                   1/1       Running   0          39m
fluentd-es-v1.22-lxd74                                   1/1       Running   0          39m
fluentd-es-v1.22-rx9mk                                   1/1       Running   0          39m
fluentd-es-v1.22-sqtf5                                   1/1       Running   0          39m
kibana-logging-57d98b74f9-k94fw                          1/1       Running   0          10m
kube-apiserver-ip-10-250-197-168.ec2.internal            1/1       Running   0          40m
kube-apiserver-ip-10-250-201-190.ec2.internal            1/1       Running   0          40m
kube-apiserver-ip-10-250-219-206.ec2.internal            1/1       Running   0          41m
kube-controller-manager-ip-10-250-197-168.ec2.internal   1/1       Running   0          42m
kube-controller-manager-ip-10-250-201-190.ec2.internal   1/1       Running   0          42m
kube-controller-manager-ip-10-250-219-206.ec2.internal   1/1       Running   0          42m
kube-dns-79d99cdcd5-7fnsv                                3/3       Running   0          7m
kube-dns-79d99cdcd5-dcxrd                                3/3       Running   0          39m
kube-proxy-ip-10-250-197-168.ec2.internal                1/1       Running   0          41m
kube-proxy-ip-10-250-201-190.ec2.internal                1/1       Running   0          41m
kube-proxy-ip-10-250-202-15.ec2.internal                 1/1       Running   0          41m
kube-proxy-ip-10-250-207-112.ec2.internal                1/1       Running   0          41m
kube-proxy-ip-10-250-210-236.ec2.internal                1/1       Running   0          41m
kube-proxy-ip-10-250-215-204.ec2.internal                1/1       Running   0          41m
kube-proxy-ip-10-250-219-206.ec2.internal                1/1       Running   0          41m
kube-scheduler-ip-10-250-197-168.ec2.internal            1/1       Running   0          42m
kube-scheduler-ip-10-250-201-190.ec2.internal            1/1       Running   0          42m
kube-scheduler-ip-10-250-219-206.ec2.internal            1/1       Running   0          42m
kubedns-autoscaler-5564b5585f-pr28c                      1/1       Running   0          39m
kubernetes-dashboard-69cb58d748-m5fmt                    1/1       Running   1          39m
nginx-proxy-ip-10-250-202-15.ec2.internal                1/1       Running   0          41m
nginx-proxy-ip-10-250-207-112.ec2.internal               1/1       Running   0          41m
nginx-proxy-ip-10-250-210-236.ec2.internal               1/1       Running   0          41m
nginx-proxy-ip-10-250-215-204.ec2.internal               1/1       Running   0          41m
tiller-deploy-5b48764ff7-9jk7x                           1/1       Running   0          39m

ianvernon · 2018-03-07T19:18:08Z

@brant4test I'm seeing the following in the text file you attached:

Events:
  Type     Reason   Age   From                                     Message
  ----     ------   ----  ----                                     -------
  Warning  Evicted  0s    kubelet, ip-10-250-215-204.ec2.internal  The node was low on resource: [DiskPressure].

brant4test · 2018-03-07T19:18:51Z

@tgraf It's based on a stack of kubespray “master 5aeaa24” + istio "0.5.1" + cilium "v1.0.0-rc6".
getting much better/stabler, but still sometimes cilium pods run into Pending status.
and i cannot get logs of cilium pods in Pending status, they're swiftly gone.

$ kubectl -n kube-system logs cilium-xphxw
Error from server (NotFound): pods "cilium-xphxw" not found

Any tips? thanks!

brant4test · 2018-03-07T19:25:28Z

@ianvernon thanks for your reply. I've seen that before, no clue what leads to DiskPressure. any suggestions? thanks!

ianvernon · 2018-03-07T19:31:39Z

@brant4test this is one of the NodeConditions per Kubernetes Documentation.

Available disk space and inodes on either the node’s root filesystem or image filesystem has satisfied an eviction threshold

tgraf · 2018-03-07T20:32:18Z

@brant4test Can you ssh into the node in question and see if you have a large file name cilium-envoy.log in /var/run/cilium?

brant4test · 2018-03-08T02:44:20Z

@ianvernon Thanks a lot for your help. from your doc, I think this is the reason why daemonset pods keeps restarting.

[DaemonSet](https://kubernetes.io/docs/tasks/administer-cluster/out-of-resource/#soft-eviction-thresholds)

It is never desired for kubelet to evict a DaemonSet Pod, since the Pod is immediately recreated and rescheduled back to the same node.

At the moment, the kubelet has no ability to distinguish a Pod created from DaemonSet versus any other object. If/when that information is available, the kubelet could pro-actively filter those Pods from the candidate set of Pods provided to the eviction strategy.

In general, it is strongly recommended that DaemonSet not create BestEffort Pods to avoid being identified as a candidate Pod for eviction. Instead DaemonSet should ideally launch Guaranteed Pods.

@tgraf seems like no cilium-envoy.log exists. and a good news is all nodes and pods are stable now.

ip-10-250-208-159 cilium # pwd
/var/run/cilium
ip-10-250-208-159 cilium # ls -lah
total 4.0K
drwxr-xr-x.  3 root root 160 Mar  7 22:37 .
drwxr-xr-x. 27 root root 700 Mar  7 22:37 ..
-rw-r-----.  1 root root   2 Mar  7 22:37 cilium.pid
srw-rw----.  1 root 1000   0 Mar  7 22:37 cilium.sock
prw-------.  1 root root   0 Mar  8 02:39 events.sock
srw-rw----.  1 root 1000   0 Mar  7 22:37 health.sock
srw-rw----.  1 root 1000   0 Mar  7 22:37 monitor.sock
drwxr-x---. 21 root root 540 Mar  8 02:40 state

except one concern about two Events Warning? Are they also issues? or can be ignored? thanks!

$ kubectl describe po/cilium-sp4hp -n kube-system
Name:           cilium-sp4hp
Namespace:      kube-system
Node:           ip-10-250-195-4.ec2.internal/10.250.195.4
Start Time:     Wed, 07 Mar 2018 22:37:40 +0000
Labels:         controller-revision-hash=1212936907
                k8s-app=cilium
                kubernetes.io/cluster-service=true
                pod-template-generation=1
Annotations:    scheduler.alpha.kubernetes.io/critical-pod=
                scheduler.alpha.kubernetes.io/tolerations=[{"key":"dedicated","operator":"Equal","value":"master","effect":"NoSchedule"}]
Status:         Running
IP:             10.250.195.4
Controlled By:  DaemonSet/cilium
Containers:
  cilium-agent:
    Container ID:  docker://2ef6063069cbc27a080d7c788473a46826c2a54953399ccb1b86ac354d672cf2
    Image:         docker.io/cilium/cilium:v1.0.0-rc6
    Image ID:      docker-pullable://cilium/cilium@sha256:35ac3e2c5c7e7b0d5a61fd2bfa62ee5c14090e2c5887ff8e8ffafc0230f15ff0
    Port:          <none>
    Command:
      cilium-agent
    Args:
      --debug=$(CILIUM_DEBUG)
      -t
      vxlan
      --kvstore
      etcd
      --kvstore-opt
      etcd.config=/var/lib/etcd-config/etcd.config
      --disable-ipv4=$(DISABLE_IPV4)
    State:          Running
      Started:      Wed, 07 Mar 2018 22:37:41 +0000
    Ready:          True
    Restart Count:  0
    Liveness:       exec [cilium status] delay=120s timeout=1s period=10s #success=1 #failure=10
    Readiness:      exec [cilium status] delay=5s timeout=1s period=5s #success=1 #failure=3
    Environment:
      K8S_NODE_NAME:   (v1:spec.nodeName)
      CILIUM_DEBUG:   <set to the key 'debug' of config map 'cilium-config'>         Optional: false
      DISABLE_IPV4:   <set to the key 'disable-ipv4' of config map 'cilium-config'>  Optional: false
    Mounts:
      /etc/cilium/certs from cilium-certs (ro)
      /host/etc/cni/net.d from etc-cni-netd (rw)
      /host/opt/cni/bin from cni-path (rw)
      /sys/fs/bpf from bpf-maps (rw)
      /var/lib/etcd-config from etcd-config-path (ro)
      /var/run/cilium from cilium-run (rw)
      /var/run/docker.sock from docker-socket (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from cilium-token-ffs74 (ro)
Conditions:
  Type           Status
  Initialized    True 
  Ready          True 
  PodScheduled   True 
Volumes:
  cilium-run:
    Type:  HostPath (bare host directory volume)
    Path:  /var/run/cilium
  bpf-maps:
    Type:  HostPath (bare host directory volume)
    Path:  /sys/fs/bpf
  docker-socket:
    Type:  HostPath (bare host directory volume)
    Path:  /var/run/docker.sock
  cni-path:
    Type:  HostPath (bare host directory volume)
    Path:  /opt/cni/bin
  etc-cni-netd:
    Type:  HostPath (bare host directory volume)
    Path:  /etc/cni/net.d
  cilium-certs:
    Type:  HostPath (bare host directory volume)
    Path:  /etc/cilium/certs
  etcd-config-path:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      cilium-config
    Optional:  false
  cilium-token-ffs74:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  cilium-token-ffs74
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  <none>
Tolerations:     CriticalAddonsOnly
                 node-role.kubernetes.io/master:NoSchedule
                 node.cloudprovider.kubernetes.io/uninitialized=true:NoSchedule
                 node.kubernetes.io/disk-pressure:NoSchedule
                 node.kubernetes.io/memory-pressure:NoSchedule
                 node.kubernetes.io/not-ready:NoExecute
                 node.kubernetes.io/unreachable:NoExecute
Events:
  Type     Reason     Age                From                                   Message
  ----     ------     ----               ----                                   -------
  Warning  Unhealthy  12m (x3 over 14m)  kubelet, ip-10-250-195-4.ec2.internal  Liveness probe failed: context deadline exceeded
  Warning  Unhealthy  11m (x3 over 14m)  kubelet, ip-10-250-195-4.ec2.internal  Readiness probe failed: context deadline exceeded

tgraf · 2018-03-08T02:58:55Z

@brant4test

seems like no cilium-envoy.log exists. and a good news is all nodes and pods are stable now.

Great! We saw some instances with excessive log spamming that we addressed in the meantime. I wanted to make sure that you you were not affected by this.

except one concern about two Events Warning? Are they also issues? or can be ignored? thanks!

If Cilium is healthy then it is a false negative but we should not ignore them. I noticed that that the timeout is only 1 second. We will look into this and get back to you.

brant4test · 2018-03-08T06:56:34Z

kubectl -n kube-system describe pod cilium NEW.txt

Thanks!

brant4test · 2018-03-10T10:39:36Z

@tgraf Will Cilium 1.0.0-rc7 fixes these Events Warning?

tgraf assigned rlenglet Mar 5, 2018

tgraf added the kind/community-report This was reported by a user in the Cilium community, eg via Slack. label Mar 5, 2018

tgraf mentioned this issue Mar 5, 2018

etcd: Handle error when trying to watch compacted revisons #3011

Merged

tgraf assigned tgraf and unassigned rlenglet Mar 6, 2018

tgraf added the kind/bug This is a bug in the Cilium logic. label Mar 6, 2018

tgraf closed this as completed in #3011 Mar 6, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Help needed, A how-to around istio "0.5.1" + cilium "v1.0.0-rc5" #3010

Help needed, A how-to around istio "0.5.1" + cilium "v1.0.0-rc5" #3010

brant4test commented Mar 5, 2018

tgraf commented Mar 5, 2018

brant4test commented Mar 5, 2018 •

edited

brant4test commented Mar 5, 2018

tgraf commented Mar 5, 2018

tgraf commented Mar 5, 2018

brant4test commented Mar 6, 2018

rlenglet commented Mar 6, 2018

tgraf commented Mar 6, 2018

brant4test commented Mar 7, 2018

ianvernon commented Mar 7, 2018

brant4test commented Mar 7, 2018 •

edited

brant4test commented Mar 7, 2018

ianvernon commented Mar 7, 2018

tgraf commented Mar 7, 2018

brant4test commented Mar 8, 2018 •

edited

tgraf commented Mar 8, 2018

brant4test commented Mar 8, 2018

brant4test commented Mar 10, 2018

Help needed, A how-to around istio "0.5.1" + cilium "v1.0.0-rc5" #3010

Help needed, A how-to around istio "0.5.1" + cilium "v1.0.0-rc5" #3010

Comments

brant4test commented Mar 5, 2018