Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Help needed, A how-to around istio "0.5.1" + cilium "v1.0.0-rc5" #3010

Closed
brant4test opened this issue Mar 5, 2018 · 18 comments
Closed

Help needed, A how-to around istio "0.5.1" + cilium "v1.0.0-rc5" #3010

brant4test opened this issue Mar 5, 2018 · 18 comments
Assignees
Labels
kind/bug This is a bug in the Cilium logic. kind/community-report This was reported by a user in the Cilium community, eg via Slack.

Comments

@brant4test
Copy link


Help needed, A how-to around istio "0.5.1" + cilium "v1.0.0-rc5"

integration of istio and cilium with less maintenance cost (future upgrade)

I'm following this guide to create a stack of kubespray “master 5aeaa24” + istio "0.5.1" + cilium "v1.0.0-rc5".
https://github.com/kubernetes-incubator/kubespray/tree/master/contrib/terraform/aws

Stack creation was successful (ansible PLAY RECAP), but then cilium pods seems got re-created frequently (including changing name of pods, cilium-***** and becoming to Pending ), causing application services to flip a lot as well.

PLAY RECAP ******************************************************************************************************************************************************************************************************
kubernetes-devtesta-etcd0  : ok=206  changed=51   unreachable=0    failed=0   
kubernetes-devtesta-etcd1  : ok=203  changed=51   unreachable=0    failed=0   
kubernetes-devtesta-etcd2  : ok=203  changed=51   unreachable=0    failed=0   
kubernetes-devtesta-master0 : ok=335  changed=111  unreachable=0    failed=0   
kubernetes-devtesta-master1 : ok=299  changed=93   unreachable=0    failed=0   
kubernetes-devtesta-master2 : ok=299  changed=93   unreachable=0    failed=0   
kubernetes-devtesta-worker0 : ok=301  changed=86   unreachable=0    failed=0   
kubernetes-devtesta-worker1 : ok=267  changed=75   unreachable=0    failed=0   
kubernetes-devtesta-worker2 : ok=267  changed=75   unreachable=0    failed=0   
kubernetes-devtesta-worker3 : ok=267  changed=75   unreachable=0    failed=0   
localhost                  : ok=5    changed=1    unreachable=0    failed=0   

Friday 02 March 2018  19:01:25 +0000 (0:00:00.029)       0:29:55.833 ********** 
=============================================================================== 
kubernetes/secrets : Check certs | check if a cert already exists on node ------------------------------------------------------------------------------------------------------------------------------- 41.82s
etcd : Configure | Join member(s) to etcd-events cluster one at a time ---------------------------------------------------------------------------------------------------------------------------------- 40.35s
etcd : Configure | Join member(s) to etcd cluster one at a time ----------------------------------------------------------------------------------------------------------------------------------------- 40.34s
download : container_download | Download containers if pull is required or told to always pull (all nodes) ---------------------------------------------------------------------------------------------- 35.87s
download : container_download | Download containers if pull is required or told to always pull (all nodes) ---------------------------------------------------------------------------------------------- 35.43s
download : container_download | Download containers if pull is required or told to always pull (all nodes) ---------------------------------------------------------------------------------------------- 27.42s
download : container_download | Download containers if pull is required or told to always pull (all nodes) ---------------------------------------------------------------------------------------------- 20.36s
etcd : Configure | Join member(s) to etcd cluster one at a time ----------------------------------------------------------------------------------------------------------------------------------------- 20.23s
etcd : Configure | Join member(s) to etcd-events cluster one at a time ---------------------------------------------------------------------------------------------------------------------------------- 20.21s
download : container_download | Download containers if pull is required or told to always pull (all nodes) ---------------------------------------------------------------------------------------------- 19.16s
download : container_download | Download containers if pull is required or told to always pull (all nodes) ---------------------------------------------------------------------------------------------- 18.33s
kubernetes/node : write the kubecfg (auth) file for kubelet --------------------------------------------------------------------------------------------------------------------------------------------- 15.67s
bootstrap-os : Bootstrap | Install pip ------------------------------------------------------------------------------------------------------------------------------------------------------------------ 14.68s
gather facts from all instances ------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 14.47s
kubernetes/preinstall : Create kubernetes directories --------------------------------------------------------------------------------------------------------------------------------------------------- 14.41s
download : container_download | Download containers if pull is required or told to always pull (all nodes) ---------------------------------------------------------------------------------------------- 12.26s
download : container_download | Download containers if pull is required or told to always pull (all nodes) ---------------------------------------------------------------------------------------------- 12.15s
download : container_download | Download containers if pull is required or told to always pull (all nodes) ---------------------------------------------------------------------------------------------- 11.62s
kubernetes/node : install | Copy kubelet from hyperkube container --------------------------------------------------------------------------------------------------------------------------------------- 11.60s
kubernetes/node : Enable bridge-nf-call tables ---------------------------------------------------------------------------------------------------------------------------------------------------------- 11.22s

$ kubectl -n kube-system logs cilium-8645d

...

level=warning msg="etcd watcher received error" error="etcdserver: mvcc: required revision has been compacted" revision=0 watcher=cilium/state/identities/v1/id

level=warning msg="etcd watcher received error" error="etcdserver: mvcc: required revision has been compacted" revision=0 watcher=cilium/state/identities/v1/id

level=warning msg="etcd watcher received error" error="etcdserver: mvcc: required revision has been compacted" revision=0 watcher=cilium/state/identities/v1/id

level=warning msg="etcd watcher received error" error="etcdserver: mvcc: required revision has been compacted" revision=0 watcher=cilium/state/identities/v1/id

level=warning msg="etcd watcher received error" error="etcdserver: mvcc: required revision has been compacted" revision=0 watcher=cilium/state/identities/v1/id

level=warning msg="etcd watcher received error" error="etcdserver: mvcc: required revision has been compacted" revision=0 watcher=cilium/state/identities/v1/id

level=warning msg="etcd watcher received error" error="etcdserver: mvcc: required revision has been compacted" revision=0 watcher=cilium/state/identities/v1/id

General Information

  • Cilium version (run cilium version)
$ kubectl exec -it cilium-5mn87 -n kube-system -- bash
root@ip-10-250-208-195:~# cilium version
Client: 1.0.0-rc5 1eb7ae1a0 2018-02-27T20:50:47+00:00 go version go1.9 linux/amd64
Daemon: 1.0.0-rc5 1eb7ae1a0 2018-02-27T20:50:47+00:00 go version go1.9 linux/amd64
  • Kernel version (run uname -a)
core@ip-10-250-202-243 ~ $ uname -a
Linux ip-10-250-202-243.ec2.internal 4.14.19-coreos #1 SMP Wed Feb 14 03:18:05 UTC 2018 x86_64 Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz GenuineIntel GNU/Linux

  • Orchestration system version in use (e.g. kubectl version, Mesos, ...)
$ kubectl version
Client Version: version.Info{Major:"1", Minor:"8", GitVersion:"v1.8.0", GitCommit:"6e937839ac04a38cac63e6a7a306c5d035fe7b0a", GitTreeState:"clean", BuildDate:"2017-09-28T22:57:57Z", GoVersion:"go1.8.3", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"9", GitVersion:"v1.9.3+coreos.0", GitCommit:"f588569ed1bd4a6c986205dd0d7b04da4ab1a3b6", GitTreeState:"clean", BuildDate:"2018-02-10T01:42:55Z", GoVersion:"go1.9.2", Compiler:"gc", Platform:"linux/amd64"}
  • Link to relevant artifacts (policies, deployments scripts, ...)
https://github.com/istio/istio/releases/tag/0.5.1

How to reproduce the issue

kubespray/contrib/terraform/aws$ terraform apply -var-file=credentials.tfvars -var 'loadbalancer_apiserver_address=*.*.*.*'
kubespray$ ansible-playbook -i ./inventory/hosts ./cluster.yml -e ansible_ssh_user=core -e bootstrap_os=coreos -b --become-user=root --flush-cache -e ansible_user=core
istio-0.5.1$ kubectl apply -f install/kubernetes/istio.yaml

Feature Requests

a how-to around istio "0.5.1" + cilium "v1.0.0-rc5".

Thanks! :)

@tgraf tgraf added the kind/community-report This was reported by a user in the Cilium community, eg via Slack. label Mar 5, 2018
@tgraf
Copy link
Member

tgraf commented Mar 5, 2018

@brant4test You mentioned that the cilium pods are restarting. Can you attach the full logs and also kubectl -n kube-system describe pod cilium

@brant4test
Copy link
Author

brant4test commented Mar 5, 2018

@brant4test
Copy link
Author

Stack size and layout. Btw, do you think aws_etcd_size = "t2.medium" is enough?
image

@tgraf
Copy link
Member

tgraf commented Mar 5, 2018

Investigating similar issues:

tgraf added a commit that referenced this issue Mar 5, 2018
etcd does not support watching on a compacted revision and will error out.
Fortunately etcd tells us the minimum compact revision that we can watch,
therefore, recreate the watcher with the provided minimum revision.

Fixes: #3010

Signed-off-by: Thomas Graf <thomas@cilium.io>
@tgraf
Copy link
Member

tgraf commented Mar 5, 2018

Reported/asked upstream etcd-io/etcd#9386

tgraf added a commit that referenced this issue Mar 6, 2018
etcd does not support watching on a compacted revision and will error out. Do a
fresh get on the latest revision and restart the watcher. In order to continue
maintaining the proper  order of events, a local cache is introduced.

The ListDone signal is only emitted once at the beginning.

On ReList, deletion events are sent for keys which can no longer be found.

Fixes: #3010

Signed-off-by: Thomas Graf <thomas@cilium.io>
@tgraf tgraf assigned tgraf and unassigned rlenglet Mar 6, 2018
@tgraf tgraf added the kind/bug This is a bug in the Cilium logic. label Mar 6, 2018
tgraf added a commit that referenced this issue Mar 6, 2018
etcd does not support watching on a compacted revision and will error out. Do a
fresh get on the latest revision and restart the watcher. In order to continue
maintaining the proper  order of events, a local cache is introduced.

The ListDone signal is only emitted once at the beginning.

On ReList, deletion events are sent for keys which can no longer be found.

Fixes: #3010

Signed-off-by: Thomas Graf <thomas@cilium.io>
@brant4test
Copy link
Author

@tgraf Great! so "cilium v1.0.0-rc5 pods restart with istio 0.5.1" issue has been solved? :)

@rlenglet
Copy link
Contributor

rlenglet commented Mar 6, 2018

@brant4test You should be able to run the current Cilium versions in combination with any Istio version (without mTLS) without modifying the Istio proxy images.

@tgraf
Copy link
Member

tgraf commented Mar 6, 2018

@brant4test The etcd errors you have seen should also be resolved correctly if you are using the image cilium/cilium:latest. We will put out a new release in a couple of days as well.

tgraf added a commit that referenced this issue Mar 7, 2018
etcd does not support watching on a compacted revision and will error out. Do a
fresh get on the latest revision and restart the watcher. In order to continue
maintaining the proper  order of events, a local cache is introduced.

The ListDone signal is only emitted once at the beginning.

On ReList, deletion events are sent for keys which can no longer be found.

Fixes: #3010

Signed-off-by: Thomas Graf <thomas@cilium.io>
@brant4test
Copy link
Author

kubectl -n kube-system describe pod cilium.txt

$ kubectl get -n kube-system pods
NAME                                                     READY     STATUS    RESTARTS   AGE
cilium-89hpn                                             1/1       Running   0          40m
cilium-9jk6w                                             1/1       Running   0          40m
cilium-g92xl                                             0/1       Pending   0          5s
cilium-hznms                                             1/1       Running   0          40m
cilium-qlt6c                                             1/1       Running   0          2m
cilium-sqznj                                             1/1       Running   0          11m
cilium-z5dmg                                             0/1       Pending   0          2s
elasticsearch-logging-v1-776b8b856c-94zq5                1/1       Running   0          1m
elasticsearch-logging-v1-776b8b856c-hs9pr                1/1       Running   0          9m
fluentd-es-v1.22-2d92d                                   1/1       Running   0          39m
fluentd-es-v1.22-2ph5k                                   1/1       Running   0          39m
fluentd-es-v1.22-5f4wr                                   1/1       Running   0          39m
fluentd-es-v1.22-jj7km                                   1/1       Running   0          39m
fluentd-es-v1.22-lxd74                                   1/1       Running   0          39m
fluentd-es-v1.22-rx9mk                                   1/1       Running   0          39m
fluentd-es-v1.22-sqtf5                                   1/1       Running   0          39m
kibana-logging-57d98b74f9-k94fw                          1/1       Running   0          10m
kube-apiserver-ip-10-250-197-168.ec2.internal            1/1       Running   0          40m
kube-apiserver-ip-10-250-201-190.ec2.internal            1/1       Running   0          40m
kube-apiserver-ip-10-250-219-206.ec2.internal            1/1       Running   0          41m
kube-controller-manager-ip-10-250-197-168.ec2.internal   1/1       Running   0          42m
kube-controller-manager-ip-10-250-201-190.ec2.internal   1/1       Running   0          42m
kube-controller-manager-ip-10-250-219-206.ec2.internal   1/1       Running   0          42m
kube-dns-79d99cdcd5-7fnsv                                3/3       Running   0          7m
kube-dns-79d99cdcd5-dcxrd                                3/3       Running   0          39m
kube-proxy-ip-10-250-197-168.ec2.internal                1/1       Running   0          41m
kube-proxy-ip-10-250-201-190.ec2.internal                1/1       Running   0          41m
kube-proxy-ip-10-250-202-15.ec2.internal                 1/1       Running   0          41m
kube-proxy-ip-10-250-207-112.ec2.internal                1/1       Running   0          41m
kube-proxy-ip-10-250-210-236.ec2.internal                1/1       Running   0          41m
kube-proxy-ip-10-250-215-204.ec2.internal                1/1       Running   0          41m
kube-proxy-ip-10-250-219-206.ec2.internal                1/1       Running   0          41m
kube-scheduler-ip-10-250-197-168.ec2.internal            1/1       Running   0          42m
kube-scheduler-ip-10-250-201-190.ec2.internal            1/1       Running   0          42m
kube-scheduler-ip-10-250-219-206.ec2.internal            1/1       Running   0          42m
kubedns-autoscaler-5564b5585f-pr28c                      1/1       Running   0          39m
kubernetes-dashboard-69cb58d748-m5fmt                    1/1       Running   1          39m
nginx-proxy-ip-10-250-202-15.ec2.internal                1/1       Running   0          41m
nginx-proxy-ip-10-250-207-112.ec2.internal               1/1       Running   0          41m
nginx-proxy-ip-10-250-210-236.ec2.internal               1/1       Running   0          41m
nginx-proxy-ip-10-250-215-204.ec2.internal               1/1       Running   0          41m
tiller-deploy-5b48764ff7-9jk7x                           1/1       Running   0          39m

@ianvernon
Copy link
Member

@brant4test I'm seeing the following in the text file you attached:

Events:
  Type     Reason   Age   From                                     Message
  ----     ------   ----  ----                                     -------
  Warning  Evicted  0s    kubelet, ip-10-250-215-204.ec2.internal  The node was low on resource: [DiskPressure].

@brant4test
Copy link
Author

brant4test commented Mar 7, 2018

@tgraf It's based on a stack of kubespray “master 5aeaa24” + istio "0.5.1" + cilium "v1.0.0-rc6".
getting much better/stabler, but still sometimes cilium pods run into Pending status.
and i cannot get logs of cilium pods in Pending status, they're swiftly gone.

$ kubectl -n kube-system logs cilium-xphxw
Error from server (NotFound): pods "cilium-xphxw" not found

Any tips? thanks!

@brant4test
Copy link
Author

@ianvernon thanks for your reply. I've seen that before, no clue what leads to DiskPressure. any suggestions? thanks!

@ianvernon
Copy link
Member

@brant4test this is one of the NodeConditions per Kubernetes Documentation.

Available disk space and inodes on either the node’s root filesystem or image filesystem has satisfied an eviction threshold

@tgraf
Copy link
Member

tgraf commented Mar 7, 2018

@brant4test Can you ssh into the node in question and see if you have a large file name cilium-envoy.log in /var/run/cilium?

@brant4test
Copy link
Author

brant4test commented Mar 8, 2018

@ianvernon Thanks a lot for your help. from your doc, I think this is the reason why daemonset pods keeps restarting.

[DaemonSet](https://kubernetes.io/docs/tasks/administer-cluster/out-of-resource/#soft-eviction-thresholds)

It is never desired for kubelet to evict a DaemonSet Pod, since the Pod is immediately recreated and rescheduled back to the same node.

At the moment, the kubelet has no ability to distinguish a Pod created from DaemonSet versus any other object. If/when that information is available, the kubelet could pro-actively filter those Pods from the candidate set of Pods provided to the eviction strategy.

In general, it is strongly recommended that DaemonSet not create BestEffort Pods to avoid being identified as a candidate Pod for eviction. Instead DaemonSet should ideally launch Guaranteed Pods.

@tgraf seems like no cilium-envoy.log exists. and a good news is all nodes and pods are stable now.

ip-10-250-208-159 cilium # pwd
/var/run/cilium
ip-10-250-208-159 cilium # ls -lah
total 4.0K
drwxr-xr-x.  3 root root 160 Mar  7 22:37 .
drwxr-xr-x. 27 root root 700 Mar  7 22:37 ..
-rw-r-----.  1 root root   2 Mar  7 22:37 cilium.pid
srw-rw----.  1 root 1000   0 Mar  7 22:37 cilium.sock
prw-------.  1 root root   0 Mar  8 02:39 events.sock
srw-rw----.  1 root 1000   0 Mar  7 22:37 health.sock
srw-rw----.  1 root 1000   0 Mar  7 22:37 monitor.sock
drwxr-x---. 21 root root 540 Mar  8 02:40 state

except one concern about two Events Warning? Are they also issues? or can be ignored? thanks!

$ kubectl describe po/cilium-sp4hp -n kube-system
Name:           cilium-sp4hp
Namespace:      kube-system
Node:           ip-10-250-195-4.ec2.internal/10.250.195.4
Start Time:     Wed, 07 Mar 2018 22:37:40 +0000
Labels:         controller-revision-hash=1212936907
                k8s-app=cilium
                kubernetes.io/cluster-service=true
                pod-template-generation=1
Annotations:    scheduler.alpha.kubernetes.io/critical-pod=
                scheduler.alpha.kubernetes.io/tolerations=[{"key":"dedicated","operator":"Equal","value":"master","effect":"NoSchedule"}]
Status:         Running
IP:             10.250.195.4
Controlled By:  DaemonSet/cilium
Containers:
  cilium-agent:
    Container ID:  docker://2ef6063069cbc27a080d7c788473a46826c2a54953399ccb1b86ac354d672cf2
    Image:         docker.io/cilium/cilium:v1.0.0-rc6
    Image ID:      docker-pullable://cilium/cilium@sha256:35ac3e2c5c7e7b0d5a61fd2bfa62ee5c14090e2c5887ff8e8ffafc0230f15ff0
    Port:          <none>
    Command:
      cilium-agent
    Args:
      --debug=$(CILIUM_DEBUG)
      -t
      vxlan
      --kvstore
      etcd
      --kvstore-opt
      etcd.config=/var/lib/etcd-config/etcd.config
      --disable-ipv4=$(DISABLE_IPV4)
    State:          Running
      Started:      Wed, 07 Mar 2018 22:37:41 +0000
    Ready:          True
    Restart Count:  0
    Liveness:       exec [cilium status] delay=120s timeout=1s period=10s #success=1 #failure=10
    Readiness:      exec [cilium status] delay=5s timeout=1s period=5s #success=1 #failure=3
    Environment:
      K8S_NODE_NAME:   (v1:spec.nodeName)
      CILIUM_DEBUG:   <set to the key 'debug' of config map 'cilium-config'>         Optional: false
      DISABLE_IPV4:   <set to the key 'disable-ipv4' of config map 'cilium-config'>  Optional: false
    Mounts:
      /etc/cilium/certs from cilium-certs (ro)
      /host/etc/cni/net.d from etc-cni-netd (rw)
      /host/opt/cni/bin from cni-path (rw)
      /sys/fs/bpf from bpf-maps (rw)
      /var/lib/etcd-config from etcd-config-path (ro)
      /var/run/cilium from cilium-run (rw)
      /var/run/docker.sock from docker-socket (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from cilium-token-ffs74 (ro)
Conditions:
  Type           Status
  Initialized    True 
  Ready          True 
  PodScheduled   True 
Volumes:
  cilium-run:
    Type:  HostPath (bare host directory volume)
    Path:  /var/run/cilium
  bpf-maps:
    Type:  HostPath (bare host directory volume)
    Path:  /sys/fs/bpf
  docker-socket:
    Type:  HostPath (bare host directory volume)
    Path:  /var/run/docker.sock
  cni-path:
    Type:  HostPath (bare host directory volume)
    Path:  /opt/cni/bin
  etc-cni-netd:
    Type:  HostPath (bare host directory volume)
    Path:  /etc/cni/net.d
  cilium-certs:
    Type:  HostPath (bare host directory volume)
    Path:  /etc/cilium/certs
  etcd-config-path:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      cilium-config
    Optional:  false
  cilium-token-ffs74:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  cilium-token-ffs74
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  <none>
Tolerations:     CriticalAddonsOnly
                 node-role.kubernetes.io/master:NoSchedule
                 node.cloudprovider.kubernetes.io/uninitialized=true:NoSchedule
                 node.kubernetes.io/disk-pressure:NoSchedule
                 node.kubernetes.io/memory-pressure:NoSchedule
                 node.kubernetes.io/not-ready:NoExecute
                 node.kubernetes.io/unreachable:NoExecute
Events:
  Type     Reason     Age                From                                   Message
  ----     ------     ----               ----                                   -------
  Warning  Unhealthy  12m (x3 over 14m)  kubelet, ip-10-250-195-4.ec2.internal  Liveness probe failed: context deadline exceeded
  Warning  Unhealthy  11m (x3 over 14m)  kubelet, ip-10-250-195-4.ec2.internal  Readiness probe failed: context deadline exceeded

@tgraf
Copy link
Member

tgraf commented Mar 8, 2018

@brant4test

seems like no cilium-envoy.log exists. and a good news is all nodes and pods are stable now.

Great! We saw some instances with excessive log spamming that we addressed in the meantime. I wanted to make sure that you you were not affected by this.

except one concern about two Events Warning? Are they also issues? or can be ignored? thanks!

If Cilium is healthy then it is a false negative but we should not ignore them. I noticed that that the timeout is only 1 second. We will look into this and get back to you.

@brant4test
Copy link
Author

@brant4test
Copy link
Author

@tgraf Will Cilium 1.0.0-rc7 fixes these Events Warning?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug This is a bug in the Cilium logic. kind/community-report This was reported by a user in the Cilium community, eg via Slack.
Projects
None yet
Development

No branches or pull requests

4 participants