Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clustermesh don't show external endpoints in Azure #8849

Closed
ghost opened this issue Aug 9, 2019 · 11 comments · Fixed by #8904
Closed

Clustermesh don't show external endpoints in Azure #8849

ghost opened this issue Aug 9, 2019 · 11 comments · Fixed by #8904
Assignees
Labels
kind/bug This is a bug in the Cilium logic. kind/community-report This was reported by a user in the Cilium community, eg via Slack.
Milestone

Comments

@ghost
Copy link

ghost commented Aug 9, 2019

Support

I'm deploying a clustermesh using the Aks-engine. I have installed cilium on two different clusters. Following the clustermesh installation guide everything looks correct. Nodes are listed, the status is correct and no errors appear in the etcd-operator log. However, I cannot access external endpoints. The example app is always answering from the current cluster.

Following the troubleshooting guide I have found in the debuginfo from the agents that no external endpoints are declared. Clusters have a master and two slave nodes. I attach the node list and status from both clusters. I can provide additional logs if required.

Any help would be appreciated.

Cluster 1

kubectl -nkube-system exec -it cilium-vg8sm cilium node list
Name IPv4 Address Endpoint CIDR IPv6 Address Endpoint CIDR
cluster1/k8s-cilium2-29734124-0 172.18.2.5 192.168.1.0/24
cluster1/k8s-cilium2-29734124-1 172.18.2.4 10.4.0.0/16
cluster1/k8s-master-29734124-0 172.18.1.239 10.239.0.0/16
cluster2/k8s-cilium2-14610979-0 172.18.2.6 192.168.2.0/24
cluster2/k8s-cilium2-14610979-1 172.18.2.7 10.7.0.0/16
cluster2/k8s-master-14610979-0 172.18.2.239 10.239.0.0/16

kubectl -nkube-system exec -it cilium-vg8sm cilium status
KVStore: Ok etcd: 1/1 connected: https://cilium-etcd-client.kube-system.svc:2379 - 3.3.11
ContainerRuntime: Ok docker daemon: OK
Kubernetes: Ok 1.15 (v1.15.1) [linux/amd64]
Kubernetes APIs: ["CustomResourceDefinition", "cilium/v2::CiliumNetworkPolicy", "core/v1::Endpoint", "core/v1::Namespace", "core/v1::Node", "core/v1::Pods", "core/v1::Service", "networking.k8s.io/v1::NetworkPolicy"]
Cilium: Ok OK
NodeMonitor: Disabled
Cilium health daemon: Ok
IPv4 address pool: 10/65535 allocated from 10.4.0.0/16
Controller Status: 48/48 healthy
Proxy Status: OK, ip 10.4.0.1, port-range 10000-20000
Cluster health: 6/6 reachable (2019-08-09T10:11:22Z)

Cluster 2

kubectl -nkube-system exec -it cilium-rl8gt cilium node list
Name IPv4 Address Endpoint CIDR IPv6 Address Endpoint CIDR
cluster1/k8s-cilium2-29734124-0 172.18.2.5 192.168.1.0/24
cluster1/k8s-cilium2-29734124-1 172.18.2.4 10.4.0.0/16
cluster1/k8s-master-29734124-0 172.18.1.239 10.239.0.0/16
cluster2/k8s-cilium2-14610979-0 172.18.2.6 192.168.2.0/24
cluster2/k8s-cilium2-14610979-1 172.18.2.7 10.7.0.0/16
cluster2/k8s-master-14610979-0 172.18.2.239 10.239.0.0/16 `

kubectl -nkube-system exec -it cilium-rl8gt cilium status
KVStore: Ok etcd: 1/1 connected: https://cilium-etcd-client.kube-system.svc:2379 - 3.3.11
ContainerRuntime: Ok docker daemon: OK
Kubernetes: Ok 1.15 (v1.15.1) [linux/amd64]
Kubernetes APIs: ["CustomResourceDefinition", "cilium/v2::CiliumNetworkPolicy", "core/v1::Endpoint", "core/v1::Namespace", "core/v1::Node", "core/v1::Pods", "core/v1::Service", "networking.k8s.io/v1::NetworkPolicy"]
Cilium: Ok OK
NodeMonitor: Disabled
Cilium health daemon: Ok
IPv4 address pool: 10/65535 allocated from 10.7.0.0/16
Controller Status: 48/48 healthy
Proxy Status: OK, ip 10.7.0.1, port-range 10000-20000
Cluster health: 6/6 reachable (2019-08-09T10:40:39Z)`

@tgraf tgraf added kind/community-report This was reported by a user in the Cilium community, eg via Slack. kind/bug This is a bug in the Cilium logic. labels Aug 12, 2019
@tgraf
Copy link
Member

tgraf commented Aug 12, 2019

@juan-daishogroup Based on the output you posted, the clustermesh control plane is up and running so it seems to be a problem with global service correlation.

Are you able to post the debuginfo output of both clusters?

It would also be useful to see [...] exec -ti cilium-xxx -- cilium kvstore get --recursive cilium/. You should be seeing the global service in there. If not, something may be wrong with cilium-operator synchronizing global services into the kvstore.

@tgraf tgraf self-assigned this Aug 12, 2019
@ghost
Copy link
Author

ghost commented Aug 12, 2019

Thanks for the assistance.

I add the debuginfo output for both clusters. I am a bit concerned about a recurrent problem showing a writing error

Error: Unable to open /sys/fs/bpf/tc/globals/cilium_ct4_461: Unable to get object /sys/fs/bpf/tc/globals/cilium_ct4_461: no such file or directory

I don't see any error in etcd-operator and operator logs.

Any guidance is welcome.

Debug info:

Cluster 1: https://pastebin.com/8PKneWaP

Cluster 2: https://pastebin.com/uELutbry

@tgraf
Copy link
Member

tgraf commented Aug 12, 2019

I add the debuginfo output for both clusters. I am a bit concerned about a recurrent problem showing a writing error

This is just the debuginfo script trying to scrape per endpoint conntrack tables which are not enabled. It's a cosmetic problem only. The debuginfo output can be a bit raw as it is intended for developers.

Based on the debuginfo, no external endpoints are visible as you guess as well. You need to check the kvstore and see if they are synchronized correctly. It's basically this step from the troubleshooting guide:

  • When using global services, ensure that the cilium-operator deployment is running and healthy. It is responsible to propagate Kubernetes services into the kvstore. You can validate the correct functionality of the operator by running:
cilium kvstore get --recursive cilium/state/services/v1/

An entry must exist for each global service

You can confirm the correct propagation of service backend endpoints by running cilium service list and cilium bpf lb list to see the complete mapping of service IPs to backend/pod IPs.

@ghost
Copy link
Author

ghost commented Aug 12, 2019

Well, it actually seems that the information is not correctly propagated. Is there any log output or additional check you can suggest me in order to find the failure point?

Thanks for the support!

@tgraf
Copy link
Member

tgraf commented Aug 13, 2019

@juan-daishogroup Yes:

  • Make sure you are running the operator with --synchronize-k8s-services enabled
  • Check the operator logs whether it says that it can connect to the k8s apiserver.
  • Check for the log message Starting to synchronize k8s services to kvstore in the operator
  • Check for the log message Starting to synchronize Kubernetes services to kvstore
  • If still unclear, enable --debug=true in the operator and provide the logs.

@tgraf
Copy link
Member

tgraf commented Aug 14, 2019

@juan-daishogroup Issue has been identified, fix is coming.

tgraf added a commit that referenced this issue Aug 14, 2019
The kvstore settings were not correctly derived from the ConfigMap which
resulted in all kvstore functionality to always be disabled.

Fixes: da0c527 ("operator: Don't attempt to connect to kvstore if disabled")
Fixes: #8849

Reported-by: @juan-daishogroup
Reported-by: Ara Zarifian
Signed-off-by: Thomas Graf <thomas@cilium.io>
@tgraf
Copy link
Member

tgraf commented Aug 14, 2019

Fix is coming via #8904, in addition to running the fixed operator, the following must be added to the Deployment:

+        - name: CILIUM_KVSTORE
+          valueFrom:
+            configMapKeyRef:
+              key: kvstore
+              name: cilium-config
+              optional: true
+        - name: CILIUM_KVSTORE_OPT
+          valueFrom:
+            configMapKeyRef:
+              key: kvstore-opt
+              name: cilium-config
+              optional: true

@tgraf tgraf added this to the 1.6.0 milestone Aug 14, 2019
@ghost
Copy link
Author

ghost commented Aug 14, 2019

Awesome. I will try it out.

@tgraf
Copy link
Member

tgraf commented Aug 14, 2019

There is a build with the fix: cilium/operator:kvstore-fix

@tgraf
Copy link
Member

tgraf commented Aug 14, 2019

  1. The operator deployment must either set --kvstore and --kvstore-opt via the args:
  2. ... or the following must be in the cilium-config ConfigMap:
  kvstore: etcd
  kvstore-opt: '{"etcd.config": "/var/lib/etcd-config/etcd.config"}'

@ghost
Copy link
Author

ghost commented Aug 14, 2019

I can confirm this fix is working. Thanks for the quick fixing.

@ghost ghost closed this as completed Aug 14, 2019
tgraf added a commit that referenced this issue Aug 14, 2019
The kvstore settings were not correctly derived from the ConfigMap which
resulted in all kvstore functionality to always be disabled.

Fixes: da0c527 ("operator: Don't attempt to connect to kvstore if disabled")
Fixes: #8849

Reported-by: @juan-daishogroup
Reported-by: Ara Zarifian
Signed-off-by: Thomas Graf <thomas@cilium.io>
brb pushed a commit that referenced this issue Aug 15, 2019
[ upstream commit a37ff0f ]

The kvstore settings were not correctly derived from the ConfigMap which
resulted in all kvstore functionality to always be disabled.

Fixes: da0c527 ("operator: Don't attempt to connect to kvstore if disabled")
Fixes: #8849

Reported-by: @juan-daishogroup
Reported-by: Ara Zarifian
Signed-off-by: Thomas Graf <thomas@cilium.io>
Signed-off-by: Martynas Pumputis <m@lambda.lt>
ianvernon pushed a commit that referenced this issue Aug 15, 2019
[ upstream commit a37ff0f ]

The kvstore settings were not correctly derived from the ConfigMap which
resulted in all kvstore functionality to always be disabled.

Fixes: da0c527 ("operator: Don't attempt to connect to kvstore if disabled")
Fixes: #8849

Reported-by: @juan-daishogroup
Reported-by: Ara Zarifian
Signed-off-by: Thomas Graf <thomas@cilium.io>
Signed-off-by: Martynas Pumputis <m@lambda.lt>
This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug This is a bug in the Cilium logic. kind/community-report This was reported by a user in the Cilium community, eg via Slack.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant