Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cilium in DSR mode: Can't reach nodeport service outside cluster from remote node in self managed AWS cluster #26407

Open
2 tasks done
teclone opened this issue Jun 21, 2023 · 23 comments
Labels
area/loadbalancing Impacts load-balancing and Kubernetes service implementations feature/dsr Relates to Cilium's Direct-Server-Return feature for KPR. kind/bug This is a bug in the Cilium logic. kind/community-report This was reported by a user in the Cilium community, eg via Slack. sig/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages.

Comments

@teclone
Copy link

teclone commented Jun 21, 2023

Is there an existing issue for this?

  • I have searched the existing issues

What happened?

I cannot access a nodePort nginx service outside the cluster through nodePublicIpAddress:nodePort from any of the nodes that does not have the pod running locally in it. The connection always times out when the backend pod is on a remote node.

I am running kubernetes 1.27.3 with Cilium v1.14.0-snapshot.4 in a self managed dual stack k8s cluster installed using kubeadm in AWS. There are two ec2 worker nodes, with 1 control plane node, all in the same AWS region and availability zone (us-east-1a).

I deployed Cilium in strict kube-proxy replacement mode and also disabled ip source/destination checks in AWS for all 3 nodes

Here is the output of kubectl -n kube-system exec ds/cilium -- cilium status --verbose. It shows that all nodes and endpoints are reachable.

[ec2-user@ip-172-31-15-25 ~]$ kubectl -n kube-system exec ds/cilium -- cilium status --verbose
Defaulted container "cilium-agent" out of: cilium-agent, cilium-monitor, config (init), mount-cgroup (init), apply-sysctl-overwrites (init), mount-bpf-fs (init), clean-cilium-state (init), install-cni-binaries (init)
KVStore:                Ok   Disabled
Kubernetes:             Ok   1.27 (v1.27.3) [linux/arm64]
Kubernetes APIs:        ["EndpointSliceOrEndpoint", "cilium/v2::CiliumClusterwideNetworkPolicy", "cilium/v2::CiliumEndpoint", "cilium/v2::CiliumNetworkPolicy", "cilium/v2::CiliumNode", "cilium/v2alpha1::CiliumCIDRGroup", "core/v1::Namespace", "core/v1::Pods", "core/v1::Service", "networking.k8s.io/v1::NetworkPolicy"]
KubeProxyReplacement:   Strict   [ens5 ipv4 ipv6 (Direct Routing)]
Host firewall:          Disabled
CNI Chaining:           none
Cilium:                 Ok   1.14.0-snapshot.4 (v1.14.0-snapshot.4-6c8db759)
NodeMonitor:            Listening for events on 2 CPUs with 64x4096 of shared memory
Cilium health daemon:   Ok   
IPAM:                   IPv4: 3/254 allocated from 10.10.0.0/24, IPv6: 3/18446744073709551614 allocated from fd10:800b:444f:2b00::/64
Allocated addresses:
  10.10.0.184 (router)
  10.10.0.187 (health)
  10.10.0.73 (kube-system/coredns-6fccb86bcb-5n4bj)
  fd10:800b:444f:2b00::26e3 (kube-system/coredns-6fccb86bcb-5n4bj)
  fd10:800b:444f:2b00::50be (health)
  fd10:800b:444f:2b00::f003 (router)
IPv4 BIG TCP:           Disabled
IPv6 BIG TCP:           Disabled
BandwidthManager:       Disabled
Host Routing:           BPF
Masquerading:           BPF       [ens5]   10.10.0.0/16 [IPv4: Enabled, IPv6: Enabled]
Clock Source for BPF:   jiffies   [100 Hz]
Controller Status:      24/24 healthy
  Name                                  Last success   Last error   Count   Message
  cilium-health-ep                      45s ago        never        0       no error   
  dns-garbage-collector-job             49s ago        never        0       no error   
  endpoint-1837-regeneration-recovery   never          never        0       no error   
  endpoint-572-regeneration-recovery    never          never        0       no error   
  endpoint-891-regeneration-recovery    never          never        0       no error   
  endpoint-gc                           49s ago        never        0       no error   
  ipcache-inject-labels                 46s ago        10m48s ago   0       no error   
  k8s-heartbeat                         19s ago        never        0       no error   
  link-cache                            1s ago         never        0       no error   
  metricsmap-bpf-prom-sync              4s ago         never        0       no error   
  resolve-identity-1837                 46s ago        never        0       no error   
  resolve-identity-572                  30s ago        never        0       no error   
  resolve-identity-891                  45s ago        never        0       no error   
  sync-host-ips                         46s ago        never        0       no error   
  sync-lb-maps-with-k8s-services        10m46s ago     never        0       no error   
  sync-policymap-1837                   10s ago        never        0       no error   
  sync-policymap-572                    10s ago        never        0       no error   
  sync-policymap-891                    10s ago        never        0       no error   
  sync-to-k8s-ciliumendpoint (1837)     6s ago         never        0       no error   
  sync-to-k8s-ciliumendpoint (572)      10s ago        never        0       no error   
  sync-to-k8s-ciliumendpoint (891)      5s ago         never        0       no error   
  sync-utime                            46s ago        never        0       no error   
  template-dir-watcher                  never          never        0       no error   
  write-cni-file                        10m49s ago     never        0       no error   
Proxy Status:            OK, ip 10.10.0.184, 0 redirects active on ports 10000-20000, Envoy: embedded
Global Identity Range:   min 256, max 65535
Hubble:                  Ok   Current/Max Flows: 3911/4095 (95.51%), Flows/s: 6.01   Metrics: Disabled
KubeProxyReplacement Details:
  Status:                 Strict
  Socket LB:              Enabled
  Socket LB Tracing:      Enabled
  Socket LB Coverage:     Full
  Devices:                ens5 ipv4 ipv6 (Direct Routing)
  Mode:                   DSR
  Backend Selection:      Maglev (Table Size: 65521)
  Session Affinity:       Enabled
  Graceful Termination:   Enabled
  NAT46/64 Support:       Disabled
  XDP Acceleration:       Disabled
  Services:
  - ClusterIP:      Enabled
  - NodePort:       Enabled (Range: 30000-32767) 
  - LoadBalancer:   Enabled 
  - externalIPs:    Enabled 
  - HostPort:       Enabled
BPF Maps:   dynamic sizing: on (ratio: 0.002500)
  Name                          Size
  Auth                          524288
  Non-TCP connection tracking   65536
  TCP connection tracking       131072
  Endpoint policy               65535
  IP cache                      512000
  IPv4 masquerading agent       16384
  IPv6 masquerading agent       16384
  IPv4 fragmentation            8192
  IPv4 service                  65536
  IPv6 service                  65536
  IPv4 service backend          65536
  IPv6 service backend          65536
  IPv4 service reverse NAT      65536
  IPv6 service reverse NAT      65536
  Metrics                       1024
  NAT                           131072
  Neighbor table                131072
  Global policy                 16384
  Session affinity              65536
  Sock reverse NAT              65536
  Tunnel                        65536
Encryption:                                  Disabled        
Cluster health:                              3/3 reachable   (2023-06-21T17:25:31Z)
  Name                                       IP              Node        Endpoints
  master-node (localhost)   internalIPV4   reachable   reachable
  worker-1               internalIPIV4    reachable   reachable
  worker-2               internalIpV4    reachable   reachable

However, i can curl the nginx service from any of the 3 nodes internally using each node's ip address and port

# this works from any node internally
curl nodeInternalIp:nodePort

Cilium Version

v1.14.0-snapshot.4

Kernel Version

6.1.29-50.88.amzn2023.aarch64

Kubernetes Version

1.27.3

Sysdump

cilium-sysdump-20230621-174034.zip

Relevant log output

No response

Anything else?

Here is the deployment file

apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-nginx
spec:
  selector:
    matchLabels:
      run: my-nginx
  replicas: 1
  template:
    metadata:
      labels:
        run: my-nginx
    spec:
      containers:
      - name: my-nginx
        image: nginx
        ports:
        - containerPort: 80

here is the command that i used to expose the service

kubectl expose deployment my-nginx --type=NodePort --port=80

Code of Conduct

  • I agree to follow this project's Code of Conduct
@teclone teclone added kind/bug This is a bug in the Cilium logic. kind/community-report This was reported by a user in the Cilium community, eg via Slack. needs/triage This issue requires triaging to establish severity and next steps. labels Jun 21, 2023
@ti-mo ti-mo added sig/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages. release-blocker/1.14 This issue will prevent the release of the next version of Cilium. and removed needs/triage This issue requires triaging to establish severity and next steps. labels Jun 22, 2023
@margamanterola
Copy link
Member

Hi @teclone, thanks for the report. You mentioned that this is happening with snapshot.4, did you test with any other Cilium versions before? (snapshot.3 or even Cilium 1.13).

@teclone
Copy link
Author

teclone commented Jun 22, 2023

hi @margamanterola , yes, i did, i tested with Cilium version 1.13.3, 1.13.4 and snapshot 3 as well, i also tested with Kubernetes 1.27.2

I tried various installations. I also tried using kube-router as the pod networking plugin, alongside cilium (that is no directrouting, but rather bgp from kube-router) as outlined in this cilium guideline https://docs.cilium.io/en/stable/network/kube-router/

Everything worked perfectly in all trials except that i am not able to reach node port service outside the cluster from remote nodes.

I also tried with ubuntu installation, instead of amazon linux 2023. Same thing

@margamanterola
Copy link
Member

Sorry, @teclone, but it's not clear to me from your message in which cases it failed or didn't fail. Are you saying that it also failed in all of those environments, or that it didn't fail in those? Is there any environment where it didn't fail?

@teclone
Copy link
Author

teclone commented Jun 22, 2023

ah sorry, @margamanterola , it failed in all the cases. what i meant was that all the installations looked fine
when i printed

kubectl -n kube-system exec ds/cilium -- cilium status --verbose

just like the one in this Issue report

@margamanterola
Copy link
Member

Alright, in that case it's not a 1.14 regression. I'll retitle.

@margamanterola margamanterola changed the title Cilium v1.14.0-snapshot.4 in DSR mode: Can't reach nodeport service outside cluster from remote node in self managed AWS cluster Cilium in DSR mode: Can't reach nodeport service outside cluster from remote node in self managed AWS cluster Jun 22, 2023
@margamanterola
Copy link
Member

From where are you trying to reach the service? I see mentions of whether the pod is running or not in the nodes, but you also say from outside the cluster, which is confusing. Can you clarify, from which locations it works and which locations it doesn't work? What response do you get when it works and when it doesn't work?

@teclone
Copy link
Author

teclone commented Jun 22, 2023

@margamanterola I tried to reach the service from two places, outside the cluster and inside the cluster., by outside the cluster, i mean from public/world via my browser using the node public ip address.

Test 1. enter the public ip address of the node that has the pod running locally in it + service nodePort, here is the output screen
Screenshot 2023-06-22 at 14 37 59

Test 2: enter the public ip address of any of the other two nodes, (worker 2 and master 1) + service nodePort
error, connection time

by inside the cluster, i mean when i SSH into all of the nodes, and curl the service using each node's internal ip address + port. I receive a valid response, basically 200 response code with the html page.

The screenshot below is the curl request executed inside the master control plane node. The ip address is the internal ip address of my master control plane node. The nginx service is running on nodePort 31345.

Screenshot 2023-06-22 at 14 46 12

I hope it clears the confusion

@margamanterola
Copy link
Member

Try using hubble or cilium monitor to debug what's going on with your packet. It's likely it's going back the wrong path.

@margamanterola margamanterola removed the release-blocker/1.14 This issue will prevent the release of the next version of Cilium. label Jun 22, 2023
@teclone
Copy link
Author

teclone commented Jun 22, 2023

Try using hubble or cilium monitor to debug what's going on with your packet. It's likely it's going back the wrong path.

hi @margamanterola , it is not clear to me what to look out for in cilium monitor, i see a lot of logs in here when i execute

kubectl -n kube-system exec ds/cilium -- cilium monitor

What exactly do you mean by going back the wrong path? how do i check for this? I will really appreciate any troubleshooting help i can get

@margamanterola
Copy link
Member

margamanterola commented Jun 23, 2023

Hi @teclone! I was able to get a similar behavior to what you are getting, by setting up KIND with a misconfiguration with regards to native routing. I believe your problem is very likely also due to misconfiguration. My findings:

In order to use DSR, Cilium needs to be configured to use native routing (also called "direct routing" in parts of the documentation). To reproduce the behavior in KiND, what I did was to set autoDirectNodeRoutes: false. By doing that, I was telling Cilium that packets for the PodCIDRs would be able to get natively routed, without actually setting up the native routes.

The I used curl to reach the service on each nodeport and ran cilium monitor on the different cilium pods. I saw that if I run cilium monitor on the same node where the nginx pod was running, I could see the traffic go by successfully when connecting to that same node, but nothing when trying to curl the "broken" node. When running cilium monitor on the "broken" node, I saw this:

<- network flow 0x346bddf0 , identity unknown->unknown state unknown ifindex eth0 orig-ip 0.0.0.0: 172.18.0.1:45640 -> 172.18.0.3:30033 tcp SYN
xx drop (Service backend not found) flow 0x346bddf0 to endpoint 0, ifindex 76, file bpf_host.c:879, , identity world->unknown: 172.18.0.1:45640 -> 172.18.0.3:30033 tcp SYN

So, Cilium was not able to redirect the traffic because the node was not able to reach the backend.

After that, I reconfigured the cluster setting autoDirectNodeRoutes: true (which tells Cilium to create the pod routes as needed). And in that case I was able to properly reach the service on both nodes.

Now, in your case, you are using AWS. I believe that what you need to do is use the ENI IPAM mode, as documented here:
https://docs.cilium.io/en/stable/network/concepts/routing/#aws-eni-datapath

If this solves the issue for you, please close the bug. Thanks

@teclone
Copy link
Author

teclone commented Jun 24, 2023

hi @margamanterola , thank you for taking the time to investigate but I actually installed cilium with the auto direct node routes set to true.

However, i did not use aws-eni ipam mode because the number of aws ip addresses per node is very limited. it is not a scalable approach for me.

Please can we setup a time to test this together? So i can show what is going on. Remember that i am able to access the service from any node using internal ip addresses of the nodes.

Why does the internal ip addresses of the nodes work, but not not work with their external ip address except for the node that hosts the pod?

@margamanterola
Copy link
Member

Hi @teclone,

First, I'm sorry to say that I can't give you one-on-one support. If you want that level of support you might want to talk to a vendor that does so.

I do believe that the issue you are seeing is due to a misconfiguration in your cluster. Configuring native routing can be tricky, and the recommended way when using AWS is using the AWS ENI mode.

As I mentioned above, you can try running cilium monitor on the different agents running in the different nodes, and seeing if you find where and why the packets are being dropped. To make this simpler, what I did was store the output in a file, and then view the file with a text editor, where I could search for the ports that I was interested in.

@teclone
Copy link
Author

teclone commented Jun 26, 2023

hi @margamanterola, this is not a misconfiguration from me, I followed the documentation properly.

I have tried with cilium monitor as you suggested. I can see in the logs that request reached the agent on the node that hosts the pod from other nodes, below is the log line.

-> network flow 0x3506c3df , identity unknown->unknown state reply ifindex 0 orig-ip 0.0.0.0: 172.31.15.25:31165 -> 197.210.55.103:14731 tcp SYN, ACK
-> network flow 0xaefc16ba , identity unknown->unknown state reply ifindex 0 orig-ip 0.0.0.0: 172.31.15.25:31165 -> 197.210.226.251:18769 tcp SYN, ACK
-> network flow 0xb5e3c06c , identity unknown->unknown state reply ifindex 0 orig-ip 0.0.0.0: 172.31.15.25:31165 -> 197.210.226.251:36918 tcp SYN, ACK
-> network flow 0x2223a415 , identity unknown->unknown state reply ifindex 0 orig-ip 0.0.0.0: 172.31.15.25:31165 -> 197.210.55.103:14731 tcp SYN, ACK
-> network flow 0xa2e16735 , identity unknown->unknown state reply ifindex 0 orig-ip 0.0.0.0: 172.31.15.25:31165 -> 197.210.226.251:18769 tcp SYN, ACK

I am not sure what to make out from the log.

Below is the options i passed to helm while installing cilium. correct me if there is anything wrong in the options

helm repo add cilium https://helm.cilium.io/
SEED=$(head -c12 /dev/urandom | base64 -w0)
helm install cilium cilium/cilium --version 1.14.0-snapshot.4 \
  --namespace kube-system \
  --set k8sServiceHost=hostIP \
  --set k8sServicePort=6443 \
  --set kubeProxyReplacement=strict \
  --set tunnel=disabled \
  --set loadBalancer.mode=dsr \
  --set loadBalancer.algorithm=maglev \
  --set maglev.tableSize=65521 \
  --set ipv6.enabled=true \
  --set bpf.masquerade=true \
  --set enableIPv6Masquerade=true \
  --set autoDirectNodeRoutes=true \
  --set ipv4NativeRoutingCIDR=10.10.0.0/16 \
  --set ipv6NativeRoutingCIDR=fd10:800b:444f:2b00::/56 \
  --set ipam.mode=kubernetes \
  --set monitor.enabled=true \
  --set endpointRoutes.enabled=true

@teclone
Copy link
Author

teclone commented Jul 2, 2023

hi @margamanterola, any update from you on the above log i shared?

@margamanterola
Copy link
Member

I suggest you give another read of:
https://docs.cilium.io/en/stable/network/kubernetes/kubeproxy-free/#direct-server-return-dsr

In particular:

Note that usage of DSR mode might not work in some public cloud provider environments due to the Cilium-specific IP options that could be dropped by an underlying network fabric.
[...]
Also, in some public cloud provider environments, which implement a source / destination IP address checking (e.g. AWS), the checking has to be disabled in order for the DSR mode to work.

As I mentioned, I can't really provide 1:1 support. I tried to point you in the right direction, but it seems you need additional support to configure your service and for that you should engage with a vendor, rather than continue this engagement through a bug report.

@julianwiedmann julianwiedmann added the area/loadbalancing Impacts load-balancing and Kubernetes service implementations label Aug 30, 2023
@JamesHawkinss
Copy link

@teclone I'm able to reproduce the issue you're experiencing when using OVH Dedicated Servers connected at Layer 2 using their vRack. Sounds like we're having an identical issue - the nodes are able to curl the NodePort pod from each other, but I'm only able to access it externally by using the IP of the node running the pod.

I'm assuming that you haven't been able to find a fix for this?

@ninja-
Copy link

ninja- commented Nov 12, 2023

if there's ever progress on that I would be interested to hear, especially that I might be working with OVH vRack on my next project. however, I can recommend @JamesHawkinss @teclone to clone the service on each node(like nginx) and use externalTrafficPolicy: Local which is usually works much much better for onprem environments...

@carnerito
Copy link
Contributor

I have the same issue when running Cilium in DSR mode with Geneve as described here.

@giorio94
Copy link
Member

giorio94 commented Jan 8, 2024

I think I've just hit the same issue mentioned here, which seems specific to DSR + KPR + BPF masquerade disabled.

Reproduced on a two nodes kind cluster, with Cilium (v1.15.0-rc.0) configured with:

bpf:
    masquerade: false
kube-proxy-replacement: strict
tunnelProtocol: geneve
loadBalancer:
  mode: dsr
  dsrDispatch: geneve

The nodeport service is reachable when targeting the node hosting the backend pod, not when targeting the other node. In that case, curl 172.18.0.6:31852 returns: curl: (56) Recv failure: Connection reset by peer. Running tcpdump on the node hosting the backend highlights that the second response is not SNATted correctly (9898 is the port the server is listening on, 172.18.0.5 the IP of the hosting node):

$ tcpdump -ni eth0 port 9898 or port 31852
17:54:46.972737 IP 172.18.0.6.31852 > 172.18.0.1.55830: Flags [S.], seq 239868844, ack 3429881298, win 64308, options [mss 1410,sackOK,TS val 708178691 ecr 830692500,nop,wscale 7], length 0
17:54:46.972818 IP 172.18.0.5.9898 > 172.18.0.1.55830: Flags [.], ack 3429881378, win 502, options [nop,nop,TS val 708178691 ecr 830692500], length 0
17:54:46.972824 IP 172.18.0.1.55830 > 172.18.0.5.9898: Flags [R], seq 3429881378, win 0, length 0
17:54:47.173255 IP 172.18.0.6.31852 > 172.18.0.1.55830: Flags [R], seq 239868845, win 0, length 0

The same issue, instead, does not reproduce when BPF masquerade is enabled.

@teclone
Copy link
Author

teclone commented Jan 9, 2024

@margamanterola please, this might be worth another look now?

@withinboredom
Copy link

Let me know if I am hijacking, but after searching multiple issues, this is the closest one to my issue. However, I'm using metallb to provide a public IP. With IPv4, I see exactly this issue when getting traffic via the external IP. IPv6 works just fine though.

Copy link

This issue has been automatically marked as stale because it has not
had recent activity. It will be closed if no further activity occurs.

@github-actions github-actions bot added the stale The stale bot thinks this issue is old. Add "pinned" label to prevent this from becoming stale. label Mar 12, 2024
@withinboredom
Copy link

I really with the issue hiding bot didn't exist. It doesn't make issues magically stop existing.

@giorio94 giorio94 removed the stale The stale bot thinks this issue is old. Add "pinned" label to prevent this from becoming stale. label Mar 13, 2024
@julianwiedmann julianwiedmann added the feature/dsr Relates to Cilium's Direct-Server-Return feature for KPR. label May 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/loadbalancing Impacts load-balancing and Kubernetes service implementations feature/dsr Relates to Cilium's Direct-Server-Return feature for KPR. kind/bug This is a bug in the Cilium logic. kind/community-report This was reported by a user in the Cilium community, eg via Slack. sig/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages.
Projects
None yet
Development

No branches or pull requests

9 participants