New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cilium in DSR mode: Can't reach nodeport service outside cluster from remote node in self managed AWS cluster #26407
Comments
Hi @teclone, thanks for the report. You mentioned that this is happening with snapshot.4, did you test with any other Cilium versions before? (snapshot.3 or even Cilium 1.13). |
hi @margamanterola , yes, i did, i tested with Cilium version 1.13.3, 1.13.4 and snapshot 3 as well, i also tested with Kubernetes 1.27.2 I tried various installations. I also tried using kube-router as the pod networking plugin, alongside cilium (that is no directrouting, but rather bgp from kube-router) as outlined in this cilium guideline https://docs.cilium.io/en/stable/network/kube-router/ Everything worked perfectly in all trials except that i am not able to reach node port service outside the cluster from remote nodes. I also tried with ubuntu installation, instead of amazon linux 2023. Same thing |
Sorry, @teclone, but it's not clear to me from your message in which cases it failed or didn't fail. Are you saying that it also failed in all of those environments, or that it didn't fail in those? Is there any environment where it didn't fail? |
ah sorry, @margamanterola , it failed in all the cases. what i meant was that all the installations looked fine kubectl -n kube-system exec ds/cilium -- cilium status --verbose just like the one in this Issue report |
Alright, in that case it's not a 1.14 regression. I'll retitle. |
From where are you trying to reach the service? I see mentions of whether the pod is running or not in the nodes, but you also say from outside the cluster, which is confusing. Can you clarify, from which locations it works and which locations it doesn't work? What response do you get when it works and when it doesn't work? |
@margamanterola I tried to reach the service from two places, outside the cluster and inside the cluster., by outside the cluster, i mean from public/world via my browser using the node public ip address. Test 1. enter the public ip address of the node that has the pod running locally in it + service nodePort, here is the output screen Test 2: enter the public ip address of any of the other two nodes, (worker 2 and master 1) + service nodePort by inside the cluster, i mean when i SSH into all of the nodes, and curl the service using each node's internal ip address + port. I receive a valid response, basically 200 response code with the html page. The screenshot below is the curl request executed inside the master control plane node. The ip address is the internal ip address of my master control plane node. The nginx service is running on nodePort 31345. I hope it clears the confusion |
Try using |
hi @margamanterola , it is not clear to me what to look out for in cilium monitor, i see a lot of logs in here when i execute
What exactly do you mean by going back the wrong path? how do i check for this? I will really appreciate any troubleshooting help i can get |
Hi @teclone! I was able to get a similar behavior to what you are getting, by setting up KIND with a misconfiguration with regards to native routing. I believe your problem is very likely also due to misconfiguration. My findings: In order to use DSR, Cilium needs to be configured to use native routing (also called "direct routing" in parts of the documentation). To reproduce the behavior in KiND, what I did was to set The I used curl to reach the service on each nodeport and ran
So, Cilium was not able to redirect the traffic because the node was not able to reach the backend. After that, I reconfigured the cluster setting Now, in your case, you are using AWS. I believe that what you need to do is use the ENI IPAM mode, as documented here: If this solves the issue for you, please close the bug. Thanks |
hi @margamanterola , thank you for taking the time to investigate but I actually installed cilium with the auto direct node routes set to true. However, i did not use aws-eni ipam mode because the number of aws ip addresses per node is very limited. it is not a scalable approach for me. Please can we setup a time to test this together? So i can show what is going on. Remember that i am able to access the service from any node using internal ip addresses of the nodes. Why does the internal ip addresses of the nodes work, but not not work with their external ip address except for the node that hosts the pod? |
Hi @teclone, First, I'm sorry to say that I can't give you one-on-one support. If you want that level of support you might want to talk to a vendor that does so. I do believe that the issue you are seeing is due to a misconfiguration in your cluster. Configuring native routing can be tricky, and the recommended way when using AWS is using the AWS ENI mode. As I mentioned above, you can try running |
hi @margamanterola, this is not a misconfiguration from me, I followed the documentation properly. I have tried with cilium monitor as you suggested. I can see in the logs that request reached the agent on the node that hosts the pod from other nodes, below is the log line.
I am not sure what to make out from the log. Below is the options i passed to helm while installing cilium. correct me if there is anything wrong in the options helm repo add cilium https://helm.cilium.io/
SEED=$(head -c12 /dev/urandom | base64 -w0)
helm install cilium cilium/cilium --version 1.14.0-snapshot.4 \
--namespace kube-system \
--set k8sServiceHost=hostIP \
--set k8sServicePort=6443 \
--set kubeProxyReplacement=strict \
--set tunnel=disabled \
--set loadBalancer.mode=dsr \
--set loadBalancer.algorithm=maglev \
--set maglev.tableSize=65521 \
--set ipv6.enabled=true \
--set bpf.masquerade=true \
--set enableIPv6Masquerade=true \
--set autoDirectNodeRoutes=true \
--set ipv4NativeRoutingCIDR=10.10.0.0/16 \
--set ipv6NativeRoutingCIDR=fd10:800b:444f:2b00::/56 \
--set ipam.mode=kubernetes \
--set monitor.enabled=true \
--set endpointRoutes.enabled=true |
hi @margamanterola, any update from you on the above log i shared? |
I suggest you give another read of: In particular:
As I mentioned, I can't really provide 1:1 support. I tried to point you in the right direction, but it seems you need additional support to configure your service and for that you should engage with a vendor, rather than continue this engagement through a bug report. |
@teclone I'm able to reproduce the issue you're experiencing when using OVH Dedicated Servers connected at Layer 2 using their vRack. Sounds like we're having an identical issue - the nodes are able to curl the NodePort pod from each other, but I'm only able to access it externally by using the IP of the node running the pod. I'm assuming that you haven't been able to find a fix for this? |
if there's ever progress on that I would be interested to hear, especially that I might be working with OVH vRack on my next project. however, I can recommend @JamesHawkinss @teclone to clone the service on each node(like nginx) and use externalTrafficPolicy: Local which is usually works much much better for onprem environments... |
I have the same issue when running Cilium in DSR mode with Geneve as described here. |
I think I've just hit the same issue mentioned here, which seems specific to DSR + KPR + BPF masquerade disabled. Reproduced on a two nodes kind cluster, with Cilium (v1.15.0-rc.0) configured with:
The nodeport service is reachable when targeting the node hosting the backend pod, not when targeting the other node. In that case,
The same issue, instead, does not reproduce when BPF masquerade is enabled. |
@margamanterola please, this might be worth another look now? |
Let me know if I am hijacking, but after searching multiple issues, this is the closest one to my issue. However, I'm using metallb to provide a public IP. With IPv4, I see exactly this issue when getting traffic via the external IP. IPv6 works just fine though. |
This issue has been automatically marked as stale because it has not |
I really with the issue hiding bot didn't exist. It doesn't make issues magically stop existing. |
Is there an existing issue for this?
What happened?
I cannot access a nodePort nginx service outside the cluster through
nodePublicIpAddress:nodePort
from any of the nodes that does not have the pod running locally in it. The connection always times out when the backend pod is on a remote node.I am running
kubernetes 1.27.3
withCilium v1.14.0-snapshot.4
in a self managed dual stack k8s cluster installed using kubeadm in AWS. There are two ec2 worker nodes, with 1 control plane node, all in the same AWS region and availability zone (us-east-1a).I deployed Cilium in strict kube-proxy replacement mode and also disabled ip source/destination checks in AWS for all 3 nodes
Here is the output of
kubectl -n kube-system exec ds/cilium -- cilium status --verbose
. It shows that all nodes and endpoints are reachable.However, i can curl the nginx service from any of the 3 nodes internally using each node's ip address and port
# this works from any node internally curl nodeInternalIp:nodePort
Cilium Version
v1.14.0-snapshot.4
Kernel Version
6.1.29-50.88.amzn2023.aarch64
Kubernetes Version
1.27.3
Sysdump
cilium-sysdump-20230621-174034.zip
Relevant log output
No response
Anything else?
Here is the deployment file
here is the command that i used to expose the service
Code of Conduct
The text was updated successfully, but these errors were encountered: