Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prefer k8s Node IP for SNAT IP (IPV{4,6}_NODEPORT) when multiple IP addrs are set to BPF NodePort device #12988

Closed
networkop opened this issue Aug 27, 2020 · 29 comments · Fixed by #13223
Assignees
Labels
kind/bug This is a bug in the Cilium logic. kind/community-report This was reported by a user in the Cilium community, eg via Slack.

Comments

@networkop
Copy link
Contributor

Bug report

When Cilium is configured as a kube-proxy replacement, it fails to masquerade the source IP of pods when the target pod is in the hostNetwork of one of the k8s nodes.

General Information

  • Cilium version
Client: 1.8.2 aa42034f0 2020-07-23T15:02:39-07:00 go version go1.14.6 linux/amd64
Daemon: 1.8.2 aa42034f0 2020-07-23T15:02:39-07:00 go version go1.14.6 linux/amd64
  • Kernel version
Linux primary-external-workers-uksouth-machine-node-0 5.4.0-1022-azure #22-Ubuntu SMP Fri Jul 10 06:14:37 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
  • Orchestration system version in use
$ kubectl version

Client Version: version.Info{Major:"1", Minor:"18", GitVersion:"v1.18.8", GitCommit:"9f2892aab98fe339f3bd70e3c470144299398ace", GitTreeState:"clean", BuildDate:"2020-08-13T16:12:48Z", GoVersion:"go1.13.15", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"18", GitVersion:"v1.18.8", GitCommit:"9f2892aab98fe339f3bd70e3c470144299398ace", GitTreeState:"clean", BuildDate:"2020-08-13T16:04:18Z", GoVersion:"go1.13.15", Compiler:"gc", Platform:"linux/amd64"}

How to reproduce the issue

  1. Using kubespray (or manually with kubeadm) build a cluster on Azure with cloud-provider disabled. API-server pods will be running as static pods in host OS namespace on 3 controller nodes.
k describe svc kubernetes
Name:              kubernetes
Namespace:         default
Labels:            component=apiserver
                   provider=kubernetes
Annotations:       <none>
Selector:          <none>
Type:              ClusterIP
IP:                10.0.128.1
Port:              https  443/TCP
TargetPort:        6443/TCP
Endpoints:         10.0.255.10:6443,10.0.255.6:6443,10.0.255.7:6443
Session Affinity:  None
Events:            <none>
  1. Deploy a test pod on any worker node and attempt to connect to the API server clusterIP. This results in timeouts
bash-5.0# nc -zv -w 5 10.0.128.1 443
nc: connect to 10.0.128.1 port 443 (tcp) timed out: Operation in progress

Additional details

I have a test pod (10.0.6.223) trying to connect to the API server on 10.0.128.1:443. When doing a tcpdump on the underlying node, I cannot see SNAT'ed packets.

root@primary-external-workers-uksouth-machine-node-0:~# tcpdump -i any -n 'host 10.0.255.10 or host 10.0.255.6 or 10.0.255.7'
14:01:30.463349 IP 10.0.6.223.35832 > 10.0.255.7.6443: Flags [S], seq 3752066639, win 64860, options [mss 1410,sackOK,TS val 1253644246 ecr 0,nop,wscale 7], length 0
14:01:34.687357 IP 10.0.6.223.35832 > 10.0.255.7.6443: Flags [S], seq 3752066639, win 64860, options [mss 1410,sackOK,TS val 1253648470 ecr 0,nop,wscale 7], length 0

Since the source IP of the pod is not masqueraded, it gets dropped by Azure's networking stack.

According to @brb's comment it should have been translated on TC egress, however I don't see it happening (see above tcpdump).

As soon as I change enable-bpf-masquerade to "false" and restart the agent on the node, connectivity gets restored.

I've got the environment up and running for a while so I'm happy to collect any additional logs/outputs.

Here's the cilium configmap.

apiVersion: v1
kind: ConfigMap
metadata:
  name: cilium-config
  namespace: kube-system
data:
  # Identity allocation mode selects how identities are shared between cilium
  # nodes by setting how they are stored. The options are "crd" or "kvstore".
  # - "crd" stores identities in kubernetes as CRDs (custom resource definition).
  #   These can be queried with:
  #     kubectl get ciliumid
  # - "kvstore" stores identities in a kvstore, etcd or consul, that is
  #   configured below. Cilium versions before 1.6 supported only the kvstore
  #   backend. Upgrades from these older cilium versions should continue using
  #   the kvstore by commenting out the identity-allocation-mode below, or
  #   setting it to "kvstore".
  identity-allocation-mode: crd

  # If you want to run cilium in debug mode change this value to true
  debug: "false"

  # Enable IPv4 addressing. If enabled, all endpoints are allocated an IPv4
  # address.
  enable-ipv4: "true"

  # Enable IPv6 addressing. If enabled, all endpoints are allocated an IPv6
  # address.
  enable-ipv6: "false"
  enable-bpf-clock-probe: "true"

  # If you want cilium monitor to aggregate tracing for packets, set this level
  # to "low", "medium", or "maximum". The higher the level, the less packets
  # that will be seen in monitor output.
  monitor-aggregation: medium

  # The monitor aggregation interval governs the typical time between monitor
  # notification events for each allowed connection.
  #
  # Only effective when monitor aggregation is set to "medium" or higher.
  monitor-aggregation-interval: 5s

  # The monitor aggregation flags determine which TCP flags which, upon the
  # first observation, cause monitor notifications to be generated.
  #
  # Only effective when monitor aggregation is set to "medium" or higher.
  monitor-aggregation-flags: all
  # bpf-policy-map-max specified the maximum number of entries in endpoint
  # policy map (per endpoint)
  bpf-policy-map-max: "16384"
  # Specifies the ratio (0.0-1.0) of total system memory to use for dynamic
  # sizing of the TCP CT, non-TCP CT, NAT and policy BPF maps.
  bpf-map-dynamic-size-ratio: "0.0025"

  # Pre-allocation of map entries allows per-packet latency to be reduced, at
  # the expense of up-front memory allocation for the entries in the maps. The
  # default value below will minimize memory usage in the default installation;
  # users who are sensitive to latency may consider setting this to "true".
  #
  # This option was introduced in Cilium 1.4. Cilium 1.3 and earlier ignore
  # this option and behave as though it is set to "true".
  #
  # If this value is modified, then during the next Cilium startup the restore
  # of existing endpoints and tracking of ongoing connections may be disrupted.
  # This may lead to policy drops or a change in loadbalancing decisions for a
  # connection for some time. Endpoints may need to be recreated to restore
  # connectivity.
  #
  # If this option is set to "false" during an upgrade from 1.3 or earlier to
  # 1.4 or later, then it may cause one-time disruptions during the upgrade.
  preallocate-bpf-maps: "false"

  # Regular expression matching compatible Istio sidecar istio-proxy
  # container image names
  sidecar-istio-proxy-image: "cilium/istio_proxy"

  # Encapsulation mode for communication between nodes
  # Possible values:
  #   - disabled
  #   - vxlan (default)
  #   - geneve
  tunnel: vxlan

  # Name of the cluster. Only relevant when building a mesh of clusters.
  cluster-name: default

  # wait-bpf-mount makes init container wait until bpf filesystem is mounted
  wait-bpf-mount: "false"

  masquerade: "true"
  # (Michael) workaround for https://github.com/cilium/cilium/issues/12699
  enable-bpf-masquerade: "false"
  enable-xt-socket-fallback: "true"
  install-iptables-rules: "true"
  auto-direct-node-routes: "false"
  kube-proxy-replacement:  "strict"
  enable-health-check-nodeport: "true"
  node-port-bind-protection: "true"
  enable-auto-protect-node-port-range: "true"
  enable-session-affinity: "true"
  enable-endpoint-health-checking: "true"
  enable-well-known-identities: "false"
  enable-remote-node-identity: "true"
  operator-api-serve-addr: "127.0.0.1:9234"
  # Enable Hubble gRPC service.
  enable-hubble: "true"
  # UNIX domain socket for Hubble server to listen to.
  hubble-socket-path:  "/var/run/cilium/hubble.sock"
  # An additional address for Hubble server to listen to (e.g. ":4244").
  hubble-listen-address: ""
  ipam: "kubernetes"
  disable-cnp-status-updates: "true"
  
  # (Michael) workaround for https://github.com/cilium/cilium/issues/10627
  blacklist-conflicting-routes: "false"
@networkop networkop added the kind/bug This is a bug in the Cilium logic. label Aug 27, 2020
@aditighag aditighag added kind/community-report This was reported by a user in the Cilium community, eg via Slack. needs/triage This issue requires triaging to establish severity and next steps. labels Aug 27, 2020
@brb brb self-assigned this Aug 27, 2020
@aditighag
Copy link
Member

CC @brb

@brb
Copy link
Member

brb commented Aug 27, 2020

@networkop Thanks for the issue. Two things - could you run cilium bpf ipcache get 10.0.255.7 from the node which runs the client pod, and could you provide a sysdump (https://docs.cilium.io/en/v1.8/troubleshooting/#automatic-log-state-collection)?

@networkop
Copy link
Contributor Author

sure, here's the bpf ipcache output first

$ cilium bpf ipcache get 10.0.255.7
10.0.255.7 maps to identity 6 0 0.0.0.0

@networkop
Copy link
Contributor Author

@brb where can I upload the sysdump?

@networkop
Copy link
Contributor Author

oh wow, didn't know about GH's drag-and-drop cilium-sysdump-20200827-160052.zip

@brb
Copy link
Member

brb commented Aug 28, 2020

@networkop Could you try w/o tunneling? You can do that by setting auto-direct-node-routes: true and tunnel: disabled in the ConfigMap, and then restarting all cilium-agent pods? Also, you need to make sure that Azure underlying network won't drop packets from/to an unknown subnet (which is a Pod CIDR).

@networkop
Copy link
Contributor Author

ok, l'll try. do you want me to collect any logs or just verify that it works?

@brb
Copy link
Member

brb commented Aug 28, 2020

Thanks. Just to verify that it works.

@networkop
Copy link
Contributor Author

I've done the test, however I still don't have the e2e connectivity. In addition to changing the above flags I also had to specify the native-routing-cidr to stop the agent from crashing. I set it to the same value as pod cidr (in my case 10.0.0.0/18).

Additionally, in Azure I enabled "ip forwarding" on the NIC of the worker node I was testing, to stop it from dropping unknown packets.

The only new things I've noticed was that traceroute now looks a bit more like what you'd expected in the beginning:

15:14:57.811789 IP 10.0.6.173.50440 > 10.0.255.10.6443: Flags [S], seq 3237958959, win 65280, options [mss 1360,sackOK,TS val 2709627184 ecr 0,nop,wscale 7], length 0
15:14:59.162251 IP 10.0.255.4.54334 > 10.0.255.10.6443: Flags [S], seq 2787558301, win 64240, options [mss 1460,sackOK,TS val 560436112 ecr 0,nop,wscale 7], length 0
15:14:59.163323 IP 10.0.255.10.6443 > 10.0.255.4.54334: Flags [S.], seq 4113293485, ack 2787558302, win 65160, options [mss 1418,sackOK,TS val 2268648707 ecr 560436112,nop,wscale 7], length 0
15:14:59.163377 IP 10.0.255.4.54334 > 10.0.255.10.6443: Flags [.], ack 1, win 502, options [nop,nop,TS val 560436113 ecr 2268648707], length 0
15:14:59.163417 IP 10.0.255.4.54334 > 10.0.255.10.6443: Flags [P.], seq 1:518, ack 1, win 502, options [nop,nop,TS val 560436113 ecr 2268648707], length 517

So it looks like there's a SNAT'ed packet, however I can't understand why the translated session keeps sending packets while I don't even see a SYN-ACK in the pod. So maybe it's a red herring.

@brb
Copy link
Member

brb commented Aug 28, 2020

Hmm. Would it be possible to get an access to your Azure cluster (I'm martynas on Cilium's public slack)?

@Mengkzhaoyun
Copy link

the same bug , mark it

@brb
Copy link
Member

brb commented Aug 30, 2020

@Mengkzhaoyun What is your setup and configuration?

@Mengkzhaoyun
Copy link

I deploy the cilium-dev:v1.9.0-rc0 & kubernetes v1.18.8 with out kube-proxy mod (3master ha with kube-vip)
when i test the deploy on 3 ubuntu 20.04 hyper-v win10 vm , i found my kubernetes'dashboard pod can not access https://kube-apiserver:6443 , the error is (apiserver service endpoint) access time out.

I login to bash on dashboard pod , it can not curl access other host's apiserver , but bash on the pod' vm host , the curl https://kube-apiserver.othervm.local:6443/version run success.

I change enable-bpf-masquerade to false then my kubernetes'dashboard pod work correct.
here is my config

{
	"auto-direct-node-routes": "false",
	"bpf-lb-map-max": "65536",
	"bpf-map-dynamic-size-ratio": "0.0025",
	"bpf-policy-map-max": "16384",
	"cluster-name": "default",
	"cluster-pool-ipv4-cidr": "10.0.0.0/11",
	"cluster-pool-ipv4-mask-size": "24",
	"debug": "false",
	"disable-cnp-status-updates": "true",
	"disable-envoy-version-check": "true",
	"enable-auto-protect-node-port-range": "true",
	"enable-bpf-clock-probe": "true",
	"enable-bpf-masquerade": "false",
	"enable-endpoint-health-checking": "true",
	"enable-external-ips": "true",
	"enable-health-check-nodeport": "true",
	"enable-host-port": "true",
	"enable-host-reachable-services": "true",
	"enable-hubble": "true",
	"enable-ipv4": "true",
	"enable-ipv6": "false",
	"enable-node-port": "true",
	"enable-remote-node-identity": "true",
	"enable-session-affinity": "true",
	"enable-well-known-identities": "false",
	"enable-xt-socket-fallback": "true",
	"hubble-listen-address": ":4244",
	"hubble-socket-path": "/var/run/cilium/hubble.sock",
	"identity-allocation-mode": "crd",
	"install-iptables-rules": "true",
	"ipam": "cluster-pool",
	"k8s-require-ipv4-pod-cidr": "true",
	"k8s-require-ipv6-pod-cidr": "false",
	"kube-proxy-replacement": "partial",
	"masquerade": "true",
	"monitor-aggregation": "medium",
	"monitor-aggregation-flags": "all",
	"monitor-aggregation-interval": "5s",
	"node-port-bind-protection": "true",
	"node-port-range": "20,65000",
	"operator-api-serve-addr": "127.0.0.1:9234",
	"preallocate-bpf-maps": "false",
	"sidecar-istio-proxy-image": "cilium/istio_proxy",
	"tunnel": "vxlan",
	"wait-bpf-mount": "false"
}

@brb
Copy link
Member

brb commented Aug 30, 2020

@Mengkzhaoyun Thanks. Do you run cilium on the master nodes?

@Mengkzhaoyun
Copy link

there is 3 master nodes run cilium agent .

@brb
Copy link
Member

brb commented Aug 31, 2020

@Mengkzhaoyun Can you please ping me (martynas) on the Cilium's public slack?

@travisghansen
Copy link
Contributor

@brb I'm pretty sure this is related to issue I was last discussing with you guys as well where I had to revert to iptables.

@brb
Copy link
Member

brb commented Sep 2, 2020

@networkop @Mengkzhaoyun OK, so what happens in your case is that the wrong IPv4 addr is picked for SNAT, as the selected devices for NodePort have more than one IPv4 addr. Currently, cilium-agent can support only one per bpf_host.o instance, and it does not check which IP addr is used to communicate between k8s nodes (aka k8s Node IP). The fix would be to to prefer the k8s Node IP when selecting that single IP addr.

If any of you wants to work on the fix (a good opportunity to familiarize with Cilium internals), I'm happy to guide you.

@brb brb removed the needs/triage This issue requires triaging to establish severity and next steps. label Sep 2, 2020
@brb brb changed the title Failing to communicate with hostNetwork pods with enable-bpf-masquerade Prefer k8s Node IP for SNAT IP (IPV{4,6}_NODEPORT) when multiple IP addrs are set to BPF NodePort device Sep 2, 2020
@networkop
Copy link
Contributor Author

I'm happy to pick it up @brb

@brb
Copy link
Member

brb commented Sep 2, 2020

The fix might be simple: in the function https://github.com/cilium/cilium/blob/master/pkg/node/address.go#L128 you need to pass k8s Node IPv{4,6} addr to the invocations of firstGlobalV{4,6}Addr(...) as a second parameter.

@Mengkzhaoyun
Copy link

add CILIUM_IPV4_NODE env to cilium-agent can temp skip the bug.

        env:
        - name: K8S_NODE_NAME
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: spec.nodeName
        - name: CILIUM_IPV4_NODE
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: spec.nodeName 

@travisghansen
Copy link
Contributor

@Mengkzhaoyun that makes my deployment crash as the nodeName is a hostname...aware of another property that pushes the IP definitively?

level=fatal msg="Invalid IPv4 node address" ipAddr=<node hostname> subsys=daemon

@travisghansen
Copy link
Contributor

Here's the better way I suppose:

        - name: CILIUM_IPV4_NODE
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: status.podIP

Didn't help my scenario so I must be hitting a different bug altogether.

@brb
Copy link
Member

brb commented Sep 3, 2020

JFYI: CILIUM_IPV4_NODE doesn't have anything to do with fixing the issue.

@brb
Copy link
Member

brb commented Sep 4, 2020

@brb I'm pretty sure this is related to issue I was last discussing with you guys as well where I had to revert to iptables.

As per discussion offline - it's not related.

@networkop
Copy link
Contributor Author

Here's a brief summary of the issues I was hitting:

  1. The main issue was caused by the fact that one of the k8s nodes had multiple outgoing interfaces. Specifically, the interface that was used to communicate with the affected nodes was wireguard, which doesn't have a L2 (MAC) address and is currently not supported until NodePort fails when using IP attached to interface without HW address #12317 gets resolved.

  2. The secondary issue was that cilium agent is picking the wrong SNAT IP for traffic leaving the primary (non-wg) interface. This interface was a bond interface with 1 public and 1 private IP. Cilium was picking the public IP, ignoring the fact that that k8s node had InternalIP set to the private IP. The following patch fixed this issue for me:

diff --git a/pkg/node/address.go b/pkg/node/address.go
index a5fa3dadd..7c438a9d8 100644
--- a/pkg/node/address.go
+++ b/pkg/node/address.go
@@ -129,10 +129,14 @@ func InitNodePortAddrs(devices []string) error {
        if option.Config.EnableIPv4 {
                ipv4NodePortAddrs = make(map[string]net.IP, len(devices))
                for _, device := range devices {
-                       ip, err := firstGlobalV4Addr(device, nil)
+                       ip, err := firstGlobalV4Addr(device, GetK8sNodeIP())
                        if err != nil {
                                return fmt.Errorf("Failed to determine IPv4 of %s for NodePort", device)
                        }
+                       log.WithFields(logrus.Fields{
+                               "device": device,
+                               "ip":     ip,
+                       }).Info("Pinning node IPs")
                        ipv4NodePortAddrs[device] = ip
                }
        }
@@ -140,7 +144,7 @@ func InitNodePortAddrs(devices []string) error {
        if option.Config.EnableIPv6 {
                ipv6NodePortAddrs = make(map[string]net.IP, len(devices))
                for _, device := range devices {
-                       ip, err := firstGlobalV6Addr(device, nil)
+                       ip, err := firstGlobalV6Addr(device, GetK8sNodeIP())
                        if err != nil {
                                return fmt.Errorf("Failed to determine IPv6 of %s for NodePort", device)
                        }
diff --git a/pkg/node/address_linux.go b/pkg/node/address_linux.go
index a05a3f38d..bfa16fc73 100644
--- a/pkg/node/address_linux.go
+++ b/pkg/node/address_linux.go
@@ -86,7 +86,7 @@ retryScope:
        }
 
        if len(ipsPublic) != 0 {
-               if hasPreferred && ip.IsPublicAddr(preferredIP) {
+               if hasPreferred {
                        return preferredIP, nil
                }
  1. Finally, the last issue was me (PBKAC). I started replicating the issue in the default namespace which also happened to have an egress network policy. cilium monitor -t drop helped identify that.

Thanks @brb for the support, let me know if you want me to do a PR with the above fix or it's ok to just close this issue for now.

@Mengkzhaoyun
Copy link

@networkop , good job , it is work.

@aanm
Copy link
Member

aanm commented Sep 14, 2020

@networkop can you push a PR with that fix? that would be great!

@networkop
Copy link
Contributor Author

PR is ready for review 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug This is a bug in the Cilium logic. kind/community-report This was reported by a user in the Cilium community, eg via Slack.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants