Prefer k8s Node IP for SNAT IP (IPV{4,6}_NODEPORT) when multiple IP addrs are set to BPF NodePort device #12988

networkop · 2020-08-27T14:27:18Z

Bug report

When Cilium is configured as a kube-proxy replacement, it fails to masquerade the source IP of pods when the target pod is in the hostNetwork of one of the k8s nodes.

General Information

Cilium version

Client: 1.8.2 aa42034f0 2020-07-23T15:02:39-07:00 go version go1.14.6 linux/amd64
Daemon: 1.8.2 aa42034f0 2020-07-23T15:02:39-07:00 go version go1.14.6 linux/amd64

Kernel version

Linux primary-external-workers-uksouth-machine-node-0 5.4.0-1022-azure #22-Ubuntu SMP Fri Jul 10 06:14:37 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

Orchestration system version in use

$ kubectl version

Client Version: version.Info{Major:"1", Minor:"18", GitVersion:"v1.18.8", GitCommit:"9f2892aab98fe339f3bd70e3c470144299398ace", GitTreeState:"clean", BuildDate:"2020-08-13T16:12:48Z", GoVersion:"go1.13.15", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"18", GitVersion:"v1.18.8", GitCommit:"9f2892aab98fe339f3bd70e3c470144299398ace", GitTreeState:"clean", BuildDate:"2020-08-13T16:04:18Z", GoVersion:"go1.13.15", Compiler:"gc", Platform:"linux/amd64"}

How to reproduce the issue

Using kubespray (or manually with kubeadm) build a cluster on Azure with cloud-provider disabled. API-server pods will be running as static pods in host OS namespace on 3 controller nodes.

k describe svc kubernetes
Name:              kubernetes
Namespace:         default
Labels:            component=apiserver
                   provider=kubernetes
Annotations:       <none>
Selector:          <none>
Type:              ClusterIP
IP:                10.0.128.1
Port:              https  443/TCP
TargetPort:        6443/TCP
Endpoints:         10.0.255.10:6443,10.0.255.6:6443,10.0.255.7:6443
Session Affinity:  None
Events:            <none>

Deploy a test pod on any worker node and attempt to connect to the API server clusterIP. This results in timeouts

bash-5.0# nc -zv -w 5 10.0.128.1 443
nc: connect to 10.0.128.1 port 443 (tcp) timed out: Operation in progress

Additional details

I have a test pod (10.0.6.223) trying to connect to the API server on 10.0.128.1:443. When doing a tcpdump on the underlying node, I cannot see SNAT'ed packets.

root@primary-external-workers-uksouth-machine-node-0:~# tcpdump -i any -n 'host 10.0.255.10 or host 10.0.255.6 or 10.0.255.7'
14:01:30.463349 IP 10.0.6.223.35832 > 10.0.255.7.6443: Flags [S], seq 3752066639, win 64860, options [mss 1410,sackOK,TS val 1253644246 ecr 0,nop,wscale 7], length 0
14:01:34.687357 IP 10.0.6.223.35832 > 10.0.255.7.6443: Flags [S], seq 3752066639, win 64860, options [mss 1410,sackOK,TS val 1253648470 ecr 0,nop,wscale 7], length 0

Since the source IP of the pod is not masqueraded, it gets dropped by Azure's networking stack.

According to @brb's comment it should have been translated on TC egress, however I don't see it happening (see above tcpdump).

As soon as I change enable-bpf-masquerade to "false" and restart the agent on the node, connectivity gets restored.

I've got the environment up and running for a while so I'm happy to collect any additional logs/outputs.

Here's the cilium configmap.

apiVersion: v1
kind: ConfigMap
metadata:
  name: cilium-config
  namespace: kube-system
data:
  # Identity allocation mode selects how identities are shared between cilium
  # nodes by setting how they are stored. The options are "crd" or "kvstore".
  # - "crd" stores identities in kubernetes as CRDs (custom resource definition).
  #   These can be queried with:
  #     kubectl get ciliumid
  # - "kvstore" stores identities in a kvstore, etcd or consul, that is
  #   configured below. Cilium versions before 1.6 supported only the kvstore
  #   backend. Upgrades from these older cilium versions should continue using
  #   the kvstore by commenting out the identity-allocation-mode below, or
  #   setting it to "kvstore".
  identity-allocation-mode: crd

  # If you want to run cilium in debug mode change this value to true
  debug: "false"

  # Enable IPv4 addressing. If enabled, all endpoints are allocated an IPv4
  # address.
  enable-ipv4: "true"

  # Enable IPv6 addressing. If enabled, all endpoints are allocated an IPv6
  # address.
  enable-ipv6: "false"
  enable-bpf-clock-probe: "true"

  # If you want cilium monitor to aggregate tracing for packets, set this level
  # to "low", "medium", or "maximum". The higher the level, the less packets
  # that will be seen in monitor output.
  monitor-aggregation: medium

  # The monitor aggregation interval governs the typical time between monitor
  # notification events for each allowed connection.
  #
  # Only effective when monitor aggregation is set to "medium" or higher.
  monitor-aggregation-interval: 5s

  # The monitor aggregation flags determine which TCP flags which, upon the
  # first observation, cause monitor notifications to be generated.
  #
  # Only effective when monitor aggregation is set to "medium" or higher.
  monitor-aggregation-flags: all
  # bpf-policy-map-max specified the maximum number of entries in endpoint
  # policy map (per endpoint)
  bpf-policy-map-max: "16384"
  # Specifies the ratio (0.0-1.0) of total system memory to use for dynamic
  # sizing of the TCP CT, non-TCP CT, NAT and policy BPF maps.
  bpf-map-dynamic-size-ratio: "0.0025"

  # Pre-allocation of map entries allows per-packet latency to be reduced, at
  # the expense of up-front memory allocation for the entries in the maps. The
  # default value below will minimize memory usage in the default installation;
  # users who are sensitive to latency may consider setting this to "true".
  #
  # This option was introduced in Cilium 1.4. Cilium 1.3 and earlier ignore
  # this option and behave as though it is set to "true".
  #
  # If this value is modified, then during the next Cilium startup the restore
  # of existing endpoints and tracking of ongoing connections may be disrupted.
  # This may lead to policy drops or a change in loadbalancing decisions for a
  # connection for some time. Endpoints may need to be recreated to restore
  # connectivity.
  #
  # If this option is set to "false" during an upgrade from 1.3 or earlier to
  # 1.4 or later, then it may cause one-time disruptions during the upgrade.
  preallocate-bpf-maps: "false"

  # Regular expression matching compatible Istio sidecar istio-proxy
  # container image names
  sidecar-istio-proxy-image: "cilium/istio_proxy"

  # Encapsulation mode for communication between nodes
  # Possible values:
  #   - disabled
  #   - vxlan (default)
  #   - geneve
  tunnel: vxlan

  # Name of the cluster. Only relevant when building a mesh of clusters.
  cluster-name: default

  # wait-bpf-mount makes init container wait until bpf filesystem is mounted
  wait-bpf-mount: "false"

  masquerade: "true"
  # (Michael) workaround for https://github.com/cilium/cilium/issues/12699
  enable-bpf-masquerade: "false"
  enable-xt-socket-fallback: "true"
  install-iptables-rules: "true"
  auto-direct-node-routes: "false"
  kube-proxy-replacement:  "strict"
  enable-health-check-nodeport: "true"
  node-port-bind-protection: "true"
  enable-auto-protect-node-port-range: "true"
  enable-session-affinity: "true"
  enable-endpoint-health-checking: "true"
  enable-well-known-identities: "false"
  enable-remote-node-identity: "true"
  operator-api-serve-addr: "127.0.0.1:9234"
  # Enable Hubble gRPC service.
  enable-hubble: "true"
  # UNIX domain socket for Hubble server to listen to.
  hubble-socket-path:  "/var/run/cilium/hubble.sock"
  # An additional address for Hubble server to listen to (e.g. ":4244").
  hubble-listen-address: ""
  ipam: "kubernetes"
  disable-cnp-status-updates: "true"
  
  # (Michael) workaround for https://github.com/cilium/cilium/issues/10627
  blacklist-conflicting-routes: "false"

The text was updated successfully, but these errors were encountered:

aditighag · 2020-08-27T15:04:53Z

CC @brb

brb · 2020-08-27T15:25:19Z

@networkop Thanks for the issue. Two things - could you run cilium bpf ipcache get 10.0.255.7 from the node which runs the client pod, and could you provide a sysdump (https://docs.cilium.io/en/v1.8/troubleshooting/#automatic-log-state-collection)?

networkop · 2020-08-27T15:59:19Z

sure, here's the bpf ipcache output first

$ cilium bpf ipcache get 10.0.255.7
10.0.255.7 maps to identity 6 0 0.0.0.0

networkop · 2020-08-27T16:04:12Z

@brb where can I upload the sysdump?

networkop · 2020-08-27T16:05:45Z

oh wow, didn't know about GH's drag-and-drop cilium-sysdump-20200827-160052.zip

brb · 2020-08-28T14:12:35Z

@networkop Could you try w/o tunneling? You can do that by setting auto-direct-node-routes: true and tunnel: disabled in the ConfigMap, and then restarting all cilium-agent pods? Also, you need to make sure that Azure underlying network won't drop packets from/to an unknown subnet (which is a Pod CIDR).

networkop · 2020-08-28T14:28:42Z

ok, l'll try. do you want me to collect any logs or just verify that it works?

brb · 2020-08-28T14:36:14Z

Thanks. Just to verify that it works.

networkop · 2020-08-28T15:19:43Z

I've done the test, however I still don't have the e2e connectivity. In addition to changing the above flags I also had to specify the native-routing-cidr to stop the agent from crashing. I set it to the same value as pod cidr (in my case 10.0.0.0/18).

Additionally, in Azure I enabled "ip forwarding" on the NIC of the worker node I was testing, to stop it from dropping unknown packets.

The only new things I've noticed was that traceroute now looks a bit more like what you'd expected in the beginning:

15:14:57.811789 IP 10.0.6.173.50440 > 10.0.255.10.6443: Flags [S], seq 3237958959, win 65280, options [mss 1360,sackOK,TS val 2709627184 ecr 0,nop,wscale 7], length 0
15:14:59.162251 IP 10.0.255.4.54334 > 10.0.255.10.6443: Flags [S], seq 2787558301, win 64240, options [mss 1460,sackOK,TS val 560436112 ecr 0,nop,wscale 7], length 0
15:14:59.163323 IP 10.0.255.10.6443 > 10.0.255.4.54334: Flags [S.], seq 4113293485, ack 2787558302, win 65160, options [mss 1418,sackOK,TS val 2268648707 ecr 560436112,nop,wscale 7], length 0
15:14:59.163377 IP 10.0.255.4.54334 > 10.0.255.10.6443: Flags [.], ack 1, win 502, options [nop,nop,TS val 560436113 ecr 2268648707], length 0
15:14:59.163417 IP 10.0.255.4.54334 > 10.0.255.10.6443: Flags [P.], seq 1:518, ack 1, win 502, options [nop,nop,TS val 560436113 ecr 2268648707], length 517

So it looks like there's a SNAT'ed packet, however I can't understand why the translated session keeps sending packets while I don't even see a SYN-ACK in the pod. So maybe it's a red herring.

brb · 2020-08-28T15:33:50Z

Hmm. Would it be possible to get an access to your Azure cluster (I'm martynas on Cilium's public slack)?

Mengkzhaoyun · 2020-08-29T10:13:04Z

the same bug , mark it

brb · 2020-08-30T12:10:36Z

@Mengkzhaoyun What is your setup and configuration?

Mengkzhaoyun · 2020-08-30T14:29:53Z

I deploy the cilium-dev:v1.9.0-rc0 & kubernetes v1.18.8 with out kube-proxy mod (3master ha with kube-vip)
when i test the deploy on 3 ubuntu 20.04 hyper-v win10 vm , i found my kubernetes'dashboard pod can not access https://kube-apiserver:6443 , the error is (apiserver service endpoint) access time out.

I login to bash on dashboard pod , it can not curl access other host's apiserver , but bash on the pod' vm host , the curl https://kube-apiserver.othervm.local:6443/version run success.

I change enable-bpf-masquerade to false then my kubernetes'dashboard pod work correct.
here is my config

{
	"auto-direct-node-routes": "false",
	"bpf-lb-map-max": "65536",
	"bpf-map-dynamic-size-ratio": "0.0025",
	"bpf-policy-map-max": "16384",
	"cluster-name": "default",
	"cluster-pool-ipv4-cidr": "10.0.0.0/11",
	"cluster-pool-ipv4-mask-size": "24",
	"debug": "false",
	"disable-cnp-status-updates": "true",
	"disable-envoy-version-check": "true",
	"enable-auto-protect-node-port-range": "true",
	"enable-bpf-clock-probe": "true",
	"enable-bpf-masquerade": "false",
	"enable-endpoint-health-checking": "true",
	"enable-external-ips": "true",
	"enable-health-check-nodeport": "true",
	"enable-host-port": "true",
	"enable-host-reachable-services": "true",
	"enable-hubble": "true",
	"enable-ipv4": "true",
	"enable-ipv6": "false",
	"enable-node-port": "true",
	"enable-remote-node-identity": "true",
	"enable-session-affinity": "true",
	"enable-well-known-identities": "false",
	"enable-xt-socket-fallback": "true",
	"hubble-listen-address": ":4244",
	"hubble-socket-path": "/var/run/cilium/hubble.sock",
	"identity-allocation-mode": "crd",
	"install-iptables-rules": "true",
	"ipam": "cluster-pool",
	"k8s-require-ipv4-pod-cidr": "true",
	"k8s-require-ipv6-pod-cidr": "false",
	"kube-proxy-replacement": "partial",
	"masquerade": "true",
	"monitor-aggregation": "medium",
	"monitor-aggregation-flags": "all",
	"monitor-aggregation-interval": "5s",
	"node-port-bind-protection": "true",
	"node-port-range": "20,65000",
	"operator-api-serve-addr": "127.0.0.1:9234",
	"preallocate-bpf-maps": "false",
	"sidecar-istio-proxy-image": "cilium/istio_proxy",
	"tunnel": "vxlan",
	"wait-bpf-mount": "false"
}

brb · 2020-08-30T14:53:56Z

@Mengkzhaoyun Thanks. Do you run cilium on the master nodes?

Mengkzhaoyun · 2020-08-31T01:40:03Z

there is 3 master nodes run cilium agent .

brb · 2020-08-31T09:12:15Z

@Mengkzhaoyun Can you please ping me (martynas) on the Cilium's public slack?

travisghansen · 2020-08-31T16:31:03Z

@brb I'm pretty sure this is related to issue I was last discussing with you guys as well where I had to revert to iptables.

brb · 2020-09-02T12:45:10Z

@networkop @Mengkzhaoyun OK, so what happens in your case is that the wrong IPv4 addr is picked for SNAT, as the selected devices for NodePort have more than one IPv4 addr. Currently, cilium-agent can support only one per bpf_host.o instance, and it does not check which IP addr is used to communicate between k8s nodes (aka k8s Node IP). The fix would be to to prefer the k8s Node IP when selecting that single IP addr.

If any of you wants to work on the fix (a good opportunity to familiarize with Cilium internals), I'm happy to guide you.

networkop · 2020-09-02T12:54:31Z

I'm happy to pick it up @brb

brb · 2020-09-02T12:58:07Z

The fix might be simple: in the function https://github.com/cilium/cilium/blob/master/pkg/node/address.go#L128 you need to pass k8s Node IPv{4,6} addr to the invocations of firstGlobalV{4,6}Addr(...) as a second parameter.

Mengkzhaoyun · 2020-09-03T14:09:34Z

add CILIUM_IPV4_NODE env to cilium-agent can temp skip the bug.

        env:
        - name: K8S_NODE_NAME
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: spec.nodeName
        - name: CILIUM_IPV4_NODE
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: spec.nodeName

travisghansen · 2020-09-03T14:28:42Z

@Mengkzhaoyun that makes my deployment crash as the nodeName is a hostname...aware of another property that pushes the IP definitively?

level=fatal msg="Invalid IPv4 node address" ipAddr=<node hostname> subsys=daemon

travisghansen · 2020-09-03T15:22:40Z

Here's the better way I suppose:

        - name: CILIUM_IPV4_NODE
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: status.podIP

Didn't help my scenario so I must be hitting a different bug altogether.

brb · 2020-09-03T19:45:50Z

JFYI: CILIUM_IPV4_NODE doesn't have anything to do with fixing the issue.

brb · 2020-09-04T16:18:47Z

@brb I'm pretty sure this is related to issue I was last discussing with you guys as well where I had to revert to iptables.

As per discussion offline - it's not related.

networkop · 2020-09-07T08:41:46Z

Here's a brief summary of the issues I was hitting:

The main issue was caused by the fact that one of the k8s nodes had multiple outgoing interfaces. Specifically, the interface that was used to communicate with the affected nodes was wireguard, which doesn't have a L2 (MAC) address and is currently not supported until NodePort fails when using IP attached to interface without HW address #12317 gets resolved.
The secondary issue was that cilium agent is picking the wrong SNAT IP for traffic leaving the primary (non-wg) interface. This interface was a bond interface with 1 public and 1 private IP. Cilium was picking the public IP, ignoring the fact that that k8s node had InternalIP set to the private IP. The following patch fixed this issue for me:

diff --git a/pkg/node/address.go b/pkg/node/address.go
index a5fa3dadd..7c438a9d8 100644
--- a/pkg/node/address.go
+++ b/pkg/node/address.go
@@ -129,10 +129,14 @@ func InitNodePortAddrs(devices []string) error {
        if option.Config.EnableIPv4 {
                ipv4NodePortAddrs = make(map[string]net.IP, len(devices))
                for _, device := range devices {
-                       ip, err := firstGlobalV4Addr(device, nil)
+                       ip, err := firstGlobalV4Addr(device, GetK8sNodeIP())
                        if err != nil {
                                return fmt.Errorf("Failed to determine IPv4 of %s for NodePort", device)
                        }
+                       log.WithFields(logrus.Fields{
+                               "device": device,
+                               "ip":     ip,
+                       }).Info("Pinning node IPs")
                        ipv4NodePortAddrs[device] = ip
                }
        }
@@ -140,7 +144,7 @@ func InitNodePortAddrs(devices []string) error {
        if option.Config.EnableIPv6 {
                ipv6NodePortAddrs = make(map[string]net.IP, len(devices))
                for _, device := range devices {
-                       ip, err := firstGlobalV6Addr(device, nil)
+                       ip, err := firstGlobalV6Addr(device, GetK8sNodeIP())
                        if err != nil {
                                return fmt.Errorf("Failed to determine IPv6 of %s for NodePort", device)
                        }
diff --git a/pkg/node/address_linux.go b/pkg/node/address_linux.go
index a05a3f38d..bfa16fc73 100644
--- a/pkg/node/address_linux.go
+++ b/pkg/node/address_linux.go
@@ -86,7 +86,7 @@ retryScope:
        }
 
        if len(ipsPublic) != 0 {
-               if hasPreferred && ip.IsPublicAddr(preferredIP) {
+               if hasPreferred {
                        return preferredIP, nil
                }

Finally, the last issue was me (PBKAC). I started replicating the issue in the default namespace which also happened to have an egress network policy. cilium monitor -t drop helped identify that.

Thanks @brb for the support, let me know if you want me to do a PR with the above fix or it's ok to just close this issue for now.

Mengkzhaoyun · 2020-09-14T07:06:54Z

@networkop , good job , it is work.

aanm · 2020-09-14T12:39:23Z

@networkop can you push a PR with that fix? that would be great!

networkop · 2020-09-19T14:53:24Z

PR is ready for review 👍

networkop added the kind/bug This is a bug in the Cilium logic. label Aug 27, 2020

aditighag added kind/community-report This was reported by a user in the Cilium community, eg via Slack. needs/triage This issue requires triaging to establish severity and next steps. labels Aug 27, 2020

brb self-assigned this Aug 27, 2020

brb added this to WIP (Martynas) in 1.9 kube-proxy removal & general dp optimization Aug 31, 2020

brb removed the needs/triage This issue requires triaging to establish severity and next steps. label Sep 2, 2020

brb changed the title ~~Failing to communicate with hostNetwork pods with enable-bpf-masquerade~~ Prefer k8s Node IP for SNAT IP (IPV{4,6}_NODEPORT) when multiple IP addrs are set to BPF NodePort device Sep 2, 2020

brb assigned networkop Sep 2, 2020

Mengkzhaoyun mentioned this issue Sep 4, 2020

Traffic from pod to node , the stream service like exec will automatically close with no error. #13077

Closed

borkmann moved this from WIP (Martynas) to WIP (Community) in 1.9 kube-proxy removal & general dp optimization Sep 4, 2020

networkop mentioned this issue Sep 19, 2020

Makes k8sNodeIP the preferred IP when initializing NodePort addresses. #13223

Merged

qmonnet closed this as completed in #13223 Sep 22, 2020

borkmann removed this from WIP (Community) in 1.9 kube-proxy removal & general dp optimization Sep 25, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prefer k8s Node IP for SNAT IP (IPV{4,6}_NODEPORT) when multiple IP addrs are set to BPF NodePort device #12988

Prefer k8s Node IP for SNAT IP (IPV{4,6}_NODEPORT) when multiple IP addrs are set to BPF NodePort device #12988

networkop commented Aug 27, 2020

aditighag commented Aug 27, 2020

brb commented Aug 27, 2020

networkop commented Aug 27, 2020

networkop commented Aug 27, 2020

networkop commented Aug 27, 2020

brb commented Aug 28, 2020 •

edited

networkop commented Aug 28, 2020

brb commented Aug 28, 2020

networkop commented Aug 28, 2020

brb commented Aug 28, 2020

Mengkzhaoyun commented Aug 29, 2020

brb commented Aug 30, 2020

Mengkzhaoyun commented Aug 30, 2020

brb commented Aug 30, 2020

Mengkzhaoyun commented Aug 31, 2020

brb commented Aug 31, 2020

travisghansen commented Aug 31, 2020

brb commented Sep 2, 2020

networkop commented Sep 2, 2020

brb commented Sep 2, 2020

Mengkzhaoyun commented Sep 3, 2020

travisghansen commented Sep 3, 2020

travisghansen commented Sep 3, 2020

brb commented Sep 3, 2020 •

edited

brb commented Sep 4, 2020

networkop commented Sep 7, 2020

Mengkzhaoyun commented Sep 14, 2020

aanm commented Sep 14, 2020

networkop commented Sep 19, 2020

Prefer k8s Node IP for SNAT IP (IPV{4,6}_NODEPORT) when multiple IP addrs are set to BPF NodePort device #12988

Prefer k8s Node IP for SNAT IP (IPV{4,6}_NODEPORT) when multiple IP addrs are set to BPF NodePort device #12988

Comments

networkop commented Aug 27, 2020

Bug report

Additional details

aditighag commented Aug 27, 2020

brb commented Aug 27, 2020

networkop commented Aug 27, 2020

networkop commented Aug 27, 2020

networkop commented Aug 27, 2020

brb commented Aug 28, 2020 • edited

networkop commented Aug 28, 2020

brb commented Aug 28, 2020

networkop commented Aug 28, 2020

brb commented Aug 28, 2020

Mengkzhaoyun commented Aug 29, 2020

brb commented Aug 30, 2020

Mengkzhaoyun commented Aug 30, 2020

brb commented Aug 30, 2020

Mengkzhaoyun commented Aug 31, 2020

brb commented Aug 31, 2020

travisghansen commented Aug 31, 2020

brb commented Sep 2, 2020

networkop commented Sep 2, 2020

brb commented Sep 2, 2020

Mengkzhaoyun commented Sep 3, 2020

travisghansen commented Sep 3, 2020

travisghansen commented Sep 3, 2020

brb commented Sep 3, 2020 • edited

brb commented Sep 4, 2020

networkop commented Sep 7, 2020

Mengkzhaoyun commented Sep 14, 2020

aanm commented Sep 14, 2020

networkop commented Sep 19, 2020

brb commented Aug 28, 2020 •

edited

brb commented Sep 3, 2020 •

edited