LRP stops performing NAT for existing sockets after temporary k8s API-server unavailability #31988

dglushenok · 2024-04-16T10:44:55Z

Is there an existing issue for this?

I have searched the existing issues

What happened?

We using NodeLocal DNSCache with Local Redirect Policy:

apiVersion: "cilium.io/v2"
kind: CiliumLocalRedirectPolicy
metadata:
  name: "nodelocaldns"
  namespace: kube-system
spec:
  redirectFrontend:
    serviceMatcher:
      serviceName: kube-dns
      namespace: kube-system
  redirectBackend:
    localEndpointSelector:
      matchLabels:
        k8s-app: node-local-dns
    toPorts:
      - port: "53"
        name: dns
        protocol: UDP
      - port: "53"
        name: dns-tcp
        protocol: TCP

Some kind of our services opens UDP socket to DNS service at start and never reopens it. For example:

k8s-node # nsenter -t 3866824 -n ss -na | grep ':53 '
udp    ESTAB      0       0                                         10.10.23.25:57229                              10.10.23.193:53

10.10.23.193 here is address of local instance of node-local-dns, but the service connects to 10.11.0.2 - the address of kube-dns service (as read from /etc/resolv.conf).

When tracing the client service we can see that replies are coming from 10.11.0.2 (as if it is kube-dns). So, LRP performs NAT and everything works well:

k8s-node # nsenter -t 3866824 -n strace -f -e recvfrom -p 3866824 2>&1 | grep 'htons(53)'
[pid 3866912] recvfrom(95, "\340\270\205\200\0\1\0\2\0\0\0\0\nstatistics\4back\7k8s"..., 4097, 0, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("10.11.0.2")}, [16]) = 216
[pid 3866912] recvfrom(95, "|\0\205\200\0\1\0\1\0\0\0\0\nstatistics\4back\7k8s"..., 4097, 0, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("10.11.0.2")}, [16]) = 159
[pid 3866914] recvfrom(95, "\216\305\205\200\0\1\0\1\0\0\0\0\nstatistics\4back\7k8s"..., 4097, 0, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("10.11.0.2")}, [16]) = 159

Until k8s node temporary looses network access to k8s master. After that, the service receives DNS replies not from 10.11.0.2, but from 10.10.23.193 (from the same socket, created to 10.11.0.2). That triggers DNS anti-spoofing in client library (c-ares), which just drops replies from "wrong" address.

On k8s versions prior to 1.27 it was easy to reproduce by temporary disabling access to k8s master using iptables. But after upgrading to 1.27, disabling for 5 minutes is not enough to reproduce the problem, but longer unavailability leads to pods relocation. Anyway, after k8s API server become unavailable by itself and the returns, services starting to see DNS replies from not expected address (following example is from another pod):

[pid 176072] recvfrom(78, "x\204\205\200\0\1\0\2\0\0\0\0\nstatistics\4back\7k8s"..., 4097, 0, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("10.10.138.73")}, [16]) = 216
[pid 176072] <... recvfrom resumed>"\347\273\205\200\0\1\0\1\0\0\0\0\nstatistics\4back\7k8s"..., 4097, 0, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("10.10.138.73")}, [16]) = 159

On the cilium side we can see following logs:

2024-04-16T09:21:38.035952849Z level=warning msg="Network status error received, restarting client connections" error="an error on the server (\"[+]ping ok\\n[+]log ok\\n[-]etcd failed: reason withheld\\n[+]kms-provider-0 ok\\n[+]poststarthook/start-kube-apiserver-admission-initializer ok\\n[+]poststarthook/generic-apiserver-start-informers ok\\n[+]poststarthook/priority-and-fairness-config-consumer ok\\n[+]poststarthook/priority-and-fairness-filter ok\\n[+]poststarthook/storage-object-count-tracker-hook ok\\n[+]poststarthook/start-apiextensions-informers ok\\n[+]poststarthook/start-apiextensions-controllers ok\\n[+]poststarthook/crd-informer-synced ok\\n[+]poststarthook/start-system-namespaces-controller ok\\n[+]poststarthook/bootstrap-controller ok\\n[+]poststarthook/rbac/bootstrap-roles ok\\n[+]poststarthook/scheduling/bootstrap-system-priority-classes ok\\n[+]poststarthook/priority-and-fairness-config-producer ok\\n[+]poststarthook/start-cluster-authentication-info-controller ok\\n[+]poststarthook/start-kube-apiserver-identity-lease-controller ok\\n[+]poststarthook/start-deprecated-kube-apiserver-identity-lease-garbage-collector ok\\n[+]poststarthook/start-kube-apiserver-identity-lease-garbage-collector ok\\n[+]poststarthook/start-legacy-token-tracking-controller ok\\n[+]poststarthook/aggregator-reload-proxy-client-cert ok\\n[+]poststarthook/start-kube-aggregator-informers ok\\n[+]poststarthook/apiservice-registration-controller ok\\n[+]poststarthook/apiservice-status-available-controller ok\\n[+]poststarthook/kube-apiserver-autoregistration ok\\n[+]autoregister-completion ok\\n[+]poststarthook/apiservice-openapi-controller ok\\n[+]poststarthook/apiservice-openapiv3-controller ok\\n[+]poststarthook/apiservice-discovery-controller ok\\nhealthz check failed\") has prevented the request from succeeding (get healthz.meta.k8s.io)" subsys=k8s
2024-04-16T09:22:08.775241170Z level=info msg="Restored service" k8sSvcID=kube-system/kube-dns subsys=redirectpolicy
2024-04-16T09:22:08.776152153Z level=error msg="Local redirect service for policy (kube-system/nodelocaldns) with frontend (10.11.0.2:53) not deleted" error="<nil>" subsys=redirectpolicy
2024-04-16T09:22:08.776161183Z level=info msg="Restored service" k8sSvcID=kube-system/kube-dns subsys=redirectpolicy
2024-04-16T09:22:08.776174248Z level=info msg="Restored service" k8sSvcID=kube-system/kube-dns subsys=redirectpolicy

Cilium Version

Client: 1.12.9 e0bb30a 2023-04-17T23:54:19+02:00 go version go1.18.10 linux/amd64
Daemon: 1.12.9 e0bb30a 2023-04-17T23:54:19+02:00 go version go1.18.10 linux/amd64

Kernel Version

5.4.0-171-generic #189-Ubuntu SMP Fri Jan 5 14:23:02 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux

Kubernetes Version

Server Version: version.Info{Major:"1", Minor:"27", GitVersion:"v1.27.3", GitCommit:"25b4e43193bcda6c7328a6d147b1fb73a33f1598", GitTreeState:"clean", BuildDate:"2023-06-14T09:47:40Z", GoVersion:"go1.20.5", Compiler:"gc", Platform:"linux/amd64"}

Regression

No response

Sysdump

Error: unknown command "sysdump" for "cilium"

Relevant log output

No response

Anything else?

No response

Cilium Users Document

Are you a user of Cilium? Please add yourself to the Users doc

Code of Conduct

I agree to follow this project's Code of Conduct

The text was updated successfully, but these errors were encountered:

aditighag · 2024-04-16T15:18:24Z

What's your kube-proxy replacement setting on the cluster?

dglushenok · 2024-04-16T18:01:18Z

Strict, according to status:

# cilium status
KVStore:                               Ok   Disabled
Kubernetes:                            Ok   1.27 (v1.27.3) [linux/amd64]
Kubernetes APIs:                       ["cilium/v2::CiliumClusterwideNetworkPolicy", "cilium/v2::CiliumEndpoint", "cilium/v2::CiliumLocalRedirectPolicy", "cilium/v2::CiliumNetworkPolicy", "cilium/v2::CiliumNode", "core/v1::Namespace", "core/v1::Node", "core/v1::Pods", "core/v1::Service", "discovery/v1::EndpointSlice", "networking.k8s.io/v1::NetworkPolicy"]
KubeProxyReplacement:                  Strict   [eth0 172.23.87.26]
Host firewall:                         Disabled
CNI Chaining:                          none
Cilium:                                Ok   1.12.9 (v1.12.9-e0bb30a)
NodeMonitor:                           Listening for events on 8 CPUs with 64x4096 of shared memory
Cilium health daemon:                  Ok
IPAM:                                  IPv4: 13/254 allocated from 10.10.22.0/24,
BandwidthManager:                      EDT with BPF [CUBIC] [eth0]
Host Routing:                          Legacy
Masquerading:                          IPTables [IPv4: Enabled, IPv6: Disabled]
Controller Status:                     63/63 healthy
Proxy Status:                          OK, ip 10.10.22.22, 0 redirects active on ports 10000-20000
Global Identity Range:                 min 256, max 65535
Hubble:                                Ok   Current/Max Flows: 4095/4095 (100.00%), Flows/s: 146.37   Metrics: Ok
Encryption:                            Disabled
Cluster health:                        14/15 reachable   (2024-04-16T17:50:46Z)
  Name                                 IP                Node          Endpoints
  mk8s-master-catvtnbtjlsoh40mmpvg-c   172.23.85.29      unreachable   reachable
#

aditighag · 2024-04-17T22:53:30Z

You are running into the socket-lb limitation with respect to stale backends. This issue is a dup of - #31012.

Edit: Sorry, I might not have understood the report entirely, so let's check if it's the same issue.

Until k8s node temporary looses network access to k8s master. After that, the service receives DNS replies not from 10.11.0.2, but from 10.10.23.193 (from the same socket, created to 10.11.0.2). That triggers DNS anti-spoofing in client library (c-ares), which just drops replies from "wrong" address.

Can you reword this? I'm a bit lost as to what problem you are seeing.

dglushenok · 2024-04-18T06:34:25Z

K8s API server can become temporary unavailable due to network issues or problems at k8s master itself. After availability restores - cilium noticing that (we see "Network status error received, restarting client connections" in logs) and it somehow triggers broken NAT for LRP (client still getting replies, but from node-local-dns IP address instead of kube-dns). If client performs verification that UDP reply was sent from the IP-address, the initial socket was created to, that verification fails and DNS reply is getting dropped.

So, it is not looks like the same issue as you mentioned (node-local-dns pod not being restarted in our case).

aditighag · 2024-04-18T22:43:14Z

After availability restores - cilium noticing that (we see "Network status error received, restarting client connections" in logs) and it somehow triggers broken NAT for LRP (client still getting replies, but from node-local-dns IP address instead of kube-dns).

So you are seeing issues on the reverse NAT path?

Can you capture two sysdump: one before the problem occurs and another while you are seeing the connectivity issue?

dglushenok · 2024-04-19T08:03:44Z

So you are seeing issues on the reverse NAT path?

It looks so.

Can you capture two sysdump: one before the problem occurs and another while you are seeing the connectivity issue?

Yes, but it will take some time (I'll try to reproduce the problem on clean debugging environment).

dglushenok · 2024-04-23T16:13:38Z

I've uploaded requested sysdumps:
cilium-sysdump-good-20240423-183307.zip
cilium-sysdump-bad-20240423-190046.zip

The problem was reproduced on freshly deployed k8s cluster. The steps:

Run following script in pod to see addresses in DNS replies:

#!/usr/bin/env python3

import socket
from binascii import unhexlify
from datetime import datetime
from time import sleep

dns_req_msg = "AAAA01000001000000000000076578616d706c6503636f6d0000010001"

with open("/etc/resolv.conf") as fd:
  while True:
      l = fd.readline()
      if l.startswith("nameserver"):
          ns_ip = l.split()[1]
          break

sock = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)
sock.connect((ns_ip, 53))

while True:
  sock.send(unhexlify(dns_req_msg))
  reply, reply_addr = sock.recvfrom(4096)
  print(f"{datetime.now()} - {reply_addr[0]}")
  sleep(1)

Temporary block access to master on node with the pod:

date;
iptables -I OUTPUT 1 -d 10.0.0.6 -j REJECT;
iptables -I OUTPUT 1 -d 158.160.162.56 -j REJECT;
sleep 300;
date;
iptables -D OUTPUT -d 10.0.0.6 -j REJECT;
iptables -D OUTPUT -d 158.160.162.56 -j REJECT;

The output of blocking script (time of blocking and time of unblocking):

Tue 23 Apr 2024 03:54:28 PM UTC
Tue 23 Apr 2024 03:59:28 PM UTC

The output from python script at time around source IP change:

2024-04-23 16:00:07.573901 - 172.17.0.2
2024-04-23 16:00:08.575238 - 172.17.0.2
2024-04-23 16:00:09.576558 - 172.17.0.2
2024-04-23 16:00:10.577879 - 172.16.7.246
2024-04-23 16:00:11.579178 - 172.16.7.246
2024-04-23 16:00:12.580559 - 172.16.7.246

The logs from cilium-agent at the same time:

2024-04-23T15:59:25.666634302Z level=warning msg="Network status error received, restarting client connections" error="Get \"https://10.0.0.6/healthz\": dial tcp 10.0.0.6:443: connect: connection refused" subsys=k8s
2024-04-23T15:59:28.002734019Z level=warning msg="github.com/cilium/cilium/pkg/k8s/watchers/cilium_network_policy.go:155: failed to list *v2.CiliumNetworkPolicy: Get \"https://10.0.0.6/apis/cilium.io/v2/ciliumnetworkpolicies?resourceVersion=196457\": dial tcp 10.0.0.6:443: connect: connection refused" subsys=klog
2024-04-23T15:59:28.002767117Z level=error msg=k8sError error="github.com/cilium/cilium/pkg/k8s/watchers/cilium_network_policy.go:155: Failed to watch *v2.CiliumNetworkPolicy: failed to list *v2.CiliumNetworkPolicy: Get \"https://10.0.0.6/apis/cilium.io/v2/ciliumnetworkpolicies?resourceVersion=196457\": dial tcp 10.0.0.6:443: connect: connection refused" subsys=k8s
2024-04-23T15:59:29.068578937Z level=info msg="Local Redirect Policy updates are not handled" subsys=k8s-watcher
2024-04-23T16:00:10.155041297Z level=info msg="Restored service" k8sSvcID=kube-system/kube-dns subsys=redirectpolicy
2024-04-23T16:00:10.155088840Z level=info msg="Restored service" k8sSvcID=kube-system/kube-dns subsys=redirectpolicy
2024-04-23T16:00:10.155093554Z level=error msg="Local redirect service for policy (kube-system/node-local-dns) with frontend (172.17.0.2:53) not deleted" error="<nil>" subsys=redirectpolicy
2024-04-23T16:00:10.155097808Z level=info msg="Restored service" k8sSvcID=kube-system/kube-dns subsys=redirectpolicy

Elegant996 · 2024-05-13T07:57:12Z

Can also confirm this in my cluster with kube-proxy replacement set to strict. Happy to assist with logs as well. Restarting node-local-cache pods appears to fix this but it's not ideal.

dglushenok added kind/bug This is a bug in the Cilium logic. kind/community-report This was reported by a user in the Cilium community, eg via Slack. needs/triage This issue requires triaging to establish severity and next steps. labels Apr 16, 2024

aditighag added the sig/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages. label Apr 16, 2024

aditighag added the area/lrp Impacts Local Redirect Policy. label Apr 17, 2024

aditighag added kind/duplicate There is another issue which contains additional details. and removed kind/duplicate There is another issue which contains additional details. labels Apr 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LRP stops performing NAT for existing sockets after temporary k8s API-server unavailability #31988

LRP stops performing NAT for existing sockets after temporary k8s API-server unavailability #31988

dglushenok commented Apr 16, 2024

aditighag commented Apr 16, 2024

dglushenok commented Apr 16, 2024

aditighag commented Apr 17, 2024 •

edited

dglushenok commented Apr 18, 2024 •

edited

aditighag commented Apr 18, 2024

dglushenok commented Apr 19, 2024

dglushenok commented Apr 23, 2024

Elegant996 commented May 13, 2024 •

edited

LRP stops performing NAT for existing sockets after temporary k8s API-server unavailability #31988

LRP stops performing NAT for existing sockets after temporary k8s API-server unavailability #31988

Comments

dglushenok commented Apr 16, 2024

Is there an existing issue for this?

What happened?

Cilium Version

Kernel Version

Kubernetes Version

Regression

Sysdump

Relevant log output

Anything else?

Cilium Users Document

Code of Conduct

aditighag commented Apr 16, 2024

dglushenok commented Apr 16, 2024

aditighag commented Apr 17, 2024 • edited

dglushenok commented Apr 18, 2024 • edited

aditighag commented Apr 18, 2024

dglushenok commented Apr 19, 2024

dglushenok commented Apr 23, 2024

Elegant996 commented May 13, 2024 • edited

aditighag commented Apr 17, 2024 •

edited

dglushenok commented Apr 18, 2024 •

edited

Elegant996 commented May 13, 2024 •

edited