Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Return-traffic for LB hairpin using loopback address as destination #23913

Closed
2 tasks done
sypakine opened this issue Feb 21, 2023 · 23 comments
Closed
2 tasks done

Return-traffic for LB hairpin using loopback address as destination #23913

sypakine opened this issue Feb 21, 2023 · 23 comments
Labels
area/loadbalancing Impacts load-balancing and Kubernetes service implementations kind/bug This is a bug in the Cilium logic. kind/community-report This was reported by a user in the Cilium community, eg via Slack. needs/triage This issue requires triaging to establish severity and next steps. sig/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages.

Comments

@sypakine
Copy link

Is there an existing issue for this?

  • I have searched the existing issues

What happened?

tl;dr: for LB hairpin traffic, return-traffic may not restore the connection state (SNATed to the service IP nor DNATed to to the backend address).

Observations

Example working flow:

-> endpoint 544 flow 0xdd389eee , identity 38392->38392 state established ifindex gke41dc9a27c6a orig-ip 169.254.42.1: 169.254.42.1:50304 -> 10.53.3.69:8443 tcp SYN
-> endpoint 544 flow 0x489ab7d9 , identity 38392->38392 state new ifindex gke41dc9a27c6a orig-ip 10.0.138.63: 10.0.138.63:443 -> 10.53.3.69:50304 tcp SYN, ACK
-> endpoint 544 flow 0x489ab7d9 , identity world->38392 state established ifindex 0 orig-ip 10.0.138.63: 10.0.138.63:443 -> 10.53.3.69:50304 tcp SYN, ACK
-> endpoint 544 flow 0xdd389eee , identity 38392->38392 state established ifindex gke41dc9a27c6a orig-ip 169.254.42.1: 169.254.42.1:50304 -> 10.53.3.69:8443 tcp ACK
...
(through to FIN/ACK)

Example flow which encountered the issue:

-> endpoint 544 flow 0x365671e4 , identity 38392->38392 state new ifindex gke41dc9a27c6a orig-ip 169.254.42.1: 169.254.42.1:50332 -> 10.53.3.69:8443 tcp SYN
-> stack flow 0xfef7f129 , identity 38392->world state reply ifindex 0 orig-ip 0.0.0.0: 10.53.3.69:8443 -> 169.254.42.1:50332 tcp SYN, ACK
-> endpoint 544 flow 0x2eca841d , identity 38392->38392 state established ifindex gke41dc9a27c6a orig-ip 169.254.42.1: 169.254.42.1:50332 -> 10.53.3.69:8443 tcp SYN
...  (retries until TCP connection fails)

The failure signature appears exclusively for flows that are initially detected as new rather than established (see signature screenshot below).

I am unable to reproduce this issue on demand. I've tried inducing by regenerating endpoints, flushing the conntrack table, etc. Looking for insight into the issue or next steps for debugging.

Context

An nginx ingress LB pod that is communicating via an internal LB address which is served by the same nginx pod. In this scenario, the source address is changed to the loopback address (ref: https://github.com/cilium/cilium/blob/master/bpf/lib/lb.h#L1705-L1721).

For this LB loopback scenario The outgoing flow should:

When the reverse translation does not occur for the return traffic, the return traffic does not reach the client.

Cilium Version

Client: 1.11.4 c18daa70f5 2022-03-10T13:43:39-08:00 go version go1.17.9 linux/amd64
Daemon: 1.11.4 c18daa70f5 2022-03-10T13:43:39-08:00 go version go1.17.9 linux/amd64

Kernel Version

Linux gke-<cluster_name>-default-pool- 5.10.147+ #1 SMP Thu Nov 10 10:14:30 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

Kubernetes Version

Server Version: version.Info{Major:"1", Minor:"23", GitVersion:"v1.23.14-gke.1800", GitCommit:"1eab5b8da4acab130c72aea21eb7ed3e96523ca2", GitTreeState:"clean", BuildDate:"2022-12-07T09:32:46Z", GoVersion:"go1.17.13b7", Compiler:"gc", Platform:"linux/amd64"}

Sysdump

Can acquire if required.

Relevant log output

-> endpoint 544 flow 0x365671e4 , identity 38392->38392 state new ifindex gke41dc9a27c6a orig-ip 169.254.42.1: 169.254.42.1:50332 -> 10.53.3.69:8443 tcp SYN
-> stack flow 0xfef7f129 , identity 38392->world state reply ifindex 0 orig-ip 0.0.0.0: 10.53.3.69:8443 -> 169.254.42.1:50332 tcp SYN, ACK
-> endpoint 544 flow 0x2eca841d , identity 38392->38392 state established ifindex gke41dc9a27c6a orig-ip 169.254.42.1: 169.254.42.1:50332 -> 10.53.3.69:8443 tcp SYN
...  (retries until TCP connection fails)

Anything else?

cilium monitor output from a successful case (return traffic is DNATed to backend):

successful

cilium monitor output from a failure case (return traffic is NOT NATed):

failure

Signature:
signature

Code of Conduct

  • I agree to follow this project's Code of Conduct
@sypakine sypakine added kind/bug This is a bug in the Cilium logic. kind/community-report This was reported by a user in the Cilium community, eg via Slack. needs/triage This issue requires triaging to establish severity and next steps. labels Feb 21, 2023
@pchaigno pchaigno added the sig/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages. label Feb 21, 2023
@anfernee
Copy link
Contributor

anfernee commented Apr 7, 2023

The problem is the missing RevNAT field for reply packet:

root@gke-repro-default-pool-b4998d8a-bh5n:/home/cilium# cilium bpf ct list global | grep "54513"
TCP OUT 10.128.15.203:80 -> 10.12.1.27:54513 service expires=18142194 RxPackets=0 RxBytes=234 RxFlagsSeen=0x00 LastRxReport=0 TxPackets=0 TxBytes=0 TxFlagsSeen=0x02 LastTxReport=18141960 Flags=0x0000 [ ] RevNAT=16 SourceSecurityID=0 IfIndex=0
TCP OUT 10.12.1.27:54513 -> 10.128.15.203:80 expires=18226335 RxPackets=8 RxBytes=958 RxFlagsSeen=0x1b LastRxReport=18137237 TxPackets=46 TxBytes=3453 TxFlagsSeen=0x1b LastTxReport=18141960 Flags=0x0018 [ LBLoopback SeenNonSyn ] RevNAT=16 SourceSecurityID=61069 IfIndex=0
TCP IN 169.254.42.1:54513 -> 10.12.1.27:80 expires=18142194 RxPackets=10 RxBytes=740 RxFlagsSeen=0x02 LastRxReport=18141960 TxPackets=9 TxBytes=666 TxFlagsSeen=0x12 LastTxReport=18141960 Flags=0x0000 [ ] RevNAT=0 SourceSecurityID=61069 IfIndex=0

The 3rd entry's RevNAT=0, so it couldn't find the original source IP to modify back to. Not sure why it happens though.

@anfernee
Copy link
Contributor

A good example looks like this:

TCP IN 169.254.42.1:53516 -> 10.12.0.20:80 expires=16847939 RxPackets=12 RxBytes=986 RxFlagsSeen=0x1b LastRxReport=16847900 TxPackets=5 TxBytes=553 TxFlagsSeen=0x1b LastTxReport=16847900 Flags=0x001b [ RxClosing TxClosing LBLoopback SeenNonSyn ] RevNAT=17 SourceSecurityID=13794 IfIndex=0

while a bad example looks like this: [ ] RevNAT=0 vs [ RxClosing TxClosing LBLoopback SeenNonSyn ] RevNAT=17

Seems like after lb translation, the connection tracking for the new packet didn't correctly related this packet to the original service w/ the right rev_nat index. Not exactly sure how this happens, but we know when this happens, the stale ct entry will be there, all connection trying to use the source port will timeout.

@sypakine
Copy link
Author

sypakine commented Apr 12, 2023

I have a turn-key reproduction for GKE, but I am trying to reproduce using OSS master/v1.13 using kind/metallb, but can't get the agent configured to use the loopback IP. What is a minimal agent configuration to ensure DISABLE_LOOPBACK_LB is not defined?

@sypakine
Copy link
Author

The problem is that the loopback conntrack entries (TCP OUT and TCP IN) have inconsistent closing states:

TCP OUT 169.254.42.1:5555 -> 10.32.2.250:80 expires=24226972 (remaining: 21600 sec(s)) RxPackets=0 RxBytes=0 RxFlagsSeen=0x00 LastRxRepor
t=0 TxPackets=6 TxBytes=479 TxFlagsSeen=0x1b LastTxReport=24142596 Flags=0x001a [ TxClosing LBLoopback SeenNonSyn ] RevNAT=18 SourceSecurityID=332 IfIndex=0
---
TCP IN 169.254.42.1:5555 -> 10.32.2.250:80 expires=24142636 (remaining: 10 sec(s)) RxPackets=6 RxBytes=479 RxFlagsSeen=0x1b LastRxRe
port=24142596 TxPackets=5 TxBytes=553 TxFlagsSeen=0x1b LastTxReport=24142597 Flags=0x001b [ RxClosing TxClosing LBLoopback SeenNonSyn ]
RevNAT=18 SourceSecurityID=332 IfIndex=0

TCP OUT is missing the RxClosing state, which means the entry lifetime is not set to CT_CLOSE_TIMEOUT when the TCP connection is closed (ref: https://sourcegraph.com/github.com/cilium/cilium/-/blob/bpf/lib/conntrack.h?L271-274#tab=history).

TCP IN has the RxClosing and TxClosing states, so it sets the lifetime to the closing value.

The conntrack entry pair now has two different lifetimes. When the TCP IN entry becomes expired, it becomes eligible for GC. When GC'ed -- for the remainder of the TCP OUT lifetime -- the connection 5-tuple will remain in this broken state.

This is because we only create this entry in the special loopback case (ref: https://sourcegraph.com/github.com/cilium/cilium/-/blob/bpf/lib/conntrack.h?L961-989)

Any connection that reuses the 5-tuple will perform a ct lookup for the TCP IN entry, fail, and then create an entry missing the loopback and revnat context:

TCP OUT 169.254.42.1:5555 -> 10.32.2.250:80 expires=24209444 (remaining: 21600 sec(s)) RxPackets=0 RxBytes=0 RxFlagsSeen=0x00 LastRxReport=0 TxPackets=15 TxBytes=1145 TxFlagsSeen=0x1b LastTxReport=24208724 Flags=0x0018 [ LBLoopback SeenNonSyn ] RevNAT=18 SourceSecurityID=332 IfIndex=0
---
TCP IN 169.254.42.1:5555 -> 10.32.2.250:80 expires=24209444 (remaining: 21600 sec(s)) RxPackets=9 RxBytes=666 RxFlagsSeen=0x02 LastRxReport=24208724 TxPackets=16 TxBytes=1168 TxFlagsSeen=0x12 LastTxReport=24208724 Flags=0x0010 [ SeenNonSyn ] RevNAT=0 SourceSecurityID=2 IfIndex=0

One workaround for this is to drop the bpf-ct-timeout-regular-tcp lifetime value, to reduce the delta in the lifetimes of the two entries:

kubectl patch cm -n kube-system cilium-config --patch '
data:
  bpf-ct-timeout-regular-tcp: "180s"
> '

This reduces the window for which this issue can encountered.

@anfernee
Copy link
Contributor

anfernee commented May 17, 2023

Irrelevant note: Found that the cilium bpf ct list result doesn't differentiate OUT and SERVICE:

	if k.Flags&TUPLE_F_IN != 0 {
		sb.WriteString(fmt.Sprintf("%s IN %s %d:%d ",
			k.NextHeader.String(), addrDest, k.SourcePort,
			k.DestPort),
		)
	} else {
		sb.WriteString(fmt.Sprintf("%s OUT %s %d:%d ",
			k.NextHeader.String(), addrDest, k.DestPort,
			k.SourcePort),
		)
	}

@aojea
Copy link
Contributor

aojea commented May 18, 2023

/cc

@aojea
Copy link
Contributor

aojea commented May 23, 2023

It seems this was solved with #22972

I can not reprod in master but I think I can with the previous commit 80af06e

KUBEPROXY_MODE="none" make kind
make kind-image

helm upgrade -i cilium ./install/kubernetes/cilium \
    --wait \
    --namespace kube-system \
    --set k8sServiceHost="kind-control-plane" \
    --set k8sServicePort="6443" \
    --set debug.enabled=true \
    --set pprof.enabled=true \
    --set enableIPv4Masquerade=false \
    --set enableIPv6Masquerade=false \
    --set enableIPv4Masquerade=false \
    --set hostFirewall.enabled=false \
    --set socketLB.enabled=true \
    --set socketLB.hostNamespaceOnly=true \
    --set kubeProxyReplacement=strict \
    --set nodeinit.enabled=true \
    --set ipam.mode=kubernetes \
    --set ipv4.enabled=true \
    --set ipv4NativeRoutingCIDR=10.244.0.0/16 \
    --set ipv6.enabled=false \
    --set image.override="localhost:5000/cilium/cilium-dev:local" \
    --set image.pullPolicy=Never \
    --set operator.image.override="localhost:5000/cilium/operator-generic:local" \
    --set operator.image.pullPolicy=Never \
    --set operator.image.suffix="" \
    --set securityContext.privileged=true \
    --set gatewayAPI.enabled=false

Install one Pod and expose it with one Service

apiVersion: apps/v1
kind: Deployment
metadata:
  name: server-deployment
  labels:
    app: MyApp
spec:
  replicas: 1
  selector:
    matchLabels:
      app: MyApp
  template:
    metadata:
      labels:
        app: MyApp
    spec:
      containers:
      - name: agnhost
        image: k8s.gcr.io/e2e-test-images/agnhost:2.39
        args:
          - netexec
          - --http-port=80
        ports:
        - containerPort: 80
---
apiVersion: v1
kind: Service
metadata:
  name: myservice
spec:
  type: ClusterIP
  selector:
    app: MyApp
  ports:
    - protocol: TCP
      port: 80
      targetPort: 80

Once the pod is running get the Service ClusterIP and exec into the pod to run a curl against the ClusterIP

# curl 10.96.97.87/hostname
server-deployment-64f84c8b7f-tx92

One the anetd pod we should see the asymmetry on the timeouts for the link-local address

TCP OUT 169.254.42.1:57240 -> 10.244.1.128:80 expires=17793634 (remaining: 21594 sec(s)) RxPackets=0 RxBytes=0 RxFlagsSeen=0x00 LastRxReport=0 TxPackets=6 TxBytes=480 TxFlagsSeen=0x1b LastTxReport=17772541 Flags=0x001a [ TxClosing LBLoopback SeenNonSyn ] RevNAT=5 SourceSecurityID=15428 IfIndex=0
TCP IN 169.254.42.1:57240 -> 10.244.1.128:80 expires=17772550 (remaining: 4 sec(s)) RxPackets=6 RxBytes=480 RxFlagsSeen=0x1b LastRxReport=17772541 TxPackets=5 TxBytes=522 TxFlagsSeen=0x1b LastTxReport=17772541 Flags=0x001b [ RxClosing TxClosing LBLoopback SeenNonSyn ] RevNAT=5 SourceSecurityID=15428 IfIndex=0
ICMP OUT 169.254.42.1:0 -> 10.244.1.128:0 related expires=17772599 (remaining: 54 sec(s)) RxPackets=0 RxBytes=0 RxFlagsSeen=0x00 LastRxReport=0 TxPackets=1 TxBytes=74 TxFlagsSeen=0x02 LastTxReport=17772541 Flags=0x0018 [ LBLoopback SeenNonSyn ] RevNAT=5 SourceSecurityID=15428 IfIndex=0

@aojea
Copy link
Contributor

aojea commented May 24, 2023

If I run with the patch from #22972 the timers seems correct now

 cilium bpf ct list global -d | grep 169
TCP IN 169.254.42.1:55450 -> 10.244.1.128:80 expires=17832164 (remaining: 21586 sec(s)) RxPackets=0 RxBytes=0 RxFlagsSeen=0x00 LastRxReport=0 TxPackets=5 TxBytes=526 TxFlagsSeen=0x1b LastTxReport=17811071 Flags=0x001a [ TxClosing LBLoopback SeenNonSyn ] RevNAT=5 SourceSecurityID=15428 IfIndex=0
TCP OUT 169.254.42.1:55450 -> 10.244.1.128:80 expires=17832164 (remaining: 21586 sec(s)) RxPackets=0 RxBytes=0 RxFlagsSeen=0x00 LastRxReport=0 TxPackets=6 TxBytes=480 TxFlagsSeen=0x1b LastTxReport=17811071 Flags=0x001a [ TxClosing LBLoopback SeenNonSyn ] RevNAT=5 SourceSecurityID=15428 IfIndex=0
ICMP OUT 169.254.42.1:0 -> 10.244.1.128:0 related expires=17811129 (remaining: 47 sec(s)) RxPackets=0 RxBytes=0 RxFlagsSeen=0x00 LastRxReport=0 TxPackets=1 TxBytes=74 TxFlagsSeen=0x02 LastTxReport=17811071 Flags=0x0018 [ LBLoopback SeenNonSyn ] RevNAT=5 SourceSecurityID=15428 IfIndex=0

@sypakine
Copy link
Author

Confirmed that fixes the issue.

Note: #22972 does change the ct states -- the Rx closing FIN(/ACK) is missed by these conntrack entries and therefore remain in the ct map for the TCP (syn seen) lifetime:

# cilium bpf ct list global -d | grep :55555
TCP IN 169.254.42.1:55555 -> 10.244.1.10:80 expires=18068339 (remaining: 21594 sec(s)) RxPackets=0 RxBytes=0 RxFlagsSeen=0x00 LastRxReport=0 TxPackets=5 TxBytes=523 TxFlagsSeen=0x1b LastTxReport=18047246 Flags=0x001a [ TxClosing LBLoopback SeenNonSyn ] RevNAT=5 SourceSecurityID=8717 IfIndex=0

TCP OUT 169.254.42.1:55555 -> 10.244.1.10:80 expires=18068339 (remaining: 21594 sec(s)) RxPackets=0 RxBytes=0 RxFlagsSeen=0x00 LastRxReport=0 TxPackets=6 TxBytes=481 TxFlagsSeen=0x1b LastTxReport=18047246 Flags=0x001a [ TxClosing LBLoopback SeenNonSyn ] RevNAT=5 SourceSecurityID=8717 IfIndex=0

TCP OUT 10.96.183.246:80 -> 10.244.1.10:55555 service expires=18068339 (remaining: 21594 sec(s)) RxPackets=0 RxBytes=7 RxFlagsSeen=0x00 LastRxReport=0 TxPackets=0 TxBytes=0 TxFlagsSeen=0x1b LastTxReport=18047246 Flags=0x0012 [ TxClosing SeenNonSyn ] RevNAT=5 SourceSecurityID=0 IfIndex=0

@aojea
Copy link
Contributor

aojea commented May 24, 2023

Confirmed that fixes the issue.

I think we can close it then

Thanks

/close

@aojea
Copy link
Contributor

aojea commented Jun 9, 2023

/reopen

@aditighag can you please reopen, it seems that with certain configuration the issue can be reproduced, we are trying to figure out what exactly is this combination

Thanks

@aanm aanm reopened this Jun 9, 2023
@aojea
Copy link
Contributor

aojea commented Jun 11, 2023

This is easy to repro on the affected scenarios with the reproducer in #23913 (comment) by execing in the Pod a while loop that curls to the Service IP, it eventually hangs and seems to match explanations in #23913 (comment)

while true; do curl 10.52.7.201/clientip; done
....
curl: (28) Failed to connect to 10.52.7.201 port 80 after 131064 ms: Operation timed out
169.254.42.1:60202curl: (28) Failed to connect to 10.52.7.201 port 80 after 131059 ms: Operation timed out

The SYN arrives byt the SYN ACK never gets back because the ct entry found is not able to reverse the nat

@aojea
Copy link
Contributor

aojea commented Jun 11, 2023

TCP IN has the RxClosing and TxClosing states, so it sets the lifetime to the closing value.

I see that is the expected behavior for Services, to set RxClosing but it seems this behavior does no longer apply for "loopback" services, that does not set the RxClosing and this way we don't have the problem

#23913 (comment)

@joestringer is this expected? that "loopback" services don't set the RxClosing state in the CT entry?

#23913 (comment)

I don't have clear if is deliberate that we miss to update some known state on the TCP close handshake

@sypakine
Copy link
Author

sypakine commented Jun 12, 2023

We've identified the trigger. When endpointRoutes and hostLegacyRouting are enabled, the state of the TCP IN conntrack entry for a closed TCP session differs.

helm:
endpointRoutes.enabled=true && bpf.hostLegacyRouting=true

agent flags:
--enable-endpoint-routes='true'&& --enable-host-legacy-routing='true'

With the above settings, the TCP FIN/ACK sets the RxClosing state for the TCP IN ct entry (output from main branch):

TCP IN 169.254.42.1:55555 -> 10.244.0.29:80 expires=16989756 (remaining: 9 sec(s)) RxPackets=6 RxBytes=489 RxFlagsSeen=0x1b LastRxReport=16989747 TxPackets=5 TxBytes=468 TxFlagsSeen=0x1b LastTxReport=16989747 Flags=0x001b [ RxClosing TxClosing LBLoopback SeenNonSyn ] RevNAT=2 SourceSecurityID=50098 IfIndex=0
TCP OUT 169.254.42.1:55555 -> 10.244.0.29:80 expires=17010840 (remaining: 21599 sec(s)) RxPackets=0 RxBytes=0 RxFlagsSeen=0x00 LastRxReport=0 TxPackets=6 TxBytes=489 TxFlagsSeen=0x1b LastTxReport=16989747 Flags=0x001a [ TxClosing LBLoopback SeenNonSyn ] RevNAT=2 SourceSecurityID=50098 IfIndex=0

else:

TCP IN 169.254.42.1:55555 -> 10.244.0.73:80 expires=17010840 (remaining: 21599 sec(s)) RxPackets=0 RxBytes=0 RxFlagsSeen=0x00 LastRxReport=0 TxPackets=5 TxBytes=468 TxFlagsSeen=0x1b LastTxReport=17345307 Flags=0x001a [ TxClosing LBLoopback SeenNonSyn ] RevNAT=5 SourceSecurityID=3072 IfIndex=0
TCP OUT 169.254.42.1:55555 -> 10.244.0.73:80 expires=17010840 (remaining: 21599 sec(s)) RxPackets=0 RxBytes=0 RxFlagsSeen=0x00 LastRxReport=0 TxPackets=6 TxBytes=489 TxFlagsSeen=0x1b LastTxReport=17345307 Flags=0x001a [ TxClosing LBLoopback SeenNonSyn ] RevNAT=5 SourceSecurityID=3072 IfIndex=


Reproduction:

This state diff is reproducible on main:

KUBEPROXY_MODE="none" make kind && \
make kind-image && \
helm upgrade -i cilium ./install/kubernetes/cilium \
  --wait \
  --namespace kube-system \
  --set k8sServiceHost="kind-control-plane" \
  --set k8sServicePort="6443" \
  --set debug.enabled=true \
  --set pprof.enabled=true \
  --set enableIPv4Masquerade=false \
  --set enableIPv6Masquerade=false \
  --set enableIPv4Masquerade=false \
  --set hostFirewall.enabled=false \
  --set socketLB.hostNamespaceOnly=true \
  --set kubeProxyReplacement=strict \
  --set nodeinit.enabled=true \
  --set ipam.mode=kubernetes \
  --set ipv4.enabled=true \
  --set ipv4NativeRoutingCIDR=10.244.0.0/16 \
  --set ipv6.enabled=false \
  --set image.override="localhost:5000/cilium/cilium-dev:local" \
  --set image.pullPolicy=Never \
  --set operator.image.override="localhost:5000/cilium/operator-generic:local" \
  --set operator.image.pullPolicy=Never \
  --set operator.image.suffix="" \
  --set securityContext.privileged=true \
  --set gatewayAPI.enabled=false \
  --set socketLB.enabled=false \
  --set bpf.hostLegacyRouting=true \
  --set endpointRoutes.enabled=true

alias kk="kubectl --context kind-kind"
kk --context kind-kind apply -f - <<EOF
apiVersion: v1
kind: Pod
metadata:
  name: server
  namespace: default
  labels:
    app: server
spec:
  containers:
  - name: agnhost
    image: k8s.gcr.io/e2e-test-images/agnhost:2.39
    args:
    - netexec
    - --http-port=80
    ports:
    - containerPort: 80
---
apiVersion: v1
kind: Service
metadata:
  name: server
spec:
  type: ClusterIP
  selector:
    app: server
  ports:
    - protocol: TCP
      port: 80
      targetPort: 80
EOF

time while ! [[ "$( kk get svc server -ojsonpath='{.spec.clusterIP}' | xargs )" != "" ]]; do echo -n "."; sleep 1; done
clusterip=$( kk get svc server -ojsonpath='{.spec.clusterIP}' )

time while ! [[ "$( kk get pod server -ojsonpath='{.status.conditions[?(@.type=="Ready")].status}' )" =~ "True" ]]; do echo -n "."; sleep 1; done
kk exec server -- curl --no-keepalive --connect-timeout 5 --local-port 55555 -v ${clusterip}:80/hostname

host=$( kk get pod server \
    -ojsonpath='{ .spec.nodeName }' )
agent=$( kk get pod -n kube-system \
    -l k8s-app=cilium \
    --field-selector=spec.nodeName=${host} \
    -ojsonpath='{.items[].metadata.name}' )
kk exec -it -n kube-system ${agent} -- cilium bpf ct list global -d | grep ":55555" | grep "169.254.42.1"

@joestringer
Copy link
Member

I see that is the expected behavior for Services, to set RxClosing but it seems this behavior does no longer apply for "loopback" services, that does not set the RxClosing and this way we don't have the problem

#23913 (comment)

@joestringer is this expected? that "loopback" services don't set the RxClosing state in the CT entry?

Very broadly, without looking at the details, I'd expect the transmit / receive closing bits to be set based on the perspective of the local entry upon egress or ingress of the corresponding packet. So when the pod transmits a TCP FIN A->B I'd expect TxClosing to be set for the A->B connection at A, then when B receives it, it'd set RxClosing at B, then B should respond with the FIN-ACK and that would set TxClosing at B, the packet is routed back and received at A, where it would set RxClosing.

When it comes to LB hairpin I would then posit that this should happen the same but for connections A->C1 and C->B, then B->C and C->A.

Footnotes

  1. Where C is the hairpin IP

@aojea
Copy link
Contributor

aojea commented Jun 14, 2023

the packet is routed back and received at A, where it would set RxClosing.

sweet, @sypakine this clarifies everything, so the bug is clear now, in LB hairpin during all the test and permutations we did we never saw A setting RxClosing

It seems that depending on which features are enabled, for LB hairping, the conntrack states can be the same or different. The conntrack states drive the entry timeout, and if both are different then you'll eventually hit this bug

@sypakine
Copy link
Author

Yes, this is the observation I made @ #23913 (comment):

TCP OUT is missing the RxClosing state, which means the entry lifetime is not set to CT_CLOSE_TIMEOUT when the TCP connection is closed (ref: https://sourcegraph.com/github.com/cilium/cilium/-/blob/bpf/lib/conntrack.h?L271-274#tab=history).

It seems that this flow dodged a wider issue because all ct entries fail to set the RxClosing state. The side-effect is that the stale ct entries linger (potentially for ~6H, or higher with a non-default tcp lifetime) .

@joestringer
Copy link
Member

Nice find! Even if this doesn't always trigger a connectivity failure, the additional conntrack entries hanging around are not ideal.

@julianwiedmann
Copy link
Member

julianwiedmann commented Jun 26, 2023

Not sure I follow all the debugging results so far, but looks like you're on the right track 👍.

For the loopback case, we should end up having three CT entries:

  1. CT_SERVICE, tracking Pod_IP -> Cluster_IP
  2. CT_EGRESS, tracking IPV4_LOOPBACK <-> Pod_IP (the client's PoV)
  3. CT_INGRESS, also tracking IPV4_LOOPBACK <-> Pod_IP (the backend's PoV)

CT_SERVICE only sees the requests. But the other two entries should ideally be updated for both directions of the connection. From the entries' packet statistics, it should be easy to check whether we're missing some updates.

The underlying problem is that for the loopback case, the CT_INGRESS entry already gets created while creating the CT_EGRESS entry (to set .loopback and .rev_nat_index). It would be much more robust if the to-container path could handle this (as usual), when the first request flows towards the backend.

FYI, I've parked some WIP cleanups in https://github.com/julianwiedmann/cilium/tree/1.15-bpf-loopback-cleanups that I hope to land in v1.15. Maybe this makes it a bit easier to understand the code flow, and develop a fix on-top.

@julianwiedmann
Copy link
Member

This is easy to repro on the affected scenarios with the reproducer in #23913 (comment) by execing in the Pod a while loop that curls to the Service IP, it eventually hangs and seems to match explanations in #23913 (comment)

while true; do curl 10.52.7.201/clientip; done
....
curl: (28) Failed to connect to 10.52.7.201 port 80 after 131064 ms: Operation timed out
169.254.42.1:60202curl: (28) Failed to connect to 10.52.7.201 port 80 after 131059 ms: Operation timed out

The SYN arrives byt the SYN ACK never gets back because the ct entry found is not able to reverse the nat

👋 hi @aojea, is there still interest in resolving this particular issue (missing CT_INGRESS entry for loopback connections, due to different expiry times)?

If so, #27602 would benefit from testing / contributing tests that reproduce the issue.

@aojea
Copy link
Contributor

aojea commented Aug 22, 2023

If so, #27602 would benefit from testing / contributing tests that reproduce the issue.

This is super confusing, I can not reproduce it in head (without the patch), with the patch, in 1.14.0 ... however, in 1.14.1 is completely broken

I have inconsistent results with different versions, I need to find a good way to consistently repro this

@julianwiedmann julianwiedmann added the area/loadbalancing Impacts load-balancing and Kubernetes service implementations label Aug 30, 2023
@julianwiedmann
Copy link
Member

The missing RevNAT should be fixed now with #27602.

@sypakine
Copy link
Author

sypakine commented Sep 7, 2023

Thanks for all the work on this @julianwiedmann, @aditighag, @aojea, and @anfernee!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/loadbalancing Impacts load-balancing and Kubernetes service implementations kind/bug This is a bug in the Cilium logic. kind/community-report This was reported by a user in the Cilium community, eg via Slack. needs/triage This issue requires triaging to establish severity and next steps. sig/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages.
Projects
None yet
Development

No branches or pull requests

8 participants