Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bpf: lxc: defer CT_INGRESS entry creation for loopback connections #27602

Merged

Conversation

julianwiedmann
Copy link
Member

@julianwiedmann julianwiedmann commented Aug 21, 2023

Currently the CT_INGRESS entry for a loopback connection is already created when the first request leaves the client, as from-container creates the CT_EGRESS entry (see the loopback handling in ct_create4() for details).

This is unusual - we would normally just create the CT_INGRESS entry as the first packet passes through to-container into the backend. But for loopback connections it is needed, so that
1.) to-container finds a CT entry with .loopback set, and thus skips
network policy enforcement even for the first packet, and
2.) the CT entry has its rev_nat_index field populated, and thus can
RevNAT replies in from-container.

This approach conflicts with the fact that loopback replies skip the client's to-container path (to avoid network policy enforcement).

Once the loopback connection is closed, the backend's from-container path observes the FIN / RST, and __ct_lookup4() updates the CT_INGRESS entry's lifetime to CT_CLOSE_TIMEOUT. But the client's to-container path will not observe the FIN / RST, and thus the CT_EGRESS entry's lifetime remains as CT_CONNECTION_LIFETIME_TCP. Therefore the CT_INGRESS entry will expire earlier, and potentially gets garbage-collected while the CT_EGRESS entry stays in place.

If the same loopback connection is subsequently re-opened, the client's from-container path finds the CT_EGRESS entry and thus will not call ct_create4(). Consequently the CT_INGRESS entry is not created, and the backend will not apply the loopback-specific handling described above. Inbound requests are potentially dropped (due to network policy), and/or replies are not RevNATed.

Fix this up by letting the backend path create its CT_INGRESS entry as usual. It just needs a bit of detection code in its CT_NEW handling to understand that the first packet belongs to a .loopback connection, and populate its own CT_INGRESS entry accordingly.

Fix connectivity issues caused by missing conntrack entry when service pod connects to itself via clusterIP.

@julianwiedmann julianwiedmann added kind/bug This is a bug in the Cilium logic. sig/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages. release-note/bug This PR fixes an issue in a previous release of Cilium. feature/conntrack area/loadbalancing Impacts load-balancing and Kubernetes service implementations labels Aug 21, 2023
@julianwiedmann
Copy link
Member Author

/test

@julianwiedmann julianwiedmann force-pushed the 1.15-bpf-lxc-loopback-ct-fix branch 2 times, most recently from ba45517 to 45ce1bd Compare August 21, 2023 07:14
@julianwiedmann
Copy link
Member Author

/test

@aojea
Copy link
Contributor

aojea commented Aug 21, 2023

can you please add me as reviewer, I will make some time during this week to review and test this

@aojea
Copy link
Contributor

aojea commented Aug 22, 2023

surprisingly I can not reproduce it with and without this patch, however, the CT entries are different

without the patch

cilium bpf ct list global -d | grep  41746
TCP IN 169.254.42.1:41746 -> 10.244.0.70:80 expires=1600072 (remaining: 21442 sec(s)) RxPackets=0 RxBytes=0 RxFlagsSeen=0x00 LastRxReport=0 TxPackets=21 TxBytes=2109 TxFlagsSeen=0x1b LastTxReport=1578472 Flags=0x001a [ TxClosing LBLoopback SeenNonSyn ] RevNAT=6 SourceSecurityID=48450 IfIndex=0
TCP OUT 169.254.42.1:41746 -> 10.244.0.70:80 expires=1600072 (remaining: 21442 sec(s)) RxPackets=0 RxBytes=0 RxFlagsSeen=0x00 LastRxReport=0 TxPackets=30 TxBytes=2445 TxFlagsSeen=0x1b LastTxReport=1578472 Flags=0x001a [ TxClosing LBLoopback SeenNonSyn ] RevNAT=6 SourceSecurityID=48450 IfIndex=0
TCP OUT 10.96.234.161:80 -> 10.244.0.70:41746 service expires=1600072 (remaining: 21442 sec(s)) RxPackets=0 RxBytes=9 RxFlagsSeen=0x00 LastRxReport=0 TxPackets=0 TxBytes=0 TxFlagsSeen=0x1b LastTxReport=1578472 Flags=0x0012 [ TxClosing SeenNonSyn ] RevNAT=6 SourceSecurityID=0 IfIndex=0

with this patch

TCP IN 169.254.42.1:41746 -> 10.244.0.70:80 expires=1580663 (remaining: -10 sec(s)) RxPackets=6 RxBytes=489 RxFlagsSeen=0x1b LastRxReport=1580653 TxPackets=4 TxBytes=407 TxFlagsSeen=0x1b LastTxReport=1580653 Flags=0x001b [ RxClosing TxClosing LBLoopback SeenNonSyn ] RevNAT=6 SourceSecurityID=48450 IfIndex=0
TCP OUT 169.254.42.1:41746 -> 10.244.0.70:80 expires=1602253 (remaining: 21580 sec(s)) RxPackets=0 RxBytes=0 RxFlagsSeen=0x00 LastRxReport=0 TxPackets=6 TxBytes=489 TxFlagsSeen=0x1b LastTxReport=1580653 Flags=0x001a [ TxClosing LBLoopback SeenNonSyn ] RevNAT=6 SourceSecurityID=48450 IfIndex=0
TCP OUT 10.96.234.161:80 -> 10.244.0.70:41746 service expires=1602253 (remaining: 21580 sec(s)) RxPackets=0 RxBytes=9 RxFlagsSeen=0x00 LastRxReport=0 TxPackets=0 TxBytes=0 TxFlagsSeen=0x1b LastTxReport=1580653 Flags=0x0012 [ TxClosing SeenNonSyn ] RevNAT=6 SourceSecurityID=0 IfIndex=0

bpf/bpf_lxc.c Outdated Show resolved Hide resolved
@aojea
Copy link
Contributor

aojea commented Aug 23, 2023

ok, @julianwiedmann I decipher the problem I was having to reproduce it.

I need to start from clean state each time I try a new version, this means flushing conntrack table and install a new pod (I do not know how to force the bpf recompilation :/)

With this patch I observe the "wrong" behavior, asymmetry on the timers (though RxClosing is set correctly) it sets the remaining time of TCP_IN to 5s, but the TCP_OUT to 21595s

cilium bpf ct list global -d | grep 43488
TCP OUT 10.96.77.183:80 -> 10.244.1.160:43488 service expires=1700265 (remaining: 21595 sec(s)) RxPackets=0 RxBytes=13 RxFlagsSeen=0x00 LastRxReport=0 TxPackets=0 TxBytes=0 TxFlagsSeen=0x1b LastTxReport=1678662 Flags=0x0012 [ TxClosing SeenNonSyn ] RevNAT=5 SourceSecurityID=0 IfIndex=0
TCP IN 169.254.42.1:43488 -> 10.244.1.160:80 expires=1678675 (remaining: 5 sec(s)) RxPackets=15 RxBytes=1254 RxFlagsSeen=0x1b LastRxReport=1678662 TxPackets=15 TxBytes=1881 TxFlagsSeen=0x1b LastTxReport=1678662 Flags=0x001b [ RxClosing TxClosing LBLoopback SeenNonSyn ] RevNAT=5 SourceSecurityID=48450 IfIndex=0
TCP OUT 169.254.42.1:43488 -> 10.244.1.160:80 expires=1700265 (remaining: 21595 sec(s)) RxPackets=0 RxBytes=0 RxFlagsSeen=0x00 LastRxReport=0 TxPackets=15 TxBytes=1254 TxFlagsSeen=0x1b LastTxReport=1678662 Flags=0x001a [ TxClosing LBLoopback SeenNonSyn ] RevNAT=5 SourceSecurityID=48450 IfIndex=0

with versions v1.14.0 and v1.14.1 the timers look ok, but the RxClosing state is not present

cilium bpf ct list global -d | grep 43488
TCP OUT 10.96.239.241:80 -> 10.244.1.167:43488 service expires=1699206 (remaining: 21596 sec(s)) RxPackets=0 RxBytes=11 RxFlagsSeen=0x00 LastRxReport=0 TxPackets=0 TxBytes=0 TxFlagsSeen=0x1b LastTxReport=1677602 Flags=0x0012 [ TxClosing SeenNonSyn ] RevNAT=6 SourceSecurityID=0 IfIndex=0
TCP OUT 169.254.42.1:43488 -> 10.244.1.167:80 expires=1699206 (remaining: 21596 sec(s)) RxPackets=0 RxBytes=0 RxFlagsSeen=0x00 LastRxReport=0 TxPackets=15 TxBytes=1257 TxFlagsSeen=0x1b LastTxReport=1677602 Flags=0x001a [ TxClosing LBLoopback SeenNonSyn ] RevNAT=6 SourceSecurityID=48450 IfIndex=0
TCP IN 169.254.42.1:43488 -> 10.244.1.167:80 expires=1699206 (remaining: 21596 sec(s)) RxPackets=0 RxBytes=0 RxFlagsSeen=0x00 LastRxReport=0 TxPackets=16 TxBytes=1955 TxFlagsSeen=0x1b LastTxReport=1677602 Flags=0x001a [ TxClosing LBLoopback SeenNonSyn ] RevNAT=6 SourceSecurityID=48450 IfIndex=0

This is how can you reproduce the issue, the problem is that this is not easy to automate as it depends on the timers to be incorrect and eventually get one of these entries that has expired in one direction and not in the other, that make take some time O(min)

  1. Install cilium with kind KUBEPROXY_MODE="none" make kind && make kind-image

  2. Install cilium now , I use this one but for this case I assume any with kube-proxy strict should be enough

helm upgrade -i cilium ./install/kubernetes/cilium \
    --wait \
    --namespace kube-system \
    --set k8sServiceHost="kind-control-plane" \
    --set k8sServicePort="6443" \
    --set debug.enabled=true \
    --set pprof.enabled=true \
    --set enableIPv4Masquerade=false \
    --set enableIPv6Masquerade=false \
    --set enableIPv4Masquerade=false \
    --set hostFirewall.enabled=false \
    --set socketLB.enabled=true \
    --set socketLB.hostNamespaceOnly=true \
    --set kubeProxyReplacement=strict \
    --set nodeinit.enabled=true \
    --set ipam.mode=kubernetes \
    --set routingMode=native \
    --set ipv4.enabled=true \
    --set ipv4NativeRoutingCIDR=10.244.0.0/16 \
    --set ipv6.enabled=false \
    --set image.override="localhost:5000/cilium/cilium-dev:local" \
    --set image.pullPolicy=Never \
    --set operator.image.override="localhost:5000/cilium/operator-generic:local" \
    --set operator.image.pullPolicy=Never \
    --set operator.image.suffix="" \
    --set securityContext.privileged=true \
    --set gatewayAPI.enabled=false

Now these are the steps to reproduce, everything you install a new cilium versions you have to start from here by doing
make kind-image to build a cilium image for current checkout commit
kubectl -n kube-system rollout restart ds/cilium to reload the cilium agent

Install a Pod with a Services that references itself and it runs a web server

apiVersion: apps/v1
kind: Deployment
metadata:
  name: server-deployment
  labels:
    app: MyApp
spec:
  replicas: 1
  selector:
    matchLabels:
      app: MyApp
  template:
    metadata:
      labels:
        app: MyApp
    spec:
      nodeName: kind-worker
      containers:
      - name: agnhost
        image: httpd:2
        ports:
        - containerPort: 80
---
apiVersion: v1
kind: Service
metadata:
  name: myservice
spec:
  type: ClusterIP
  selector:
    app: MyApp
  clusterIP: 10.96.7.7
  ports:
    - protocol: TCP
      port: 80
      targetPort: 80

The pod will always run in the node kind-workerfor simplicity and the Service has the ClusterIP hardcoded to 10.96.7.7, I run 100000 request to be sure all ephemera ports are exercised and I grep for one of them

kubectl get pods
NAME                                 READY   STATUS    RESTARTS   AGE
server-deployment-5d9f6fdf8b-ggbvm   1/1     Running   0          12s
kubectl exec -it server-deployment-5d9f6fdf8b-ggbvm -- ab -n 60000 -c 100 http://10.96.7.7/
This is ApacheBench, Version 2.3 <$Revision: 1903618 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/

Benchmarking 10.96.7.7 (be patient)
Completed 6000 requests
Completed 12000 requests
Completed 18000 requests
Completed 24000 requests
Completed 30000 requests

In other console log into the corresponding cilium agent to get the conntrack entries

docker exec -it kind-worker bash
crictl exec -it $(crictl ps | grep cilium-agent | awk '{print $1}') bash
 cilium bpf ct list global -d | grep 43488
TCP OUT 10.96.77.183:80 -> 10.244.1.160:43488 service expires=1700265 (remaining: 21595 sec(s)) RxPackets=0 RxBytes=13 RxFlagsSeen=0x00 LastRxReport=0 TxPackets=0 TxBytes=0 TxFlagsSeen=0x1b LastTxReport=1678662 Flags=0x0012 [ TxClosing SeenNonSyn ] RevNAT=5 SourceSecurityID=0 IfIndex=0
TCP IN 169.254.42.1:43488 -> 10.244.1.160:80 expires=1678675 (remaining: 5 sec(s)) RxPackets=15 RxBytes=1254 RxFlagsSeen=0x1b LastRxReport=1678662 TxPackets=15 TxBytes=1881 TxFlagsSeen=0x1b LastTxReport=1678662 Flags=0x001b [ RxClosing TxClosing LBLoopback SeenNonSyn ] RevNAT=5 SourceSecurityID=48450 IfIndex=0
TCP OUT 169.254.42.1:43488 -> 10.244.1.160:80 expires=1700265 (remaining: 21595 sec(s)) RxPackets=0 RxBytes=0 RxFlagsSeen=0x00 LastRxReport=0 TxPackets=15 TxBytes=1254 TxFlagsSeen=0x1b LastTxReport=1678662 Flags=0x001a [ TxClosing LBLoopback SeenNonSyn ] RevNAT=5 SourceSecurityID=48450 IfIndex=0

@aojea
Copy link
Contributor

aojea commented Aug 23, 2023

I talked with @sypakine and he still sees the issue in v1.14.0, it turns out that the behavior exhibited depends on the flags, with this config in v1.14.0 still shows assymetrics values

KUBEPROXY_MODE="none" make kind && \
make kind-image && \
helm upgrade -i cilium ./install/kubernetes/cilium \
  --wait \
  --namespace kube-system \
  --set k8sServiceHost="kind-control-plane" \
  --set k8sServicePort="6443" \
  --set debug.enabled=true \
  --set pprof.enabled=true \
  --set enableIPv4Masquerade=false \
  --set enableIPv6Masquerade=false \
  --set enableIPv4Masquerade=false \
  --set hostFirewall.enabled=false \
  --set socketLB.hostNamespaceOnly=true \
  --set kubeProxyReplacement=strict \
  --set nodeinit.enabled=true \
  --set ipam.mode=kubernetes \
  --set ipv4.enabled=true \
  --set ipv4NativeRoutingCIDR=10.244.0.0/16 \
  --set ipv6.enabled=false \
  --set image.override="localhost:5000/cilium/cilium-dev:local" \
  --set image.pullPolicy=Never \
  --set operator.image.override="localhost:5000/cilium/operator-generic:local" \
  --set operator.image.pullPolicy=Never \
  --set operator.image.suffix="" \
  --set securityContext.privileged=true \
  --set gatewayAPI.enabled=false \
  --set socketLB.enabled=false \
  --set bpf.hostLegacyRouting=true \
  --set endpointRoutes.enabled=true

@julianwiedmann
Copy link
Member Author

With this patch I observe the "wrong" behavior, asymmetry on the timers (though RxClosing is set correctly) it sets the remaining time of TCP_IN to 5s, but the TCP_OUT to 21595s

Yep, not trying to fix the timers 😁. That would require fixing up all the path symmetry for loopback traffic. Bigger topic.

But if the TCP_IN entry expires & gets GC-collected (while the TCP_OUT stays around), you now should see that a new connection re-creates the required TCP_IN entry.

it turns out that the behavior exhibited depends on the flags, with this config in v1.14.0 still shows assymetrics values

ack, endpointRoutes.enabled=true is not a good combination for loopback connections right now. It doesn't play well with the bypass logic.

@aojea
Copy link
Contributor

aojea commented Aug 24, 2023

Yep, not trying to fix the timers 😁. That would require fixing up all the path symmetry for loopback traffic. Bigger topic.
But if the TCP_IN entry expires & gets GC-collected (while the TCP_OUT stays around), you now should see that a new connection re-creates the required TCP_IN entry.

understood , cc @sypakine , I'll fall back to the while true; do curl http://serviceip ; done that eventually triggered this problem

@aojea
Copy link
Contributor

aojea commented Aug 24, 2023

I no longer hit the problem using the while loop for the localhostneither with 1.14.0 or this patch, this is an example of the entries

1.14.0 

 cilium bpf ct list global -d | grep 43350
TCP IN 10.96.7.7:80 -> 10.244.1.190:43350 expires=1787276 (remaining: 21581 sec(s)) RxPackets=8 RxBytes=898 RxFlagsSeen=0x1b LastRxReport=1765676 TxPackets=0 TxBytes=0 TxFlagsSeen=0x00 LastTxReport=0 Flags=0x0011 [ RxClosing SeenNonSyn ] RevNAT=0 SourceSecurityID=2 IfIndex=0
TCP IN 169.254.42.1:43350 -> 10.244.1.190:80 expires=1765686 (remaining: -9 sec(s)) RxPackets=12 RxBytes=954 RxFlagsSeen=0x1b LastRxReport=1765676 TxPackets=9 TxBytes=972 TxFlagsSeen=0x1b LastTxReport=1765676 Flags=0x001b [ RxClosing TxClosing LBLoopback SeenNonSyn ] RevNAT=6 SourceSecurityID=27633 IfIndex=0
TCP OUT 10.96.7.7:80 -> 10.244.1.190:43350 service expires=1787276 (remaining: 21581 sec(s)) RxPackets=0 RxBytes=12 RxFlagsSeen=0x00 LastRxReport=0 TxPackets=0 TxBytes=0 TxFlagsSeen=0x1b LastTxReport=1765676 Flags=0x0012 [ TxClosing SeenNonSyn ] RevNAT=6 SourceSecurityID=0 IfIndex=0
TCP OUT 169.254.42.1:43350 -> 10.244.1.190:80 expires=1787276 (remaining: 21581 sec(s)) RxPackets=0 RxBytes=0 RxFlagsSeen=0x00 LastRxReport=0 TxPackets=12 TxBytes=954 TxFlagsSeen=0x1b LastTxReport=1765676 Flags=0x001a [ TxClosing LBLoopback SeenNonSyn ] RevNAT=6 SourceSecurityID=27633 IfIndex=0


With the patch
TCP IN 10.96.7.7:80 -> 10.244.1.190:43350 expires=1787966 (remaining: 21595 sec(s)) RxPackets=4 RxBytes=450 RxFlagsSeen=0x1b LastRxReport=1766366 TxPackets=0 TxBytes=0 TxFlagsSeen=0x00 LastTxReport=0 Flags=0x0011 [ RxClosing SeenNonSyn ] RevNAT=0 SourceSecurityID=2 IfIndex=0
TCP IN 169.254.42.1:43350 -> 10.244.1.190:80 expires=1766376 (remaining: 5 sec(s)) RxPackets=6 RxBytes=477 RxFlagsSeen=0x1b LastRxReport=1766366 TxPackets=4 TxBytes=450 TxFlagsSeen=0x1b LastTxReport=1766366 Flags=0x001b [ RxClosing TxClosing LBLoopback SeenNonSyn ] RevNAT=6 SourceSecurityID=27633 IfIndex=0
TCP OUT 10.96.7.7:80 -> 10.244.1.190:43350 service expires=1787966 (remaining: 21595 sec(s)) RxPackets=0 RxBytes=12 RxFlagsSeen=0x00 LastRxReport=0 TxPackets=0 TxBytes=0 TxFlagsSeen=0x1b LastTxReport=1766366 Flags=0x0012 [ TxClosing SeenNonSyn ] RevNAT=6 SourceSecurityID=0 IfIndex=0
TCP OUT 169.254.42.1:43350 -> 10.244.1.190:80 expires=1787966 (remaining: 21595 sec(s)) RxPackets=0 RxBytes=0 RxFlagsSeen=0x00 LastRxReport=0 TxPackets=6 TxBytes=477 TxFlagsSeen=0x1b LastTxReport=1766366 Flags=0x001a [ TxClosing LBLoopback SeenNonSyn ] RevNAT=6 SourceSecurityID=27633 IfIndex=0

I always hit it before #23913 (comment)

bpf/bpf_lxc.c Show resolved Hide resolved
@sypakine
Copy link

With this patch I observe the "wrong" behavior, asymmetry on the timers (though RxClosing is set correctly) it sets the remaining time of TCP_IN to 5s, but the TCP_OUT to 21595s

Yep, not trying to fix the timers 😁. That would require fixing up all the path symmetry for loopback traffic. Bigger topic.

But if the TCP_IN entry expires & gets GC-collected (while the TCP_OUT stays around), you now should see that a new connection re-creates the required TCP_IN entry.

it turns out that the behavior exhibited depends on the flags, with this config in v1.14.0 still shows assymetrics values

ack, endpointRoutes.enabled=true is not a good combination for loopback connections right now. It doesn't play well with the bypass logic.

I verified this fix. It does re-create the appropriate TCP_IN entry.

Repro:

gh repo clone cilium/ciium cilium-oss && cd !$ && gh pr checkout 27602
KUBEPROXY_MODE="none" make kind && \
make kind-image && \
helm upgrade -i cilium ./install/kubernetes/cilium \
  --wait \
  --namespace kube-system \
  --set k8sServiceHost="kind-control-plane" \
  --set k8sServicePort="6443" \
  --set debug.enabled=true \
  --set pprof.enabled=true \
  --set enableIPv4Masquerade=false \
  --set enableIPv6Masquerade=false \
  --set enableIPv4Masquerade=false \
  --set hostFirewall.enabled=false \
  --set socketLB.hostNamespaceOnly=true \
  --set kubeProxyReplacement=strict \
  --set nodeinit.enabled=true \
  --set ipam.mode=kubernetes \
  --set ipv4.enabled=true \
  --set ipv4NativeRoutingCIDR=10.244.0.0/16 \
  --set ipv6.enabled=false \
  --set image.override="localhost:5000/cilium/cilium-dev:local" \
  --set image.pullPolicy=Never \
  --set operator.image.override="localhost:5000/cilium/operator-generic:local" \
  --set operator.image.pullPolicy=Never \
  --set operator.image.suffix="" \
  --set securityContext.privileged=true \
  --set gatewayAPI.enabled=false \
  --set socketLB.enabled=false \
  --set bpf.hostLegacyRouting=true \
  --set endpointRoutes.enabled=true

alias kk="kubectl --context kind-kind"

kk --context kind-kind apply -f - <<EOF
apiVersion: v1
kind: Pod
metadata:
  name: server
  namespace: default
  labels:
    app: server
spec:
  containers:
  - name: agnhost
    image: k8s.gcr.io/e2e-test-images/agnhost:2.39
    args:
    - netexec
    - --http-port=80
    ports:
    - containerPort: 80
---
apiVersion: v1
kind: Service
metadata:
  name: server-svc
spec:
  type: ClusterIP
  selector:
    app: server
  ports:
    - protocol: TCP
      port: 80
      targetPort: 80
EOF

time while ! [[ "$( kk get svc server-svc -ojsonpath='{.spec.clusterIP}' | xargs )" != "" ]]; do echo -n "."; sleep 1; done
clusterip=$( kk get svc server-svc -ojsonpath='{.spec.clusterIP}' )

time while ! [[ "$( kk get pod server -ojsonpath='{.status.conditions[?(@.type=="Ready")].status}' )" =~ "True" ]]; do echo -n "."; sleep 1; done

host=$( kk get pod server \
    -ojsonpath='{ .spec.nodeName }' )
agent=$( kk get pod -n kube-system \
    -l k8s-app=cilium \
    --field-selector=spec.nodeName=${host} \
    -ojsonpath='{.items[].metadata.name}' )

watch "kubectl --context kind-kind exec -it -n kube-system ${agent} -c cilium-agent -- cilium bpf ct list global -d | grep ':55555'"

# new terminal
kk exec server -- curl --no-keepalive --connect-timeout 5 --local-port 55555 -v ${clusterip}:80/hostname

# wait for TCP_IN entry to timeout
kk exec server -- curl --no-keepalive --connect-timeout 5 --local-port 55555 -v ${clusterip}:80/hostname

@julianwiedmann
Copy link
Member Author

(pushed a slightly updated version with more comments, and tests for the expected CT entries)

@julianwiedmann
Copy link
Member Author

/test

@aojea
Copy link
Contributor

aojea commented Aug 25, 2023

Ok, now I have a consistent way to reproduce it and verify this patch works, I don't know how can be easily automated since depend on lowering the value of --conntrack-gc-interval so it reaps the dead entry.

The steps to repro are exactly the same as #27602 (comment), in this case I just patch the cilium config-map to set a value of 60 seconds and restart the agents to pick up the new config

 kubectl -n kube-system get cm cilium-config -o yaml | grep conntrack-gc-interval
  conntrack-gc-interval: 1m0s

So, once you create the container you do the request against the service and it hairpins creating the conntrack entries that by default I think it has 15 seconds (I use the same trick as @sypakine so we have a well known port to grep curl --no-keepalive --connect-timeout 5 --local-port 55555 -v ${clusterip}:80/hostname )

bpf ct list global -d | grep 169.254.42.1:55555
TCP IN 169.254.42.1:55555 -> 10.244.1.70:80 expires=1851690 (remaining: 8 sec(s)) RxPackets=6 RxBytes=477 RxFlagsSeen=0x1b LastRxReport=1851680 TxPackets=4 TxBytes=452 TxFlagsSeen=0x1b LastTxReport=1851680 Flags=0x001b [ RxClosing TxClosing LBLoopback SeenNonSyn ] RevNAT=6 SourceSecurityID=27633 IfIndex=0
TCP OUT 169.254.42.1:55555 -> 10.244.1.70:80 expires=1873280 (remaining: 21598 sec(s)) RxPackets=0 RxBytes=0 RxFlagsSeen=0x00 LastRxReport=0 TxPackets=6 TxBytes=477 TxFlagsSeen=0x1b LastTxReport=1851680 Flags=0x001a [ TxClosing LBLoopback SeenNonSyn ] RevNAT=6 SourceSecurityID=27633 IfIndex=0

Now, after 60 seconds, the GC will delete that entry

# cilium bpf ct list global -d | grep 169.254.42.1:55555
TCP OUT 169.254.42.1:55555 -> 10.244.1.70:80 expires=1873280 (remaining: 21535 sec(s)) RxPackets=0 RxBytes=0 RxFlagsSeen=0x00 LastRxReport=0 TxPackets=6 TxBytes=477 TxFlagsSeen=0x1b LastTxReport=1851680 Flags=0x001a [ TxClosing LBLoopback SeenNonSyn ] RevNAT=6 SourceSecurityID=27633 IfIndex=0

without this patch the connection times out

bash-5.0# curl --no-keepalive --connect-timeout 5 --local-port 55555 10.96.7.7
NOW: 2023-08-25 13:28:16.025963709 +0000 UTC m=+79359.922380945bash-5.0# curl --no-keepalive --connect-timeout 5 --local-port 55555 10.96.7.7
curl: (28) Connection timeout after 5001 ms

with the patch a new entry is created:

cilium bpf ct list global -d | grep 169.254.42.1:55555
TCP IN 169.254.42.1:55555 -> 10.244.1.70:80 expires=1851771 (remaining: 8 sec(s)) RxPackets=6 RxBytes=477 RxFlagsSeen=0x1b LastRxReport=1851761 TxPackets=4 TxBytes=452 TxFlagsSeen=0x1b LastTxReport=1851761 Flags=0x001b [ RxClosing TxClosing LBLoopback SeenNonSyn ] RevNAT=6 SourceSecurityID=27633 IfIndex=0
TCP OUT 169.254.42.1:55555 -> 10.244.1.70:80 expires=1873361 (remaining: 21598 sec(s)) RxPackets=0 RxBytes=0 RxFlagsSeen=0x00 LastRxReport=0 TxPackets=12 TxBytes=954 TxFlagsSeen=0x1b LastTxReport=1851761 Flags=0x001a [ TxClosing LBLoopback SeenNonSyn ] RevNAT=6 SourceSecurityID=27633 IfIndex=0

I wish I knew how to create some test for this, ideally we should have some way of forcing the garbage collection of the conntrack entries, the alternative is to go to the bpf map and forcefully delete the entry

/lgtm

Copy link
Member

@aditighag aditighag left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the detailed description, and making an effort to detangle the loopback datapath. I made an initial pass, still mulling over the fix. Curios to know your thoughts about my alternate approach.

In a nutshell, this PR updates the egress logic that previously also created an ingress entry for the loopback case, and instead let the ingress create a ct entry if the corresponding egress entry existed. The main issue where is that the datapath can't update the ingress entry gc timeout when it updates the egress entry, so there is inherently an asymmetry here.
From that perspective, the change makes sense to me, but it's not ideal that ct logic is so tightly coupled with policy enforcement.

If the same loopback connection is subsequently re-opened, the client's from-container path finds the CT_EGRESS entry and thus will not call ct_create4().

Just for the sake of discussion, can it check for an ingress entry, and create one if doesn't exist? On a related note, we have logic to select a new backend after some pre-configured time has passed (see ct_entry_expired_rebalance). The same logic should apply for loopback connections as well. Wdyt?

bpf/bpf_lxc.c Show resolved Hide resolved
@julianwiedmann julianwiedmann added the needs-backport/1.14 This PR / issue needs backporting to the v1.14 branch label Aug 28, 2023
@julianwiedmann julianwiedmann added the backport/author The backport will be carried out by the author of the PR. label Sep 4, 2023
@julianwiedmann julianwiedmann added backport-pending/1.14 The backport for Cilium 1.14.x for this PR is in progress. and removed needs-backport/1.14 This PR / issue needs backporting to the v1.14 branch labels Sep 4, 2023
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Needs backport from main to Backport pending to v1.14 in 1.14.2 Sep 4, 2023
@maintainer-s-little-helper maintainer-s-little-helper bot added this to Backport pending to v1.12 in 1.12.14 Sep 6, 2023
@qmonnet qmonnet added the needs-backport/1.13 This PR / issue needs backporting to the v1.13 branch label Sep 6, 2023
@maintainer-s-little-helper maintainer-s-little-helper bot added this to Needs backport from main in 1.13.7 Sep 6, 2023
@qmonnet qmonnet mentioned this pull request Sep 6, 2023
10 tasks
@julianwiedmann julianwiedmann added backport-done/1.14 The backport for Cilium 1.14.x for this PR is done. and removed backport-pending/1.14 The backport for Cilium 1.14.x for this PR is in progress. labels Sep 7, 2023
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Backport pending to v1.14 to Backport done to v1.14 in 1.14.2 Sep 7, 2023
@qmonnet qmonnet added backport-pending/1.13 The backport for Cilium 1.13.x for this PR is in progress. and removed needs-backport/1.13 This PR / issue needs backporting to the v1.13 branch labels Sep 7, 2023
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Needs backport from main to Backport pending to v1.13 in 1.13.7 Sep 7, 2023
@julianwiedmann julianwiedmann added backport-done/1.13 The backport for Cilium 1.13.x for this PR is done. and removed backport-pending/1.13 The backport for Cilium 1.13.x for this PR is in progress. labels Sep 8, 2023
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Backport pending to v1.13 to Backport done to v1.13 in 1.13.7 Sep 8, 2023
@qmonnet qmonnet added backport-done/1.12 The backport for Cilium 1.12.x for this PR is done. and removed backport-pending/1.12 labels Sep 8, 2023
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Backport pending to v1.12 to Backport done to v1.12 in 1.12.14 Sep 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/loadbalancing Impacts load-balancing and Kubernetes service implementations backport/author The backport will be carried out by the author of the PR. backport-done/1.12 The backport for Cilium 1.12.x for this PR is done. backport-done/1.13 The backport for Cilium 1.13.x for this PR is done. backport-done/1.14 The backport for Cilium 1.14.x for this PR is done. feature/conntrack kind/bug This is a bug in the Cilium logic. ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/bug This PR fixes an issue in a previous release of Cilium. sig/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages. sig/policy Impacts whether traffic is allowed or denied based on user-defined policies.
Projects
No open projects
1.12.14
Backport done to v1.12
1.13.7
Backport done to v1.13
1.14.2
Backport done to v1.14
Development

Successfully merging this pull request may close these issues.

None yet

5 participants