Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v1.9 backports 2021-11-26 #18025

Merged
merged 4 commits into from
Nov 29, 2021
Merged

Conversation

qmonnet
Copy link
Member

@qmonnet qmonnet commented Nov 26, 2021

Once this PR is merged, you can update the PR labels via:

$ for pr in 17865 17916; do contrib/backporting/set-labels.py $pr done 1.9; done

pchaigno and others added 4 commits November 26, 2021 15:48
[ upstream commit 573159d ]

Note this commit was applied before as 7ef59aa and reverted due to a
broken test in commit 346713b.

When endpoint routes are enabled, we attach a BPF program on the way to
the container and add a Linux route to the lxc interface. So when coming
from bpf_network with IPsec, we should use that route to go directly to
the lxc device and its attached BPF program.

In contrast, when endpoint routes are disabled, we run the BPF program for
ingress pod policies from cilium_host, via a tail call in bpf_host.
Therefore, in that case, we need to jump from bpf_network to cilium_host
first, to follow the correct path to the lxc interface.

That's what commit 287f49c ("cilium: encryption, fix redirect when
endpoint routes enabled") attempted to implement for when endpoint
routes are enabled. It's goal was to go directly from bpf_network to the
stack in that case, to use the per-endpoint Linux routes to the lxc
device. That commit however implements a noop change:
ENABLE_ENDPOINT_ROUTES is defined as a per-endpoint setting, but then
used in bpf_network, which is not tied to any endpoint. In practice,
that means the macro is defined in the ep_config.h header files used by
bpf_lxc, whereas bpf_network (from which the macro is used) relies on
the node_config.h header file.

The fix is therefore simple: we need to define ENABLE_ENDPOINT_ROUTES as
a global config, written in node_config.h.

To reproduce the bug and validate the fix, I deploy Cilium on GKE (where
endpoint routes are enabled by default) with:

    helm install cilium ./cilium --namespace kube-system \
        --set nodeinit.enabled=true \
        --set nodeinit.reconfigureKubelet=true \
        --set nodeinit.removeCbrBridge=true \
        --set cni.binPath=/home/kubernetes/bin \
        --set gke.enabled=true \
        --set ipam.mode=kubernetes \
        --set nativeRoutingCIDR=$NATIVE_CIDR \
        --set nodeinit.restartPods=true \
        --set image.repository=docker.io/pchaigno/cilium-dev \
        --set image.tag=fix-ipsec-ep-routes \
        --set operator.image.repository=quay.io/cilium/operator \
        --set operator.image.suffix="-ci" \
        --set encryption.enabled=true \
        --set encryption.type=ipsec

I then deployed the below manifest and attempted a curl request from pod
client to the service echo-a.

    metadata:
      name: echo-a
      labels:
        name: echo-a
    spec:
      template:
        metadata:
          labels:
            name: echo-a
        spec:
          containers:
          - name: echo-a-container
            env:
            - name: PORT
              value: "8080"
            ports:
            - containerPort: 8080
            image: quay.io/cilium/json-mock:v1.3.0
            imagePullPolicy: IfNotPresent
            readinessProbe:
              timeoutSeconds: 7
              exec:
                command:
                - curl
                - -sS
                - --fail
                - --connect-timeout
                - "5"
                - -o
                - /dev/null
                - localhost:8080
      selector:
        matchLabels:
          name: echo-a
      replicas: 1
    apiVersion: apps/v1
    kind: Deployment
    ---
    metadata:
      name: echo-a
      labels:
        name: echo-a
    spec:
      ports:
      - name: http
        port: 8080
      type: ClusterIP
      selector:
        name: echo-a
    apiVersion: v1
    kind: Service
    ---
    apiVersion: "cilium.io/v2"
    kind: CiliumNetworkPolicy
    metadata:
      name: "l3-rule"
    spec:
      endpointSelector:
        matchLabels:
          name: client
      ingress:
      - fromEndpoints:
        - matchLabels:
            name: echo-a
    ---
    apiVersion: v1
    kind: Pod
    metadata:
      name: client
      labels:
        name: client
    spec:
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: app.kubernetes.io/name
                operator: In
                values:
                - echo-a
            topologyKey: kubernetes.io/hostname
      containers:
      - name: netperf
        args:
        - sleep
        - infinity
        image: cilium/netperf

Fixes: 287f49c ("cilium: encryption, fix redirect when endpoint routes enabled")
Signed-off-by: Paul Chaignon <paul@cilium.io>
Signed-off-by: Quentin Monnet <quentin@isovalent.com>
[ upstream commit 3ffe49e ]

With the previous patch, when IPsec and endpoint routes are enabled,
packets flow directly from bpf_network to bpf_lxc via the Linux stack,
instead of going through bpf_host. However, we noticed that, when L7
policies are applied, connectivity fails between the proxy and the
destination:

    43.808: cilium-test/client2-6dd75b74c6-9nxhx:45704 -> cilium-test/echo-other-node-697d5d69b7-lglxf:8080 from-endpoint FORWARDED (TCP Flags: SYN)
    43.808: cilium-test/client2-6dd75b74c6-9nxhx:45704 -> cilium-test/echo-other-node-697d5d69b7-lglxf:8080 UNKNOWN 5 (TCP Flags: SYN)
    43.808: cilium-test/client2-6dd75b74c6-9nxhx:45704 -> cilium-test/echo-other-node-697d5d69b7-lglxf:8080 to-proxy FORWARDED (TCP Flags: SYN)
    43.808: cilium-test/echo-other-node-697d5d69b7-lglxf:8080 -> cilium-test/client2-6dd75b74c6-9nxhx:45704 from-proxy FORWARDED (TCP Flags: SYN, ACK)
    43.808: cilium-test/echo-other-node-697d5d69b7-lglxf:8080 -> cilium-test/client2-6dd75b74c6-9nxhx:45704 to-endpoint FORWARDED (TCP Flags: SYN, ACK)
    43.808: cilium-test/client2-6dd75b74c6-9nxhx:45704 -> cilium-test/echo-other-node-697d5d69b7-lglxf:8080 from-endpoint FORWARDED (TCP Flags: ACK)
    43.808: cilium-test/client2-6dd75b74c6-9nxhx:45704 -> cilium-test/echo-other-node-697d5d69b7-lglxf:8080 to-proxy FORWARDED (TCP Flags: ACK)
    43.810: cilium-test/client2-6dd75b74c6-9nxhx:45704 -> cilium-test/echo-other-node-697d5d69b7-lglxf:8080 from-endpoint FORWARDED (TCP Flags: ACK, PSH)
    43.810: cilium-test/client2-6dd75b74c6-9nxhx:45704 -> cilium-test/echo-other-node-697d5d69b7-lglxf:8080 to-proxy FORWARDED (TCP Flags: ACK, PSH)
    43.810: cilium-test/echo-other-node-697d5d69b7-lglxf:8080 -> cilium-test/client2-6dd75b74c6-9nxhx:45704 from-proxy FORWARDED (TCP Flags: ACK)
    43.810: cilium-test/echo-other-node-697d5d69b7-lglxf:8080 -> cilium-test/client2-6dd75b74c6-9nxhx:45704 to-endpoint FORWARDED (TCP Flags: ACK)
    43.810: cilium-test/client2-6dd75b74c6-9nxhx:45704 -> cilium-test/echo-other-node-697d5d69b7-lglxf:8080 http-request FORWARDED (HTTP/1.1 GET http://10.240.0.55:8080/)
    43.812: cilium-test/client2-6dd75b74c6-9nxhx:45704 -> cilium-test/echo-other-node-697d5d69b7-lglxf:8080 from-network FORWARDED (TCP Flags: SYN)
    43.812: cilium-test/client2-6dd75b74c6-9nxhx:45704 -> cilium-test/echo-other-node-697d5d69b7-lglxf:8080 from-stack FORWARDED (TCP Flags: SYN)
    43.812: cilium-test/client2-6dd75b74c6-9nxhx:45704 -> cilium-test/echo-other-node-697d5d69b7-lglxf:8080 to-endpoint FORWARDED (TCP Flags: SYN)
    43.812: cilium-test/echo-other-node-697d5d69b7-lglxf:8080 -> cilium-test/client2-6dd75b74c6-9nxhx:45704 from-endpoint FORWARDED (TCP Flags: SYN, ACK)
    43.812: cilium-test/echo-other-node-697d5d69b7-lglxf:8080 -> cilium-test/client2-6dd75b74c6-9nxhx:45704 to-stack FORWARDED (TCP Flags: SYN, ACK)
    43.813: cilium-test/echo-other-node-697d5d69b7-lglxf:8080 -> cilium-test/client2-6dd75b74c6-9nxhx:45704 from-network FORWARDED (TCP Flags: SYN, ACK)
    44.827: cilium-test/client2-6dd75b74c6-9nxhx:45704 -> cilium-test/echo-other-node-697d5d69b7-lglxf:8080 from-network FORWARDED (TCP Flags: SYN)
    44.827: cilium-test/client2-6dd75b74c6-9nxhx:45704 -> cilium-test/echo-other-node-697d5d69b7-lglxf:8080 from-stack FORWARDED (TCP Flags: SYN)
    44.827: cilium-test/client2-6dd75b74c6-9nxhx:45704 -> cilium-test/echo-other-node-697d5d69b7-lglxf:8080 to-endpoint FORWARDED (TCP Flags: SYN)

We can see the TCP handshake between the client and proxy, followed by an
attempt to perform the TCP handshake between the proxy and server. That
second part fails, as the SYN+ACK packets sent by the server never seem to
reach the proxy; they are dropped somewhere after the from-network
observation point.

At the same time, we can see the IPsec error counter XfrmInNoPols
increasing on the client node. This indicates that SYN+ACK packets were
dropped after decryption, because the XFRM state used for decryption
doesn't match any XFRM policy. The XFRM state used for decryption is:

    src 0.0.0.0 dst 10.240.0.18
    proto esp spi 0x00000003 reqid 1 mode tunnel
    replay-window 0
    mark 0xd00/0xf00 output-mark 0xd00/0xf00
    aead rfc4106(gcm(aes)) 0x3ec3e84ac118c3baf335392e7a4ea24ee3aecb2b 128
    anti-replay context: seq 0x0, oseq 0x0, bitmap 0x00000000
    sel src 0.0.0.0/0 dst 0.0.0.0/0

And it should match the following XFRM policy:

    src 0.0.0.0/0 dst 10.240.0.0/16
    dir in priority 0
    mark 0xd00/0xf00
    tmpl src 0.0.0.0 dst 10.240.0.18
        proto esp reqid 1 mode tunnel

After the packet is decrypted however, we hit the following rule in
iptables because we're going to the proxy.

    -A CILIUM_PRE_mangle -m socket --transparent -m comment --comment "cilium: any->pod redirect proxied traffic to host proxy" -j MARK --set-xmark 0x200/0xffffffff

As a result, the packet mark is set to 0x200 and we don't match the 0xd00
packet mark of the XFRM policy anymore. The packet is therefore dropped
with XfrmInNoPols.

To avoid this, we can simply mark the XFRM policy optional when endpoint
routes are enabled, in the same way we do for tunneling.

Fixes: 287f49c ("cilium: encryption, fix redirect when endpoint routes enabled")
Signed-off-by: Paul Chaignon <paul@cilium.io>
Signed-off-by: Quentin Monnet <quentin@isovalent.com>
[ upstream commit b385f0f ]

This patch modifies a forward policy update introduced by
0b52fd7 so that the template source
matches the source which is 0.0.0.0/0 (wildcard).

Above modification addresses an issue of intermittent packet drops, as
discussed in detail below.

During an investigation of intermittent dropped packets in AKS (kernel
5.4.0-1061-azure) with IPSec enabled, there was an increase of
(XfrmInTmplMismatch) errors in /proc/net/xfrm_stat as packets were
dropped.

Tracing revaled that the packets were dropped due to an __xfrm_policy_check
failure when the packet was redirected from eth0 (after decryption) to the LXC
device of a pod.

Further investigation, attributed the drops to changes in the forwarding
policy. Specifically, the forwarding policy would change as:

 src 0.0.0.0/0 dst 10.240.0.0/16
        dir fwd priority 2975
-       tmpl src 0.0.0.0 dst 10.240.0.19
+       tmpl src 10.240.0.19 dst 10.240.0.61
                proto esp reqid 1 mode tunnel
                level use

And back:

 src 0.0.0.0/0 dst 10.240.0.0/16
        dir fwd priority 2975
-       tmpl src 10.240.0.19 dst 10.240.0.61
+       tmpl src 0.0.0.0 dst 10.240.0.19
                proto esp reqid 1 mode tunnel
                level use

The above change was caused by: func (n *linuxNodeHandler) enableIPsec(newNode
*nodeTypes.Node) in pkg/datapath/linux/node.go.  Modifying the code to avoid
chancing the policy elimiated the packet drops.

The are two places were the xfrm policy is updated in enableIPsec():
  (1) inside UpsertIPsecEndpoint() when an IN policy is specified (as happens if newNode.IsLocal())
  (2) in enableIPsec() itself, as introduced by 0b52fd7

For example, adding log messages to IpSecReplacePolicyFwd and UpsertIPsecEndpoint produced:
 level=info msg="IpSecReplacePolicyFwd: src=0.0.0.0/0 dst=10.240.0.61/16 tmplSrc=10.240.0.19/16 tmplDst=10.240.0.61/16" subsys=ipsec
 level=info msg="UpsertIPsecEndpoint: local:10.240.0.19/16 remote:0.0.0.0/0 fowrard:10.240.0.19/16" subsys=ipsec
 level=info msg="IpSecReplacePolicyFwd: src=0.0.0.0/0 dst=10.240.0.19/16 tmplSrc=0.0.0.0/0 tmplDst=10.240.0.19/16" subsys=ipsec
 level=info msg="UpsertIPsecEndpoint: exit" subsys=ipsec

Additional testing revealed that the update that resulting in a template
with tmplSrc=10.240.0.19/16  was the culprit for the packet drops.
Making the source template matching the source which is a wildcard in
update (2), eliminated the packet drops.

Signed-off-by: Kornilios Kourtis <kornilios@isovalent.com>
Signed-off-by: Quentin Monnet <quentin@isovalent.com>
[ upstream commit 690c112 ]

A version of the classic closure with concurrency gotcha. Bind the value
of cmd in the loop to a new variable to address the issue.

Signed-off-by: Robin Hahling <robin.hahling@gw-computing.net>
Signed-off-by: Quentin Monnet <quentin@isovalent.com>
@qmonnet qmonnet requested a review from a team as a code owner November 26, 2021 15:50
@qmonnet qmonnet added backport/1.9 kind/backports This PR provides functionality previously merged into master. labels Nov 26, 2021
@qmonnet qmonnet requested a review from kkourt November 26, 2021 15:52
@qmonnet qmonnet requested a review from rolinh November 26, 2021 15:52
@qmonnet
Copy link
Member Author

qmonnet commented Nov 26, 2021

/test-backport-1.9

Copy link
Member

@rolinh rolinh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM for my patch 👍

Copy link
Contributor

@kkourt kkourt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

@qmonnet
Copy link
Member Author

qmonnet commented Nov 29, 2021

GKE failure seems unrelated - cilium pods were stuck in ImagePullBackOff, making BeforeAll checks fail.
/test-gke

@qmonnet
Copy link
Member Author

qmonnet commented Nov 29, 2021

GKE tests are in fact broken on v1.9 at the moment (#17797 for context and WIP fix). All other tests are passing, I'll go ahead and merge the PR.

@qmonnet qmonnet merged commit befda9c into cilium:v1.9 Nov 29, 2021
@qmonnet qmonnet deleted the pr/v1.9-backport-2021-11-26 branch November 29, 2021 13:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/backports This PR provides functionality previously merged into master.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants