Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v1.10 backports 2021-11-23 #17985

Merged
merged 12 commits into from
Nov 30, 2021
Merged

Conversation

gandro
Copy link
Member

@gandro gandro commented Nov 23, 2021

Once this PR is merged, you can update the PR labels via:

$ for pr in 17550 17868 17865 17824 17916 17906 17845 17869; do contrib/backporting/set-labels.py $pr done 1.10; done

twpayne and others added 2 commits November 23, 2021 17:35
[ upstream commit ef16ed7 ]

Refs cilium#17401.

Signed-off-by: Tom Payne <tom@isovalent.com>
Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
[ upstream commit f6434eb ]

Signed-off-by: Paul Chaignon <paul@cilium.io>
Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
@gandro gandro requested a review from a team as a code owner November 23, 2021 17:23
@gandro gandro added backport/1.10 kind/backports This PR provides functionality previously merged into master. labels Nov 23, 2021
pchaigno and others added 10 commits November 23, 2021 18:35
[ upstream commit 4a10eaa ]

Move several check from bpf_lxc into a helper function
is_cluster_destination().

Signed-off-by: Paul Chaignon <paul@cilium.io>
Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
[ upstream commit 573159d ]

Note this commit was applied before as 7ef59aa and reverted due to a
broken test in commit 346713b.

When endpoint routes are enabled, we attach a BPF program on the way to
the container and add a Linux route to the lxc interface. So when coming
from bpf_network with IPsec, we should use that route to go directly to
the lxc device and its attached BPF program.

In contrast, when endpoint routes are disabled, we run the BPF program for
ingress pod policies from cilium_host, via a tail call in bpf_host.
Therefore, in that case, we need to jump from bpf_network to cilium_host
first, to follow the correct path to the lxc interface.

That's what commit 287f49c ("cilium: encryption, fix redirect when
endpoint routes enabled") attempted to implement for when endpoint
routes are enabled. It's goal was to go directly from bpf_network to the
stack in that case, to use the per-endpoint Linux routes to the lxc
device. That commit however implements a noop change:
ENABLE_ENDPOINT_ROUTES is defined as a per-endpoint setting, but then
used in bpf_network, which is not tied to any endpoint. In practice,
that means the macro is defined in the ep_config.h header files used by
bpf_lxc, whereas bpf_network (from which the macro is used) relies on
the node_config.h header file.

The fix is therefore simple: we need to define ENABLE_ENDPOINT_ROUTES as
a global config, written in node_config.h.

To reproduce the bug and validate the fix, I deploy Cilium on GKE (where
endpoint routes are enabled by default) with:

    helm install cilium ./cilium --namespace kube-system \
        --set nodeinit.enabled=true \
        --set nodeinit.reconfigureKubelet=true \
        --set nodeinit.removeCbrBridge=true \
        --set cni.binPath=/home/kubernetes/bin \
        --set gke.enabled=true \
        --set ipam.mode=kubernetes \
        --set nativeRoutingCIDR=$NATIVE_CIDR \
        --set nodeinit.restartPods=true \
        --set image.repository=docker.io/pchaigno/cilium-dev \
        --set image.tag=fix-ipsec-ep-routes \
        --set operator.image.repository=quay.io/cilium/operator \
        --set operator.image.suffix="-ci" \
        --set encryption.enabled=true \
        --set encryption.type=ipsec

I then deployed the below manifest and attempted a curl request from pod
client to the service echo-a.

    metadata:
      name: echo-a
      labels:
        name: echo-a
    spec:
      template:
        metadata:
          labels:
            name: echo-a
        spec:
          containers:
          - name: echo-a-container
            env:
            - name: PORT
              value: "8080"
            ports:
            - containerPort: 8080
            image: quay.io/cilium/json-mock:v1.3.0
            imagePullPolicy: IfNotPresent
            readinessProbe:
              timeoutSeconds: 7
              exec:
                command:
                - curl
                - -sS
                - --fail
                - --connect-timeout
                - "5"
                - -o
                - /dev/null
                - localhost:8080
      selector:
        matchLabels:
          name: echo-a
      replicas: 1
    apiVersion: apps/v1
    kind: Deployment
    ---
    metadata:
      name: echo-a
      labels:
        name: echo-a
    spec:
      ports:
      - name: http
        port: 8080
      type: ClusterIP
      selector:
        name: echo-a
    apiVersion: v1
    kind: Service
    ---
    apiVersion: "cilium.io/v2"
    kind: CiliumNetworkPolicy
    metadata:
      name: "l3-rule"
    spec:
      endpointSelector:
        matchLabels:
          name: client
      ingress:
      - fromEndpoints:
        - matchLabels:
            name: echo-a
    ---
    apiVersion: v1
    kind: Pod
    metadata:
      name: client
      labels:
        name: client
    spec:
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: app.kubernetes.io/name
                operator: In
                values:
                - echo-a
            topologyKey: kubernetes.io/hostname
      containers:
      - name: netperf
        args:
        - sleep
        - infinity
        image: cilium/netperf

Fixes: 287f49c ("cilium: encryption, fix redirect when endpoint routes enabled")
Signed-off-by: Paul Chaignon <paul@cilium.io>
Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
[ upstream commit 3ffe49e ]

With the previous patch, when IPsec and endpoint routes are enabled,
packets flow directly from bpf_network to bpf_lxc via the Linux stack,
instead of going through bpf_host. However, we noticed that, when L7
policies are applied, connectivity fails between the proxy and the
destination:

    43.808: cilium-test/client2-6dd75b74c6-9nxhx:45704 -> cilium-test/echo-other-node-697d5d69b7-lglxf:8080 from-endpoint FORWARDED (TCP Flags: SYN)
    43.808: cilium-test/client2-6dd75b74c6-9nxhx:45704 -> cilium-test/echo-other-node-697d5d69b7-lglxf:8080 UNKNOWN 5 (TCP Flags: SYN)
    43.808: cilium-test/client2-6dd75b74c6-9nxhx:45704 -> cilium-test/echo-other-node-697d5d69b7-lglxf:8080 to-proxy FORWARDED (TCP Flags: SYN)
    43.808: cilium-test/echo-other-node-697d5d69b7-lglxf:8080 -> cilium-test/client2-6dd75b74c6-9nxhx:45704 from-proxy FORWARDED (TCP Flags: SYN, ACK)
    43.808: cilium-test/echo-other-node-697d5d69b7-lglxf:8080 -> cilium-test/client2-6dd75b74c6-9nxhx:45704 to-endpoint FORWARDED (TCP Flags: SYN, ACK)
    43.808: cilium-test/client2-6dd75b74c6-9nxhx:45704 -> cilium-test/echo-other-node-697d5d69b7-lglxf:8080 from-endpoint FORWARDED (TCP Flags: ACK)
    43.808: cilium-test/client2-6dd75b74c6-9nxhx:45704 -> cilium-test/echo-other-node-697d5d69b7-lglxf:8080 to-proxy FORWARDED (TCP Flags: ACK)
    43.810: cilium-test/client2-6dd75b74c6-9nxhx:45704 -> cilium-test/echo-other-node-697d5d69b7-lglxf:8080 from-endpoint FORWARDED (TCP Flags: ACK, PSH)
    43.810: cilium-test/client2-6dd75b74c6-9nxhx:45704 -> cilium-test/echo-other-node-697d5d69b7-lglxf:8080 to-proxy FORWARDED (TCP Flags: ACK, PSH)
    43.810: cilium-test/echo-other-node-697d5d69b7-lglxf:8080 -> cilium-test/client2-6dd75b74c6-9nxhx:45704 from-proxy FORWARDED (TCP Flags: ACK)
    43.810: cilium-test/echo-other-node-697d5d69b7-lglxf:8080 -> cilium-test/client2-6dd75b74c6-9nxhx:45704 to-endpoint FORWARDED (TCP Flags: ACK)
    43.810: cilium-test/client2-6dd75b74c6-9nxhx:45704 -> cilium-test/echo-other-node-697d5d69b7-lglxf:8080 http-request FORWARDED (HTTP/1.1 GET http://10.240.0.55:8080/)
    43.812: cilium-test/client2-6dd75b74c6-9nxhx:45704 -> cilium-test/echo-other-node-697d5d69b7-lglxf:8080 from-network FORWARDED (TCP Flags: SYN)
    43.812: cilium-test/client2-6dd75b74c6-9nxhx:45704 -> cilium-test/echo-other-node-697d5d69b7-lglxf:8080 from-stack FORWARDED (TCP Flags: SYN)
    43.812: cilium-test/client2-6dd75b74c6-9nxhx:45704 -> cilium-test/echo-other-node-697d5d69b7-lglxf:8080 to-endpoint FORWARDED (TCP Flags: SYN)
    43.812: cilium-test/echo-other-node-697d5d69b7-lglxf:8080 -> cilium-test/client2-6dd75b74c6-9nxhx:45704 from-endpoint FORWARDED (TCP Flags: SYN, ACK)
    43.812: cilium-test/echo-other-node-697d5d69b7-lglxf:8080 -> cilium-test/client2-6dd75b74c6-9nxhx:45704 to-stack FORWARDED (TCP Flags: SYN, ACK)
    43.813: cilium-test/echo-other-node-697d5d69b7-lglxf:8080 -> cilium-test/client2-6dd75b74c6-9nxhx:45704 from-network FORWARDED (TCP Flags: SYN, ACK)
    44.827: cilium-test/client2-6dd75b74c6-9nxhx:45704 -> cilium-test/echo-other-node-697d5d69b7-lglxf:8080 from-network FORWARDED (TCP Flags: SYN)
    44.827: cilium-test/client2-6dd75b74c6-9nxhx:45704 -> cilium-test/echo-other-node-697d5d69b7-lglxf:8080 from-stack FORWARDED (TCP Flags: SYN)
    44.827: cilium-test/client2-6dd75b74c6-9nxhx:45704 -> cilium-test/echo-other-node-697d5d69b7-lglxf:8080 to-endpoint FORWARDED (TCP Flags: SYN)

We can see the TCP handshake between the client and proxy, followed by an
attempt to perform the TCP handshake between the proxy and server. That
second part fails, as the SYN+ACK packets sent by the server never seem to
reach the proxy; they are dropped somewhere after the from-network
observation point.

At the same time, we can see the IPsec error counter XfrmInNoPols
increasing on the client node. This indicates that SYN+ACK packets were
dropped after decryption, because the XFRM state used for decryption
doesn't match any XFRM policy. The XFRM state used for decryption is:

    src 0.0.0.0 dst 10.240.0.18
    proto esp spi 0x00000003 reqid 1 mode tunnel
    replay-window 0
    mark 0xd00/0xf00 output-mark 0xd00/0xf00
    aead rfc4106(gcm(aes)) 0x3ec3e84ac118c3baf335392e7a4ea24ee3aecb2b 128
    anti-replay context: seq 0x0, oseq 0x0, bitmap 0x00000000
    sel src 0.0.0.0/0 dst 0.0.0.0/0

And it should match the following XFRM policy:

    src 0.0.0.0/0 dst 10.240.0.0/16
    dir in priority 0
    mark 0xd00/0xf00
    tmpl src 0.0.0.0 dst 10.240.0.18
        proto esp reqid 1 mode tunnel

After the packet is decrypted however, we hit the following rule in
iptables because we're going to the proxy.

    -A CILIUM_PRE_mangle -m socket --transparent -m comment --comment "cilium: any->pod redirect proxied traffic to host proxy" -j MARK --set-xmark 0x200/0xffffffff

As a result, the packet mark is set to 0x200 and we don't match the 0xd00
packet mark of the XFRM policy anymore. The packet is therefore dropped
with XfrmInNoPols.

To avoid this, we can simply mark the XFRM policy optional when endpoint
routes are enabled, in the same way we do for tunneling.

Fixes: 287f49c ("cilium: encryption, fix redirect when endpoint routes enabled")
Signed-off-by: Paul Chaignon <paul@cilium.io>
Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
[ upstream commit b385f0f ]

This patch modifies a forward policy update introduced by
0b52fd7 so that the template source
matches the source which is 0.0.0.0/0 (wildcard).

Above modification addresses an issue of intermittent packet drops, as
discussed in detail below.

During an investigation of intermittent dropped packets in AKS (kernel
5.4.0-1061-azure) with IPSec enabled, there was an increase of
(XfrmInTmplMismatch) errors in /proc/net/xfrm_stat as packets were
dropped.

Tracing revaled that the packets were dropped due to an __xfrm_policy_check
failure when the packet was redirected from eth0 (after decryption) to the LXC
device of a pod.

Further investigation, attributed the drops to changes in the forwarding
policy. Specifically, the forwarding policy would change as:

 src 0.0.0.0/0 dst 10.240.0.0/16
        dir fwd priority 2975
-       tmpl src 0.0.0.0 dst 10.240.0.19
+       tmpl src 10.240.0.19 dst 10.240.0.61
                proto esp reqid 1 mode tunnel
                level use

And back:

 src 0.0.0.0/0 dst 10.240.0.0/16
        dir fwd priority 2975
-       tmpl src 10.240.0.19 dst 10.240.0.61
+       tmpl src 0.0.0.0 dst 10.240.0.19
                proto esp reqid 1 mode tunnel
                level use

The above change was caused by: func (n *linuxNodeHandler) enableIPsec(newNode
*nodeTypes.Node) in pkg/datapath/linux/node.go.  Modifying the code to avoid
chancing the policy elimiated the packet drops.

The are two places were the xfrm policy is updated in enableIPsec():
  (1) inside UpsertIPsecEndpoint() when an IN policy is specified (as happens if newNode.IsLocal())
  (2) in enableIPsec() itself, as introduced by 0b52fd7

For example, adding log messages to IpSecReplacePolicyFwd and UpsertIPsecEndpoint produced:
 level=info msg="IpSecReplacePolicyFwd: src=0.0.0.0/0 dst=10.240.0.61/16 tmplSrc=10.240.0.19/16 tmplDst=10.240.0.61/16" subsys=ipsec
 level=info msg="UpsertIPsecEndpoint: local:10.240.0.19/16 remote:0.0.0.0/0 fowrard:10.240.0.19/16" subsys=ipsec
 level=info msg="IpSecReplacePolicyFwd: src=0.0.0.0/0 dst=10.240.0.19/16 tmplSrc=0.0.0.0/0 tmplDst=10.240.0.19/16" subsys=ipsec
 level=info msg="UpsertIPsecEndpoint: exit" subsys=ipsec

Additional testing revealed that the update that resulting in a template
with tmplSrc=10.240.0.19/16  was the culprit for the packet drops.
Making the source template matching the source which is a wildcard in
update (2), eliminated the packet drops.

Signed-off-by: Kornilios Kourtis <kornilios@isovalent.com>
Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
[ upstream commit dc40b2c ]

In Cilium v1.10, we disabled kube-proxy-replacement by default but left
BPF masquerading enabled. Since the latter requires the former, the
default installation results in a warning.

This commit fixes the warning by disabling BPF masquerading as well on
new v1.10+ deployments.

Fixes: 5412142 ("install: Disable kube-proxy-replacement by default")
Signed-off-by: Paul Chaignon <paul@cilium.io>
Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
[ upstream commit 690c112 ]

A version of the classic closure with concurrency gotcha. Bind the value
of cmd in the loop to a new variable to address the issue.

Signed-off-by: Robin Hahling <robin.hahling@gw-computing.net>
Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
[ upstream commit d60cdef ]

BPF masquerading for IPv6 isn't supported yet, so we should fatal early
if the user asks for both BPF and IPv6 masquerade. They can use
iptables-based masquerading for IPv6 instead.

Since we enable BPF-based masquerading in all tests with 4.19+ kernels,
we also need to disable IPv6 masquerading there. That should be fine
since we rarely rely on IPv6 masquerading anyway.

Signed-off-by: Paul Chaignon <paul@cilium.io>
Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
[ upstream commit e7de7f2 ]

The current understanding is that we want to remove these rules all the
time, as they can cause issues with features such as egress gateway.

Signed-off-by: Bruno M. Custódio <brunomcustodio@gmail.com>
Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
[ upstream commit fcd19dc ]

Currently all pod traffic matching source IP and destination CIDR of an
egress policy will be forwarded to an egress gateway.

This means we will incorrectly forward to an egress gateway also all
reply traffic from connections destined to a pod, breaking said
connections.

This commit fixes this by adding an additional check to make sure reply
traffic (i.e. connections not originating from a pod) is excluded from
the egress gateway logic.

Fixes: cilium#17866
Signed-off-by: Gilberto Bertin <gilberto@isovalent.com>
Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
[ upstream commit c117362 ]

to make it more explicit its purpose

Signed-off-by: Gilberto Bertin <gilberto@isovalent.com>
Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
@gandro gandro force-pushed the pr/v1.10-backport-2021-11-23 branch from 264e9f0 to bb61a3a Compare November 23, 2021 17:37
Copy link
Member

@rolinh rolinh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM for my patch.

@gandro gandro reopened this Nov 24, 2021
@gandro
Copy link
Member Author

gandro commented Nov 24, 2021

/test-backport-1.10

Job 'Cilium-PR-K8s-1.19-kernel-5.4' hit: #17069 (92.24% similarity)

Job 'Cilium-PR-K8s-1.19-kernel-5.4' failed and has not been observed before, so may be related to your PR:

Click to show.

Test Name

[empty]

Failure Output


If it is a flake, comment /mlh new-flake Cilium-PR-K8s-1.19-kernel-5.4 so I can create a new GitHub issue to track it.

Job 'Cilium-PR-K8s-1.19-kernel-5.4' failed and has not been observed before, so may be related to your PR:

Click to show.

Test Name

[empty]

Failure Output


If it is a flake, comment /mlh new-flake Cilium-PR-K8s-1.19-kernel-5.4 so I can create a new GitHub issue to track it.

Copy link
Member

@jibi jibi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My PR looks good, thanks!

@gandro
Copy link
Member Author

gandro commented Nov 25, 2021

/ci-eks-1.10

@gandro
Copy link
Member Author

gandro commented Nov 25, 2021

/test-1.16-netnext

@gandro
Copy link
Member Author

gandro commented Nov 25, 2021

/test-1.19-5.4

Copy link
Contributor

@bmcustodio bmcustodio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My commit LGTM!

Copy link
Member

@pchaigno pchaigno left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@gandro
Copy link
Member Author

gandro commented Nov 29, 2021

14:34:48  K8sUpdates 
14:34:48    Tests upgrade and downgrade from a Cilium stable image to master
14:34:48    /home/jenkins/workspace/Cilium-PR-K8s-1.19-kernel-5.4/src/github.com/cilium/cilium/test/ginkgo-ext/scopes.go:518
14:34:48  14:34:46 STEP: Running BeforeAll block for EntireTestsuite K8sUpdates
14:34:48  14:34:46 STEP: Ensuring the namespace kube-system exists
14:34:48  14:34:46 STEP: WaitforPods(namespace="kube-system", filter="-l k8s-app=cilium-test-logs")
14:34:48  14:34:46 STEP: WaitforPods(namespace="kube-system", filter="-l k8s-app=cilium-test-logs") => <nil>
14:34:56  14:34:55 STEP: Deleting Cilium and CoreDNS...
14:35:02  14:35:02 STEP: Waiting for pods to be terminated..
14:35:02  14:35:02 STEP: Cleaning Cilium state (bb61a3a29287e1b93f1f6e89f5f06b4dbf50ed90)
14:35:02  14:35:02 STEP: Cleaning up Cilium components
14:35:05  14:35:05 STEP: Waiting for Cilium to become ready
14:35:18  14:35:17 STEP: Cleaning Cilium state (v1.9)
14:35:18  14:35:17 STEP: Cleaning up Cilium components
14:35:28  14:35:26 STEP: Waiting for Cilium to become ready
14:39:41  FAIL: Timed out after 241.838s.
14:39:41  Cilium "1.9-dev" did not become ready in time
14:39:41  Expected
14:39:41      <*errors.errorString | 0xc000196230>: {
15:17:24       
15:17:24  Ginkgo ran 1 suite in 1h17m5.878348097s
15:17:24  Test Suite Failed

@gandro
Copy link
Member Author

gandro commented Nov 29, 2021

/test-1.16-netnext

@gandro
Copy link
Member Author

gandro commented Nov 29, 2021

/test-1.19-5.4

@gandro
Copy link
Member Author

gandro commented Nov 29, 2021

Hit exactly the same issues 🤔

https://jenkins.cilium.io/job/Cilium-PR-K8s-1.16-net-next/2074/
https://jenkins.cilium.io/job/Cilium-PR-K8s-1.19-kernel-5.4/1364/

But again, both the VM provisioning error during import and Cilium 1.9-dev not getting ready should be unrelated to this branch.

@gandro
Copy link
Member Author

gandro commented Nov 29, 2021

/test-1.16-netnext

@gandro
Copy link
Member Author

gandro commented Nov 29, 2021

/test-1.19-5.4

@qmonnet
Copy link
Member

qmonnet commented Nov 29, 2021

17:30:32  + cd /home/jenkins/workspace/Cilium-PR-K8s-1.16-net-next/src/github.com/cilium/cilium/test
17:30:32  + ./archive_test_results.sh
17:30:32  ./archive_test_results.sh: line 3: cd: /home/jenkins/workspace/Cilium-PR-K8s-1.16-net-next/test: No such file or directory

Looks like something went wrong on the infra

@qmonnet
Copy link
Member

qmonnet commented Nov 29, 2021

/test-1.16-netnext

@gandro
Copy link
Member Author

gandro commented Nov 30, 2021

Yet another net-next provisioning failure: https://jenkins.cilium.io/job/Cilium-PR-K8s-1.16-net-next/2082/

@gandro
Copy link
Member Author

gandro commented Nov 30, 2021

/test-1.16-netnext

@qmonnet qmonnet merged commit 8417a82 into cilium:v1.10 Nov 30, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/backports This PR provides functionality previously merged into master.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants