New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
config: Fix incorrect packet path with IPsec and endpoint routes #17000
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
When endpoint routes are enabled, we attach a BPF program on the way to the container and add a Linux route to the lxc interface. So when coming from bpf_network with IPsec, we should use that route to go directly to the lxc device and its attached BPF program. In contrast, when endpoint routes are disabled, we run the BPF program for ingress pod policies from cilium_host, via a tail call in bpf_host. Therefore, in that case, we need to jump from bpf_network to cilium_host first, to follow the correct path to the lxc interface. That's what commit 287f49c ("cilium: encryption, fix redirect when endpoint routes enabled") attempted to implement for when endpoint routes are enabled. It's goal was to go directly from bpf_network to the stack in that case, to use the per-endpoint Linux routes to the lxc device. That commit however implements a noop change: ENABLE_ENDPOINT_ROUTES is defined as a per-endpoint setting, but then used in bpf_network, which is not tied to any endpoint. In practice, that means the macro is defined in the ep_config.h header files used by bpf_lxc, whereas bpf_network (from which the macro is used) relies on the node_config.h header file. The fix is therefore simple: we need to define ENABLE_ENDPOINT_ROUTES as a global config, written in node_config.h. To reproduce the bug and validate the fix, I deploy Cilium on GKE (where endpoint routes are enabled by default) with: helm install cilium ./cilium --namespace kube-system \ --set nodeinit.enabled=true \ --set nodeinit.reconfigureKubelet=true \ --set nodeinit.removeCbrBridge=true \ --set cni.binPath=/home/kubernetes/bin \ --set gke.enabled=true \ --set ipam.mode=kubernetes \ --set nativeRoutingCIDR=$NATIVE_CIDR \ --set nodeinit.restartPods=true \ --set image.repository=docker.io/pchaigno/cilium-dev \ --set image.tag=fix-ipsec-ep-routes \ --set operator.image.repository=quay.io/cilium/operator \ --set operator.image.suffix="-ci" \ --set encryption.enabled=true \ --set encryption.type=ipsec I then deployed the below manifest and attempted a curl request from pod client to the service echo-a. metadata: name: echo-a labels: name: echo-a spec: template: metadata: labels: name: echo-a spec: containers: - name: echo-a-container env: - name: PORT value: "8080" ports: - containerPort: 8080 image: quay.io/cilium/json-mock:v1.3.0 imagePullPolicy: IfNotPresent readinessProbe: timeoutSeconds: 7 exec: command: - curl - -sS - --fail - --connect-timeout - "5" - -o - /dev/null - localhost:8080 selector: matchLabels: name: echo-a replicas: 1 apiVersion: apps/v1 kind: Deployment --- metadata: name: echo-a labels: name: echo-a spec: ports: - name: http port: 8080 type: ClusterIP selector: name: echo-a apiVersion: v1 kind: Service --- apiVersion: "cilium.io/v2" kind: CiliumNetworkPolicy metadata: name: "l3-rule" spec: endpointSelector: matchLabels: name: client ingress: - fromEndpoints: - matchLabels: name: echo-a --- apiVersion: v1 kind: Pod metadata: name: client labels: name: client spec: affinity: podAntiAffinity: requiredDuringSchedulingIgnoredDuringExecution: - labelSelector: matchExpressions: - key: app.kubernetes.io/name operator: In values: - echo-a topologyKey: kubernetes.io/hostname containers: - name: netperf args: - sleep - infinity image: cilium/netperf Fixes: 287f49c ("cilium: encryption, fix redirect when endpoint routes enabled") Signed-off-by: Paul Chaignon <paul@cilium.io>
pchaigno
added
kind/bug
This is a bug in the Cilium logic.
release-note/bug
This PR fixes an issue in a previous release of Cilium.
area/encryption
Impacts encryption support such as IPSec, WireGuard, or kTLS.
needs-backport/1.10
labels
Jul 26, 2021
test-me-please |
Previous run failed with a Docker hub rate limiting: https://github.com/cilium/cilium/actions/runs/1069119537. ci-l4lb |
nathanjsweet
approved these changes
Jul 28, 2021
christarazi
approved these changes
Jul 28, 2021
pchaigno
added
the
ready-to-merge
This PR has passed all tests and received consensus from code owners to merge.
label
Jul 28, 2021
vadorovsky
approved these changes
Jul 29, 2021
maintainer-s-little-helper
bot
removed
the
ready-to-merge
This PR has passed all tests and received consensus from code owners to merge.
label
Jul 29, 2021
christarazi
added a commit
to cilium/cilium-cli
that referenced
this pull request
Aug 4, 2021
In the Cilium datapath, the identity "world" is a special case. If traffic cannot be identified, then the datapath falls back to assigning it as "world". Having only "allow-all" in the connectivity test will mask failures in which we have datapath bugs that incorrectly assign traffic as "world", but the traffic is still allowed. One such case is cilium/cilium#17000. This commit splits up the "allow-all" test into two, which now does "allow-all" as well as "allow-all-except-world", where non-world traffic is *not* allowed. This should cover the datapath special case. Signed-off-by: Chris Tarazi <chris@isovalent.com>
Ah, OK, yes you're right. I was confused between this and
|
@jrajahalme Yes. |
christarazi
added a commit
to cilium/cilium-cli
that referenced
this pull request
Aug 10, 2021
In the Cilium datapath, the identity "world" is a special case. If traffic cannot be identified, then the datapath falls back to assigning it as "world". Having only "allow-all" in the connectivity test will mask failures in which we have datapath bugs that incorrectly assign traffic as "world", but the traffic is still allowed. One such case is cilium/cilium#17000. This commit replaces the "allow-all" test with "allow-all-except-world" (and unmanaged), thereby covering the datapath special case. We don't want to allow unmanaged traffic either because it could also lead mark underlying datapath bugs. Signed-off-by: Chris Tarazi <chris@isovalent.com>
christarazi
added a commit
to cilium/cilium-cli
that referenced
this pull request
Aug 10, 2021
In the Cilium datapath, the identity "world" is a special case. If traffic cannot be identified, then the datapath falls back to assigning it as "world". Having only "allow-all" in the connectivity test will mask failures in which we have datapath bugs that incorrectly assign traffic as "world", but the traffic is still allowed. One such case is cilium/cilium#17000. This commit replaces the "allow-all" test with "allow-all-except-world" (and unmanaged), thereby covering the datapath special case. We don't want to allow unmanaged traffic either because it could also lead mark underlying datapath bugs, such as a delay in propagation of identities. Signed-off-by: Paul Chaignon <paul@cilium.io> Signed-off-by: Chris Tarazi <chris@isovalent.com>
maintainer-s-little-helper
bot
moved this from Needs backport from master
to Backport pending to v1.10
in 1.10.4
Aug 10, 2021
maintainer-s-little-helper
bot
moved this from Needs backport from master
to Backport pending to v1.9
in 1.9.10
Aug 10, 2021
I'm removing this PR from backports until we've figured out #17022. |
christarazi
added a commit
to cilium/cilium-cli
that referenced
this pull request
Jan 26, 2022
In the Cilium datapath, the identity "world" is a special case. If traffic cannot be identified, then the datapath falls back to assigning it as "world". Having only "allow-all" in the connectivity test will mask failures in which we have datapath bugs that incorrectly assign traffic as "world", but the traffic is still allowed. One such case is cilium/cilium#17000. This commit replaces the "allow-all" test with "allow-all-except-world" (and unmanaged), thereby covering the datapath special case. We don't want to allow unmanaged traffic either because it could also lead mark underlying datapath bugs, such as a delay in propagation of identities. Signed-off-by: Paul Chaignon <paul@cilium.io> Signed-off-by: Chris Tarazi <chris@isovalent.com>
christarazi
added a commit
to cilium/cilium-cli
that referenced
this pull request
Jan 26, 2022
In the Cilium datapath, the identity "world" is a special case. If traffic cannot be identified, then the datapath falls back to assigning it as "world". Having only "allow-all" in the connectivity test will mask failures in which we have datapath bugs that incorrectly assign traffic as "world", but the traffic is still allowed. One such case is cilium/cilium#17000. This commit replaces the "allow-all" test with "allow-all-except-world" (and unmanaged), thereby covering the datapath special case. We don't want to allow unmanaged traffic either because it could also lead mark underlying datapath bugs, such as a delay in propagation of identities. Signed-off-by: Paul Chaignon <paul@cilium.io> Signed-off-by: Chris Tarazi <chris@isovalent.com>
christarazi
added a commit
to cilium/cilium-cli
that referenced
this pull request
Feb 2, 2022
In the Cilium datapath, the identity "world" is a special case. If traffic cannot be identified, then the datapath falls back to assigning it as "world". Having only "allow-all" in the connectivity test will mask failures in which we have datapath bugs that incorrectly assign traffic as "world", but the traffic is still allowed. One such case is cilium/cilium#17000. This commit replaces the "allow-all" test with "allow-all-except-world" (and unmanaged), thereby covering the datapath special case. We don't want to allow unmanaged traffic either because it could also lead mark underlying datapath bugs, such as a delay in propagation of identities. Signed-off-by: Paul Chaignon <paul@cilium.io> Signed-off-by: Chris Tarazi <chris@isovalent.com>
christarazi
added a commit
to cilium/cilium-cli
that referenced
this pull request
Feb 9, 2022
In the Cilium datapath, the identity "world" is a special case. If traffic cannot be identified, then the datapath falls back to assigning it as "world". Having only "allow-all" in the connectivity test will mask failures in which we have datapath bugs that incorrectly assign traffic as "world", but the traffic is still allowed. One such case is cilium/cilium#17000. This commit replaces the "allow-all" test with "allow-all-except-world" (and unmanaged), thereby covering the datapath special case. We don't want to allow unmanaged traffic either because it could also lead mark underlying datapath bugs, such as a delay in propagation of identities. Signed-off-by: Paul Chaignon <paul@cilium.io> Signed-off-by: Chris Tarazi <chris@isovalent.com>
jibi
pushed a commit
to cilium/cilium-cli
that referenced
this pull request
Feb 15, 2022
In the Cilium datapath, the identity "world" is a special case. If traffic cannot be identified, then the datapath falls back to assigning it as "world". Having only "allow-all" in the connectivity test will mask failures in which we have datapath bugs that incorrectly assign traffic as "world", but the traffic is still allowed. One such case is cilium/cilium#17000. This commit replaces the "allow-all" test with "allow-all-except-world" (and unmanaged), thereby covering the datapath special case. We don't want to allow unmanaged traffic either because it could also lead mark underlying datapath bugs, such as a delay in propagation of identities. Signed-off-by: Paul Chaignon <paul@cilium.io> Signed-off-by: Chris Tarazi <chris@isovalent.com>
aditighag
pushed a commit
to aditighag/cilium-cli
that referenced
this pull request
Apr 21, 2023
In the Cilium datapath, the identity "world" is a special case. If traffic cannot be identified, then the datapath falls back to assigning it as "world". Having only "allow-all" in the connectivity test will mask failures in which we have datapath bugs that incorrectly assign traffic as "world", but the traffic is still allowed. One such case is cilium/cilium#17000. This commit replaces the "allow-all" test with "allow-all-except-world" (and unmanaged), thereby covering the datapath special case. We don't want to allow unmanaged traffic either because it could also lead mark underlying datapath bugs, such as a delay in propagation of identities. Signed-off-by: Paul Chaignon <paul@cilium.io> Signed-off-by: Chris Tarazi <chris@isovalent.com>
michi-covalent
pushed a commit
to michi-covalent/cilium
that referenced
this pull request
May 30, 2023
In the Cilium datapath, the identity "world" is a special case. If traffic cannot be identified, then the datapath falls back to assigning it as "world". Having only "allow-all" in the connectivity test will mask failures in which we have datapath bugs that incorrectly assign traffic as "world", but the traffic is still allowed. One such case is cilium#17000. This commit replaces the "allow-all" test with "allow-all-except-world" (and unmanaged), thereby covering the datapath special case. We don't want to allow unmanaged traffic either because it could also lead mark underlying datapath bugs, such as a delay in propagation of identities. Signed-off-by: Paul Chaignon <paul@cilium.io> Signed-off-by: Chris Tarazi <chris@isovalent.com>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
area/encryption
Impacts encryption support such as IPSec, WireGuard, or kTLS.
kind/bug
This is a bug in the Cilium logic.
release-note/bug
This PR fixes an issue in a previous release of Cilium.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
When endpoint routes are enabled, we attach a BPF program on the way to the container and add a Linux route to the lxc interface. So when coming from
bpf_network
with IPsec, we should use that route to go directly to the lxc device and its attached BPF program.In contrast, when endpoint routes are disabled, we run the BPF program for ingress pod policies from cilium_host, via a tail call in
bpf_host
. Therefore, in that case, we need to jump frombpf_network
to cilium_host first, to follow the correct path to the lxc interface.That's what commit 287f49c ("cilium: encryption, fix redirect when endpoint routes enabled") attempted to implement for when endpoint routes are enabled. It's goal was to go directly from
bpf_network
to the stack in that case, to use the per-endpoint Linux routes to the lxc device. That commit however implements a noop change:ENABLE_ENDPOINT_ROUTES
is defined as a per-endpoint setting, but then used inbpf_network
, which is not tied to any endpoint. In practice, that means the macro is defined in theep_config.h
header files used bybpf_lxc
, whereasbpf_network
(from which the macro is used) relies on thenode_config.h
header file.The fix is therefore simple: we need to define
ENABLE_ENDPOINT_ROUTES
as a global config, written innode_config.h
.Click to show the reproduction steps.
To reproduce the bug and validate the fix, I deployed Cilium on GKE (where endpoint routes are enabled by default) with:
I then deployed the below manifest and attempted a curl request from pod
client
to theecho-a
service.Fixes: #15048
/cc @jrfastab