Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v1.10 backports 2022-02-14 #18801

Merged
merged 1 commit into from Feb 16, 2022
Merged

v1.10 backports 2022-02-14 #18801

merged 1 commit into from Feb 16, 2022

Conversation

jibi
Copy link
Member

@jibi jibi commented Feb 14, 2022

Once this PR is merged, you can update the PR labels via:

$ for pr in 18777; do contrib/backporting/set-labels.py $pr done 1.10; done

or with

$ make add-label branch=v1.10 issues=18777

[ upstream commit f6a4104 ]

This patch updates the Cilium logic for handling remote node identity
updates to ensure that when Cilium's '--enable-remote-node-identity'
flag is configured, each Cilium node will consistently consider all
other nodes as having the "remote-node" identity. This fixes an issue
where users reported policy drops from remote nodes -> pods, even though
the policy appeared to allow this. The issue was limited to kvstore
configurations of Cilium, and does not affect configurations where CRDs
are used for sharing information within the cluster.

For background:

When Cilium starts up, it locally scans for IP addresses associated with
the node, and updates its own IPcache to associate those IPs with the
"host" identity. Additionally, it will also publish this information to
other nodes so that they can make policy decisions regarding traffic
coming from IPs attached to nodes in the cluster.

Before commit 7bf60a5 ("nodediscovery: Fix local host identity
propagation"), Cilium would propagate the identity "remote-node" as part
of these updates to other nodes. After that commit, it would propagate
the identity "host" as part of these updates to other nodes.

When receiving these updates, Cilium would trust the identity directly
and push IP->Identity mappings like this into the datapath, regardless
of whether the '--enable-remote-node-identity' setting was configured or
not. As such, when the above commit changed the behaviour, it triggered
a change in policy handling behaviour.

The '--enable-remote-node-identity' flag was initially introduced to
allow the security domain of remote nodes in the cluster to be
considered differently vs. the local host. This can be important as
Kubernetes defines that the host should always have access to pods on
the node, so if all nodes are considered the same as the "host", this
can represent a larger open policy surface for pods than necessary in a
zero trust environment.

Given the potential security implications of this setting, at the time
that it was introduced, we introduced mitigations both in the control
plane and in the data plane. Whenever the datapath is configured with
--enable-remote-node-identity=true, it will also distrust any reports
that peer node identities are "host", even if the ipcache itself reports
this. In this situation, the datapath does not accept that the traffic
is from the "host". Rather, it demotes the identity of the traffic to
considering it as part of the "world". The motivation behind this is
that allowing "world" is a very permissive policy, so if the user is OK
with allowing "world" traffic then it is likely that they will be OK
with accepting any traffic like this which purports to be coming from a
"host" in the cluster.

As a result of the above conditions, users running in kvstore mode who
upgraded from earlier Cilium versions to 1.9.12, 1.10.6 or 1.11.0 (and
other releases up until this patch is released as part of an official
version) could observe traffic drops for traffic from nodes in the
cluster towards pods on other nodes in the cluster. Hubble would report
that the traffic is coming "from the world" (identity=2), despite having
a source address of another node in the cluster.

We considered multiple approaches to solving this issue:
A) Revert the commit that introduced the issue (see GH-18763).
   * Evidently, by this point there are multiple other codepaths relying
     on the internal storage of the local node's identity as Host, which
     would make this more difficult.
B) Ensure that the kvstore propagation code propagates the current node's
   identity as "remote-node", as other nodes may expect.
   * In cases of versions with mixed knowledge of remote-node-identity
     (for instance during upgrade), then newer nodes could end up
     propagating the new identity, but old nodes would not understand
     how to calculate policy with this identity in consideration, so
     this could result in similar sorts of policy drops during upgrade.
C) In the case when --enable-remote-node-identity=true, ensure that when
   Cilium receives updates from peer nodes, it demotes the "host"
   identity reported by peer nodes down to "remote-node" for the
   associated IP addresses. This way, the impact of the flag is limited
   to the way that the current node configures itself only. If the
   datapath is then informed (via ipcache) that thes IPs correspond to
   "remote-node", then the policy will be correctly assessed.

This commit takes approach (C).

Fixes: 7bf60a5 ("nodediscovery: Fix local host identity propagation")
Co-authored-by: André Martins <andre@cilium.io>
Signed-off-by: Joe Stringer <joe@cilium.io>
Signed-off-by: Gilberto Bertin <gilberto@isovalent.com>
@jibi jibi requested a review from a team as a code owner February 14, 2022 16:39
@maintainer-s-little-helper maintainer-s-little-helper bot added backport/1.10 kind/backports This PR provides functionality previously merged into master. labels Feb 14, 2022
@jibi
Copy link
Member Author

jibi commented Feb 14, 2022

/test-backport-1.10

Job 'Cilium-PR-K8s-1.17-kernel-4.9' failed and has not been observed before, so may be related to your PR:

Click to show.

Test Name

K8sServicesTest Checks service across nodes with L7 policy Tests NodePort with L7 Policy

Failure Output

FAIL: Request from k8s1 to service http://10.109.74.203:10080 failed

If it is a flake, comment /mlh new-flake Cilium-PR-K8s-1.17-kernel-4.9 so I can create a new GitHub issue to track it.

@jibi
Copy link
Member Author

jibi commented Feb 15, 2022

/test-upstream-k8s

@jibi
Copy link
Member Author

jibi commented Feb 15, 2022

/test-1.17-4.9

@jibi jibi added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Feb 16, 2022
@jibi jibi merged commit 4359883 into v1.10 Feb 16, 2022
@jibi jibi deleted the pr/v1.10-backport-2022-02-14 branch February 16, 2022 09:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/backports This PR provides functionality previously merged into master. ready-to-merge This PR has passed all tests and received consensus from code owners to merge.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants