Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ipcache: Fix node-to-node encryption with kube-apiserver #19318

Closed

Conversation

gandro
Copy link
Member

@gandro gandro commented Apr 4, 2022

The kube-apiserver reserved identity requires us to create a
IPCache entry based on the endpoints of the default/kubernetes K8s
service.

This kube-apiserver IPCache entry will overwrite any existing remote
host entries. The entry also blocks the creation of remote host entries
for the same IP (via source.AllowsOverwrite).

When the kube-apiserver entry is created, we check if there is an
existing remote node entry, and if so, copy over the IPSec keypair
information into the newly created kube-apiserver entry. If however the
order is reversed (we first create the kube-apiserver entry before the
remote node is discovered), then the IPSec keypair is never updated.
This in turn means that connectivity to any remote nodes which also run
the kube-apiserver is broken when node-to-node encryption is enabled.

This commit works around the issue by adding the IPKeyPair before the
labels are injected. This allows InjectLabels to look up the pair
whenever the entry is created or updated.

This work is also needed for the upcoming WireGuard node-to-node
encryption.

--

(Edit: This has been partially addressed) @joestringer @christarazi I would highly value your feedback on this, as this is rather hacky at the moment. I think an alternative (maybe cleaner?) approach would be to bring in the IPSec key pair via UpsertMetadata/TriggerLabelInjection or something, but it would require us to pass in the encrypt key and host IP to those methods.

@gandro gandro added release-note/bug This PR fixes an issue in a previous release of Cilium. area/encryption Impacts encryption support such as IPSec, WireGuard, or kTLS. labels Apr 4, 2022
@gandro gandro requested a review from a team as a code owner April 4, 2022 12:40
@gandro gandro force-pushed the pr/gandro/fix-node-to-nodeencryption branch from 37cfbbd to 180ac10 Compare April 4, 2022 13:28
@christarazi christarazi added kind/bug This is a bug in the Cilium logic. kind/regression This functionality worked fine before, but was broken in a newer release of Cilium. labels Apr 4, 2022
@gandro gandro force-pushed the pr/gandro/fix-node-to-nodeencryption branch from 180ac10 to f9179b3 Compare April 5, 2022 10:32
@gandro gandro requested a review from a team April 5, 2022 10:32
@gandro
Copy link
Member Author

gandro commented Apr 5, 2022

Updated the PR based on a suggestion by Chris. Doing some manual testing as well.

Comment on lines -449 to -454
// Upsert() will return true if the ipcache entry is owned by
// the source of the node update that triggered this node
// update (kvstore, k8s, ...) The datapath is only updated if
// that source of truth is updated.
if err != nil {
dpUpdate = false
Copy link
Member Author

@gandro gandro Apr 5, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this is somewhat safe to remove, as we also have a source overwrite check here:
https://github.com/cilium/cilium/blob/f9179b3b351d55ee54a5fa9538751dcac44b17ca/pkg/node/manager/manager.go#L470-L477

But given the asynchronous nature of TriggerLabelInjection, it would become much harder to learn if the upsert succeeded.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I presume you're referring to the removal of the dpUpdate logic?

From what I understand, the idea is that we first try to inject an ipcache entry, which could plausibly fail (for instance if the node update is for the local node, where we already knew about the local node locally via NodeDiscovery), so then in that case we don't need to go and update the datapath and invoke nh.NodeUpdate(...) / nh.NodeAdd(...) below.

So I think the result is that all of the NodeManager subscribers (hubble and datapath, where datapath does the route updates, tunnel map updates, ipsec/wireguard config etc.) will now receive updates even if the ipcache suggests that nothing needs to change. Hopefully those subscribers are mostly idempotent anyway?

IMHO the previous logic where ipsAdded was extended or not based on the success of the ipcache modification seemed a bit odd, that doesn't change whether the node's IP is changing or not. 🤷

@gandro gandro force-pushed the pr/gandro/fix-node-to-nodeencryption branch from f9179b3 to a287086 Compare April 5, 2022 16:52
@gandro gandro requested a review from a team April 5, 2022 16:52
@christarazi christarazi marked this pull request as draft April 5, 2022 16:56
@gandro gandro force-pushed the pr/gandro/fix-node-to-nodeencryption branch 3 times, most recently from 7cb562a to 6fef76a Compare April 7, 2022 13:54
@gandro gandro force-pushed the pr/gandro/fix-node-to-nodeencryption branch from 6fef76a to bd49a90 Compare April 14, 2022 14:23
@@ -41,18 +41,20 @@ func TestInjectLabels(t *testing.T) {
// Upsert node labels to the kube-apiserver to validate that the CIDR ID is
// deallocated and the kube-apiserver reserved ID is associated with this
// IP now.
IPIdentityCache.UpsertMetadata("10.0.0.4", labels.LabelRemoteNode)
IPIdentityCache.UpsertMetadata("10.0.0.4", labels.LabelRemoteNode, source.CustomResource)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note: I replaced most sources here to source.CustomResource to mimic how the real invocation will look like. The local node (source.Local) should never get the kubeapiserver identity.

@gandro gandro force-pushed the pr/gandro/fix-node-to-nodeencryption branch from bd49a90 to 1259275 Compare April 14, 2022 15:44
@gandro
Copy link
Member Author

gandro commented Apr 14, 2022

/test

Copy link
Member

@joestringer joestringer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't look too closely at the manager_test.go changes but overall the PR LGTM.

})

m.upsertIntoIDMD(ipAddrStr, remoteHostIdentity, n.Source)
m.ipcache.UpsertAuxiliary(ipAddrStr, tunnelIP, key)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where's the corresponding removal of Auxiliary data during the update (for the old IP, if any)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I guess we just rely on the current ipcache.Delete() calls to clean these up. That's fine for now.

Comment on lines -449 to -454
// Upsert() will return true if the ipcache entry is owned by
// the source of the node update that triggered this node
// update (kvstore, k8s, ...) The datapath is only updated if
// that source of truth is updated.
if err != nil {
dpUpdate = false
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I presume you're referring to the removal of the dpUpdate logic?

From what I understand, the idea is that we first try to inject an ipcache entry, which could plausibly fail (for instance if the node update is for the local node, where we already knew about the local node locally via NodeDiscovery), so then in that case we don't need to go and update the datapath and invoke nh.NodeUpdate(...) / nh.NodeAdd(...) below.

So I think the result is that all of the NodeManager subscribers (hubble and datapath, where datapath does the route updates, tunnel map updates, ipsec/wireguard config etc.) will now receive updates even if the ipcache suggests that nothing needs to change. Hopefully those subscribers are mostly idempotent anyway?

IMHO the previous logic where ipsAdded was extended or not based on the success of the ipcache modification seemed a bit odd, that doesn't change whether the node's IP is changing or not. 🤷

@@ -146,7 +164,7 @@ func (ipc *IPCache) InjectLabels(src source.Source) error {
// cluster), or CIDR IDs for kube-apiservers deployed outside of the
// cluster.
// Also, any new identity should be upserted, regardless.
if hasKubeAPIServerLabel || isNew {
if id.ID.IsReservedNodeIdentity() || hasKubeAPIServerLabel || isNew {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a vague suspicion that this is the line that's causing the failure:

\n\t   ipcache-inject-labels                       4m5s ago       7s ago       22      failed to inject labels into ipcache: failed to upsert fd04::11 into ipcache with identity 1: unable to overwrite source \"local\" with source \"kvstore\"   \n\t   k8s-heartbeat                               11s ago        never        0       no error    

Basically, this function can be called at almost any time, perhaps multiple times, to evaluate the entire metadata map and then attempt to apply the changes described in this code. However, so far the node's entries (except for the node with the kube-apiserver) are primarily managed by directly ipcache.Upsert(). When the source of the information may be set to kvstore here (because the current agent learned about its own existence via updates from the kvstore), it will go into this conditional statement (due to the IsReservedNodeIdentity() change here), then attempt to upsert an entry into the IPCache with something like {IP X has identity 1, and the source of this info is the kvstore}. However, because NodeDiscovery module would have already injected this IP/ID into the IPCache before with a higher-priority source (Local, ie directly observed from the kernel), so now when this code runs to update with the kvstore-sourced info, the upsert will fail... and then that failure gets returned back out to the controller. Once the controller fails enough times, the test will check this and fail out purely because of the failing controller.

The whole source thing is really difficult to work with right now, since it assumes that there is only one source of information for any particular piece of information, which is simply untrue. Chris and I have been working on a rework of the IPCache that we hope to post soon which will help alleviate these sorts of issues, but that doesn't really help you much in the mean time. Hmm.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that your goal here is that when an update comes in from the NodeManager, you want the TriggerLabelInjection() to re-evaluate the Auxiliary data (ie encrypt key) and update any existing entries with that info.

You might be able to get away with this by basically keeping this line as-is, but then having another statement similar but differently below the hasKubeAPIServerLabel case inside this conditional statement, where you look up the existing entry's source and use that iff the entry corresponds to a reserved node identity. It's a bit ugly but we're about to rewrite this code anyway, so if it's just enough to solve your bug then it may be a reasonable option.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for the detailed feedback and review. My original intent was to keep the Upsert source via UpsertLabels (a832573), but as you pointed out that doesn't seem to work in all cases.

Reusing the existing entry's source I think sounds indeed like an easy workaround, I'll give that a try.

gandro added 2 commits May 3, 2022 10:43
This adds a new method to update the encryption IP and Key pair without
a full upsert. This will be used in a subsequent commit when entries are
created via InjectLabels, as it will read the ipToHostIPCache map when
upserting entries.

Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
The kube-apiserver reserved identity requires us to create a IPCache
entry based on the endpoints of the `default/kubernetes` K8s service.

This kube-apiserver  IPCache entry will overwrite any existing remote
host entries. The entry also blocks the creation of remote host entries
for the same IP (via `source.AllowsOverwrite`).

When the kube-apiserver entry is created, we check if there is an
existing remote node entry, and if so, copy over the IPSec keypair
information into the newly created kube-apiserver entry. If however the
order is reversed (we first create the kube-apiserver entry before the
remote node is discovered), then the IPSec keypair is never updated.
This in turn means that connectivity to any remote nodes which also run
the kube-apiserver is broken when node-to-node encryption is enabled.

This commit works around the issue by adding the IPKeyPair before the
labels are injected. This allows InjectLabels to look up the pair
whenever the entry is created or updated. We also update InjectLabels to
always upsert entries for reserved node identities, to ensure new
IPCache entries are created for any discovered (remote or local) nodes.
The user-space IPCache does check if an upsert actually modifies the
existing entry and will return early if not, so no unnecessary BPF map
updates should be triggered for this.

The node manager unit test is also updated to use the real IPCache
instead of a mock one to ensure that the interaction between the two
packages works as expected.

Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
@gandro gandro force-pushed the pr/gandro/fix-node-to-nodeencryption branch from 1259275 to e77ec4f Compare May 3, 2022 13:30
@gandro
Copy link
Member Author

gandro commented May 3, 2022

/test

@gandro
Copy link
Member Author

gandro commented May 4, 2022

Seems like this approach breaks host connectivity in tunnel mode. Not really sure yet why, I don't see anything obviously wrong with IPCache in the sysdumps. 🤔

christarazi added a commit that referenced this pull request Mar 27, 2023
This commit moves the node/manager package to use the new asynchronous
IPCache API. Instead of directly performing Upserts and Delete on the
various node IPs (InternalIP, ExternalIP, HealthIPs etc), we now
associate each node IP with the corresponding labels. The CEW identity
is now also determined by the node's labels, rather than its numeric
identity.

This also fixes an issue where concurrent use of the synchronous and
asynchronous API would lead to the encryption key for the kube-apiserver
node being lost (c.f. #19318).

While we are at it, change the test to use netip types instead of
net.IP.

Co-authored-by: Chris Tarazi <chris@isovalent.com>
Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
Signed-off-by: Chris Tarazi <chris@isovalent.com>
christarazi added a commit that referenced this pull request Apr 11, 2023
This commit moves the node/manager package to use the new asynchronous
IPCache API. Instead of directly performing Upserts and Delete on the
various node IPs (InternalIP, ExternalIP, HealthIPs etc), we now
associate each node IP with the corresponding labels. The CEW identity
is now also determined by the node's labels, rather than its numeric
identity.

This also fixes an issue where concurrent use of the synchronous and
asynchronous API would lead to the encryption key for the kube-apiserver
node being lost (c.f. #19318).

While we are at it, change the test to use netip types instead of
net.IP.

Co-authored-by: Chris Tarazi <chris@isovalent.com>
Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
Signed-off-by: Chris Tarazi <chris@isovalent.com>
christarazi added a commit that referenced this pull request Apr 13, 2023
This commit moves the node/manager package to use the new asynchronous
IPCache API. Instead of directly performing Upserts and Delete on the
various node IPs (InternalIP, ExternalIP, HealthIPs etc), we now
associate each node IP with the corresponding labels. The CEW identity
is now also determined by the node's labels, rather than its numeric
identity.

This also fixes an issue where concurrent use of the synchronous and
asynchronous API would lead to the encryption key for the kube-apiserver
node being lost (c.f. #19318).

While we are at it, change the test to use netip types instead of
net.IP.

Co-authored-by: Chris Tarazi <chris@isovalent.com>
Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
Signed-off-by: Chris Tarazi <chris@isovalent.com>
christarazi added a commit that referenced this pull request Apr 14, 2023
This commit moves the node/manager package to use the new asynchronous
IPCache API. Instead of directly performing Upserts and Delete on the
various node IPs (InternalIP, ExternalIP, HealthIPs etc), we now
associate each node IP with the corresponding labels. The CEW identity
is now also determined by the node's labels, rather than its numeric
identity.

This also fixes an issue where concurrent use of the synchronous and
asynchronous API would lead to the encryption key for the kube-apiserver
node being lost (c.f. #19318).

While we are at it, change the test to use netip types instead of
net.IP.

Co-authored-by: Chris Tarazi <chris@isovalent.com>
Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
Signed-off-by: Chris Tarazi <chris@isovalent.com>
christarazi added a commit that referenced this pull request Apr 19, 2023
This commit moves the node/manager package to use the new asynchronous
IPCache API. Instead of directly performing Upserts and Delete on the
various node IPs (InternalIP, ExternalIP, HealthIPs etc), we now
associate each node IP with the corresponding labels. The CEW identity
is now also determined by the node's labels, rather than its numeric
identity.

This also fixes an issue where concurrent use of the synchronous and
asynchronous API would lead to the encryption key for the kube-apiserver
node being lost (c.f. #19318).

While we are at it, change the test to use netip types instead of
net.IP.

Co-authored-by: Chris Tarazi <chris@isovalent.com>
Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
Signed-off-by: Chris Tarazi <chris@isovalent.com>
christarazi added a commit that referenced this pull request May 3, 2023
This commit moves the node/manager package to use the new asynchronous
IPCache API. Instead of directly performing Upserts and Delete on the
various node IPs (InternalIP, ExternalIP, HealthIPs etc), we now
associate each node IP with the corresponding labels. The CEW identity
is now also determined by the node's labels, rather than its numeric
identity.

This also fixes an issue where concurrent use of the synchronous and
asynchronous API would lead to the encryption key for the kube-apiserver
node being lost (c.f. #19318).

While we are at it, change the test to use netip types instead of
net.IP.

Co-authored-by: Chris Tarazi <chris@isovalent.com>
Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
Signed-off-by: Chris Tarazi <chris@isovalent.com>
christarazi added a commit that referenced this pull request May 5, 2023
This commit moves the node/manager package to use the new asynchronous
IPCache API. Instead of directly performing Upserts and Delete on the
various node IPs (InternalIP, ExternalIP, HealthIPs etc), we now
associate each node IP with the corresponding labels. The CEW identity
is now also determined by the node's labels, rather than its numeric
identity.

This also fixes an issue where concurrent use of the synchronous and
asynchronous API would lead to the encryption key for the kube-apiserver
node being lost (c.f. #19318).

While we are at it, change the test to use netip types instead of
net.IP.

Co-authored-by: Chris Tarazi <chris@isovalent.com>
Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
Signed-off-by: Chris Tarazi <chris@isovalent.com>
christarazi added a commit that referenced this pull request May 10, 2023
This commit moves the node/manager package to use the new asynchronous
IPCache API. Instead of directly performing Upserts and Delete on the
various node IPs (InternalIP, ExternalIP, HealthIPs etc), we now
associate each node IP with the corresponding labels. The CEW identity
is now also determined by the node's labels, rather than its numeric
identity.

This also fixes an issue where concurrent use of the synchronous and
asynchronous API would lead to the encryption key for the kube-apiserver
node being lost (c.f. #19318).

While we are at it, change the test to use netip types instead of
net.IP.

Co-authored-by: Chris Tarazi <chris@isovalent.com>
Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
Signed-off-by: Chris Tarazi <chris@isovalent.com>
christarazi added a commit that referenced this pull request May 11, 2023
This commit moves the node/manager package to use the new asynchronous
IPCache API. Instead of directly performing Upserts and Delete on the
various node IPs (InternalIP, ExternalIP, HealthIPs etc), we now
associate each node IP with the corresponding labels. The CEW identity
is now also determined by the node's labels, rather than its numeric
identity.

This also fixes an issue where concurrent use of the synchronous and
asynchronous API would lead to the encryption key for the kube-apiserver
node being lost (c.f. #19318).

While we are at it, change the test to use netip types instead of
net.IP.

Co-authored-by: Chris Tarazi <chris@isovalent.com>
Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
Signed-off-by: Chris Tarazi <chris@isovalent.com>
christarazi added a commit that referenced this pull request May 16, 2023
This commit moves the node/manager package to use the new asynchronous
IPCache API. Instead of directly performing Upserts and Delete on the
various node IPs (InternalIP, ExternalIP, HealthIPs etc), we now
associate each node IP with the corresponding labels. The CEW identity
is now also determined by the node's labels, rather than its numeric
identity.

This also fixes an issue where concurrent use of the synchronous and
asynchronous API would lead to the encryption key for the kube-apiserver
node being lost (c.f. #19318).

While we are at it, change the test to use netip types instead of
net.IP.

Co-authored-by: Chris Tarazi <chris@isovalent.com>
Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
Signed-off-by: Chris Tarazi <chris@isovalent.com>
christarazi added a commit that referenced this pull request May 17, 2023
This commit moves the node/manager package to use the new asynchronous
IPCache API. Instead of directly performing Upserts and Delete on the
various node IPs (InternalIP, ExternalIP, HealthIPs etc), we now
associate each node IP with the corresponding labels. The CEW identity
is now also determined by the node's labels, rather than its numeric
identity.

This also fixes an issue where concurrent use of the synchronous and
asynchronous API would lead to the encryption key for the kube-apiserver
node being lost (c.f. #19318).

While we are at it, change the test to use netip types instead of
net.IP.

Co-authored-by: Chris Tarazi <chris@isovalent.com>
Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
Signed-off-by: Chris Tarazi <chris@isovalent.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/encryption Impacts encryption support such as IPSec, WireGuard, or kTLS. kind/bug This is a bug in the Cilium logic. kind/regression This functionality worked fine before, but was broken in a newer release of Cilium. release-note/bug This PR fixes an issue in a previous release of Cilium.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants