Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v1.15 Backports 2024-01-12 #30230

Merged
merged 42 commits into from
Jan 15, 2024
Merged

v1.15 Backports 2024-01-12 #30230

merged 42 commits into from
Jan 15, 2024

Conversation

jibi
Copy link
Member

@jibi jibi commented Jan 12, 2024

Once this PR is merged, a GitHub action will update the labels of these PRs:

 29942 29971 29999 29917 29946 30033 30115 30006 30124 29986 28910 29976 29249 29518 30017 30109 29894 30110 29400 28947 29877 29778 29745 29893 30104 30166 30194 28698 30134 28974 30209 30176

renovate bot and others added 30 commits January 12, 2024 13:53
[ upstream commit 4ad5718 ]

Signed-off-by: renovate[bot] <bot@renovateapp.com>
[ upstream commit fc7b72f ]

#29349 added complexity tests for 6.1.
But we also want to select the relevant features for the build tests.

Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
Signed-off-by: Gilberto Bertin <jibi@cilium.io>
[ upstream commit f8a7616 ]

We want to mimic GetBPFCPU(), which selects mcpu=v3 for 5.10+ kernels.

Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
Signed-off-by: Gilberto Bertin <jibi@cilium.io>
[ upstream commit 3faa67b ]

The device reloader introduced a very similar variant of the race it was
created to avoid. Specifically, if the node addressing table was updated
after a device change, but before the device reloader re-queriyed the
table, the reloader would miss the update.

This results in node-port services potentially not being available on
newly added devices, even though runtime device detection is enabled.

Fixes: 8901eab (daemon: add a device reloader that isn't as racy)

Signed-off-by: David Bimmler <david.bimmler@isovalent.com>
Signed-off-by: Gilberto Bertin <jibi@cilium.io>
[ upstream commit 1f68e26 ]

The loopback address mask value should be '0xff000000':

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/include/linux/in.h?h=v6.7-rc6#n38

	static inline bool ipv4_is_loopback(__be32 addr)
	{
		return (addr & htonl(0xff000000)) == htonl(0x7f000000);
	}

Fixes: 50c947d ("bpf: refine wild card lookup for node port services from host")

Signed-off-by: Haiyue Wang <haiyue.wang@intel.com>
Signed-off-by: Gilberto Bertin <jibi@cilium.io>
[ upstream commit 69823aa ]

This enhances the BGP GetRoutes API to allow retrieving advertised
BGP routes for all configured peers, without the need for specifying
an exact peer address. The peer address is included in the BgpRoute
response in case that the source of the route is adj-rib-in or adj-rib-out.

Signed-off-by: Rastislav Szabo <rastislav.szabo@isovalent.com>
Signed-off-by: Gilberto Bertin <jibi@cilium.io>
[ upstream commit 0c051b5 ]

This enhances the `cilium-dbg bgp routes` command to allow listing advertised
routes without the need to specify an exact peer IP - in that case, advertised
routes of all peers will be listed.

Also adds defaulting of table type and AFI / SAFI, to allow simple use with even less typing.

Signed-off-by: Rastislav Szabo <rastislav.szabo@isovalent.com>
Signed-off-by: Gilberto Bertin <jibi@cilium.io>
[ upstream commit 87f1106 ]

Includes cilium-dbg commands for dumping IPv4 and IPv6 advertised BGP routes
of each configured BGP peer in the bugtool.

Signed-off-by: Rastislav Szabo <rastislav.szabo@isovalent.com>
Signed-off-by: Gilberto Bertin <jibi@cilium.io>
[ upstream commit f323319 ]

Relates: #28836
Signed-off-by: Tam Mach <tam.mach@cilium.io>
Signed-off-by: Gilberto Bertin <jibi@cilium.io>
[ upstream commit d5d9648 ]

It's unused since commit 7eb7f22 ("EndpointSelector: replacing
LabelArray with k8s LabelSelector").

Signed-off-by: Tobias Klauser <tobias@cilium.io>
Signed-off-by: Gilberto Bertin <jibi@cilium.io>
[ upstream commit 684c58e ]

Use the predefined const.

Signed-off-by: Tobias Klauser <tobias@cilium.io>
Signed-off-by: Gilberto Bertin <jibi@cilium.io>
[ upstream commit 1281f1c ]

Allocate a label map of the correct known size.

name       old time/op    new time/op    delta
NewFrom-8     638ns ± 8%      71ns ± 6%   -88.87%  (p=0.000 n=10+10)

name       old alloc/op   new alloc/op   delta
NewFrom-8      624B ± 0%        0B       -100.00%  (p=0.000 n=10+10)

name       old allocs/op  new allocs/op  delta
NewFrom-8      2.00 ± 0%      0.00       -100.00%  (p=0.000 n=10+10)

Signed-off-by: Tobias Klauser <tobias@cilium.io>
Signed-off-by: Gilberto Bertin <jibi@cilium.io>
[ upstream commit 8a421e7 ]

This is to make sure that Gateway is updated whenever there is change in
GRPCRoute.

Relates: #28654
Signed-off-by: Tam Mach <tam.mach@cilium.io>
Signed-off-by: Gilberto Bertin <jibi@cilium.io>
[ upstream commit 4a1bc1b ]

Following a bug on the v1.13 branch in the way we update the list of
ipset entries in the node manager, add a test to ensure that entries are
added or removed as we expect on node updates.

This is a forward port of the test added first to branch v1.13, commit
9ec163aff6f ("pkg/node/manager: Add test to validate ipset entries"),
although the test has been reworked to use a mock ipset manager instead
of checking the real ipset (which required privileges).

Signed-off-by: Quentin Monnet <quentin@isovalent.com>
Signed-off-by: Gilberto Bertin <jibi@cilium.io>
[ upstream commit cbe3037 ]

With iptable-based masquerading, we use some iptables rules to exclude
certain flows from masquerading. These rules rely on IP sets (managed
with ipset). There are some cases where we create entries and
immediately remove them afterwards.

The issue only occurs for the extra Kubernetes IPs) not the
CiliumInternalIP) on the node, and only with both node encryption
disabled and --enable-remote-node-identity=false.

Here is a trimmed version of the previous code to follow the logics.
When we update an existing node, and that the configuration is such that
nodeAddressSkipsIPCache() returns true, then we create an ipset entry
and immediately delete it after. Follow the numeroted comments in the
code:

    func (m *manager) legacyNodeIpBehavior() bool {
            if m.conf.RemoteNodeIdentitiesEnabled() {
                    return false
            }
            if m.conf.NodeEncryptionEnabled() {
                    return false
            }
            return true
    }

    func (m *manager) nodeAddressSkipsIPCache(address nodeTypes.Address) bool {
            return m.legacyNodeIpBehavior() && address.Type != addressing.NodeCiliumInternalIP
    }

    func (m *manager) NodeUpdated(n nodeTypes.Node) {
            for _, address := range n.IPAddresses {
                    // 1) We add the IP to the ipset
                    if m.conf.NodeIpsetNeeded() && address.Type == addressing.NodeInternalIP {
                            m.ipsetMgr.AddToNodeIpset(address.IP)
                    }

                    // 2) Say we skip the ipcache update here based on config...
                    if m.nodeAddressSkipsIPCache(address) {
                            continue
                    }
                    ...

                    // 3) ... Then we don't add the IP to nodeIPsAdded, even though we added it to the ipset
                    nodeIPsAdded = append(nodeIPsAdded, prefix)
            }
            ...

            entry, oldNodeExists := m.nodes[nodeIdentifier]
            if oldNodeExists {
                    ...

                    // 4) We call m.removeNodeFromIPCache(), passing nodeIPsAdded
                    m.removeNodeFromIPCache(oldNode, resource, nodeIPsAdded, healthIPsAdded, ingressIPsAdded)
                    ...
            }
            ...
    }

    func (m *manager) removeNodeFromIPCache(oldNode nodeTypes.Node, resource ipcacheTypes.ResourceID,
            nodeIPsAdded, healthIPsAdded, ingressIPsAdded []netip.Prefix) {
            ...

            for _, address := range oldNode.IPAddresses {
                    oldPrefix := ip.IPToNetPrefix(address.IP)

                    // 5) oldPrefix is not in nodeIPsAdded, we don't loop yet, we carry on to 6)
                    if slices.Contains(nodeIPsAdded, oldPrefix) {
                            continue
                    }

                    // 6) We remove the ipset entry that we just added
                    if m.conf.NodeIpsetNeeded() && address.Type == addressing.NodeInternalIP {
                            m.ipsetMgr.RemoveFromNodeIpset(address.IP)
                    }

                    if m.nodeAddressSkipsIPCache(address) {
                            continue
                    }
                    ...
            }
            ...
    }

This looks incorrect. If we just created the entry, we should keep it.
Worse, if the node exists, and we try to create the ipset entry, then
the ipset invocation will be a no-op because ipset won't add a duplicate
entry, but then removing the entry just after that will remove the
initial entry from node creation and may impact masquerading on the
node.

How do we change this? There are a comple of things that we can
consider:

  - We can move 3) towards 1), before 2). In other words: we can update
    nodeIPsAdded as soon as we insert the ipset entry, making this list
    consistent with the ipset entries we created. However, this is
    incorrect, because if we skip the ipcache entry insertion in 2),
    then the list in nodeIPsAdded is inconsistent regarding the ipcache.

  - Another option would be to move 1) towards 3), after 2). In other
    words, insert the new ipset entry only if we do not skip the ipcache
    update, thus tying the creation of the ipset and ipcache entries
    together. However, this would change the current behaviour, and is
    not the desirable outcome: we want to skip masquerading for the
    NodeInternalIP (by adding the corresponding entry to the ipset)
    regardless of whether remote node identities are set or not (thanks
    to Paul Chaignon for the clarification here).

  - Given that we want to preserve the current dissociation between the
    insertion of the ipset and ipcache entries, we need a separate list.

Therefore, this commit addresses the issue by adding a separate slice to
record the ipset entries we created during the node update, to avoid
removing any of them even if we skip the ipcache update.

Historically:

  - Commit 49cb220 ("iptables: Don't masquerade traffic to cluster
    nodes") introduced the ipset entry creation. It's fine because we
    wouldn't ever remove the ipset entries on node updates, so we could
    1) add the ipset entry, 2) skipIPCache() and avoid updating
    ipsAdded, since we didn't use ipsAdded for ipset operations.

  - Commit d5e5bf3 ("node/manager: Remove ipset config from
    previous node state") introduced the issue, because it adds the
    ipset entry removal to NodeUpdated(), regardless of whether we just
    added the entry.

  - Commit e7e4abb ("node/manager: Only remove old IPs if they
    weren't already added") attempts to fix this, but is broken because
    1) the slice lookup never succeed (fixed in f2a4312
    ("nodemanager: Move to async IPCache API")) and 2) ipsAdded may or
    may not contain the relevant entries, depending on whether we
    skipped the ipcache operations.

Found by code inspection, while investigating CI flakes on branch v1.13
due to the slice lookups on ipsAdded never succeeding and the IP not
being excluded from masquerading as we expected.

Fixes: d5e5bf3 ("node/manager: Remove ipset config from previous node state")
Signed-off-by: Quentin Monnet <quentin@isovalent.com>
Signed-off-by: Gilberto Bertin <jibi@cilium.io>
[ upstream commit bf42e1f ]

Signed-off-by: Tamilmani <tamanoha@microsoft.com>
Signed-off-by: Gilberto Bertin <jibi@cilium.io>
[ upstream commit 88d4230 ]

When the cilium-envoy ServiceMonitor was last updated, the required
lookup for .annotation was left out.

Signed-off-by: cornfeedhobo <cornfeedhobo@fuzzlabs.org>
Signed-off-by: Gilberto Bertin <jibi@cilium.io>
[ upstream commit 2f3e3ec ]

This commit adds the app.kubernetes.io/name label to cilium-envoy
ServiceMonitor, matching the convention in cilium-operator.

Signed-off-by: cornfeedhobo <cornfeedhobo@fuzzlabs.org>
Signed-off-by: Gilberto Bertin <jibi@cilium.io>
[ upstream commit 5931a80 ]

The CEP and CES stores getter was originally meant for Hubble, but it is
not used anymore, so it is possible to clean it up.

Signed-off-by: Fabio Falzoi <fabio.falzoi@isovalent.com>
Signed-off-by: Gilberto Bertin <jibi@cilium.io>
[ upstream commit 56f4f94 ]

Use a releasable Resource[CiliumSlimEndpoint] to refactor the agent CEP
watcher. Like the previous implementation, the underlying informer is
stopped as soon as the subscription to the events stream is canceled.

Signed-off-by: Fabio Falzoi <fabio.falzoi@isovalent.com>
Signed-off-by: Gilberto Bertin <jibi@cilium.io>
[ upstream commit b0164d4 ]

Use a releasable Resource[CiliumEndpointSlice] to refactor the agent CES
watcher. Like the previous implementation, the underlying informer is
stopped as soon as the subscription to the events stream is canceled.

Signed-off-by: Fabio Falzoi <fabio.falzoi@isovalent.com>
Signed-off-by: Gilberto Bertin <jibi@cilium.io>
[ upstream commit 9c336fb ]

CiliumEndpoint and CiliumEndpointSlice resources are backed by a
releasable resource. To allow stopping the underlying informer when it
is not needed anymore, the cleanup routine should release the reference
to the store after collecting the stale CEPs.

Also, this keeps the CiliumEndpoint cleanup routine fully working even
when the agent is configured to use an external key-value store. In that
scenario, the routine starts the k8s informer to list the CEPs or CESs
and, once all the stale CEPs have been collected, releases the resource
to stop the informer.

Signed-off-by: Fabio Falzoi <fabio.falzoi@isovalent.com>
Signed-off-by: Gilberto Bertin <jibi@cilium.io>
[ upstream commit ae64d83 ]

Signed-off-by: Nico Vibert <nicolas.vibert@isovalent.com>
Signed-off-by: Gilberto Bertin <jibi@cilium.io>
[ upstream commit 20975c9 ]

This re-creates similar behavior as in v1.14 - FQDN requests are
serialized on a per-name basis. This was removed as part of the larger
fqdn refactor but, it turns out, that was premature. The difference
between v1.14 and now is that this locks per-name, rather than per-IP.
It preserves the prime-sized count of bucketed locks.

This is needed because we release the lock on the NameManager before
policy updates have been fully pushed out to all endpoints. If another
endpoint requests the same name, it may have the DNS response returned
to it before policy updates are complete. However, the NameManager will
detect the noop case, take a shortcut, and release the DNS response
before policy updates are potentially complete.

For precise details, see the comment in fqdn.go.

Fixes: #30014
Fixes: 507259f
Signed-off-by: Casey Callendrello <cdc@isovalent.com>
Signed-off-by: Gilberto Bertin <jibi@cilium.io>
[ upstream commit a661456 ]

Currently, errors during writing the status in the defer function
aren't returned to the caller (controller-runtime). Hence, a
successful reconciliation (error = nil) but a failure during
writing the status would result in a non-updated status (because
it wouldn't result in an additional reconciliation).

To avoid having to introduce named return params (to let the defer
function modifying the returned error), this commit gets rid of the
defer statement at all and replaced it with explicit calls to a newly
introduced function `handleReconcileErrorWithStatus` that tries to
update the resource status and reflects any errors during updating the
status in the returned error.

Signed-off-by: Marco Hofstetter <marco.hofstetter@isovalent.com>
Signed-off-by: Gilberto Bertin <jibi@cilium.io>
[ upstream commit eed73af ]

This small fix prevents allocating a local identity for IPs with names
that are not selected by a toFQDN selector. Without this change, an
identity is allocated for every IP included in an intercepted DNS
response. For the (common) case where all DNS requests are proxied, this
could potentially lead to a waste of resources and thus a performance
regression.

Previously (v1.14 and before), we did not allocate identities for
un-selected IPs. That was inadvertently changed in #29036.

Signed-off-by: Casey Callendrello <cdc@isovalent.com>
Signed-off-by: Gilberto Bertin <jibi@cilium.io>
[ upstream commit e71407b ]

This change adds a check in the auth manager not to do authentication
for either IDs if they are reserved.
This could have caused failed handshakes to happen should any of these
entities be allowed by policy but with mutual auth enabled.
These IDs do are not able to ever complete a handshake as they are not
generated by design.

This commit also replaces all IDs used in tests to be high enough
that they do not conflict with reserve IDs.

Signed-off-by: Maartje Eyskens <maartje@eyskens.me>
Signed-off-by: Gilberto Bertin <jibi@cilium.io>
[ upstream commit b193daa ]

This sets the default toleration for SPIRE agent to be allowed on the
control plane nodes.
This allows Cilium Agent on these nodes to get attested by SPIRE for
Mutual Authentication to work.

Signed-off-by: Maartje Eyskens <maartje.eyskens@isovalent.com>
Signed-off-by: Gilberto Bertin <jibi@cilium.io>
[ upstream commit cddb591 ]

Signed-off-by: Chris Tarazi <chris@isovalent.com>
Signed-off-by: Gilberto Bertin <jibi@cilium.io>
[ upstream commit 5d0757d ]

In case the requirements for BPF masquerade are not met, fail the daemon
initialization instead of silently fall back to iptables based
masquerading.

This is needed because the iptables cell does not support runtime
configuration changes.

Fixes: a6a2b73 ("iptables: Add a cell for iptables config manager")

Signed-off-by: Fabio Falzoi <fabio.falzoi@isovalent.com>
Signed-off-by: Gilberto Bertin <jibi@cilium.io>
Copy link
Member

@mhofstetter mhofstetter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jibi !

Copy link
Member

@giorio94 giorio94 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My commit looks good, thanks!

@jibi
Copy link
Member Author

jibi commented Jan 12, 2024

/test-backport-1.15

@jibi jibi marked this pull request as ready for review January 12, 2024 14:10
@jibi jibi requested review from a team as code owners January 12, 2024 14:10
@jibi jibi requested a review from learnitall January 12, 2024 14:10
Copy link
Member

@tklauser tklauser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My changes look good, thanks Jibi!

Copy link
Member

@bimmlerd bimmlerd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀 my PR LGTM

Copy link
Member

@meyskens meyskens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My changes look good, thanks!

Copy link
Contributor

@rastislavs rastislavs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My changes LGTM, thanks!

Copy link
Member

@pippolo84 pippolo84 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My changes look good, thanks!

Copy link
Contributor

@haiyuewa haiyuewa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My changes look good, thanks!

Copy link
Member

@sayboras sayboras left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks and looks good for my commits

@dylandreimerink dylandreimerink merged commit abc942f into v1.15 Jan 15, 2024
213 of 214 checks passed
@dylandreimerink dylandreimerink deleted the pr/v1.15-backport-2024-01-12 branch January 15, 2024 08:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport/1.15 This PR represents a backport for Cilium 1.15.x of a PR that was merged to main. kind/backports This PR provides functionality previously merged into master.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet