Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v1.13 backports 2022-12-21 #22822

Merged
merged 56 commits into from
Dec 22, 2022
Merged

v1.13 backports 2022-12-21 #22822

merged 56 commits into from
Dec 22, 2022

Conversation

joestringer
Copy link
Member

v1.13 backports 2022-12-21

Skipped due to conflicts, seems some datapath testing PRs are not tagged for backport:

Once this PR is merged, you can update the PR labels via:

$ for pr in 22627 21797 22477 22676 22553 22673 22252 21600 22726 22333 22418 22656 22721 22679 22654 22700 22619 22518 22772 22796 22821 22600 21822; do contrib/backporting/set-labels.py $pr done 1.13; done

squeed and others added 30 commits December 21, 2022 16:34
[ upstream commit bdc23f5 ]

These tests deploy the standalone agent in a funny way; using the
existing helm chart, but not configuring it fully. This breaks the
config-resolver, which expects to be able to reach a functioning
in-cluster apiserver.

So, just pass the correct apiserver to the config resolver.

Signed-off-by: Casey Callendrello <cdc@isovalent.com>
Signed-off-by: Joe Stringer <joe@cilium.io>
[ upstream commit 0f2f3fe ]

Add some missing helper functions to implement v3 backend map.
1. A new AddrCluster constructor AddrClusterFrom
2. A new method of AddrCluster to extract ClusterID

Signed-off-by: Yutaro Hayakawa <yutaro.hayakawa@isovalent.com>
Signed-off-by: Joe Stringer <joe@cilium.io>
[ upstream commit c6e53ea ]

Extend current backend map values (struct lb4_backend and struct
lb6_backend) to contain a new cluster_id field. With this field, we can
distinguish two backends which have the same IP address, but belong to
different clusters. Since we don't have available padding bit for both
structs anymore, we need to extend the structs and bump the map version
as well. This commit also contains migration code from v2 backend map to
v3 backend map.

Signed-off-by: Yutaro Hayakawa <yutaro.hayakawa@isovalent.com>
Signed-off-by: Joe Stringer <joe@cilium.io>
[ upstream commit 4d7acac ]

Currently, when a cilium-agent in eni/alibabacloud mode initializes
router info, it might encounter the following fatal:

```
level=fatal msg="Error while creating daemon" error="failed to create router info invalid ip: " subsys=daemon
```

The gateway IP of routing info is derived from the CIDR of the subnet,
the eni.Subnet.CIDR in InstanceMap is set as empty after ENI creation.
In normal cases it will be filled by a later resyncTrigger after maintainIPPool.
But if another goroutine (periodic resync or pool maintainer of another node)
happens to sync local InstanceMap cache to k8s, cilium-agent would be
informed of that ENI IP pool with empty cidr and router IP allocation would
fatal due to empty gateway IP.

This patch fixes this by filling the CIDR right after ENI creation.

Signed-off-by: Jaff Cheng <jaff.cheng.sh@gmail.com>
Signed-off-by: Joe Stringer <joe@cilium.io>
[ upstream commit ced77b3 ]

In the L4LB mode, the PERM L2 entries are managed by users. So, in this
mode Cilium should not mess with them.

Signed-off-by: Martynas Pumputis <m@lambda.lt>
Signed-off-by: Joe Stringer <joe@cilium.io>
[ upstream commit 857060b ]

Add a new type CiliumClusterConfig which represents a cluster
configuration. This will be serialized and stored into kvstore during
the clustermesh-apiserver startup time. Later on, cilium-agent on each
node reads it when connecting to new clusters.

The current use case of this is getting ClusterID at connect time, but by
exposing the cluster configuration, we can also do some useful validation
such as

- Make sure the cluster id is not conflicting with existing clusters.
- Make sure the new cluster doesn't have any capability mismatch.

Signed-off-by: Yutaro Hayakawa <yutaro.hayakawa@isovalent.com>
Signed-off-by: Joe Stringer <joe@cilium.io>
[ upstream commit b24973e ]

Add hepler functions to set/get cluster information on kvstore.

Signed-off-by: Yutaro Hayakawa <yutaro.hayakawa@isovalent.com>
Signed-off-by: Joe Stringer <joe@cilium.io>
[ upstream commit 5e5a26e ]

Implement a basic connect-time validation using CiliumClusterConfig.
clustermesh-apiserver is modified to set local CiliumClusterConfig on
start-up time and cilium-agent is modified to get CiliumClusterConfig
of remote clusters and validates it.

Signed-off-by: Yutaro Hayakawa <yutaro.hayakawa@isovalent.com>
Signed-off-by: Joe Stringer <joe@cilium.io>
[ upstream commit 398cf5e ]

For compatibility with an old Cilium that doesn't support cluster
configuration feature, we should be able to connect to the remote
cluster even if the configuration is missing. Test it.

Signed-off-by: Yutaro Hayakawa <yutaro.hayakawa@isovalent.com>
Signed-off-by: Joe Stringer <joe@cilium.io>
[ upstream commit 303d2e7 ]

This is to make sure http retry was done as part of the test.

Fixes: #21710
Fixes: #21993

Signed-off-by: Tam Mach <tam.mach@cilium.io>
Signed-off-by: Joe Stringer <joe@cilium.io>
[ upstream commit 25fa14c ]

This allows us to stress / benchmark the main callback function of the
DNS proxy within Cilium. This callback is called on each DNS request and
response made by each endpoint managed by Cilium.

For this commit, we are only considering the DNS response path because
it is the path which contains the most work to be done. This includes
policy calculation, identity allocation, and ipcache upsert.

It is useful to establish baseline performance numbers so that we can
compare against changes to this code path. In the following commits, a
mutex will be added to this path to fix a race condition during parallel
DNS response handling.

Baseline:

```
$ go test -v -tags integration_tests ./daemon/cmd -check.b -check.bmem -check.f DaemonFQDNSuite.Benchmark_notifyOnDNSMsg -test.benchtime 100x
...
PASS: fqdn_test.go:180: DaemonFQDNSuite.Benchmark_notifyOnDNSMsg           20000             96078 ns/op        36567 B/op         482 allocs/op
```

Signed-off-by: Chris Tarazi <chris@isovalent.com>
Signed-off-by: Joe Stringer <joe@cilium.io>
[ upstream commit 604eb21 ]

It is possible for parallel DNS requests handled by separate goroutines
(controlled by --dnsproxy-concurrency-limit) to race if they are for the
same DNS name, i.e. both goroutines handling cilium.io.

Here is a diagram for this race:

```
             G1                                    G2

T0 --> NotifyOnDNSMsg()               NotifyOnDNSMsg()            <-- T0

T1 --> UpdateGenerateDNS()            UpdateGenerateDNS()         <-- T1

T2 ----> mutex.Lock()                 +---------------------------+
                                      |No identities need updating|
T3 ----> mutex.Unlock()               +---------------------------+

T4 --> UpsertGeneratedIdentities()    UpsertGeneratedIdentities() <-- T4

T5 ---->  Upsert()                    DNS released back to pod    <-- T5
                                                   |
T6 --> DNS released back to pod                    |
             |                                     |
             |                                     |
             v                                     v
      Traffic flows fine                   Leads to policy drop
```

Note how G2 releases the DNS msg back to the pod at T5 because
UpdateGenerateDNS() was a no-op. It's a no-op because G1 had executed
UpdateGenerateDNS() first at T1 and performed the necessary identity
allocation for the response IPs. Due to G1 performing all the work
first, G2 executes T4 also as a no-op and releases the msg back to the
pod at T5 before G1 would at T6.

The above results in a policy drop because it is assumed that when the
DNS msg is released back to the pod / endpoint, then the ipcache has
been updated with the proper IP <-> identity mapping. In other words,
because G2 releases the DNS msg back to the pod / endpoint, the app
attempts to use the IPs returned in the DNS response, but the ipcache
lookup fails in the datapath because the IP <-> identity mappings don't
exist. In essence, this is a race between DNS requests to the ipcache.

To fix, we create a critical section when handling the DNS response IPs.
Allocating identities and associating them to IPs in the ipcache is now
atomic because of the critical section. This is especially important as
we've already seen for multiple DNS requests are in-flight for the same
name (i.e. cilium.io).

The critical section is implemented using a fixed sized array of
mutexes. Why? Because the other options (listed below) all have flaws
which may result in drops.

Other options considered:

1) A map of mutexes keyed on DNS name
2) A map of mutexes keyed on set of DNS response IPs
3) A map of mutexes keyed on endpoint ID
4) A slice of mutexes with hash based on individual IPs from the set of
   DNS response IPs

(1) could cause a race between the DNS response IPs from `a.com` and the
set of DNS response IPs from `b.com` especially if the sets are the
same.
(2) could race between a super / subset of DNS response IPs, i.e. only
an intersection of sets would have protection. Any IPs that are not in
the intersection are vulnerable to race.
(3) could cause a race between different endpoints querying the same
name, i.e. two endpoints querying cilium.io.
(4) could work but this is approach is not ideal as the slice scales
with the number of unique IPs seen by the DNS proxy. This has memory
consumption concerns.

Therefore, we opt for a fixed size array of mutexes where we hash the
individual IPs from the set of DNS response IPs. For example, say we
attempt to resolve cilium.io and the response IPs are A and B.

  hash(A) % mod = 8
  hash(B) % mod = 3

  where mod is --dnsproxy-lock-count, i.e. size of array

What's returned is a slice containing the above array indices that's
sorted in ascending order. This is the order in which the mutexes must
be locked and unlocked. To be specific, [3, 8] is the sorted resulted
and it means we acquire & release mutex at index 3 and then do the same
for mutex at index 8.

What this does is essentially serialize parallel DNS requests which
involve the same IP <-> identity mappings in the ipcache.

It's possible that the hash of the IPs collide (map to the same mutex)
when the IPs are different, but this is OK because it's overprotecting,
rather then underprotecting which the other options above suffer from.
Overprotecting results in a small performance hit because we are
potentially serializing completely different / unrelated DNS requests,
i.e. no intersection in the response IP sets.

For --dnsproxy-lock-count, a default of 128 was chosen as a reasonable
balance between memory usage and hash collisions, for most deployments
of Cilium. An increased lock count will result in more memory usage, but
faster DNS processing as there are less IP hash collisions, and
therefore less mutexes to acquire. A decreased lock count will save
memory at the cost of slower DNS processing as the IP hash collisions
are more likely, and this tends towards behavior similar to a single,
shared mutex. Users who need to tune this value because they have many
parallel DNS requests (1000s) can make the trade-off of using more
memory to avoid hash collisions. It's worth mentioning that the memory
usage is for storing a instance of a mutex which is 64 bits as of Go
1.19 [1].

Benchmark run comparing the code before this commit, changing the fixed
sized array to one to simulate a single mutex, and this commit
(multiple-locks):

```
$ go test -v -tags integration_tests ./daemon/cmd -check.b -check.bmem -check.f DaemonFQDNSuite.Benchmark_notifyOnDNSMsg -test.benchtime 100x -test.count 8
```

```
$ benchstat no-lock.txt single-lock.txt multiple-locks.txt
name \ time/op    no-lock.txt  single-lock.txt  multiple-locks.txt
_notifyOnDNSMsg   93.2µs ± 9%     179.6µs ± 3%        143.2µs ±28%

name \ alloc/op   no-lock.txt  single-lock.txt  multiple-locks.txt
_notifyOnDNSMsg   36.5kB ± 2%      35.3kB ± 0%         33.8kB ± 1%

name \ allocs/op  no-lock.txt  single-lock.txt  multiple-locks.txt
_notifyOnDNSMsg      485 ± 1%         474 ± 0%            453 ± 1%
```

Reproducing the bug and performance testing on a real cluster included
running the 3 different modes and comparing the average processing time
of the DNS proxy code. The cluster was 2 nodes with 32 replicas of the
following workload:

  # websites.txt contains different common URLs, some duplicated.
  while [[ true ]]; do
    curl --parallel --parallel-immediate --parallel-max 5 --config websites.txt
    sleep 0.25s
  done

No lock:        4.61ms, 18.3ms
Single lock:    7.42ms, 18.3ms
Multiple locks: 5.25ms, 17.9ms

[1]: https://cs.opensource.google/go/go/+/refs/tags/go1.19.3:src/sync/mutex.go;l=34

Co-authored-by: Michi Mutsuzaki <michi@isovalent.com>
Signed-off-by: Chris Tarazi <chris@isovalent.com>
Signed-off-by: Joe Stringer <joe@cilium.io>
[ upstream commit 9eb0bfb ]

Optimize policy processing by precomputing redirect types for the policy
when ready to avoid multiple scans of the policy filters later.

Signed-off-by: Jarno Rajahalme <jarno@isovalent.com>
Signed-off-by: Joe Stringer <joe@cilium.io>
[ upstream commit 38589e7 ]

Add a sticky parser type for CiliumEnvoyConfig CRDs. This will be used
for policy based redirect to custom Envoy listeners.

While CRD parser type will be redirected to Envoy, it is generally
handled by a custom Listener, which may not perform HTTP policy
enforcement. Thus this parser type is incompatible with HTTP rules.

Signed-off-by: Jarno Rajahalme <jarno@isovalent.com>
Signed-off-by: Joe Stringer <joe@cilium.io>
[ upstream commit 27861ae ]

Signed-off-by: Jarno Rajahalme <jarno@isovalent.com>
Signed-off-by: Joe Stringer <joe@cilium.io>
[ upstream commit 7b01459 ]

Listener config can fail if we don't wait for clusters to be configured
first, so wait for clusters to be configured if both clusters and
listeners are configured at the same time.

While failure like this was once seen on local testing for the custom
listener support in a CNP, similar failure could also happen for Cilium
Ingress and Gateway API.

Signed-off-by: Jarno Rajahalme <jarno@isovalent.com>
Signed-off-by: Joe Stringer <joe@cilium.io>
[ upstream commit 0036e24 ]

Support injecting Cilium Network filter also to TcpProxy filter
chains. This enables SNI and other L3/L4 policy enforcement features also
on custom Envoy listeners applied via Cilium Envoy Config resources that
contain Envoy filter chains using TcpProxy filter, in addition to Http
Connection Manager filter.

Signed-off-by: Jarno Rajahalme <jarno@isovalent.com>
Signed-off-by: Joe Stringer <joe@cilium.io>
[ upstream commit 41fe8c8 ]

Tell envoy.ParseResources() if the resources are intended for a L7 load
balancer. If so, the Envoy filter chain is configured to not use original
source address and use the socket mark as the source endpoint ID. This was
the current behavior with all CEC CRDs.

When the CEC CRD does not contain a frontend service that is redirected
to the defined listener, then the new 'isL7LB' flag is passed as
'false'. In this case the Envoy filter chain is configured to allow use
of the original source address, and mark the upstream socket with the
numeric source security identity. This is also the way in which Cilium
uses Envoy Listener filter chains when enforcing policy for HTTP, for
example. In this mode L4 LB is assumed to be taken place, if applicable,
before traffic is redirected to the Envoy listener.

Prior to this change CEC CRDs were mostly only usable for L7 LB uses,
such as with Cilium Ingress. After this change CEC CRDs without a Service
redirect can be used in place of Cilium policy enforcement filter chains,
if desired.

Signed-off-by: Jarno Rajahalme <jarno@isovalent.com>
Signed-off-by: Joe Stringer <joe@cilium.io>
… now

[ upstream commit a679c40 ]

Specify proxy port as egress (or not-ingress) in test cases as the
datapath currently supports ProxyTypeCRD only for L7 LB which is always
on the egress path.

Signed-off-by: Jarno Rajahalme <jarno@isovalent.com>
Signed-off-by: Joe Stringer <joe@cilium.io>
[ upstream commit aff0655 ]

Add a new no-op CRDRedirect type to be used with Envoy listeners defined
in CEC CRDs. In this case the listeners already exist and new Envoy
Listener resources do not need to be created for them.

This is needed for the forthcoming policy feature where policy can refer
to a Listener defined in CEC CRD.

Signed-off-by: Jarno Rajahalme <jarno@isovalent.com>
Signed-off-by: Joe Stringer <joe@cilium.io>
[ upstream commit 06ed0a4 ]

Find CRD proxy port by name instead of type. This is needed for enabling
CEC CRD defined listeners to be used in CNPs. Prior to this CRD proxy
ports did not use this code path, which is only called from endpoint
policy updates, so there was no need to find CRD proxy ports by name.

Signed-off-by: Jarno Rajahalme <jarno@isovalent.com>
Signed-off-by: Joe Stringer <joe@cilium.io>
[ upstream commit ede26fb ]

Move resourceQualifiedName() to policy/api and export it so that it can
be used in policy as well as in envoy package.

Signed-off-by: Jarno Rajahalme <jarno@isovalent.com>
Signed-off-by: Joe Stringer <joe@cilium.io>
[ upstream commit 9159609 ]

Add `listener` field to CNP and CCNP, that causes traffic to the
specified port(s) to be redirected to the named Envoy listener. If the
listener does not exist the traffic is not allowed at all.

When `listener.envoyConfig.kind` is left out it defaults to namespaced
`CiliumEnvoyConfig` for rules in namespaced policies (CNP) or to
cluster-scoped `CiliumClusterwideEnvoyConfig` for rules in cluster-scoped
policies (CCNP). Namespaced policies can also refer to cluster-scoped
listeners with an explicit `listener.envoyConfig.kind:
CiliumClusterwideEnvoyConfig`. Cluster-scoped policies can not refer to
namespaced listeners.

Endpoint policies are regenerated whenever Envoy listeners change to
update potential listener redirections in the bpf policy maps.

Signed-off-by: Jarno Rajahalme <jarno@isovalent.com>
Signed-off-by: Joe Stringer <joe@cilium.io>
[ upstream commit 527f5fd ]

This makes it much easier to track down which code path the executor
logs are coming from, so that when debugging issues, we can focus on the
relevant code path.

Signed-off-by: Chris Tarazi <chris@isovalent.com>
Signed-off-by: Joe Stringer <joe@cilium.io>
[ upstream commit bf3fd5f ]

Signed-off-by: Chris Tarazi <chris@isovalent.com>
Signed-off-by: Joe Stringer <joe@cilium.io>
[ upstream commit dab8723 ]

Previously, 11cb4d0 assumed that 137 was the exit code for when a
process exists due to a SIGKILL. However, upon reading the Go source
code as of 1.20 rc1, this is not the case, and that -1 is set for all
exit codes due to signals [1].

Fixes: 11cb4d0 ("test: Keep trying exec if killed")
Fixes: #22570

[1]:
https://github.com/golang/go/blob/go1.20rc1/src/os/exec_posix.go#L128-L130

Signed-off-by: Chris Tarazi <chris@isovalent.com>
Signed-off-by: Joe Stringer <joe@cilium.io>
[ upstream commit e49ab12 ]

When endpoint routes are enabled, we should enforce ingress policies at
the destination lxc interface, to which a BPF program will be attached.
Nevertheless, today, for packets coming from the overlay, we enforce
ingress policies twice, once at the e.g. cilium_vxlan interface and a
second time at the lxc device.

This is happening for two reasons:
  1. bpf_overlay is not aware of the endpoint routes settings so it
     doesn't even know that it's not responsible for enforcing ingress
     policies.
  2. We have a flag to force the enforcement of ingress policies at the
     source in this case. This flag exists for historic reasons that are
     not valid anymore.

A separate patch will fix the reason 2 above. This commit fixes reason 1
by telling bpf_overlay to *not* enforce ingress policies when endpoint
routes are enabled.

Note that we do not support the case where some endpoint have endpoint
routes enabled and others don't. If we did, additional logic would be
required.

Fixes: 3179a47 ("datapath: Support enable-endpoint-routes with encapsulation")
Signed-off-by: Paul Chaignon <paul@cilium.io>
Signed-off-by: Joe Stringer <joe@cilium.io>
[ upstream commit 3d2ceaf ]

The previous commit changed the packet handling on the path
overlay->lxc to fix a bug. More presicely, when endpoint routes are
enabled, we won't enforce ingress policies on both the overlay and the
lxc devices but only on the latter.

However, as a consequence of that patch, we don't go through the
policy-only program in bpf_lxc and we therefore changed the way the
packet is transmitted between overlay and lxc devices in some cases. As
a summary of changes made in the previous path, consider the following
table for the path overlay -> lxc.

Before the previous patch:
| Endpoint routes | Enforcement     | Path                 |
|-----------------|-----------------|----------------------|
| Enable          | overlay AND lxc | bpf_redirect if KPR; |
|                 |                 | stack otherwise      |
| Disabled        | overlay         | bpf_redirect         |

Now:
| Endpoint routes | Enforcement | Path         |
|-----------------|-------------|--------------|
| Enable          | lxc         | bpf_redirect |
| Disabled        | overlay     | bpf_redirect |

The previous patch intended to fix the enforcement to avoid the double
policy enforcement, but it also changed the packet path in case endpoint
routes are enabled.

This patch now fixes this by adding the same exception we have in
bpf_lxc to the l3.h logic we have. Hence, with the current patch, the
table will look like:
| Endpoint routes | Enforcement | Path                 |
|-----------------|-------------|----------------------|
| Enable          | lxc         | bpf_redirect if KPR; |
|                 |             | stack otherwise      |
| Disabled        | overlay     | bpf_redirect         |

I've kept this in a separate commit from the previous in an attempt to
split up and the logic and more clearly show the deltas.

Signed-off-by: Paul Chaignon <paul@cilium.io>
Signed-off-by: Joe Stringer <joe@cilium.io>
[ upstream commit 843c072 ]

The encryption test details -
cilium/cilium-cli#1241.

Signed-off-by: Martynas Pumputis <m@lambda.lt>
Signed-off-by: Joe Stringer <joe@cilium.io>
[ upstream commit 884ccc1 ]

This generates the CiliumNodeConfig type, a new way to set
configuration overrides on a set of nodes.

Signed-off-by: Casey Callendrello <cdc@isovalent.com>
Signed-off-by: Joe Stringer <joe@cilium.io>
@joestringer
Copy link
Member Author

I'm not sure how #22656 is going to interact with the v1.13 CI, but just in case I've also posted #22823. cc @squeed

@joestringer
Copy link
Member Author

joestringer commented Dec 21, 2022

/test-backport-1.13

Job 'Cilium-PR-K8s-1.21-kernel-4.9' failed:

Click to show.

Test Name

K8sDatapathServicesTest Checks E/W loadbalancing (ClusterIP, NodePort from inside cluster, etc) with L4 policy Tests NodePort with L4 Policy

Failure Output

FAIL: Request from testclient-6xfrm pod to service tftp://[fd04::12]:31948/hello failed

If it is a flake and a GitHub issue doesn't already exist to track it, comment /mlh new-flake Cilium-PR-K8s-1.21-kernel-4.9 so I can create one.

Job 'Cilium-PR-K8s-1.16-kernel-4.9' failed:

Click to show.

Test Name

K8sKafkaPolicyTest Kafka Policy Tests KafkaPolicies

Failure Output

FAIL: Unable to restart unmanaged pods with 'kubectl -n kube-system delete pods coredns-8cfc78c54-5cgdk': Exitcode: 1 

If it is a flake and a GitHub issue doesn't already exist to track it, comment /mlh new-flake Cilium-PR-K8s-1.16-kernel-4.9 so I can create one.

Copy link
Member

@brb brb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My changes LGTM, thanks!

test/l4lb,nat46x64: Replace Kind/Helm with DinD #22653 -- test/l4lb,nat46x64: Replace Kind/Helm with DinD (@brb)

I will do a manual backport of these.

Copy link
Member

@gandro gandro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My commit (and the Hubble changes) looks good, thanks!

Copy link
Member

@sayboras sayboras left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good for my commit, thanks 💯

@joestringer
Copy link
Member Author

CI triage:

@joestringer joestringer merged commit 698cb2a into v1.13 Dec 22, 2022
@joestringer joestringer deleted the pr/v1.13-backport-2022-12-21 branch December 22, 2022 02:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport/1.13 This PR represents a backport for Cilium 1.13.x of a PR that was merged to main. kind/backports This PR provides functionality previously merged into master.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet