v1.9 backports 2020-12-14 #14405

joestringer · 2020-12-15T00:12:43Z

docs: Document expected behavior for node-local DNS #14297 -- docs: Document expected behavior for node-local DNS (@aditighag)
pkg/k8s: fix k8s_event_lag_seconds for negative time #14313 -- pkg/k8s: fix k8s_event_lag_seconds for negative time (@aanm)
ipcache: Use controller.Manager on IPIdentityCache for bpf-garbage-collection #14300 -- ipcache: Use controller.Manager on IPIdentityCache for bpf-garbage-collection (@dctrwatson)
test: Fix flake on policy verdict count check #14286 -- test: Fix flake on policy verdict count check (@pchaigno)
Fix rare crash on startup when kubernetes initialization occurs before IP address configuration #14299 -- Fix rare crash on startup when kubernetes initialization occurs before IP address configuration (@joestringer)
microk8s: fix add-on-command for enabling cilium #14325 -- microk8s: fix add-on-command for enabling cilium (@brandshaide)
ci/helpers: Clean-up resource quotas #14294 -- ci/helpers: Clean-up resource quotas (@gandro)
pkg/k8s: fix nil pointer for invalid CCNP #14375 -- pkg/k8s: fix nil pointer for invalid CCNP (@aanm)
hubble/peer: fix TLS server name for cluster names containing dots #14378 -- hubble/peer: fix TLS server name for cluster names containing dots (@rolinh)

Skipped due to conflicts, PTAL:

cilium, gops: remap to fixed port to avoid collision with nodeport range #14329 -- cilium, gops: remap to fixed port to avoid collision with nodeport range (@borkmann)
maglev: Delete map if previous M's do not match #14345 -- maglev: Delete map if previous M's do not match (@brb)

Once this PR is merged, you can update the PR labels via:

$ for pr in 14297 14313 14300 14286 14325 14299 14294 14375 14378; do contrib/backporting/set-labels.py $pr done 1.9; done

[ upstream commit 8917d33 ] Signed-off-by: Aditi Ghag <aditi@cilium.io> Signed-off-by: Joe Stringer <joe@cilium.io>

[ upstream commit 478f409 ] In some occasions the metric `k8s_event_lag_seconds` could be presented as an overflown value such as `9223372036854775807`. This commit fixes this by checking if the calculated value is less than zero by only setting this metric for positive times. Fixes: 4e29130 ("pkg/endpoint: calculate Kube API-Server lag from pod events") Signed-off-by: André Martins <andre@cilium.io> Signed-off-by: Joe Stringer <joe@cilium.io>

…rbage-collection [ upstream commit 1ef686b ] Signed-off-by: John Watson <johnw@planetscale.com> Signed-off-by: Joe Stringer <joe@cilium.io>

[ upstream commit 7ba7edc ] Our fromCIDR+toPorts tests have been flaking for a while. In particular, the check on the number of policy verdicts found in the output of cilium monitor sometimes fails. Those fails happen because we read cilium monitor's output too soon, before it has a chance to drain the ring buffer (see #12596 for a detailed analysis of the logs). This commit fixes it by repeatedly checking the output against our regular expression until it succeeds or times out, similarly to what is already done in K8sDatapathConfig MonitorAggregation. Fixes: #12596 Fixes: 94eb491 ("test: Add policy test for ingress FromCIDR+ToPorts") Fixes: 74be0b2 ("test: fromCIDR+toPorts host policy") Signed-off-by: Paul Chaignon <paul@cilium.io> Signed-off-by: Joe Stringer <joe@cilium.io>

[ upstream commit 002bd9c ] Suggested-by: Chris Tarazi <chris@isovalent.com> Signed-off-by: Paul Chaignon <paul@cilium.io> Signed-off-by: Joe Stringer <joe@cilium.io>

[ upstream commit 431eda4 ] As per Documentation of microk8s (https://microk8s.io/docs/addons), enabling add-ons are being invoked with `microk8s enable <addon>` - note the missing „.“. Also tested this locally `microk8s enable cilium` Signed-off-by: Philipp Gniewosz <philipp.gniewosz@posteo.de> Signed-off-by: Joe Stringer <joe@cilium.io>

[ upstream commit d117d50 ] Previously, a race condition between Kubernetes initialization logic and the IPAM / address handling / NodeDiscovery logic could cause a crash on startup in rare circumstances due to the k8s subsystem handling a policy update event before the initialization was complete. André pointed out that InitK8sSubsystem() occurs asynchronously to initialize and begin handling k8s events, and this is initiated prior to the allocation / initialization of the IP handling subsystem. Looking at the stacktrace, we see that the "index out of range" error is triggered from handling a policyAdd event, through the ObjectCache initialization, through to the headerfile write / nodeconfig fetch logic, through to logic which would serialize the node IP addressing information into the node config file, subsequently to be used by other datapath initialization logic: panic: runtime error: index out of range [3] with length 0 goroutine 145 [running]: encoding/binary.bigEndian.Uint32(...) /usr/local/go/src/encoding/binary/binary.go:112 github.com/cilium/cilium/pkg/byteorder.HostSliceToNetwork(...) /go/src/github.com/cilium/cilium/pkg/byteorder/byteorder.go:134 +0x1ee github.com/cilium/cilium/pkg/datapath/linux/config.(*HeaderfileWriter).WriteNodeConfig(...) /go/src/github.com/cilium/cilium/pkg/datapath/linux/config/config.go:108 +0x5baf ... github.com/cilium/cilium/pkg/datapath/loader.(*Loader).Reinitialize(...) /go/src/github.com/cilium/cilium/pkg/datapath/loader/base.go:187 +0x2c0 github.com/cilium/cilium/daemon/cmd.(*Daemon).policyAdd(...) /go/src/github.com/cilium/cilium/daemon/cmd/policy.go:285 +0x1ff7 github.com/cilium/cilium/daemon/cmd.(*PolicyAddEvent).Handle(...) /go/src/github.com/cilium/cilium/daemon/cmd/policy.go:217 +0x5c ... created by github.com/cilium/cilium/pkg/eventqueue.(*EventQueue).Run /go/src/github.com/cilium/cilium/pkg/eventqueue/eventqueue.go:240 +0x62 This occurs when converting the IPv4 loopback IP to network byte order, and the index out of range implies that the IP address is not yet initialized. The IP loopback address is initialized from Daemon.allocateIPs() -> node.SetIPv4Loopback(...). I had initially considered extending EventQueue to allow an "initialization prerequisite" to be configured in the form of a channel. This would require the daemon to create a channel for ip initialization and either pass it down through the stack or pass it via the Daemon object through to the datapath initialization logic and the endpoint add logic (including host endpoint add), then into the endpoint EventQueue initialization routines, which would keep a reference to the channel around and wait on the close of the channel when EventQueue.Run() is called. This would ensure that the "wait" is asynchronous, and hence shouldn't slow down any of the ongoing initialization logic, however it's a little more invasive than I'd like for a bugfix. Looking a little closer around this logic, I realized that loader.Reinitialize() consumes information from NodeDiscovery. NodeDiscovery is initialized even later than allocateIPs(), and the NodeDiscovery object has a convenient "Registered" channel which is closed when the Daemon completes NodeDiscovery initialization. So, if we wait on that channel, we could ensure all prerequisite IP addressing information would be initialized prior to datapath generation. From there, it's just a question of whether this path is guaranteed to be handled asynchronously or whether there are some other paths that call Reinitialize() that could potentially end up causing some kind of deadlock. That does not appear possible upon initial inspection: File Function Line 0 policy.go <global> 286 if (err := d.Datapath().Loader().Reinitialize(d.ctx, d, d.mtuConfig.GetDeviceMTU(), d.Datapath(), d.l7Proxy); err != nil { 1 policy.go <global> 632 if (err := d.Datapath().Loader().Reinitialize(d.ctx, d, d.mtuConfig.GetDeviceMTU(), d.Datapath(), d.l7Proxy); err != nil { 2 loader.go <global> 34 Reinitialize(ctx context.Context, o BaseProgramOwner, deviceMTU int , iptMgr IptablesManager, p Proxy) error 3 config.go Handle 90 if (err := d.Datapath().Loader().Reinitialize(d.ctx, d, d.mtuConfig.GetDeviceMTU(), d.Datapath(), d.l8Proxy); err != nil { 4 daemon.go init 228 if (err := d.Datapath().Loader().Reinitialize(d.ctx, d, d.mtuConfig.GetDeviceMTU(), d.Datapath(), d.l7Proxy); err != nil { 5 daemon.go func 795 if (err := d.Datapath().Loader().Reinitialize(d.ctx, d, d.mtuConfig.GetDeviceMTU(), d.Datapath(), d.l7Proxy); err != nil { 6 datapath.go Reinitialize 131 func (f *fakeLoader) Reinitialize(ctx context.Context, o datapath.BaseProgramOwner, deviceMTU int , iptMgr datapath.IptablesManager, p datapath.Proxy) error { 7 base.go Reinitialize 171 func (l *Loader) Reinitialize(ctx context.Context, o datapath.BaseProgramOwner, deviceMTU int , iptMgr datapath.IptablesManager, p datapath.Proxy) error { 0. policyAdd (From EventQueue) 1. policyDelete (From EventQueue) 2. type Loader interface 3. Daemon /config API (API only served after initialization is complete) 4. Daemon.init() (Occurs after StartDiscovery() 5. TriggerReloadWithoutCompile (Triggered from ipcache GC controller) 6. fakeLoader (unit tests) 7. Loader (function definition) So, by blocking on node discovery when fetching LocalConfig(), we are able to ensure that ip addressing configuration is properly initialized prior to using it from datapath Reinitialize(). Fixes: #14245 Reported-by: Jorge Arco <jorge.arcoma@gmail.com> Signed-off-by: Joe Stringer <joe@cilium.io>

[ upstream commit 26f318d ] Support for ResourceQuotas (specifically for GKE) was added in #13878. Signed-off-by: Sebastian Wicki <sebastian@isovalent.com> Signed-off-by: Joe Stringer <joe@cilium.io>

[ upstream commit b5c04ee ] If a CCNP does not contain a spec nor specs, which means the embedded CNP is nil, it will cause a nil pointer exception if we try to access those fields. To prevent this we should be checking for this embedded CNP struct before access its fields. Fixes: 9d7082e ("policy: add k8s controller to daemon for clusterwide policies") Signed-off-by: André Martins <andre@cilium.io> Signed-off-by: Joe Stringer <joe@cilium.io>

[ upstream commit 27cd5d2 ] Similarly to a node name, a cluster name may contain dots (.). In such a case, dots need to be replaced by dashes (-) so that the TLS server name matches the certificate server's identity. This patch is similar to the way node names containing dots are treated. Signed-off-by: Robin Hahling <robin.hahling@gw-computing.net> Signed-off-by: Joe Stringer <joe@cilium.io>

joestringer · 2020-12-15T00:13:06Z

test-backport-1.9

aditighag

LGTM for my change.

rolinh

LGTM for my changes as well.

pchaigno

👍 for my PR.

gandro

Good for my commit!

aanm

LGTM for my changes

aditighag and others added 10 commits December 14, 2020 16:09

docs: Document expected behavior for node-local DNS

3bd88aa

[ upstream commit 8917d33 ] Signed-off-by: Aditi Ghag <aditi@cilium.io> Signed-off-by: Joe Stringer <joe@cilium.io>

ipcache: Use controller.Manager on IPIdentityCache for ipcache-bpf-ga…

2663702

…rbage-collection [ upstream commit 1ef686b ] Signed-off-by: John Watson <johnw@planetscale.com> Signed-off-by: Joe Stringer <joe@cilium.io>

test: Use gomega.Eventually instead of helpers.RepeatUntil

c63e62e

[ upstream commit 002bd9c ] Suggested-by: Chris Tarazi <chris@isovalent.com> Signed-off-by: Paul Chaignon <paul@cilium.io> Signed-off-by: Joe Stringer <joe@cilium.io>

ci/helpers: Clean-up cilium-resource-quota

5433daa

[ upstream commit 26f318d ] Support for ResourceQuotas (specifically for GKE) was added in #13878. Signed-off-by: Sebastian Wicki <sebastian@isovalent.com> Signed-off-by: Joe Stringer <joe@cilium.io>

joestringer requested a review from a team as a code owner December 15, 2020 00:12

joestringer added backport/1.9 kind/backports This PR provides functionality previously merged into master. labels Dec 15, 2020

aditighag approved these changes Dec 15, 2020

View reviewed changes

rolinh approved these changes Dec 15, 2020

View reviewed changes

pchaigno approved these changes Dec 15, 2020

View reviewed changes

gandro approved these changes Dec 15, 2020

View reviewed changes

aanm approved these changes Dec 15, 2020

View reviewed changes

joestringer merged commit 5251a49 into v1.9 Dec 15, 2020

joestringer deleted the pr/v1.9-backport-2020-12-14 branch December 15, 2020 19:11

aanm mentioned this pull request Jan 20, 2021

Prepare for release v1.9.2 #14666

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v1.9 backports 2020-12-14 #14405

v1.9 backports 2020-12-14 #14405

joestringer commented Dec 15, 2020 •

edited

Loading

joestringer commented Dec 15, 2020

aditighag left a comment

rolinh left a comment

pchaigno left a comment

gandro left a comment

aanm left a comment

v1.9 backports 2020-12-14 #14405

v1.9 backports 2020-12-14 #14405

Conversation

joestringer commented Dec 15, 2020 • edited Loading

joestringer commented Dec 15, 2020

aditighag left a comment

Choose a reason for hiding this comment

rolinh left a comment

Choose a reason for hiding this comment

pchaigno left a comment

Choose a reason for hiding this comment

gandro left a comment

Choose a reason for hiding this comment

aanm left a comment

Choose a reason for hiding this comment

joestringer commented Dec 15, 2020 •

edited

Loading