-
Notifications
You must be signed in to change notification settings - Fork 2.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
v1.9 backports 2020-12-14 #14405
v1.9 backports 2020-12-14 #14405
Conversation
[ upstream commit 8917d33 ] Signed-off-by: Aditi Ghag <aditi@cilium.io> Signed-off-by: Joe Stringer <joe@cilium.io>
[ upstream commit 478f409 ] In some occasions the metric `k8s_event_lag_seconds` could be presented as an overflown value such as `9223372036854775807`. This commit fixes this by checking if the calculated value is less than zero by only setting this metric for positive times. Fixes: 4e29130 ("pkg/endpoint: calculate Kube API-Server lag from pod events") Signed-off-by: André Martins <andre@cilium.io> Signed-off-by: Joe Stringer <joe@cilium.io>
…rbage-collection [ upstream commit 1ef686b ] Signed-off-by: John Watson <johnw@planetscale.com> Signed-off-by: Joe Stringer <joe@cilium.io>
[ upstream commit 7ba7edc ] Our fromCIDR+toPorts tests have been flaking for a while. In particular, the check on the number of policy verdicts found in the output of cilium monitor sometimes fails. Those fails happen because we read cilium monitor's output too soon, before it has a chance to drain the ring buffer (see #12596 for a detailed analysis of the logs). This commit fixes it by repeatedly checking the output against our regular expression until it succeeds or times out, similarly to what is already done in K8sDatapathConfig MonitorAggregation. Fixes: #12596 Fixes: 94eb491 ("test: Add policy test for ingress FromCIDR+ToPorts") Fixes: 74be0b2 ("test: fromCIDR+toPorts host policy") Signed-off-by: Paul Chaignon <paul@cilium.io> Signed-off-by: Joe Stringer <joe@cilium.io>
[ upstream commit 002bd9c ] Suggested-by: Chris Tarazi <chris@isovalent.com> Signed-off-by: Paul Chaignon <paul@cilium.io> Signed-off-by: Joe Stringer <joe@cilium.io>
[ upstream commit 431eda4 ] As per Documentation of microk8s (https://microk8s.io/docs/addons), enabling add-ons are being invoked with `microk8s enable <addon>` - note the missing „.“. Also tested this locally `microk8s enable cilium` Signed-off-by: Philipp Gniewosz <philipp.gniewosz@posteo.de> Signed-off-by: Joe Stringer <joe@cilium.io>
[ upstream commit d117d50 ] Previously, a race condition between Kubernetes initialization logic and the IPAM / address handling / NodeDiscovery logic could cause a crash on startup in rare circumstances due to the k8s subsystem handling a policy update event before the initialization was complete. André pointed out that InitK8sSubsystem() occurs asynchronously to initialize and begin handling k8s events, and this is initiated prior to the allocation / initialization of the IP handling subsystem. Looking at the stacktrace, we see that the "index out of range" error is triggered from handling a policyAdd event, through the ObjectCache initialization, through to the headerfile write / nodeconfig fetch logic, through to logic which would serialize the node IP addressing information into the node config file, subsequently to be used by other datapath initialization logic: panic: runtime error: index out of range [3] with length 0 goroutine 145 [running]: encoding/binary.bigEndian.Uint32(...) /usr/local/go/src/encoding/binary/binary.go:112 github.com/cilium/cilium/pkg/byteorder.HostSliceToNetwork(...) /go/src/github.com/cilium/cilium/pkg/byteorder/byteorder.go:134 +0x1ee github.com/cilium/cilium/pkg/datapath/linux/config.(*HeaderfileWriter).WriteNodeConfig(...) /go/src/github.com/cilium/cilium/pkg/datapath/linux/config/config.go:108 +0x5baf ... github.com/cilium/cilium/pkg/datapath/loader.(*Loader).Reinitialize(...) /go/src/github.com/cilium/cilium/pkg/datapath/loader/base.go:187 +0x2c0 github.com/cilium/cilium/daemon/cmd.(*Daemon).policyAdd(...) /go/src/github.com/cilium/cilium/daemon/cmd/policy.go:285 +0x1ff7 github.com/cilium/cilium/daemon/cmd.(*PolicyAddEvent).Handle(...) /go/src/github.com/cilium/cilium/daemon/cmd/policy.go:217 +0x5c ... created by github.com/cilium/cilium/pkg/eventqueue.(*EventQueue).Run /go/src/github.com/cilium/cilium/pkg/eventqueue/eventqueue.go:240 +0x62 This occurs when converting the IPv4 loopback IP to network byte order, and the index out of range implies that the IP address is not yet initialized. The IP loopback address is initialized from Daemon.allocateIPs() -> node.SetIPv4Loopback(...). I had initially considered extending EventQueue to allow an "initialization prerequisite" to be configured in the form of a channel. This would require the daemon to create a channel for ip initialization and either pass it down through the stack or pass it via the Daemon object through to the datapath initialization logic and the endpoint add logic (including host endpoint add), then into the endpoint EventQueue initialization routines, which would keep a reference to the channel around and wait on the close of the channel when EventQueue.Run() is called. This would ensure that the "wait" is asynchronous, and hence shouldn't slow down any of the ongoing initialization logic, however it's a little more invasive than I'd like for a bugfix. Looking a little closer around this logic, I realized that loader.Reinitialize() consumes information from NodeDiscovery. NodeDiscovery is initialized even later than allocateIPs(), and the NodeDiscovery object has a convenient "Registered" channel which is closed when the Daemon completes NodeDiscovery initialization. So, if we wait on that channel, we could ensure all prerequisite IP addressing information would be initialized prior to datapath generation. From there, it's just a question of whether this path is guaranteed to be handled asynchronously or whether there are some other paths that call Reinitialize() that could potentially end up causing some kind of deadlock. That does not appear possible upon initial inspection: File Function Line 0 policy.go <global> 286 if (err := d.Datapath().Loader().Reinitialize(d.ctx, d, d.mtuConfig.GetDeviceMTU(), d.Datapath(), d.l7Proxy); err != nil { 1 policy.go <global> 632 if (err := d.Datapath().Loader().Reinitialize(d.ctx, d, d.mtuConfig.GetDeviceMTU(), d.Datapath(), d.l7Proxy); err != nil { 2 loader.go <global> 34 Reinitialize(ctx context.Context, o BaseProgramOwner, deviceMTU int , iptMgr IptablesManager, p Proxy) error 3 config.go Handle 90 if (err := d.Datapath().Loader().Reinitialize(d.ctx, d, d.mtuConfig.GetDeviceMTU(), d.Datapath(), d.l8Proxy); err != nil { 4 daemon.go init 228 if (err := d.Datapath().Loader().Reinitialize(d.ctx, d, d.mtuConfig.GetDeviceMTU(), d.Datapath(), d.l7Proxy); err != nil { 5 daemon.go func 795 if (err := d.Datapath().Loader().Reinitialize(d.ctx, d, d.mtuConfig.GetDeviceMTU(), d.Datapath(), d.l7Proxy); err != nil { 6 datapath.go Reinitialize 131 func (f *fakeLoader) Reinitialize(ctx context.Context, o datapath.BaseProgramOwner, deviceMTU int , iptMgr datapath.IptablesManager, p datapath.Proxy) error { 7 base.go Reinitialize 171 func (l *Loader) Reinitialize(ctx context.Context, o datapath.BaseProgramOwner, deviceMTU int , iptMgr datapath.IptablesManager, p datapath.Proxy) error { 0. policyAdd (From EventQueue) 1. policyDelete (From EventQueue) 2. type Loader interface 3. Daemon /config API (API only served after initialization is complete) 4. Daemon.init() (Occurs after StartDiscovery() 5. TriggerReloadWithoutCompile (Triggered from ipcache GC controller) 6. fakeLoader (unit tests) 7. Loader (function definition) So, by blocking on node discovery when fetching LocalConfig(), we are able to ensure that ip addressing configuration is properly initialized prior to using it from datapath Reinitialize(). Fixes: #14245 Reported-by: Jorge Arco <jorge.arcoma@gmail.com> Signed-off-by: Joe Stringer <joe@cilium.io>
[ upstream commit b5c04ee ] If a CCNP does not contain a spec nor specs, which means the embedded CNP is nil, it will cause a nil pointer exception if we try to access those fields. To prevent this we should be checking for this embedded CNP struct before access its fields. Fixes: 9d7082e ("policy: add k8s controller to daemon for clusterwide policies") Signed-off-by: André Martins <andre@cilium.io> Signed-off-by: Joe Stringer <joe@cilium.io>
[ upstream commit 27cd5d2 ] Similarly to a node name, a cluster name may contain dots (.). In such a case, dots need to be replaced by dashes (-) so that the TLS server name matches the certificate server's identity. This patch is similar to the way node names containing dots are treated. Signed-off-by: Robin Hahling <robin.hahling@gw-computing.net> Signed-off-by: Joe Stringer <joe@cilium.io>
test-backport-1.9 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM for my change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM for my changes as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 for my PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good for my commit!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM for my changes
Skipped due to conflicts, PTAL:
Once this PR is merged, you can update the PR labels via: