pkg/endpoint: calculate Kube API-Server lag from pod events #13702

aanm · 2020-10-22T14:59:03Z

Since Cilium receives CNI events when a pod is created, Cilium can
calculate the lag for kube-apiserver events by checking the time an
ADD event for that Pod was received and subtracting by the time the CNI
event for that pod was received.

Signed-off-by: André Martins andre@cilium.io

Fixes: #13679

Add metric 'cilium_k8s_event_lag_seconds' for calculated lag of Kubernetes events

aanm · 2020-10-22T15:00:27Z

test-me-please

christarazi

LGTM, a few minor changes.

pkg/k8s/watchers/pod.go

christarazi · 2020-10-22T16:34:56Z

pkg/metrics/metrics.go

+			EventLagK8s = prometheus.NewGauge(prometheus.GaugeOpts{
+				Namespace:   Namespace,
+				Name:        "event_lag_seconds",
+				Help:        "Lag (computed value) for Kubernetes events",
+				ConstLabels: prometheus.Labels{"source": LabelEventSourceK8s},
+			})


Have you run this through promtool?

Also, I think we can add more context to the description of this metrics. Since this is the lag between receiving the CNI ADD and then getting it in the cache, let's add that so it is clear for readers.

I haven't but our smoke tests do :)

aanm · 2020-10-22T16:58:41Z

test-me-please

christarazi · 2020-10-22T17:09:42Z

pkg/metrics/metrics.go

+			EventLagK8s = prometheus.NewGauge(prometheus.GaugeOpts{
+				Namespace:   Namespace,
+				Name:        "event_lag_seconds",
+				Help:        "Lag (computed value) for Kubernetes events",
+				ConstLabels: prometheus.Labels{"source": LabelEventSourceK8s},
+			})


Let's update this description per my other comment as soon as this passes CI.

(Actually we might want to rename the variable if we update the description to be specific to CNI ADD -> pod cache lag time.)

joestringer

Yeah, I think it would be nice to make it more specific than just k8s_event_lag_seconds, for example k8s_pod_create_latency_seconds or something?

Also, how exactly do restored endpoints impact the metric? I get the impression that the time is set during restore but I don't follow exactly how the metric might be updated/influenced by restored endpoints.

Otherwise LGTM.

aanm · 2020-10-22T19:28:41Z

Yeah, I think it would be nice to make it more specific than just k8s_event_lag_seconds, for example k8s_pod_create_latency_seconds or something?

Also, how exactly do restored endpoints impact the metric? I get the impression that the time is set during restore but I don't follow exactly how the metric might be updated/influenced by restored endpoints.

Otherwise LGTM.

@joestringer I think it's still fair to assume the latency of Kubernetes starts to "count" as soon we restore an endpoint.

pchaigno · 2020-10-22T19:42:15Z

But the K8s watcher won't receive anything for the restored endpoints because the corresponding pods already exist, right? So the create time for restored endpoints probably doesn't matter unless we have a very serious lag and Cilium manages to restart before the k8s event is received.

aanm · 2020-10-23T08:09:55Z

But the K8s watcher won't receive anything for the restored endpoints because the corresponding pods already exist, right? So the create time for restored endpoints probably doesn't matter unless we have a very serious lag and Cilium manages to restart before the k8s event is received.

The watchers receives all pods regardless of the state of the local endpoints. I didn't check but I think the watcher receives the pods before we restore.

Since Cilium receives CNI events when a pod is created, Cilium can calculate the lag for kube-apiserver events by checking the time an ADD event for that Pod was received and subtracting by the time the CNI event for that pod was received. Signed-off-by: André Martins <andre@cilium.io>

pchaigno · 2020-10-23T09:32:01Z

But the K8s watcher won't receive anything for the restored endpoints because the corresponding pods already exist, right? So the create time for restored endpoints probably doesn't matter unless we have a very serious lag and Cilium manages to restart before the k8s event is received.

The watchers receives all pods regardless of the state of the local endpoints. I didn't check but I think the watcher receives the pods before we restore.

Indeed, we are blocking on the initial update of pods from K8s in InitK8sSubsystem() before continuing with the restoration of endpoints.

aanm added release-note/minor This PR changes functionality that users may find relevant to operating Cilium. needs-backport/1.7 labels Oct 22, 2020

aanm requested a review from a team as a code owner October 22, 2020 14:59

aanm requested a review from a team October 22, 2020 14:59

aanm requested a review from a team as a code owner October 22, 2020 14:59

maintainer-s-little-helper bot added this to Needs backport from master in 1.9.0-rc3 Oct 22, 2020

maintainer-s-little-helper bot added this to Needs backport from master in 1.7.11 Oct 22, 2020

maintainer-s-little-helper bot added this to Needs backport from master in 1.8.5 Oct 22, 2020

pchaigno approved these changes Oct 22, 2020

View reviewed changes

christarazi requested changes Oct 22, 2020

View reviewed changes

aanm force-pushed the pr/correlate-delay-k8s-events branch from 19a682b to 3d5440f Compare October 22, 2020 16:58

aanm requested a review from christarazi October 22, 2020 17:05

christarazi approved these changes Oct 22, 2020

View reviewed changes

joestringer approved these changes Oct 22, 2020

View reviewed changes

christarazi added the dont-merge/blocked Another PR must be merged before this one. label Oct 23, 2020

aanm force-pushed the pr/correlate-delay-k8s-events branch from 3d5440f to b134bc9 Compare October 23, 2020 09:13

aanm removed the dont-merge/blocked Another PR must be merged before this one. label Oct 23, 2020

aanm merged commit 4e29130 into cilium:master Oct 23, 2020

christarazi mentioned this pull request Oct 23, 2020

v1.7 backports 2020-10-23 #13739

Merged

christarazi added backport-pending/1.7 and removed needs-backport/1.7 labels Oct 23, 2020

maintainer-s-little-helper bot moved this from Needs backport from master to Backport pending to v1.7 in 1.7.11 Oct 23, 2020

aanm added backport-done/1.7 and removed backport-pending/1.7 labels Oct 24, 2020

maintainer-s-little-helper bot moved this from Backport pending to v1.7 to Backport done to v1.7 in 1.7.11 Oct 24, 2020

gandro mentioned this pull request Oct 26, 2020

v1.9 backports 2020-10-26 #13751

Merged

gandro added backport-pending/1.9 and removed needs-backport/1.9 labels Oct 26, 2020

maintainer-s-little-helper bot moved this from Needs backport from master to Backport pending to v1.9 in 1.9.0-rc3 Oct 26, 2020

gandro added backport-done/1.9 and removed backport-pending/1.9 labels Oct 27, 2020

maintainer-s-little-helper bot moved this from Backport pending to v1.9 to Backport done to v1.9 in 1.9.0-rc3 Oct 27, 2020

christarazi mentioned this pull request Oct 27, 2020

v1.8 backports 2020-10-27 #13788

Merged

christarazi added backport-pending/1.8 and removed needs-backport/1.8 labels Oct 27, 2020

maintainer-s-little-helper bot moved this from Needs backport from master to Backport pending to v1.8 in 1.8.5 Oct 27, 2020

christarazi mentioned this pull request Oct 28, 2020

Prepare for release v1.7.11 #13790

Merged

jrajahalme added backport-done/1.8 and removed backport-pending/1.8 labels Oct 28, 2020

maintainer-s-little-helper bot moved this from Backport pending to v1.8 to Backport done to v1.8 in 1.8.5 Oct 28, 2020

christarazi mentioned this pull request Oct 28, 2020

Prepare for release v1.8.5 #13803

Merged

joestringer mentioned this pull request Nov 3, 2020

Prepare for release v1.9.0-rc3 #13864

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pkg/endpoint: calculate Kube API-Server lag from pod events #13702

pkg/endpoint: calculate Kube API-Server lag from pod events #13702

aanm commented Oct 22, 2020 •

edited

aanm commented Oct 22, 2020

christarazi left a comment

christarazi Oct 22, 2020

christarazi Oct 22, 2020

aanm Oct 22, 2020

aanm commented Oct 22, 2020

christarazi Oct 22, 2020

christarazi Oct 22, 2020

joestringer left a comment

aanm commented Oct 22, 2020

pchaigno commented Oct 22, 2020

aanm commented Oct 23, 2020

pchaigno commented Oct 23, 2020

pkg/endpoint: calculate Kube API-Server lag from pod events #13702

pkg/endpoint: calculate Kube API-Server lag from pod events #13702

Conversation

aanm commented Oct 22, 2020 • edited

aanm commented Oct 22, 2020

christarazi left a comment

Choose a reason for hiding this comment

christarazi Oct 22, 2020

Choose a reason for hiding this comment

christarazi Oct 22, 2020

Choose a reason for hiding this comment

aanm Oct 22, 2020

Choose a reason for hiding this comment

aanm commented Oct 22, 2020

christarazi Oct 22, 2020

Choose a reason for hiding this comment

christarazi Oct 22, 2020

Choose a reason for hiding this comment

joestringer left a comment

Choose a reason for hiding this comment

aanm commented Oct 22, 2020

pchaigno commented Oct 22, 2020

aanm commented Oct 23, 2020

pchaigno commented Oct 23, 2020

aanm commented Oct 22, 2020 •

edited