Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pkg/endpoint: calculate Kube API-Server lag from pod events #13702

Merged
merged 1 commit into from
Oct 23, 2020

Conversation

aanm
Copy link
Member

@aanm aanm commented Oct 22, 2020

Since Cilium receives CNI events when a pod is created, Cilium can
calculate the lag for kube-apiserver events by checking the time an
ADD event for that Pod was received and subtracting by the time the CNI
event for that pod was received.

Signed-off-by: André Martins andre@cilium.io

Fixes: #13679

Add metric 'cilium_k8s_event_lag_seconds' for calculated lag of Kubernetes events

@aanm aanm added release-note/minor This PR changes functionality that users may find relevant to operating Cilium. needs-backport/1.7 labels Oct 22, 2020
@aanm aanm requested a review from a team as a code owner October 22, 2020 14:59
@aanm aanm requested a review from a team October 22, 2020 14:59
@aanm aanm requested a review from a team as a code owner October 22, 2020 14:59
@maintainer-s-little-helper maintainer-s-little-helper bot added this to Needs backport from master in 1.9.0-rc3 Oct 22, 2020
@maintainer-s-little-helper maintainer-s-little-helper bot added this to Needs backport from master in 1.7.11 Oct 22, 2020
@maintainer-s-little-helper maintainer-s-little-helper bot added this to Needs backport from master in 1.8.5 Oct 22, 2020
@aanm
Copy link
Member Author

aanm commented Oct 22, 2020

test-me-please

Copy link
Member

@christarazi christarazi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, a few minor changes.

pkg/k8s/watchers/pod.go Outdated Show resolved Hide resolved
pkg/k8s/watchers/pod.go Outdated Show resolved Hide resolved
Comment on lines 794 to 799
EventLagK8s = prometheus.NewGauge(prometheus.GaugeOpts{
Namespace: Namespace,
Name: "event_lag_seconds",
Help: "Lag (computed value) for Kubernetes events",
ConstLabels: prometheus.Labels{"source": LabelEventSourceK8s},
})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you run this through promtool?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I think we can add more context to the description of this metrics. Since this is the lag between receiving the CNI ADD and then getting it in the cache, let's add that so it is clear for readers.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't but our smoke tests do :)

@aanm
Copy link
Member Author

aanm commented Oct 22, 2020

test-me-please

@aanm aanm force-pushed the pr/correlate-delay-k8s-events branch from 19a682b to 3d5440f Compare October 22, 2020 16:58
@aanm aanm requested a review from christarazi October 22, 2020 17:05
Comment on lines 794 to 799
EventLagK8s = prometheus.NewGauge(prometheus.GaugeOpts{
Namespace: Namespace,
Name: "event_lag_seconds",
Help: "Lag (computed value) for Kubernetes events",
ConstLabels: prometheus.Labels{"source": LabelEventSourceK8s},
})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's update this description per my other comment as soon as this passes CI.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Actually we might want to rename the variable if we update the description to be specific to CNI ADD -> pod cache lag time.)

Copy link
Member

@joestringer joestringer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think it would be nice to make it more specific than just k8s_event_lag_seconds, for example k8s_pod_create_latency_seconds or something?

Also, how exactly do restored endpoints impact the metric? I get the impression that the time is set during restore but I don't follow exactly how the metric might be updated/influenced by restored endpoints.

Otherwise LGTM.

@aanm
Copy link
Member Author

aanm commented Oct 22, 2020

Yeah, I think it would be nice to make it more specific than just k8s_event_lag_seconds, for example k8s_pod_create_latency_seconds or something?

Also, how exactly do restored endpoints impact the metric? I get the impression that the time is set during restore but I don't follow exactly how the metric might be updated/influenced by restored endpoints.

Otherwise LGTM.

@joestringer I think it's still fair to assume the latency of Kubernetes starts to "count" as soon we restore an endpoint.

@pchaigno
Copy link
Member

But the K8s watcher won't receive anything for the restored endpoints because the corresponding pods already exist, right? So the create time for restored endpoints probably doesn't matter unless we have a very serious lag and Cilium manages to restart before the k8s event is received.

@christarazi christarazi added the dont-merge/blocked Another PR must be merged before this one. label Oct 23, 2020
@aanm
Copy link
Member Author

aanm commented Oct 23, 2020

But the K8s watcher won't receive anything for the restored endpoints because the corresponding pods already exist, right? So the create time for restored endpoints probably doesn't matter unless we have a very serious lag and Cilium manages to restart before the k8s event is received.

The watchers receives all pods regardless of the state of the local endpoints. I didn't check but I think the watcher receives the pods before we restore.

Since Cilium receives CNI events when a pod is created, Cilium can
calculate the lag for kube-apiserver events by checking the time an
ADD event for that Pod was received and subtracting by the time the CNI
event for that pod was received.

Signed-off-by: André Martins <andre@cilium.io>
@aanm aanm force-pushed the pr/correlate-delay-k8s-events branch from 3d5440f to b134bc9 Compare October 23, 2020 09:13
@aanm aanm removed the dont-merge/blocked Another PR must be merged before this one. label Oct 23, 2020
@pchaigno
Copy link
Member

But the K8s watcher won't receive anything for the restored endpoints because the corresponding pods already exist, right? So the create time for restored endpoints probably doesn't matter unless we have a very serious lag and Cilium manages to restart before the k8s event is received.

The watchers receives all pods regardless of the state of the local endpoints. I didn't check but I think the watcher receives the pods before we restore.

Indeed, we are blocking on the initial update of pods from K8s in InitK8sSubsystem() before continuing with the restoration of endpoints.

@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Needs backport from master to Backport pending to v1.7 in 1.7.11 Oct 23, 2020
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Needs backport from master to Backport pending to v1.7 in 1.7.11 Oct 23, 2020
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Backport pending to v1.7 to Backport done to v1.7 in 1.7.11 Oct 24, 2020
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Backport pending to v1.7 to Backport done to v1.7 in 1.7.11 Oct 24, 2020
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Needs backport from master to Backport pending to v1.9 in 1.9.0-rc3 Oct 26, 2020
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Backport pending to v1.9 to Backport done to v1.9 in 1.9.0-rc3 Oct 27, 2020
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Needs backport from master to Backport pending to v1.8 in 1.8.5 Oct 27, 2020
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Backport pending to v1.8 to Backport done to v1.8 in 1.8.5 Oct 28, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release-note/minor This PR changes functionality that users may find relevant to operating Cilium.
Projects
No open projects
1.7.11
Backport done to v1.7
1.8.5
Backport done to v1.8
1.9.0-rc3
Backport done to v1.9
Development

Successfully merging this pull request may close these issues.

Correlate Kubernetes events with CNI events
6 participants