-
Notifications
You must be signed in to change notification settings - Fork 2.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pkg/endpoint: calculate Kube API-Server lag from pod events #13702
Conversation
test-me-please |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, a few minor changes.
pkg/metrics/metrics.go
Outdated
EventLagK8s = prometheus.NewGauge(prometheus.GaugeOpts{ | ||
Namespace: Namespace, | ||
Name: "event_lag_seconds", | ||
Help: "Lag (computed value) for Kubernetes events", | ||
ConstLabels: prometheus.Labels{"source": LabelEventSourceK8s}, | ||
}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you run this through promtool
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, I think we can add more context to the description of this metrics. Since this is the lag between receiving the CNI ADD and then getting it in the cache, let's add that so it is clear for readers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't but our smoke tests do :)
test-me-please |
19a682b
to
3d5440f
Compare
pkg/metrics/metrics.go
Outdated
EventLagK8s = prometheus.NewGauge(prometheus.GaugeOpts{ | ||
Namespace: Namespace, | ||
Name: "event_lag_seconds", | ||
Help: "Lag (computed value) for Kubernetes events", | ||
ConstLabels: prometheus.Labels{"source": LabelEventSourceK8s}, | ||
}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's update this description per my other comment as soon as this passes CI.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Actually we might want to rename the variable if we update the description to be specific to CNI ADD -> pod cache lag time.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think it would be nice to make it more specific than just k8s_event_lag_seconds
, for example k8s_pod_create_latency_seconds
or something?
Also, how exactly do restored endpoints impact the metric? I get the impression that the time is set during restore but I don't follow exactly how the metric might be updated/influenced by restored endpoints.
Otherwise LGTM.
@joestringer I think it's still fair to assume the latency of Kubernetes starts to "count" as soon we restore an endpoint. |
But the K8s watcher won't receive anything for the restored endpoints because the corresponding pods already exist, right? So the create time for restored endpoints probably doesn't matter unless we have a very serious lag and Cilium manages to restart before the k8s event is received. |
The watchers receives all pods regardless of the state of the local endpoints. I didn't check but I think the watcher receives the pods before we restore. |
Since Cilium receives CNI events when a pod is created, Cilium can calculate the lag for kube-apiserver events by checking the time an ADD event for that Pod was received and subtracting by the time the CNI event for that pod was received. Signed-off-by: André Martins <andre@cilium.io>
3d5440f
to
b134bc9
Compare
Indeed, we are blocking on the initial update of pods from K8s in |
Since Cilium receives CNI events when a pod is created, Cilium can
calculate the lag for kube-apiserver events by checking the time an
ADD event for that Pod was received and subtracting by the time the CNI
event for that pod was received.
Signed-off-by: André Martins andre@cilium.io
Fixes: #13679