Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Export metrics from agent #2127

Merged
merged 5 commits into from
Nov 29, 2017
Merged

Export metrics from agent #2127

merged 5 commits into from
Nov 29, 2017

Conversation

raybejjani
Copy link
Contributor

@raybejjani raybejjani commented Nov 23, 2017

The agent can export prometheus metrics if configured. This isn't complete but illustrates the basic parts of the metrics collection and export.

Things to consider

  1. Should we serve metrics under the agent API?
    This might be bad, since we'd be mixing concerns. I also don't know if prometheus can collect from a socket without a custom collector.
    Some of the things we want to collect in Expose runtime metrics for use in monitoring system #1935 aren't amenable as metrics, but make for decent statistics to present. I'll likely add a cilium metrics command that ships back all the metrics we collect, along with the things in Expose runtime metrics for use in monitoring system #1935 that aren't easy to export.

  2. It seems like there are annotations to tell k8s to slurp up the metrics from a pod. I will try to include that in our examples.

  3. We collect the go runtime metrics. The might end up being too much noise.

  4. I need to figure out if we can tag all metrics from a node by a node-id of some sort. This might be bad for prometheus, however, since it creates a new date series per label (well, per dynamic one, the static ones may not).

Sample metrics (in the prometheus text format, I think)

vagrant@cilium-k8s-master:~/go/src/github.com/cilium/cilium$ curl -s localhost:9090/metrics | grep cilium
# HELP cilium_endpoints Number of endpoints managed by this agent
# TYPE cilium_endpoints gauge
cilium_endpoints 2
# HELP cilium_endpoints_regenerating Number of endpoints currently regenerating
# TYPE cilium_endpoints_regenerating gauge
cilium_endpoints_regenerating 0
# HELP cilium_endpoints_regenerations Count of all endpoint regenerations that have completed, tagged by outcome
# TYPE cilium_endpoints_regenerations counter
cilium_endpoints_regenerations{outcome="succeeded"} 6
# HELP cilium_policies Number of policies currently loaded
# TYPE cilium_policies gauge
cilium_policies 0
# HELP cilium_policies_import_errors Number of times a policy import has failed
# TYPE cilium_policies_import_errors counter
cilium_policies_import_errors 0
# HELP cilium_policies_max_revision Highest policy revision number in the agent
# TYPE cilium_policies_max_revision counter
cilium_policies_max_revision 4

Partially fixes: #1935

cilium-agent collects and serves prometheus metrics

How To Test
Run agent with --prometheus-serve-addr ":9090" (this will bind to all interfaces, you can pass in an address directly"

@raybejjani raybejjani added area/daemon Impacts operation of the Cilium daemon. release-note/minor This PR changes functionality that users may find relevant to operating Cilium. wip labels Nov 23, 2017
@raybejjani raybejjani requested a review from a team November 23, 2017 15:32
@raybejjani raybejjani requested review from a team as code owners November 23, 2017 15:32
@raybejjani raybejjani requested a review from a team November 23, 2017 15:32
@raybejjani raybejjani requested a review from a team as a code owner November 23, 2017 15:32
log "github.com/sirupsen/logrus"
)

func Enable(addr string) error {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

exported function Enable should have comment or be unexported

// ConstLabels: prometheus.Labels{"node": "a node ID, from k8s?",
})

NumEndpointsRegenerating = prometheus.NewGauge(prometheus.GaugeOpts{

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

exported var NumEndpointsRegenerating should have comment or be unexported

// This is the namespace presented to prometheus and should be scoped to us
Namespace = "cilium"

// Endpoints

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comment on exported var NumEndpoints should be of the form "NumEndpoints ..."

)

var (
// This is the namespace presented to prometheus and should be scoped to us

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comment on exported var Namespace should be of the form "Namespace ..."

log "github.com/sirupsen/logrus"
)

func Enable(addr string) error {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

exported function Enable should have comment or be unexported

// ConstLabels: prometheus.Labels{"node": "a node ID, from k8s?",
})

NumEndpointsRegenerating = prometheus.NewGauge(prometheus.GaugeOpts{

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

exported var NumEndpointsRegenerating should have comment or be unexported

// This is the namespace presented to prometheus and should be scoped to us
Namespace = "cilium"

// Endpoints

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comment on exported var NumEndpoints should be of the form "NumEndpoints ..."

)

var (
// This is the namespace presented to prometheus and should be scoped to us

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comment on exported var Namespace should be of the form "Namespace ..."

log "github.com/sirupsen/logrus"
)

func Enable(addr string) error {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

exported function Enable should have comment or be unexported

// ConstLabels: prometheus.Labels{"node": "a node ID, from k8s?",
})

NumEndpointsRegenerating = prometheus.NewGauge(prometheus.GaugeOpts{

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

exported var NumEndpointsRegenerating should have comment or be unexported

// This is the namespace presented to prometheus and should be scoped to us
Namespace = "cilium"

// Endpoints

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comment on exported var NumEndpoints should be of the form "NumEndpoints ..."

)

var (
// This is the namespace presented to prometheus and should be scoped to us

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comment on exported var Namespace should be of the form "Namespace ..."

Help: "Number of endpoints currently regenerating",
})

// CountEndpointsRegeneration is a count of the number of times any endpoint

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comment on exported var CountEndpointsRegenerations should be of the form "CountEndpointsRegenerations ..."

var (
registry = prometheus.NewRegistry()

// This is the namespace presented to prometheus and should be scoped to us

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comment on exported var Namespace should be of the form "Namespace ..."

@raybejjani
Copy link
Contributor Author

raybejjani commented Nov 27, 2017

Righto. I have, finally, tested this with a real prometheus deployment. The basic metrics export works and should be reviewed. I will follow up with a PR that includes docs, example prometheus configuration and code changes that might come up.

For the numbered points I mention above

  1. Should we serve metrics under the agent API?

After some offline discussion, the answer is no.

  1. It seems like there are annotations to tell k8s to slurp up the metrics from a pod. I will try to include that in our examples.

This is less automatic but it boils down to: prometheus can be configured to match k8s nodes/pods/deploys etc. and collect metrics from matching entities. In our case, this is easy to setup when running as a DaemonSet but a little more complicated as a host-agent. In both cases, the prometheus configuration can further tag/annotate these metrics.

  1. We collect the go runtime metrics. The might end up being too much noise.

I disabled this since it cannot be namespaced. This might be better if toggled by the debug flag but I'll leave it disabled for now.

  1. I need to figure out if we can tag all metrics from a node by a node-id of some sort.

This isn't necessary since the prometheus collector can/will tag these metrics. This also leaves this decision up to the user.

@@ -258,6 +259,7 @@ func (d *Daemon) EnableK8sWatcher(reSyncPeriod time.Duration) error {
reSyncPeriod,
cache.ResourceEventHandlerFuncs{
AddFunc: func(obj interface{}) {
metrics.SetTSValue(metrics.LastK8sEventTS, time.Now())
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not terribly happy adding this to each and every one of these functions. I'm not sure if there is a better place for it. It can go in the actual handler function but it seemed best to do it closest to the event it captured (if we can't copy the objects we still want to record that we saw an event).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's fine. We may have a generic controller soon to handle all events where this could be moved to eventually.

@cilium cilium deleted a comment from houndci-bot Nov 27, 2017
@cilium cilium deleted a comment from houndci-bot Nov 27, 2017
@cilium cilium deleted a comment from houndci-bot Nov 27, 2017
@@ -113,3 +113,9 @@ github.com/facebookgo/pidfile f242e2999868dcd267a2b86e49ce1f9cf9e15b16
github.com/facebookgo/atomicfile 2de1f203e7d5e386a6833233882782932729f27e
github.com/cpuguy83/go-md2man 1d903dcb749992f3741d744c0f8376b4bd7eb3e1
github.com/russross/blackfriday 6d1ef893fcb01b4f50cb6e57ed7df3e2e627b6b2
github.com/prometheus/client_golang v0.9.0-pre1
github.com/prometheus/common 2e54d0b93cba2fd133edc32211dcc32c06ef72ca
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was really surprising but the prometheus internal dependencies seemed to not have versions that matched the client. I'm not sure how best to track this but I picked master on the day I committed this :/

// NumPolicies is the number of policies loaded into the agent
NumPolicies = prometheus.NewGauge(prometheus.GaugeOpts{
Namespace: Namespace,
Name: "policies",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest to just name this policy to be consistent, we never call it policies.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done. I also renamed endpoint and event to be consistent all around :)

@@ -258,6 +259,7 @@ func (d *Daemon) EnableK8sWatcher(reSyncPeriod time.Duration) error {
reSyncPeriod,
cache.ResourceEventHandlerFuncs{
AddFunc: func(obj interface{}) {
metrics.SetTSValue(metrics.LastK8sEventTS, time.Now())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's fine. We may have a generic controller soon to handle all events where this could be moved to eventually.

func SetTSValue(c prometheus.Gauge, ts time.Time) {
// Build time in seconds since the epoch. Prometheus only takes floating
// point values, however, and urges times to be in seconds
scaled_ts := float64(ts.UnixNano()) / float64(1000000000)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't use underscores in Go names; var scaled_ts should be scaledTs

@raybejjani
Copy link
Contributor Author

Ok, redone. I added event tracking under one metrics for k8s, containerd and API (aka cli). I renamed some of the variables and metric names to be consistent with <general concept><thing recorded><specific stuff>. Alas, this pulled in more reviewers :/

Copy link
Member

@aanm aanm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks really cool. I hope adding more metrics will be as easy as it looked while reading this PR :)

Small note, perhaps we should start thinking on having methods to increment the policy revision and to everything in the Prometheus increment and the policy revision increment in a single place.

@@ -0,0 +1,45 @@
package metrics
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing copywrite header

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

metrics.EndpointCountRegenerating.Inc()
defer func() {
metrics.EndpointCountRegenerating.Dec()
tags := map[bool]string{true: "succeeded", false: "failed"}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not use a simply if statement?

if retErr != nil {
   metrics.CountEndpointsRegenerations.WithLabelValues("failed").Inc()
} else {
   metrics.CountEndpointsRegenerations.WithLabelValues("succeeded").Inc()
}

Do you need to set up a constant for succeeded and failed or not necessary?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I made constants for the values... I'm not sure if I should also make constants for the field names...

Signed-off-by: Ray Bejjani <ray@covalent.io>
We use the prometheus client to present a pull-based metrics endpoint.
The default collector also exports go runtime information.

Signed-off-by: Ray Bejjani <ray@covalent.io>
// LabelValueOutcomeSuccess is used as a successful outcome of an operation
LabelValueOutcomeSuccess = "success"

// LabelValueOutcomeSuccess is used as an unsuccessful outcome of an operation

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comment on exported var LabelValueOutcomeFail should be of the form "LabelValueOutcomeFail ..."

Signed-off-by: Ray Bejjani <ray@covalent.io>
Signed-off-by: Ray Bejjani <ray@covalent.io>
Signed-off-by: Ray Bejjani <ray@covalent.io>
@raybejjani
Copy link
Contributor Author

An increment function is a good idea. I don't remember why, but I did think about it and decided not to try in this PR. It might be that the increments are done in different locking contexts but that should still be ok. I can make a followup PR to do that.

@raybejjani raybejjani requested a review from aanm November 29, 2017 12:19
@raybejjani raybejjani added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Nov 29, 2017
@tgraf
Copy link
Member

tgraf commented Nov 29, 2017

yay!

@tgraf tgraf merged commit ae769bd into master Nov 29, 2017
@tgraf tgraf deleted the metrics branch November 29, 2017 16:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/daemon Impacts operation of the Cilium daemon. ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/minor This PR changes functionality that users may find relevant to operating Cilium.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Expose runtime metrics for use in monitoring system
4 participants