Add metrics output to docker #25820

Merged
merged 1 commit into from Oct 27, 2016
@crosbymichael
Member

This adds metrics support to docker via the prometheus output. There is a new /metrics endpoint where the prometheus handler is installed.

This adds metrics to the daemon package for basic container, image, and daemon operations.

The client package being used is located here:

https://github.com/docker/go-metrics

This pr is the beginning, any more packages need to be instrumented but help/suggestions from the individual subsystem maintainers is needed to collection relevant and useful information.

@jhorwit2 jhorwit2 and 1 other commented on an outdated diff Aug 18, 2016
daemon/metrics.go
+ "image": c.Config.Image,
+ })
+
+ var (
+ cpuTime = ns.NewLabeledGauge("cpu_usage", "The number of nanoseconds a container uses in cpu time", metrics.Nanoseconds, "type")
+ memoryUsage = ns.NewLabeledGauge("memory_usage", "The number of bytes a container uses in memory", metrics.Bytes, "type")
+ memoryFailCount = ns.NewGauge("memory_failcount", "The number of times the container has reached the memory limit", metrics.Total)
+ pidCount = ns.NewLabeledGauge("pids", "The number of pids running inside a container", metrics.Total, "type")
+ networkRxPackets = ns.NewLabeledGauge("network_rx", "The number of packets received by the container", metrics.Total, "type", "interface")
+ networkTxPackets = ns.NewLabeledGauge("network_tx", "The number of packets transmitted by the container", metrics.Total, "type", "interface")
+ networkRxBytes = ns.NewLabeledGauge("network_rx", "The number of bytes received by the container", metrics.Bytes, "interface")
+ networkTxBytes = ns.NewLabeledGauge("network_tx", "The number of bytes transmitted by the container", metrics.Bytes, "interface")
+ )
+ metrics.Register(ns)
+
+ update := daemon.subscribeToContainerStats(c)
@jhorwit2
jhorwit2 Aug 18, 2016 Contributor

Thoughts on subscribing to the health events for this container and keeping counters? This way you can track/alert containers that are flapping.

@crosbymichael
crosbymichael Aug 22, 2016 Member

Sounds like a good idea. I was trying to keep the scope small at this point to make sure we have the basics monitored before adding more metrics. This is just the first metrics PR, we will expand the scope after to other packages/etc.

@jhorwit2 jhorwit2 and 2 others commented on an outdated diff Aug 18, 2016
daemon/daemon.go
@@ -668,6 +687,7 @@ func (daemon *Daemon) shutdownContainer(c *container.Container) error {
}
c.WaitStop(-1 * time.Second)
+ daemonActions.WithValues("shutdown_container").UpdateSince(start)
@jhorwit2
jhorwit2 Aug 18, 2016 Contributor

Why only include a counter on create & start but not shutdown or delete?

@LK4D4
LK4D4 Sep 26, 2016 Contributor

Yup, I agree. Might be useful to track container balance.

@brian-brazil
brian-brazil Oct 11, 2016

If you want to track how many containers are on each machine, add a gauge to keep track. Doing it off counters can be slightly inaccurate.

@jhorwit2 jhorwit2 and 2 others commented on an outdated diff Aug 18, 2016
daemon/health.go
@@ -156,6 +157,7 @@ func monitor(d *Daemon, c *container.Container, stop chan struct{}, probe probe)
ctx, cancelProbe := context.WithTimeout(context.Background(), probeTimeout)
results := make(chan *types.HealthcheckResult)
go func() {
+ healthCheckCounter.Inc()
@jhorwit2
jhorwit2 Aug 18, 2016 Contributor

Thoughts on breaking this up into healthy & unhealthy result counters? You can still get the current value, but also get a little more detailed info if you want it. This metric/comment is kinda moot though if you add health status counters to each container because you can just aggregate.

@crosbymichael
crosbymichael Aug 22, 2016 Member

Sounds good. Do you want to try to implement this and see how it looks? I'm not sure the best way to structure that but if you have an idea what you are looking for then let me know.

@jhorwit2
jhorwit2 Aug 23, 2016 Contributor

The end of handleProbeResult seems like a good spot.

Can add

if h.Status == types.Healthy {
    healthyCheckCounter.Inc()
} else {
    unhealthyCheckCounter.Inc()
}

to the end of the method.

@brian-brazil
brian-brazil Oct 11, 2016

The convention is to have a total and a failed counter, it's easier to work with.

@jhorwit2 jhorwit2 and 2 others commented on an outdated diff Aug 18, 2016
daemon/metrics.go
+ "cpus",
+ "graph_driver",
+ "kernel",
+ "memory",
+ "os",
+ )
+ containerCounter = ns.NewLabeledCounter("container", "The total number of containers processed", "action")
+ healthCheckCounter = ns.NewCounter("health_check", "The total number of health checks executed")
+ imageActions = ns.NewLabeledTimer("image_actions", "The number of seconds it takes to process each image action", "action")
+ metrics.Register(ns)
+}
+
+func (daemon *Daemon) startMetrics(c *container.Container) {
+ ns := metrics.NewNamespace("container", "stat", metrics.Labels{
+ "id": c.ID,
+ "image": c.Config.Image,
@jhorwit2
jhorwit2 Aug 18, 2016 Contributor

Why not merge in container labels into this? Seems useful for querying.

@crosbymichael
crosbymichael Aug 22, 2016 Member

It is useful but I worry about overloading the metrics with anything that people place on a container. If we are writing alot of datapoints and containers have a large number of labels with large values this could slow down alot of things.

How do you think we could add container labels to metrics without it getting out of control?

@jhorwit2
jhorwit2 Aug 23, 2016 Contributor

In theory, yea a large # of labels w/ large values could become an issue across a lot of containers, but is that actually a common thing? It's not something I have encountered.

@stevvooe
stevvooe Aug 27, 2016 Contributor

@crosbymichael @jhorwit2 Check out how cadvisor does this: https://gist.github.com/stevvooe/0760f84b3c5242af8c65b3832794b816.

We probably only need to output the labels on a single metric and then can join on container id.

@jhorwit2
jhorwit2 Aug 27, 2016 Contributor

I'm not following. Cadvisor includes the container labels on every metric for a given container. That's where i got my original comment about merging them in. What do you mean by "join on container id"?

@stevvooe
stevvooe Aug 27, 2016 Contributor

@jhorwit2 There is a trick in prometheus where you can join a metric based on some parameter. In this case, we increment a static counter that contains the meta data (labels, name, etc.) and container id. The rest of the metrics, we simply emit with container id. When you want to know about labels and counts, you join the metadata with the container id. This approach is generally applicable most data models.

This allows us to reduce the volume of data being emitted per container while still binding rich metadata to the metrics.

@jhorwit2 jhorwit2 and 2 others commented on an outdated diff Aug 18, 2016
daemon/metrics.go
+)
+
+var (
+ containerActions metrics.LabeledTimer
+ imageActions metrics.LabeledTimer
+ containerCounter metrics.LabeledCounter
+ daemonActions metrics.LabeledTimer
+ networkActions metrics.LabeledTimer
+ engineVersion metrics.LabeledGauge
+ healthCheckCounter metrics.Counter
+)
+
+func init() {
+ ns := metrics.NewNamespace("engine", "daemon", metrics.Labels{
+ "version": dockerversion.Version,
+ "commit": dockerversion.GitCommit,
@jhorwit2
jhorwit2 Aug 18, 2016 edited Contributor

Add daemon name?

edit: this would be useful on all the namespaces.

@crosbymichael
crosbymichael Aug 22, 2016 Member

The daemon name can be placed in the scrape config for prometheus. I would probably not want to add it to the metrics.

@stevvooe
stevvooe Aug 27, 2016 Contributor

I think we don't need to put these on every metric. We just increment a counter once with these then join on instance, which is unique to the process.

@jhorwit2
jhorwit2 Aug 29, 2016 Contributor

@stevvooe going off the prometheus label join trick you mentioned, couldn't you omit version/commit on the namespace & add it to the "version" gauge a few lines below? That way you aren't emitting these 2 values on every engine/daemon metric.

@jhorwit2
Contributor

Thoughts on adding expvar collector if daemon is in debug mode?

❤️ this PR! 😃

@jhorwit2 jhorwit2 referenced this pull request in docker/swarmkit Aug 18, 2016
Open

Proposal: Service/Task Stats #1284

@stevvooe stevvooe commented on the diff Aug 19, 2016
vendor/src/github.com/docker/go-metrics/timer.go
+func StartTimer(timer Timer) (done func()) {
+ start := time.Now()
+ return func() {
+ timer.Update(time.Since(start))
+ }
+}
+
+// Timer is a metric that allows collecting the duration of an action in seconds
+type Timer interface {
+ // Update records an observation, duration, and converts to the target
+ // units.
+ Update(duration time.Duration)
+
+ // UpdateSince will add the duration from the provided starting time to the
+ // timer's summary with the precisions that was used in creation of the timer
+ UpdateSince(time.Time)
@stevvooe
stevvooe Aug 19, 2016 Contributor

StartTimer costs a closure allocation. This interface is the most performant, but simple stuff can use the sugar.

@crosbymichael
Member

@xbglowx you might be interested in this PR

@cpuguy83
Contributor

So cool design LGTM

@discordianfish
Contributor

@crosbymichael Ha, nice! You might want to look into #9130 for what naming I proposed. That follows the prometheus best practices better. In doubt, see: https://prometheus.io/docs/practices/naming/
Would be really great if those metrics followed the best pratices.

@fabxc You might be interested in this as well.

@stevvooe
Contributor

Would be really great if those metrics followed the best pratices.

@discordianfish We've read through this in detail, a few times now, and we think we are following them. It would be more constructive if you could identify the areas where we don't follow best practices or suggest how making them different could help out.

@jhorwit2 jhorwit2 and 1 other commented on an outdated diff Aug 28, 2016
daemon/metrics.go
+ networkActions = ns.NewLabeledTimer("network_actions", "The number of seconds it takes to process each network action", "action")
+ engineVersion = ns.NewLabeledGauge("version", "The version and commit information for the engine process", metrics.Unit("info"),
+ "architecture",
+ "cpus",
+ "graph_driver",
+ "kernel",
+ "memory",
+ "os",
+ )
+ containerCounter = ns.NewLabeledCounter("container", "The total number of containers processed", "action")
+ healthCheckCounter = ns.NewCounter("health_check", "The total number of health checks executed")
+ imageActions = ns.NewLabeledTimer("image_actions", "The number of seconds it takes to process each image action", "action")
+ metrics.Register(ns)
+}
+
+func (daemon *Daemon) startMetrics(c *container.Container) {
@jhorwit2
jhorwit2 Aug 28, 2016 Contributor

Not sure if you are aware, but this isn't called anywhere.

@crosbymichael
crosbymichael Aug 31, 2016 Member

ya, i removed it for now until we have multiple registries

@crosbymichael crosbymichael added this to the 1.13.0 milestone Aug 31, 2016
@LK4D4 LK4D4 was assigned by crosbymichael Aug 31, 2016
@discordianfish
Contributor

@stevvooe I referred to my PR for specific naming suggestions. Beside that by quickly skimming through:

  • Standardize on seconds everywhere
  • Use self-explanatory metric names (not network_tx but network_tx_packets for example, you could also try to use similar/same names as in node_exporter or container_exporter)

Since we're migrating away from Docker, I can't really put much more time in this though. Still, if you have specific questions I'm happy to answer.

@icecrime
Member
icecrime commented Sep 9, 2016

Ping @LK4D4: PTAL!

@stevvooe stevvooe and 2 others commented on an outdated diff Sep 13, 2016
daemon/metrics.go
+ "os",
+ )
+ containerCounter = ns.NewLabeledCounter("container", "The total number of containers processed", "action")
+ healthCheckCounter = ns.NewCounter("health_check", "The total number of health checks executed")
+ imageActions = ns.NewLabeledTimer("image_actions", "The number of seconds it takes to process each image action", "action")
+ metrics.Register(ns)
+}
+
+func (daemon *Daemon) startMetrics(c *container.Container) {
+ ns := metrics.NewNamespace("container", "stat", metrics.Labels{
+ "id": c.ID,
+ "image": c.Config.Image,
+ })
+
+ var (
+ cpuTime = ns.NewLabeledGauge("cpu_usage", "The number of nanoseconds a container uses in cpu time", metrics.Nanoseconds, "type")
@stevvooe
stevvooe Sep 13, 2016 Contributor

@crosbymichael The recommendation from @discordianfish was to standardize on using seconds. Should we update go-metrics to follow that standard?

@crosbymichael
crosbymichael Sep 13, 2016 Member

for this, everyone expects cputime in nanoseconds, that is that is returned from all stat and cgroup calls return. i don't know if it makes sense to do it here or not. what do you think?

@stevvooe
stevvooe Sep 13, 2016 Contributor

I'm not a huge fan of the units type. I don't think people will take time to make a considered choice in practice, leaving us with inconsistent counters throughout the project.

However, I do see your point.

In this case, I am really not sure, since this is actually a gauge (may need to be a counter). We don't really get a guaranteed conversion here, since it will still be incremented as a float.

@dqminh
dqminh Sep 27, 2016 Contributor

yah i think this should be a counter.

@LK4D4
Contributor
LK4D4 commented Sep 26, 2016

@crosbymichael I wonder how heavy all registrations in init. Reexec is still used in overlay2 storage driver, chrootarchive and in multiple places in libnetwork.

@LK4D4
Contributor
LK4D4 commented Sep 27, 2016

Ok, seems like it quite heavy, but that's supposed way to use prometheus client.
@crosbymichael need rebase.

@crosbymichael
Member

Rebased and placed this feature behind the experimental flag.

@LK4D4 for the initialization, these counters are not heavy at all and have no out of process calls or allocations. They are just initializing types and they are small

@LK4D4
Contributor
LK4D4 commented Sep 29, 2016

@crosbymichael DockerDaemonSuite.TestBuildOnDisabledBridgeNetworkDaemon appears to be real.

@crosbymichael
Member

There looks like a panic in the network code causing the tests to fail

2016-09-30 16:43:08.394541 I | http: panic serving @: assignment to entry in nil map
goroutine 27300 [running]:
net/http.(*conn).serve.func1(0xc420bf9d00)
    /usr/local/go/src/net/http/server.go:1491 +0x12a
panic(0x161caa0, 0xc421064720)
    /usr/local/go/src/runtime/panic.go:458 +0x243
github.com/docker/docker/daemon.(*Daemon).updateNetworkSettings(0xc420368540, 0xc4211cd860, 0x2274000, 0xc421300a80, 0xc420176240, 0xc421064700, 0x0)
    /go/src/github.com/docker/docker/daemon/container_operations.go:275 +0x26d
github.com/docker/docker/daemon.(*Daemon).updateNetworkConfig(0xc420368540, 0xc4211cd860, 0x2274000, 0xc421300a80, 0xc420176240, 0x2274001, 0xc421300a80, 0x0)
    /go/src/github.com/docker/docker/daemon/container_operations.go:632 +0x1f0
github.com/docker/docker/daemon.(*Daemon).connectToNetwork(0xc420368540, 0xc4211cd860, 0xc4208b7b55, 0xc, 0xc420176240, 0x1, 0x0, 0x0)
    /go/src/github.com/docker/docker/daemon/container_operations.go:667 +0x17f
github.com/docker/docker/daemon.(*Daemon).ConnectToNetwork(0xc420368540, 0xc4211cd860, 0xc4208b7b55, 0xc, 0xc420176240, 0x0, 0xc42146fd40)
    /go/src/github.com/docker/docker/daemon/container_operations.go:937 +0x321
github.com/docker/docker/daemon.(*Daemon).ConnectContainerToNetwork(0xc420368540, 0xc4210646a8, 0x6, 0xc4208b7b55, 0xc, 0xc420176240, 0x30, 0xc4208eb980)
    /go/src/github.com/docker/docker/daemon/network.go:343 +0x88
github.com/docker/docker/api/server/router/network.(*networkRouter).postNetworkConnect(0xc4207421b0, 0x226dbe0, 0xc4208eb980, 0x226a560, 0xc420bf36c0, 0xc4215ac1e0, 0xc4208eb6b0, 0xc421064601, 0xc4208eb980)
    /go/src/github.com/docker/docker/api/server/router/network/network_routes.go:117 +0x1e1
github.com/docker/docker/api/server/router/network.(*networkRouter).(github.com/docker/docker/api/server/router/network.postNetworkConnect)-fm(0x226dbe0, 0xc4208eb980, 0x226a560, 0xc420bf36c0, 0xc4215ac1e0, 0xc4208eb6b0, 0x19, 0x2259ce0)
    /go/src/github.com/docker/docker/api/server/router/network/network.go:38 +0x69
github.com/docker/docker/api/server/middleware.VersionMiddleware.WrapHandler.func1(0x226dbe0, 0xc4208eb890, 0x226a560, 0xc420bf36c0, 0xc4215ac1e0, 0xc4208eb6b0, 0x2, 0x2255820)
    /go/src/github.com/docker/docker/api/server/middleware/version.go:49 +0x754
github.com/docker/docker/api/server/middleware.UserAgentMiddleware.WrapHandler.func1(0x226db60, 0xc420012d90, 0x226a560, 0xc420bf36c0, 0xc4215ac1e0, 0xc4208eb6b0, 0x0, 0x0)
    /go/src/github.com/docker/docker/api/server/middleware/user_agent.go:45 +0x272
github.com/docker/docker/pkg/authorization.(*Middleware).WrapHandler.func1(0x226db60, 0xc420012d90, 0x226a560, 0xc420bf36c0, 0xc4215ac1e0, 0xc4208eb6b0, 0x0, 0xc421132a10)
    /go/src/github.com/docker/docker/pkg/authorization/middleware.go:34 +0x7ad
github.com/docker/docker/api/server/middleware.DebugRequestMiddleware.func1(0x226db60, 0xc420012d90, 0x226a560, 0xc420bf36c0, 0xc4215ac1e0, 0xc4208eb6b0, 0xc421132bb0, 0x8)
    /go/src/github.com/docker/docker/api/server/middleware/debug.go:53 +0x5bb
github.com/docker/docker/api/server.(*Server).makeHTTPHandler.func1(0x226a560, 0xc420bf36c0, 0xc4215ac1e0)
    /go/src/github.com/docker/docker/api/server/server.go:139 +0xc5
net/http.HandlerFunc.ServeHTTP(0xc420889960, 0x226a560, 0xc420bf36c0, 0xc4215ac1e0)
    /usr/local/go/src/net/http/server.go:1726 +0x44
github.com/gorilla/mux.(*Router).ServeHTTP(0xc42013a870, 0x226a560, 0xc420bf36c0, 0xc4215ac1e0)
    /go/src/github.com/docker/docker/vendor/src/github.com/gorilla/mux/mux.go:103 +0x255
github.com/docker/docker/api/server.(*routerSwapper).ServeHTTP(0xc4208c9320, 0x226a560, 0xc420bf36c0, 0xc4215ac1e0)
    /go/src/github.com/docker/docker/api/server/router_swapper.go:29 +0x70
net/http.serverHandler.ServeHTTP(0xc420446280, 0x226a560, 0xc420bf36c0, 0xc4215ac1e0)
    /usr/local/go/src/net/http/server.go:2202 +0x7d
net/http.(*conn).serve(0xc420bf9d00, 0x226c020, 0xc42083a840)
    /usr/local/go/src/net/http/server.go:1579 +0x4b7
created by net/http.(*Server).Serve
@crosbymichael
Member

fixed the networking issues, it was caused by a bad rebase

@LK4D4
Contributor
LK4D4 commented Sep 30, 2016

@crosbymichael Fail on experimental looks to be valid :(

20:53:20 FAIL: docker_cli_external_graphdriver_unix_test.go:340: DockerExternalGraphdriverSuite.TestExternalGraphDriver
20:53:20 
20:53:20 [da5a9ea1936cc] waiting for daemon to start
20:53:20 [da5a9ea1936cc] daemon started
20:53:20 [da5a9ea1936cc] exiting daemon
20:53:20 [da5a9ea1936cc] waiting for daemon to start
20:53:20 [da5a9ea1936cc] daemon started
20:53:20 [da5a9ea1936cc] exiting daemon
20:53:20 docker_cli_external_graphdriver_unix_test.go:341:
20:53:20     s.testExternalGraphDriver("test-external-graph-driver", "spec", c)
20:53:20 docker_cli_external_graphdriver_unix_test.go:381:
20:53:20     c.Assert(s.ec[ext].stats, check.Equals, 3)
20:53:20 ... obtained int = 5
20:53:20 ... expected int = 3
20:53:20 
20:53:23 
daemon/events/metrics.go
+
+func init() {
+ ns := metrics.NewNamespace("engine", "events", metrics.Labels{
+ "version": dockerversion.Version,
@Rucknar
Rucknar Oct 8, 2016 edited

@beorn7 @juliusv Please provide guidance on avoiding including commit and version on every metric.

(from @stevvooe)

@juliusv
juliusv Oct 9, 2016

The "Target Labels should be Minimal" section from the http://www.robustperception.io/target-labels-are-for-life-not-just-for-christmas/ post particularly applies here (as these labels here are constant throughout the lifetime of the process, they wouldn't create some of the other mentioned problems).

@stevvooe
Contributor
stevvooe commented Oct 8, 2016

After discussing with the prometheus folks (@beorn7 @juliusv), we thought it would be best to bind the metrics on a separate port so that the engine metrics can be exported separately from the engine api.

@crosbymichael
Member

@stevvooe yes that is a good idea. I was thinking about that as well

@stevvooe stevvooe referenced this pull request Oct 11, 2016
Open

metrics: prometheus integration road map #27307

1 of 4 tasks complete
@brian-brazil

I think you've chosen the wrong instrumentation point here given all the duplicate code and metrics being used across files. Try instrumenting whatever is calling all these functions.

daemon/metrics.go
+ "version": dockerversion.Version,
+ "commit": dockerversion.GitCommit,
+ })
+ containerActions = ns.NewLabeledTimer("container_actions", "The number of seconds it takes to process each container action", "action")
@brian-brazil
brian-brazil Oct 11, 2016

What type of Prometheus metric is this? Prometheus client libraries don't have anything called a "LabeledTimer".

@stevvooe
stevvooe Oct 11, 2016 Contributor

It's a histogram on the backend.

daemon/metrics.go
+ networkActions = ns.NewLabeledTimer("network_actions", "The number of seconds it takes to process each network action", "action")
+ engineVersion = ns.NewLabeledGauge("version", "The version and commit information for the engine process", metrics.Unit("info"),
+ "architecture",
+ "cpus",
@brian-brazil
brian-brazil Oct 11, 2016 edited

This should be a gauge on its own, as you'll want to do math with it. Same with memory.

@crosbymichael
crosbymichael Oct 18, 2016 Member

Can you explain the tradeoff of splitting these out into separate metrics vs too many metrics? Is there such a thing as too many metrics?

It makes sense to me to split these out but I just want to understand more on the other side where we have to start considering whether or not to add a metric in the future.

@juliusv
juliusv Oct 18, 2016 edited

In this case, it's not about multiple metrics vs. one, but about having the number of CPUs as a label value vs. having it as a sample value. You can't do math in PromQL on a label value, as it's just a string (e.g. a foo_metric{cpus="4"} series with a constant 1 dummy sample value), but you can do math with the number of CPUs in the sample value. For example, you could divide CPU usage ratios by the number of CPUs on a system.

@crosbymichael
crosbymichael Oct 18, 2016 Member

But does it make sense to add these metrics for a machines or host's cpu and memory? Is that valuable enough for adding a new metric?

@juliusv
juliusv Oct 18, 2016

I would think so, yeah. It depends a bit on what you want to use the data for in the end. A metric without extra dimensions has very little cost, but is potentially useful. For example, having memory and CPU as metrics would allow you to ask "which nodes have >2 CPUs" or "which nodes have <4GB of memory". Or other math based on those values.

daemon/metrics.go
+ "memory",
+ "os",
+ )
+ containerCounter = ns.NewLabeledCounter("container", "The total number of containers processed", "action")
@brian-brazil
brian-brazil Oct 11, 2016

containers_processed_total?

@stevvooe
stevvooe Oct 12, 2016 Contributor

Counters add that automatically.

This will be <namespace>_container_total.

daemon/metrics.go
+ "os",
+ )
+ containerCounter = ns.NewLabeledCounter("container", "The total number of containers processed", "action")
+ healthCheckCounter = ns.NewCounter("health_check", "The total number of health checks executed")
@brian-brazil
brian-brazil Oct 11, 2016

Counters end in _total, and plural would read better

daemon/metrics.go
+ )
+ containerCounter = ns.NewLabeledCounter("container", "The total number of containers processed", "action")
+ healthCheckCounter = ns.NewCounter("health_check", "The total number of health checks executed")
+ imageActions = ns.NewLabeledTimer("image_actions", "The number of seconds it takes to process each image action", "action")
@brian-brazil
brian-brazil Oct 11, 2016

Put the unit (seconds) in the metric name for clarity

@stevvooe
stevvooe Oct 12, 2016 Contributor

That's automatic for timers.

@@ -0,0 +1,21 @@
+package events
@brian-brazil
brian-brazil Oct 11, 2016 edited

Creating a separate file for metrics is a smell, metrics should be treated like logging and live in the same file as what they're instrumenting.

Similarly that the same metric is being updated in multiple files is a smell, indicating that you're not instrumenting at the right place in your code. Either you shouldn't be using a label, or you should move the instrumentation up the call stack so each metric only lives in one file.

@stevvooe
stevvooe Oct 18, 2016 Contributor

What is the inferred pungency here?

Are you saying that metrics should be representative of the call tree? How do we reconcile implementation versus the exported API that metrics invariably become?

@brian-brazil
brian-brazil Oct 25, 2016

Are you saying that metrics should be representative of the call tree?

I'm saying that metrics should usually be right beside what they're instrumenting, and that a given metric should almost always live in exactly one file.

For this particular file I think it should be merged into events.go. This saves someone trying to understand the code having to jump around files.

My stance is based on years of trying to debug code that looked just like this. It gets really frustrating when you're already trying to understand a bug code spread across 10s functions in 10s files, to then have to open up another set of files to find out what the metrics available are and another set of files again to see which confusing behaviour this team's metrics wrapper has.

How do we reconcile implementation versus the exported API that metrics invariably become?

I'd expect metrics to change over time as changes are made to the code. A handful are likely to end up highly depended on, so some future compromise might be required if the choice is made to make them an official API.

For Prometheus itself our stability guarantees don't include the metrics, as that'd be too constraining and effectively prevent significant development as any major internal change would likely become a breaking change to metrics. Given Docker is under rapid development, you may wish to take a similar stance.

daemon/events/metrics.go
+ "version": dockerversion.Version,
+ "commit": dockerversion.GitCommit,
+ })
+ eventsCounter = ns.NewCounter("events", "The number of events logged")
@brian-brazil
brian-brazil Oct 11, 2016

Counters end in _total by convention.

events_logged_total would be a better name.

daemon/health.go
@@ -156,6 +157,7 @@ func monitor(d *Daemon, c *container.Container, stop chan struct{}, probe probe)
ctx, cancelProbe := context.WithTimeout(context.Background(), probeTimeout)
results := make(chan *types.HealthcheckResult)
go func() {
+ healthCheckCounter.Inc()
@brian-brazil
brian-brazil Oct 11, 2016

The convention is to have a total and a failed counter, it's easier to work with.

daemon/start.go
@@ -178,6 +180,9 @@ func (daemon *Daemon) containerStart(container *container.Container, checkpoint
return fmt.Errorf("%s", errDesc)
}
+ containerActions.WithValues("start").UpdateSince(start)
+ containerCounter.WithValues("start").Inc()
@brian-brazil
brian-brazil Oct 11, 2016

This is probably redundant with containerActions, as presuming that's a summary or histogram it already tracks this as the _count.

@crosbymichael
Member

Thanks for the feedback. I'll work on updating this code today and let you all know when its pushed for another review.

@thaJeztah
Member

@crosbymichael did you update?

@jhowardmsft
Contributor

Has anyone confirmed this can effectively be used on Windows?

@juliusv
juliusv commented Oct 21, 2016

@jhowardmsft Not a Docker expert, but at least from the Prometheus metrics side of it, there should be nothing OS-specific here. The Prometheus server that can scrape and store these metrics can also be run on Windows.

@crosbymichael
Member

@jhowardmsft yes, there should not be anything platform specific with this and its serving the metrics API on a separate TCP port so there shouldn't be any issues there for windows.

@crosbymichael
Member

Pushed an update that puts the metrics api on a separate port. It has no default address and must be explicitly specified if the user wants metrics.

This update addresses all of the existing feedback and should be ready for another pass of reviews.

cmd/dockerd/metrics.go
+ }
+ go func() {
+ if err := http.ListenAndServe(addr, metrics.Handler()); err != nil {
+ logrus.Errorf("serve metrics api: %s", err)
@brian-brazil
brian-brazil Oct 25, 2016

Should this error be returned? I would expect like the port already being in use to prevent startup.

@@ -0,0 +1,21 @@
+package events
@brian-brazil
brian-brazil Oct 25, 2016

Are you saying that metrics should be representative of the call tree?

I'm saying that metrics should usually be right beside what they're instrumenting, and that a given metric should almost always live in exactly one file.

For this particular file I think it should be merged into events.go. This saves someone trying to understand the code having to jump around files.

My stance is based on years of trying to debug code that looked just like this. It gets really frustrating when you're already trying to understand a bug code spread across 10s functions in 10s files, to then have to open up another set of files to find out what the metrics available are and another set of files again to see which confusing behaviour this team's metrics wrapper has.

How do we reconcile implementation versus the exported API that metrics invariably become?

I'd expect metrics to change over time as changes are made to the code. A handful are likely to end up highly depended on, so some future compromise might be required if the choice is made to make them an official API.

For Prometheus itself our stability guarantees don't include the metrics, as that'd be too constraining and effectively prevent significant development as any major internal change would likely become a breaking change to metrics. Given Docker is under rapid development, you may wish to take a similar stance.

daemon/health.go
@@ -159,6 +160,7 @@ func monitor(d *Daemon, c *container.Container, stop chan struct{}, probe probe)
go func() {
result, err := probe.run(ctx, d, c)
if err != nil {
+ healthCheckCounter.WithValues("failed").Inc()
@brian-brazil
brian-brazil Oct 25, 2016

It's best to have one counter for total and another counter for failures. It's easier to work with that way as a) it's then just a straight division and b) you don't have to worry about missing timeseries.

See https://prometheus.io/docs/practices/instrumentation/#avoid-missing-metrics and https://prometheus.io/docs/practices/instrumentation/#avoid-missing-metrics

+
+func init() {
+ ns := metrics.NewNamespace("engine", "daemon", nil)
+ containerActions = ns.NewLabeledTimer("container_actions", "The number of seconds it takes to process each container action", "action")
@brian-brazil
brian-brazil Oct 25, 2016

These should be initialised with all possible values for 'action' to avoid weirdness. This is done usually by containerActions.WithLabelValues("xxx")

@crosbymichael
crosbymichael Oct 25, 2016 Member

That makes sense when you only have one label but when you have multiple labels with multiple values are you supposed to initialized all possible combinations?

@brian-brazil
brian-brazil Oct 26, 2016

If you know all possible combinations and they are reasonably bounded, you should initialise them all. In this case they should be initialised.

The fun cases are things like HTTP response codes, where in the general case there's 80ish label values which would likely end up with performance issues in the medium/long term. In that case we're left handing the user the problems associated with labels not being present, which can lead to subtle bugs. For example the original health_checks in this PR would have broken common usages if there were no successes or no failures (missing divided by something is missing).

daemon/metrics.go
+ "os",
+ )
+ engineCpus = ns.NewGauge("engine_cpus", "The number of cpus that the host system of the engine has", metrics.Total)
+ engineMemory = ns.NewGauge("engine_memory", "The number of bytes of memory that the host system of the engine has", metrics.Bytes)
@brian-brazil
brian-brazil Oct 25, 2016

The units of bytes should be mentioned in the metric name.

daemon/metrics.go
+ )
+ engineCpus = ns.NewGauge("engine_cpus", "The number of cpus that the host system of the engine has", metrics.Total)
+ engineMemory = ns.NewGauge("engine_memory", "The number of bytes of memory that the host system of the engine has", metrics.Bytes)
+ healthCheckCounter = ns.NewLabeledCounter("health_checks", "The total number of health checks executed", "action")
@brian-brazil
brian-brazil Oct 25, 2016

Action doesn't make sense as a label here as it's an output rather than an input. As noted below this should be health_checks and health_checks_failed.

@crosbymichael
crosbymichael Oct 25, 2016 Member

Why? Just stating something does not help someone learn why, you have to explain the reason why in this case it does not make sense but in others it does. What makes this different?

@juliusv
juliusv Oct 25, 2016 edited

I'm not @brian-brazil, but I can try to answer. I'm not sure whether your question is about his first or second sentence, so let me try both:

  • The health_checks metric is dimensionally split by the health check result (failed or passed). So semantically, action doesn't describe that dimension well, whereas result or outcome or similar would.
  • It's usually better to split up total and failed counters into separate metrics instead of differentiating the outcome with a label. That's especially easier to work with in PromQL computations. For example, to calculate the ratio of bad health checks with two separate metrics (per engine): failed_health_checks_total / health_checks_total (elements on the left and right side of the operation will be automatically matched on their identical label sets to produce a labeled output set). With a single metric and a differentiating label, you couldn't just say health_checks_total{result="failed"} / health_checks_total{result="passed"}, because each side of the division is a labeled vector, and now you'd have to explicitly ignore the result label for the purposes of this vector matching, like: health_checks_total{result="failed"} / ignoring(result) health_checks_total{result="passed"}. So it all becomes more cumbersome.
daemon/metrics.go
+ "graph_driver", "kernel",
+ "os",
+ )
+ engineCpus = ns.NewGauge("engine_cpus", "The number of cpus that the host system of the engine has", metrics.Total)
@stevvooe
stevvooe Oct 25, 2016 Contributor

Nit: IMHO, the units here should be cpus rather than total. Do we still use "total" in this case?

cc @juliusv @brian-brazil

@juliusv
juliusv Oct 25, 2016

Yeah, this is just a gauge with the number of CPUs this engine has, right? Then the current name is good. _total is only for counters, not for gauges.

@brian-brazil
brian-brazil Oct 25, 2016

cpus or cores would the the unit here, depending on which ye use.

@stevvooe
Contributor

Replying inline, since github is being weird:

It gets really frustrating when you're already trying to understand a bug code spread across 10s functions in 10s files, to then have to open up another set of files to find out what the metrics available are and another set of files again to see which confusing behaviour this team's metrics wrapper has.

I agree with your point but it will be easier to keep the number of metrics under control if they aren't littered about. We'll try both models and see what works in practice for this project.

Sorry, terrible question: do you have tooling setup? This is super obnoxious without something like vim-go.

For Prometheus itself our stability guarantees don't include the metrics, as that'd be too constraining and effectively prevent significant development as any major internal change would likely become a breaking change to metrics. Given Docker is under rapid development, you may wish to take a similar stance.

Does that become a problem for users instrumenting prometheus itself? This project would be a lot easier if people didn't expect stability in the first few releases, but I'm afraid, even with documented caveats, there will be those who will rely on metric naming.

@stevvooe
Contributor

LGTM

I have a few questions but let's handle those offline. This meets the requirements and gets the ball rolling.

@brian-brazil

I agree with your point but it will be easier to keep the number of metrics under control if they aren't littered about. We'll try both models and see what works in practice for this project.

I wouldn't worry about keeping the number of metrics under control. Labelless metrics are effectively free (e.g. the health checks here), it's labeled metrics that tend to bite you over time as they gain in cardinality.

Sorry, terrible question: do you have tooling setup? This is super obnoxious without something like vim-go.

Most of my experience was in C++ actually, with very good tooling for reading code. Vim-go isn't sufficient for this sort of use case, you need an actual IDE. Having 10 browser tabs open at once, at least 4 visible at a time was on the low end of where this sort of debugging typically takes you - and that's before I got to reverse engineering the obfuscating abstractions on top of common libraries such as instrumentation.

Does that become a problem for users instrumenting prometheus itself?

It hasn't been a problem yet for Prometheus. We've yet to have to break any of the "important" metrics.

This project would be a lot easier if people didn't expect stability in the first few releases, but I'm afraid, even with documented caveats, there will be those who will rely on metric naming.

Thus is life unfortunately. I expect ye're going to have it a lot tougher than we do, so it'd be good to clarify the guarantees.

@stevvooe
Contributor

Vim-go isn't sufficient for this sort of use case, you need an actual IDE.

I'd recommend you give it another look. It does most things an IDE can. VSCode is also an option if you're struggling with Go code navigation. Anyways, it allows me to bounce around from file to file without these issues. I haven't tried it in awhile, but SourceGraph can help with Go navigation in github.

@crosbymichael
Member

oh no, not the editor debate ;)

@brian-brazil

I'm exclusively a vim user for writing code as it happens. SourceGraph looks like the sort of tool I want when doing debugging, thanks. Doesn't directly support opening links in new tabs though via middle click, which'd slow me down a fair bit when debugging.

daemon/events/metrics.go
+
+func init() {
+ ns := metrics.NewNamespace("engine", "events", nil)
+ eventsCounter = ns.NewCounter("events", "The number of events logged")
@brian-brazil
brian-brazil Oct 26, 2016

Is this going to end up as engine_events_events_total? I'd personally avoid the stutter.

@juliusv
juliusv Oct 26, 2016

Apologies if I'm misunderstanding something, but from reading the code in https://github.com/docker/go-metrics/blob/master/namespace.go, it does seem like it would end up being named engine_events_events_total due to the namespace above including the events string as a subsystem name already. Prometheus glues together the namespace, subsystem, and metric name given in the counter opts constructed in https://github.com/docker/go-metrics/blob/master/namespace.go#L61-L69 with an underscore.

The namespace definition here seems problematic for the other metrics as well. For example, you'd get engine_events_engine_cpus, which should probably just be engine_cpus (same for engine_events_engine_memory)?

@crosbymichael
crosbymichael Oct 26, 2016 Member

@juliusv ya, you are correct on the events part but the engine_events_engine_memory; the new github review is mashing things together making it look like they are in the same file but they are not. I'll fix the above problem

@juliusv
juliusv Oct 26, 2016

Btw., this lack of grokability+grepability is why some people like @brian-brazil argue against generating metric names from multiple name components (namespace, component, metric name, suffix) spread over the code base. I personally started out with the multi-component approach in the past, but nowadays I prefer the single-string approach - it's just simpler to reason about and find things. But that's your call, of course. I recognize your motivation to enforce common prefixes/suffixes throughout the Docker code base, which is valid too.

@juliusv
juliusv Oct 26, 2016

@crosbymichael Ah, damn you, GitHub :)

daemon/metrics.go
+ engineCpus = ns.NewGauge("engine_cpus", "The number of cpus that the host system of the engine has", metrics.Unit("cpus"))
+ engineMemory = ns.NewGauge("engine_memory", "The number of bytes of memory that the host system of the engine has", metrics.Bytes)
+ healthChecksPassedCounter = ns.NewCounter("health_checks_passed", "The total number of health checks executed")
+ healthChecksFailedCounter = ns.NewCounter("health_checks_failed", "The total number of health checks executed")
@brian-brazil
brian-brazil Oct 26, 2016

It's better to have failed and total, rather than passed and failed. This is as the ratio is usually what you want and f / t is simpler than f / (p + f).

The descriptions also need updating.

cmd/dockerd/metrics.go
+ metrics "github.com/docker/go-metrics"
+)
+
+func startMetricsServer(addr string, d *daemon.Daemon) error {
@LK4D4
LK4D4 Oct 27, 2016 Contributor

d is unused

cmd/dockerd/daemon.go
@@ -248,6 +248,12 @@ func (cli *DaemonCli) start(opts daemonOptions) (err error) {
return fmt.Errorf("Error starting daemon: %v", err)
}
+ if d.HasExperimental() && cli.Config.MetricsAddress != "" {
@LK4D4
LK4D4 Oct 27, 2016 Contributor

let's split it a little and return an error if MetricsAddress is not empty on non-experimental.

@crosbymichael crosbymichael Add basic prometheus support
This adds a metrics packages that creates additional metrics.  Add the
metrics endpoint to the docker api server under `/metrics`.

Signed-off-by: Michael Crosby <crosbymichael@gmail.com>

Add metrics to daemon package

Signed-off-by: Michael Crosby <crosbymichael@gmail.com>

api: use standard way for metrics route

Also add "type" query parameter

Signed-off-by: Alexander Morozov <lk4d4@docker.com>

Convert timers to ms

Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
3343d23
@LK4D4
LK4D4 approved these changes Oct 27, 2016 View changes
@LK4D4
Contributor
LK4D4 commented Oct 27, 2016

LGTM

@LK4D4
Contributor
LK4D4 commented Oct 27, 2016

ping @thaJeztah

@thaJeztah
Member

Since this is a new API, we need a new section in the docs for this. Let's do so in a follow up before 1.13 is released

/cc @mstanleyjones

@thaJeztah thaJeztah merged commit 33474a1 into docker:master Oct 27, 2016

5 checks passed

docker/dco-signed All commits signed
Details
experimental Jenkins build Docker-PRs-experimental 25556 has succeeded
Details
janky Jenkins build Docker-PRs 34155 has succeeded
Details
vendor Jenkins build Docker-PRs-vendor 2237 has succeeded
Details
windowsRS1 Jenkins build Docker-PRs-WoW-RS1 5037 has succeeded
Details
@thaJeztah
Member

opened #27843 for tracking docs

@crosbymichael crosbymichael deleted the crosbymichael:prom branch Oct 28, 2016
@outcoldman
Contributor

Am I right that in current implementation only prometheus can be used for collecting metrics? Why not to have extensibility as in log drivers? Current implementation is really tight to the prometheus client library without any abstractions.

@brian-brazil

Am I right that in current implementation only prometheus can be used for collecting metrics?

For clarity I define "collection" to mean instrumentation, and "ingestion" as the process of getting the collected metric data into a monitoring system.

In that context, the answer is yes as I understand things on the Docker side.

Why not to have extensibility as in log drivers? Current implementation is really tight to the prometheus client library without any abstractions.

The challenge is that metric instrumentation is nowhere near as standardised as logs. Indeed Prometheus client libraries can be viewed as one standardisation option to allow for extensiblity, as they are designed to be an open ecosystem.

You could add an abstraction at the instrumentation level as I believe you are suggesting. You'd likely end up with a lowest-common denominator library, which would lose you the benefits of things like labels, floating point numbers, take a performance hit and not be idiomatic for any monitoring system. In short you'd gain abstraction, at the cost of good instrumentation. I personally do not believe that is a good tradeoff, and have never seen it work out well when attempted.

With Prometheus client libraries there's abstraction at the ingestion level. The Prometheus approach is to instrument full-on with Prometheus client libraries, and then you use a small shim (called a "bridge") to output to other systems like Graphite (currently in code review to be included out of the box for the Go library, already exists for Java/Python), New Relic, InfluxDB, CloudWatch etc. There's no need for a user to run a Prometheus server in such a scenario, see https://www.robustperception.io/exporting-to-graphite-with-the-prometheus-python-client/ for example. There are other ways you can plumb that too.

Prometheus client libraries put all the smarts in the server rather than the instrumentation and have a data model that's on the powerful end of the scale, so it's almost always possible to automagically convert to something sensible in other monitoring/instrumentation systems. The reverse is unfortunately rarely true.

@peterbourgon
Contributor

You could add an abstraction at the instrumentation level . . .

Self-plug: one candidate is go-kit/kit/metrics, which provides exactly this abstraction and suffers none of the drawbacks you mention. I'm not necessarily advocating for it in this circumstance—it seems exceedingly unlikely that Docker should or would use a different metrics backend than Prometheus—but at least the option is there.

@thaJeztah thaJeztah referenced this pull request in docker/for-mac Dec 20, 2016
Open

Unable to use experimental daemon metrics API #1058

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment