Evaluate and choose monitoring components #177

gbalint · 2017-12-20T16:02:25Z

Probably we should use influxdb as the backend, but we can change this decision too.
It is also a question what protocol to use: influxdb, statsd, or graphite

The text was updated successfully, but these errors were encountered:

holisticode · 2017-12-27T21:03:02Z

While #183 describes what metrics will be collected, this issue is about the infrastructure for metrics - libraries, protocols, databases, visualizations.

holisticode · 2017-12-27T21:35:37Z

go-ethereum already has an implementation for metrics. It uses the go-metrics/rcrowley go port of the Coda Hale's Metrics library: https://github.com/dropwizard/metrics.

According to this comment: ethereum/go-ethereum#15481 (comment), it should be fairly easy to use this library for swarm-specific metrics. Following this path, there are several client libraries available for this implementation in order for the metrics to be formatted, stored and made available to visualization engines and frameworks.

Nevertheless, concerns have been voiced on the suitability of this stack for swarm.
@nonsense :

statsd by default resets all measurements for the flush interval, whereas go-metrics/codahale keeps aggregating forever, which is not cool

This appears to be questioning that the available implementation in go-ethereum can be used. So a first decision needs to be made: How severe is this aggregation issue in the existing library in go-ethereum? Is it worth, desirable and justified introducing a new library because of the aggregate problem?

Also, it is not clear if this aggregate issue is due to the library itself (go-metrics/rcrowley) or the graphite client as a transport of metrics to an influxdb ( https://github.com/cyberdelia/go-metrics-graphite).

In fact a specific library exists for writing to influxdb: https://github.com/vrischmann/go-metrics-influxdb. If the aggregate problem is inside the metrics library itself, using the latter does most probably not change anything. If the problem is in the graphite client, then the influxdb may be a solution. The proposed stack then could be go-metrics/rcrowley - influxdb - grafana (for visualization)

This comment by @nonsense in fact suggests that the problem is in the graphite client, not the base library itself:

https://github.com/syntaqx/go-metrics-datadog - here is a reporter which is resetting the Counters

This reporter is also a client for the go-metrics/rcrowley implementation, thus if this one is resetting the counters, it seems to suggest that it addresses the aggregation issue.

holisticode · 2017-12-27T21:51:02Z

The dogstatsd doesn't look very flexible for us. It seems it can only be used as SaaS, so no custom installation? https://www.datadoghq.com/product/. Also requires a gostatsd process running.

@nonsense how is it possible to check if with https://github.com/vrischmann/go-metrics-influxdb we incur in the "aggregation" problem or not?

nonsense · 2017-12-28T11:41:54Z

Also, it is not clear if this aggregate issue is due to the library itself (go-metrics/rcrowley) or the graphite client as a transport of metrics to an influxdb (https://github.com/cyberdelia/go-metrics-graphite).

The problem is in the go-metrics/rcrowley library - you are only ever adding measurements to the Registry, and never resetting it. Resetting should happen at the time of reporting in my opinion, which in the case of go-metrics/rcrowley would mean instantiating a new empty Registry.

The dogstatsd doesn't look very flexible for us. - It is quite flexible, and can be used with any backend which supports statsd. Both collectd and telegraf have plugins that enable them to receive statsd messages, so we could use one of these agents on the machines (we already have them actually).

The way I see it, we have two options:

Continue using go-metrics/rcrowley and live with the aggregation problem. This means that we will be having trouble having accurate measurements for Timers, and we will have to introduce hacks for getting accurate Counters. Getting accurate Counters can be achieved with hacking the Reporter as already shown. Getting accurate Timers - I don't know how to do that yet with this library - we can try instantiating a new Registry every time we report, essentially introducing a flush.
Use statsd protocol, and a simple statsd client from the Go code (for example https://github.com/DataDog/datadog-go/blob/master/statsd/statsd.go). This way the Go code reports stats to a local (on the machine) agent, which is responsible for the aggregation. We already have such an agent on the machines, because we care about infrastructure metrics anyway. Note that aggregation here happens per flush interval, unlike the aggregation that happens per process lifetime under the current go-metrics/rcrowley.

I believe option 2 is better, because we can track more fine-grained counters and timers.

We can approach the decision in the following way:

Define meaningful measurements we want to take in terms of timers, counters, gauges, etc. and just introduce them in the code and report them with option 1 and 2. Changing the code to report metrics based on two different clients is not difficult. Once we have the initial set of measurements and we have experimented how we collect them and visualize them, it should be easier to make a decision.

holisticode · 2017-12-28T16:05:31Z

So it appears that dogstatsd per se does fit our requirements, but at the cost of an additional process running. This is already the case for swarm-gateways.net, but not for a normal node.

The metrics system should allow a normal swarm node to run easily, with minimum additional software installation requirements.

lmars · 2017-12-29T17:08:19Z

I'd like to provide some insight, but I'm not sure I understand this "aggregate problem".

So with a counter, don't you just keep incrementing it and then calculate the stats you want from that? So for example if I want to know how many requests I served in the last hour, I'd subtract the value from 1h ago from the value it is now. If instead the counter is reset every 10s then I have to add them all up? What if some of the stats get lost (it is UDP right), then I get an inaccurate count? There seems to be more opportunity for me to lose stats if they keep getting reset.

holisticode · 2018-01-09T17:27:44Z

The chosen stack is:

gopkg.in/alexcesaro/statsd.v2 (statsd client in swarm go code)
|
statsd daemon (telegraf)
|
influxdb
|
grafana

nonsense · 2018-01-09T17:38:09Z

@holisticode thanks! i will sync with Peter tomorrow, so that we see what the geth team thinks of this, as I imagine we might want to add instrumentation on some of the shared go-ethereum / swarm code which might be missing. will keep you posted.

nonsense · 2018-01-16T13:00:01Z

In light of the amount of code and features done on go-ethereum with go-metrics, we decided to implement custom timer type (ResettingTimer) on go-metrics and fork the library, so that we are backward-compatible with the existing features.

ResettingTimer literally implements what statsd timer is doing.

So the stack becomes:

go-ethereum with go-metrics
|
influxdb
|
grafana

gbalint added the network testing label Dec 20, 2017

gbalint assigned holisticode Dec 20, 2017

gbalint added this to To do in Swarm Core - Sprint planning via automation Dec 20, 2017

holisticode mentioned this issue Dec 26, 2017

Implement metrics for swarm #183

Closed

holisticode added the discussion label Dec 27, 2017

gbalint moved this from To do to Backlog in Swarm Core - Sprint planning Jan 2, 2018

gbalint moved this from Backlog to In progress in Swarm Core - Sprint planning Jan 8, 2018

gbalint assigned nonsense Jan 8, 2018

gbalint moved this from In progress to Done in Swarm Core - Sprint planning Jan 16, 2018

nonsense closed this as completed Feb 23, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Evaluate and choose monitoring components #177

Evaluate and choose monitoring components #177

gbalint commented Dec 20, 2017

holisticode commented Dec 27, 2017

holisticode commented Dec 27, 2017

holisticode commented Dec 27, 2017 •

edited

nonsense commented Dec 28, 2017 •

edited

holisticode commented Dec 28, 2017

lmars commented Dec 29, 2017

holisticode commented Jan 9, 2018

nonsense commented Jan 9, 2018

nonsense commented Jan 16, 2018

Evaluate and choose monitoring components #177

Evaluate and choose monitoring components #177

Comments

gbalint commented Dec 20, 2017

holisticode commented Dec 27, 2017

holisticode commented Dec 27, 2017

holisticode commented Dec 27, 2017 • edited

nonsense commented Dec 28, 2017 • edited

holisticode commented Dec 28, 2017

lmars commented Dec 29, 2017

holisticode commented Jan 9, 2018

nonsense commented Jan 9, 2018

nonsense commented Jan 16, 2018

holisticode commented Dec 27, 2017 •

edited

nonsense commented Dec 28, 2017 •

edited