Skip to content
This repository has been archived by the owner on Aug 2, 2021. It is now read-only.

Evaluate and choose monitoring components #177

Closed
gbalint opened this issue Dec 20, 2017 · 9 comments
Closed

Evaluate and choose monitoring components #177

gbalint opened this issue Dec 20, 2017 · 9 comments

Comments

@gbalint
Copy link

gbalint commented Dec 20, 2017

Probably we should use influxdb as the backend, but we can change this decision too.
It is also a question what protocol to use: influxdb, statsd, or graphite

@holisticode
Copy link
Contributor

While #183 describes what metrics will be collected, this issue is about the infrastructure for metrics - libraries, protocols, databases, visualizations.

@holisticode
Copy link
Contributor

go-ethereum already has an implementation for metrics. It uses the go-metrics/rcrowley go port of the Coda Hale's Metrics library: https://github.com/dropwizard/metrics.

According to this comment: ethereum/go-ethereum#15481 (comment), it should be fairly easy to use this library for swarm-specific metrics. Following this path, there are several client libraries available for this implementation in order for the metrics to be formatted, stored and made available to visualization engines and frameworks.

Nevertheless, concerns have been voiced on the suitability of this stack for swarm.
@nonsense :

statsd by default resets all measurements for the flush interval, whereas go-metrics/codahale keeps aggregating forever, which is not cool

This appears to be questioning that the available implementation in go-ethereum can be used. So a first decision needs to be made: How severe is this aggregation issue in the existing library in go-ethereum? Is it worth, desirable and justified introducing a new library because of the aggregate problem?

Also, it is not clear if this aggregate issue is due to the library itself (go-metrics/rcrowley) or the graphite client as a transport of metrics to an influxdb ( https://github.com/cyberdelia/go-metrics-graphite).

In fact a specific library exists for writing to influxdb: https://github.com/vrischmann/go-metrics-influxdb. If the aggregate problem is inside the metrics library itself, using the latter does most probably not change anything. If the problem is in the graphite client, then the influxdb may be a solution. The proposed stack then could be go-metrics/rcrowley - influxdb - grafana (for visualization)

This comment by @nonsense in fact suggests that the problem is in the graphite client, not the base library itself:

https://github.com/syntaqx/go-metrics-datadog - here is a reporter which is resetting the Counters

This reporter is also a client for the go-metrics/rcrowley implementation, thus if this one is resetting the counters, it seems to suggest that it addresses the aggregation issue.

@holisticode
Copy link
Contributor

holisticode commented Dec 27, 2017

The dogstatsd doesn't look very flexible for us. It seems it can only be used as SaaS, so no custom installation? https://www.datadoghq.com/product/. Also requires a gostatsd process running.

@nonsense how is it possible to check if with https://github.com/vrischmann/go-metrics-influxdb we incur in the "aggregation" problem or not?

@nonsense
Copy link
Contributor

nonsense commented Dec 28, 2017

Also, it is not clear if this aggregate issue is due to the library itself (go-metrics/rcrowley) or the graphite client as a transport of metrics to an influxdb (https://github.com/cyberdelia/go-metrics-graphite).

The problem is in the go-metrics/rcrowley library - you are only ever adding measurements to the Registry, and never resetting it. Resetting should happen at the time of reporting in my opinion, which in the case of go-metrics/rcrowley would mean instantiating a new empty Registry.

The dogstatsd doesn't look very flexible for us. - It is quite flexible, and can be used with any backend which supports statsd. Both collectd and telegraf have plugins that enable them to receive statsd messages, so we could use one of these agents on the machines (we already have them actually).

The way I see it, we have two options:

  1. Continue using go-metrics/rcrowley and live with the aggregation problem. This means that we will be having trouble having accurate measurements for Timers, and we will have to introduce hacks for getting accurate Counters. Getting accurate Counters can be achieved with hacking the Reporter as already shown. Getting accurate Timers - I don't know how to do that yet with this library - we can try instantiating a new Registry every time we report, essentially introducing a flush.

  2. Use statsd protocol, and a simple statsd client from the Go code (for example https://github.com/DataDog/datadog-go/blob/master/statsd/statsd.go). This way the Go code reports stats to a local (on the machine) agent, which is responsible for the aggregation. We already have such an agent on the machines, because we care about infrastructure metrics anyway. Note that aggregation here happens per flush interval, unlike the aggregation that happens per process lifetime under the current go-metrics/rcrowley.

I believe option 2 is better, because we can track more fine-grained counters and timers.

We can approach the decision in the following way:

Define meaningful measurements we want to take in terms of timers, counters, gauges, etc. and just introduce them in the code and report them with option 1 and 2. Changing the code to report metrics based on two different clients is not difficult. Once we have the initial set of measurements and we have experimented how we collect them and visualize them, it should be easier to make a decision.

@holisticode
Copy link
Contributor

So it appears that dogstatsd per se does fit our requirements, but at the cost of an additional process running. This is already the case for swarm-gateways.net, but not for a normal node.

The metrics system should allow a normal swarm node to run easily, with minimum additional software installation requirements.

@lmars
Copy link

lmars commented Dec 29, 2017

I'd like to provide some insight, but I'm not sure I understand this "aggregate problem".

So with a counter, don't you just keep incrementing it and then calculate the stats you want from that? So for example if I want to know how many requests I served in the last hour, I'd subtract the value from 1h ago from the value it is now. If instead the counter is reset every 10s then I have to add them all up? What if some of the stats get lost (it is UDP right), then I get an inaccurate count? There seems to be more opportunity for me to lose stats if they keep getting reset.

@gbalint gbalint moved this from To do to Backlog in Swarm Core - Sprint planning Jan 2, 2018
@gbalint gbalint moved this from Backlog to In progress in Swarm Core - Sprint planning Jan 8, 2018
@holisticode
Copy link
Contributor

The chosen stack is:

gopkg.in/alexcesaro/statsd.v2 (statsd client in swarm go code)
|
statsd daemon (telegraf)
|
influxdb
|
grafana

@nonsense
Copy link
Contributor

nonsense commented Jan 9, 2018

@holisticode thanks! i will sync with Peter tomorrow, so that we see what the geth team thinks of this, as I imagine we might want to add instrumentation on some of the shared go-ethereum / swarm code which might be missing. will keep you posted.

@nonsense
Copy link
Contributor

In light of the amount of code and features done on go-ethereum with go-metrics, we decided to implement custom timer type (ResettingTimer) on go-metrics and fork the library, so that we are backward-compatible with the existing features.

ResettingTimer literally implements what statsd timer is doing.

So the stack becomes:

go-ethereum with go-metrics
|
influxdb
|
grafana

@gbalint gbalint moved this from In progress to Done in Swarm Core - Sprint planning Jan 16, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
No open projects
Development

No branches or pull requests

4 participants