swarm: initial instrumentation #15969

nonsense · 2018-01-25T10:30:07Z

This PR is joint work with @holisticode and based on #15910 and should be reviewed/merged after #15910

It adds configuration for a metrics backend within swarm and periodically flushes measurements from the DefaultRegistry to it.

We could include an optional metrics backend (InfluxDB + Grafana + dashboards) setup into the repo and docs similar to https://github.com/janos/swarm-test-cluster/tree/master/swarm-test-cluster (which is based on Terraform and Docker), in which case a developer/user of swarm would only have to run make in order to setup the metrics backend, and it will be configured out of the box to receive measurements from the process periodically.

lmars · 2018-01-25T12:24:07Z

We could include an optional metrics backend (InfluxDB + Grafana + dashboards) setup into the repo

Not necessarily in this PR, but we should update the dev environment to include this, so I can just run swarm/dev/run.sh and have a Swarm cluster with a metrics backend. I know there was talk of switching from bash + docker to terraform, so perhaps do that at the same time?

lmars · 2018-01-25T12:27:08Z

metrics/metrics.go

@@ -75,6 +75,24 @@ func NewTimer(name string) metrics.Timer {
 	return metrics.GetOrRegisterTimer(name, metrics.DefaultRegistry)
 }

+// NewResettingTimer create a new ResettingTimer, either a real one of a NOP stub depending


- NewResettingTimer create a new ResettingTimer, either a real one of a NOP stub depending + NewResettingTimer creates a new ResettingTimer, either a real one or a NOP stub depending

lmars · 2018-01-25T12:29:25Z

swarm/api/api.go

 	log.Trace(fmt.Sprintf("Resolving : %v", uri.Addr))

 	// if the URI is immutable, check if the address is a hash
 	isHash := hashMatcher.MatchString(uri.Addr)
 	if uri.Immutable() || uri.DeprecatedImmutable() {
 		if !isHash {
+			apiResolveFail.Inc(1)


I don't think it is correct to log this as a resolve failure, as no resolving was attempted (the user just provided bad input).

I finally agree on this one and will remove it

lmars · 2018-01-25T12:30:59Z

swarm/api/api.go

@@ -93,6 +117,7 @@ func (self *Api) Resolve(uri *URI) (storage.Key, error) {
 	// if DNS is not configured, check if the address is a hash
 	if self.dns == nil {
 		if !isHash {
+			apiResolveFail.Inc(1)


Again is this really a resolve failure? I'd expect the resolve failure metric to tell me how many times Swarm tried to actually resolve a name but failed. Perhaps my expectation is incorrect though?

My expectation was to capture how many times a resolve was attempted with the meaning of requested (count) and how many times these attempts failed, answering the question: "How many times did people request a resolution, and how many times this request failed?". Both expectations (mine and yours) can lead to mal-interpretations. We should just settle on one. I believe it makes sense to keep them for now, and then remove such metrics as we go when we realize we are not gaining anything from them. Still, if people think we should remove ones like this one right away, I am ok with it.

I believe it makes sense to keep them for now, and then remove such metrics as we go when we realize we are not gaining anything from them

I think it should be the other way around. Just add metrics we know are useful, with the freedom to add more useful ones in the future.

In this specific case, if I wanted to know how many resolve failures I'm getting, I'd also want to know what domains are failing too, in which case I'd probably just look for errors in my logs (which have much more context than just a single number).

When looking at failures, specifically in a production environment, but also in a staging/test environment, I would structure my workflow in creating alarms and visualizing error situations in a general manner first, like creating alarms when reaching certain thresholds. Only then I would, if the counters/graphs show unacceptable levels or alarms get triggered, I would then start the debugging process of looking at logs for why the alarms were triggered or why there is an unusual/unexpected high amount of errors. I believe this to be a very valid argument to keep this counter. Nevertheless, if it's still maintained that the counter itself is of no value, I will proceed to remove it.

lmars · 2018-01-25T12:34:41Z

swarm/api/api.go

@@ -336,18 +391,23 @@ func (self *Api) AppendFile(mhash, path, fname string, existingSize int64, conte
 }

 func (self *Api) BuildDirectoryTree(mhash string, nameresolver bool) (key storage.Key, manifestEntryMap map[string]*manifestTrieEntry, err error) {
+	apiBuildDirTreeCount.Inc(1)


Do you think this is a useful metric? I'm not so sure.

We've been focusing on setting up the library, the reporter and the infrastructure around metrics. I agree that not all of the measurements currently are useful and meaningful - there will definitely be iterations on them. At the same time I am happy to fix those that are immediately known to be not useful in this PR. :)

I have been developing mostly web REST APIs in previous years. Each such a REST API usually is there because it serves a specific function and a user requirement/use case. From this point of view, I interpreted this function something like a public API function - in which case we may be interested to know how many times each of these are being called. If my interpretation actually is incorrect, in that this API call isn't very relevant, is maybe semantically not a user-facing API endpoint, or is maybe even obsolete/deprecated, I happy for it to be removed

Given you are questioning whether this is useful, we should just drop it unless we decide it is useful. I personally don't think it is a useful metric.

I was not really questioning the usefulness of the metric, but rather asking for clarification about the role or, maybe the relevance of the API itself in terms of (public) use case. I will proceed to remove it, but I would also welcome better documentation on such supposedly API functions, in order to understand why they are there and their relevance for users of the API.

lmars · 2018-01-25T12:37:55Z

swarm/api/http/error.go

 	"github.com/ethereum/go-ethereum/swarm/api"
 )

 //templateMap holds a mapping of an HTTP error code to a template
 var templateMap map[int]*template.Template

+//metrics variables
+var (
+	htmlCounter = metrics.NewCounter("api.http.errorpage.html.count")


Do we need to be this granular? Should we not just log the count of 5xx errors?

Thinking of it, it makes most sense to me to actually track how many requests were being made for JSON and how many for HTML. This is currently missing as I couldn't identify a specific spot where this would be easy. For the time being this could be kept, as I anyway expect 99.9% to be HTML (but that's why this could be interesting to measure, as I don't know).

If you want to count the amount of HTML responses, check http.ResponseWriter.Header for the Content-Type containing text/html just before returning a response in ServeHTTP.

The latter seems actually to not be very practical, as the current code for ServeHTTP returns at many different places.

lmars · 2018-01-25T12:56:06Z

swarm/storage/chunker.go

@@ -298,6 +305,13 @@ func (self *TreeChunker) hashChunk(hasher SwarmHash, job *hashJob, chunkC chan *
 	job.parentWg.Done()

 	if chunkC != nil {
+		//NOTE: this increases the chunk count even if the local node already has this chunk;
+		//on file upload the node will increase this counter even if the same file has already been uploaded
+		//So it should be evaluated whether it is worth keeping this counter


I would definitely find this misleading. This is probably one of the most crucial metrics, we should make sure it is as accurate as possible.

Agreed. I expect many changes with the rewrite and the syncer rewrite, so I thought it's good we have the metric in the code now but will probably need to update later anyway, and wherever we move it now it is probably wrong.

lmars · 2018-01-25T12:58:03Z

swarm/storage/chunker.go

+		//So it should be evaluated whether it is worth keeping this counter
+		//and/or actually better track when the chunk is Put to the local database
+		//(which may question the need for disambiguation when a completely new chunk has been created
+		//and/or a chunk is being put to the local DB; for chunk tracking it may be worth distinguishing


We should probably have a cache hit / miss counter for chunks, so we know how many requests we are managing to serve locally, and how many are hitting the network (both Get and Put).

My assumption was that memstore.get counters are counter for chunks served by the cache.

lmars · 2018-01-25T12:59:40Z

swarm/swarm.go

+	startTime      time.Time
+	metricsTimeout = 5 * time.Second
+	startCounter   = metrics.NewCounter("stack,start")
+	stopCounter    = metrics.NewCounter("stack,stop")


These should be stack.start and stack.stop.

I think those are actually redundant. @holisticode what do you think?

@nonsense in the sense that you already have metrics measuring start/stop events at process level? If that is so, then I agree it may be redundant.

@holisticode we have the uptime gauge, and we also track docker daemon stats. We can infer from them how often swarm crashes for example.

lmars · 2018-01-25T13:00:47Z

swarm/swarm.go

+
+	go func() {
+		for _ = range ticker.C {
+			self.sendMetrics()


This loop should exit if the node is stopped.

I chose a "lazy" solution to this as I concluded we'll be collecting those metrics for the complete duration of a swarm process run. As far as I understand, a swarm node stop means somehow the shutdown of the process, in which case it would be somehow killed, killing also this loop. This may be a bad assumption, in which case I will add an exit path.

lmars · 2018-01-25T13:01:39Z

swarm/swarm.go

+	}()
+}
+
+func (self *Swarm) sendMetrics() {


This method name is misleading, it doesn't seem to be sending the metrics anywhere, just updating some metrics?

holisticode · 2018-01-29T20:26:53Z

Builds fail due to incomplete vendoring of github.com/influxdata/influxdb it seems?

holisticode · 2018-01-31T14:45:39Z

swarm/network/kademlia/kademlia.go

@@ -426,3 +444,15 @@ func (self *Kademlia) String() string {
 	rows = append(rows, "=========================================================================")
 	return strings.Join(rows, "\n")
 }
+
+//We have to build up the array of counters for each index
+func (self *Kademlia) initMetricsVariables() {


@lmars please review this func, added as for your own review comments. Want to make sure I am correctly initializing stuff based on kad table size and indexes. Thanks!

…equests

* swarm: initial instrumentation with go-metrics * swarm: initialise metrics collection and add ResettingTimer to HTTP requests * swarm: update metrics flags names. remove redundant Timer. * swarm: rename method for periodically updating gauges * swarm: finalise metrics after feedback * swarm/network: always init kad metrics containers * swarm/network: off-by-one index in metrics containers * swarm, metrics: resolved conflicts

nonsense requested a review from karalabe as a code owner January 25, 2018 10:30

lmars reviewed Jan 25, 2018

View reviewed changes

holisticode reviewed Jan 31, 2018

View reviewed changes

zelig added in progress labels Feb 12, 2018

zelig assigned nonsense Feb 12, 2018

gbalint removed the in progress label Feb 20, 2018

holisticode approved these changes Feb 20, 2018

View reviewed changes

nonsense force-pushed the fork-go-metrics-for-resetting-timer-swarm-instrumentation branch 2 times, most recently from 14b6b63 to 2f67745 Compare February 21, 2018 11:52

holisticode and others added 7 commits February 23, 2018 11:39

swarm: initial instrumentation with go-metrics

89dfc96

swarm: initialise metrics collection and add ResettingTimer to HTTP r…

b7516e4

…equests

swarm: update metrics flags names. remove redundant Timer.

76dd62c

swarm: rename method for periodically updating gauges

e05e4ca

swarm: finalise metrics after feedback

5af18e3

swarm/network: always init kad metrics containers

661eeb7

swarm/network: off-by-one index in metrics containers

13aa42e

nonsense force-pushed the fork-go-metrics-for-resetting-timer-swarm-instrumentation branch 2 times, most recently from f11b3f9 to 53c8785 Compare February 23, 2018 10:59

swarm, metrics: resolved conflicts

e891d75

nonsense force-pushed the fork-go-metrics-for-resetting-timer-swarm-instrumentation branch from 53c8785 to e891d75 Compare February 23, 2018 11:45

gbalint approved these changes Feb 23, 2018

View reviewed changes

gbalint added this to the 1.8.2 milestone Feb 23, 2018

gbalint merged commit dcca613 into ethereum:master Feb 23, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

swarm: initial instrumentation #15969

swarm: initial instrumentation #15969

nonsense commented Jan 25, 2018 •

edited

lmars commented Jan 25, 2018

lmars Jan 25, 2018

nonsense Jan 26, 2018

lmars Jan 25, 2018

holisticode Jan 29, 2018

lmars Jan 25, 2018

holisticode Jan 25, 2018

lmars Jan 27, 2018

holisticode Jan 29, 2018

lmars Jan 25, 2018

nonsense Jan 25, 2018

holisticode Jan 25, 2018

lmars Jan 27, 2018

holisticode Jan 29, 2018

lmars Jan 25, 2018

holisticode Jan 25, 2018

lmars Jan 27, 2018

holisticode Jan 29, 2018

lmars Jan 25, 2018

holisticode Jan 25, 2018

lmars Jan 25, 2018

holisticode Jan 25, 2018

lmars Jan 25, 2018

nonsense Jan 25, 2018

holisticode Jan 25, 2018

nonsense Jan 26, 2018

lmars Jan 25, 2018

holisticode Jan 25, 2018

lmars Jan 25, 2018

nonsense Jan 26, 2018

holisticode commented Jan 29, 2018

holisticode Jan 31, 2018

swarm: initial instrumentation #15969

swarm: initial instrumentation #15969

Conversation

nonsense commented Jan 25, 2018 • edited

lmars commented Jan 25, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

holisticode commented Jan 29, 2018

Choose a reason for hiding this comment

nonsense commented Jan 25, 2018 •

edited