Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

swarm: initial instrumentation #15969

Conversation

nonsense
Copy link
Member

@nonsense nonsense commented Jan 25, 2018

This PR is joint work with @holisticode and based on #15910 and should be reviewed/merged after #15910

It adds configuration for a metrics backend within swarm and periodically flushes measurements from the DefaultRegistry to it.

We could include an optional metrics backend (InfluxDB + Grafana + dashboards) setup into the repo and docs similar to https://github.com/janos/swarm-test-cluster/tree/master/swarm-test-cluster (which is based on Terraform and Docker), in which case a developer/user of swarm would only have to run make in order to setup the metrics backend, and it will be configured out of the box to receive measurements from the process periodically.

@lmars
Copy link
Contributor

lmars commented Jan 25, 2018

We could include an optional metrics backend (InfluxDB + Grafana + dashboards) setup into the repo

Not necessarily in this PR, but we should update the dev environment to include this, so I can just run swarm/dev/run.sh and have a Swarm cluster with a metrics backend. I know there was talk of switching from bash + docker to terraform, so perhaps do that at the same time?

@@ -75,6 +75,24 @@ func NewTimer(name string) metrics.Timer {
return metrics.GetOrRegisterTimer(name, metrics.DefaultRegistry)
}

// NewResettingTimer create a new ResettingTimer, either a real one of a NOP stub depending
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

- NewResettingTimer create a new ResettingTimer, either a real one of a NOP stub depending
+ NewResettingTimer creates a new ResettingTimer, either a real one or a NOP stub depending

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

swarm/api/api.go Outdated
log.Trace(fmt.Sprintf("Resolving : %v", uri.Addr))

// if the URI is immutable, check if the address is a hash
isHash := hashMatcher.MatchString(uri.Addr)
if uri.Immutable() || uri.DeprecatedImmutable() {
if !isHash {
apiResolveFail.Inc(1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it is correct to log this as a resolve failure, as no resolving was attempted (the user just provided bad input).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I finally agree on this one and will remove it

@@ -93,6 +117,7 @@ func (self *Api) Resolve(uri *URI) (storage.Key, error) {
// if DNS is not configured, check if the address is a hash
if self.dns == nil {
if !isHash {
apiResolveFail.Inc(1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again is this really a resolve failure? I'd expect the resolve failure metric to tell me how many times Swarm tried to actually resolve a name but failed. Perhaps my expectation is incorrect though?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My expectation was to capture how many times a resolve was attempted with the meaning of requested (count) and how many times these attempts failed, answering the question: "How many times did people request a resolution, and how many times this request failed?". Both expectations (mine and yours) can lead to mal-interpretations. We should just settle on one. I believe it makes sense to keep them for now, and then remove such metrics as we go when we realize we are not gaining anything from them. Still, if people think we should remove ones like this one right away, I am ok with it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe it makes sense to keep them for now, and then remove such metrics as we go when we realize we are not gaining anything from them

I think it should be the other way around. Just add metrics we know are useful, with the freedom to add more useful ones in the future.

In this specific case, if I wanted to know how many resolve failures I'm getting, I'd also want to know what domains are failing too, in which case I'd probably just look for errors in my logs (which have much more context than just a single number).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When looking at failures, specifically in a production environment, but also in a staging/test environment, I would structure my workflow in creating alarms and visualizing error situations in a general manner first, like creating alarms when reaching certain thresholds. Only then I would, if the counters/graphs show unacceptable levels or alarms get triggered, I would then start the debugging process of looking at logs for why the alarms were triggered or why there is an unusual/unexpected high amount of errors. I believe this to be a very valid argument to keep this counter. Nevertheless, if it's still maintained that the counter itself is of no value, I will proceed to remove it.

swarm/api/api.go Outdated
@@ -336,18 +391,23 @@ func (self *Api) AppendFile(mhash, path, fname string, existingSize int64, conte
}

func (self *Api) BuildDirectoryTree(mhash string, nameresolver bool) (key storage.Key, manifestEntryMap map[string]*manifestTrieEntry, err error) {
apiBuildDirTreeCount.Inc(1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think this is a useful metric? I'm not so sure.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We've been focusing on setting up the library, the reporter and the infrastructure around metrics. I agree that not all of the measurements currently are useful and meaningful - there will definitely be iterations on them. At the same time I am happy to fix those that are immediately known to be not useful in this PR. :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have been developing mostly web REST APIs in previous years. Each such a REST API usually is there because it serves a specific function and a user requirement/use case. From this point of view, I interpreted this function something like a public API function - in which case we may be interested to know how many times each of these are being called. If my interpretation actually is incorrect, in that this API call isn't very relevant, is maybe semantically not a user-facing API endpoint, or is maybe even obsolete/deprecated, I happy for it to be removed

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given you are questioning whether this is useful, we should just drop it unless we decide it is useful. I personally don't think it is a useful metric.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was not really questioning the usefulness of the metric, but rather asking for clarification about the role or, maybe the relevance of the API itself in terms of (public) use case. I will proceed to remove it, but I would also welcome better documentation on such supposedly API functions, in order to understand why they are there and their relevance for users of the API.

"github.com/ethereum/go-ethereum/swarm/api"
)

//templateMap holds a mapping of an HTTP error code to a template
var templateMap map[int]*template.Template

//metrics variables
var (
htmlCounter = metrics.NewCounter("api.http.errorpage.html.count")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to be this granular? Should we not just log the count of 5xx errors?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking of it, it makes most sense to me to actually track how many requests were being made for JSON and how many for HTML. This is currently missing as I couldn't identify a specific spot where this would be easy. For the time being this could be kept, as I anyway expect 99.9% to be HTML (but that's why this could be interesting to measure, as I don't know).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you want to count the amount of HTML responses, check http.ResponseWriter.Header for the Content-Type containing text/html just before returning a response in ServeHTTP.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The latter seems actually to not be very practical, as the current code for ServeHTTP returns at many different places.

@@ -298,6 +305,13 @@ func (self *TreeChunker) hashChunk(hasher SwarmHash, job *hashJob, chunkC chan *
job.parentWg.Done()

if chunkC != nil {
//NOTE: this increases the chunk count even if the local node already has this chunk;
//on file upload the node will increase this counter even if the same file has already been uploaded
//So it should be evaluated whether it is worth keeping this counter
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would definitely find this misleading. This is probably one of the most crucial metrics, we should make sure it is as accurate as possible.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. I expect many changes with the rewrite and the syncer rewrite, so I thought it's good we have the metric in the code now but will probably need to update later anyway, and wherever we move it now it is probably wrong.

//So it should be evaluated whether it is worth keeping this counter
//and/or actually better track when the chunk is Put to the local database
//(which may question the need for disambiguation when a completely new chunk has been created
//and/or a chunk is being put to the local DB; for chunk tracking it may be worth distinguishing
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably have a cache hit / miss counter for chunks, so we know how many requests we are managing to serve locally, and how many are hitting the network (both Get and Put).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My assumption was that memstore.get counters are counter for chunks served by the cache.

swarm/swarm.go Outdated
startTime time.Time
metricsTimeout = 5 * time.Second
startCounter = metrics.NewCounter("stack,start")
stopCounter = metrics.NewCounter("stack,stop")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These should be stack.start and stack.stop.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think those are actually redundant. @holisticode what do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nonsense in the sense that you already have metrics measuring start/stop events at process level? If that is so, then I agree it may be redundant.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@holisticode we have the uptime gauge, and we also track docker daemon stats. We can infer from them how often swarm crashes for example.

swarm/swarm.go Outdated

go func() {
for _ = range ticker.C {
self.sendMetrics()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This loop should exit if the node is stopped.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I chose a "lazy" solution to this as I concluded we'll be collecting those metrics for the complete duration of a swarm process run. As far as I understand, a swarm node stop means somehow the shutdown of the process, in which case it would be somehow killed, killing also this loop. This may be a bad assumption, in which case I will add an exit path.

swarm/swarm.go Outdated
}()
}

func (self *Swarm) sendMetrics() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method name is misleading, it doesn't seem to be sending the metrics anywhere, just updating some metrics?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

@holisticode
Copy link
Contributor

Builds fail due to incomplete vendoring of github.com/influxdata/influxdb it seems?

@@ -426,3 +444,15 @@ func (self *Kademlia) String() string {
rows = append(rows, "=========================================================================")
return strings.Join(rows, "\n")
}

//We have to build up the array of counters for each index
func (self *Kademlia) initMetricsVariables() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lmars please review this func, added as for your own review comments. Want to make sure I am correctly initializing stuff based on kad table size and indexes. Thanks!

@nonsense nonsense force-pushed the fork-go-metrics-for-resetting-timer-swarm-instrumentation branch 2 times, most recently from 14b6b63 to 2f67745 Compare February 21, 2018 11:52
@nonsense nonsense force-pushed the fork-go-metrics-for-resetting-timer-swarm-instrumentation branch 2 times, most recently from f11b3f9 to 53c8785 Compare February 23, 2018 10:59
@nonsense nonsense force-pushed the fork-go-metrics-for-resetting-timer-swarm-instrumentation branch from 53c8785 to e891d75 Compare February 23, 2018 11:45
@gbalint gbalint added this to the 1.8.2 milestone Feb 23, 2018
@gbalint gbalint merged commit dcca613 into ethereum:master Feb 23, 2018
prestonvanloon pushed a commit to prestonvanloon/go-ethereum that referenced this pull request Apr 2, 2018
* swarm: initial instrumentation with go-metrics

* swarm: initialise metrics collection and add ResettingTimer to HTTP requests

* swarm: update metrics flags names. remove redundant Timer.

* swarm: rename method for periodically updating gauges

* swarm: finalise metrics after feedback

* swarm/network: always init kad metrics containers

* swarm/network: off-by-one index in metrics containers

* swarm, metrics: resolved conflicts
mariameda pushed a commit to NiluPlatform/go-nilu that referenced this pull request Aug 23, 2018
* swarm: initial instrumentation with go-metrics

* swarm: initialise metrics collection and add ResettingTimer to HTTP requests

* swarm: update metrics flags names. remove redundant Timer.

* swarm: rename method for periodically updating gauges

* swarm: finalise metrics after feedback

* swarm/network: always init kad metrics containers

* swarm/network: off-by-one index in metrics containers

* swarm, metrics: resolved conflicts
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants