Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

transient stats prototype #3504

Merged
merged 1 commit into from
Dec 30, 2015
Merged

transient stats prototype #3504

merged 1 commit into from
Dec 30, 2015

Conversation

tbg
Copy link
Member

@tbg tbg commented Dec 21, 2015

This is a prototype for the collection of transient performance statistics.
It does not yet expose many "useful" runtime statistics, but is meant to lay
the groundwork for doing so across different parts of the system.

We were previously using the go-metrics package, which provides the concept
of a Registry bundling different metrics. Unfortunately, various limitations,
a certain amount of interface bloat (which made it annoying to provide custom
implementations) and various design choices made this unwieldy to work with
without upstream adaptations.
Instead, introduced the util/metric package which replaces the registry
functionality and provides light wrapper implementations around metrics.
Each registry object owns its metrics' goroutines and terminates them using
a closer channel; this allows for easy integration with util.Stopper.

Metric registries can be nested and in fact we do so, keeping one "global"
metrics registry and various others in it. In particular Node- and Store-level
metrics have individual registries. This is useful since this allows their
metrics to be stored to their corresponding time series with simple iterations
over a registry and with correct names globally and locally.

For now, we wrap Counter, Gauge and EWMA types and provide a windowed
histogram (based on Gil Tene's HDRHistograms). Others (such as reservoir
backed histograms) can be added as required.

Review on Reviewable

@petermattis
Copy link
Collaborator

@mrtracy definitely needs to take a look at this.


Review status: 0 of 18 files reviewed at latest revision, 4 unresolved discussions, some commit checks failed.


server/status.go, line 94 [r1] (raw file):
I know we've used transient to describe these stats, but it doesn't feel quite right to me. How about metrics (which is the term for the supporting package) or stats?


util/metric/metric.go, line 41 [r1] (raw file):
You can can register (Add) a Registry with itself? The parameters deserve names.

After looking at the rest of the code, I'm wondering why the name isn't attached to the Iterable. Would you ever add the same Iterable with a different name? Or perhaps the second argument should be a new Metric interface:

type Metric interface {
  Iterable
  Name() string
}

That would also resolve the issue with Registry itself not being a metric.


util/metric/metric.go, line 265 [r1] (raw file):
curSum looks to be protected by mu. I'd move it below the mu definition and add a comment that mu protects all the fields below it.


util/metric/metric.go, line 326 [r1] (raw file):
For all of the Register functions, I would have placed the Registry parameter first. Not sure if I have a cogent reason for this, but it seems the primary action is calling Registry.Addwith one or more metrics created via the remaining arguments.


Comments from the review on Reviewable.io

@bdarnell
Copy link
Contributor

Review status: 0 of 18 files reviewed at latest revision, 16 unresolved discussions, some commit checks failed.


server/node.go, line 472 [r1] (raw file):
Won't every (non-test) call to CallComplete need to do this time.Now().Sub(start) calculation? Maybe CallComplete should take a start time instead of a duration and do the math itself.


server/status/feed_test.go, line 305 [r1] (raw file):
The constant d could be local to this function.


server/status/monitor.go, line 59 [r1] (raw file):
I suggest using - instead of . as the separator in metric names, because we'll probably want to match them with regular expressions at some point and the escaping will be annoying.


server/status/monitor.go, line 217 [r1] (raw file):
This pairing of a counter and a rate is common; in other systems I've used the output of a rate was a superset of the output of a counter so you could use the rate on its own.


util/metric/metric.go, line 19 [r1] (raw file):
This seems kind of excessive. At Google the standard is (or used to be) minute, ten minute, and hour.


util/metric/metric.go, line 54 [r1] (raw file):
s/registry/metaRegistry/?


util/metric/metric.go, line 91 [r1] (raw file):
s/metrics/registry/? This naming is confusing.


util/metric/metric.go, line 97 [r1] (raw file):
It would be nice to avoid background goroutines altogether here. Each metric could track its last-updated time and catch up on its periodic computations when updated or read.


util/metric/metric.go, line 177 [r1] (raw file):
I don't like that %s shows up in the caller-supplied names. I would like to just append the time scale at the end, but I see that you put it in the middle in some places. Is that important?


util/metric/metric.go, line 179 [r1] (raw file):
What makes this latency-specific (as opposed to a more generic RegisterHistograms like you have for RegisterEWMAS)?


util/metric/metric.go, line 263 [r1] (raw file):
I don't like EWMA as a public name. The name should be more about the semantics than the implementation. Rate?


util/metric/metric.go, line 326 [r1] (raw file):
Me too. I would actually make all of the Register functions methods of the Registry interface unless it is possible to use a nil Registry.


util/metric/metric.go, line 327 [r1] (raw file):
Why is this different from the list of scales in RegisterLatency?


Comments from the review on Reviewable.io

@tbg
Copy link
Member Author

tbg commented Dec 23, 2015

Review status: 0 of 18 files reviewed at latest revision, 15 unresolved discussions.


server/node.go, line 472 [r1] (raw file):
Yeah, I think that's a good idea. @mrtracy and I talked about removing the feed portion. I'll get to that in that step.


server/status.go, line 94 [r1] (raw file):
renamed to metrics.


server/status/monitor.go, line 59 [r1] (raw file):
I'm following the naming scheme for time series here. Happy to break it if you think the two don't have to correspond and it's worth it.


server/status/monitor.go, line 217 [r1] (raw file):
I don't quite follow. Could you reword/elaborate?


util/metric/metric.go, line 19 [r1] (raw file):
works for me. I'm going to change it to that.


util/metric/metric.go, line 41 [r1] (raw file):
I want to be able to nest registries. For example, each Store has an associated registry with metrics cr.store.<name>. This registry is part of the metaRegistry which will render the name as cr.store.<name>.<store_id>.
I'm not ever planning to make use of that nesting further, but it seems preferable to having two types of registries.
There's no Name() method since I haven't required one. Where would you want to use it?


util/metric/metric.go, line 54 [r1] (raw file):
actually s/registries/iterables (i.e. typically metrics or other registries)/ is what I wanted.


util/metric/metric.go, line 91 [r1] (raw file):
I updated the comment.


util/metric/metric.go, line 97 [r1] (raw file):
that's a good idea. I'll do it in a separate PR if that's ok with you.


util/metric/metric.go, line 177 [r1] (raw file):
This yoga is to have the naming corresponding to time series, where it's cr.store.<metric>.<id>. I could split that up to supplying a suffix and a prefix, but that seemed silly. I think there's value in sticking to one general format of naming metrics, but let me know if you think differently/have other ideas.


util/metric/metric.go, line 179 [r1] (raw file):
It measures in milliseconds with 2 digit precision and caps at one minute, other than that nothing. The idea was mostly that everyone should have a clear idea which one to use for getting a latency histogram (and there might be other histograms at some point, a reservoir-backed one could make more sense to sample heavy hitters or something like that - not that I would have something specific in mind).


util/metric/metric.go, line 263 [r1] (raw file):
Sounds good. I'll make the change.


util/metric/metric.go, line 265 [r1] (raw file):
good catch, done.


util/metric/metric.go, line 326 [r1] (raw file):
I removed the Registry interface (not useful with all the methods added) and went with Ben's suggestion. I stripped Register from all methods since register.Counter(...) should be descriptive enough.


util/metric/metric.go, line 327 [r1] (raw file):
Just playing around. Changed it to {1m, 10m, 1h}.


Comments from the review on Reviewable.io

@petermattis
Copy link
Collaborator

Review status: 0 of 18 files reviewed at latest revision, 13 unresolved discussions.


server/status/monitor.go, line 59 [r1] (raw file):
The %s in these strings is odd, especially because it is needed for some metric types and not for others. I think it would be more obvious if the sub-metrics of something like Rate had fixed names that were appended to the parent metric name.


server/status/monitor.go, line 217 [r1] (raw file):
A rate should internally maintain a count of the number of times Add was called (using a Counter metric). This would remove the need for a separate numSuccess metric here and numError metric below.


util/metric/metric.go, line 41 [r1] (raw file):
My thought was that the Name() method would replace the format passed to Registry.Add, but perhaps that wouldn't work right with nested registries. I wasn't considering that a possibility. When I've used metrics packages like this in the past there is only a single registry (often global) and it is up to the user to provide hierarchy to the names.


Comments from the review on Reviewable.io

@tbg
Copy link
Member Author

tbg commented Dec 25, 2015

Review status: 0 of 18 files reviewed at latest revision, 13 unresolved discussions.


server/status/monitor.go, line 59 [r1] (raw file):
Rates is not a metric type but a helper which creates a bunch of metrics, same for Latency. I'll simply append here, it's probably about as clear. Then the format string will only appear when adding a registry within another, so basically nowhere.


server/status/monitor.go, line 217 [r1] (raw file):
I've kept those mostly because they're included in test coverage. Before I make changes here, do we actually want to keep all-time counts? Agreed that they can be folded into Rate quite easily, but I wouldn't bother putting a Counter metric inside of a Rate - I'd just absorb it.


util/metric/metric.go, line 41 [r1] (raw file):
That wouldn't work (smoothly), and with our use cases here I think we should stick to having registries within others (take a look at OnStartNode in server/status/monitor.go or NewStoreStatusMonitor - trying to remap everything ad-hoc is going to be worse). We won't make extensive use of it, but it nicely abstracts away these oddities and in the end, we do use a global registry primarily but we allow localized parts of the system to pretend that they do the same when in fact they're on a subset.


Comments from the review on Reviewable.io

@tbg
Copy link
Member Author

tbg commented Dec 25, 2015

Review status: 0 of 19 files reviewed at latest revision, 13 unresolved discussions.


util/metric/metric.go, line 97 [r1] (raw file):
Actually made the change, was easy enough. Looks much better.


Comments from the review on Reviewable.io

@tamird
Copy link
Contributor

tamird commented Dec 26, 2015

Reviewed 5 of 18 files at r1, 4 of 10 files at r2, 10 of 10 files at r3.
Review status: all files reviewed at latest revision, 17 unresolved discussions, some commit checks failed.


sql/executor.go, line 235 [r3] (raw file):

defer func(start time.Time) {
  e.latency.RecordValue(time.Since(start).Nanoseconds())
}(time.Now())

util/metric/registry.go, line 46 [r3] (raw file):
s/implement/implements/


util/metric/registry.go, line 48 [r3] (raw file):
no error on double-add?


util/metric/registry.go, line 69 [r3] (raw file):
might be worth adding var _ json.Marshaler = &Registry{}


Comments from the review on Reviewable.io

@bdarnell
Copy link
Contributor

Review status: 15 of 19 files reviewed at latest revision, 5 unresolved discussions.


server/status/monitor.go, line 59 [r1] (raw file):
Metrics and time series should follow the same naming scheme. I think it would be a good idea to separately change both of them from . to -, but that shouldn't be tied in to this change.


server/status/monitor.go, line 217 [r1] (raw file):
Yes, we want to keep all-time counts. They're the most important part (of a rate-like metric), because they are the cleanest thing to aggregate. The moving averages computed on a single node are just for convenience; for aggregate stats we discard the per-node derived values and recompute based on the global sum.


util/metric/metric.go, line 179 [r1] (raw file):
You mean it can't measure anything smaller than a millisecond or larger than a minute, and records two significant digits in that range? I think we'll often want sub-millisecond precision. We may need to have more than one "standard" histogram configuration with these limitations. How well do these histograms aggregate? The old util/metrics implementation had another histogram type that was designed for aggregation; I can't remember what its range/precision limitations were.


Comments from the review on Reviewable.io

@tbg
Copy link
Member Author

tbg commented Dec 27, 2015

Review status: 15 of 19 files reviewed at latest revision, 5 unresolved discussions.


server/status/monitor.go, line 59 [r1] (raw file):
ok, filed #3530.


server/status/monitor.go, line 217 [r1] (raw file):
ok, I'll fold the counters into the rates.


util/metric/metric.go, line 179 [r1] (raw file):
Yes, that's (close to) what it is now.
I looked at the original util/metrics code - it's similar (though this one here is more versatile). The histograms are just bins, so they aggregate well.
What would you consider sane defaults (and/or histogram versions) for measuring latency in terms of the maximum we care about and the precision that we want? The "native" based unit is ns, but that's probably too much. .1ms? .01ms? Memory is a bit of a concern if we should wind up with many many wide-range and very precise histograms (I think ~100-200k for one is expected, there's a back-of-envelope computation to get precise size if we care).


Comments from the review on Reviewable.io

@tbg
Copy link
Member Author

tbg commented Dec 28, 2015

Review status: 8 of 19 files reviewed at latest revision, 5 unresolved discussions.


server/status/monitor.go, line 217 [r1] (raw file):
Done, PTAL.


util/metric/metric.go, line 179 [r1] (raw file):
I touched this up (in particular, removed the base unit - it was supposed to save some memory, but shouldn't matter for now): clarified comment, and simply storing ns with two digits precision.


Comments from the review on Reviewable.io

@tbg tbg force-pushed the transient_stats branch 2 times, most recently from b2c8370 to adce909 Compare December 28, 2015 13:54
@petermattis
Copy link
Collaborator

LGTM


Review status: 8 of 22 files reviewed at latest revision, 5 unresolved discussions, some commit checks failed.


util/metric/metric.go, line 126 [r5] (raw file):
Is it expensive to make this copy?


Comments from the review on Reviewable.io

@tbg
Copy link
Member Author

tbg commented Dec 28, 2015

Review status: 8 of 22 files reviewed at latest revision, 5 unresolved discussions, some commit checks failed.


util/metric/metric.go, line 126 [r5] (raw file):
Shouldn't be too bad. It's mostly cloning a slice. I've slightly optimized the locking (doing the import outside the locked scope) just for the heck of it. Each and MarshalJSON don't copy, so they can be used for some purposes. Are you expecting this to be called a lot?


Comments from the review on Reviewable.io

@petermattis
Copy link
Collaborator

Review status: 8 of 22 files reviewed at latest revision, 5 unresolved discussions, some commit checks failed.


util/metric/metric.go, line 126 [r5] (raw file):
I was just curious what Export and Import are doing.


Comments from the review on Reviewable.io

@bdarnell
Copy link
Contributor

LGTM


Review status: 8 of 22 files reviewed at latest revision, 4 unresolved discussions.


server/status/monitor.go, line 58 [r6] (raw file):
The trailing dot is awkward for the same reason the %s was. I think the metrics package should be responsible for the dot too. (it's also inconsistent: you use a trailing dot here but not later for registry.Latency("sql.latency")).


Comments from the review on Reviewable.io

@tbg
Copy link
Member Author

tbg commented Dec 28, 2015

Review status: 8 of 22 files reviewed at latest revision, 4 unresolved discussions.


server/status/monitor.go, line 58 [r6] (raw file):
Not quite sure where you're coming from. Seems weird that the registry should know about our naming scheme (and lock it in). The parameter being passed here is called prefix, so the way it's used seems spot on for me. These metrics helpers (Rates and Latency) can't work exactly the same as individual metrics (where you can just pass the full name) since they create multiple metrics. The usage here makes it obvious what will happen while your preference will be more surprising, so I prefer to keep it that way unless there's something I'm missing.
Thanks for pointing out sql.latency, that's an oversight and I've added the missing dot.


Comments from the review on Reviewable.io

@bdarnell
Copy link
Contributor

Review status: 8 of 22 files reviewed at latest revision, 4 unresolved discussions.


server/status/monitor.go, line 58 [r6] (raw file):
It's far from obvious that our convention requires a trailing dot in the argument to some methods of Registry, while forbidding it in others. This is a mistake we will make again and again. I'd rather make Registry aware of our naming scheme than create this opportunity for error.


Comments from the review on Reviewable.io

@tbg tbg force-pushed the transient_stats branch 2 times, most recently from b59dbb9 to 1553899 Compare December 29, 2015 13:54
@tbg
Copy link
Member Author

tbg commented Dec 29, 2015

Reviewed 2 of 18 files at r1, 2 of 10 files at r2, 4 of 10 files at r3, 12 of 13 files at r5, 1 of 1 files at r6, 1 of 1 files at r7.
Review status: 19 of 22 files reviewed at latest revision, 4 unresolved discussions.


server/status/monitor.go, line 58 [r6] (raw file):
I went ahead and simply dropped the dot (that's what I initially had, and btw also the reason the dot was missing from sql.latency). The resulting metric names still make sense, it doesn't pollute the package and usage is non-ambiguous for our use case. Does that work for you?


Comments from the review on Reviewable.io

@@ -276,8 +279,14 @@ func TestNodeStatusRecorder(t *testing.T) {
generateStoreData(2, "capacity.available", 100, 75),

// Node stats.
generateNodeData(1, "calls.success", 100, 2),
generateNodeData(1, "calls.error", 100, 1),
generateNodeData(1, "exec.successcount", 100, 2),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These names need a delimiter, and I still think the metric package should add one. It doesn't need to be the same delimiter if that would alleviate your concern about codifying our naming conventions: exec.success@count, exec.success@1h.

@mrtracy
Copy link
Contributor

mrtracy commented Dec 30, 2015

Add a few comments, although I do approve of the direction. I finally had a chance to dig into go-metrics Histogram, and it does seem incorrect; while exponential decay seems more appealing in principle than the hard cliffs of the windowed histogram, the exponential system in go-metrics could probabilistically lose max/min samples which are very important. This might be something to eventually communicate back to the go-metrics team.

I think your commit message is no longer correct, as ticking does not require a goroutine in your latest iteration.

LGTM


Review status: 19 of 22 files reviewed at latest revision, 7 unresolved discussions, some commit checks failed.


server/status/monitor.go, line 59 [r1] (raw file):
My only question is: why do so many other naming schemes for metrics rely on '.' as a separator? Is it simply an otherwise thoughtless decision based on emulating previous designs?


server/status/monitor.go, line 58 [r6] (raw file):
I think that it's totally appropriate to have the registry aware of our naming scheme; organization seems like a primary concern of the registry object. It should fine to "lock in" our own naming scheme, we are unlikely to have need to deviate from it, and if we did we could just modify the Registry which is not a large piece of code.


server/status/recorder.go, line 196 [r8] (raw file):
As I understand it, you're going to want to record Max + ValueAtQuantile() for some set of quantiles. I don't know if that varies between histograms, though, or if we always want to record the same quantiles.

Given the windowed nature of this, I don't know if count makes a lot of sense. This could be paired with a rate or cumulative counter if that information is needed.


server/status/recorder_test.go, line 282 [r8] (raw file):
Agreed, it should be simply for the metrics package to assemble these hierarchical names. I don't understand the concern with codifying our own naming metrics, this is an internal package that nobody else is using, we're the only one accessing these metrics, and as far as I can tell every other metrics package adheres to an extremely similar '.' based naming convention.


util/metric/registry.go, line 103 [r8] (raw file):
I understand what you're going for here; that said, the time series system is designed to eventually have rollups itself, and that's probably the best place to manage this. The values being recorded from the 1M histogram could be rolled up by the time series system to get the equivalents of the 10M and 1H scales.

Even without support for rollups, you probably at least want support for recording at variable time scales before adding this; all time series are currently recorded every ten seconds, but that's going to be of extremely dubious value for a 1H sample window, especially when old sample windows will be dropped on precise 15 minute boundaries.


Comments from the review on Reviewable.io

@tbg
Copy link
Member Author

tbg commented Dec 30, 2015

Review status: 19 of 22 files reviewed at latest revision, 7 unresolved discussions, some commit checks failed.


server/status/monitor.go, line 58 [r6] (raw file):
Don't disagree in principle, it just triggers my OCD that 99.999% of the naming scheme should be in our code (leaving a perfectly general-purpose metrics library) and then on the last meter we go ahead and break any other hypothetical user's naming scheme for what seems like a weak reason. But, I'll just turn off my OCD for a short period of time and make the change.


server/status/recorder.go, line 196 [r8] (raw file):
yeah, Count doesn't make much sense. Probably some fixed quantiles would be good. Given that this is for time series, it definitely needs to be the same set of quantiles between invocations for the same Histogram (so we can't use the p-iteration, which is usually the one that gives "good" sample points). Updating to that effect.


server/status/recorder_test.go, line 282 [r8] (raw file):
"this is an internal package that nobody else is using" is probably true, but I've looked around and the metrics package I'd want to use elsewhere doesn't exist. Not saying I want to put nontrivial effort into making another one but my other comment above applies.
If the registry took care of all the delimiting/hierarchy stuff, I think your point would be valid. But it doesn't do that and if it did, the interface would be much fatter and I don't think necessarily better seeing that we also need to share the names of all the time series generated with the ui, which is hard if they're only puzzled together at runtime by some auxiliary package. Note from my other comment though that the issue is off the table for now since I went ahead and added a separator.


util/metric/registry.go, line 103 [r8] (raw file):
Good point about variable times. I'm not in favor of making the metric timescales-centric to the point where they don't make sense on their own (I'd like to be able to use them ad-hoc in the block writer, for example), though rollups would be nice to get cluster-wide latency histograms. Also, not everything contained in the metric needs to go to time series. I'm not sure what the right interface is between the two, but it looks safe to figure it out after this lands.
How would we store the histogram for the rollup? It looks like we'd have to export all of its state and at the end the code that makes sense of the state is in hdrhistogram. I don't immediately see how this would work. Could you create an issue with your initial thoughts?


Comments from the review on Reviewable.io

@tbg
Copy link
Member Author

tbg commented Dec 30, 2015

thanks for the review everyone. @mrtracy I'm going to leave this open for you since I took care of the TODO about recording histograms.

@bdarnell
Copy link
Contributor

Review status: 15 of 22 files reviewed at latest revision, 7 unresolved discussions.


server/status/monitor.go, line 59 [r1] (raw file):
My guess is that . as separator is especially appealing in javascript, since exec.error.count could be a direct reference into an appropriately-structured JSON object. And even in other languages it's possible to construct structures that match variable names in code to the names in the metric system, while a separator like - forces you to handle metric names as strings that differ from local variable names (in most languages).

Googling for "metric naming schemes" turns up three different schemes in the top four results: librato and graphite use ., new relic uses /, and prometheus uses _.


Comments from the review on Reviewable.io

@mrtracy
Copy link
Contributor

mrtracy commented Dec 30, 2015

LGTM


Review status: 15 of 22 files reviewed at latest revision, 7 unresolved discussions.


util/metric/registry.go, line 103 [r8] (raw file):
I'll make an issue, for now you can just leave a TODO here.


Comments from the review on Reviewable.io

This is a prototype for the collection of transient performance statistics.
It does not yet expose "useful" runtime statistics, but is meant to lay the
groundwork for doing so across different parts of the system.

We were previously using the `go-metrics` package, which provides the concept
of a `Registry` bundling different metrics. Unfortunately, various limitations,
a certain amount of interface bloat (which made it annoying to provide custom
implementations) and various design choices made this unwieldy to work with
without upstream adaptations.
Instead, introduced the `util/metric` package which replaces the registry
functionality and provides light wrapper implementations around metrics.

Metric registries can be nested and in fact we do so, keeping one "global"
metrics registry and various others in it. In particular Node- and Store-level
metrics have individual registries. This is useful since this allows their
metrics to be stored to their corresponding time series with simple iterations
over a registry and with correct names globally and locally.

For now, we wrap `Counter`, `Gauge` and `EWMA` types and provide a windowed
histogram (based on Gil Tene's `HDRHistogram`s). Others (such as reservoir
backed histograms can be added as required.
tbg added a commit that referenced this pull request Dec 30, 2015
@tbg tbg merged commit 7041ad3 into cockroachdb:master Dec 30, 2015
@tbg tbg deleted the transient_stats branch December 30, 2015 19:21
pav-kv pushed a commit to pav-kv/cockroach that referenced this pull request Mar 5, 2024
Give a human being some insight into how we might have gotten to this
state based on feedback from cockroachdb#3504.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants