stats: add unit support to histogram #8484

troshko111 · 2019-10-04T01:22:51Z

Signed-off-by: Taras Roshko troshko@netflix.com

Description: Add unit to the histogram type so that the stat sinks can disambiguate among various types of histograms (some Metric stores support timer vs non-timer histograms and generally speaking there already are histograms measuring bytes and microseconds, but it's all assumed to be millis on flush
Risk Level: Low
Testing: Unit tests
Docs Changes: N/A
Release Notes: Included
Fixes: #8164

include/envoy/stats/histogram.h

test/mocks/stats/mocks.h

test/mocks/stats/mocks.cc

include/envoy/stats/timespan.h

troshko111 · 2019-10-08T01:04:39Z

@jmarantz
Release builds are failing due to stat size checking tests:

StatsThreadLocalStoreTestNoFixture.MemoryWithoutTlsFakeSymbolTable and StatsThreadLocalStoreTestNoFixture.MemoryWithTlsFakeSymbolTable are 16 bytes off

ClusterMemoryTestRunner.MemoryLargeClusterSizeWithFakeSymbolTable and
ClusterMemoryTestRunner.MemoryLargeClusterSizeWithRealSymbolTable are 32 bytes off

I don't know exactly how these are computed but a small size difference is expected, let me know if I can adjust the asserts.

Other things I'd like your opinion on:

HistogramCompletableTimespan is kind of ugly with these asserts, one thing I was thinking of is to introduce a method on histogram like quantity() returning an enum with values Unspecified, Information, Time and assert on quantity == Time instead of the switch in the ctor.
Multiple null histograms cached in fields in ThreadLocalStoreImpl vs enum to null histogram map (this requires either a copy ctor on the null histogram class or a move ctor which seemed like a bad trade off so I stored them in fields in this PR)
I'd like to update the metrics service to send the unit along with the bucket data, this is a forward/backward compatible change and should not break anyone.

jmarantz · 2019-10-08T02:08:11Z

Yes if you are adding some stat data of value, like an enum to a histogram, that seems totally worth it and you should just bump the limit.

You can also test locally with 'blaze test -c opt --test_env=ENVOY_MEMORY_TEST_EXACT=true' so you can get the test working locally before pushing to CI.

I will start reviewing your PR; sorry for the delay.

jmarantz

flushing comments. This looks amazing, thanks for doing this! I do have a deprecation policy question and a bunch of code nits so far.

jmarantz · 2019-10-08T02:20:59Z

include/envoy/event/dispatcher.h

I see how this is a positive change. However I wonder if this might be a breaking change for monitoring setups for some deployments. How will we message this out and manage this transition?

I've been thinking about this change before making it and came to a conclusion that however nasty this may be, the alternative is no better (potentially worse).

This is a breaking change as people definitely have alerts set up for these. If left as-is, statsd change is not possible (automatically append the actually correct symbol) and Metric sinks supporting generic histograms vs timers will report milli-milliseconds, micro-milliseconds and other nonsense (they would take values in seconds and scale using metric prefixes and display the scaled value so the metric name suffix will mismatch). It's also simply put messy. Some of the reported histograms right now are rather confusing already (Redis command latency is reported in millis or micros depending on a config value) so alerting on these is error-prone at best.

Having said that we may create a table of old -> new metric names for statsd and for non-statsd sinks so people can do a low effort find & replace instead of figuring it out? This can be published with release itself as a pre-read.

Thoughts?

@htuch @alyssawilk should this be considered an API change and versioned as such?

To sanity check, this is changing the variable names for all statsd variables? I definitely think that's something that folks should be able to migrate to carefully and not have to fix as they pull HEAD.
Ideally it's the sort of thing we could "runtime" guard and do away with but I don't think it's the sort of thing you can flip live. Given that I'd lean towards a config option which we deprecate to (eventually) avoid code cruft

@alyssawilk this affects histogram names, I've compiled the changes to make this easier to understand:

Histogram name changes for non-statsd sinks like metric service (unit symbol suffix dropped if present)

Group Old New

Cluster upstream_cx_length_ms upstream_cx_length

Cluster upstream_cx_connect_ms upstream_cx_connect

Dispatcher poll_delay_us poll_delay

Dispatcher loop_duration_us loop_duration

Dubbo request_time_ms request_time

Dynamo upstream_rq_time same

Dynamo upstream_rq_time_{i}xx same

Dynamo upstream_rq_time_unknown same

HTTPConnMan downstream_rq_time same

HTTPConnMan downstream_cx_length_ms downstream_cx_length

Listener downstream_cx_length_ms downstream_cx_length

Mongo time_ms time

Mongo reply_size same

Mongo reply_time_ms reply_time

Redis latency same

Redis upstream_rq_time same

Server initialization_time_ms initialization_time

Thrift request_time_ms request_time

UserAgent downstream_cx_length_ms downstream_cx_length

ZK <opcode>_latency same

ZK connect_response_latency same

Histogram name changes for statsd sinks (unit symbol suffix appended where absent)

Group Old New

Cluster upstream_cx_length_ms same

Cluster upstream_cx_connect_ms same

Dispatcher poll_delay_us same

Dispatcher loop_duration_us same

Dubbo request_time_ms same

Dynamo upstream_rq_time upstream_rq_time_ms

Dynamo upstream_rq_time_{i}xx upstream_rq_time_{i}xx_ms

Dynamo upstream_rq_time_unknown upstream_rq_time_unknown_ms

HTTPConnMan downstream_rq_time downstream_rq_time_ms

HTTPConnMan downstream_cx_length_ms same

Listener downstream_cx_length_ms same

Mongo time_ms same

Mongo reply_size reply_size_b

Mongo reply_time_ms same

Redis latency latency_us or latency_ms (configurable)

Redis upstream_rq_time upstream_rq_time_ms

Server initialization_time_ms same

Thrift request_time_ms same

UserAgent downstream_cx_length_ms same

ZK <opcode>_latency <opcode>_latency_ms

ZK connect_response_latency connect_response_latency_ms

Sure, I think it's great to require good name hygiene for new stats, I just don't think it's worth the end-user pain to change the names of the existing ones (though I agree it would be nice).

ok so to recap:

Keep all existing names as-is.

Do not autosuffix in statsd.

Have a check_format rule ensuring no new histograms are suffixed but whitelist all existing.

I will address the remaining comments and send 1 commit to undo the renames so it's easier to review.

The only significant downside to this is that statsd assumes everything is millis and right now the Redis latency histogram is either millis or micros but does not include any suffix whatsoever, same for Mongo reply size. A bit messy.

I will leave the statsd to the sink maintainers, but recommendation is that since statsd assumes all histograms are millis, all other time values should be scaled to millis on flush unless they are suffixed. This is the minimally invasive fix to reinstall sanity. This way things like Redis latency would be scaled to millis and things like poll_delay_us would be reported as-is but it would be clear on the UI it's micros (due to the name). And for all new stats they can be assumed to be millis, no suffixed, scaled to millis if necessary.

A little late to the discussion, but I agree that this is too disruptive. However, stats are not (today) part of the Envoy API, so they don't live by its versioning guidelines. Is there a way we can add versioning to the stats interface to allow it to evolve over time? Should we provide a canonical proto definition that will allow it to move alongside the API major versions, or perhaps on its own clock?

I overall agree this is a breaking change and probably can't be released just like this but I'd really welcome any strategy for dealing with metric name migrations as little things here and there keep being added over time and at some point consistency and clarity is lost in the name of backward compatibility and at that point it's hard to say what causes more bugs - vague metrics and wrong alerts set up based on misinterpretation or broken alerts due to the name changes. It's not a pressing problem but would be cool if there was a solution at some point.

Created #8594 to track the suffix thing.

include/envoy/stats/histogram.h

include/envoy/stats/timespan.h

source/common/stats/histogram_impl.h

source/extensions/filters/network/dubbo_proxy/stats.h

source/extensions/filters/network/redis_proxy/command_splitter_impl.h

source/extensions/filters/network/thrift_proxy/stats.h

source/extensions/stat_sinks/common/statsd/statsd.cc

source/extensions/stat_sinks/common/statsd/statsd.h

source/server/server.h

test/common/stats/stat_test_utility.cc

test/mocks/stats/mocks.h

troshko111 · 2019-10-09T02:13:56Z

Thanks a bunch for taking the time to review this @jmarantz!

I think I addressed the comments (minus the release messaging and some questions I had).

Let me know what your thoughts are on the questions 1-3, I think we should at least update the metric service contract to send the unit.

include/envoy/stats/timespan.h

jmarantz · 2019-10-09T02:25:05Z

Oh sorry I missed the 3 questions:

HistogramCompletableTimespan is kind of ugly with these asserts, one thing I was thinking of is to introduce a method on histogram like quantity() returning an enum with values Unspecified, Information, Time and assert on quantity == Time instead of the switch in the ctor.

I didn't notice asserts in there. Maybe I'm missing something?

Multiple null histograms cached in fields in ThreadLocalStoreImpl vs enum to null histogram map (this requires either a copy ctor on the null histogram class or a move ctor which seemed like a bad trade off so I stored them in fields in this PR)

I think what you have is OK, but I wanted to take one more shot at simplifying the null histograms. Can you just change the asserts to allow for the case of a null histogram? Say, by having an isNull() pure method in the Histogram interface that you can check?

I'd like to update the metrics service to send the unit along with the bucket data, this is a forward/backward compatible change and should not break anyone.

ok sounds fine to me then.

My main blocker is I'd like to hear from @envoyproxy/senior-maintainers about the breaking changes to monitoring and how to message that and/or phase it.

jmarantz · 2019-10-09T20:25:24Z

Leaving the names as-is sounds good to me. You could whitelist any existing names that are in violation; there's a few similar patterns in check_format.py.

source/common/stats/timespan_impl.cc

troshko111 · 2019-10-09T23:25:11Z

I didn't notice asserts in there. Maybe I'm missing something?

Oh sorry I missed the 3 questions:

HistogramCompletableTimespan is kind of ugly with these asserts, one thing I was thinking of is to introduce a method on histogram like quantity() returning an enum with values Unspecified, Information, Time and assert on quantity == Time instead of the switch in the ctor.

I didn't notice asserts in there. Maybe I'm missing something?

Multiple null histograms cached in fields in ThreadLocalStoreImpl vs enum to null histogram map (this requires either a copy ctor on the null histogram class or a move ctor which seemed like a bad trade off so I stored them in fields in this PR)

I think what you have is OK, but I wanted to take one more shot at simplifying the null histograms. Can you just change the asserts to allow for the case of a null histogram? Say, by having an isNull() pure method in the Histogram interface that you can check?

I'd like to update the metrics service to send the unit along with the bucket data, this is a forward/backward compatible change and should not break anyone.

ok sounds fine to me then.

My main blocker is I'd like to hear from @envoyproxy/senior-maintainers about the breaking changes to monitoring and how to message that and/or phase it.

Tagged you directly in the diff to explain what I was thinking in more detail.
Simplified the null histograms as per your suggestion (I was thinking about similar approach but was not sure you'd be fine with a virtual marker, I made it more generally useful so it does not look like a hack, let me know if that's what you meant)
Will push a separate commit to plumb through the metric service

Finally as per Matt's suggestion I'm reverting the name changes altogether and will leave a TODO in statsd for the maintainers to consider scaling unsuffixed histograms reported not in millis as these are a usability disaster for the users (statsd thinks it's millis, users think it's millis, the name has no indication it's not, but it's not). This will ensure backward-compat. Will tune the check_format to give everything existing a pass and bark about any future suffixes.

troshko111 · 2019-10-10T23:32:28Z

macOS test failure seems intermittent (a timeout), other than that I think I've addressed everything. Will not be changing metric service as it uses Prometheus proto contract so out of scope.

include/envoy/stats/histogram.h

source/common/stats/timespan_impl.cc

source/extensions/filters/network/redis_proxy/command_splitter_impl.h

source/extensions/stat_sinks/common/statsd/statsd.cc

test/extensions/stats_sinks/common/statsd/statsd_test.cc

mattklein123

Thanks looks great, just one small comment.

/wait

source/common/stats/timespan_impl.cc

mattklein123

Thanks!

mattklein123 · 2019-10-17T21:49:59Z

Sorry looks like it needs a master merge.

/wait

Signed-off-by: Taras Roshko <troshko@netflix.com>

- Move unit symbol and suffix to utils and add a test - Rename isolated store alloc functors - Add a statsd "ms" comment - Add a format check for histogram suffix - Add TODO Signed-off-by: Taras Roshko <troshko@netflix.com>

- Refactor timespan into an interface and the implementation Signed-off-by: Taras Roshko <troshko@netflix.com>

- Simplify null histogram - Simplify timespan interface - Init mock histogram unit - Fix python formatter format Signed-off-by: Taras Roshko <troshko@netflix.com>

Signed-off-by: Taras Roshko <troshko@netflix.com>

- Check format for hist suffixes excluding the pre-existing ones Signed-off-by: Taras Roshko <troshko@netflix.com>

- Add a TODO to statsd sink Signed-off-by: Taras Roshko <troshko@netflix.com>

Signed-off-by: Taras Roshko <troshko@netflix.com>

- Revert misformatted comments in histogram.h - Add the issue # to the statsd TODOs - Remove redundant Unspecified setting for mocks Signed-off-by: Taras Roshko <troshko@netflix.com>

- Remove active() from histogram.h and make it a Null unit instead - Make elapsed() return chrono::millis - Use NOT_REACHED_GCOVR_EXCL_LINE for unreachable code - Extract histogram type assertion into a method ensureTimeHistogram() - Make timespan methods const - Remove TODOs from statsd, leave as notes Signed-off-by: Taras Roshko <troshko@netflix.com>

- Use a release assert - Add an actionable error message Signed-off-by: Taras Roshko <troshko@netflix.com>

mattklein123 · 2019-10-17T23:04:30Z

@troshko111 friendly request to please never force push, just merge master and push commits. Thank you!

mattklein123

Thanks!

troshko111 · 2019-10-17T23:11:53Z

@troshko111 friendly request to please never force push, just merge master and push commits. Thank you!

Ah my bad! I thought merge conflicts were ok to rebase, I now see that GitHub makes it painful to review when the history changes (does not store snapshots to compare to), sorry for that.

troshko111 requested review from jmarantz, lizan, snowp and zuercher as code owners October 4, 2019 01:22

troshko111 changed the title ~~stats: add unit support to histogram #8164~~ stats: add unit support to histogram Oct 4, 2019