Skip to content

Conversation

@troshko111
Copy link
Contributor

Signed-off-by: Taras Roshko troshko@netflix.com

Description: Add unit to the histogram type so that the stat sinks can disambiguate among various types of histograms (some Metric stores support timer vs non-timer histograms and generally speaking there already are histograms measuring bytes and microseconds, but it's all assumed to be millis on flush
Risk Level: Low
Testing: Unit tests
Docs Changes: N/A
Release Notes: Included
Fixes: #8164

@troshko111 troshko111 changed the title stats: add unit support to histogram #8164 stats: add unit support to histogram Oct 4, 2019
@jmarantz jmarantz self-assigned this Oct 4, 2019
@troshko111
Copy link
Contributor Author

@jmarantz
Release builds are failing due to stat size checking tests:

StatsThreadLocalStoreTestNoFixture.MemoryWithoutTlsFakeSymbolTable and StatsThreadLocalStoreTestNoFixture.MemoryWithTlsFakeSymbolTable are 16 bytes off

ClusterMemoryTestRunner.MemoryLargeClusterSizeWithFakeSymbolTable and
ClusterMemoryTestRunner.MemoryLargeClusterSizeWithRealSymbolTable are 32 bytes off

I don't know exactly how these are computed but a small size difference is expected, let me know if I can adjust the asserts.

Other things I'd like your opinion on:

  1. HistogramCompletableTimespan is kind of ugly with these asserts, one thing I was thinking of is to introduce a method on histogram like quantity() returning an enum with values Unspecified, Information, Time and assert on quantity == Time instead of the switch in the ctor.
  2. Multiple null histograms cached in fields in ThreadLocalStoreImpl vs enum to null histogram map (this requires either a copy ctor on the null histogram class or a move ctor which seemed like a bad trade off so I stored them in fields in this PR)
  3. I'd like to update the metrics service to send the unit along with the bucket data, this is a forward/backward compatible change and should not break anyone.

@jmarantz
Copy link
Contributor

jmarantz commented Oct 8, 2019

Yes if you are adding some stat data of value, like an enum to a histogram, that seems totally worth it and you should just bump the limit.

You can also test locally with 'blaze test -c opt --test_env=ENVOY_MEMORY_TEST_EXACT=true' so you can get the test working locally before pushing to CI.

I will start reviewing your PR; sorry for the delay.

Copy link
Contributor

@jmarantz jmarantz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

flushing comments. This looks amazing, thanks for doing this! I do have a deprecation policy question and a bunch of code nits so far.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see how this is a positive change. However I wonder if this might be a breaking change for monitoring setups for some deployments. How will we message this out and manage this transition?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been thinking about this change before making it and came to a conclusion that however nasty this may be, the alternative is no better (potentially worse).

This is a breaking change as people definitely have alerts set up for these. If left as-is, statsd change is not possible (automatically append the actually correct symbol) and Metric sinks supporting generic histograms vs timers will report milli-milliseconds, micro-milliseconds and other nonsense (they would take values in seconds and scale using metric prefixes and display the scaled value so the metric name suffix will mismatch). It's also simply put messy. Some of the reported histograms right now are rather confusing already (Redis command latency is reported in millis or micros depending on a config value) so alerting on these is error-prone at best.

Having said that we may create a table of old -> new metric names for statsd and for non-statsd sinks so people can do a low effort find & replace instead of figuring it out? This can be published with release itself as a pre-read.

Thoughts?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@htuch @alyssawilk should this be considered an API change and versioned as such?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To sanity check, this is changing the variable names for all statsd variables? I definitely think that's something that folks should be able to migrate to carefully and not have to fix as they pull HEAD.
Ideally it's the sort of thing we could "runtime" guard and do away with but I don't think it's the sort of thing you can flip live. Given that I'd lean towards a config option which we deprecate to (eventually) avoid code cruft

Copy link
Contributor Author

@troshko111 troshko111 Oct 9, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alyssawilk this affects histogram names, I've compiled the changes to make this easier to understand:

Histogram name changes for non-statsd sinks like metric service (unit symbol suffix dropped if present)

Group Old New
Cluster upstream_cx_length_ms upstream_cx_length
Cluster upstream_cx_connect_ms upstream_cx_connect
Dispatcher poll_delay_us poll_delay
Dispatcher loop_duration_us loop_duration
Dubbo request_time_ms request_time
Dynamo upstream_rq_time same
Dynamo upstream_rq_time_{i}xx same
Dynamo upstream_rq_time_unknown same
HTTPConnMan downstream_rq_time same
HTTPConnMan downstream_cx_length_ms downstream_cx_length
Listener downstream_cx_length_ms downstream_cx_length
Mongo time_ms time
Mongo reply_size same
Mongo reply_time_ms reply_time
Redis latency same
Redis upstream_rq_time same
Server initialization_time_ms initialization_time
Thrift request_time_ms request_time
UserAgent downstream_cx_length_ms downstream_cx_length
ZK <opcode>_latency same
ZK connect_response_latency same

Histogram name changes for statsd sinks (unit symbol suffix appended where absent)

Group Old New
Cluster upstream_cx_length_ms same
Cluster upstream_cx_connect_ms same
Dispatcher poll_delay_us same
Dispatcher loop_duration_us same
Dubbo request_time_ms same
Dynamo upstream_rq_time upstream_rq_time_ms
Dynamo upstream_rq_time_{i}xx upstream_rq_time_{i}xx_ms
Dynamo upstream_rq_time_unknown upstream_rq_time_unknown_ms
HTTPConnMan downstream_rq_time downstream_rq_time_ms
HTTPConnMan downstream_cx_length_ms same
Listener downstream_cx_length_ms same
Mongo time_ms same
Mongo reply_size reply_size_b
Mongo reply_time_ms same
Redis latency latency_us or latency_ms (configurable)
Redis upstream_rq_time upstream_rq_time_ms
Server initialization_time_ms same
Thrift request_time_ms same
UserAgent downstream_cx_length_ms same
ZK <opcode>_latency <opcode>_latency_ms
ZK connect_response_latency connect_response_latency_ms

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I think it's great to require good name hygiene for new stats, I just don't think it's worth the end-user pain to change the names of the existing ones (though I agree it would be nice).

Copy link
Contributor Author

@troshko111 troshko111 Oct 9, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok so to recap:

  1. Keep all existing names as-is.
  2. Do not autosuffix in statsd.
  3. Have a check_format rule ensuring no new histograms are suffixed but whitelist all existing.

I will address the remaining comments and send 1 commit to undo the renames so it's easier to review.

The only significant downside to this is that statsd assumes everything is millis and right now the Redis latency histogram is either millis or micros but does not include any suffix whatsoever, same for Mongo reply size. A bit messy.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will leave the statsd to the sink maintainers, but recommendation is that since statsd assumes all histograms are millis, all other time values should be scaled to millis on flush unless they are suffixed. This is the minimally invasive fix to reinstall sanity. This way things like Redis latency would be scaled to millis and things like poll_delay_us would be reported as-is but it would be clear on the UI it's micros (due to the name). And for all new stats they can be assumed to be millis, no suffixed, scaled to millis if necessary.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A little late to the discussion, but I agree that this is too disruptive. However, stats are not (today) part of the Envoy API, so they don't live by its versioning guidelines. Is there a way we can add versioning to the stats interface to allow it to evolve over time? Should we provide a canonical proto definition that will allow it to move alongside the API major versions, or perhaps on its own clock?

Copy link
Contributor Author

@troshko111 troshko111 Oct 11, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I overall agree this is a breaking change and probably can't be released just like this but I'd really welcome any strategy for dealing with metric name migrations as little things here and there keep being added over time and at some point consistency and clarity is lost in the name of backward compatibility and at that point it's hard to say what causes more bugs - vague metrics and wrong alerts set up based on misinterpretation or broken alerts due to the name changes. It's not a pressing problem but would be cool if there was a solution at some point.

Created #8594 to track the suffix thing.

@troshko111
Copy link
Contributor Author

Thanks a bunch for taking the time to review this @jmarantz!

I think I addressed the comments (minus the release messaging and some questions I had).

Let me know what your thoughts are on the questions 1-3, I think we should at least update the metric service contract to send the unit.

@jmarantz
Copy link
Contributor

jmarantz commented Oct 9, 2019

Oh sorry I missed the 3 questions:

HistogramCompletableTimespan is kind of ugly with these asserts, one thing I was thinking of is to introduce a method on histogram like quantity() returning an enum with values Unspecified, Information, Time and assert on quantity == Time instead of the switch in the ctor.

I didn't notice asserts in there. Maybe I'm missing something?

Multiple null histograms cached in fields in ThreadLocalStoreImpl vs enum to null histogram map (this requires either a copy ctor on the null histogram class or a move ctor which seemed like a bad trade off so I stored them in fields in this PR)

I think what you have is OK, but I wanted to take one more shot at simplifying the null histograms. Can you just change the asserts to allow for the case of a null histogram? Say, by having an isNull() pure method in the Histogram interface that you can check?

I'd like to update the metrics service to send the unit along with the bucket data, this is a forward/backward compatible change and should not break anyone.

ok sounds fine to me then.

My main blocker is I'd like to hear from @envoyproxy/senior-maintainers about the breaking changes to monitoring and how to message that and/or phase it.

@mattklein123 mattklein123 self-assigned this Oct 9, 2019
@jmarantz
Copy link
Contributor

jmarantz commented Oct 9, 2019

Leaving the names as-is sounds good to me. You could whitelist any existing names that are in violation; there's a few similar patterns in check_format.py.

@troshko111
Copy link
Contributor Author

I didn't notice asserts in there. Maybe I'm missing something?

Oh sorry I missed the 3 questions:

HistogramCompletableTimespan is kind of ugly with these asserts, one thing I was thinking of is to introduce a method on histogram like quantity() returning an enum with values Unspecified, Information, Time and assert on quantity == Time instead of the switch in the ctor.

I didn't notice asserts in there. Maybe I'm missing something?

Multiple null histograms cached in fields in ThreadLocalStoreImpl vs enum to null histogram map (this requires either a copy ctor on the null histogram class or a move ctor which seemed like a bad trade off so I stored them in fields in this PR)

I think what you have is OK, but I wanted to take one more shot at simplifying the null histograms. Can you just change the asserts to allow for the case of a null histogram? Say, by having an isNull() pure method in the Histogram interface that you can check?

I'd like to update the metrics service to send the unit along with the bucket data, this is a forward/backward compatible change and should not break anyone.

ok sounds fine to me then.

My main blocker is I'd like to hear from @envoyproxy/senior-maintainers about the breaking changes to monitoring and how to message that and/or phase it.

  1. Tagged you directly in the diff to explain what I was thinking in more detail.
  2. Simplified the null histograms as per your suggestion (I was thinking about similar approach but was not sure you'd be fine with a virtual marker, I made it more generally useful so it does not look like a hack, let me know if that's what you meant)
  3. Will push a separate commit to plumb through the metric service

Finally as per Matt's suggestion I'm reverting the name changes altogether and will leave a TODO in statsd for the maintainers to consider scaling unsuffixed histograms reported not in millis as these are a usability disaster for the users (statsd thinks it's millis, users think it's millis, the name has no indication it's not, but it's not). This will ensure backward-compat. Will tune the check_format to give everything existing a pass and bark about any future suffixes.

@troshko111
Copy link
Contributor Author

troshko111 commented Oct 10, 2019

macOS test failure seems intermittent (a timeout), other than that I think I've addressed everything. Will not be changing metric service as it uses Prometheus proto contract so out of scope.

Copy link
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks looks great, just one small comment.

/wait

mattklein123
mattklein123 previously approved these changes Oct 17, 2019
Copy link
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@mattklein123
Copy link
Member

Sorry looks like it needs a master merge.

/wait

Signed-off-by: Taras Roshko <troshko@netflix.com>
Signed-off-by: Taras Roshko <troshko@netflix.com>
Signed-off-by: Taras Roshko <troshko@netflix.com>
Signed-off-by: Taras Roshko <troshko@netflix.com>
Signed-off-by: Taras Roshko <troshko@netflix.com>
- Move unit symbol and suffix to utils and add a test
- Rename isolated store alloc functors
- Add a statsd "ms" comment
- Add a format check for histogram suffix
- Add TODO

Signed-off-by: Taras Roshko <troshko@netflix.com>
- Refactor timespan into an interface and the implementation

Signed-off-by: Taras Roshko <troshko@netflix.com>
- Simplify null histogram
- Simplify timespan interface
- Init mock histogram unit
- Fix python formatter format

Signed-off-by: Taras Roshko <troshko@netflix.com>
Signed-off-by: Taras Roshko <troshko@netflix.com>
- Check format for hist suffixes excluding the pre-existing ones

Signed-off-by: Taras Roshko <troshko@netflix.com>
- Add a TODO to statsd sink

Signed-off-by: Taras Roshko <troshko@netflix.com>
Signed-off-by: Taras Roshko <troshko@netflix.com>
Signed-off-by: Taras Roshko <troshko@netflix.com>
- Revert misformatted comments in histogram.h
- Add the issue # to the statsd TODOs
- Remove redundant Unspecified setting for mocks

Signed-off-by: Taras Roshko <troshko@netflix.com>
- Remove active() from histogram.h and make it a Null unit instead
- Make elapsed() return chrono::millis
- Use NOT_REACHED_GCOVR_EXCL_LINE for unreachable code
- Extract histogram type assertion into a method ensureTimeHistogram()
- Make timespan methods const
- Remove TODOs from statsd, leave as notes

Signed-off-by: Taras Roshko <troshko@netflix.com>
- Use a release assert
- Add an actionable error message

Signed-off-by: Taras Roshko <troshko@netflix.com>
@mattklein123
Copy link
Member

@troshko111 friendly request to please never force push, just merge master and push commits. Thank you!

Copy link
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@troshko111
Copy link
Contributor Author

@troshko111 friendly request to please never force push, just merge master and push commits. Thank you!

Ah my bad! I thought merge conflicts were ok to rebase, I now see that GitHub makes it painful to review when the history changes (does not store snapshots to compare to), sorry for that.

@mattklein123 mattklein123 merged commit 2fe5b8a into envoyproxy:master Oct 18, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

stats: [request] expose the underlying histogram unit

5 participants