-
Notifications
You must be signed in to change notification settings - Fork 5.2k
stats: add unit support to histogram #8484
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@jmarantz
I don't know exactly how these are computed but a small size difference is expected, let me know if I can adjust the asserts. Other things I'd like your opinion on:
|
|
Yes if you are adding some stat data of value, like an enum to a histogram, that seems totally worth it and you should just bump the limit. You can also test locally with 'blaze test -c opt --test_env=ENVOY_MEMORY_TEST_EXACT=true' so you can get the test working locally before pushing to CI. I will start reviewing your PR; sorry for the delay. |
jmarantz
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
flushing comments. This looks amazing, thanks for doing this! I do have a deprecation policy question and a bunch of code nits so far.
include/envoy/event/dispatcher.h
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see how this is a positive change. However I wonder if this might be a breaking change for monitoring setups for some deployments. How will we message this out and manage this transition?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've been thinking about this change before making it and came to a conclusion that however nasty this may be, the alternative is no better (potentially worse).
This is a breaking change as people definitely have alerts set up for these. If left as-is, statsd change is not possible (automatically append the actually correct symbol) and Metric sinks supporting generic histograms vs timers will report milli-milliseconds, micro-milliseconds and other nonsense (they would take values in seconds and scale using metric prefixes and display the scaled value so the metric name suffix will mismatch). It's also simply put messy. Some of the reported histograms right now are rather confusing already (Redis command latency is reported in millis or micros depending on a config value) so alerting on these is error-prone at best.
Having said that we may create a table of old -> new metric names for statsd and for non-statsd sinks so people can do a low effort find & replace instead of figuring it out? This can be published with release itself as a pre-read.
Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@htuch @alyssawilk should this be considered an API change and versioned as such?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To sanity check, this is changing the variable names for all statsd variables? I definitely think that's something that folks should be able to migrate to carefully and not have to fix as they pull HEAD.
Ideally it's the sort of thing we could "runtime" guard and do away with but I don't think it's the sort of thing you can flip live. Given that I'd lean towards a config option which we deprecate to (eventually) avoid code cruft
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alyssawilk this affects histogram names, I've compiled the changes to make this easier to understand:
Histogram name changes for non-statsd sinks like metric service (unit symbol suffix dropped if present)
| Group | Old | New |
|---|---|---|
| Cluster | upstream_cx_length_ms |
upstream_cx_length |
| Cluster | upstream_cx_connect_ms |
upstream_cx_connect |
| Dispatcher | poll_delay_us |
poll_delay |
| Dispatcher | loop_duration_us |
loop_duration |
| Dubbo | request_time_ms |
request_time |
| Dynamo | upstream_rq_time |
same |
| Dynamo | upstream_rq_time_{i}xx |
same |
| Dynamo | upstream_rq_time_unknown |
same |
| HTTPConnMan | downstream_rq_time |
same |
| HTTPConnMan | downstream_cx_length_ms |
downstream_cx_length |
| Listener | downstream_cx_length_ms |
downstream_cx_length |
| Mongo | time_ms |
time |
| Mongo | reply_size |
same |
| Mongo | reply_time_ms |
reply_time |
| Redis | latency |
same |
| Redis | upstream_rq_time |
same |
| Server | initialization_time_ms |
initialization_time |
| Thrift | request_time_ms |
request_time |
| UserAgent | downstream_cx_length_ms |
downstream_cx_length |
| ZK | <opcode>_latency |
same |
| ZK | connect_response_latency |
same |
Histogram name changes for statsd sinks (unit symbol suffix appended where absent)
| Group | Old | New |
|---|---|---|
| Cluster | upstream_cx_length_ms |
same |
| Cluster | upstream_cx_connect_ms |
same |
| Dispatcher | poll_delay_us |
same |
| Dispatcher | loop_duration_us |
same |
| Dubbo | request_time_ms |
same |
| Dynamo | upstream_rq_time |
upstream_rq_time_ms |
| Dynamo | upstream_rq_time_{i}xx |
upstream_rq_time_{i}xx_ms |
| Dynamo | upstream_rq_time_unknown |
upstream_rq_time_unknown_ms |
| HTTPConnMan | downstream_rq_time |
downstream_rq_time_ms |
| HTTPConnMan | downstream_cx_length_ms |
same |
| Listener | downstream_cx_length_ms |
same |
| Mongo | time_ms |
same |
| Mongo | reply_size |
reply_size_b |
| Mongo | reply_time_ms |
same |
| Redis | latency |
latency_us or latency_ms (configurable) |
| Redis | upstream_rq_time |
upstream_rq_time_ms |
| Server | initialization_time_ms |
same |
| Thrift | request_time_ms |
same |
| UserAgent | downstream_cx_length_ms |
same |
| ZK | <opcode>_latency |
<opcode>_latency_ms |
| ZK | connect_response_latency |
connect_response_latency_ms |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I think it's great to require good name hygiene for new stats, I just don't think it's worth the end-user pain to change the names of the existing ones (though I agree it would be nice).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok so to recap:
- Keep all existing names as-is.
- Do not autosuffix in
statsd. - Have a
check_formatrule ensuring no new histograms are suffixed but whitelist all existing.
I will address the remaining comments and send 1 commit to undo the renames so it's easier to review.
The only significant downside to this is that statsd assumes everything is millis and right now the Redis latency histogram is either millis or micros but does not include any suffix whatsoever, same for Mongo reply size. A bit messy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will leave the statsd to the sink maintainers, but recommendation is that since statsd assumes all histograms are millis, all other time values should be scaled to millis on flush unless they are suffixed. This is the minimally invasive fix to reinstall sanity. This way things like Redis latency would be scaled to millis and things like poll_delay_us would be reported as-is but it would be clear on the UI it's micros (due to the name). And for all new stats they can be assumed to be millis, no suffixed, scaled to millis if necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A little late to the discussion, but I agree that this is too disruptive. However, stats are not (today) part of the Envoy API, so they don't live by its versioning guidelines. Is there a way we can add versioning to the stats interface to allow it to evolve over time? Should we provide a canonical proto definition that will allow it to move alongside the API major versions, or perhaps on its own clock?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I overall agree this is a breaking change and probably can't be released just like this but I'd really welcome any strategy for dealing with metric name migrations as little things here and there keep being added over time and at some point consistency and clarity is lost in the name of backward compatibility and at that point it's hard to say what causes more bugs - vague metrics and wrong alerts set up based on misinterpretation or broken alerts due to the name changes. It's not a pressing problem but would be cool if there was a solution at some point.
Created #8594 to track the suffix thing.
source/extensions/filters/network/redis_proxy/command_splitter_impl.h
Outdated
Show resolved
Hide resolved
|
Thanks a bunch for taking the time to review this @jmarantz! I think I addressed the comments (minus the release messaging and some questions I had). Let me know what your thoughts are on the questions 1-3, I think we should at least update the metric service contract to send the unit. |
|
Oh sorry I missed the 3 questions:
I didn't notice asserts in there. Maybe I'm missing something?
I think what you have is OK, but I wanted to take one more shot at simplifying the null histograms. Can you just change the asserts to allow for the case of a null histogram? Say, by having an isNull() pure method in the Histogram interface that you can check?
ok sounds fine to me then. My main blocker is I'd like to hear from @envoyproxy/senior-maintainers about the breaking changes to monitoring and how to message that and/or phase it. |
|
Leaving the names as-is sounds good to me. You could whitelist any existing names that are in violation; there's a few similar patterns in check_format.py. |
Finally as per Matt's suggestion I'm reverting the name changes altogether and will leave a TODO in |
|
|
source/extensions/filters/network/redis_proxy/command_splitter_impl.h
Outdated
Show resolved
Hide resolved
mattklein123
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks looks great, just one small comment.
/wait
mattklein123
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
|
Sorry looks like it needs a master merge. /wait |
Signed-off-by: Taras Roshko <troshko@netflix.com>
Signed-off-by: Taras Roshko <troshko@netflix.com>
Signed-off-by: Taras Roshko <troshko@netflix.com>
Signed-off-by: Taras Roshko <troshko@netflix.com>
Signed-off-by: Taras Roshko <troshko@netflix.com>
- Move unit symbol and suffix to utils and add a test - Rename isolated store alloc functors - Add a statsd "ms" comment - Add a format check for histogram suffix - Add TODO Signed-off-by: Taras Roshko <troshko@netflix.com>
- Refactor timespan into an interface and the implementation Signed-off-by: Taras Roshko <troshko@netflix.com>
- Simplify null histogram - Simplify timespan interface - Init mock histogram unit - Fix python formatter format Signed-off-by: Taras Roshko <troshko@netflix.com>
Signed-off-by: Taras Roshko <troshko@netflix.com>
- Check format for hist suffixes excluding the pre-existing ones Signed-off-by: Taras Roshko <troshko@netflix.com>
- Add a TODO to statsd sink Signed-off-by: Taras Roshko <troshko@netflix.com>
Signed-off-by: Taras Roshko <troshko@netflix.com>
Signed-off-by: Taras Roshko <troshko@netflix.com>
- Revert misformatted comments in histogram.h - Add the issue # to the statsd TODOs - Remove redundant Unspecified setting for mocks Signed-off-by: Taras Roshko <troshko@netflix.com>
- Remove active() from histogram.h and make it a Null unit instead - Make elapsed() return chrono::millis - Use NOT_REACHED_GCOVR_EXCL_LINE for unreachable code - Extract histogram type assertion into a method ensureTimeHistogram() - Make timespan methods const - Remove TODOs from statsd, leave as notes Signed-off-by: Taras Roshko <troshko@netflix.com>
- Use a release assert - Add an actionable error message Signed-off-by: Taras Roshko <troshko@netflix.com>
|
@troshko111 friendly request to please never force push, just merge master and push commits. Thank you! |
mattklein123
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
Ah my bad! I thought merge conflicts were ok to rebase, I now see that GitHub makes it painful to review when the history changes (does not store snapshots to compare to), sorry for that. |
Signed-off-by: Taras Roshko troshko@netflix.com
Description: Add unit to the histogram type so that the stat sinks can disambiguate among various types of histograms (some Metric stores support timer vs non-timer histograms and generally speaking there already are histograms measuring bytes and microseconds, but it's all assumed to be millis on flush
Risk Level: Low
Testing: Unit tests
Docs Changes: N/A
Release Notes: Included
Fixes: #8164