Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MP Config properties for Histogram and Timers to extend and customize output #779

Merged
merged 11 commits into from Aug 17, 2023

Conversation

Channyboy
Copy link
Contributor

@Channyboy Channyboy commented Jul 10, 2023

Fixes #587, #674, #675, #676, #691

spec/src/main/asciidoc/rest-endpoints.adoc Outdated Show resolved Hide resolved
spec/src/main/asciidoc/rest-endpoints.adoc Outdated Show resolved Hide resolved
spec/src/main/asciidoc/rest-endpoints.adoc Outdated Show resolved Hide resolved
spec/src/main/asciidoc/rest-endpoints.adoc Outdated Show resolved Hide resolved
spec/src/main/asciidoc/rest-endpoints.adoc Outdated Show resolved Hide resolved
spec/src/main/asciidoc/rest-endpoints.adoc Outdated Show resolved Hide resolved
@Channyboy Channyboy force-pushed the v510-histoDistrSum branch 2 times, most recently from 9004c69 to 945efb7 Compare August 4, 2023 15:54
@Channyboy
Copy link
Contributor Author

@tjquinno

After discussion with the liberty team.
We're proposing to change the property mp.metrics.distribution.timer.slo to mp.metrics.distribution.timer. This removes the slo as it is unneeded as the timer metric can be used for more than just SLO calculations.

Also the mp.metrics.distribution.timer.maximum-expected-value and mp.metrics.distribution.timer.maximum-expected-value and the equivalent for the histogram property we want to reduce it to just:

  • mp.metrics.distribution.timer.max-value
  • mp.metrics.distribution.timer.min-value
  • mp.metrics.distribution.histogram.max-value
  • mp.metrics.distribution.histogram.min-value
    ^^The reason being the expected not being needed and other MP specs use max and min. Also we shouldn't really care if that original springboot was set that way.

Further more, the mp.metrics.distribution.timer.slo/ mp.metrics.distribution.timer and the mp.metrics.distribution.(max/min)-value properties we had defined that it would require a time unit (i.e. ms, s, m, h). Otherwise we would ignore it. We would like to provide a default unit of seconds if no time unit was specified.

@tjquinno
Copy link
Contributor

tjquinno commented Aug 6, 2023

@Channyboy

Thanks for tagging me on this.

A few thoughts...

  1. Regarding changing to min-value and max-value: seems reasonable--shorter with no real loss in clarity.

  2. Regarding defaulting durations/SLOs unit to seconds...

    There are other places in MP where time units default to something--for example, in fault tolerance @Timeout defaults to ms and in that case ms seems like a very natural choice given what's being timed: method invocations.

    With SLOs we have no idea in advance what a typical time range is going to be for any given timer because we don't know what sort of quantity any given timer measures (method invocations, atomic particle decay times, police and fire unit response times, government agency license application response times).

    I'm OK with--and perhaps in favor of--ignoring with a warning or even rejecting with an error any unit-less SLO setting. Requiring explicit units (and failing start-up in their absence) removes any ambiguity.

  3. For the config key for SLOs... I am inclined to keep the slo suffix. It keeps very explicit what those values correspond to which helps the config keys be more "self-documenting."

    Further, although we might not introduce them now, there could be other timer-related settings we'd want to be able to support via configuration in the future. For example, a Micrometer Timer supports the publishPercentiles method which accepts a list of values, as does slos.

    If, now in MP metrics config, we "up-level" the slo settings up to the timer level, then later we add config for publish percentiles, it could be somewhat ambiguous to users which settings the "top-level" values correspond to and it might seem to arbitrarily ascribe more importance to the SLOs than to the percentiles.

    Because slo is a short string, we're not imposing a lot more typing time to users who edit config files and we keep some clarity now and, potentially, going forward.

@Channyboy
Copy link
Contributor Author

@tjquinno
2. Regarding use of milliseconds as the default unit, that would make more sense. There is more control with milliseconds. Seconds was brought up as it was the same unit that will be used/converted-to in the Prometheus output.
Regardless of which, I think the discussion brought up that we should have a default unit instead of ignoring values.

  1. We could also try mp.metrics.distribution.timer.buckets for a more generic sounding property?

**Not sure if the publishPercentiles was just used as a quick example. The mp.metrics.distribution.percentiles applies to both Histogram and Timer metric. So for any implementation using micrometer, this config property would affect a timer being built using the publishPercentile(..) method.

@donbourne wdyt?

@Emily-Jiang
Copy link
Member

I would suggest to use mp.metrics.distribution.timer as it sounds more natural. Ending .buckets sounds weird to specify the value with a time unit.

@Channyboy Channyboy changed the title [Draft] MP Config properties for Histogram and Timers to extend and customize output MP Config properties for Histogram and Timers to extend and customize output Aug 9, 2023
@donbourne
Copy link
Member

For (2), I'm in favor of having a default (s or ms). Assuming we have a default, the worst case scenario is that the person configuring the value thinks the default unit is different than what it actually is -- but since the metrics we generate show the buckets, that should be a fairly easy thing for the ops person to spot and fix. If we didn't have a default unit then the error handling would be more complex, both for the runtime (do we not start the app if the slo is configured in a file in the app? do we not start the server if it's configured in an env var?), and for the ops person (more ways to get something wrong that you have to correct). @Channyboy , which unit do the timer bucket values in the /metrics output have (our unit default should match, IMO)?

For (3), another argument for sticking with mp.metrics.distribution.timer.slo or mp.metrics.distribution.timer.buckets is that timers can have both percentile values and bucket time values, so it's not intuitive to just ask someone to fill in the values for mp.metrics.distribution.timer. I think I slightly prefer mp.metrics.distribution.timer.buckets over mp.metrics.distribution.timer.slo, since there's no guarantee that you're timing something for the purpose of tracking an SLO, and since it's very semantically related to mp.metrics.distribution.histogram.buckets.

@Channyboy
Copy link
Contributor Author

Channyboy commented Aug 10, 2023

The buckets for a Timer in Prometheus output will be converted to seconds.

For any incompatible values (either server or app level), I wouldn't think we would prevent the runtime from starting. We would just ignore it (with a warnning/error).

@tjquinno
Copy link
Contributor

Re: 2. (default units)

I guess I will (reluctantly) step back from urging a start-up failure for unit-less settings!

Then the question is what's a good default unit. I mentioned the FT default not necessarily to advocate for that as the default for this usage but because, in the FT case, everyone knows that it is method invocations being measured and so ms seems like reasonable units for that. If we expect that most uses of timers are for measuring method invocations, then ms makes sense here as well.

Re: 3. (config key suffix for bucket/slos)

The config key(s) must unambiguously identify what is being set.

IIUC (and I could well be wrong), the Micrometer API, as an example, allows setting bucket boundary values and also, via a separate method, SLOs.

The JavaDoc for Timer says (to me at least) that setting SLOs makes sure that those values are among those (perhaps along with others?) in the set of bucket boundaries.

My point is that we need to very clear exactly what we are allowing the user to set via config, and we should use a strong key name to achieve that clarity. If we are going to interpret the settings as bucket boundaries, then buckets is a very good suffix. Although it might sound a bit informal, it's a term widely used in discussing and documenting distribution summaries so that's fine.

On the other hand, if we are going to use the configured values as parameters to the serviceLevelObjectives method then the suffix needs to reflect that.

Or would we expose both?

@Channyboy
Copy link
Contributor Author

@tjquinno @donbourne

Springboot's management.metrics.distribution.slo property uses ms as a default if no time unit is defined.

@Channyboy
Copy link
Contributor Author

@tjquinno @donbourne

Will go ahead and update PR with the change to use mp.metrics.distribution.timer.buckets and use default units of ms

@donbourne
Copy link
Member

If we expect that most uses of timers are for measuring method invocations, then ms makes sense here as well.

I expect most timings will be for methods or services, and generally ms is the right granularity for those, so I think ms makes sense. While Timers can be used to time other long-running things, I expect that's less common.

On the other hand, if we are going to use the configured values as parameters to the serviceLevelObjectives method then the suffix needs to reflect that.

while Micrometer calls these SLOs, that doesn't surface in the MP Metrics API, so I don't think we need to do things the same as Micrometer here. I'm ok with either buckets or slo. On balance I lean towards buckets because Timers don't have to be timing things for the purpose of tracking SLOs, and to show the semantic consistency with the buckets setting for Histograms.

@Channyboy
Copy link
Contributor Author

@tjquinno We're aiming to release an RC1 this friday (or sooner!) May you please review.

Copy link
Contributor

@tjquinno tjquinno left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple suggested changes.

Comment on lines 121 to 122
//alpha.timer will publish histogram-buckets with maximum values of 500 milliseconds, 2 seconds, and 3 minutes while the beta.timer will publish histogram buckets with maximum values of 10 seconds, 2 minutes and 5 hours. Timer metrics that do not match will not have any histogram buckets.
mp.metrics.distribution.timer.buckets=alpha.timer=10,500ms,2s,3m;beta.timer=10s,2m,5h
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The setting for alpha.timer specifies 10 first, but the command describing the setting does not mention it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, fixed!

mp.metrics.distribution.timer.buckets=alpha.timer=10,500ms,2s,3m;beta.timer=10s,2m,5h

//any timer that matches with `alpha.*` will publish histograms buckets with maximum values of 50 seconds and 100 seconds while the alpha.test.timer which will publish a bucket with a maximum value 100 milliseconds due to precedence. Timer metrics that do not match will not have any histogram buckets.
mp.metrics.distribution.timer.buckets=alpha.*=50s, 100s;alpha.test.timer=100
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want the space in the example? Does/should the spec language say anything about spaces in the settings?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed!

spec/src/main/asciidoc/histogram-timer-config.adoc Outdated Show resolved Hide resolved
spec/src/main/asciidoc/histogram-timer-config.adoc Outdated Show resolved Hide resolved
spec/src/main/asciidoc/histogram-timer-config.adoc Outdated Show resolved Hide resolved
spec/src/main/asciidoc/histogram-timer-config.adoc Outdated Show resolved Hide resolved
spec/src/main/asciidoc/histogram-timer-config.adoc Outdated Show resolved Hide resolved
spec/src/main/asciidoc/histogram-timer-config.adoc Outdated Show resolved Hide resolved
spec/src/main/asciidoc/histogram-timer-config.adoc Outdated Show resolved Hide resolved
spec/src/main/asciidoc/histogram-timer-config.adoc Outdated Show resolved Hide resolved
Copy link
Member

@donbourne donbourne left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

awaiting changes on 2 previously submitted comments...(pinged @Channyboy to let him know)

Copy link
Member

@donbourne donbourne left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@Channyboy
Copy link
Contributor Author

@tjquinno

@Channyboy Channyboy merged commit c77da94 into eclipse:main Aug 17, 2023
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add Timed histogram
4 participants