Add Timed histogram #587

theute · 2020-06-09T09:21:46Z

It would be very useful to extend "Timed" (or another annotation) to have histogram data.

We use those metrics to calculate our SLO, so all responses above 2s aren't meeting our SLO and we need to know. With buckets and total count of hits, we could calculate this, with the current metrics exposed by "Timed" or "SimplyTimed" we can't.

With Python the Starlette exporter does this and it happens to be very useful, now we can tell of many requests are above or below a particular threshold. The list of buckets should be configurable.

Here are metrics I get for an app in Python that I'd love to see for my Java app (the 2 last metrics are the same metrics exposed by "SimplyTimed" )

starlette_requests_processing_time_seconds_bucket{le="0.005",method="GET",path_template="/metrics"} 12705.0
starlette_requests_processing_time_seconds_bucket{le="0.01",method="GET",path_template="/metrics"} 12746.0
starlette_requests_processing_time_seconds_bucket{le="0.025",method="GET",path_template="/metrics"} 12785.0
starlette_requests_processing_time_seconds_bucket{le="0.05",method="GET",path_template="/metrics"} 12785.0
starlette_requests_processing_time_seconds_bucket{le="0.075",method="GET",path_template="/metrics"} 12785.0
starlette_requests_processing_time_seconds_bucket{le="0.1",method="GET",path_template="/metrics"} 12785.0
starlette_requests_processing_time_seconds_bucket{le="0.25",method="GET",path_template="/metrics"} 12785.0
starlette_requests_processing_time_seconds_bucket{le="0.5",method="GET",path_template="/metrics"} 12788.0
starlette_requests_processing_time_seconds_bucket{le="0.75",method="GET",path_template="/metrics"} 12789.0
starlette_requests_processing_time_seconds_bucket{le="1.0",method="GET",path_template="/metrics"} 12789.0
starlette_requests_processing_time_seconds_bucket{le="2.5",method="GET",path_template="/metrics"} 12789.0
starlette_requests_processing_time_seconds_bucket{le="5.0",method="GET",path_template="/metrics"} 12789.0
starlette_requests_processing_time_seconds_bucket{le="7.5",method="GET",path_template="/metrics"} 12789.0
starlette_requests_processing_time_seconds_bucket{le="10.0",method="GET",path_template="/metrics"} 12789.0
starlette_requests_processing_time_seconds_bucket{le="+Inf",method="GET",path_template="/metrics"} 12789.0
starlette_requests_processing_time_seconds_count{method="GET",path_template="/metrics"} 12789.0
starlette_requests_processing_time_seconds_sum{method="GET",path_template="/metrics"} 23.30231829994591

Thanks !

The text was updated successfully, but these errors were encountered:

jmartisk · 2020-06-09T09:49:41Z

With @Timed you currently get a histogram, except it's somewhat "reversed"

As I understand it, you want a histogram that tells you
"X1 requests have finished under Y1 seconds, X2 requests have finished under Y2 seconds,..", where Y1, Y2,... are user-defined buckets and X1, X2 are computed numbers

while with @Timed you currently get
"X1 percent of requests have finished under Y1 seconds, X2 percent of requests have finished under Y2 seconds", where X1, X2,.. are the buckets (pre-defined by the spec) and Y1,Y2,.. are computed numbers

If we did both these formats of the histogram at the same time, the output would get very confusing. Perhaps we could have a switch that switches between one format and the other, but that's still not very easy to grasp. I'm not even sure what we would call them. Then there is the question how to do that switch:

Adding extra fields to the metadata of just one particular metric type is not something that the current MP Metrics API is quite ready for, because each metric type shares the same metadata type, so the model would become confusing, because the field would exist for all metric types, including those where it's not applicable.

I can imagine having an experimental config option in SmallRye that switches ALL timers to the other histogram format.. But I don't have any other idea right now how to do this without getting too crazy.

Allowing to define the percentiles (X1,X2,..) in the current format would not be sufficient for you, I suppose?

theute · 2020-06-09T10:45:30Z

Right, I saw the percentiles with "Timed", but I'm really interested to know data based on threshold values and less on percentage.

For instance with 3Scale I get this chart in Grafana, and it's quite useful, it's easy to spot the exceptions, the distribution as well as the load at a particular time: (on lower traffic the percentiles are not very uesful)

jmartisk · 2020-06-10T05:56:13Z

So I can imagine adding a SmallRye-specific (at least for now; if it proves useful it will be backported to the spec later) option that switches timers (and potentially raw Histograms as a metric type too) from computing percentiles (default) to computing counts of values which are over some thresholds.

This could be implemented as either

a global configuration property that switches all timers/histograms to this. Another property could specify the thresholds, but then they would probably need to be the same for all timers/histograms, which is probably bad, especially since histograms can also use non-time based units
an boolean entry in ExtendedMetadata, so programmatically created timers/histograms using ExtendedMetadata would be able to make use of this, we could also add an array-type field to specify the threshold values
we could also introduce our own SmallRye-specific version of the @Timed annotation that enables this (there's no annotation for histograms, so they would need to use ExtendedMetadata directly)
A combination of 2 and 3 - this would work for both timers and histograms, and for annotated metrics as well as programmatically created metrics...

A combination of 2 and 3 sounds best to me right now. WDYT?
Of perhaps @donbourne has got a better idea how to integrate this without turning the current MP Metrics API into a mess...

theute · 2020-06-10T09:04:41Z

I could imagine that some people would want both.
Do we need a switch from one to the other ?
I don't know the cost so maybe an option to turn some off would work (may have worked instead of @SimplyTimed as well ?).
I think the buckets for Timers could/should be defined once for the app.

I don't understand the details of your options, I'm not used to SmallRye at all.

Now, I can think of other "feature", we time our REST calls (JAX-RS), seprateing calls that endup with server errors (not from total) from successful ansers would also be useful, this also applies to percentile. Having calls that ends with HTTP500 aren't interesting for latency checks.

In the Google SRE Book https://landing.google.com/sre/sre-book/chapters/monitoring-distributed-systems/ :

Latency
    The time it takes to service a request. It’s important to distinguish between the latency of successful requests and the latency of failed requests. For example, an HTTP 500 error triggered due to loss of connection to a database or other critical backend might be served very quickly; however, as an HTTP 500 error indicates a failed request, factoring 500s into your overall latency might result in misleading calculations. On the other hand, a slow error is even worse than a fast error! Therefore, it’s important to track error latency, as opposed to just filtering out errors.

I can open a new ticket for this, or maybe that's beyond what SmallRye metrics would want to provide and we should have our own filter anyway...

jmartisk · 2020-06-10T09:46:41Z

I think we could extend the @Timed annotation with an enum argument named histogramExportFormat that would have the values PERCENTILES and THRESHOLD_VALUES and also BOTH if someone wants both formats to be exported at the same time. The metric metadata would recieve the same field.

For the buckets, I think defining them once per app can be problematic, because histograms can also use non time-based units, so it's not clear how the setting would apply to them. Then we'd probably need to apply the setting only to time-based histograms, and require other histograms to define their buckets on a per-metric basis...

Now, I can think of other "feature", we time our REST calls (JAX-RS), seprateing calls that endup with server errors (not from total) from successful ansers would also be useful, this also applies to percentile. Having calls that ends with HTTP500 aren't interesting for latency checks.

This should be addressed in Metrics 3.0 through #585 - I hope this will satisfy your requirements - the statistics about erroneous calls will be separated.

jbndbc · 2020-06-11T11:59:40Z

I would like to second this request, since the OpenMetrics summary types currently being produced cannot be combined together by for example prometheus because quantiles cannot be averaged and can therefore only be used for metrics that make sense at an individual instance level.

theute · 2020-06-15T07:31:45Z

For the buckets, I think defining them once per app can be problematic, because histograms can also use non time-based units, so it's not clear how the setting would apply to them. Then we'd probably need to apply the setting only to time-based histograms, and require other histograms to define their buckets on a per-metric basis...

I don't really know :)

Now, I can think of other "feature", we time our REST calls (JAX-RS), seprateing calls that endup with server errors (not from total) from successful ansers would also be useful, this also applies to percentile. Having calls that ends with HTTP500 aren't interesting for latency checks.

This should be addressed in Metrics 3.0 through #585 - I hope this will satisfy your requirements - the statistics about erroneous calls will be separated

I guess so, thanks !

jmartisk · 2020-06-16T07:07:48Z

I reported smallrye/smallrye-metrics#325 because this will be first implemented as an experimental feature in an implementation before backporting it to spec, let's discuss implementation details there for now

ebullient · 2020-07-30T13:55:24Z

Micrometer timers can have distribution summaries or histogram buckets associated with them. Would this satisfy what you're asking for?
https://micrometer.io/docs/concepts#_timers + https://micrometer.io/docs/concepts#_histograms_and_percentiles

jmartisk mentioned this issue Jun 16, 2020

Support multiple histogram output formats smallrye/smallrye-metrics#325

Open

donbourne mentioned this issue Aug 24, 2022

add ability to enable/disable distribution buckets for Timers/Histograms #674

Closed

donbourne mentioned this issue Sep 12, 2022

add ability to set distribution bucket ranges for Timers/Histograms #676

Closed

donbourne added this to the 5.1 milestone Oct 7, 2022

Channyboy self-assigned this Jul 6, 2023

Channyboy mentioned this issue Aug 17, 2023

MP Config properties for Histogram and Timers to extend and customize output #779

Merged

Channyboy closed this as completed Aug 30, 2023

Channyboy mentioned this issue Oct 16, 2023

Support Mp Metrics 5.1 MP Config properties for Histogram and Timer metrics smallrye/smallrye-metrics#665

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Timed histogram #587

Add Timed histogram #587

theute commented Jun 9, 2020 •

edited

Loading

jmartisk commented Jun 9, 2020

theute commented Jun 9, 2020 •

edited

Loading

jmartisk commented Jun 10, 2020

theute commented Jun 10, 2020

jmartisk commented Jun 10, 2020

jbndbc commented Jun 11, 2020

theute commented Jun 15, 2020

jmartisk commented Jun 16, 2020

ebullient commented Jul 30, 2020

Add Timed histogram #587

Add Timed histogram #587

Comments

theute commented Jun 9, 2020 • edited Loading

jmartisk commented Jun 9, 2020

theute commented Jun 9, 2020 • edited Loading

jmartisk commented Jun 10, 2020

theute commented Jun 10, 2020

jmartisk commented Jun 10, 2020

jbndbc commented Jun 11, 2020

theute commented Jun 15, 2020

jmartisk commented Jun 16, 2020

ebullient commented Jul 30, 2020

theute commented Jun 9, 2020 •

edited

Loading

theute commented Jun 9, 2020 •

edited

Loading