Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Timed histogram #587

Closed
theute opened this issue Jun 9, 2020 · 9 comments · Fixed by #779
Closed

Add Timed histogram #587

theute opened this issue Jun 9, 2020 · 9 comments · Fixed by #779
Assignees
Milestone

Comments

@theute
Copy link

theute commented Jun 9, 2020

It would be very useful to extend "Timed" (or another annotation) to have histogram data.

We use those metrics to calculate our SLO, so all responses above 2s aren't meeting our SLO and we need to know. With buckets and total count of hits, we could calculate this, with the current metrics exposed by "Timed" or "SimplyTimed" we can't.

With Python the Starlette exporter does this and it happens to be very useful, now we can tell of many requests are above or below a particular threshold. The list of buckets should be configurable.

Here are metrics I get for an app in Python that I'd love to see for my Java app (the 2 last metrics are the same metrics exposed by "SimplyTimed" )

starlette_requests_processing_time_seconds_bucket{le="0.005",method="GET",path_template="/metrics"} 12705.0
starlette_requests_processing_time_seconds_bucket{le="0.01",method="GET",path_template="/metrics"} 12746.0
starlette_requests_processing_time_seconds_bucket{le="0.025",method="GET",path_template="/metrics"} 12785.0
starlette_requests_processing_time_seconds_bucket{le="0.05",method="GET",path_template="/metrics"} 12785.0
starlette_requests_processing_time_seconds_bucket{le="0.075",method="GET",path_template="/metrics"} 12785.0
starlette_requests_processing_time_seconds_bucket{le="0.1",method="GET",path_template="/metrics"} 12785.0
starlette_requests_processing_time_seconds_bucket{le="0.25",method="GET",path_template="/metrics"} 12785.0
starlette_requests_processing_time_seconds_bucket{le="0.5",method="GET",path_template="/metrics"} 12788.0
starlette_requests_processing_time_seconds_bucket{le="0.75",method="GET",path_template="/metrics"} 12789.0
starlette_requests_processing_time_seconds_bucket{le="1.0",method="GET",path_template="/metrics"} 12789.0
starlette_requests_processing_time_seconds_bucket{le="2.5",method="GET",path_template="/metrics"} 12789.0
starlette_requests_processing_time_seconds_bucket{le="5.0",method="GET",path_template="/metrics"} 12789.0
starlette_requests_processing_time_seconds_bucket{le="7.5",method="GET",path_template="/metrics"} 12789.0
starlette_requests_processing_time_seconds_bucket{le="10.0",method="GET",path_template="/metrics"} 12789.0
starlette_requests_processing_time_seconds_bucket{le="+Inf",method="GET",path_template="/metrics"} 12789.0
starlette_requests_processing_time_seconds_count{method="GET",path_template="/metrics"} 12789.0
starlette_requests_processing_time_seconds_sum{method="GET",path_template="/metrics"} 23.30231829994591

Thanks !

@jmartisk
Copy link
Contributor

jmartisk commented Jun 9, 2020

With @Timed you currently get a histogram, except it's somewhat "reversed"

As I understand it, you want a histogram that tells you
"X1 requests have finished under Y1 seconds, X2 requests have finished under Y2 seconds,..", where Y1, Y2,... are user-defined buckets and X1, X2 are computed numbers

while with @Timed you currently get
"X1 percent of requests have finished under Y1 seconds, X2 percent of requests have finished under Y2 seconds", where X1, X2,.. are the buckets (pre-defined by the spec) and Y1,Y2,.. are computed numbers

If we did both these formats of the histogram at the same time, the output would get very confusing. Perhaps we could have a switch that switches between one format and the other, but that's still not very easy to grasp. I'm not even sure what we would call them. Then there is the question how to do that switch:

Adding extra fields to the metadata of just one particular metric type is not something that the current MP Metrics API is quite ready for, because each metric type shares the same metadata type, so the model would become confusing, because the field would exist for all metric types, including those where it's not applicable.

I can imagine having an experimental config option in SmallRye that switches ALL timers to the other histogram format.. But I don't have any other idea right now how to do this without getting too crazy.

Allowing to define the percentiles (X1,X2,..) in the current format would not be sufficient for you, I suppose?

@theute
Copy link
Author

theute commented Jun 9, 2020

Right, I saw the percentiles with "Timed", but I'm really interested to know data based on threshold values and less on percentage.

For instance with 3Scale I get this chart in Grafana, and it's quite useful, it's easy to spot the exceptions, the distribution as well as the load at a particular time: (on lower traffic the percentiles are not very uesful)
image

image

@jmartisk
Copy link
Contributor

So I can imagine adding a SmallRye-specific (at least for now; if it proves useful it will be backported to the spec later) option that switches timers (and potentially raw Histograms as a metric type too) from computing percentiles (default) to computing counts of values which are over some thresholds.

This could be implemented as either

  1. a global configuration property that switches all timers/histograms to this. Another property could specify the thresholds, but then they would probably need to be the same for all timers/histograms, which is probably bad, especially since histograms can also use non-time based units
  2. an boolean entry in ExtendedMetadata, so programmatically created timers/histograms using ExtendedMetadata would be able to make use of this, we could also add an array-type field to specify the threshold values
  3. we could also introduce our own SmallRye-specific version of the @Timed annotation that enables this (there's no annotation for histograms, so they would need to use ExtendedMetadata directly)
  4. A combination of 2 and 3 - this would work for both timers and histograms, and for annotated metrics as well as programmatically created metrics...

A combination of 2 and 3 sounds best to me right now. WDYT?
Of perhaps @donbourne has got a better idea how to integrate this without turning the current MP Metrics API into a mess...

@theute
Copy link
Author

theute commented Jun 10, 2020

I could imagine that some people would want both.
Do we need a switch from one to the other ?
I don't know the cost so maybe an option to turn some off would work (may have worked instead of @SimplyTimed as well ?).
I think the buckets for Timers could/should be defined once for the app.

I don't understand the details of your options, I'm not used to SmallRye at all.

Now, I can think of other "feature", we time our REST calls (JAX-RS), seprateing calls that endup with server errors (not from total) from successful ansers would also be useful, this also applies to percentile. Having calls that ends with HTTP500 aren't interesting for latency checks.

In the Google SRE Book https://landing.google.com/sre/sre-book/chapters/monitoring-distributed-systems/ :

Latency
    The time it takes to service a request. It’s important to distinguish between the latency of successful requests and the latency of failed requests. For example, an HTTP 500 error triggered due to loss of connection to a database or other critical backend might be served very quickly; however, as an HTTP 500 error indicates a failed request, factoring 500s into your overall latency might result in misleading calculations. On the other hand, a slow error is even worse than a fast error! Therefore, it’s important to track error latency, as opposed to just filtering out errors.

I can open a new ticket for this, or maybe that's beyond what SmallRye metrics would want to provide and we should have our own filter anyway...

@jmartisk
Copy link
Contributor

I think we could extend the @Timed annotation with an enum argument named histogramExportFormat that would have the values PERCENTILES and THRESHOLD_VALUES and also BOTH if someone wants both formats to be exported at the same time. The metric metadata would recieve the same field.

For the buckets, I think defining them once per app can be problematic, because histograms can also use non time-based units, so it's not clear how the setting would apply to them. Then we'd probably need to apply the setting only to time-based histograms, and require other histograms to define their buckets on a per-metric basis...

Now, I can think of other "feature", we time our REST calls (JAX-RS), seprateing calls that endup with server errors (not from total) from successful ansers would also be useful, this also applies to percentile. Having calls that ends with HTTP500 aren't interesting for latency checks.

This should be addressed in Metrics 3.0 through #585 - I hope this will satisfy your requirements - the statistics about erroneous calls will be separated.

@jbndbc
Copy link

jbndbc commented Jun 11, 2020

I would like to second this request, since the OpenMetrics summary types currently being produced cannot be combined together by for example prometheus because quantiles cannot be averaged and can therefore only be used for metrics that make sense at an individual instance level.

@theute
Copy link
Author

theute commented Jun 15, 2020

For the buckets, I think defining them once per app can be problematic, because histograms can also use non time-based units, so it's not clear how the setting would apply to them. Then we'd probably need to apply the setting only to time-based histograms, and require other histograms to define their buckets on a per-metric basis...

I don't really know :)

Now, I can think of other "feature", we time our REST calls (JAX-RS), seprateing calls that endup with server errors (not from total) from successful ansers would also be useful, this also applies to percentile. Having calls that ends with HTTP500 aren't interesting for latency checks.

This should be addressed in Metrics 3.0 through #585 - I hope this will satisfy your requirements - the statistics about erroneous calls will be separated

I guess so, thanks !

@jmartisk
Copy link
Contributor

I reported smallrye/smallrye-metrics#325 because this will be first implemented as an experimental feature in an implementation before backporting it to spec, let's discuss implementation details there for now

@ebullient
Copy link

Micrometer timers can have distribution summaries or histogram buckets associated with them. Would this satisfy what you're asking for?
https://micrometer.io/docs/concepts#_timers + https://micrometer.io/docs/concepts#_histograms_and_percentiles

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
6 participants