Add a new auto_histogram aggregation for numeric fields #31828

pcsanwald · 2018-07-05T14:01:55Z

Per #28993, we also want to add support for an auto_histogram on numeric fields. We agreed that it should be a separate issue so that we would not block #28993 on implementation.

elasticmachine · 2018-07-05T14:01:57Z

Pinging @elastic/es-search-aggs

melissachang · 2018-07-18T16:53:34Z

Here's a suggestion, what do you think?

It's easier for the user if buckets are round numbers, eg:

10-19:     25
20-29      86
30-39      83

is easier to parse than:

10-17:     25
18-25      86
26-33      83

So I propose:
bucket is optional.

If bucket is set, that number of buckets is returned.
If bucket is not set, the number of buckets returned will vary, to make the buckets round. The number of buckets returned can vary between 5 and 12.

Examples:

Min value = 1, max_value = 20. 10 buckets: 1-2, 3-4, 5-6, 7-8, 9-10, 11-12, ...
Min value = 1, max value = 50. 6 buckets: 0-9, 10-19, 20-29, 30-39, 40-49, 50-59
Min value = 1, max value = 99. 10 buckets: 0-9, 10-19, 20-29, 30-39, ..., 90-99
Min value = 1, max value = 100. 11 buckets: 0-9, 10-19, 20-29, 30-39, ..., 90-99, 100-109
Min value = 200, max value = 300. 11 buckets: 200-209, 210-219, 220-229, 230-239, ..., 290-299, 200-209

FYI @rayward who filed #9572.

colings86 · 2018-07-19T09:09:13Z

@melissachang what you are describing about avoiding "non-round" intervals is actually what we intend to do and what is already done for the date version of this aggregation in #28993. The difference is that the intervals are always selected to be "round". the buckets parameter the user provides represents the maximum number of buckets they would like returned and the aggregation will pick a "round" interval which gets as close to that number as possible but does not exceed it.

This means that an application can, for example, provide the maximum number of bars that it has space to render on a bar chart, and the aggregation will return buckets with an interval easy for humans to parse but which do not exceed the number the application is able to render.

melissachang · 2018-07-19T14:44:51Z

I see, thanks.

The documentation for #28993 says "The buckets field is optional, and will default to 10 buckets if not specified.". Does "rounding" occur in this case? For me, ideal is:

I would like to use this feature without specifying max # buckets. (I don't want to think about how many buckets I want, I just want a reasonable default.)
When I don't specify max # buckets, rounding takes place and the actual # buckets may vary.

colings86 · 2018-07-19T17:09:03Z

Yes rounding also occurs for the default value, so you may get less than 10 buckets returned by default but not more than 10.

melissachang · 2018-07-19T19:43:09Z

Thanks. Is there an ETA on when this will be implemented? I would like to start using it. :)

colings86 · 2018-07-20T06:47:31Z

We don't have an ETA for this sorry but its great that you are excited for this feature. You can track this issue for progress on the feature, where there is a PR available it will show the intended version(s) (note until the PR is merged the intended versions may change)

melissachang · 2018-07-23T16:44:14Z

Thanks. If I could test the future PR before it's merged, that would be great. I'd like to see how it looks on my data.

mrec · 2018-08-15T13:12:59Z

Presumably this would also follow auto_date_histogram's policy of erroring on buckets above 10k?

One (very) edge case worth considering is what happens when "buckets":1. This is obviously silly for auto_histogram, but it's not silly for other bucketed aggs like terms so I don't know if you'd want to error on it. If you don't error, though, you've got the awkward question of what to do when the value range crosses zero - normally for a distribution like that zero would always be a bucket boundary, but in this case it couldn't be.

Finally, could you confirm that your rounding logic will always round to 1x, 2x or 5x a power of ten? I'm building a manual equivalent of this feature at the moment, using a separate calibration request, and am trying to design it in such a a way that we could swap it out for auto_histogram transparently when available.

pcsanwald · 2018-08-15T15:06:43Z

Hi @mrec, thanks for your comments!

We'll follow the same policy of erroring if number of buckets required exceeds the soft limit, but unfortunately, the logic is slightly more complex than throwing an error if buckets exceed 10k; the logic is here. The reason for checking and throwing this error is that our rounding could otherwise trip the soft limit for max number of buckets. So, the logic is actually to error if soft limit for max buckets setting / max inner interval for roundings. The error message will contain the reason. The reason for this implementation decision is in the comments on this PR.

Currently, we don't error if "buckets":1. Let me investigate this edge case a bit more, I think opening an issue might be appropriate but let me confirm it's an issue for auto_date_histogram first (I haven't tested with dates that could cross 0).

On rounding logic: We round in milliseconds and the current finest grained rounding interval is SECOND_OF_MINUTE, so hopefully this addresses your concern. Rounding units and innerIntervals are built here

mrec · 2018-08-15T16:03:40Z

Hi @pcsanwald,

our rounding could otherwise trip the soft limit

That's fine; I understand the rationale and was already allowing for rounding in my calibrated implementation.

On rounding logic: We round in milliseconds and the current finest grained rounding interval is SECOND_OF_MINUTE

Erm... I think you're getting mixed up between auto_date_histogram and auto_histogram. This question is about the latter.

pcsanwald · 2018-08-15T16:33:34Z

@mrec my mistake, sorry. Your suggestion seems sensible and I can't currently think of a reason we wouldn't round to those intervals, at this point. Since I'm planning to start on this work soon, I'll keep the issue updated and I'd be happy to discuss any cases that might cause us not to round in this manner.

colings86 · 2018-08-15T16:48:56Z

For numeric auto_histogram I was thinking of rounding to 1x, 2x, 5x powers of ten but this isn't set in stone at this stage. We need to begin implementing and then we can assess whether those multiples are appropriate and give enough options within a power of ten. Hopefully what we decide and what you decide will end up the same and you can swap your implementation out transparently but we can't guarantee it at this stage

mrec · 2018-08-15T16:57:54Z

@colings86 - understood. I'm actively expecting ES upgrades to change our bucketing, for the better; at the moment we're stuck way back on 2.3.2 where histogram doesn't even support non-integral intervals. I'm just trying to avoid gratuitous differences where I can.

colings86 · 2018-08-15T17:03:33Z

sure, I understand that aim and I think its a good one. We'll be sure to update this issue when something is settled on. 😄

mrec · 2018-08-17T11:40:19Z

Thinking about this a bit more since yesterday.

@pcsanwald : seeing what auto_date_histogram does with "buckets":1 would be interesting, but I don't think it's exactly the same situation as auto_histogram. For dates, the maximum supported interval is 100 years, so the client always has to allow for the possibility that the data won't fit - if I ask for 10 buckets and the values span two millenia, I'm going to be disappointed. For numbers this is much less obvious, because aside from the limits of double representation intervals can get as big as they need to be, so clients may not be expecting rejection.

Semi-offtopic: extreme outliers aren't the only cause of the "won't fit" problem, but in our usage of histogram/date_histogram they've been by far the most common one. I'm not for a moment suggesting that it should be in scope for this issue, but would it be possible even in principle for auto_histogram to be able to do something like Winsorizing? We have this floating around as a possible requirement, and it looks as if it might be doable with the current separate-calibration approach (by requesting e.g. 1 and 99 percentiles as well as the 0 and 100 used to determine the full range, and doing the final histogram aggregation on a script field that clamps raw values) but not via auto_histogram.

agirbal · 2019-09-12T20:53:24Z

It seems like this ticket would focus on picking buckets automatically but make them of equal ranges? If so it would not satisfy the main use cases of elastic/kibana#3905 and elastic/kibana#3757 which were asking for quantile / percentile based ranges on the X-Axis, right? Said differently, the ranges should not be equal (e.g. 1-9, 10-499, 500-10000) but instead represent an equivalent count of entries within each. This is really powerful as it lets you graph some metric for your bottom 25% of users vs top 25% of users. In typical performance histogram case, there is a huge long tail of results and having simple auto-bucketing would just cram 99% of results into only 1 bucket.

As I understand @jpountz the only way to do this today would be the 2 queries approach?

polyfractal · 2019-09-12T21:49:02Z

It seems like this ticket would focus on picking buckets automatically but make them of equal ranges?

Correct. There is a related open PR (from a community member) to add variable-width histograms: #42035

But I don't think that would satisfy the requirement if you need exact control over the bucket-widths. E.g. 42035 will still choose it's own bucket widths as it clusters/merges buckets together.

On the other hand, if you are going after the CDF in particular, I'd probably just use the percentiles aggregation and specify a bunch of percentiles to retrieve (0-100 in increments of 1 or 0.1 or whatever). Calculating those is essentially free since they are all being generated from the same data sketch.

You could do the same with a percentile_rank to get the actual values on the x-axis, although it'd require a max first to find the upper bound that you need to request (which I admit is not as convenient)

Edit: Oh, nevermind, you want to buckets so that you can run sub-aggs right? Yeah that would require two passes if you need to specify histogram widths like for percentiles/quantiles. Theoretically we could do it in a single-pass approximate manner (by interrogating the sketch at runtime as it accumulates data)... but I think the accuracy of that would be awful.

ShahinSorkh · 2021-11-20T18:36:41Z

any progress?

nemphys · 2021-12-28T16:30:31Z

Any news on this one? As already mentioned, it would be really handy for commerce applications (price ranges).

mitar · 2022-09-16T23:56:56Z

Isn't this available now as Variable width histogram aggregation?

pcsanwald added >feature :Analytics/Aggregations Aggregations v7.0.0 labels Jul 5, 2018

pcsanwald mentioned this issue Jul 5, 2018

Adds a new auto-interval date histogram #28993

Merged

7 tasks

pcsanwald mentioned this issue Jul 17, 2018

Support dynamic interval and fixed buckets for histogram aggregation #9572

Closed

melissachang mentioned this issue Jul 23, 2018

Dynamically determine bucket size for histograms DataBiosphere/data-explorer#107

Merged

jasontedor added v8.0.0 and removed v7.0.0 labels Feb 6, 2019

colings86 added 7x and removed v8.0.0 labels Apr 12, 2019

$@polyfractal$ polyfractal removed the 7x label Dec 12, 2019

agirbal mentioned this issue Dec 19, 2019

Add a new quantile histogram aggregation for numeric fields #50386

Open

$@polyfractal$ polyfractal added the Top Ask label Apr 9, 2020

rjernst added the Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) label May 4, 2020

talevy mentioned this issue May 28, 2020

[Meta] Kibana support for ES aggregations elastic/kibana#58628

Closed

7 tasks

$@polyfractal$ polyfractal mentioned this issue Jul 28, 2020

Freedman-Diaconis histogram #60312

Closed

$@polyfractal$ polyfractal mentioned this issue Nov 4, 2020

Enhancement: Range agg specified as max bucket count rather than explicit ranges #24254

Closed

ymao1 mentioned this issue Apr 30, 2021

[Alerting][Docs] Adding query to identify long running rules to docs elastic/kibana#98773

Merged

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a new auto_histogram aggregation for numeric fields #31828

Add a new auto_histogram aggregation for numeric fields #31828

pcsanwald commented Jul 5, 2018

elasticmachine commented Jul 5, 2018

melissachang commented Jul 18, 2018

colings86 commented Jul 19, 2018

melissachang commented Jul 19, 2018

colings86 commented Jul 19, 2018

melissachang commented Jul 19, 2018

colings86 commented Jul 20, 2018

melissachang commented Jul 23, 2018

mrec commented Aug 15, 2018

pcsanwald commented Aug 15, 2018

mrec commented Aug 15, 2018

pcsanwald commented Aug 15, 2018

colings86 commented Aug 15, 2018

mrec commented Aug 15, 2018

colings86 commented Aug 15, 2018

mrec commented Aug 17, 2018

agirbal commented Sep 12, 2019 •

edited

Loading

polyfractal commented Sep 12, 2019 •

edited

Loading

ShahinSorkh commented Nov 20, 2021

nemphys commented Dec 28, 2021

mitar commented Sep 16, 2022

Add a new auto_histogram aggregation for numeric fields #31828

Add a new auto_histogram aggregation for numeric fields #31828

Comments

pcsanwald commented Jul 5, 2018

elasticmachine commented Jul 5, 2018

melissachang commented Jul 18, 2018

colings86 commented Jul 19, 2018

melissachang commented Jul 19, 2018

colings86 commented Jul 19, 2018

melissachang commented Jul 19, 2018

colings86 commented Jul 20, 2018

melissachang commented Jul 23, 2018

mrec commented Aug 15, 2018

pcsanwald commented Aug 15, 2018

mrec commented Aug 15, 2018

pcsanwald commented Aug 15, 2018

colings86 commented Aug 15, 2018

mrec commented Aug 15, 2018

colings86 commented Aug 15, 2018

mrec commented Aug 17, 2018

agirbal commented Sep 12, 2019 • edited Loading

polyfractal commented Sep 12, 2019 • edited Loading

ShahinSorkh commented Nov 20, 2021

nemphys commented Dec 28, 2021

mitar commented Sep 16, 2022

agirbal commented Sep 12, 2019 •

edited

Loading

polyfractal commented Sep 12, 2019 •

edited

Loading