Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a new auto_histogram aggregation for numeric fields #31828

Open
pcsanwald opened this issue Jul 5, 2018 · 21 comments
Open

Add a new auto_histogram aggregation for numeric fields #31828

pcsanwald opened this issue Jul 5, 2018 · 21 comments
Labels
:Analytics/Aggregations Aggregations >feature Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) Top Ask

Comments

@pcsanwald
Copy link
Contributor

Per #28993, we also want to add support for an auto_histogram on numeric fields. We agreed that it should be a separate issue so that we would not block #28993 on implementation.

@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-search-aggs

@melissachang
Copy link

Here's a suggestion, what do you think?

It's easier for the user if buckets are round numbers, eg:

10-19:     25
20-29      86
30-39      83

is easier to parse than:

10-17:     25
18-25      86
26-33      83

So I propose:
bucket is optional.

  • If bucket is set, that number of buckets is returned.
  • If bucket is not set, the number of buckets returned will vary, to make the buckets round. The number of buckets returned can vary between 5 and 12.

Examples:

  • Min value = 1, max_value = 20. 10 buckets: 1-2, 3-4, 5-6, 7-8, 9-10, 11-12, ...
  • Min value = 1, max value = 50. 6 buckets: 0-9, 10-19, 20-29, 30-39, 40-49, 50-59
  • Min value = 1, max value = 99. 10 buckets: 0-9, 10-19, 20-29, 30-39, ..., 90-99
  • Min value = 1, max value = 100. 11 buckets: 0-9, 10-19, 20-29, 30-39, ..., 90-99, 100-109
  • Min value = 200, max value = 300. 11 buckets: 200-209, 210-219, 220-229, 230-239, ..., 290-299, 200-209

FYI @rayward who filed #9572.

@colings86
Copy link
Contributor

@melissachang what you are describing about avoiding "non-round" intervals is actually what we intend to do and what is already done for the date version of this aggregation in #28993. The difference is that the intervals are always selected to be "round". the buckets parameter the user provides represents the maximum number of buckets they would like returned and the aggregation will pick a "round" interval which gets as close to that number as possible but does not exceed it.

This means that an application can, for example, provide the maximum number of bars that it has space to render on a bar chart, and the aggregation will return buckets with an interval easy for humans to parse but which do not exceed the number the application is able to render.

@melissachang
Copy link

I see, thanks.

The documentation for #28993 says "The buckets field is optional, and will default to 10 buckets if not specified.". Does "rounding" occur in this case? For me, ideal is:

  • I would like to use this feature without specifying max # buckets. (I don't want to think about how many buckets I want, I just want a reasonable default.)
  • When I don't specify max # buckets, rounding takes place and the actual # buckets may vary.

@colings86
Copy link
Contributor

Yes rounding also occurs for the default value, so you may get less than 10 buckets returned by default but not more than 10.

@melissachang
Copy link

Thanks. Is there an ETA on when this will be implemented? I would like to start using it. :)

@colings86
Copy link
Contributor

We don't have an ETA for this sorry but its great that you are excited for this feature. You can track this issue for progress on the feature, where there is a PR available it will show the intended version(s) (note until the PR is merged the intended versions may change)

@melissachang
Copy link

Thanks. If I could test the future PR before it's merged, that would be great. I'd like to see how it looks on my data.

@mrec
Copy link

mrec commented Aug 15, 2018

Presumably this would also follow auto_date_histogram's policy of erroring on buckets above 10k?

One (very) edge case worth considering is what happens when "buckets":1. This is obviously silly for auto_histogram, but it's not silly for other bucketed aggs like terms so I don't know if you'd want to error on it. If you don't error, though, you've got the awkward question of what to do when the value range crosses zero - normally for a distribution like that zero would always be a bucket boundary, but in this case it couldn't be.

Finally, could you confirm that your rounding logic will always round to 1x, 2x or 5x a power of ten? I'm building a manual equivalent of this feature at the moment, using a separate calibration request, and am trying to design it in such a a way that we could swap it out for auto_histogram transparently when available.

@pcsanwald
Copy link
Contributor Author

Hi @mrec, thanks for your comments!

We'll follow the same policy of erroring if number of buckets required exceeds the soft limit, but unfortunately, the logic is slightly more complex than throwing an error if buckets exceed 10k; the logic is here. The reason for checking and throwing this error is that our rounding could otherwise trip the soft limit for max number of buckets. So, the logic is actually to error if soft limit for max buckets setting / max inner interval for roundings. The error message will contain the reason. The reason for this implementation decision is in the comments on this PR.

Currently, we don't error if "buckets":1. Let me investigate this edge case a bit more, I think opening an issue might be appropriate but let me confirm it's an issue for auto_date_histogram first (I haven't tested with dates that could cross 0).

On rounding logic: We round in milliseconds and the current finest grained rounding interval is SECOND_OF_MINUTE, so hopefully this addresses your concern. Rounding units and innerIntervals are built here

@mrec
Copy link

mrec commented Aug 15, 2018

Hi @pcsanwald,

our rounding could otherwise trip the soft limit

That's fine; I understand the rationale and was already allowing for rounding in my calibrated implementation.

On rounding logic: We round in milliseconds and the current finest grained rounding interval is SECOND_OF_MINUTE

Erm... I think you're getting mixed up between auto_date_histogram and auto_histogram. This question is about the latter.

@pcsanwald
Copy link
Contributor Author

@mrec my mistake, sorry. Your suggestion seems sensible and I can't currently think of a reason we wouldn't round to those intervals, at this point. Since I'm planning to start on this work soon, I'll keep the issue updated and I'd be happy to discuss any cases that might cause us not to round in this manner.

@colings86
Copy link
Contributor

For numeric auto_histogram I was thinking of rounding to 1x, 2x, 5x powers of ten but this isn't set in stone at this stage. We need to begin implementing and then we can assess whether those multiples are appropriate and give enough options within a power of ten. Hopefully what we decide and what you decide will end up the same and you can swap your implementation out transparently but we can't guarantee it at this stage

@mrec
Copy link

mrec commented Aug 15, 2018

@colings86 - understood. I'm actively expecting ES upgrades to change our bucketing, for the better; at the moment we're stuck way back on 2.3.2 where histogram doesn't even support non-integral intervals. I'm just trying to avoid gratuitous differences where I can.

@colings86
Copy link
Contributor

sure, I understand that aim and I think its a good one. We'll be sure to update this issue when something is settled on. 😄

@mrec
Copy link

mrec commented Aug 17, 2018

Thinking about this a bit more since yesterday.

@pcsanwald : seeing what auto_date_histogram does with "buckets":1 would be interesting, but I don't think it's exactly the same situation as auto_histogram. For dates, the maximum supported interval is 100 years, so the client always has to allow for the possibility that the data won't fit - if I ask for 10 buckets and the values span two millenia, I'm going to be disappointed. For numbers this is much less obvious, because aside from the limits of double representation intervals can get as big as they need to be, so clients may not be expecting rejection.

Semi-offtopic: extreme outliers aren't the only cause of the "won't fit" problem, but in our usage of histogram/date_histogram they've been by far the most common one. I'm not for a moment suggesting that it should be in scope for this issue, but would it be possible even in principle for auto_histogram to be able to do something like Winsorizing? We have this floating around as a possible requirement, and it looks as if it might be doable with the current separate-calibration approach (by requesting e.g. 1 and 99 percentiles as well as the 0 and 100 used to determine the full range, and doing the final histogram aggregation on a script field that clamps raw values) but not via auto_histogram.

@jasontedor jasontedor added v8.0.0 and removed v7.0.0 labels Feb 6, 2019
@colings86 colings86 added 7x and removed v8.0.0 labels Apr 12, 2019
@agirbal
Copy link

agirbal commented Sep 12, 2019

It seems like this ticket would focus on picking buckets automatically but make them of equal ranges? If so it would not satisfy the main use cases of elastic/kibana#3905 and elastic/kibana#3757 which were asking for quantile / percentile based ranges on the X-Axis, right? Said differently, the ranges should not be equal (e.g. 1-9, 10-499, 500-10000) but instead represent an equivalent count of entries within each. This is really powerful as it lets you graph some metric for your bottom 25% of users vs top 25% of users. In typical performance histogram case, there is a huge long tail of results and having simple auto-bucketing would just cram 99% of results into only 1 bucket.

As I understand @jpountz the only way to do this today would be the 2 queries approach?

@polyfractal
Copy link
Contributor

polyfractal commented Sep 12, 2019

It seems like this ticket would focus on picking buckets automatically but make them of equal ranges?

Correct. There is a related open PR (from a community member) to add variable-width histograms: #42035

But I don't think that would satisfy the requirement if you need exact control over the bucket-widths. E.g. 42035 will still choose it's own bucket widths as it clusters/merges buckets together.

On the other hand, if you are going after the CDF in particular, I'd probably just use the percentiles aggregation and specify a bunch of percentiles to retrieve (0-100 in increments of 1 or 0.1 or whatever). Calculating those is essentially free since they are all being generated from the same data sketch.

You could do the same with a percentile_rank to get the actual values on the x-axis, although it'd require a max first to find the upper bound that you need to request (which I admit is not as convenient)

Edit: Oh, nevermind, you want to buckets so that you can run sub-aggs right? Yeah that would require two passes if you need to specify histogram widths like for percentiles/quantiles. Theoretically we could do it in a single-pass approximate manner (by interrogating the sketch at runtime as it accumulates data)... but I think the accuracy of that would be awful.

@ShahinSorkh
Copy link

any progress?

@nemphys
Copy link

nemphys commented Dec 28, 2021

Any news on this one? As already mentioned, it would be really handy for commerce applications (price ranges).

@mitar
Copy link
Contributor

mitar commented Sep 16, 2022

Isn't this available now as Variable width histogram aggregation?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Analytics/Aggregations Aggregations >feature Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) Top Ask
Projects
None yet
Development

No branches or pull requests