[Lens][Visualize] Enables histogram for histogram fields #131379

stratoula · 2022-05-03T09:35:10Z

Summary

Adds support for histogram and range aggregations on histogram fields.
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-histogram-aggregation.html
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-range-aggregation.html

Lens

Visualize editor

How to test

Create an index with an histogram field. I used the example here. Create an agg-based visualization or a Lens visualization on this dataview.

Agg based visualization with histogram agg
Lens visualization with intervals

Important note

From ES documentation:

when executing a histogram|range aggregation over an histogram field, no sub-aggregations are allowed

Checklist

Unit or functional tests were updated or added to match the most common scenarios

elasticmachine · 2022-05-03T13:18:49Z

Pinging @elastic/kibana-vis-editors @elastic/kibana-vis-editors-external (Team:VisEditors)

ppisljar

app services code LGTM

mbondyra · 2022-05-03T14:23:04Z

It's a tough one from UX perspective. Imagine you have a chart like this (using the example data you mentioned):

and then you can add intervals operation that will immediately throw error.

I feel like we should either block the intervals in this case, or convert Unique count of my text to Count of records.

What do you think @ghudgins @flash1293? Maybe we should discuss it during weekly?

stratoula · 2022-05-03T14:41:08Z

I get your point Marta but changing to count seems weird to me. Disabling it for non count can be better but what happens if the user configured first the bucket and then the metric? I was thinking that it is how histogram works on histogram fields and it makes sense to me but I would be happy to discuss it all together.

flash1293 · 2022-05-03T15:57:01Z

What about handling the case on the Lens level like we do it for moving average without a date histogram?

markov00

Hi @stratoula I don't want to be the usual party pooper, but I have some doubts.
The ES docs about histogram says:

When using a histogram as part of an aggregation, the accuracy of the results will depend on how the histogram was constructed. It is important to consider the percentiles aggregation mode that will be used to build it. Some possibilities include:

For the T-Digest mode, the values array represents the mean centroid positions and the counts array represents the number of values that are attributed to each centroid. If the algorithm has already started to approximate the percentiles, this inaccuracy is carried over in the histogram.
For the High Dynamic Range (HDR) histogram mode, the values array represents fixed upper limits of each bucket interval, and the counts array represents the number of values that are attributed to each interval. This implementation maintains a fixed worse-case percentage error (specified as a number of significant digits), therefore the value used when generating the histogram would be the maximum accuracy you can achieve at aggregation time.

The histogram field is "algorithm agnostic" and does not store data specific to either T-Digest or HDRHistogram. While this means the field can technically be aggregated with either algorithm, in practice the user should chose one algorithm and index data in that manner (e.g. centroids for T-Digest or intervals for HDRHistogram) to ensure best accuracy.

This to me means that the values array can represent buckets in different ways than the way we are using buckets today (the bucket key represents the beginning of the bucket).
Here, the ES description clearly defines that the value array can represent not only the initial bucket, but a mean centroid, or the upper limit of each interval.

The way the histogram field is built is essential to represent it correctly. By default we/elastic-charts consider the data passed to it as a set of histogram bins where the value represent the beginning of the bin, but we don't automatically (and I think this is not possible anyway) determine what method was used to build such histogram and we can fall into untruthful representation of data.

Your example here is the proof. I think it is using an histogram of this like:

[0.1, 0.2, 0.25, 0.3, 0.35, 0.4, 0.45, 0.5]

An histogram divides the range into intervals without leaving gaps between them. But the way it is constructired is fundamental. With a simple set of value like that we can describe the following histograms:

each value represent the beginning of an equal width bin [0.1 - 0.15], [0.15 - 0.2] (missing), [0.2- 0.25]
each value represent the beginning of unequal width bins [0.1 -0.2],[0.2 - 0.5]
each value represent the centroid (based on some data) [0 - 0.18],[0.18 - 0.22],[0.22 - 0.29]note the first element can start at zero for example
each value represent the end of equal unequal width bins [0 - 0.1], [0.1 - 0.2] [0.2 - 0.25] note the first element can start at zero for example

These are just examples but each one shows a different representation and is interpreted differently. We can make this works for 1 case (each value represents the beginning of the bin) but I don't think this covers the various use cases, and in particular, this doesn't cover the two use cases specified in the docs.

If, instead, we are applying our histogram aggregation on top of histogram fields, your work is fine, but the ES implementation logic blows my mind because it looks like not even considering what was defined in the ES documentation itself.

… by bucket is given

flash1293 · 2022-05-04T08:28:02Z

@markov00 I agree that the user has to know what the numbers in the histogram actually mean to be able to use it in a meaningful way in an aggregation. However, this is a true statement for all kinds of data and not special to the histogram field type (although I agree it's extra weird).

The docs you cited just specify that - what a histogram actually means is up to how the data was indexed which is fully controlled by the user and Elasticsearch can't store meta information about it because it doesn't know. It calls out a special case about percentiles (right now we don't support hdr percentiles in Lens at all - a known limitation ), but that's kind of separate to the histogram aggregation as there is no way I know off to have the histogram aggregation interpret the histogram field in a "centroid way" and attribute the counts to multiple buckets if they are closer to the edge or something like this - seems like Elasticsearch itself is lacking this functionality and you probably shouldn't use the histogram agg on histogram fields indexed this way (or be aware of inaccuracies).

The mental model (and prime use case from my understanding) is indeed the first case you are calling out - it's also the example made in the docs for this situation: https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-histogram-aggregation.html#search-aggregations-bucket-histogram-aggregation-histogram-fields

All of that being said, it's probably worth adding an in-product documentation popover for this, as it's definitely something a user can misinterpret if they are not aware of what their data means - however IMHO it's not a blocker for adding this functionality because it's inherent to the level of meta information about the data structure Lens is working with. To make a current example - it's possible to use the "Counter rate" function in a misleading way because the time series has to be broken down by all dimension fields. We don't know the dimension fields so we can't guide the user with this. If a dimension field is missed, the chart will still show something, but it won't be correct. In the same way it would be meaningless to do a percentile calculation on a counter metric field, but we can't prevent or guide the user doing this at the moment. This case seems similar to me.

markov00 · 2022-05-04T08:51:44Z

The mental model (and prime use case from my understanding) is indeed the first case you are calling out - it's also the example made in the docs for this situation: https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-histogram-aggregation.html#search-aggregations-bucket-histogram-aggregation-histogram-field

@flash1293 that example to me is completely wrong: the histogram aggregation on the histogram field is not even considering the "histogram" nature of the field, but it is taking each single bin on the histogram fields, converting them to single data points (as they were single documents) and then aggregating them in new histogram bins:

"latency_histo" : {
      "values" : [1, 3, 8, 12, 15],
      "counts" : [3, 7, 23, 12, 6]
   }
"latency_histo" : {
      "values" : [1, 6, 8, 12, 14],
      "counts" : [8, 17, 8, 7, 6]
   }

"buckets": [
        {
          "key": 0.0,
          "doc_count": 18 // made by sum of bin 1,3 of the first doc + the bin 1 of the second doc
        },
        {
          "key": 5.0,
          "doc_count": 48 // made by the sum of the bin 8 of the first doc + the bin 6 and 8 of the second
        },
        {
          "key": 10.0,
          "doc_count": 25 // bin 12 an bin 12/14
        },
        {
          "key": 15.0,
          "doc_count": 6 // only bin 16 of first doc
        }
      ]

The fact that the nature of the bin (a range between two values) is completely removed from the calculation, is probably one of the most misleading features we can have in ES... happy to discuss in a separate session, but to me ES is doing it wrong and we are just releasing a feature of something that works in the wrong way.

flash1293 · 2022-05-04T08:55:59Z

@markov00 Sorry, I should have clarified - the way I interpreted this example is that the buckets of the histogram field are 1ms each, but you are right that it's not called out explicitly.

The fact that the nature of the bin (a range between two values) is completely removed from the calculation, is probably one of the most misleading features we can have in ES... happy to discuss in a separate session, but to me ES is doing it wrong and we are just releasing a feature of something that works in the wrong way.

I definitely understand your reservations here, let's discuss offline. It's a pretty "raw" feature but it definitely has it's use cases.

stratoula · 2022-05-05T09:35:46Z

@MichaelMarcialis we decided to proceed with this feature (enable support of histogram on histogram fields) but we would like to add an in-app documentation to explain to the users how intervals work with histogram fields in order to ensure that there will be no confusion.

I wonder how this in-app documentation should work. Atm we have the date_histogram documentation

I could do the same for the intervals, but do we want this to be rendered always (even if the field is not of histogram type)? Do we want this to appear as long as the user has selected the field? What do you think?

MichaelMarcialis · 2022-05-05T17:33:35Z

I could do the same for the intervals, but do we want this to be rendered always (even if the field is not of histogram type)? Do we want this to appear as long as the user has selected the field? What do you think?

@stratoula: My first instinct is to say that any in-app documentation we include for the quick functions would be ever-present and not conditional based on the field selected (especially since the field selection can happen after function selection if not using drag-and-drop). Instead, could we include descriptions within the popover that explain what will happen with both histogram type fields as well as other field types?

stratoula · 2022-05-06T05:56:33Z

Yes @MichaelMarcialis this makes sense, thanx!

kibana-ci · 2022-05-09T07:10:41Z

💚 Build Succeeded

Metrics [docs]

Async chunks

Total size of all lazy-loaded chunks that will be downloaded as the user navigates the app

id	before	after	diff
`lens`	1.1MB	1.1MB	+891.0B
`visDefaultEditor`	142.4KB	142.5KB	+22.0B
total			+913.0B

Page load bundle

Size of the bundles that are downloaded on every page load. Target size is below 100kb

id	before	after	diff
`data`	413.8KB	413.9KB	+56.0B

History

💚 Build #42943 succeeded 35f7912
💚 Build #42348 succeeded bd124ba
💔 Build #42337 failed 1c9252b
💔 Build #42321 failed 48b41b8
💚 Build #42296 succeeded f8c3ae6
💚 Build #42035 succeeded c146309

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

ghudgins · 2022-05-10T17:34:55Z

We met and decided to hold on this PR. If we can find a more tangible business case for adding this to Lens (someone who needs it more explicitly) then we can re-open this PR.

One thing we identified that would help this feature be useful for experts is the ability to set the interval width with a more detailed control instead of the slider UI. This would allow the expert user to match the histogram bucket size in their histogram fields.

Enables histogram for histogram fields

4329bd1

stratoula changed the title ~~Enables histogram for histogram fields~~ [Lens][Visualize] Enables histogram for histogram fields May 3, 2022

stratoula added 3 commits May 3, 2022 13:33

Add support for ranges agg on histogram field

8c08179

Allows auto interval for histogram fields

844ce3d

Merge branch 'main' into histogram-on-histogram

c146309

stratoula added release_note:enhancement Feature:Vis Editor Visualization editor issues Feature:Lens backport:skip This commit does not require backporting v8.3.0 Team:Visualizations Visualization editors, elastic-charts and infrastructure labels May 3, 2022

stratoula marked this pull request as ready for review May 3, 2022 13:18

stratoula requested review from a team as code owners May 3, 2022 13:18

ppisljar approved these changes May 3, 2022

View reviewed changes

Merge branch 'main' into histogram-on-histogram

f8c3ae6

markov00 reviewed May 4, 2022

View reviewed changes

Display a helper error in case a non-count aggregation or a breakdown…

48b41b8

… by bucket is given

Fix jest test

1c9252b

Refactor helpers jest test

bd124ba

stratoula added 2 commits May 6, 2022 10:25

Merge branch 'main' into histogram-on-histogram

35f7912

Merge branch 'main' into histogram-on-histogram

bef671e

ghudgins closed this May 10, 2022

stratoula reopened this Apr 22, 2024

stratoula requested a review from a team as a code owner April 22, 2024 14:42

stratoula marked this pull request as draft April 22, 2024 14:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Lens][Visualize] Enables histogram for histogram fields #131379

[Lens][Visualize] Enables histogram for histogram fields #131379

stratoula commented May 3, 2022 •

edited

Loading

elasticmachine commented May 3, 2022

ppisljar left a comment

mbondyra commented May 3, 2022

stratoula commented May 3, 2022

flash1293 commented May 3, 2022

markov00 left a comment •

edited

Loading

flash1293 commented May 4, 2022 •

edited

Loading

markov00 commented May 4, 2022

flash1293 commented May 4, 2022 •

edited

Loading

stratoula commented May 5, 2022

MichaelMarcialis commented May 5, 2022

stratoula commented May 6, 2022

kibana-ci commented May 9, 2022

ghudgins commented May 10, 2022

[Lens][Visualize] Enables histogram for histogram fields #131379

Are you sure you want to change the base?

[Lens][Visualize] Enables histogram for histogram fields #131379

Conversation

stratoula commented May 3, 2022 • edited Loading

Summary

How to test

Important note

Checklist

elasticmachine commented May 3, 2022

ppisljar left a comment

Choose a reason for hiding this comment

mbondyra commented May 3, 2022

stratoula commented May 3, 2022

flash1293 commented May 3, 2022

markov00 left a comment • edited Loading

Choose a reason for hiding this comment

flash1293 commented May 4, 2022 • edited Loading

markov00 commented May 4, 2022

flash1293 commented May 4, 2022 • edited Loading

stratoula commented May 5, 2022

MichaelMarcialis commented May 5, 2022

stratoula commented May 6, 2022

kibana-ci commented May 9, 2022

💚 Build Succeeded

Metrics [docs]

Async chunks

Page load bundle

History

ghudgins commented May 10, 2022

stratoula commented May 3, 2022 •

edited

Loading

markov00 left a comment •

edited

Loading

flash1293 commented May 4, 2022 •

edited

Loading

flash1293 commented May 4, 2022 •

edited

Loading