Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Lens][Visualize] Enables histogram for histogram fields #131379

Draft
wants to merge 10 commits into
base: main
Choose a base branch
from

Conversation

stratoula
Copy link
Contributor

@stratoula stratoula commented May 3, 2022

Summary

Closes #112391
Closes #112390

Adds support for histogram and range aggregations on histogram fields.
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-histogram-aggregation.html
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-range-aggregation.html

Lens
image

Visualize editor
image

How to test

Create an index with an histogram field. I used the example here. Create an agg-based visualization or a Lens visualization on this dataview.

  • Agg based visualization with histogram agg
  • Lens visualization with intervals

Important note

From ES documentation:

when executing a histogram|range aggregation over an histogram field, no sub-aggregations are allowed

Checklist

@stratoula stratoula changed the title Enables histogram for histogram fields [Lens][Visualize] Enables histogram for histogram fields May 3, 2022
@stratoula stratoula added release_note:enhancement Feature:Vis Editor Visualization editor issues Feature:Lens backport:skip This commit does not require backporting v8.3.0 Team:Visualizations Visualization editors, elastic-charts and infrastructure labels May 3, 2022
@stratoula stratoula marked this pull request as ready for review May 3, 2022 13:18
@stratoula stratoula requested review from a team as code owners May 3, 2022 13:18
@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-vis-editors @elastic/kibana-vis-editors-external (Team:VisEditors)

Copy link
Member

@ppisljar ppisljar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

app services code LGTM

@mbondyra
Copy link
Contributor

mbondyra commented May 3, 2022

It's a tough one from UX perspective. Imagine you have a chart like this (using the example data you mentioned):
Screenshot 2022-05-03 at 16 21 24

and then you can add intervals operation that will immediately throw error.
Screenshot 2022-05-03 at 16 22 42

I feel like we should either block the intervals in this case, or convert Unique count of my text to Count of records.

What do you think @ghudgins @flash1293? Maybe we should discuss it during weekly?

@stratoula
Copy link
Contributor Author

I get your point Marta but changing to count seems weird to me. Disabling it for non count can be better but what happens if the user configured first the bucket and then the metric? I was thinking that it is how histogram works on histogram fields and it makes sense to me but I would be happy to discuss it all together.

@flash1293
Copy link
Contributor

What about handling the case on the Lens level like we do it for moving average without a date histogram?
Screenshot 2022-05-03 at 17 56 37

Copy link
Member

@markov00 markov00 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @stratoula I don't want to be the usual party pooper, but I have some doubts.
The ES docs about histogram says:

When using a histogram as part of an aggregation, the accuracy of the results will depend on how the histogram was constructed. It is important to consider the percentiles aggregation mode that will be used to build it. Some possibilities include:

For the T-Digest mode, the values array represents the mean centroid positions and the counts array represents the number of values that are attributed to each centroid. If the algorithm has already started to approximate the percentiles, this inaccuracy is carried over in the histogram.
For the High Dynamic Range (HDR) histogram mode, the values array represents fixed upper limits of each bucket interval, and the counts array represents the number of values that are attributed to each interval. This implementation maintains a fixed worse-case percentage error (specified as a number of significant digits), therefore the value used when generating the histogram would be the maximum accuracy you can achieve at aggregation time.

The histogram field is "algorithm agnostic" and does not store data specific to either T-Digest or HDRHistogram. While this means the field can technically be aggregated with either algorithm, in practice the user should chose one algorithm and index data in that manner (e.g. centroids for T-Digest or intervals for HDRHistogram) to ensure best accuracy.

This to me means that the values array can represent buckets in different ways than the way we are using buckets today (the bucket key represents the beginning of the bucket).
Here, the ES description clearly defines that the value array can represent not only the initial bucket, but a mean centroid, or the upper limit of each interval.

The way the histogram field is built is essential to represent it correctly. By default we/elastic-charts consider the data passed to it as a set of histogram bins where the value represent the beginning of the bin, but we don't automatically (and I think this is not possible anyway) determine what method was used to build such histogram and we can fall into untruthful representation of data.

Your example here is the proof. I think it is using an histogram of this like:

[0.1, 0.2, 0.25, 0.3, 0.35, 0.4, 0.45, 0.5]

image

An histogram divides the range into intervals without leaving gaps between them. But the way it is constructired is fundamental. With a simple set of value like that we can describe the following histograms:

  • each value represent the beginning of an equal width bin [0.1 - 0.15], [0.15 - 0.2] (missing), [0.2- 0.25]
  • each value represent the beginning of unequal width bins [0.1 -0.2],[0.2 - 0.5]
  • each value represent the centroid (based on some data) [0 - 0.18],[0.18 - 0.22],[0.22 - 0.29]note the first element can start at zero for example
  • each value represent the end of equal unequal width bins [0 - 0.1], [0.1 - 0.2] [0.2 - 0.25] note the first element can start at zero for example

These are just examples but each one shows a different representation and is interpreted differently. We can make this works for 1 case (each value represents the beginning of the bin) but I don't think this covers the various use cases, and in particular, this doesn't cover the two use cases specified in the docs.

If, instead, we are applying our histogram aggregation on top of histogram fields, your work is fine, but the ES implementation logic blows my mind because it looks like not even considering what was defined in the ES documentation itself.

@flash1293
Copy link
Contributor

flash1293 commented May 4, 2022

@markov00 I agree that the user has to know what the numbers in the histogram actually mean to be able to use it in a meaningful way in an aggregation. However, this is a true statement for all kinds of data and not special to the histogram field type (although I agree it's extra weird).

The docs you cited just specify that - what a histogram actually means is up to how the data was indexed which is fully controlled by the user and Elasticsearch can't store meta information about it because it doesn't know. It calls out a special case about percentiles (right now we don't support hdr percentiles in Lens at all - a known limitation ), but that's kind of separate to the histogram aggregation as there is no way I know off to have the histogram aggregation interpret the histogram field in a "centroid way" and attribute the counts to multiple buckets if they are closer to the edge or something like this - seems like Elasticsearch itself is lacking this functionality and you probably shouldn't use the histogram agg on histogram fields indexed this way (or be aware of inaccuracies).

The mental model (and prime use case from my understanding) is indeed the first case you are calling out - it's also the example made in the docs for this situation: https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-histogram-aggregation.html#search-aggregations-bucket-histogram-aggregation-histogram-fields

All of that being said, it's probably worth adding an in-product documentation popover for this, as it's definitely something a user can misinterpret if they are not aware of what their data means - however IMHO it's not a blocker for adding this functionality because it's inherent to the level of meta information about the data structure Lens is working with. To make a current example - it's possible to use the "Counter rate" function in a misleading way because the time series has to be broken down by all dimension fields. We don't know the dimension fields so we can't guide the user with this. If a dimension field is missed, the chart will still show something, but it won't be correct. In the same way it would be meaningless to do a percentile calculation on a counter metric field, but we can't prevent or guide the user doing this at the moment. This case seems similar to me.

@markov00
Copy link
Member

markov00 commented May 4, 2022

The mental model (and prime use case from my understanding) is indeed the first case you are calling out - it's also the example made in the docs for this situation: https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-histogram-aggregation.html#search-aggregations-bucket-histogram-aggregation-histogram-field

@flash1293 that example to me is completely wrong: the histogram aggregation on the histogram field is not even considering the "histogram" nature of the field, but it is taking each single bin on the histogram fields, converting them to single data points (as they were single documents) and then aggregating them in new histogram bins:

"latency_histo" : {
      "values" : [1, 3, 8, 12, 15],
      "counts" : [3, 7, 23, 12, 6]
   }
"latency_histo" : {
      "values" : [1, 6, 8, 12, 14],
      "counts" : [8, 17, 8, 7, 6]
   }
"buckets": [
        {
          "key": 0.0,
          "doc_count": 18 // made by sum of bin 1,3 of the first doc + the bin 1 of the second doc
        },
        {
          "key": 5.0,
          "doc_count": 48 // made by the sum of the bin 8 of the first doc + the bin 6 and 8 of the second
        },
        {
          "key": 10.0,
          "doc_count": 25 // bin 12 an bin 12/14
        },
        {
          "key": 15.0,
          "doc_count": 6 // only bin 16 of first doc
        }
      ]

The fact that the nature of the bin (a range between two values) is completely removed from the calculation, is probably one of the most misleading features we can have in ES... happy to discuss in a separate session, but to me ES is doing it wrong and we are just releasing a feature of something that works in the wrong way.

@flash1293
Copy link
Contributor

flash1293 commented May 4, 2022

@markov00 Sorry, I should have clarified - the way I interpreted this example is that the buckets of the histogram field are 1ms each, but you are right that it's not called out explicitly.

The fact that the nature of the bin (a range between two values) is completely removed from the calculation, is probably one of the most misleading features we can have in ES... happy to discuss in a separate session, but to me ES is doing it wrong and we are just releasing a feature of something that works in the wrong way.

I definitely understand your reservations here, let's discuss offline. It's a pretty "raw" feature but it definitely has it's use cases.

@stratoula
Copy link
Contributor Author

@MichaelMarcialis we decided to proceed with this feature (enable support of histogram on histogram fields) but we would like to add an in-app documentation to explain to the users how intervals work with histogram fields in order to ensure that there will be no confusion.

I wonder how this in-app documentation should work. Atm we have the date_histogram documentation
image

I could do the same for the intervals, but do we want this to be rendered always (even if the field is not of histogram type)? Do we want this to appear as long as the user has selected the field? What do you think?

@MichaelMarcialis
Copy link
Contributor

I could do the same for the intervals, but do we want this to be rendered always (even if the field is not of histogram type)? Do we want this to appear as long as the user has selected the field? What do you think?

@stratoula: My first instinct is to say that any in-app documentation we include for the quick functions would be ever-present and not conditional based on the field selected (especially since the field selection can happen after function selection if not using drag-and-drop). Instead, could we include descriptions within the popover that explain what will happen with both histogram type fields as well as other field types?

@stratoula
Copy link
Contributor Author

Yes @MichaelMarcialis this makes sense, thanx!

@kibana-ci
Copy link
Collaborator

💚 Build Succeeded

Metrics [docs]

Async chunks

Total size of all lazy-loaded chunks that will be downloaded as the user navigates the app

id before after diff
lens 1.1MB 1.1MB +891.0B
visDefaultEditor 142.4KB 142.5KB +22.0B
total +913.0B

Page load bundle

Size of the bundles that are downloaded on every page load. Target size is below 100kb

id before after diff
data 413.8KB 413.9KB +56.0B

History

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

@ghudgins
Copy link

We met and decided to hold on this PR. If we can find a more tangible business case for adding this to Lens (someone who needs it more explicitly) then we can re-open this PR.

One thing we identified that would help this feature be useful for experts is the ability to set the interval width with a more detailed control instead of the slider UI. This would allow the expert user to match the histogram bucket size in their histogram fields.

@ghudgins ghudgins closed this May 10, 2022
@stratoula stratoula reopened this Apr 22, 2024
@stratoula stratoula requested a review from a team as a code owner April 22, 2024 14:42
@stratoula stratoula marked this pull request as draft April 22, 2024 14:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport:skip This commit does not require backporting Feature:Lens Feature:Vis Editor Visualization editor issues release_note:enhancement Team:Visualizations Visualization editors, elastic-charts and infrastructure v8.3.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Allow histogram aggregation on histogram field type [Lens] Allow histograms on histogram fields
9 participants