Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for "missing" to all bucket aggregations #5324

Closed
roytmana opened this issue Mar 3, 2014 · 61 comments
Closed

Add support for "missing" to all bucket aggregations #5324

roytmana opened this issue Mar 3, 2014 · 61 comments
Assignees

Comments

@roytmana
Copy link

roytmana commented Mar 3, 2014

NEED: In many (if not majority cases) when present users with business analytics, the user would want to see numbers for complete data set. No matter how you aggregate it should present the same data with the same number of documents. Inability to handle "missing" values exclude those from analysis making analyzed data set incomplete and grand totals dependent on which field(s) the aggregation is done. It is impossible to explain to the users why the lower level totals do not add up to the upper level ones!

WORKAROUND: Currently field based bucket aggregations (term, range etc) have no way to aggregate missing values. The only way is to use missing aggregation on the same level and the same field as the term aggregation itself. It is easy enough when dealing with one level aggregations but if you have 2-3 level aggregation number of "missing" aggregations (and complete lower level aggregation to be repeated in them) mushrooms very quickly to the point that the query is huge, convoluted and not debuggable. It may affect performance as well. Also fetched date needs to be heavily post-processed to extract multiple levels aggregation buckets from under various "missing" elements and put them inline with the regular aggregation values. Below please see a simple query to do 2 level aggregation with just one sum metrics

PROPOSAL: I would suggest that any aggregation operating on a field should have a missing option. If missing config is specified, aggregation should accumulate missing values under that value and honor any nested aggregations within. It should never assume any value like 0 or _missing since it may clash with actual keys. If it is not specified the aggregation should skip missing values as it does now.

This approach makes it entirely compatible with existing logic and give developers complete control over whether to aggregate missing and under what key. In cases when it is not needed (and not specified) there will be no performance overhead. But when it needed it will work faster as we would not need to do missing aggregation and aggregations under it separately (same goes for "other" aggregation)

To be honest, I would love to see the same handling for "other" - documents that have not been included in aggregation due to the aggregation size constraints. Again the same rationale - ability to slice complete data set regardless of aggregation structure. It is just as needed as "missing" and just as troublesome to calculate but
I could understand if you did not add it as it may be not compatible with your algorithms but PLEASE PLEASE add "missing" handling at least

{
      "total": {
        "sum": {
          "field": "money.totals.obligationTotal"
        }
      },
      "missing": {
        "missing": {
          "field": "division"
        },
        "aggs": {
          "total": {
            "sum": {
              "field": "money.totals.obligationTotal"
            }
          },
          "missing": {
            "missing": {
              "field": "fy"
            }
          },
          "group": {
            "terms": {
              "field": "fy",
              "order": { "_term": "asc" }
            },
            "aggs": {
              "total": {
                "sum": {
                  "field": "money.totals.obligationTotal"
                }
              }
            }
          }
        }
      },
      "group": {
        "terms": {
          "field": "division",
          "order": { "_term": "asc" },
          size:100
        },
        "aggs": {
          "total": {
            "sum": {
              "field": "money.totals.obligationTotal"
            }
          },
          "missing": {
            "missing": {
              "field": "fy"
            },
            "aggs": {
              "total": {
                "sum": {
                  "field": "money.totals.obligationTotal"
                }
              }
            }
          },
          "group": {
            "terms": {
              "field": "fy",
              "order": { "_term": "asc" }
            },
            "aggs": {
              "total": {
                "sum": {
                  "field": "money.totals.obligationTotal"
                }
              }
            }
          }
        }
      }
    }

cc @uboness, @jpountz

@roytmana
Copy link
Author

Hello @uboness, @jpountz
Any chance for this to make it to 1.2?

@uboness
Copy link
Contributor

uboness commented Mar 30, 2014

@roytmana we're considering adding it, though can't promise it for 1.2 (we're currently working on several other enhancements/features for aggs, so we'll see if it'll fit)

@j0hnsmith
Copy link

+1

@d1nsh
Copy link

d1nsh commented Jul 31, 2014

Any plans of picking this up? I am having the issue as well and I currently have a workaround using multi search requests. It is ok so far since I only have a couple of aggregation levels but I can see this getting messy If I go deeper.

@cn081
Copy link

cn081 commented Sep 17, 2014

Any news on this issue?

@yeroc
Copy link

yeroc commented Sep 18, 2014

I just ran into this -- we're trying to migrate from the deprecated facets to aggregations and are finding this a real gap in the supposed new and better way of doing things.

@bradvido
Copy link
Contributor

bradvido commented Oct 1, 2014

I'm trying to move to aggregations and this is really holding us back. Forced to use facets for now...

@roytmana
Copy link
Author

roytmana commented Oct 1, 2014

Same here - lack of support for _missing as a bucket as well as lack of _other as a bucket (however inefficient it may be it still will be better than a convoluted multi-step process I would be forced to use to calculate it myself) prevents me from moving from facets to aggs

@j0hnsmith
Copy link

Deprecating facets without a viable solution via aggregations is madness, if/when this happens, some of us won't be able update to the new ES version.

@roytmana
Copy link
Author

roytmana commented Oct 2, 2014

Exactly my feeling. I commented on depreciation change request a while ago and it looks like ES team does not feel this way. They consider facet a legacy that makes it harder to move the product forward. It is understandably but very unplesant for people who do heavy duty (and generic) analytics with facets. I wish they looked at aggs from productivity standpoint and considered how well it suits for traditional data mart style applications which typically operate on a consistent dataset regardless of grouping, rollups etc. This is where MISSING and OTHER is really handy...

I also wish theybwould look at metrics that are expressions against other metrics in query aggs...

@jpountz
Copy link
Contributor

jpountz commented Oct 3, 2014

In facets, other used to represent the number of values that didn't make it to the top terms. With aggregations, we tried to make everything document-based (as opposed to value). Computing the number of other documents is not possible in the general case. For example let's imagine that you have 2 documents which have a tag field. Document 1 has tags: [ "red", "blue", "green" ] and document 2 has tags: ["red"]. So overall we have the following counts: { red: 2, blue: 1, green: 1}. Now if we run a terms aggregation with size: 1, we get {red: 2}. What should other be? red already matches all documents, so there are no really other documents. We might want to return how many documents match other terms (blue and green) but we cannot just return the sum: it is the same document that has blue and green as a tag, there is only 1 other document that has other terms.

In the end, we can't return the document count for other terms. However, we could imagine returning the sum of the document counts for other terms, would that be enough? Note that in the single-valued case, it would be equal to the number of documents that have another term, the issue that I described above only occurs with multi-valued fields.

Here is a suggested format for the response:

{
    "aggregations": {
        "colors": {
            "buckets": [
                {
                    "key": "red",
                    "doc_count": 5
                },
                {
                    "key": "green",
                    "doc_count": 3
                }
            ],
            "sum_of_other_buckets": {
                "doc_count": 3
            }
        }
    }
}

sum_of_other_buckets is a bit long but it aims at making clear that it is the sum of the document counts for other buckets and not the number of documents that have another term.

Something else that might be possible would be to return sub aggregations for the other terms, but it would suffer from the same issues in the multi-valued case, and I would like to keep it for later as it would require significant work.

@clintongormley
Copy link

+1

@rashidkpc
Copy link

The ability to sub-aggregate is absolutely necessary. I think the proposal to have the "other" bucket mean something slightly different is ok as long as we document its issue with multi-valued fields. As Adrien mentioned, this is not an issue with single valued properties.

However, I think its important to treat the "other" bucket as equivalent to the regular buckets. Aggregations have a defined nested structure and breaking that will harm the ability to process them in a cleanly recursive fashion. I'd rather see the other bucket be added to the bucket array, with perhaps a customizable key as suggested by the original proposal, something like:

Request:

{
    "aggs": {
        "colors": {
            "terms": {
                "field": "color",
                "other_bucket": "__myCustomKey__"
            }
        }
    }
}

Response:

{
    "aggregations": {
        "colors": {
            "buckets": [
                {
                    "key": "red",
                    "doc_count": 5
                },
                {
                    "key": "green",
                    "doc_count": 3
                },
                {
                    "key": "__myCustomKey__",
                    "doc_count": 3
                }
            ]
        }
    }
}

@bradvido
Copy link
Contributor

bradvido commented Oct 3, 2014

sum_of_other_buckets makes sense and its name clearly describes what it is. Maybe documents_in_other_buckets is clearer?
This would be nice to have, and I'd bet that most use-cases aren't on multi-valued fields, so the gotcha won't apply (but definitely needs to be documented).

@roytmana
Copy link
Author

roytmana commented Oct 3, 2014

Like @rashidkpc I would love if _other was a bucket aggregation not just a metric. My primary interest is to be able to calculate metrics for _other bucket not just counts. Say, my analytics shows $ sales breakdown by store. I would like to be able to show sales for to 20 stores and then lump all other sales into _other so the total roll up for entire company does not depend on number of "visible" buckets. But if we could also support bucket sub aggs within _other bucket it would be fantastic.

For single valued field logic of _missing and _other is rather clear and it is the most common case. For multivalued fields it is not so clear as @jpountz noted. Maybe if ES provide _other bucket and let me pick metrics within and how to interpret it it would be a more generic use case?

There could be a bucket aggregation called Distinct which takes parent document set and distinct it and any metrics within such bucket will not doublecount

My example above would not work very well with multivalued field as it will be double-counting $ but so it will be double-counting if I tried to roll up visible buckets

But I would like to say that single value use case is arguable more important to have complete and very productive implementation (in my mind it would be support for _missing and _other as buckets) and if I need to do analytics like in my example on a multivalued data element, I should probably structure my data so that each value of the "multi" is a document carrying its fraction of $ or accept doublecounting in some shape or form

My current solution for _other (even with facets since facets do not support _other on anything by count) is to calculate the same metric for entire dataset and then substract sum of the metric for "visible" facets. and that of course is not working for multi-valued fields. but I can

@jpountz
Copy link
Contributor

jpountz commented Oct 3, 2014

@rashidkpc I'm concerned that it requires to know a term that doesn't exist among your documents, otherwise there could be a collision. This might not always be easy?

We might be able to do something about sub aggregations for the other bucket but this requires much more work so I would like to do it in several steps and start with just the count (however the format should allow for adding data for sub-buckets in the future).

@Asimov4
Copy link
Contributor

Asimov4 commented Mar 29, 2015

+1

@cameronkerrnz
Copy link

+1. If my ELK stack can't help me to solidly point the finger at something --- because signals get weakened if I can't tell how significant the top-N things are --- then it becomes much harder to use Kibana (4) effectively. It effectively makes me want to keep Kibana 3 around.

@Nickology
Copy link

+1

@oliviervaussy
Copy link

+1 for the "other_bucket"

@roytmana
Copy link
Author

looks like a lot of people have the same need as I do (bucketing on _missing and _other)
I posted this issue initially because it was a critical and pretty obvious need for my us case (which is a generic traditional analytics/data mart/data slicing and visualization system). Sadly, even though the issue has one of the highest ratings thanks to all your +1s :-)) after all these months it has not even got a planned version assigned to it...
I personally can't get off the now deprecated facets without it

@jpountz
Copy link
Contributor

jpountz commented Apr 21, 2015

Sadly, even though the issue has one of the highest ratings thanks to all your +1s :-)) after all these months it has not even got a planned version assigned to it...

The reason why this feature has not been implemented entirely is that it is very challenging. Your complaint feels like we are leaving users in a dead end but I don't think it's true. As a follow-up of this issue we added the ability to get the document count for other buckets (#8213) and for documents that miss a value for the field, it is still possible to use the missing aggregation. I agree it might not be as user-friendly as getting a missing bucket but on the other hand there are very valid reasons to not have the same sub aggregations for documents that miss a value and we do not want to clutter the API with options for every possible use-case.

In addition, we are currently exploring new ways to work on top of aggregations in #9876. It is not clear yet whether it will help on this issue but it will at least open new doors.

@roytmana
Copy link
Author

@jpountz I appreciate the update and I realize that it is a complex matter but it is a very much needed for this rather common use case (whether for developer using ES or ELK users)

I am not really complaining I am just making an observation that it has been a year since I submitted the issue and it seen fairly high interest from the user community and yet has not received a target version number thus it is quite possible that it will not get implemented (for what could be very valid technical or strategic reasons). If that's so I personally would like to know it so I concentrate on finding a solution. As you said, missing + total aggregations together with some transformation logic would allow to calculate _other and _missing but for complex scenarios of UI driven dynamic multilevel aggregations, such queries quickly become really huge and complicated and lot more expensive than even an un-optimized built-in solution.

So if the consensus of the ES Team that implementation of _other and _missing buckets is not feasible, I would like to know it and start working on a solution that expands a concise "logical" query defining which agg should take _missing and/or other into account into a large ES ("physical") one and then transform results to calculate and inject _other and _missing buckets on all levels of the result tree transparently.

Thank you
Alex

@kcberg
Copy link

kcberg commented Apr 22, 2015

Really need this too and it's the biggest barrier in moving to kibana 4 IMO. +1

@u238
Copy link

u238 commented Apr 27, 2015

I think that without that functionality the "top N" feature is really not that useful.

+1

@aelagon
Copy link

aelagon commented May 6, 2015

+1

jpountz added a commit to jpountz/elasticsearch that referenced this issue May 7, 2015
Most aggregations (terms, histogram, stats, percentiles, geohash-grid) now
support a new `missing` option which defines the value to consider when a
field does not have a value. This can be handy if you eg. want a terms
aggregation to handle the same way documents that have "N/A" or no value
for a `tag` field.

This works in a very similar way to the `missing` option on the `sort`
element.

One known issue is that this option sometimes cannot make the right decision
in the unmapped case: it needs to replace all values with the `missing` value
but might not know what kind of values source should be produced (numerics,
strings, geo points?). For this reason, we might want to add an `unmapped_type`
option in the future like we did for sorting.

Related to elastic#5324
@kyle-stachowiak-relativity

+1

jpountz added a commit to jpountz/elasticsearch that referenced this issue May 15, 2015
Most aggregations (terms, histogram, stats, percentiles, geohash-grid) now
support a new `missing` option which defines the value to consider when a
field does not have a value. This can be handy if you eg. want a terms
aggregation to handle the same way documents that have "N/A" or no value
for a `tag` field.

This works in a very similar way to the `missing` option on the `sort`
element.

One known issue is that this option sometimes cannot make the right decision
in the unmapped case: it needs to replace all values with the `missing` value
but might not know what kind of values source should be produced (numerics,
strings, geo points?). For this reason, we might want to add an `unmapped_type`
option in the future like we did for sorting.

Related to elastic#5324
@jpountz
Copy link
Contributor

jpountz commented May 21, 2015

Update: Now that most aggregations support a missing option (#11042) and that the terms aggregation returns counts for the other bucket (#8213), I think the only remaining question is whether we should add an other bucket.

While I'm pretty happy with the way we deal with missing, I'm concerned about adding an other bucket. One idea could be to make it optional, but I don't like adding options that feel unnecessary. Besides I have no idea whether the default should be true or false. On the other hand if we enforce it to be computed, I'm afraid of the multi-valued case because either we try to make it behave like a regular bucket and count documents only once which would be costly, or we just merge other buckets naively and then the multi-valued case would be wrong/confusing because some documents could be counted several times. This is the reason why the other count is currently labeled as sum_other_doc_count to make clear it is the sum of the doc counts of other buckets and not the document count that an other bucket would have.

@Kallin
Copy link

Kallin commented May 21, 2015

My main reason for +1 this issue was that I wanted an other bucket for filter aggs. I found it inconvenient to have to duplicate the entire filter agg (which can be very complex) within a parallel filter agg that has an enclosing 'not' filter to get the remainder bucket. Afaik that's still not taken care of, so no simple way to get the remainder/other bucket for a filter agg.

@jpountz
Copy link
Contributor

jpountz commented May 21, 2015

Indeed filters is a different story, and I think we could implement easily. I opened #11289 to track it separately.

@roytmana
Copy link
Author

I would say "other" is very useful as well particularly for in case people implement data mart like systems with elasticsearch. They tend to present complete dataset rollup to the user where "other" is needed to have the complete dataset representation. Also in traditional datamarts with start like data models they would avoid any multivalued relations instead modeling it as multiple fact "tables" so the issue with counting "other" multiple times would not be especially relevant to these use cases. And when people facet on multivalue field they would need to understand multiplicity issue just like they need to understand that sum of counts of its buckets is not the number of docs but values.

I would like "other" to be a fully supported bucket enabled only if user explicetly specified it to be calculated by providing a key value for it. Ithink the case of single value aggs where it works as expected intuitively is so useful in itself as to justify this feature and the multivalued case would need to be understood and dealt with by each elastic user

Just my 2c

Jist

@anshaw
Copy link

anshaw commented May 27, 2015

+1

2 similar comments
@elwerene
Copy link

+1

@dev-shubh
Copy link

+1

@ymost
Copy link

ymost commented Jul 23, 2015

+1

@jpountz
Copy link
Contributor

jpountz commented Jul 23, 2015

This is actually now implemented via #11042 and will be available in elasticsearch 2.0.

@jpountz jpountz closed this as completed Jul 23, 2015
@jpountz
Copy link
Contributor

jpountz commented Jul 23, 2015

Hmm, just remembered that in spite of the title, this issue was not only about missing but also about an other bucket. I consider the missing bucket issue solved, and the other bucket half-solved given that the filters aggregation now supports an other bucket and that the terms aggregation can return the sum of the doc counts for "other" buckets. If you would like to keep the discussion going about the other bucket, please open a new issue.

@funbrain
Copy link

funbrain commented Mar 26, 2018

hi there

let me ask one thing.

my scenario is that

let's say i have 3 requests to A system and i expected 3 responses from A system.
but in real i got 2 responses from A system.
one response was dropout.
probably, user close browser while A system was progressing....

eg raw msg:

2017-12-19 09:25:50,207 (null) 27 Info : ASystem : TESTPAGE :: elkko5zhjzyigt4it1wjmxpw :: Request : 9.0764764000000881
2017-12-19 09:25:50,207 (null) 27 Info : ASystem : TESTPAGE :: elkko5zhjzyigt4it1wjmxpw :: Response: 9.0764764000000881

2017-12-19 09:26:50,207 (null) 27 Info : ASystem : TESTPAGE :: elkko5zhjzyigt4it1wjmxpw :: Request : 9.0764764000000882

2017-12-19 09:27:50,207 (null) 27 Info : ASystem : TESTPAGE :: elkko5zhjzyigt4it1wjmxpw :: Request : 9.0764764000000883
2017-12-19 09:27:50,207 (null) 27 Info : ASystem : TESTPAGE :: elkko5zhjzyigt4it1wjmxpw :: Response: 9.0764764000000883

Above A system logs stream to logstash
Then logstash make some fields using grok patterns.

eg fields:

log_timestamp: 2017-12-19 09:27:50,207
log_level: Info
log_type: REQUEST *if log msg have Request then log_type: REQUEST
log_type: RESPONSE *if log msg have Response then log_type: RESPONSE
ref_id: 764764000000883
version: 9.0

In Request and Response logs, there is no response log found in second request?
My question is that how to query for missing response log count (dropout) in elasticsearch????

is there anybody to help me out?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.