Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make it possible to configure missing values. #11042

Merged
merged 1 commit into from
May 15, 2015

Conversation

jpountz
Copy link
Contributor

@jpountz jpountz commented May 7, 2015

Most aggregations (terms, histogram, stats, percentiles, geohash-grid) now
support a new missing option which defines the value to consider when a
field does not have a value. This can be handy if you eg. want a terms
aggregation to handle the same way documents that have "N/A" or no value
for a tag field.

This works in a very similar way to the missing option on the sort
element.

One known issue is that this option sometimes cannot make the right decision
in the unmapped case: it needs to replace all values with the missing value
but might not know what kind of values source should be produced (numerics,
strings, geo points?). For this reason, we might want to add an unmapped_type
option in the future like we did for sorting.

Related to #5324

@jpountz
Copy link
Contributor Author

jpountz commented May 7, 2015

While the API proposal here is different from the one proposed on #5324, I think it could address most use-cases and even be more generic. For instance, in some cases you might want to have a dedicated bucket for documents that miss a value and all that you would have to do would be to pass a value which doesn't exist in the index (eg. _missing, but the choice is free). In other cases however it might make sense to put documents that miss a value into an existing bucket, I think a good example of that would be the N/A value for a tag field: documents that don't have a value for the tag field and documents that have this value should really be treated the same.

Also I like that we would have a consistent behaviour in all aggregations that support this parameter (ie. all aggregations that work on top of a field or script but missing), which would be consistent with sorting as well.


==== Missing value

The `missing` parameter defines how documents that miss a value should be treated.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"that are missing a value"

@clintongormley
Copy link

Nice work!

@jpountz
Copy link
Contributor Author

jpountz commented May 11, 2015

Thanks @clintongormley for helping fix the docs, I pushed a new commit.

@@ -123,3 +123,26 @@ settings and filter the returned buckets based on a `min_doc_count` setting (by
bucket that matches documents and the last one are returned). This histogram also supports the `extended_bounds`
setting, which enables extending the bounds of the histogram beyond the data itself (to read more on why you'd want to
do that please refer to the explanation <<search-aggregations-bucket-histogram-aggregation-extended-bounds,here>>).

==== Missing value
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would this section not fit better in the general aggregations section since it affects (almost) every aggregation and is the same syntax for them all?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made it similar to other features like script support. While this duplicates the documentation effort, it also has the benefit of showing an example in context (also note that examples try to be meaningful to the aggregation whenever possible)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, that makes sense

@colings86
Copy link
Contributor

@jpountz left a couple of minor comments

@colings86
Copy link
Contributor

LGTM

Most aggregations (terms, histogram, stats, percentiles, geohash-grid) now
support a new `missing` option which defines the value to consider when a
field does not have a value. This can be handy if you eg. want a terms
aggregation to handle the same way documents that have "N/A" or no value
for a `tag` field.

This works in a very similar way to the `missing` option on the `sort`
element.

One known issue is that this option sometimes cannot make the right decision
in the unmapped case: it needs to replace all values with the `missing` value
but might not know what kind of values source should be produced (numerics,
strings, geo points?). For this reason, we might want to add an `unmapped_type`
option in the future like we did for sorting.

Related to elastic#5324
jpountz added a commit that referenced this pull request May 15, 2015
Aggs: Make it possible to configure missing values.
@jpountz jpountz merged commit bf599d6 into elastic:master May 15, 2015
@jpountz jpountz deleted the feature/aggs_missing branch May 15, 2015 14:33
@kevinkluge kevinkluge removed the review label May 15, 2015
@clintongormley clintongormley changed the title Aggs: Make it possible to configure missing values. Make it possible to configure missing values. Jun 6, 2015
@mrfelton
Copy link

Can't get this working for the life of me. Is this in 1.6? I can't find any documentation on this feature at https://www.elastic.co/

Parse Failure [Unknown key for a VALUE_STRING in [campaign_term]: [missing].]]

{
  "size": 0,
  "query": {
    "filtered": {
      "query": {
        "query_string": {
          "query": "_type:Subscription",
          "analyze_wildcard": true
        }
      }
    }
  },
  "aggs": {
    "2": {
      "date_histogram": {
        "field": "date",
        "interval": "1M",
        "pre_zone_adjust_large_interval": true,
        "min_doc_count": 1
      },
      "aggs": {
        "campaign_term": {
          "terms": {
            "field": "context.campaign.term",
            "size": 0,
            "missing": "hr-openers"
          }
        }
      }
    }
  }
}

@colings86
Copy link
Contributor

@mrfelton this will be available from 2.0 onwards. The documentation for it is availble on the master branch of the docs. There is a new section for each agg called 'Missing Values'. For example: https://www.elastic.co/guide/en/elasticsearch/reference/master/search-aggregations-metrics-avg-aggregation.html#_missing_value

@GrahamHannington
Copy link

In the meantime, before 2.0, and with apologies if this has already been covered: can you specify a script in the Kibana "JSON input" field that dynamically replaces a missing field value with zero? (And can someone point me to detailed documentation of what can be specified in that field? My Google-fu has failed there, too.)

@GrahamHannington
Copy link

Suppose I have an Elasticsearch document with no "grade" field; the "grade" field is missing.

Suppose I have another document with a "grade" field explicitly specified as null:

"grade": null

Will the new-for-2.0 missing option apply to both documents?

@jpountz
Copy link
Contributor Author

jpountz commented Aug 25, 2015

Missing and null will be considered the same by default, unless you configure a null_value in your mappings.

Regarding scripting, you can indeed do that in 1.x by running the aggregation of a script (likely with a bit performance/memory usage hit) that would check whether the list of values is empty.

@GrahamHannington
Copy link

Thanks, @jpountz . Re:

you can indeed [replace a null/missing field value with zero] in 1.x by running the aggregation of a script

Could you please either spoonfeed me (cringe, sorry) the appropriate contents of the Kibana JSON Input field, or point me to detailed documentation for specifying the contents of that field? I can write "if x is null, then set x to 0" in a few programming languages, but I lack the experience and detailed documentation I need to do that in this context (such as the surrounding JSON, the specific syntax and variable names).

@clintongormley
Copy link

@GrahamHannington

Not sure about the Kibana side, but here's an example (with groovy dynamic scripting) which will replace missing values with -1:

DELETE t

POST t/t/
{
  "num": 1
}
POST t/t/
{
  "num": 2
}
POST t/t/
{

}

GET t/_search?size=0
{
  "aggs": {
    "nums": {
      "histogram": {
        "interval": 1,
        "script": "doc['num'][0] == null ? -1 : doc['num'].value"
      }
    }
  }
}

You could use the expression language instead, but be aware that it doesn't support nulls, so you can't distinguish null from 0. If zeroes aren't important you can do:

GET t/_search?size=0
{
  "aggs": {
    "nums": {
      "histogram": {
        "interval": 1,
        "script": "doc['num'].value || -1",
        "lang": "expression"
      }
    }
  }
}

@GrahamHannington
Copy link

Thanks again, @jpountz .

I need the average (avg) aggregation to include documents with null or missing field values in its count, and treat those null or missing field values as zero. Otherwise, I get (what I consider to be) skewed averages.

For example, suppose I have the following five Elasticsearch documents, where T_n_ is a timestamp value, and grade is the name of a field on which I want to perform an average calculation:

Timestamp grade
T1 null or missing
T2 10
T3 null or missing
T4 10
T5 null or missing

Currently, when I use an average aggregration in a visualization, a bucket that includes T1 - T5 shows the average grade as 10:

(10 + 10) / 2 = 10

(that is, it skips the documents with null or missing grade)

whereas I want it to show 4 (to include the documents with null or missing grade, and treat grade as zero):

(0 + 10 + 0 + 10 + 0) / 5 = 4

However, I have so far been unable to trap null field values via the Kibana JSON Input field.

I suspect (I could be wrong) that what is happening is that Kibana (more specifically, Elasticsearch; but I'm doing all of this through the Kibana user interface) skips the documents with null or missing field values, and so those documents never "reach" the JSON Input field value.

I can use the following JSON Input field value to override the values of fields that are present (say, replace 10 with 20):

{ "script": "10 ? 20 : _value" }

but the following has no effect:

{ "script": "null ? 20 : _value" }

Similarly, neither does this, possibly unfaithfully transcribed from your suggestion (much appreciated, thank you):

{ "script" : "doc['a'][0] == null ? 0 : doc['a'].value" }

I'd appreciate some more advice here. I'd like to have a workaround (before 2.0 arrives) for these skewed averages that doesn't involve re-loading the (currently, deliberately "sparse") data with explicit zero field values. Even if that workaround involves a performance hit on large data sets (as I imagine this script-based would; so far, I've only tested it on very small indices).

@jpountz
Copy link
Contributor Author

jpountz commented Sep 2, 2015

Unfortunately, this can't be done today because Kibana requires you to configure a field and then merges the agg definition with the value in the json input, which makes elasticsearch run the script on every value instead of every document.

@lmath
Copy link

lmath commented Jun 1, 2021

We came across this feature of configuring missing values looking at the Terms Aggregation docs and were excited to use it with rollup search, but it doesn't seem like this feature is available yet for rollup search. @polyfractal we were wondering if you might know if configuring missing values are available for rollup search or if there some is other way to search for missing values?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants