Sum aggregation not ignoring missing values #71582

mbarretta · 2021-04-12T18:58:36Z

Elasticsearch version (bin/elasticsearch --version):
7.12.0

OS version (uname -a if on a Unix-like system):
OSX

Description of the problem including expected versus actual behavior:
The sum aggregation by default should ignore documents that are missing values for the field being summed. Instead, it looks like it defaults to a 0 value. The min and max aggregations have the same default behavior, but correctly ignore missing fields and return null

Steps to reproduce:

#create index w/ doc
POST sum_test/_doc?refresh
{
  "foo": "bar"
}

#run min and sum aggs on non-existent "baz" field
GET sum_test/_search?size=0
{
  "aggs": {
    "min": {
      "min": {
        "field": "baz"
      }
    },
    "sum": {
      "sum": {
        "field": "baz"
      }
    }
  }
}

Results showing correct handling of the missing baz field by min (i.e. a null), but incorrect handling by sum (0)

...
  "aggregations" : {
    "min" : {
      "value" : null
    },
    "sum" : {
      "value" : 0.0
    }
  }
...

Provide logs (if relevant):

The text was updated successfully, but these errors were encountered:

elasticmachine · 2021-04-20T15:14:58Z

Pinging @elastic/es-analytics-geo (Team:Analytics)

nik9000 · 2021-04-20T15:57:16Z

I'm not sure how longs its been like this. I can look into it. If we decide this is the right behavior we should at least document it. Folks don't always expect it.

nik9000 · 2021-04-20T19:06:12Z

Elasticsearch 1.7's sum aggregation returns 0 when aggregating on empty fields. I think we probably shouldn't change it now. But we should make sure its documented.

It'd be super reasonable if the `sum` aggregation returned `null` when run on an unmapped field. Or if the query filtered out all results. But it doesn't. It returns `0`. Which is also reasonable! Its just different from what other reasonable systems like Postgresql do. This adds a note to with an anchor we can link to. Folks ask about this a fair bit. Closes elastic#71582

kspurrier · 2021-04-20T19:59:38Z

With that being the case, is there any way to have a Kibana line or bar type visualization aggregate data without plotting a point at 0 when all of the values being aggregated over the bucket are null?

mbarretta · 2021-04-20T20:04:32Z

@nik9000 I agree that it doesn't make sense to change it now if it's been this way for so long. That said, per @kspurrier, can we have a setting (or value for the existing missing) setting) that gives null vs 0 for missing values?

kspurrier · 2021-04-20T20:15:55Z

From the PR associated, Postgres is referenced which is a darn good example. Postgres can be reasonable in getting away with something like this because one can easily just use COALESCE. If there's an equivalent workaround (in the other direction) with ES / Kibana, that'd be awesome in the docs.

Example:
create table es_is_not_a_db (id integer, name varchar(100), rank varchar(100), failures integer);
insert into es_is_not_a_db (id, name, rank) values (1, 'Joan', 'Boss');
insert into es_is_not_a_db (id, name, rank) values (1, 'Joe', 'Worker Bee');
insert into es_is_not_a_db (id, name, rank) values (1, 'Bob', 'Worker Bee');
select COUNT(failures), SUM(failures), COALESCE(sum(failures), 0) from es_is_not_a_db;

Returns:

count	sum	coalesce
0		0

nik9000 · 2021-04-20T21:43:26Z

@nik9000 I agree that it doesn't make sense to change it now if it's been this way for so long. That said, per @kspurrier, can we have a setting (or value for the existing missing) setting) that gives null vs 0 for missing values?

I recall that being trickier to do than it sounded the last time folks talked about it. Could you link to an issue this is trying to solve?

If there's an equivalent workaround (in the other direction) with ES / Kibana, that'd be awesome in the docs.

There are kind of three cases where we do a different thing than postgresql with sum. Probably more, but three I can think of. But here goes:

If the field is unmapped we return 0. PostgresQL returns an error. I expect kibana won't want to make these sorts of requests because it won't see the field in the field_caps API.
If no documents match the request we'll give 0 and PostgresQL will return null. You'd have to check the doc_count in the bucket the sum is inside - if its 0 you can do null style stuff.
You only ever sum null or an empty arrays. You can use "missing": "NaN" to get NaN. It ain't great, but its a thing.

nik9000 · 2021-04-20T21:50:48Z

You only ever sum null or an empty arrays. You can use "missing": "NaN" to get NaN. It ain't great, but its a thing.

Wait, that isn't right. That'll get you NaN when any documents are missing the value.

nik9000 · 2021-04-20T22:02:22Z

Wait, that isn't right. That'll get you NaN when any documents are missing the value.

You'd need to run another aggregation to count docs that have missing values. It'd pretty terrible frankly. It looks like:

curl -uelastic:password -HContent-Type:application/json -XDELETE localhost:9200/test1
curl -uelastic:password -HContent-Type:application/json -XDELETE localhost:9200/test2
curl -uelastic:password -HContent-Type:application/json -XPUT localhost:9200/test1/_doc/1?refresh -d'{"a": "a", "some_missing": 1.0}'
curl -uelastic:password -HContent-Type:application/json -XPUT localhost:9200/test2/_doc/2?refresh -d'{"a": "a"}'
curl -uelastic:password -HContent-type:application/json -XPOST localhost:9200/test*/_search?pretty -d'{
  "aggs": {
    "a": {
      "terms": {
        "field": "a.keyword"
      },
      "aggs": {
        "min": {
          "min": {
            "field": "some_missing"
          }
        },
        "missing": {
          "missing": {
            "field": "some_missing"
          }
        },
        "sum": {
          "sum": {
            "field": "some_missing"
          }
        }
      }
    }
  }
}'

I suspect folks don't typically encounter this because they treat our fields like they are non-nullable and single valued. Do you bump into this sort of thing frequently? documents with uneven fields that encode meaning in the absence of that field?

kspurrier · 2021-04-20T23:16:35Z

Thank you. That explanation definitely helps me, and I think it may give me a way around some of the issues that may more specifically just be limitations around how we're using Kibana out of the box for more general visualizations of data that aren't the classic use case.

For us lately, the absence of data tends to be real and we don't want to see it plotted in Kibana visualizations when doing things like applying filters with Sum aggregations.

I've recently been working with visualizing some agile metrics from tasking data where we want to see things like burn down with projections where we ideally want the line to start beyond x-axis item 0+ where we have data. If I handle all aggregations prior to Kibana, it isn't too tough to make the visualization look nice, but as soon as users ask about multiple filters using sums, I don't currently have a good answer for them and they dislike the line starting from 0 and ramping up to the true starting point every time. Something like the Jira visualization here isn't a bad example

nik9000 · 2021-04-21T12:30:39Z

It seems like there's some interest in changing this to null like you asked. Check out the PR. Its a common case and just coalescing nulls to 0 if you want 0 is a good solution. If we'd realized it when we first made it we'd have returned null. We just disagree some on whether its worth building a whole migration path. If we had all the time in the world I'd be for it. But we'll talk more and see where we land.

Since the SQL `SUM` function behaves as expected, elastic#45251 can be closed. As soon as elastic#71582 is resolved, we can go back to using the `sum` aggregation instead of `stats`.

Since the SQL `SUM` function behaves as expected, #45251 can be closed. As soon as #71582 is resolved, we can go back to using the `sum` aggregation instead of `stats`.

mbarretta added >bug needs:triage Requires assignment of a team area label labels Apr 12, 2021

dnhatn added the :Analytics/Aggregations Aggregations label Apr 20, 2021

elasticmachine added the Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) label Apr 20, 2021

dnhatn removed the needs:triage Requires assignment of a team area label label Apr 20, 2021

nik9000 mentioned this issue Apr 20, 2021

Document that sum(nothing) is 0 #71978

Closed

Luegg mentioned this issue Jun 22, 2021

SQL: SUM of multiple 0 values returns NULL #45251

Closed

Luegg mentioned this issue Jun 22, 2021

Reference relevant issue in ReplaceSumWithStats #74396

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sum aggregation not ignoring missing values #71582

Sum aggregation not ignoring missing values #71582

mbarretta commented Apr 12, 2021 •

edited

elasticmachine commented Apr 20, 2021

nik9000 commented Apr 20, 2021

nik9000 commented Apr 20, 2021

kspurrier commented Apr 20, 2021

mbarretta commented Apr 20, 2021

kspurrier commented Apr 20, 2021 •

edited

nik9000 commented Apr 20, 2021

nik9000 commented Apr 20, 2021

nik9000 commented Apr 20, 2021

kspurrier commented Apr 20, 2021

nik9000 commented Apr 21, 2021

Sum aggregation not ignoring missing values #71582

Sum aggregation not ignoring missing values #71582

Comments

mbarretta commented Apr 12, 2021 • edited

elasticmachine commented Apr 20, 2021

nik9000 commented Apr 20, 2021

nik9000 commented Apr 20, 2021

kspurrier commented Apr 20, 2021

mbarretta commented Apr 20, 2021

kspurrier commented Apr 20, 2021 • edited

nik9000 commented Apr 20, 2021

nik9000 commented Apr 20, 2021

nik9000 commented Apr 20, 2021

kspurrier commented Apr 20, 2021

nik9000 commented Apr 21, 2021

mbarretta commented Apr 12, 2021 •

edited

kspurrier commented Apr 20, 2021 •

edited