Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sum aggregation not ignoring missing values #71582

Open
mbarretta opened this issue Apr 12, 2021 · 11 comments
Open

Sum aggregation not ignoring missing values #71582

mbarretta opened this issue Apr 12, 2021 · 11 comments
Labels
:Analytics/Aggregations Aggregations >bug Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo)

Comments

@mbarretta
Copy link
Contributor

mbarretta commented Apr 12, 2021

Elasticsearch version (bin/elasticsearch --version):
7.12.0

OS version (uname -a if on a Unix-like system):
OSX

Description of the problem including expected versus actual behavior:
The sum aggregation by default should ignore documents that are missing values for the field being summed. Instead, it looks like it defaults to a 0 value. The min and max aggregations have the same default behavior, but correctly ignore missing fields and return null

Steps to reproduce:

#create index w/ doc
POST sum_test/_doc?refresh
{
  "foo": "bar"
}

#run min and sum aggs on non-existent "baz" field
GET sum_test/_search?size=0
{
  "aggs": {
    "min": {
      "min": {
        "field": "baz"
      }
    },
    "sum": {
      "sum": {
        "field": "baz"
      }
    }
  }
}

Results showing correct handling of the missing baz field by min (i.e. a null), but incorrect handling by sum (0)

...
  "aggregations" : {
    "min" : {
      "value" : null
    },
    "sum" : {
      "value" : 0.0
    }
  }
...

Provide logs (if relevant):

@mbarretta mbarretta added >bug needs:triage Requires assignment of a team area label labels Apr 12, 2021
@dnhatn dnhatn added the :Analytics/Aggregations Aggregations label Apr 20, 2021
@elasticmachine elasticmachine added the Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) label Apr 20, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-analytics-geo (Team:Analytics)

@dnhatn dnhatn removed the needs:triage Requires assignment of a team area label label Apr 20, 2021
@nik9000
Copy link
Member

nik9000 commented Apr 20, 2021

I'm not sure how longs its been like this. I can look into it. If we decide this is the right behavior we should at least document it. Folks don't always expect it.

@nik9000
Copy link
Member

nik9000 commented Apr 20, 2021

Elasticsearch 1.7's sum aggregation returns 0 when aggregating on empty fields. I think we probably shouldn't change it now. But we should make sure its documented.

nik9000 added a commit to nik9000/elasticsearch that referenced this issue Apr 20, 2021
It'd be super reasonable if the `sum` aggregation returned `null` when
run on an unmapped field. Or if the query filtered out all results. But
it doesn't. It returns `0`. Which is also reasonable! Its just different
from what other reasonable systems like Postgresql do.

This adds a note to with an anchor we can link to. Folks ask about this
a fair bit.

Closes elastic#71582
@kspurrier
Copy link

With that being the case, is there any way to have a Kibana line or bar type visualization aggregate data without plotting a point at 0 when all of the values being aggregated over the bucket are null?

@mbarretta
Copy link
Contributor Author

@nik9000 I agree that it doesn't make sense to change it now if it's been this way for so long. That said, per @kspurrier, can we have a setting (or value for the existing missing) setting) that gives null vs 0 for missing values?

@kspurrier
Copy link

kspurrier commented Apr 20, 2021

From the PR associated, Postgres is referenced which is a darn good example. Postgres can be reasonable in getting away with something like this because one can easily just use COALESCE. If there's an equivalent workaround (in the other direction) with ES / Kibana, that'd be awesome in the docs.

Example:
create table es_is_not_a_db (id integer, name varchar(100), rank varchar(100), failures integer);
insert into es_is_not_a_db (id, name, rank) values (1, 'Joan', 'Boss');
insert into es_is_not_a_db (id, name, rank) values (1, 'Joe', 'Worker Bee');
insert into es_is_not_a_db (id, name, rank) values (1, 'Bob', 'Worker Bee');
select COUNT(failures), SUM(failures), COALESCE(sum(failures), 0) from es_is_not_a_db;

Returns:

count sum coalesce
0   0

@nik9000
Copy link
Member

nik9000 commented Apr 20, 2021

@nik9000 I agree that it doesn't make sense to change it now if it's been this way for so long. That said, per @kspurrier, can we have a setting (or value for the existing missing) setting) that gives null vs 0 for missing values?

I recall that being trickier to do than it sounded the last time folks talked about it. Could you link to an issue this is trying to solve?

If there's an equivalent workaround (in the other direction) with ES / Kibana, that'd be awesome in the docs.

There are kind of three cases where we do a different thing than postgresql with sum. Probably more, but three I can think of. But here goes:

  1. If the field is unmapped we return 0. PostgresQL returns an error. I expect kibana won't want to make these sorts of requests because it won't see the field in the field_caps API.
  2. If no documents match the request we'll give 0 and PostgresQL will return null. You'd have to check the doc_count in the bucket the sum is inside - if its 0 you can do null style stuff.
  3. You only ever sum null or an empty arrays. You can use "missing": "NaN" to get NaN. It ain't great, but its a thing.

@nik9000
Copy link
Member

nik9000 commented Apr 20, 2021

You only ever sum null or an empty arrays. You can use "missing": "NaN" to get NaN. It ain't great, but its a thing.

Wait, that isn't right. That'll get you NaN when any documents are missing the value.

@nik9000
Copy link
Member

nik9000 commented Apr 20, 2021

Wait, that isn't right. That'll get you NaN when any documents are missing the value.

You'd need to run another aggregation to count docs that have missing values. It'd pretty terrible frankly. It looks like:

curl -uelastic:password -HContent-Type:application/json -XDELETE localhost:9200/test1
curl -uelastic:password -HContent-Type:application/json -XDELETE localhost:9200/test2
curl -uelastic:password -HContent-Type:application/json -XPUT localhost:9200/test1/_doc/1?refresh -d'{"a": "a", "some_missing": 1.0}'
curl -uelastic:password -HContent-Type:application/json -XPUT localhost:9200/test2/_doc/2?refresh -d'{"a": "a"}'
curl -uelastic:password -HContent-type:application/json -XPOST localhost:9200/test*/_search?pretty -d'{
  "aggs": {
    "a": {
      "terms": {
        "field": "a.keyword"
      },
      "aggs": {
        "min": {
          "min": {
            "field": "some_missing"
          }
        },
        "missing": {
          "missing": {
            "field": "some_missing"
          }
        },
        "sum": {
          "sum": {
            "field": "some_missing"
          }
        }
      }
    }
  }
}'

I suspect folks don't typically encounter this because they treat our fields like they are non-nullable and single valued. Do you bump into this sort of thing frequently? documents with uneven fields that encode meaning in the absence of that field?

@kspurrier
Copy link

Thank you. That explanation definitely helps me, and I think it may give me a way around some of the issues that may more specifically just be limitations around how we're using Kibana out of the box for more general visualizations of data that aren't the classic use case.

For us lately, the absence of data tends to be real and we don't want to see it plotted in Kibana visualizations when doing things like applying filters with Sum aggregations.

I've recently been working with visualizing some agile metrics from tasking data where we want to see things like burn down with projections where we ideally want the line to start beyond x-axis item 0+ where we have data. If I handle all aggregations prior to Kibana, it isn't too tough to make the visualization look nice, but as soon as users ask about multiple filters using sums, I don't currently have a good answer for them and they dislike the line starting from 0 and ramping up to the true starting point every time. Something like the Jira visualization here isn't a bad example

@nik9000
Copy link
Member

nik9000 commented Apr 21, 2021

It seems like there's some interest in changing this to null like you asked. Check out the PR. Its a common case and just coalescing nulls to 0 if you want 0 is a good solution. If we'd realized it when we first made it we'd have returned null. We just disagree some on whether its worth building a whole migration path. If we had all the time in the world I'd be for it. But we'll talk more and see where we land.

Luegg pushed a commit to Luegg/elasticsearch that referenced this issue Jun 22, 2021
Since the SQL `SUM` function behaves as expected, elastic#45251 can be closed.
As soon as elastic#71582 is resolved, we can go back to using the
`sum` aggregation instead of `stats`.
Luegg pushed a commit that referenced this issue Jun 23, 2021
Since the SQL `SUM` function behaves as expected, #45251 can be closed.
As soon as #71582 is resolved, we can go back to using the
`sum` aggregation instead of `stats`.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Analytics/Aggregations Aggregations >bug Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants