Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Terms aggregations shows partial results with terms aggregation over invalid field #44909

Closed
greentruff opened this issue Jul 26, 2019 · 5 comments · Fixed by #44963
Closed
Assignees

Comments

@greentruff
Copy link

greentruff commented Jul 26, 2019

Elasticsearch version (bin/elasticsearch --version): 6.8.1
Works as expected in 6.4.3.
Unexpected behavior for versions 6.5.3 to 6.8.1

Plugins installed: Default plugins in docker image provided Elastic

JVM version (java -version): 11.0

OS version (uname -a if on a Unix-like system):

Description of the problem including expected versus actual behavior:
Aggregations with a term referrencing a min aggregation over an invalid field does not return results per bucket. Other aggregations are not affected.

Steps to reproduce:

The following script reproduces the issue on a local default ES instance:

import requests

ES = 'http://localhost:9200'

## Set up
requests.delete(ES + '/parrot')

requests.put(ES + '/parrot')

requests.post(ES + '/parrot/doc', json={ "a": 1, "b": 2 })
requests.post(ES + '/parrot/doc', json={ "a": 1, "b": 2 })
requests.post(ES + '/parrot/doc', json={ "a": 2, "b": 3 })

requests.post(ES + '/parrot/_refresh')

resp = requests.get(ES + '/parrot/_search', json={
    "size":0,
    "timeout":"1m",
    "query":{
        "constant_score":{
            "filter":{
                "bool":{
                    "filter":[{"exists":{"field":"a","boost":1.0}}],
                    "adjust_pure_negative":True,"boost":1.0
                }
            },"boost":1.0
        }
    },
    "aggregations":{
        "v__max:max":{"max":{"field":"invalid"}},
        "a":{"terms":{"field":"a","show_term_doc_count_error":False,"order":[{"v__max:max":"asc"},{"_key":"asc"}]},
        "aggregations":{"v__max:max":{"max":{"field":"invalid"}}}}
    }
})
print(resp.json())

resp = requests.get(ES + '/parrot/_search', json={
    "size":0,
    "timeout":"1m",
    "query":{
        "constant_score":{
            "filter":{
                "bool":{
                    "filter":[{"exists":{"field":"a","boost":1.0}}],
                    "adjust_pure_negative":True,"boost":1.0
                }
            },"boost":1.0
        }
    },
    "aggregations":{
        "v__min:min":{"min":{"field":"invalid"}},
        "a":{"terms":{"field":"a","show_term_doc_count_error":False,"order":[{"v__min:min":"asc"},{"_key":"asc"}]},
        "aggregations":{"v__min:min":{"min":{"field":"invalid"}}}}
    }
})
print(resp.json())

Actual behavior:

{
  'took':369,
  'timed_out':False,
  '_shards':{
    'total':5,
    'successful':5,
    'skipped':0,
    'failed':0
  },
  'hits':{
    'total':3,
    'max_score':0.0,
    'hits':[

    ]
  },
  'aggregations':{
    'a':{
      'doc_count_error_upper_bound':0,
      'sum_other_doc_count':0,
      'buckets':[
        {
          'key':1,
          'doc_count':2,
          'v__max:max':{
            'value':None
          }
        },
        {
          'key':2,
          'doc_count':1,
          'v__max:max':{
            'value':None
          }
        }
      ]
    },
    'v__max:max':{
      'value':None
    }
  }
}
{
  'took':22,
  'timed_out':False,
  '_shards':{
    'total':5,
    'successful':5,
    'skipped':0,
    'failed':0
  },
  'hits':{
    'total':3,
    'max_score':0.0,
    'hits':[

    ]
  },
  'aggregations':{
    'a':{
      'doc_count_error_upper_bound':0,
      'sum_other_doc_count':0,
      'buckets':[

      ]
    },
    'v__min:min':{
      'value':None
    }
  }
}

The query with min returns no buckets.

Expected behavior:

{'took': 256, 'timed_out': False, '_shards': {'total': 5, 'successful': 5, 'skipped': 0, 'failed': 0}, 'hits': {'total': 3, 'max_score': 0.0, 'hits': []}, 'aggregations': {'a': {'doc_count_error_upper_bound': 0, 'sum_other_doc_count': 0, 'buckets': [{'key': 1, 'doc_count': 2, 'v__max:max': {'value': None}}, {'key': 2, 'doc_count': 1, 'v__max:max': {'value': None}}]}, 'v__max:max': {'value': None}}}
{'took': 156, 'timed_out': False, '_shards': {'total': 5, 'successful': 5, 'skipped': 0, 'failed': 0}, 'hits': {'total': 3, 'max_score': 0.0, 'hits': []}, 'aggregations': {'a': {'doc_count_error_upper_bound': 0, 'sum_other_doc_count': 0, 'buckets': [{'key': 1, 'doc_count': 2, 'v__min:min': {'value': None}}, {'key': 2, 'doc_count': 1, 'v__min:min': {'value': None}}]}, 'v__min:min': {'value': None}}}
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-analytics-geo

@Hohol
Copy link
Contributor

Hohol commented Jul 29, 2019

I think I've found the cause of this bug.

if (parent != null) {
return LeafBucketCollector.NO_OP_COLLECTOR;
} else {
// we have no parent and the values source is empty so we can skip collecting hits.
throw new CollectionTerminatedException();
}

if (parent == null) {
return LeafBucketCollector.NO_OP_COLLECTOR;
} else {
// we have no parent and the values source is empty so we can skip collecting hits.
throw new CollectionTerminatedException();
}

Looks like the second snippet is incorrect.

MinAggregator and MaxAggregators classes are weird. I believe they should be almost identical, but actually, there are many differences.
I think they should be refactored to get rid of code duplication and any unwanted differences.

I can work on this if someone from Elastic team approves.

jimczi added a commit to jimczi/elasticsearch that referenced this issue Jul 29, 2019
This commit fixes a bug when a deferred aggregator tries to early terminate the collection. In such case the CollectionTerminatedException is not caught and
the search fails on the shard. This change makes sure that we catch the exception in order to continue the deferred collection on the next leaf.

Fixes elastic#44909
@jimczi
Copy link
Contributor

jimczi commented Jul 29, 2019

Thanks for reporting @greentruff . I opened #44963 to fix the bug, @Hohol these snippets are correct, what's missing is the handling of the CollectionTerminatedException in the deferring collectors (see #44963).

@jimczi jimczi added the >bug label Jul 29, 2019
@jimczi jimczi self-assigned this Jul 29, 2019
@polyfractal
Copy link
Contributor

@Hohol Regarding the code differences, the reason they look so different is due to some early termination optimization. If possible, both aggregators attempt to use the BKD to lookup the min or max because that is a lot faster than iterating over all the documents to collect the values.

The BKD tree sorts the values ascending, so the min is at the left-most leaf in the tree. The min aggregator walks the tree leaves until it finds the first non-deleted document, then exits the BKD intersection and returns the value.

In contrast, the max aggregator has a comparably more difficult job. BKD intersections proceed from least-to-greatest, and we don't want to walk the whole tree. So the Max agg only inspects leaves that contain the maximum value for the segment (e.g. the last leaf), and then scans through those values to see which is the largest and also not deleted. So the max agg is more heuristic in nature, if all the docs are deleted in the last leaf it will fall back to iterating over all the values to find the max.

Thus, the differences in how the code looks :) On the surface they look similar, but due to how the tree is structured it introduces some subtle differences.

Pretty sure these details are at least mostly right, Jim can correct me if I got anything grossly wrong :)

@Hohol
Copy link
Contributor

Hohol commented Jul 29, 2019

Thanks for the explanation!

jimczi added a commit that referenced this issue Jul 30, 2019
This commit fixes a bug when a deferred aggregator tries to early terminate the collection. In such case the CollectionTerminatedException is not caught and
the search fails on the shard. This change makes sure that we catch the exception in order to continue the deferred collection on the next leaf.

Fixes #44909
jimczi added a commit that referenced this issue Jul 30, 2019
This commit fixes a bug when a deferred aggregator tries to early terminate the collection. In such case the CollectionTerminatedException is not caught and
the search fails on the shard. This change makes sure that we catch the exception in order to continue the deferred collection on the next leaf.

Fixes #44909
jimczi added a commit that referenced this issue Jul 30, 2019
This commit fixes a bug when a deferred aggregator tries to early terminate the collection. In such case the CollectionTerminatedException is not caught and
the search fails on the shard. This change makes sure that we catch the exception in order to continue the deferred collection on the next leaf.

Fixes #44909
jimczi added a commit that referenced this issue Jul 30, 2019
This commit fixes a bug when a deferred aggregator tries to early terminate the collection. In such case the CollectionTerminatedException is not caught and
the search fails on the shard. This change makes sure that we catch the exception in order to continue the deferred collection on the next leaf.

Fixes #44909
jkakavas pushed a commit that referenced this issue Jul 31, 2019
This commit fixes a bug when a deferred aggregator tries to early terminate the collection. In such case the CollectionTerminatedException is not caught and
the search fails on the shard. This change makes sure that we catch the exception in order to continue the deferred collection on the next leaf.

Fixes #44909
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants