Filtering after a top_hits have been applied or bucket selector on buckets, which can select based on buckets which has top_hits applied #94967

mat013 · 2023-04-01T12:05:56Z

Description

I am unsure whether it is a bug (because I could not find anything in the documentation) or it new feature, and I have tried to ask various forums and googled to see if I could come up with an answer to my problem.

I have a use case (which I imagine is not that unique), which is as follows:

I am trying to create a query in elasticsearch, which is able to retrieve the latest documents for each group and then filter on some criteria on those documents which has been found.

As an example:

Say the following documents are indexed in myindex in elasticsearch:

POST /myindex/_bulk
{ "index":{} }
{ "objid": 1, "ident":"group1","version":1, "chdate": 1, "field1" : 1}
{ "index":{} }
{ "objid": 2, "ident":"group1","version":2, "chdate": 2, "field1" : 0}
{ "index":{} }
{ "objid": 3, "ident":"group1","version":2, "chdate": 3, "field1" : 1}
{ "index":{} }
{ "objid": 4, "ident":"group1","version":2, "chdate": 4, "field1" : 0}
{ "index":{} }
{ "objid": 5, "ident":"group1","version":3, "chdate": 1, "field1" : 0}

I would like to find all documents, which has field1 set to x for the document with the highest chdate, for each ident and version, where the selected document has field1 set to x.

In a case where x is 0 then the documents, which has objid 4 and 5 should be returned In a case where x is 1 then the documents, which has objid 1 should be returned

Initially I tried to do following query

{ "size": 0, "aggs": { "by_ident": { "terms": { "field": "ident.keyword", "size": 10 }, "aggs": { "by_version": { "terms": { "field": "version", "size": 10000 }, "aggs": { "by_latest": { "top_hits": { "sort": [{ "chdate": { "order": "desc" } }], "size": 1 } } } } } } } }

However it is not possible to apply a filter afterwards to the top hits

ChatGPT suggested to use a bucket selector

{ "size": 0, "aggs": { "ident": { "terms": { "field": "ident" }, "aggs": { "version": { "terms": { "field": "version" }, "aggs": { "top_hits_agg": { "top_hits": { "size": 1, "sort": [ { "chdate": { "order": "desc" } } ] } }, "field1_filter": { "bucket_selector": { "buckets_path": { "hits": "top_hits_agg.hits.hits", "field1": "top_hits_agg.hits.hits._source.field1" }, "script": { "source": "params.field1 == 0" } } } } } } } } }

However it fails with: Validation Failed: 1: No aggregation found for path [top_hits_agg.hits.hits._source.field1]

So what I would like to suggest is one of the following:

Add a filtering block to top_hits
Make bucket_selector able to select a top hits bucket.

Personally I think option 2 will be more applicable, however option 1 would maybe more easier to comprehend when somebody has to maintain the query later on.

The text was updated successfully, but these errors were encountered:

elasticsearchmachine · 2023-04-03T13:39:04Z

Pinging @elastic/es-analytics-geo (Team:Analytics)

craigtaverner · 2023-04-03T16:35:37Z

Hi @mat013. I tried your last query and get the same error as you because the bucket_selector is not at the same level as the top hits aggregation. However, even if I move it to the correct level, it does not work and I get the error:

buckets_path must reference either a number value or a single value numeric metric aggregation, got: [InternalTopHits] at aggregation [by_latest]

Looking at the documentation it appears that the bucket selector only works for metric buckets with numerical values, while your buckets contain a hits array with complete documents. So it appears what you are trying is not currently supported. I think you already suspected this, by adding the enhancement label instead of the bug label to this issue. I'll discuss this enhancement with others to see if there is any chance it might be considered.

In the meantime, I assume you can handle this by filtering in the client code?

mat013 · 2023-04-04T16:53:50Z

I can get by for the simple cases... not for the more advance... so I will look forward if there will come a solution.

I'll see if I can find some alternative solutions in the meanwhile. Have a nice day :)

craigtaverner · 2023-04-05T09:10:30Z

I have found that while this is not possible for top_hits, it is possible for top_metrics. What this does is instead of returning the complete hit it returns only the metrics you specify, but they are now structured in a way that is understood by the bucket_selector. Since the hit is missing, you can specify all the fields in the document as metrics to keep all the same information, just structured differently. This query worked for me on your sample data above:

GET /myindex/_search
{
  "size": 0,
  "aggs": {
    "by_ident": {
      "terms": {
        "field": "ident",
        "size": 10
      },
      "aggs": {
        "by_version": {
          "terms": {
            "field": "version",
            "size": 10000
          },
          "aggs": {
            "by_latest": {
              "top_metrics": {
                "metrics": [
                  {"field": "objid"},
                  {"field": "ident"},
                  {"field": "version"},
                  {"field": "chdate"},
                  {"field": "field1"}
                ],
                "sort": {
                  "chdate": {
                    "order": "desc"
                  }
                }
              }
            },
            "field1_filter": {
              "bucket_selector": {
                "buckets_path": {
                  "field1": "by_latest.field1"
                },
                "script": "params.field1 == 0"
              }
            }
          }
        }
      }
    }
  }
}

And the result had buckets that looked like this:

            "buckets": [
              {
                "key": 2,
                "doc_count": 3,
                "by_latest": {
                  "top": [
                    {
                      "sort": [
                        4
                      ],
                      "metrics": {
                        "objid": 4,
                        "ident": "group1",
                        "version": 2,
                        "chdate": 4,
                        "field1": 0
                      }
                    }
                  ]
                }
              },
              {
                "key": 3,
                "doc_count": 1,
                "by_latest": {
                  "top": [
                    {
                      "sort": [
                        1
                      ],
                      "metrics": {
                        "objid": 5,
                        "ident": "group1",
                        "version": 3,
                        "chdate": 1,
                        "field1": 0
                      }
                    }
                  ]
                }
              }
            ]

mat013 · 2023-04-05T09:24:29Z

Thanks a lot. Obviously the reality is more detailed, so the toy example I provided will not scale really well for a document (in our case has more than 100 attributes), but it is definitely interesting as I can then do it in two stages first finding the one and take the objid and then search them out afterwards... So I would appreciate if this enhancement suggestion is still under evaluation whether it can be implemented in a more simple manners as originally suggested

In my case field1 will not be a number but a string or keyword, and version and chdate is timestamps
So I have to test this with the structure I am using but I will definitly have a go with this.

Have a nice day

kkrik-es · 2023-05-15T13:33:31Z

Fixed in #95828.

mat013 added >enhancement needs:triage Requires assignment of a team area label labels Apr 1, 2023

craigtaverner added :Analytics/Aggregations Aggregations Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) and removed needs:triage Requires assignment of a team area label labels Apr 3, 2023

craigtaverner self-assigned this Apr 3, 2023

This was referenced Apr 5, 2023

Meta: Refactor pipeline aggregations #82808

Open

Allow bucket_selector to be used for TopHits and TopMetrics aggs #73429

Closed

martijnvg mentioned this issue May 3, 2023

Extend TopMetrics to support BucketSelector #95720

Closed

martijnvg mentioned this issue May 12, 2023

Support value retrieval in top_hits #95828

Merged

martijnvg assigned kkrik-es May 12, 2023

kkrik-es closed this as completed May 15, 2023

cjbottaro mentioned this issue Nov 28, 2023

Support value retrieval in top_hits opensearch-project/OpenSearch#11372

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Filtering after a top_hits have been applied or bucket selector on buckets, which can select based on buckets which has top_hits applied #94967

Filtering after a top_hits have been applied or bucket selector on buckets, which can select based on buckets which has top_hits applied #94967

mat013 commented Apr 1, 2023

elasticsearchmachine commented Apr 3, 2023

craigtaverner commented Apr 3, 2023

mat013 commented Apr 4, 2023 •

edited

craigtaverner commented Apr 5, 2023 •

edited

mat013 commented Apr 5, 2023

kkrik-es commented May 15, 2023

Filtering after a top_hits have been applied or bucket selector on buckets, which can select based on buckets which has top_hits applied #94967

Filtering after a top_hits have been applied or bucket selector on buckets, which can select based on buckets which has top_hits applied #94967

Comments

mat013 commented Apr 1, 2023

Description

elasticsearchmachine commented Apr 3, 2023

craigtaverner commented Apr 3, 2023

mat013 commented Apr 4, 2023 • edited

craigtaverner commented Apr 5, 2023 • edited

mat013 commented Apr 5, 2023

kkrik-es commented May 15, 2023

mat013 commented Apr 4, 2023 •

edited

craigtaverner commented Apr 5, 2023 •

edited