Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Filtering after a top_hits have been applied or bucket selector on buckets, which can select based on buckets which has top_hits applied #94967

Closed
mat013 opened this issue Apr 1, 2023 · 6 comments
Assignees
Labels
:Analytics/Aggregations Aggregations >enhancement Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo)

Comments

@mat013
Copy link

mat013 commented Apr 1, 2023

Description

I am unsure whether it is a bug (because I could not find anything in the documentation) or it new feature, and I have tried to ask various forums and googled to see if I could come up with an answer to my problem.

I have a use case (which I imagine is not that unique), which is as follows:

I am trying to create a query in elasticsearch, which is able to retrieve the latest documents for each group and then filter on some criteria on those documents which has been found.

As an example:

Say the following documents are indexed in myindex in elasticsearch:

POST /myindex/_bulk
{ "index":{} }
{ "objid": 1, "ident":"group1","version":1, "chdate": 1, "field1" : 1}
{ "index":{} }
{ "objid": 2, "ident":"group1","version":2, "chdate": 2, "field1" : 0}
{ "index":{} }
{ "objid": 3, "ident":"group1","version":2, "chdate": 3, "field1" : 1}
{ "index":{} }
{ "objid": 4, "ident":"group1","version":2, "chdate": 4, "field1" : 0}
{ "index":{} }
{ "objid": 5, "ident":"group1","version":3, "chdate": 1, "field1" : 0}

I would like to find all documents, which has field1 set to x for the document with the highest chdate, for each ident and version, where the selected document has field1 set to x.

In a case where x is 0 then the documents, which has objid 4 and 5 should be returned In a case where x is 1 then the documents, which has objid 1 should be returned

Initially I tried to do following query

{ "size": 0, "aggs": { "by_ident": { "terms": { "field": "ident.keyword", "size": 10 }, "aggs": { "by_version": { "terms": { "field": "version", "size": 10000 }, "aggs": { "by_latest": { "top_hits": { "sort": [{ "chdate": { "order": "desc" } }], "size": 1 } } } } } } } }

However it is not possible to apply a filter afterwards to the top hits

ChatGPT suggested to use a bucket selector

{ "size": 0, "aggs": { "ident": { "terms": { "field": "ident" }, "aggs": { "version": { "terms": { "field": "version" }, "aggs": { "top_hits_agg": { "top_hits": { "size": 1, "sort": [ { "chdate": { "order": "desc" } } ] } }, "field1_filter": { "bucket_selector": { "buckets_path": { "hits": "top_hits_agg.hits.hits", "field1": "top_hits_agg.hits.hits._source.field1" }, "script": { "source": "params.field1 == 0" } } } } } } } } }

However it fails with: Validation Failed: 1: No aggregation found for path [top_hits_agg.hits.hits._source.field1]

So what I would like to suggest is one of the following:

  1. Add a filtering block to top_hits
  2. Make bucket_selector able to select a top hits bucket.

Personally I think option 2 will be more applicable, however option 1 would maybe more easier to comprehend when somebody has to maintain the query later on.

@mat013 mat013 added >enhancement needs:triage Requires assignment of a team area label labels Apr 1, 2023
@craigtaverner craigtaverner added :Analytics/Aggregations Aggregations Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) and removed needs:triage Requires assignment of a team area label labels Apr 3, 2023
@craigtaverner craigtaverner self-assigned this Apr 3, 2023
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-analytics-geo (Team:Analytics)

@craigtaverner
Copy link
Contributor

Hi @mat013. I tried your last query and get the same error as you because the bucket_selector is not at the same level as the top hits aggregation. However, even if I move it to the correct level, it does not work and I get the error:

buckets_path must reference either a number value or a single value numeric metric aggregation, got: [InternalTopHits] at aggregation [by_latest]

Looking at the documentation it appears that the bucket selector only works for metric buckets with numerical values, while your buckets contain a hits array with complete documents. So it appears what you are trying is not currently supported. I think you already suspected this, by adding the enhancement label instead of the bug label to this issue. I'll discuss this enhancement with others to see if there is any chance it might be considered.

In the meantime, I assume you can handle this by filtering in the client code?

@mat013
Copy link
Author

mat013 commented Apr 4, 2023

I can get by for the simple cases... not for the more advance... so I will look forward if there will come a solution.

I'll see if I can find some alternative solutions in the meanwhile. Have a nice day :)

@craigtaverner
Copy link
Contributor

craigtaverner commented Apr 5, 2023

I have found that while this is not possible for top_hits, it is possible for top_metrics. What this does is instead of returning the complete hit it returns only the metrics you specify, but they are now structured in a way that is understood by the bucket_selector. Since the hit is missing, you can specify all the fields in the document as metrics to keep all the same information, just structured differently. This query worked for me on your sample data above:

GET /myindex/_search
{
  "size": 0,
  "aggs": {
    "by_ident": {
      "terms": {
        "field": "ident",
        "size": 10
      },
      "aggs": {
        "by_version": {
          "terms": {
            "field": "version",
            "size": 10000
          },
          "aggs": {
            "by_latest": {
              "top_metrics": {
                "metrics": [
                  {"field": "objid"},
                  {"field": "ident"},
                  {"field": "version"},
                  {"field": "chdate"},
                  {"field": "field1"}
                ],
                "sort": {
                  "chdate": {
                    "order": "desc"
                  }
                }
              }
            },
            "field1_filter": {
              "bucket_selector": {
                "buckets_path": {
                  "field1": "by_latest.field1"
                },
                "script": "params.field1 == 0"
              }
            }
          }
        }
      }
    }
  }
}

And the result had buckets that looked like this:

            "buckets": [
              {
                "key": 2,
                "doc_count": 3,
                "by_latest": {
                  "top": [
                    {
                      "sort": [
                        4
                      ],
                      "metrics": {
                        "objid": 4,
                        "ident": "group1",
                        "version": 2,
                        "chdate": 4,
                        "field1": 0
                      }
                    }
                  ]
                }
              },
              {
                "key": 3,
                "doc_count": 1,
                "by_latest": {
                  "top": [
                    {
                      "sort": [
                        1
                      ],
                      "metrics": {
                        "objid": 5,
                        "ident": "group1",
                        "version": 3,
                        "chdate": 1,
                        "field1": 0
                      }
                    }
                  ]
                }
              }
            ]

@mat013
Copy link
Author

mat013 commented Apr 5, 2023

Thanks a lot. Obviously the reality is more detailed, so the toy example I provided will not scale really well for a document (in our case has more than 100 attributes), but it is definitely interesting as I can then do it in two stages first finding the one and take the objid and then search them out afterwards... So I would appreciate if this enhancement suggestion is still under evaluation whether it can be implemented in a more simple manners as originally suggested

In my case field1 will not be a number but a string or keyword, and version and chdate is timestamps
So I have to test this with the structure I am using but I will definitly have a go with this.

Have a nice day

@kkrik-es
Copy link
Contributor

Fixed in #95828.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Analytics/Aggregations Aggregations >enhancement Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo)
Projects
None yet
Development

No branches or pull requests

4 participants