Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metric Selector aggregation #48069

Closed
polyfractal opened this issue Oct 15, 2019 · 2 comments · Fixed by #51155
Closed

Metric Selector aggregation #48069

polyfractal opened this issue Oct 15, 2019 · 2 comments · Fixed by #51155

Comments

@polyfractal
Copy link
Contributor

I'd like to propose an aggregation that "selects" a metric from a document according to some kind of ordering criteria on a second field. For example, you may want the most recent latency value within a date_histogram bucket: in this case, the "metric" is the latency field, and the ordering criteria is timestamp DESC, size: 1.

This is a fairly common use-case which is difficult to accomplish today. top_hits can give you the information, but it fetches an entire document and is not compatible with pipeline aggregations. It is also fairly expensive if many values/documents are being fetched. You can sometimes get the required information with clever usages of other aggs (like a max agg, or scripting) to pull out the document you're looking for, but they are fragile and hacky approaches.

The WeightedAvg agg added support for multiple ValuesSources, so a "metric selector" should not be too difficult to implement.

All naming is tentative, open to better suggestions! :)

Request Syntax

GET _search
{
  "aggs": {
    "timeline": {
      "date_histogram": {
        "field": "date",
        "calendar_interval": "hour"
      },
      "aggs": {
        "most_recent": {
          "metric_selector": {
            "metric": {
              "field": "latency",
              // "script": ...
              // "format": ...,
              // "value_type": ...
            },
            "sort": {
              "field": "date",
              // "script": ...
              // "format": ...,
              // "value_type": ...
            },
            "order": "asc | desc",
            "size": 1,
            "multi_value_mode": "min | max | sum | avg"
          }
        }
      }
    }
  }
}
Parameter Description
metric The metric field that we wish to extract from a document Required
sort The field that we wish to sort and select the metric by Required
order How we should order the sort field? Ascending or descending Required
size The number of <sort, metric> tuples that should be returned Optional, default: 1
multi_value_mode How should multi-valued metric fields be collapsed into a single value? Optional: default avg

Response

{
  "aggregations" : {
    "timeline" : {
      "buckets" : [
        {
          "key_as_string" : "2019-01-01T05:00:00.000Z",
          "key" : 1546318800000,
          "doc_count" : 3,
          "most_recent" : [
            {
              "sort": 1546340340000,
              "sort_as_string": "2019-01-01T05:59:00.000Z",
              "value": 123
            },
            {
              "sort": 1546338600000,
              "sort": "2019-01-01T05:30:00.000Z",
              "value": 19
            }
          ]
        },
        {
          "key_as_string" : "2019-01-01T06:00:00.000Z",
          "key" : 1546322400000,
          "doc_count" : 1,
          "most_recent" : [
            {
              "sort": 1546341000000,
              "sort": "2019-01-01T06:10:00.000Z",
              "value": 9999
            },
            {
              "sort": 1546340700000,
              "sort": "2019-01-01T06:05:00.000Z",
              "value": 2233
            }
          ]
        }
      ]
    }
  }
}

Note how the sort values are ordered descending per-bucket, and it returns a single metric value for each sort value. There may be 1000 documents in a bucket, but unlike other aggregations this actually returns n individual values from the documents themselves. If there are ties, there would be multiple objects with the same sort.

Misc

  • I have a crude prototype which demonstrates the feasibility.
  • We will need some kind of limit on size to prevent abuses. It should be fairly easy to track in a breaker, so that might be sufficient. I would feel better if there was a hard/soft limit though :) Like top_hits, this should be used to fetch a handful of values not an entire index
  • We should support sorting on non-numeric fields too (keyword, etc).
  • I'm less sure we need to support non-numeric metrics. I think starting with numerics-only is fine
  • We can probably optimize the no-parent scenario with a BKD lookup similar to how min/max work today. Not necessary for the first iteration
  • As long as we only support asc/desc (e.g. the min or max values of a field), we shouldn't run into top-n accuracy issues like terms agg can have. Each shard will always send it's n min/max values and the coordinator will assemble a global min/max list. It might be that all top n values have the same sort key and others are omitted, but this is not incorrect since we are displaying individual results and not grouping.

/cc @costin @colings86

@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-analytics-geo (:Analytics/Aggregations)

@polyfractal
Copy link
Contributor Author

Potential naming idea: top_metric or similar, to parallel top_hits. Both signals that it has similar functionality, and doesn't confuse with bucket_selector which is quite different.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants