Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request: Aggregation to produce buckets with a fixed number of documents in them #50120

Open
IceCreamYou opened this issue Dec 12, 2019 · 7 comments
Labels
:Analytics/Aggregations Aggregations >enhancement stalled Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo)

Comments

@IceCreamYou
Copy link

IceCreamYou commented Dec 12, 2019

I would like to create an NP chart and I can't find a way to do so with ElasticSearch currently.

An NP chart is a line chart in which the value of each point is a percentage of a fixed number of items that meets some criteria. For example, if my data looks like this:

[
	{ date: '2019-01-01', failed: true },
	{ date: '2019-01-01', failed: true },
	{ date: '2019-01-02', failed: false },
	{ date: '2019-01-04', failed: false },
	{ date: '2019-01-05', failed: false },
	{ date: '2019-01-08', failed: true },
	{ date: '2019-01-08', failed: false },
	{ date: '2019-01-08', failed: false },
]

Then I want to write a histogram like this:

aggs: {
	np_chart: {
		fixed_size_buckets: {
			max_bucket_count: 10,
			max_bucket_documents: 3,
			sort: [{
				date: {
					order: 'asc',
				},
			}],
		},
		aggs: {
			failed_count: {
				filter: {
					term: {
						'failed': true,
					},
				},
			},
		},
	},
},

Which should return buckets like this:

[
	{
		key: ...,
		key_as_string: '2019-01-01',
		doc_count: 3,
		failed_count: { doc_count: 2 },
	},
	{
		key: ...,
		key_as_string: '2019-01-04',
		doc_count: 3,
		failed_count: { doc_count: 1 },
	},
	{
		key: ...,
		key_as_string: '2019-01-08',
		doc_count: 2,
		failed_count: { doc_count: 0 },
	},
]

Obviously if there are this few documents I could load the documents into memory and parse them manually, but I'd like to have up to a few thousand documents per bucket and that's too much to process that way.

There's a variant of the NP chart where instead of dividing all the documents into groups of N, we first make a date histogram and then take a random sample of N documents from each day. The proposal above would support both cases.

I think this is different than other requests for variable width histograms I've been able to find but please correct me if this has been proposed elsewhere.

@pgomulka pgomulka added the :Analytics/Aggregations Aggregations label Dec 12, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-analytics-geo (:Analytics/Aggregations)

@polyfractal
Copy link
Contributor

Hm, interesting. How do you decide which documents make it into the bucket? E.g. if you specify max_bucket_documents: 3 but there are 1000 docs that match, which three make it into the bucket? Random? Best "scoring" in some way?

We have two sampler aggs which might work for you: Sampler aggregation and Diversified sampler aggregation. Sampler selects n documents (per shard) and sub-aggs of the sampler only see those documents.

Diversified is similar, except it limits the number of documents per "value" in a field. E.g. you set max_docs_per_value: 3 over color field, and sub-aggs will only receive three documents for each color.

@IceCreamYou
Copy link
Author

How do you decide which documents make it into the bucket?

In the example I provided, the sort parameter defines that. Think of it like making a query where each bucket has one page of the response.

What I want to do is take a population of documents, sort them, and divide them into buckets with an equal number of buckets in each document. The sampler aggregation just shrinks the population; it does not support dividing up the population into equal-sized buckets. It is also important for the statistical properties of an NP-chart that each bucket be exactly the same size (apart from the remainder) and it looks like the sampler aggregation's shard_size property only limits the number of documents evaluated per shard, rather than directly affecting the size of the returned bucket. Also, in my particular use case I am actually working inside of a nested aggregation, and I'm not sure how a sampler aggregation would know how to sort child documents in that scenario.

@polyfractal
Copy link
Contributor

Ah gotcha, I overlooked the sort parameter. Makes sense, although I think there are still some issues to resolve around breaking "ties". E.g. if you sort on date and want 10 documents per bucket, you might still have >10 documents with the same date. So we'd probably need some kind of tie-breaker like document ID or something as well.

In any case, I agree it'd be nice to have a fixed-width histogram. I'm not sure we will be able to support this at the moment though. Currently the aggregation framework can only do a single pass over the data.

It has some facility to merge buckets as it proceeds (ala the auto_date_histogram, and similar method will be used by the variable-width histo. Essentially a form of clustering). But I think this would require the opposite: splitting buckets once it finds new candidates that should reside inside an existing bucket but which is already "full". That's fine on a single shard, but once we ship the results off to the coordinator and start merging shard results, we can no longer split a bucket because the sub-agg data is only mergeable, not splittable (e.g. can't split an avg after it has been calculated).

We are discussing how to implement multiple passes, which I think is one way to implement this: first pass to find extent/bounds of data and summaries about how data is distributed, second pass to calculate actual aggs.

Another approach could be a layer on top of the composite agg, which rolls up buckets as it pages through the results and returns the final result to the user.

@IceCreamYou
Copy link
Author

Regarding ties, I imagine it'd work the same way as all the other sorting ElasticSearch does - IIRC defaulting to document ID as you suggested. In fact I'd be okay if the only supported ordering was document ID, since my particular application would be sorting by date/time anyway and document ID is a reasonable proxy.

Understood about the cross-shard ordering issue. One possible approach would be to ignore the issue and just return potentially overlapping buckets. It wouldn't be ideal, obviously, but it might be close enough. A more complex but presumably more accurate variant on that idea could be to add a shard_factor parameter such that each shard produces buckets of max_bucket_documents / shard_factor documents and the coordinator merges them until max_bucket_documents is reached. So for example if max_bucket_documents = 1000 and shard_factor = 10, each shard could produce buckets of 100 and then the coordinator could merge 10 shard-buckets to produce returned buckets.

@rjernst rjernst added the Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) label May 4, 2020
@not-napoleon
Copy link
Member

Blocked by #50863 - multi-pass aggregation support

@not-napoleon
Copy link
Member

At the review meeting today, we also looked at #50386 (also stalled on multi-pass aggregations). That sounds like a very similar use case to this, we might be able to fit both asks at the same time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Analytics/Aggregations Aggregations >enhancement stalled Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo)
Projects
None yet
Development

No branches or pull requests

7 participants