Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce incremental reduction of TopDocs #23946

Merged
merged 3 commits into from
Apr 10, 2017

Conversation

s1monw
Copy link
Contributor

@s1monw s1monw commented Apr 6, 2017

This commit adds support for incremental top N reduction if the number of
expected shards in the search request is high enough. The changes here
also clean up more code in SearchPhaseController to make the separation
between values that are the same on each search result and values that
are per response. The reduced search phase result doesn't hold an arbitrary
result to obtain values like from, size or sort values which is now
cleanly encapsulated.

This commit adds support for incremental top N reduction if the number of
expected shards in the search request is high enough. The changes here
also clean up more code in SearchPhaseController to make the separation
between values that are the same on each search result and values that
are per response. The reduced search phase result doesn't hold an arbitrary
result to obtain values like `from`, `size` or sort values which is now
cleanly encapsulated.
@s1monw s1monw added :Search/Search Search-related issues that do not fall into other categories >enhancement review v5.4.0 v6.0.0-alpha1 labels Apr 6, 2017
@s1monw s1monw requested review from jpountz and jimczi April 6, 2017 15:13
try {
// the search context should inherit the default timeout
assertThat(contextWithDefaultTimeout.timeout(), equalTo(TimeValue.timeValueSeconds(5)));
} finally {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these tests where annoyingly slow since they waited for timeouts since shards where still locked - this shaved 10 seconds off the test

Copy link
Contributor

@jimczi jimczi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And that's how you incrementally introduced the incremental reduction of search results !
Looks great !

* @param bufferdAggs a list of pre-collected / buffered aggregations. if this list is non-null all aggregations have been consumed
* @param bufferedAggs a list of pre-collected / buffered aggregations. if this list is non-null all aggregations have been consumed
* from all non-null query results.
* @param bufferedAggs a list of pre-collected / buffered top docs. if this list is non-null all top docs have been consumed
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: bufferedTopDocs

// the top docs sort fields used to sort the score docs, <code>null</code> if the results are not sorted
final SortField[] sortField;
// <code>true</code> iff the result score docs is sorted
final boolean isSorted;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: hasFields ? score docs are always sorted ;)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I renamed it to isSortedByField since this is really what it is :)

Copy link
Contributor

@jpountz jpountz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left some comments, mostly about readability but I like the change in general. In many cases that you get many shard results, I think most shard results would be empty so I'm wondering whether we should optimize for that case (in another PR).

if (size != -1) {
final ScoreDoc[] mergedScoreDocs = mergeTopDocs(topDocs, size, ignoreFrom ? 0 : from);
final boolean hasNoHits = groupedCompletionSuggestions.isEmpty() && topDocs.isEmpty();
if (hasNoHits == false) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we avoid the double negation that makes things a bit harder to read by calling the var hasHits?

if (results.isEmpty()) {
return EMPTY_DOCS;
return null;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

out of curiosity, why was it an issue to return EMPTY_DOCS?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because the return value is again TopDocs and not ScoreDocs[] so private static final ScoreDoc[] EMPTY_DOCS = new ScoreDoc[0]; wouldn't cut it

List<InternalAggregations> bufferdAggs, int numReducePhases) {
List<InternalAggregations> bufferedAggs,
List<TopDocs> bufferedTopDocs, TopDocsStats topDocsStats, int numReducePhases,
boolean isScrollRequest) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you fix the indentation here? all parameters do not seem to start on the same column

@@ -204,23 +209,35 @@ private static long optionalSum(long left, long right) {
}
}
}
return scoreDocs;
final boolean isSorted;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it what is called isSortedByField elsewhere?

static class SortedTopDocs {
static final SortedTopDocs EMPTY = new SortedTopDocs(EMPTY_DOCS, false, null);
final ScoreDoc[] scoreDocs;
final boolean sorted;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it what is called isSortedByField elsewhere?

@s1monw
Copy link
Contributor Author

s1monw commented Apr 7, 2017

@jpountz I pushed a new commit

@s1monw
Copy link
Contributor Author

s1monw commented Apr 7, 2017

In many cases that you get many shard results, I think most shard results would be empty so I'm wondering whether we should optimize for that case (in another PR).

can you elaborate on this a bit I am not sure I am following.

@s1monw s1monw merged commit 1f40f8a into elastic:master Apr 10, 2017
@s1monw s1monw deleted the incrementally_reduce_top_n branch April 10, 2017 07:37
@jpountz
Copy link
Contributor

jpountz commented Apr 10, 2017

I was just thinking about the fact that adding more reductions can make results less accurate eg. for terms aggregations. Yet, I expect that incremental reduction would be especially useful in cases that you have time-based indices (eg. one per day). But if you set a date filter that only matches documents on a limited number of indices (eg. last 30 days while you have 1 year worth of data), most shard responses would be empty? So it is a bit disappointing to trade accuracy because of shard responses that do not contribute anything to the final (merged) response?

@s1monw
Copy link
Contributor Author

s1monw commented Apr 10, 2017

So it is a bit disappointing to trade accuracy because of shard responses that do not contribute anything to the final (merged) response?

oh I see what you mean, today if we get a result we don't check if it had any hits at all and in such a case we can just skip it (not buffer it). Is it that what you mean? that is a low hanging fruit I guess...

s1monw added a commit that referenced this pull request Apr 10, 2017
This commit adds support for incremental top N reduction if the number of
expected shards in the search request is high enough. The changes here
also clean up more code in SearchPhaseController to make the separation
between values that are the same on each search result and values that
are per response. The reduced search phase result doesn't hold an arbitrary
result to obtain values like `from`, `size` or sort values which is now
cleanly encapsulated.
@jpountz
Copy link
Contributor

jpountz commented Apr 10, 2017

Is it that what you mean?

Yes. I'm wondering there might be issues with the min_doc_count:0 option since it an empty result set could still create non-empty aggregations but hopefully we'll find a way to work around this.

final boolean hasTopDocs = source == null || source.size() != 0;

if (isScrollRequest == false && (hasAggs || hasTopDocs)) {
// no incremental reduce if scroll is used - we only hit a single shard or sometimes more...
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@s1monw Would you mind explaining why cannot use incremental reduce if scroll is used. This confuses me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>feature release highlight :Search/Search Search-related issues that do not fall into other categories v5.4.0 v6.0.0-alpha1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants