Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add an option to early terminate document collection when searching/counting #6885

Merged
merged 1 commit into from Jul 23, 2014

Conversation

Projects
None yet
4 participants
@areek
Copy link
Contributor

commented Jul 15, 2014

The idea is to add an option which will let the user control the number of matched documents collected per shard before scoring (if applicable). This will be helpful for exists-type functionality or when not all matched document has to be scored.

Closes #6876

@areek

This comment has been minimized.

Copy link
Contributor Author

commented Jul 15, 2014

TODO:

  • replace all usage of Lucene.ExistsCollector with the new Lucene.EarlyTerminatingCountCollector to take advantage of early termination.
  • use threshold for collation in PhraseSuggester
@s1monw

View changes

src/main/java/org/elasticsearch/action/count/CountRequest.java Outdated
@@ -240,6 +259,10 @@ public void writeTo(StreamOutput out) throws IOException {
out.writeOptionalString(preference);
out.writeBytesReference(source);
out.writeStringArray(types);

if (out.getVersion().onOrAfter(Version.V_1_4_0)) {
out.writeVInt(threshold);

This comment has been minimized.

Copy link
@s1monw

s1monw Jul 15, 2014

Contributor

if you use vint you can't write negative values :)

This comment has been minimized.

Copy link
@s1monw

s1monw Jul 15, 2014

Contributor

I think you can just go with threshold + 1 on the write side and threshold - 1 on the read side

This comment has been minimized.

Copy link
@areek

areek Jul 15, 2014

Author Contributor

I think it makes sense to just change the default to 0, it does not really make sense for the user to set it to 0. maybe even throw an exception when they want to set it.

@s1monw

View changes

src/main/java/org/elasticsearch/action/count/CountRequest.java Outdated
/**
* Upon reaching the threshold count, the count request will early terminate
*/
public CountRequest threshold(int threshold) {

This comment has been minimized.

Copy link
@s1monw

s1monw Jul 15, 2014

Contributor

maybe make sure it's positive

@s1monw

View changes

src/main/java/org/elasticsearch/action/count/ShardCountRequest.java Outdated
@@ -99,6 +108,12 @@ public void readFrom(StreamInput in) throws IOException {
}
}
nowInMillis = in.readVLong();

if (in.getVersion().onOrAfter(Version.V_1_4_0)) {
threshold = in.readVInt();

This comment has been minimized.

Copy link
@s1monw

s1monw Jul 15, 2014

Contributor

see my comment about negative values above?

@s1monw

View changes

src/test/java/org/elasticsearch/count/simple/SimpleCountTests.java Outdated
SETTING_NUMBER_OF_SHARDS, 1,
SETTING_NUMBER_OF_REPLICAS, 0).get();
ensureGreen();
client().prepareIndex("test", "type1", "1").setSource("field", "2010-01-05T02:00").execute().actionGet();

This comment has been minimized.

Copy link
@s1monw

s1monw Jul 15, 2014

Contributor

maybe add more docs like between(2, max) and use indexRandom() so you get holes in the segments ie. del docs. Then you can can randomly set the threshold between(2, max-1) etc.

This comment has been minimized.

Copy link
@areek

areek Jul 15, 2014

Author Contributor

Thats a great idea, will do!

@s1monw

This comment has been minimized.

Copy link
Contributor

commented Jul 15, 2014

left some comments - I like it...

@areek

This comment has been minimized.

Copy link
Contributor Author

commented Jul 16, 2014

Updated PR:

  • incorporated feedback
  • replaced Lucene.ExistsCollector usage with the new EarlyTerminatingCountCollector (this might speed things up a little bit in the Percolator service)

TODO:

  • use threshold param in PhraseSuggester collation.

It seems like there is no equivalent of MultiSearchRequest and family for CountRequest. To use threshold for PhraseSuggester collation, it would be ideal to use CountRequest (for setting up the threshold param) and also use something similar to MultiSearchRequest for CountRequest (to dispatch requests for all the generated phrases).

Another way would be to somehow add the count-specific param threshold to the SearchRequest. Thoughts?

}

@Override
public void setScorer(Scorer scorer) throws IOException {
this.exists = false;

This comment has been minimized.

Copy link
@martijnvg

martijnvg Jul 16, 2014

Member

+1 on removing this

@martijnvg

View changes

src/main/java/org/elasticsearch/percolator/PercolatorService.java Outdated
@@ -446,14 +446,13 @@ public ReduceResult reduce(List<PercolateShardResponse> shardResults) {
@Override
public PercolateShardResponse doPercolate(PercolateShardRequest request, PercolateContext context, boolean isNested) {
long count = 0;
Lucene.ExistsCollector collector = new Lucene.ExistsCollector();
Lucene.EarlyTerminatingCountCollector collector = Lucene.createExistsCollector();

This comment has been minimized.

Copy link
@martijnvg

martijnvg Jul 16, 2014

Member

Also +1 on using EarlyTerminatingCountCollector in percolator. The in-memory index only has one segment, but in the case when percolating a document with nested inner objects multiple in-memory indices are wrapped with a SlowCompositeReaderWrapper and in that case early terminating is beneficial.

This comment has been minimized.

Copy link
@martijnvg

martijnvg Jul 16, 2014

Member

This comment is incorrect, when the document to be percolated contains nested objects or not, there is always one segment... in any case the use of this collector is good for code reuse purposes.

This comment has been minimized.

Copy link
@areek

areek Jul 16, 2014

Author Contributor

thanks @martijnvg for clarifying this!

@areek areek self-assigned this Jul 16, 2014

@areek

This comment has been minimized.

Copy link
Contributor Author

commented Jul 18, 2014

I have added a new terminate_after_count parameter to search and count API, which represents the number of documents to collect per shard. Upon reaching that number, the query will early terminate.

The search request with the new param looks as follows:

curl -XGET 'localhost:9200/_search' -d '
  { 
    terminate_after_count: 1, 
    "query" : ...
  }'

or

curl -XGET 'localhost:9200/_search?terminate_after_count=1' -d '
  { 
    "query" : ...
  }'

and for Count API, the request is as follows:

curl -XGET 'localhost:9200/_count?terminate_after_count=1' -d '
  { 
    "query" : ...
  }'

Both the Search & Count Response will have an added field terminated_after_count which will be true if any of the shards actually early terminated.

@areek areek changed the title Implement early termination for Count API, if threshold is provided Search & Count: Add early termination of document collection option Jul 18, 2014

@areek

This comment has been minimized.

Copy link
Contributor Author

commented Jul 18, 2014

@clintongormley any thoughts on the naming/api for this?

@areek

This comment has been minimized.

Copy link
Contributor Author

commented Jul 18, 2014

Updated PR:

  • add support for early termination to the Search API
  • use early termination in collation for PhraseSuggester
  • Added tests
  • Added terminated_after_count to Search and Count response, to indicate if any of the shards actually early terminated query exec. or not

TODO:

  • add docs for new option in Search API. (not exactly sure where would the best place to add it though)
  • add more tests for new option in Search API

@areek areek added the review label Jul 18, 2014

@areek areek changed the title Search & Count: Add early termination of document collection option Search & Count: Add an option to early terminate document collection Jul 18, 2014

@clintongormley

This comment has been minimized.

Copy link
Member

commented Jul 18, 2014

@areek perhaps terminate_after, just to keep it short and simple. Wondering if the response should be true/false, or if it would be better to return terminated_after: 1000 (and leave out the key if it wasn't terminated early)

@areek

This comment has been minimized.

Copy link
Contributor Author

commented Jul 18, 2014

@clintongormley the problem with terminate_after is that its a little ambiguous, users might think it is a timeout or what not, rather than a threshold to doc collection. Though terminate_after is much nicer then terminate_after_count.
Regarding the response, I don't see the value in specifying the terminated_after count, as if the early termination did happen then this will always be the same as the input param. Specifying it also might confuse users as this is a threshold/shard, it might report num_shards*terminate_after_count hits/counts while having terminated_after: terminate_after_count in the resp.
I think terminate_after is a better name but a little ambiguous (might just roll with it) and for the response I will leave terminated_after out if it was false.
Thoughts?

@clintongormley

This comment has been minimized.

Copy link
Member

commented Jul 19, 2014

@areek i agree with you about the terminate_after count. True/false does make more sense. But if the user does specify terminate_after, I think the flag should be in the response. Perhaps terminated? or terminated_early? Neither of those are perfect, but terminated_after feels like it is missing something...

@areek

This comment has been minimized.

Copy link
Contributor Author

commented Jul 22, 2014

Updated PR:

  • The new request param has been renamed to terminate_after from terminate_after_count.
  • Added documentation
  • The response for both search & count api looks like the following:
  {
    "took": 1,
    "timed_out": false,
    "terminated_early": true,
    "_shards": {
       ...
    },
    "hits": {
       ...
    }
}

The terminated_early will only be visible when the user explicitly sets terminate_after param in the request

@areek

This comment has been minimized.

Copy link
Contributor Author

commented Jul 22, 2014

@s1monw It would be awesome to have this reviewed.

@s1monw

View changes

src/main/java/org/elasticsearch/action/count/CountRequest.java Outdated
return this;
}

public int terminateAfterCount() {

This comment has been minimized.

Copy link
@s1monw

s1monw Jul 22, 2014

Contributor

we usually call the getter also terminateAfter

This comment has been minimized.

Copy link
@areek

areek Jul 22, 2014

Author Contributor

oops refactoring miss. will fix

searcher.search(query, filter, collector);
}
} catch (EarlyTerminationException e) {
// early termination

This comment has been minimized.

Copy link
@s1monw

s1monw Jul 22, 2014

Contributor

can we just return true here?

This comment has been minimized.

Copy link
@areek

areek Jul 22, 2014

Author Contributor

will do

@s1monw

View changes

src/main/java/org/elasticsearch/common/lucene/Lucene.java Outdated
delegate.collect(doc);
}
count++;
if (count >= maxCountHits) {

This comment has been minimized.

Copy link
@s1monw

s1monw Jul 22, 2014

Contributor

mayb just do if (++count >=

This comment has been minimized.

Copy link
@areek

areek Jul 22, 2014

Author Contributor

will change

@s1monw

View changes

src/main/java/org/elasticsearch/common/lucene/Lucene.java Outdated
}

@Override
public void setNextReader(AtomicReaderContext context) throws IOException {
public void setNextReader(AtomicReaderContext atomicReaderContext) throws IOException {
if (delegate != null) {

This comment has been minimized.

Copy link
@s1monw

s1monw Jul 22, 2014

Contributor

hmm can we make a noop collector instad of checking for null all the time?

This comment has been minimized.

Copy link
@areek

areek Jul 22, 2014

Author Contributor

Thats a good idea, will have a noop collector

for (AtomicArray.Entry<? extends QuerySearchResultProvider> entry : queryResults) {
QuerySearchResult result = entry.value.queryResult();
if (result.searchTimedOut()) {
timedOut = true;
}
if (result.terminatedEarly() != null) {

This comment has been minimized.

Copy link
@s1monw

s1monw Jul 22, 2014

Contributor

I actually wonder if we should have a MaybeBoolean class that implements ToXContent and can do these kind of merge operations ie similar of haskel maybe

This comment has been minimized.

Copy link
@areek

areek Jul 23, 2014

Author Contributor

Did not really get the part about ToXContent here. could you elaborate on that?

I was thinking something like:

    class MaybeBoolean {

        private Boolean value;

        public MaybeBoolean(Boolean initialValue) {
            this.value = initialValue;
        }
         // called in the loop
        public void merge(Boolean mergeValue) {
            if (this.value == null && mergeValue != null) {
                this.value = mergeValue;
            } else if (mergeValue != null) {
                this.value |= mergeValue;
            }
        }

        // final value
        public Boolean getValue() {
            return value;
        }
    }

This comment has been minimized.

Copy link
@s1monw

s1monw Jul 23, 2014

Contributor

+1 to that but we don't have do that - lets just stick to what we have

@s1monw

This comment has been minimized.

Copy link
Contributor

commented Jul 22, 2014

I left a couple of comments but it looks great so far

@s1monw s1monw removed the review label Jul 22, 2014

@s1monw

View changes

docs/reference/search/count.asciidoc Outdated
@@ -63,6 +63,9 @@ query.

|default_operator |The default operator to be used, can be `AND` or
`OR`. Defaults to `OR`.

|terminate_after |The maximum count for each shard, upon reaching which the query

This comment has been minimized.

Copy link
@s1monw

s1monw Jul 23, 2014

Contributor

we should add coming[1.4.0] here I guess?

@s1monw

View changes

docs/reference/search/request-body.asciidoc Outdated
@@ -62,6 +62,10 @@ that point when expired. Defaults to no timeout.
`query_and_fetch`. Defaults to `query_then_fetch`. See
<<search-request-search-type,_Search Type_>> for
more details on the different types of search that can be performed.

|`terminate_after` |The maximum number of documents to collect for

This comment has been minimized.

Copy link
@s1monw

s1monw Jul 23, 2014

Contributor

we should add coming[1.4.0] here I guess?

@s1monw

View changes

docs/reference/search/uri-request.asciidoc Outdated
@@ -82,6 +82,10 @@ scores and return them as part of each hit.
within the specified time value and bail with the hits accumulated up to
that point when expired. Defaults to no timeout.

|`terminate_after` |The maximum number of documents to collect for

This comment has been minimized.

Copy link
@s1monw

s1monw Jul 23, 2014

Contributor

we should add coming[1.4.0] here I guess?

@clintongormley

This comment has been minimized.

Copy link
Member

commented Jul 23, 2014

@areek Docs look good, I'd just add a mention of the key in the response when the request has been terminated early

@s1monw

This comment has been minimized.

Copy link
Contributor

commented Jul 23, 2014

did another review and this LGTM too

@areek

This comment has been minimized.

Copy link
Contributor Author

commented Jul 23, 2014

Mentioned the added key terminated_early in response when terminate_after is set. I will merge this soon.

Search & Count: Add option to early terminate doc collection
Allow users to control document collection termination, if a specified terminate_after number is
set. Upon setting the newly added parameter, the response will include a boolean terminated_early
flag, indicating if the document collection for any shard terminated early.

closes #6876

@areek areek merged commit 5487c56 into elastic:master Jul 23, 2014

@areek areek deleted the areek:enhancement/6876 branch Jul 23, 2014

@clintongormley clintongormley changed the title Search & Count: Add an option to early terminate document collection Search: Add an option to early terminate document collection when searching/counting Sep 10, 2014

Mpdreamz added a commit to elastic/elasticsearch-net that referenced this pull request Dec 11, 2014

gmarz added a commit to elastic/elasticsearch-net that referenced this pull request Dec 11, 2014

Mpdreamz added a commit that referenced this pull request Dec 17, 2014

Mpdreamz added a commit that referenced this pull request Dec 17, 2014

@clintongormley clintongormley changed the title Search: Add an option to early terminate document collection when searching/counting Add an option to early terminate document collection when searching/counting Jun 6, 2015

mute pushed a commit to mute/elasticsearch that referenced this pull request Jul 29, 2015

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.