Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve percolate query performance by not verifying certain candidate matches #18696

Closed

Conversation

Projects
None yet
3 participants
@martijnvg
Copy link
Member

commented Jun 2, 2016

We can skip the MemoryIndex verification for candidate matches that we know upfront are always an actual match and we don't care about scores (so percolate query is wrapped in bool filter clause or constant_score query). Skipping the MemoryIndex verification can boost performance as that is the most expensive step during the executing of the percolate query.

At index time during query term extraction we store in a numeric docvalues field wether this candidate match is verified. So if during pre selecting this query matches we check if this doc values field is set and if so report it as a match.

A number queries that can skip MemoryIndex verification step:

  • TermQuery
  • TermsQuery
  • BooleanQuery with only should clauses, no minimum_should_match set and at least one should clause is a verified query.
  • BooleanQuery with a single must clause that is a verified query.

@martijnvg martijnvg force-pushed the martijnvg:percolator_verified_candidate_match branch Jun 13, 2016

@jpountz

View changes

modules/percolator/src/main/java/org/elasticsearch/percolator/ExtractQueryTermsService.java Outdated
terms.addAll(extractQueryTerms(clause.getQuery()));
Result subResult = extractQueryTerms(clause.getQuery());
if (mnsm <= 1 && numProhibitedClauses == 0 && subResult.verified) {
verified = true;

This comment has been minimized.

Copy link
@jpountz

jpountz Jun 15, 2016

Contributor

this looks wrong: verified will be true if any of the subs is verified, instead it should be true by default and then switched to false if any of the subs is not verified?

@jpountz

View changes

modules/percolator/src/main/java/org/elasticsearch/percolator/ExtractQueryTermsService.java Outdated
}
return terms;
return new Result(disjuncts.size() <= 1, terms);

This comment has been minimized.

Copy link
@jpountz

jpountz Jun 15, 2016

Contributor

it should be the same logic as disjunctions?

@jpountz

View changes

modules/percolator/src/main/java/org/elasticsearch/percolator/ExtractQueryTermsService.java Outdated
@@ -281,10 +305,31 @@ public static Query createQueryTermsQuery(IndexReader indexReader, String queryM
return new TermsQuery(extractedTerms);
}

This comment has been minimized.

Copy link
@jpountz

jpountz Jun 15, 2016

Contributor

do we want to add a case for MatchAllDocsQuery now that we are adding this ability to skip the MemoryIndex?

@jpountz

View changes

modules/percolator/src/main/java/org/elasticsearch/percolator/ExtractQueryTermsService.java Outdated
* For example the query contained an unsupported query (e.g. WildcardQuery).
* @param fieldType The field type for the query metadata field
* @param verifiedCandidateField The field used to mark if query extraction was accurate and candidate matches don't
* need to be verified by {@link org.apache.lucene.index.memory.MemoryIndex} at query time.
*/
public static void extractQueryTerms(Query query, ParseContext.Document document, String queryTermsFieldField,

This comment has been minimized.

Copy link
@jpountz

jpountz Jun 15, 2016

Contributor

maybe we should use an indexed field rather than a doc values field, this should compress better and would be as efficient at searc time?

in particular I am thinking we could use the same field to indicate that extraction was failed, partial (needs MemoryIndex) or complete (does not need MemoryIndex) using 3 different values?

This comment has been minimized.

Copy link
@martijnvg

martijnvg Jun 16, 2016

Author Member

in particular I am thinking we could use the same field to indicate that extraction was failed, partial (needs MemoryIndex) or complete (does not need MemoryIndex) using 3 different values?

+1 Good idea.

@jpountz

This comment has been minimized.

Copy link
Contributor

commented Jun 15, 2016

I left some comments, but I like it!

@martijnvg martijnvg force-pushed the martijnvg:percolator_verified_candidate_match branch 2 times, most recently Jun 16, 2016

@jpountz

View changes

modules/percolator/src/main/java/org/elasticsearch/percolator/ExtractQueryTermsService.java Outdated
if (query instanceof MatchAllDocsQuery) {
// we need to emit an empty term otherwise a percolator query with this query may never end up as a
// candidate match:
return new Result(true, new Term("", ""));

This comment has been minimized.

Copy link
@jpountz

jpountz Jun 17, 2016

Contributor

this hack looks a bit dangerous to me, maybe we need a special extraction status for it rather than using the empty term?

This comment has been minimized.

Copy link
@martijnvg

martijnvg Jun 17, 2016

Author Member

agreed

@jpountz

View changes

modules/percolator/src/main/java/org/elasticsearch/percolator/ExtractQueryTermsService.java Outdated
// candidate match:
return new Result(true, new Term("", ""));
} else if (query instanceof MatchNoDocsQuery) {
return new Result(false);

This comment has been minimized.

Copy link
@jpountz

jpountz Jun 17, 2016

Contributor

should it be new Result(true)? extrcation is complete?

This comment has been minimized.

Copy link
@jpountz

jpountz Jun 17, 2016

Contributor

then I think we would not need canIgnore anymore?

This comment has been minimized.

Copy link
@martijnvg

martijnvg Jun 17, 2016

Author Member

It is true that extraction is complete, but we use the verified boolean also as an indication to skip memory index verification and emit if a doc is a match. In the case that a percolator query contains just a MatchNoDocsQuery (happens if match query text analysis doesn't emit any token) then this would kind of be an illegal indication.

Maybe complete should have two outcomes? complete_match and complete_no_match?

@jpountz

View changes

modules/percolator/src/main/java/org/elasticsearch/percolator/ExtractQueryTermsService.java Outdated
}
if (bestClause != null) {
return bestClause;
// 1 required clause, no prohibited clause, maybe some optional clauses, so it is verified:
boolean verified = mnsm <= 1 && numProhibitedClauses == 0 && numRequiredClauses == 1;

This comment has been minimized.

Copy link
@jpountz

jpountz Jun 17, 2016

Contributor

should it be mnsm == 0 rather than mnsm <= 1? if there is one MUST and one should clause and msm == 1 then this is actually a conjunction?

This comment has been minimized.

Copy link
@martijnvg

martijnvg Jun 17, 2016

Author Member

true, because we have not taken into account any of the optional clauses at this point, we can't make this assumption here. I'll change it to mnsm == 0.

@jpountz

View changes

modules/percolator/src/main/java/org/elasticsearch/percolator/ExtractQueryTermsService.java Outdated
Set<Term> terms = new HashSet<>();
for (BooleanClause clause : clauses) {
if (clause.isProhibited()) {
// we don't need to remember the things that do *not* match...
continue;
}
terms.addAll(extractQueryTerms(clause.getQuery()));
Result subResult = extractQueryTerms(clause.getQuery());
if (subResult.canIgnore()) {

This comment has been minimized.

Copy link
@jpountz

jpountz Jun 17, 2016

Contributor

I don't think we need it?

This comment has been minimized.

Copy link
@martijnvg

martijnvg Jun 17, 2016

Author Member

I added this because so that MatchNoDocsQuery as part of bigger queries are just ignored and if they stand on their own we never mistake it for a match.

@jpountz

View changes

modules/percolator/src/main/java/org/elasticsearch/percolator/ExtractQueryTermsService.java Outdated

List<Term> extractedTerms = new ArrayList<>();
extractedTerms.add(new Term(unknownQueryField));
Collections.addAll(extractedTerms, optionalTerms);
// to include percolator queries with match_all queries as candidate matches:

This comment has been minimized.

Copy link
@jpountz

jpountz Jun 17, 2016

Contributor

we can avoid this hack if we use a special status for match_all?

@jpountz

View changes

modules/percolator/src/main/java/org/elasticsearch/percolator/PercolatorFieldMapper.java Outdated
@@ -76,15 +75,15 @@ public PercolatorFieldMapper build(BuilderContext context) {
context.path().add(name());
KeywordFieldMapper extractedTermsField = createExtractQueryFieldBuilder(EXTRACTED_TERMS_FIELD_NAME, context);
((PercolatorFieldType) fieldType).queryTermsField = extractedTermsField.fieldType();
KeywordFieldMapper unknownQueryField = createExtractQueryFieldBuilder(UNKNOWN_QUERY_FIELD_NAME, context);
((PercolatorFieldType) fieldType).unknownQueryField = unknownQueryField.fieldType();
KeywordFieldMapper ExtractionResultField = createExtractQueryFieldBuilder(EXTRACTION_RESULT_FIELD_NAME, context);

This comment has been minimized.

Copy link
@jpountz

jpountz Jun 17, 2016

Contributor

s/ExtractionResultField/extractionResultField/

@jpountz

This comment has been minimized.

Copy link
Contributor

commented Jun 17, 2016

I just did another round.

@jpountz

View changes

modules/percolator/src/main/java/org/elasticsearch/percolator/ExtractQueryTermsService.java Outdated
}
if (result.alwaysMatch) {
document.add(new Field(extractionResultField, EXTRACTION_ALWAYS_MATCH, fieldType));
}

This comment has been minimized.

Copy link
@jpountz

jpountz Jun 17, 2016

Contributor

should we do:

+        if (result.alwaysMatch) {
+            document.add(new Field(extractionResultField, EXTRACTION_ALWAYS_MATCH, fieldType));
+        } else if (result.verified) {
+            document.add(new Field(extractionResultField, EXTRACTION_COMPLETE, fieldType));
+        } else {
+            document.add(new Field(extractionResultField, EXTRACTION_PARTIAL, fieldType));
+        }

It seems weird otherwise that a doc would have both EXTRACTION_ALWAYS_MATCH and another term about whether terms could be extracted?

This comment has been minimized.

Copy link
@martijnvg

martijnvg Jun 17, 2016

Author Member

I think we can do this, but this does mean that in PercolateQuery.java line 87 we would need to use a TermsQuery instead of a TermQuery for the verifiedQueriesQuery so that we skip memory index verification for percolator queries containing only match_all queries.

This comment has been minimized.

Copy link
@jpountz

jpountz Jun 20, 2016

Contributor

Oh I see. I like it better the current way better then. I was confused by the fact that you could have both ALWAYS and PARTIAL in the same doc, maybe we could add an assertion that it never happers.

@jpountz

View changes

modules/percolator/src/main/java/org/elasticsearch/percolator/ExtractQueryTermsService.java Outdated
this.terms = terms;
this.verified = verified;
this.alwaysMatch = false;
}

This comment has been minimized.

Copy link
@jpountz

jpountz Jun 20, 2016

Contributor

Can we reduce the number of constructors and make them call one another. Also I think it would be nice to have some assertions like eg. it does not make sense to have verified = false and alwaysMatch = true?

@jpountz

View changes

modules/percolator/src/main/java/org/elasticsearch/percolator/ExtractQueryTermsService.java Outdated
if (hasRequiredClauses) {
UnsupportedQueryException uqe = null;
if (numRequiredClauses > 0) {
boolean alwaysMatch = true;

This comment has been minimized.

Copy link
@jpountz

jpountz Jun 20, 2016

Contributor

I think the current impl would return alwaysMatch=true if there is a MUST clause with a match_all query and a MUST_NOT clause, which is wrong?

@jpountz

View changes

modules/percolator/src/main/java/org/elasticsearch/percolator/PercolateQuery.java Outdated
public void extractQueryTermsQuery(String extractedTermsFieldName, String unknownQueryFieldname) throws IOException {
public void extractQueryTermsQuery(String extractedTermsFieldName, String extractionResultField,
boolean nestedMemoryIndex) throws IOException {
if (nestedMemoryIndex) {

This comment has been minimized.

Copy link
@jpountz

jpountz Jun 20, 2016

Contributor

I think I understand why the nested case does not work but can you leave a comment?

@jpountz

View changes

modules/percolator/src/main/java/org/elasticsearch/percolator/PercolateQuery.java Outdated
return new BaseScorer(this, approximation, queries, percolatorIndexSearcher) {

boolean notExhausted = scorer != null;
DocIdSetIterator iterator = notExhausted ? scorer.iterator() : null;

This comment has been minimized.

Copy link
@jpountz

jpountz Jun 20, 2016

Contributor

can we rename it to something more explicit, like verifiedDocs or verifiedDocsIterator ?

This comment has been minimized.

Copy link
@jpountz

jpountz Jun 20, 2016

Contributor

or maybe even better: I think you can simlify the logic below by using Lucene.asSequentialAccessBits (that we also use in function_score and filter(s) aggs) to check whether a doc matches the scorer.

This comment has been minimized.

Copy link
@martijnvg

martijnvg Jun 20, 2016

Author Member

👍 great idea! I didn't like the code to begin with.

@martijnvg martijnvg force-pushed the martijnvg:percolator_verified_candidate_match branch Jun 21, 2016

@martijnvg

This comment has been minimized.

Copy link
Member Author

commented Jun 21, 2016

@jpountz Thx! I've updated and rebased the PR. The last 4 commits are actually new.

@martijnvg martijnvg force-pushed the martijnvg:percolator_verified_candidate_match branch Jun 21, 2016

@jpountz

View changes

modules/percolator/src/main/java/org/elasticsearch/percolator/ExtractQueryTermsService.java Outdated
}

static Function<Query, Result> spanNotQuery() {
return query -> extractQueryTerms(((SpanNotQuery) query).getInclude());

This comment has been minimized.

Copy link
@jpountz

jpountz Jun 22, 2016

Contributor

this does not look right, we also need to make sure to set verified to false?

This comment has been minimized.

Copy link
@martijnvg

martijnvg Jun 23, 2016

Author Member

this worked out, because spanTermQuery() always returned a result with verified set to false

@jpountz

View changes

modules/percolator/src/main/java/org/elasticsearch/percolator/ExtractQueryTermsService.java Outdated
static Function<Query, Result> spanTermQuery() {
return query -> {
Term term = ((SpanTermQuery) query).getTerm();
return new Result(false, Collections.singleton(term));

This comment has been minimized.

Copy link
@jpountz

jpountz Jun 22, 2016

Contributor

should span term and span or return true for verified?

This comment has been minimized.

Copy link
@martijnvg

martijnvg Jun 23, 2016

Author Member

Yes, that makes sense, since on its own it is just a term query. I'll also update the other span query methods to explicitly set verified to false.

@jpountz

View changes

modules/percolator/src/main/java/org/elasticsearch/percolator/ExtractQueryTermsService.java Outdated
numProhibitedClauses++;
}
if (clause.isScoring()) {
numOptionalClauses++;

This comment has been minimized.

Copy link
@jpountz

jpountz Jun 22, 2016

Contributor

clause.isScoring is also true for MUST clauses, so I think the naming is confusing

This comment has been minimized.

Copy link
@martijnvg

martijnvg Jun 23, 2016

Author Member

naming is correct, my understanding of clause.isScoring() is not :)

@jpountz

View changes

modules/percolator/src/main/java/org/elasticsearch/percolator/ExtractQueryTermsService.java Outdated
try {
temp = extractQueryTerms(clause.getQuery());
if (temp.noTerms) {
continue;

This comment has been minimized.

Copy link
@jpountz

jpountz Jun 22, 2016

Contributor

temp.noTerms && temp.verified? otherwise we would have issues with MUST clauses containing match_no_docs queries?

@jpountz

View changes

modules/percolator/src/main/java/org/elasticsearch/percolator/ExtractQueryTermsService.java Outdated
if (clause.isProhibited()) {
// we don't need to remember the things that do *not* match...
continue;
if (clause.isScoring()) {

This comment has been minimized.

Copy link
@jpountz

jpountz Jun 22, 2016

Contributor

should be clause.getOccur() == SHOULD

This comment has been minimized.

Copy link
@jpountz

jpountz Jun 22, 2016

Contributor

probably means we need to beef up tests a bit more?

This comment has been minimized.

Copy link
@martijnvg

martijnvg Jun 23, 2016

Author Member

The reason did work out is that if there were must clauses this else would not have been executed, the above if clause would have been executed. Tests would have failed if we got here and there were must clauses in the bq.

I agree that this looks confusing and we should use clause.getOccur() == SHOULD instead.

@jpountz

View changes

modules/percolator/src/main/java/org/elasticsearch/percolator/ExtractQueryTermsService.java Outdated
}
final Set<Term> terms;
final boolean verified;
final boolean noTerms;

This comment has been minimized.

Copy link
@jpountz

jpountz Jun 22, 2016

Contributor

naive question but can we nuke noTerms and use an empty set of terms?

This comment has been minimized.

Copy link
@martijnvg

martijnvg Jun 23, 2016

Author Member

Like we discussed yesterday, we should remove the support for MatchAllDocsQuery as it complicated this code too much already. The wins are small as most percolator queries will not have this query. We should focus on the big wins or easy optimizations.

@jpountz

This comment has been minimized.

Copy link
Contributor

commented Jun 22, 2016

I left some more comments.

martijnvg added some commits Jun 2, 2016

percolator: Don't verify candidate matches if we know queries match.
If we don't care about scoring then for certain we can be certain if they are a candidate match that they will match.
Verifying these queries with the MemoryIndex can be skipped.
removed doc values field and changed unknownQuery field into extracti…
…on_result field that can have 3 different values depending on the outcome of query term extraction: `failed`, `partial` and `complete`.
support match_all queries
The match_all query is tricky, since it matches with all docs, but no terms can be extracted from it.
So for this reason with add an empty extract term, add percolate time we then also always add this empty term.
So at index time we safely include the match_all query in for example disjunct queries and if a percolate
query only contains a match_all query we can safely skip the MemoryIndex verification.
removed match_all hack
match no docs is always verified

@martijnvg martijnvg force-pushed the martijnvg:percolator_verified_candidate_match branch to 9e67355 Jun 23, 2016

}

/**
* Creates a boolean query with a should clause for each term on all fields of the specified index reader.

This comment has been minimized.

Copy link
@jpountz

jpountz Jun 24, 2016

Contributor

maybe update the docs to say this is a terms query rather than a bool

}
Result() {
this.verified = true;
this.terms = Collections.emptySet();

This comment has been minimized.

Copy link
@jpountz

jpountz Jun 24, 2016

Contributor

I'd vote for removing this constructor and change the caller to provide explicit arguments since it is not obvious what the defaulh parameter values should be

This comment has been minimized.

Copy link
@martijnvg

martijnvg Jun 24, 2016

Author Member

+1

@jpountz

This comment has been minimized.

Copy link
Contributor

commented Jun 24, 2016

LGTM

@martijnvg

This comment has been minimized.

Copy link
Member Author

commented Jun 24, 2016

thanks @jpountz!

@martijnvg

This comment has been minimized.

Copy link
Member Author

commented Jun 24, 2016

Pushed via to master via: 599a548

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.