Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LUCENE-8995: TopSuggestDocsCollector#collect should be able to signal rejection #913

Open
wants to merge 12 commits into
base: master
Choose a base branch
from
5 changes: 0 additions & 5 deletions lucene/core/src/java/org/apache/lucene/util/fst/Util.java
Expand Up @@ -460,11 +460,6 @@ public TopResults<T> search() throws IOException {
continue;
}

if (results.size() == topN-1 && maxQueueDepth == topN) {
// Last path -- don't bother w/ queue anymore:
queue = null;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whoa, was this opto breaking something? I guess if this final path is filtered out, we still need the queue? Have you run the suggest benchmarks to see if removing this opto hurt performance?

Copy link
Contributor Author

@cbuescher cbuescher Oct 16, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I understand this optimization assumes we surely accept (and collect) the path later in L516s acceptResult(), which always seems to be the case for collectors that don't reject, but if the collector that is eventually called via NRTSuggesters acceptResult() chooses to reject this option, we were losing expected results. This surfaced in the prefix completion tests I added. @jimczi might be able to explain this a bit better than me.

Have you run the suggest benchmarks to see if removing this opto hurt performance?

No, where are they and how can I run them?

}

// We take path and find its "0 output completion",
// ie, just keep traversing the first arc with
// NO_OUTPUT that we can find, since this must lead
Expand Down
Expand Up @@ -39,6 +39,7 @@
import org.apache.lucene.util.fst.PairOutputs;
import org.apache.lucene.util.fst.PositiveIntOutputs;
import org.apache.lucene.util.fst.Util;
import org.apache.lucene.util.fst.Util.TopResults;

import static org.apache.lucene.search.suggest.document.NRTSuggester.PayLoadProcessor.parseSurfaceForm;

Expand Down Expand Up @@ -200,23 +201,30 @@ protected boolean acceptResult(Util.FSTPath<Pair<Long, BytesRef>> path) {
if (!scorer.accept(docID, acceptDocs)) {
return false;
}
boolean duplicateSurfaceForm = false;
boolean collected = false;
if (collector.doSkipDuplicates()) {
// now record that we've seen this surface form:
char[] key = new char[spare.length()];
System.arraycopy(spare.chars(), 0, key, 0, spare.length());
if (collector.seenSurfaceForms.contains(key)) {
// we already collected a higher scoring document with this key, in this segment:
return false;
duplicateSurfaceForm = true;
} else {
collector.seenSurfaceForms.add(key);
cbuescher marked this conversation as resolved.
Show resolved Hide resolved
}
collector.seenSurfaceForms.add(key);
}
try {
float score = scorer.score(decode(path.output.output1), path.boost);
collector.collect(docID, spare.toCharsRef(), path.context, score);
return true;
} catch (IOException e) {
throw new RuntimeException(e);

// only try collecting if we didn't already detect a surface form duplicate and collector.doSkipDuplicates() == true
if (duplicateSurfaceForm == false) {
try {
float score = scorer.score(decode(path.output.output1), path.boost);
collected = collector.collect(docID, spare.toCharsRef(), path.context, score);
} catch (IOException e) {
throw new RuntimeException(e);
}
}
return collected;
}
};

Expand All @@ -239,7 +247,11 @@ protected boolean acceptResult(Util.FSTPath<Pair<Long, BytesRef>> path) {
}
// hits are also returned by search()
// we do not use it, instead collect at acceptResult
searcher.search();
TopResults<Pair<Long,BytesRef>> results = searcher.search();
if (results.isComplete == false) {
collector.notComplete();
}
return;
// search admissibility is not guaranteed
// see comment on getMaxTopNSearcherQueueSize
// assert search.isComplete;
Expand Down
Expand Up @@ -63,6 +63,8 @@ public class TopSuggestDocsCollector extends SimpleCollector {
/** Document base offset for the current Leaf */
protected int docBase;

private boolean isComplete = true;

/**
* Sole constructor
*
Expand Down Expand Up @@ -116,7 +118,7 @@ protected void doSetNextReader(LeafReaderContext context) throws IOException {
* NOTE: collection at the leaf level is guaranteed to be in
* descending order of score
*/
cbuescher marked this conversation as resolved.
Show resolved Hide resolved
public void collect(int docID, CharSequence key, CharSequence context, float score) throws IOException {
public boolean collect(int docID, CharSequence key, CharSequence context, float score) throws IOException {
SuggestScoreDoc current = new SuggestScoreDoc(docBase + docID, key, context, score);
if (current == priorityQueue.insertWithOverflow(current)) {
// if the current SuggestScoreDoc has overflown from pq,
Expand All @@ -125,6 +127,7 @@ public void collect(int docID, CharSequence key, CharSequence context, float sco
// TODO: reuse the overflow instance?
throw new CollectionTerminatedException();
}
return true;
}

/**
Expand Down Expand Up @@ -201,4 +204,19 @@ public void collect(int doc) throws IOException {
public ScoreMode scoreMode() {
return ScoreMode.COMPLETE;
}

/**
* returns true if the collector clearly exhausted all possibilities to collect results
*/
public boolean isComplete() {
cbuescher marked this conversation as resolved.
Show resolved Hide resolved
return this.isComplete ;
}

/**
* call to signal that during collection at least one segment might have returned incomplete results, e.g. because
* of too many rejections
*/
void notComplete() {
cbuescher marked this conversation as resolved.
Show resolved Hide resolved
this.isComplete = false;
}
}
Expand Up @@ -17,10 +17,15 @@
package org.apache.lucene.search.suggest.document;

import java.io.IOException;
import java.util.ArrayList;
import java.util.HashSet;
import java.util.List;
import java.util.Objects;
import java.util.Set;

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.MockAnalyzer;
import org.apache.lucene.analysis.MockSynonymAnalyzer;
import org.apache.lucene.analysis.MockTokenFilter;
import org.apache.lucene.analysis.MockTokenizer;
import org.apache.lucene.document.Document;
Expand Down Expand Up @@ -253,6 +258,61 @@ public void testDocFiltering() throws Exception {
iw.close();
}

/**
* Test that the correct amount of documents are collected if using a collector that also rejects documents.
*/
public void testCollectorThatRejects() throws Exception {
// use synonym analyzer to have multiple paths to same suggested document. This mock adds "dog" as synonym for "dogs"
Analyzer analyzer = new MockSynonymAnalyzer();
RandomIndexWriter iw = new RandomIndexWriter(random(), dir, iwcWithSuggestField(analyzer, "suggest_field"));
List<Entry> expectedResults = new ArrayList<Entry>();

for (int docCount = 10; docCount > 0; docCount--) {
Document document = new Document();
String value = "ab" + docCount + " dogs";
document.add(new SuggestField("suggest_field", value, docCount));
expectedResults.add(new Entry(value, docCount));
iw.addDocument(document);
}

if (rarely()) {
iw.commit();
}

DirectoryReader reader = iw.getReader();
SuggestIndexSearcher indexSearcher = new SuggestIndexSearcher(reader);

PrefixCompletionQuery query = new PrefixCompletionQuery(analyzer, new Term("suggest_field", "ab"));
int topN = 5;

// use a TopSuggestDocsCollector that rejects results with duplicate docIds
TopSuggestDocsCollector collector = new TopSuggestDocsCollector(topN, false) {

private Set<Integer> seenDocIds = new HashSet<>();

@Override
public boolean collect(int docID, CharSequence key, CharSequence context, float score) throws IOException {
int globalDocId = docID + docBase;
boolean collected = false;
if (seenDocIds.contains(globalDocId) == false) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm why would the suggester even send the same docID multiple times? There is only one suggestion (SuggestField) per indexed document in this test ... maybe improve the test to assert that sometimes we reject and sometimes we don't?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The collector is called multiple times with the same docID because of the MockSynonymAnalyzer used in the test setup which adds "dog" for "dogs", so each document has two completion paths. This collector is meant to de-duplicate this. I added a note explaining this. This is a simplified version of the behaviour we observe in elastic/elasticsearch#46445.

super.collect(docID, key, context, score);
seenDocIds.add(globalDocId);
collected = true;
}
return collected;
}
};

indexSearcher.suggest(query, collector);
assertSuggestions(collector.get(), expectedResults.subList(0, topN).toArray(new Entry[0]));

// TODO expecting true here, why false?
Copy link
Contributor

@jimczi jimczi Oct 2, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seems that NRTSuggesterBuilder#maxAnalyzedPathsPerOutput is not computed correctly. From what I understand it records the number of suggestions with the same analyzed form but the comment says that it should be the highest number of analyzed paths we saw for any input surface form. So imo this is a bug, it's not exactly related to this change so we should probably open a new issue for this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to open a new issue for this.

Indeed, maxAnalyzedPathsPerOutput comment in NRTSuggester.java seems to claim it's the max number of analyzed forms for a surface form (input), e.g. a graph tokenstream would create multiple analyzed forms.

But what it seems to actually be computing, in NRTSuggesterBuilder.java is actually the maximum number of unique surface forms that analyze to the same analyzed form.

And so I think the admissibility of the search is in question -- we use this to size the queue "properly" but it's not clear that works today?

Let's definitely open a new issue ... this stuff is hard to think about!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll open an issue. I also wonder if we shouldn't rely on the fact that the top suggest collector will also early terminate so whenever we expect rejection (because of deleted docs or because we deduplicate on suggestions/doc) we could set the queue size to its maximum value (5000). Currently we have different heuristics that tries to pick a sensitive value automatically but there is no guarantee of admissibility. For instance if we want to deduplicate by document id we should ensure that the queue size is greater than topN*maxAnalyzedValuesPerDoc and we'd need to compute this value at index time.
I may be completely off but it would be interesting to see the effects of setting the queue size to its maximum value on all search. This way the admissibility is easier to reason about and we don't need to correlate it with the choice made by the heuristic.

assertFalse(collector.isComplete());

reader.close();
iw.close();
}

public void testAnalyzerDefaults() throws Exception {
Analyzer analyzer = new MockAnalyzer(random(), MockTokenizer.WHITESPACE, true, MockTokenFilter.ENGLISH_STOPSET);
CompletionAnalyzer completionAnalyzer = new CompletionAnalyzer(analyzer);
Expand Down