New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LUCENE-8995: TopSuggestDocsCollector#collect should be able to signal rejection #913
base: master
Are you sure you want to change the base?
Changes from 1 commit
c0e2aad
8739bb0
d42a88c
3ffef37
3b4b816
78c40a2
5632320
bb1ece6
e72ac79
e51f262
81bc00c
2f3a1b9
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -17,10 +17,15 @@ | |
package org.apache.lucene.search.suggest.document; | ||
|
||
import java.io.IOException; | ||
import java.util.ArrayList; | ||
import java.util.HashSet; | ||
import java.util.List; | ||
import java.util.Objects; | ||
import java.util.Set; | ||
|
||
import org.apache.lucene.analysis.Analyzer; | ||
import org.apache.lucene.analysis.MockAnalyzer; | ||
import org.apache.lucene.analysis.MockSynonymAnalyzer; | ||
import org.apache.lucene.analysis.MockTokenFilter; | ||
import org.apache.lucene.analysis.MockTokenizer; | ||
import org.apache.lucene.document.Document; | ||
|
@@ -253,6 +258,61 @@ public void testDocFiltering() throws Exception { | |
iw.close(); | ||
} | ||
|
||
/** | ||
* Test that the correct amount of documents are collected if using a collector that also rejects documents. | ||
*/ | ||
public void testCollectorThatRejects() throws Exception { | ||
// use synonym analyzer to have multiple paths to same suggested document. This mock adds "dog" as synonym for "dogs" | ||
Analyzer analyzer = new MockSynonymAnalyzer(); | ||
RandomIndexWriter iw = new RandomIndexWriter(random(), dir, iwcWithSuggestField(analyzer, "suggest_field")); | ||
List<Entry> expectedResults = new ArrayList<Entry>(); | ||
|
||
for (int docCount = 10; docCount > 0; docCount--) { | ||
Document document = new Document(); | ||
String value = "ab" + docCount + " dogs"; | ||
document.add(new SuggestField("suggest_field", value, docCount)); | ||
expectedResults.add(new Entry(value, docCount)); | ||
iw.addDocument(document); | ||
} | ||
|
||
if (rarely()) { | ||
iw.commit(); | ||
} | ||
|
||
DirectoryReader reader = iw.getReader(); | ||
SuggestIndexSearcher indexSearcher = new SuggestIndexSearcher(reader); | ||
|
||
PrefixCompletionQuery query = new PrefixCompletionQuery(analyzer, new Term("suggest_field", "ab")); | ||
int topN = 5; | ||
|
||
// use a TopSuggestDocsCollector that rejects results with duplicate docIds | ||
TopSuggestDocsCollector collector = new TopSuggestDocsCollector(topN, false) { | ||
|
||
private Set<Integer> seenDocIds = new HashSet<>(); | ||
|
||
@Override | ||
public boolean collect(int docID, CharSequence key, CharSequence context, float score) throws IOException { | ||
int globalDocId = docID + docBase; | ||
boolean collected = false; | ||
if (seenDocIds.contains(globalDocId) == false) { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Hmm why would the suggester even send the same There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The collector is called multiple times with the same docID because of the MockSynonymAnalyzer used in the test setup which adds "dog" for "dogs", so each document has two completion paths. This collector is meant to de-duplicate this. I added a note explaining this. This is a simplified version of the behaviour we observe in elastic/elasticsearch#46445. |
||
super.collect(docID, key, context, score); | ||
seenDocIds.add(globalDocId); | ||
collected = true; | ||
} | ||
return collected; | ||
} | ||
}; | ||
|
||
indexSearcher.suggest(query, collector); | ||
assertSuggestions(collector.get(), expectedResults.subList(0, topN).toArray(new Entry[0])); | ||
|
||
// TODO expecting true here, why false? | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. it seems that NRTSuggesterBuilder#maxAnalyzedPathsPerOutput is not computed correctly. From what I understand it records the number of suggestions with the same analyzed form but the comment says that it should be the highest number of analyzed paths we saw for any input surface form. So imo this is a bug, it's not exactly related to this change so we should probably open a new issue for this. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. +1 to open a new issue for this. Indeed, But what it seems to actually be computing, in And so I think the admissibility of the search is in question -- we use this to size the queue "properly" but it's not clear that works today? Let's definitely open a new issue ... this stuff is hard to think about! There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'll open an issue. I also wonder if we shouldn't rely on the fact that the top suggest collector will also early terminate so whenever we expect rejection (because of deleted docs or because we deduplicate on suggestions/doc) we could set the queue size to its maximum value (5000). Currently we have different heuristics that tries to pick a sensitive value automatically but there is no guarantee of admissibility. For instance if we want to deduplicate by document id we should ensure that the queue size is greater than |
||
assertFalse(collector.isComplete()); | ||
|
||
reader.close(); | ||
iw.close(); | ||
} | ||
|
||
public void testAnalyzerDefaults() throws Exception { | ||
Analyzer analyzer = new MockAnalyzer(random(), MockTokenizer.WHITESPACE, true, MockTokenFilter.ENGLISH_STOPSET); | ||
CompletionAnalyzer completionAnalyzer = new CompletionAnalyzer(analyzer); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whoa, was this opto breaking something? I guess if this final path is filtered out, we still need the queue? Have you run the suggest benchmarks to see if removing this opto hurt performance?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as I understand this optimization assumes we surely accept (and collect) the path later in L516s acceptResult(), which always seems to be the case for collectors that don't reject, but if the collector that is eventually called via NRTSuggesters acceptResult() chooses to reject this option, we were losing expected results. This surfaced in the prefix completion tests I added. @jimczi might be able to explain this a bit better than me.
No, where are they and how can I run them?