New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LUCENE-8995: TopSuggestDocsCollector#collect should be able to signal rejection #913
base: master
Are you sure you want to change the base?
Conversation
… rejection Currently the suggestion collectors collect method has no way of signaling back to the caller whether the matched completion it received was indeed collected. While this is currenty always the case for the default TopSuggestDocsCollector, there are implementations that overwrite this with custom collection logic that also deduplicates based on e.g. docID and/or context. Currently if those completions are rejected, the calling TopNSearcher has no way of noting these rejections and might therefore terminate early, missing on completions that should be returned.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The change looks good @cbuescher . I wonder if we should also add a way to set a custom queue size from the suggest collector, currently the queue size is increased if there are deleted docs and if a filter is set so maybe we should do the same if the collector is expected to filter hits ?
I also think that this pr reveals a bug in the way we compute the max analyzed path per output, I am not sure what was the original intent so maybe @mikemccand can shed some lights here ?
lucene/suggest/src/java/org/apache/lucene/search/suggest/document/NRTSuggester.java
Outdated
Show resolved
Hide resolved
lucene/suggest/src/java/org/apache/lucene/search/suggest/document/TopSuggestDocsCollector.java
Outdated
Show resolved
Hide resolved
indexSearcher.suggest(query, collector); | ||
assertSuggestions(collector.get(), expectedResults.subList(0, topN).toArray(new Entry[0])); | ||
|
||
// TODO expecting true here, why false? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it seems that NRTSuggesterBuilder#maxAnalyzedPathsPerOutput is not computed correctly. From what I understand it records the number of suggestions with the same analyzed form but the comment says that it should be the highest number of analyzed paths we saw for any input surface form. So imo this is a bug, it's not exactly related to this change so we should probably open a new issue for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to open a new issue for this.
Indeed, maxAnalyzedPathsPerOutput
comment in NRTSuggester.java
seems to claim it's the max number of analyzed forms for a surface form (input), e.g. a graph tokenstream would create multiple analyzed forms.
But what it seems to actually be computing, in NRTSuggesterBuilder.java
is actually the maximum number of unique surface forms that analyze to the same analyzed form.
And so I think the admissibility of the search is in question -- we use this to size the queue "properly" but it's not clear that works today?
Let's definitely open a new issue ... this stuff is hard to think about!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll open an issue. I also wonder if we shouldn't rely on the fact that the top suggest collector will also early terminate so whenever we expect rejection (because of deleted docs or because we deduplicate on suggestions/doc) we could set the queue size to its maximum value (5000). Currently we have different heuristics that tries to pick a sensitive value automatically but there is no guarantee of admissibility. For instance if we want to deduplicate by document id we should ensure that the queue size is greater than topN*maxAnalyzedValuesPerDoc
and we'd need to compute this value at index time.
I may be completely off but it would be interesting to see the effects of setting the queue size to its maximum value on all search. This way the admissibility is easier to reason about and we don't need to correlate it with the choice made by the heuristic.
lucene/suggest/src/java/org/apache/lucene/search/suggest/document/NRTSuggester.java
Outdated
Show resolved
Hide resolved
...ne/suggest/src/test/org/apache/lucene/search/suggest/document/TestPrefixCompletionQuery.java
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left some minor comments but the change looks good to me.
/** | ||
* returns true if the collector clearly exhausted all possibilities to collect results | ||
*/ | ||
boolean isComplete() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this needed now that we provide the information in the TopSuggestDocs
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not necessary, but since we need the underlying flag and its currently private, I thought its okay to have at least a package private getter. Or does it bloat the class too much? Happy to remove either way, wdyt?
lucene/suggest/src/java/org/apache/lucene/search/suggest/document/NRTSuggester.java
Outdated
Show resolved
Hide resolved
lucene/suggest/src/java/org/apache/lucene/search/suggest/document/TopSuggestDocs.java
Outdated
Show resolved
Hide resolved
lucene/suggest/src/java/org/apache/lucene/search/suggest/document/TopSuggestDocs.java
Outdated
Show resolved
Hide resolved
lucene/suggest/src/java/org/apache/lucene/search/suggest/document/TopSuggestDocs.java
Outdated
Show resolved
Hide resolved
@jimczi thanks for the review, pushed a commit adressing most of the nits and left two questions. Happy to go either way with both of them, just want to hear your opinion. |
@@ -460,11 +460,6 @@ public void addStartPaths(FST.Arc<T> node, T startOutput, boolean allowEmptyStri | |||
continue; | |||
} | |||
|
|||
if (results.size() == topN-1 && maxQueueDepth == topN) { | |||
// Last path -- don't bother w/ queue anymore: | |||
queue = null; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whoa, was this opto breaking something? I guess if this final path is filtered out, we still need the queue? Have you run the suggest benchmarks to see if removing this opto hurt performance?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as I understand this optimization assumes we surely accept (and collect) the path later in L516s acceptResult(), which always seems to be the case for collectors that don't reject, but if the collector that is eventually called via NRTSuggesters acceptResult() chooses to reject this option, we were losing expected results. This surfaced in the prefix completion tests I added. @jimczi might be able to explain this a bit better than me.
Have you run the suggest benchmarks to see if removing this opto hurt performance?
No, where are they and how can I run them?
lucene/suggest/src/java/org/apache/lucene/search/suggest/document/TopSuggestDocsCollector.java
Outdated
Show resolved
Hide resolved
lucene/suggest/src/java/org/apache/lucene/search/suggest/document/TopSuggestDocs.java
Outdated
Show resolved
Hide resolved
} else { | ||
return TopSuggestDocs.EMPTY; | ||
} | ||
} | ||
|
||
/** | ||
* Indicates if the list of results is complete or not. Might be <code>false</code> if the {@link TopNSearcher} rejected | ||
* too many of the queued results. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't it sometimes false
if the TopNSearcher
's queue filled up and it chose not to pursue some low-scoring partial paths?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The admissibility of the search is computed from the reject count so a value of false
means that we exhausted all the paths but we had to reject all of them so the topN is truncated. It's hard to follow the full logic but it should be ok as long as it is ok to return less than the topN when there are more rejections than the queue size can handle ?
lucene/suggest/src/java/org/apache/lucene/search/suggest/document/TopSuggestDocsCollector.java
Show resolved
Hide resolved
public boolean collect(int docID, CharSequence key, CharSequence context, float score) throws IOException { | ||
int globalDocId = docID + docBase; | ||
boolean collected = false; | ||
if (seenDocIds.contains(globalDocId) == false) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm why would the suggester even send the same docID
multiple times? There is only one suggestion (SuggestField
) per indexed document in this test ... maybe improve the test to assert that sometimes we reject and sometimes we don't?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The collector is called multiple times with the same docID because of the MockSynonymAnalyzer used in the test setup which adds "dog" for "dogs", so each document has two completion paths. This collector is meant to de-duplicate this. I added a note explaining this. This is a simplified version of the behaviour we observe in elastic/elasticsearch#46445.
lucene/suggest/src/test/org/apache/lucene/search/suggest/document/TestTopSuggestDocs.java
Outdated
Show resolved
Hide resolved
indexSearcher.suggest(query, collector); | ||
TopSuggestDocs suggestions = collector.get(); | ||
assertSuggestions(suggestions, expectedResults.subList(0, topN).toArray(new Entry[0])); | ||
assertTrue(suggestions.isComplete()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to concoct a full test case (invoking suggester) where isComplete()
returns false?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will happen if the estimated queue size in NRTSuggester#lookup
is not large enough to accout for all rejected documents, e.g. when in this particular test we try to get the top 5 of only 5 documents. In that case the queue size heuristic in NRTSuggester#getMaxTopNSearcherQueueSize
will only size to queue to 7 (topN + numDocs/2), which is less than the number of topN + rejections, so the TopResults returned will have the isComplete
flag set.
I can add that case to the existing test if this helps.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I extended the existing test to the case where try getting the top 10. In this case the queue would have a max depth of 15, the reject count it 9.
Currently the suggestion collectors collect method has no way of signaling back
to the caller whether the matched completion it received was indeed collected.
While this is currenty always the case for the default TopSuggestDocsCollector,
there are implementations that overwrite this with custom collection logic that
also deduplicates based on e.g. docID and/or context. Currently if those
completions are rejected, the calling TopNSearcher has no way of noting these
rejections and might therefore terminate early, missing on completions that
should be returned.