LUCENE-8995: TopSuggestDocsCollector#collect should be able to signal rejection #913

cbuescher · 2019-10-01T14:54:58Z

Currently the suggestion collectors collect method has no way of signaling back
to the caller whether the matched completion it received was indeed collected.
While this is currenty always the case for the default TopSuggestDocsCollector,
there are implementations that overwrite this with custom collection logic that
also deduplicates based on e.g. docID and/or context. Currently if those
completions are rejected, the calling TopNSearcher has no way of noting these
rejections and might therefore terminate early, missing on completions that
should be returned.

… rejection Currently the suggestion collectors collect method has no way of signaling back to the caller whether the matched completion it received was indeed collected. While this is currenty always the case for the default TopSuggestDocsCollector, there are implementations that overwrite this with custom collection logic that also deduplicates based on e.g. docID and/or context. Currently if those completions are rejected, the calling TopNSearcher has no way of noting these rejections and might therefore terminate early, missing on completions that should be returned.

jimczi

The change looks good @cbuescher . I wonder if we should also add a way to set a custom queue size from the suggest collector, currently the queue size is increased if there are deleted docs and if a filter is set so maybe we should do the same if the collector is expected to filter hits ?
I also think that this pr reveals a bug in the way we compute the max analyzed path per output, I am not sure what was the original intent so maybe @mikemccand can shed some lights here ?

lucene/suggest/src/java/org/apache/lucene/search/suggest/document/NRTSuggester.java

lucene/suggest/src/java/org/apache/lucene/search/suggest/document/TopSuggestDocsCollector.java

jimczi · 2019-10-02T07:32:12Z

...ne/suggest/src/test/org/apache/lucene/search/suggest/document/TestPrefixCompletionQuery.java

+    indexSearcher.suggest(query, collector);
+    assertSuggestions(collector.get(), expectedResults.subList(0, topN).toArray(new Entry[0]));
+
+    // TODO expecting true here, why false?


it seems that NRTSuggesterBuilder#maxAnalyzedPathsPerOutput is not computed correctly. From what I understand it records the number of suggestions with the same analyzed form but the comment says that it should be the highest number of analyzed paths we saw for any input surface form. So imo this is a bug, it's not exactly related to this change so we should probably open a new issue for this.

+1 to open a new issue for this.

Indeed, maxAnalyzedPathsPerOutput comment in NRTSuggester.java seems to claim it's the max number of analyzed forms for a surface form (input), e.g. a graph tokenstream would create multiple analyzed forms.

But what it seems to actually be computing, in NRTSuggesterBuilder.java is actually the maximum number of unique surface forms that analyze to the same analyzed form.

And so I think the admissibility of the search is in question -- we use this to size the queue "properly" but it's not clear that works today?

Let's definitely open a new issue ... this stuff is hard to think about!

I'll open an issue. I also wonder if we shouldn't rely on the fact that the top suggest collector will also early terminate so whenever we expect rejection (because of deleted docs or because we deduplicate on suggestions/doc) we could set the queue size to its maximum value (5000). Currently we have different heuristics that tries to pick a sensitive value automatically but there is no guarantee of admissibility. For instance if we want to deduplicate by document id we should ensure that the queue size is greater than topN*maxAnalyzedValuesPerDoc and we'd need to compute this value at index time.
I may be completely off but it would be interesting to see the effects of setting the queue size to its maximum value on all search. This way the admissibility is easier to reason about and we don't need to correlate it with the choice made by the heuristic.

lucene/suggest/src/java/org/apache/lucene/search/suggest/document/NRTSuggester.java

...ne/suggest/src/test/org/apache/lucene/search/suggest/document/TestPrefixCompletionQuery.java

…or rejections

jimczi

I left some minor comments but the change looks good to me.

jimczi · 2019-10-09T07:51:42Z

lucene/suggest/src/java/org/apache/lucene/search/suggest/document/TopSuggestDocsCollector.java

+  /**
+   * returns true if the collector clearly exhausted all possibilities to collect results
+   */
+  boolean isComplete() {


Is this needed now that we provide the information in the TopSuggestDocs ?

Not necessary, but since we need the underlying flag and its currently private, I thought its okay to have at least a package private getter. Or does it bloat the class too much? Happy to remove either way, wdyt?

lucene/suggest/src/java/org/apache/lucene/search/suggest/document/NRTSuggester.java

lucene/suggest/src/java/org/apache/lucene/search/suggest/document/TopSuggestDocs.java

cbuescher · 2019-10-09T09:27:43Z

@jimczi thanks for the review, pushed a commit adressing most of the nits and left two questions. Happy to go either way with both of them, just want to hear your opinion.

mikemccand · 2019-10-15T14:21:57Z

lucene/core/src/java/org/apache/lucene/util/fst/Util.java

@@ -460,11 +460,6 @@ public void addStartPaths(FST.Arc<T> node, T startOutput, boolean allowEmptyStri
          continue;
        }

-        if (results.size() == topN-1 && maxQueueDepth == topN) {
-          // Last path -- don't bother w/ queue anymore:
-          queue = null;


Whoa, was this opto breaking something? I guess if this final path is filtered out, we still need the queue? Have you run the suggest benchmarks to see if removing this opto hurt performance?

As far as I understand this optimization assumes we surely accept (and collect) the path later in L516s acceptResult(), which always seems to be the case for collectors that don't reject, but if the collector that is eventually called via NRTSuggesters acceptResult() chooses to reject this option, we were losing expected results. This surfaced in the prefix completion tests I added. @jimczi might be able to explain this a bit better than me.

Have you run the suggest benchmarks to see if removing this opto hurt performance?

No, where are they and how can I run them?

lucene/suggest/src/java/org/apache/lucene/search/suggest/document/TopSuggestDocsCollector.java

lucene/suggest/src/java/org/apache/lucene/search/suggest/document/TopSuggestDocs.java

mikemccand · 2019-10-15T14:38:40Z

lucene/suggest/src/java/org/apache/lucene/search/suggest/document/TopSuggestDocs.java

    } else {
      return TopSuggestDocs.EMPTY;
    }
  }

+  /**
+   * Indicates if the list of results is complete or not. Might be <code>false</code> if the {@link TopNSearcher} rejected
+   * too many of the queued results.


Isn't it sometimes false if the TopNSearcher's queue filled up and it chose not to pursue some low-scoring partial paths?

The admissibility of the search is computed from the reject count so a value of false means that we exhausted all the paths but we had to reject all of them so the topN is truncated. It's hard to follow the full logic but it should be ok as long as it is ok to return less than the topN when there are more rejections than the queue size can handle ?

lucene/suggest/src/java/org/apache/lucene/search/suggest/document/TopSuggestDocsCollector.java

mikemccand · 2019-10-15T14:43:55Z

...ne/suggest/src/test/org/apache/lucene/search/suggest/document/TestPrefixCompletionQuery.java

+      public boolean collect(int docID, CharSequence key, CharSequence context, float score) throws IOException {
+          int globalDocId = docID + docBase;
+          boolean collected = false;
+          if (seenDocIds.contains(globalDocId) == false) {


Hmm why would the suggester even send the same docID multiple times? There is only one suggestion (SuggestField) per indexed document in this test ... maybe improve the test to assert that sometimes we reject and sometimes we don't?

The collector is called multiple times with the same docID because of the MockSynonymAnalyzer used in the test setup which adds "dog" for "dogs", so each document has two completion paths. This collector is meant to de-duplicate this. I added a note explaining this. This is a simplified version of the behaviour we observe in elastic/elasticsearch#46445.

lucene/suggest/src/test/org/apache/lucene/search/suggest/document/TestTopSuggestDocs.java

mikemccand · 2019-10-15T14:47:01Z

...ne/suggest/src/test/org/apache/lucene/search/suggest/document/TestPrefixCompletionQuery.java

+    indexSearcher.suggest(query, collector);
+    TopSuggestDocs suggestions = collector.get();
+    assertSuggestions(suggestions, expectedResults.subList(0, topN).toArray(new Entry[0]));
+    assertTrue(suggestions.isComplete());


Is it possible to concoct a full test case (invoking suggester) where isComplete() returns false?

This will happen if the estimated queue size in NRTSuggester#lookup is not large enough to accout for all rejected documents, e.g. when in this particular test we try to get the top 5 of only 5 documents. In that case the queue size heuristic in NRTSuggester#getMaxTopNSearcherQueueSize will only size to queue to 7 (topN + numDocs/2), which is less than the number of topN + rejections, so the TopResults returned will have the isComplete flag set.
I can add that case to the existing test if this helps.

I extended the existing test to the case where try getting the top 10. In this case the queue would have a max depth of 15, the reject count it 9.

cbuescher changed the title ~~LUCENE-8995: TopSuggestDocsCollector#collect should be able to signal…~~ LUCENE-8995: TopSuggestDocsCollector#collect should be able to signal rejection Oct 1, 2019

jimczi reviewed Oct 2, 2019

View reviewed changes

Christoph Büscher added 4 commits October 2, 2019 11:25

Add TopSuggestDocs flag indicating that the search is complete

8739bb0

Move surface form deduplication logic

d42a88c

Make queue larger for collectors that can reject

3ffef37

Add larger test with rejections

3b4b816

jimczi reviewed Oct 2, 2019

View reviewed changes

lucene/suggest/src/java/org/apache/lucene/search/suggest/document/NRTSuggester.java Outdated Show resolved Hide resolved

...ne/suggest/src/test/org/apache/lucene/search/suggest/document/TestPrefixCompletionQuery.java Show resolved Hide resolved

Christoph Büscher added 2 commits October 2, 2019 22:28

Apply increased queue size heuristic only once for filters or collect…

78c40a2

…or rejections

Reduce test size to stay below max queue size

5632320

This was referenced Oct 7, 2019

Missing Completion suggestions when multiple suggestion paths lead to same document elastic/elasticsearch#46445

Open

Completion suggest returns erratic results when skip_duplicates=true elastic/elasticsearch#47269

Open

jimczi reviewed Oct 9, 2019

View reviewed changes

Christoph Büscher added 2 commits October 9, 2019 11:19

Adressing review comments

bb1ece6

Add second ctor for bwc

e72ac79

cbuescher commented Oct 9, 2019

View reviewed changes

lucene/suggest/src/java/org/apache/lucene/search/suggest/document/TopSuggestDocs.java Outdated Show resolved Hide resolved

Add ctor javadoc

e51f262

mikemccand reviewed Oct 15, 2019

View reviewed changes

Christoph Büscher added 2 commits October 16, 2019 12:16

Adressing review comments

81bc00c

Extend test to case where isComplete() returns false

2f3a1b9

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LUCENE-8995: TopSuggestDocsCollector#collect should be able to signal rejection #913

LUCENE-8995: TopSuggestDocsCollector#collect should be able to signal rejection #913

cbuescher commented Oct 1, 2019 •

edited

jimczi left a comment

jimczi Oct 2, 2019 •

edited

mikemccand Oct 15, 2019

jimczi Oct 16, 2019

jimczi left a comment

jimczi Oct 9, 2019

cbuescher Oct 9, 2019

cbuescher commented Oct 9, 2019

mikemccand Oct 15, 2019

cbuescher Oct 16, 2019 •

edited

mikemccand Oct 15, 2019

jimczi Oct 16, 2019

mikemccand Oct 15, 2019

cbuescher Oct 16, 2019

mikemccand Oct 15, 2019

cbuescher Oct 16, 2019

cbuescher Oct 16, 2019

LUCENE-8995: TopSuggestDocsCollector#collect should be able to signal rejection #913

Are you sure you want to change the base?

LUCENE-8995: TopSuggestDocsCollector#collect should be able to signal rejection #913

Conversation

cbuescher commented Oct 1, 2019 • edited

jimczi left a comment

Choose a reason for hiding this comment

jimczi Oct 2, 2019 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jimczi left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cbuescher commented Oct 9, 2019

Choose a reason for hiding this comment

cbuescher Oct 16, 2019 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cbuescher commented Oct 1, 2019 •

edited

jimczi Oct 2, 2019 •

edited

cbuescher Oct 16, 2019 •

edited