Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LUCENE-8995: TopSuggestDocsCollector#collect should be able to signal rejection #913

Open
wants to merge 12 commits into
base: master
Choose a base branch
from

Conversation

cbuescher
Copy link
Contributor

@cbuescher cbuescher commented Oct 1, 2019

Currently the suggestion collectors collect method has no way of signaling back
to the caller whether the matched completion it received was indeed collected.
While this is currenty always the case for the default TopSuggestDocsCollector,
there are implementations that overwrite this with custom collection logic that
also deduplicates based on e.g. docID and/or context. Currently if those
completions are rejected, the calling TopNSearcher has no way of noting these
rejections and might therefore terminate early, missing on completions that
should be returned.

… rejection

Currently the suggestion collectors collect method has no way of signaling back
to the caller whether the matched completion it received was indeed collected.
While this is currenty always the case for the default TopSuggestDocsCollector,
there are implementations that overwrite this with custom collection logic that
also deduplicates based on e.g. docID and/or context. Currently if those
completions are rejected, the calling TopNSearcher has no way of noting these
rejections and might therefore terminate early, missing on completions that
should be returned.
@cbuescher cbuescher changed the title LUCENE-8995: TopSuggestDocsCollector#collect should be able to signal… LUCENE-8995: TopSuggestDocsCollector#collect should be able to signal rejection Oct 1, 2019
Copy link
Contributor

@jimczi jimczi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change looks good @cbuescher . I wonder if we should also add a way to set a custom queue size from the suggest collector, currently the queue size is increased if there are deleted docs and if a filter is set so maybe we should do the same if the collector is expected to filter hits ?
I also think that this pr reveals a bug in the way we compute the max analyzed path per output, I am not sure what was the original intent so maybe @mikemccand can shed some lights here ?

indexSearcher.suggest(query, collector);
assertSuggestions(collector.get(), expectedResults.subList(0, topN).toArray(new Entry[0]));

// TODO expecting true here, why false?
Copy link
Contributor

@jimczi jimczi Oct 2, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seems that NRTSuggesterBuilder#maxAnalyzedPathsPerOutput is not computed correctly. From what I understand it records the number of suggestions with the same analyzed form but the comment says that it should be the highest number of analyzed paths we saw for any input surface form. So imo this is a bug, it's not exactly related to this change so we should probably open a new issue for this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to open a new issue for this.

Indeed, maxAnalyzedPathsPerOutput comment in NRTSuggester.java seems to claim it's the max number of analyzed forms for a surface form (input), e.g. a graph tokenstream would create multiple analyzed forms.

But what it seems to actually be computing, in NRTSuggesterBuilder.java is actually the maximum number of unique surface forms that analyze to the same analyzed form.

And so I think the admissibility of the search is in question -- we use this to size the queue "properly" but it's not clear that works today?

Let's definitely open a new issue ... this stuff is hard to think about!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll open an issue. I also wonder if we shouldn't rely on the fact that the top suggest collector will also early terminate so whenever we expect rejection (because of deleted docs or because we deduplicate on suggestions/doc) we could set the queue size to its maximum value (5000). Currently we have different heuristics that tries to pick a sensitive value automatically but there is no guarantee of admissibility. For instance if we want to deduplicate by document id we should ensure that the queue size is greater than topN*maxAnalyzedValuesPerDoc and we'd need to compute this value at index time.
I may be completely off but it would be interesting to see the effects of setting the queue size to its maximum value on all search. This way the admissibility is easier to reason about and we don't need to correlate it with the choice made by the heuristic.

Copy link
Contributor

@jimczi jimczi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left some minor comments but the change looks good to me.

/**
* returns true if the collector clearly exhausted all possibilities to collect results
*/
boolean isComplete() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this needed now that we provide the information in the TopSuggestDocs ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not necessary, but since we need the underlying flag and its currently private, I thought its okay to have at least a package private getter. Or does it bloat the class too much? Happy to remove either way, wdyt?

@cbuescher
Copy link
Contributor Author

@jimczi thanks for the review, pushed a commit adressing most of the nits and left two questions. Happy to go either way with both of them, just want to hear your opinion.

@@ -460,11 +460,6 @@ public void addStartPaths(FST.Arc<T> node, T startOutput, boolean allowEmptyStri
continue;
}

if (results.size() == topN-1 && maxQueueDepth == topN) {
// Last path -- don't bother w/ queue anymore:
queue = null;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whoa, was this opto breaking something? I guess if this final path is filtered out, we still need the queue? Have you run the suggest benchmarks to see if removing this opto hurt performance?

Copy link
Contributor Author

@cbuescher cbuescher Oct 16, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I understand this optimization assumes we surely accept (and collect) the path later in L516s acceptResult(), which always seems to be the case for collectors that don't reject, but if the collector that is eventually called via NRTSuggesters acceptResult() chooses to reject this option, we were losing expected results. This surfaced in the prefix completion tests I added. @jimczi might be able to explain this a bit better than me.

Have you run the suggest benchmarks to see if removing this opto hurt performance?

No, where are they and how can I run them?

} else {
return TopSuggestDocs.EMPTY;
}
}

/**
* Indicates if the list of results is complete or not. Might be <code>false</code> if the {@link TopNSearcher} rejected
* too many of the queued results.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't it sometimes false if the TopNSearcher's queue filled up and it chose not to pursue some low-scoring partial paths?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The admissibility of the search is computed from the reject count so a value of false means that we exhausted all the paths but we had to reject all of them so the topN is truncated. It's hard to follow the full logic but it should be ok as long as it is ok to return less than the topN when there are more rejections than the queue size can handle ?

public boolean collect(int docID, CharSequence key, CharSequence context, float score) throws IOException {
int globalDocId = docID + docBase;
boolean collected = false;
if (seenDocIds.contains(globalDocId) == false) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm why would the suggester even send the same docID multiple times? There is only one suggestion (SuggestField) per indexed document in this test ... maybe improve the test to assert that sometimes we reject and sometimes we don't?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The collector is called multiple times with the same docID because of the MockSynonymAnalyzer used in the test setup which adds "dog" for "dogs", so each document has two completion paths. This collector is meant to de-duplicate this. I added a note explaining this. This is a simplified version of the behaviour we observe in elastic/elasticsearch#46445.

indexSearcher.suggest(query, collector);
TopSuggestDocs suggestions = collector.get();
assertSuggestions(suggestions, expectedResults.subList(0, topN).toArray(new Entry[0]));
assertTrue(suggestions.isComplete());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to concoct a full test case (invoking suggester) where isComplete() returns false?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will happen if the estimated queue size in NRTSuggester#lookup is not large enough to accout for all rejected documents, e.g. when in this particular test we try to get the top 5 of only 5 documents. In that case the queue size heuristic in NRTSuggester#getMaxTopNSearcherQueueSize will only size to queue to 7 (topN + numDocs/2), which is less than the number of topN + rejections, so the TopResults returned will have the isComplete flag set.
I can add that case to the existing test if this helps.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I extended the existing test to the case where try getting the top 10. In this case the queue would have a max depth of 15, the reject count it 9.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants