Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Phrase Suggester: Suggesting on very frequent words can cause request failures #34282

Closed
nik9000 opened this issue Oct 3, 2018 · 3 comments · Fixed by #34312
Closed

Phrase Suggester: Suggesting on very frequent words can cause request failures #34282

nik9000 opened this issue Oct 3, 2018 · 3 comments · Fixed by #34312
Assignees
Labels
>bug :Search Relevance/Suggesters "Did you mean" and suggestions as you type Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch

Comments

@nik9000
Copy link
Member

nik9000 commented Oct 3, 2018

I have a stack trace that looks like:

Caused by: java.lang.IllegalArgumentException: Fractional absolute document frequencies are not allowed
    at org.apache.lucene.search.spell.DirectSpellChecker.setThresholdFrequency(DirectSpellChecker.java:182)
    at org.elasticsearch.search.suggest.phrase.DirectCandidateGenerator.drawCandidates(DirectCandidateGenerator.java:131)
    at org.elasticsearch.search.suggest.phrase.MultiCandidateGeneratorWrapper.drawCandidates(MultiCandidateGeneratorWrapper.java:52)

I do not have and cannot get the index that causes this failure. But it looks to me like the failure is caused by this series of events:

  1. DirectCandidateGenerator#thresholdFrequency spits out a frequency that is bigger than Integer.MAX_VALUE. This looks to be possible using the default configuration for common words like "the" when the corpus is a couple of million documents and each document is large, like, say, as big as a wikipedia page.
  2. We call DirectSpellChecker#setThresholdFrequency with that number. The JVM helpfully casts the long returned by step 1 into a float, losing precision but keeping the magnitude of the number largely intact.
  3. Lucene attempts to validate that the float is either less than 0 or a whole number. The "is it a whole number" check looks like thresholdFrequency != (int) thresholdFrequency. That will consider floats that don't fit into ints as not whole numbers. Most of the time, anyway.

There is a work around: set "suggest_mode": "always". We'll skip the math and just pick 0 for the frequency. Which is both less than one and whole number so Lucene is quite happy with it.

It looks like we should either clamp the value to Integer.MAX_VALUE in Elasticsearch or Lucene should use something else to check for fractional numbers.

@nik9000 nik9000 added >bug :Search Relevance/Suggesters "Did you mean" and suggestions as you type labels Oct 3, 2018
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-search-aggs

@jimczi
Copy link
Contributor

jimczi commented Oct 4, 2018

It looks like we should either clamp the value to Integer.MAX_VALUE in Elasticsearch or Lucene should use something else to check for fractional numbers.

I think there is a discrepancy in our usage of the Lucene suggester. DirectSpellChecker#suggestSimilar uses the docFreq (which is an int) of the candidate to compare with the threshold computed by DirectCandidateGenerator#thresholdFrequency. However in Elasticsearch we compute the threshold from the totalTermFreq and not the docFreq so this explains why the value can be greater than Integer.MAX_VALUE. This also means that we filter some terms that should not be filtered because we don't use the same metric (docFreq vs totalTermFreq). I think we should "just" switch to docFreq when computing the threshold.

@nik9000
Copy link
Member Author

nik9000 commented Oct 4, 2018

I think regardless of how we compute the threshold we should clamp it to Integer values so we don't trigger this.

@jimczi jimczi self-assigned this Oct 4, 2018
jimczi added a commit to jimczi/elasticsearch that referenced this issue Oct 4, 2018
The `term` and `phrase` suggesters have different options to filter candidates
based on their frequencies. The `popular` mode for instance filters candidate
terms that occur in less docs than the original term. However when we compute this threshold
we use the total term frequency of a term instead of the document frequency. This is not inline
with the actual filtering which is always based on the document frequency. This change fixes
this discrepancy and clarifies the meaning of the different frequencies in use in the suggesters.
It also ensures that the threshold doesn't overflow the maximum allowed value (Integer.MAX_VALUE).

Closes elastic#34282
jimczi added a commit that referenced this issue Oct 19, 2018
The `term` and `phrase` suggesters have different options to filter candidates
based on their frequencies. The `popular` mode for instance filters candidate
terms that occur in less docs than the original term. However when we compute this threshold
we use the total term frequency of a term instead of the document frequency. This is not inline
with the actual filtering which is always based on the document frequency. This change fixes
this discrepancy and clarifies the meaning of the different frequencies in use in the suggesters.
It also ensures that the threshold doesn't overflow the maximum allowed value (Integer.MAX_VALUE).

Closes #34282
jimczi added a commit that referenced this issue Oct 19, 2018
This change ensures that the term frequency threshold computed by the term/phrase
suggesters doesn't overflow the maximum allowed value (Integer.MAX_VALUE).

Closes #34282
Relates #34312
kcm pushed a commit that referenced this issue Oct 30, 2018
The `term` and `phrase` suggesters have different options to filter candidates
based on their frequencies. The `popular` mode for instance filters candidate
terms that occur in less docs than the original term. However when we compute this threshold
we use the total term frequency of a term instead of the document frequency. This is not inline
with the actual filtering which is always based on the document frequency. This change fixes
this discrepancy and clarifies the meaning of the different frequencies in use in the suggesters.
It also ensures that the threshold doesn't overflow the maximum allowed value (Integer.MAX_VALUE).

Closes #34282
@javanna javanna added the Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch label Jul 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Search Relevance/Suggesters "Did you mean" and suggestions as you type Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants