-
Notifications
You must be signed in to change notification settings - Fork 982
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
hunspell: allow for faster dictionary iteration during 'suggest' by using more memory (opt-in) #11893
Conversation
…sing more memory (opt-in)
|
||
char[] excludeFlags = dictionary.allNonSuggestibleFlags(); | ||
FlagEnumerator.Lookup flagLookup = dictionary.flagLookup; | ||
IntPredicate isSuggestible = formId -> !flagLookup.hasAnyFlag(formId, excludeFlags); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is basically the inlined EntryFilter class with one for loop removed
Math.max(1, word.length() - MAX_ROOT_LENGTH_DIFF), | ||
word.length() + MAX_ROOT_LENGTH_DIFF, | ||
(rootChars, forms) -> { | ||
(rootChars, formSupplier) -> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of IntsRef
we now pass a Supplier<IntsRef>
to allow for lazier deserialization and exclusion of unnecessary information (morphological data ids)
while (roots.size() > MAX_ROOTS) { | ||
roots.poll(); | ||
IntsRef forms = formSupplier.get(); | ||
for (int i = 0; i < forms.length; i++) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's the loop from the inlined EntryFilter
, it's now a bit simpler.
if (forms.ints.length < dataLength) { | ||
forms.ints = new int[dataLength]; | ||
} | ||
readForms(forms, in, dataLength); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Form id reading is moved into LazyFormReader
@@ -54,7 +55,8 @@ class WordStorage { | |||
private static final int COLLISION_MASK = 0x40; | |||
private static final int SUGGESTIBLE_MASK = 0x20; | |||
private static final int MAX_STORED_LENGTH = SUGGESTIBLE_MASK - 1; | |||
|
|||
private final int maxEntryLength; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maxEntryLength is used to allow passing Integer.MAX_VALUE as the maximal interesting length (to iterate the whole dictionary) without allocating an array of that size
@@ -54,7 +55,8 @@ class WordStorage { | |||
private static final int COLLISION_MASK = 0x40; | |||
private static final int SUGGESTIBLE_MASK = 0x20; | |||
private static final int MAX_STORED_LENGTH = SUGGESTIBLE_MASK - 1; | |||
|
|||
private final int maxEntryLength; | |||
private final boolean hasCustomMorphData; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hasCustomMorphData is used to skip morphological data ids (if present) during the data deserialization, so that the cached flattened data is 2x smaller. That data isn't used during iteration anyway.
@@ -86,7 +86,7 @@ public void de() throws Exception { | |||
|
|||
@Test | |||
public void de_suggest() throws Exception { | |||
checkSuggestionPerformance("de", 100); | |||
checkSuggestionPerformance("de", 150); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can now suggest for more words in reasonable time, yay!
With the cache, about 2x memory is used (~850MB for ~190 dictionaries). The caching gives me about 1.5x speedup for en/ru/de. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left a comment or two. As far as I understand, it's fine (and, again - I'm not really that familiar with Hunspell...).
lucene/analysis/common/src/java/org/apache/lucene/analysis/hunspell/Hunspell.java
Outdated
Show resolved
Hide resolved
lucene/analysis/common/src/java/org/apache/lucene/analysis/hunspell/Hunspell.java
Outdated
Show resolved
Hide resolved
…led and SuggestibleEntryCache
I've realized that the entry cache should have a longer lifetime than |
* {@link Suggester} documentation), but using more memory. With this option, the dictionary | ||
* entries are stored as fast-to-iterate plain words instead of highly compressed prefix trees. | ||
*/ | ||
public Suggester withSuggestibleEntryCache() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the Suggester as a separate entity but this bit seems a bit awkward to me - it's a chicken and egg (constructor accepting SuggestibleEntryCache, but SuggestibleEntryCache is created via non-static method on the suggester). If the cache's lifecycle can be longer than the Suggester's then perhaps it should just be left as an explicit argument for a public constructor, without the static factory/ modifier method on the Suggester? Then folks who need the cache can compile it (for the same dictionary) and pass it to one or more suggesters, however they like.
Which also makes me think that buildCache should accept the dictionary reference and then a sanity-check could be added in the suggester to make sure the cache and the suggester's dictionary are the same... Or even modify the constructor to accept just the entry cache, which would piggyback the dictionary reference consistently.
Finally - I'm not a native speaker, but wouldn't it be simpler to call the cache just an EntryCache (or a SuggestionsCache)? Suggestible sounds a bit awkward to my (slavic) ear, but maybe it's just me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you!
I've tried to avoid making the cache also a publicly-accessible API because I don't think that clients need such detail. The cache's lifecycle can be longer than the one of the checkCanceled
runnable (which is now bound to Hunspell
unfortunately, but it turned out that we need to create it afresh on each spell
/suggest
call), not necessarily than the suggester's lifecycle. So far I envision clients creating a single suggester for a dictionary, customizing it in the way they need, and keeping it in the memory.
If the cache remains private API, renaming it to EntryCache
seems fine to me. If it should be made public, I'd leave something about suggestion there. But SuggestionCache
means that suggestions themselves are cached, while here I cache some intermediate results.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, makes sense. Thanks for explaining this to me - looking at the diff only, I didn't get the whole picture. I still feel like that static mutating method is a bit awkward but if it's an experimental API then it's fine to change it as you go - no problem there.
As for naming - if nobody from the English part of the world expresses their opinion - leave it as it is. I really don't feel like I'm in the position to have the final say here :)
…ows a SuggestionTimeoutException
…sing more memory (opt-in) (#11893) hunspell: allow for faster dictionary iteration during 'suggest' by using more memory (opt-in)
No description provided.