Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hunspell: allow for faster dictionary iteration during 'suggest' by using more memory (opt-in) #11893

Merged
merged 8 commits into from
Nov 9, 2022

Conversation

donnerpeter
Copy link
Contributor

No description provided.


char[] excludeFlags = dictionary.allNonSuggestibleFlags();
FlagEnumerator.Lookup flagLookup = dictionary.flagLookup;
IntPredicate isSuggestible = formId -> !flagLookup.hasAnyFlag(formId, excludeFlags);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is basically the inlined EntryFilter class with one for loop removed

Math.max(1, word.length() - MAX_ROOT_LENGTH_DIFF),
word.length() + MAX_ROOT_LENGTH_DIFF,
(rootChars, forms) -> {
(rootChars, formSupplier) -> {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of IntsRef we now pass a Supplier<IntsRef> to allow for lazier deserialization and exclusion of unnecessary information (morphological data ids)

while (roots.size() > MAX_ROOTS) {
roots.poll();
IntsRef forms = formSupplier.get();
for (int i = 0; i < forms.length; i++) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's the loop from the inlined EntryFilter, it's now a bit simpler.

if (forms.ints.length < dataLength) {
forms.ints = new int[dataLength];
}
readForms(forms, in, dataLength);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Form id reading is moved into LazyFormReader

@@ -54,7 +55,8 @@ class WordStorage {
private static final int COLLISION_MASK = 0x40;
private static final int SUGGESTIBLE_MASK = 0x20;
private static final int MAX_STORED_LENGTH = SUGGESTIBLE_MASK - 1;

private final int maxEntryLength;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maxEntryLength is used to allow passing Integer.MAX_VALUE as the maximal interesting length (to iterate the whole dictionary) without allocating an array of that size

@@ -54,7 +55,8 @@ class WordStorage {
private static final int COLLISION_MASK = 0x40;
private static final int SUGGESTIBLE_MASK = 0x20;
private static final int MAX_STORED_LENGTH = SUGGESTIBLE_MASK - 1;

private final int maxEntryLength;
private final boolean hasCustomMorphData;
Copy link
Contributor Author

@donnerpeter donnerpeter Oct 31, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hasCustomMorphData is used to skip morphological data ids (if present) during the data deserialization, so that the cached flattened data is 2x smaller. That data isn't used during iteration anyway.

@@ -86,7 +86,7 @@ public void de() throws Exception {

@Test
public void de_suggest() throws Exception {
checkSuggestionPerformance("de", 100);
checkSuggestionPerformance("de", 150);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can now suggest for more words in reasonable time, yay!

@donnerpeter
Copy link
Contributor Author

With the cache, about 2x memory is used (~850MB for ~190 dictionaries). The caching gives me about 1.5x speedup for en/ru/de.

Copy link
Contributor

@dweiss dweiss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a comment or two. As far as I understand, it's fine (and, again - I'm not really that familiar with Hunspell...).

@donnerpeter
Copy link
Contributor Author

I've realized that the entry cache should have a longer lifetime than checkCanceled, so I moved the suggester stuff into a separate class (where I also plan to add more optimizations). I'd appreciate comments on the new APIs.

* {@link Suggester} documentation), but using more memory. With this option, the dictionary
* entries are stored as fast-to-iterate plain words instead of highly compressed prefix trees.
*/
public Suggester withSuggestibleEntryCache() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the Suggester as a separate entity but this bit seems a bit awkward to me - it's a chicken and egg (constructor accepting SuggestibleEntryCache, but SuggestibleEntryCache is created via non-static method on the suggester). If the cache's lifecycle can be longer than the Suggester's then perhaps it should just be left as an explicit argument for a public constructor, without the static factory/ modifier method on the Suggester? Then folks who need the cache can compile it (for the same dictionary) and pass it to one or more suggesters, however they like.

Which also makes me think that buildCache should accept the dictionary reference and then a sanity-check could be added in the suggester to make sure the cache and the suggester's dictionary are the same... Or even modify the constructor to accept just the entry cache, which would piggyback the dictionary reference consistently.

Finally - I'm not a native speaker, but wouldn't it be simpler to call the cache just an EntryCache (or a SuggestionsCache)? Suggestible sounds a bit awkward to my (slavic) ear, but maybe it's just me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

I've tried to avoid making the cache also a publicly-accessible API because I don't think that clients need such detail. The cache's lifecycle can be longer than the one of the checkCanceled runnable (which is now bound to Hunspell unfortunately, but it turned out that we need to create it afresh on each spell/suggest call), not necessarily than the suggester's lifecycle. So far I envision clients creating a single suggester for a dictionary, customizing it in the way they need, and keeping it in the memory.

If the cache remains private API, renaming it to EntryCache seems fine to me. If it should be made public, I'd leave something about suggestion there. But SuggestionCache means that suggestions themselves are cached, while here I cache some intermediate results.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, makes sense. Thanks for explaining this to me - looking at the diff only, I didn't get the whole picture. I still feel like that static mutating method is a bit awkward but if it's an experimental API then it's fine to change it as you go - no problem there.

As for naming - if nobody from the English part of the world expresses their opinion - leave it as it is. I really don't feel like I'm in the position to have the final say here :)

@donnerpeter donnerpeter merged commit f7417d5 into apache:main Nov 9, 2022
@donnerpeter donnerpeter deleted the suggestibleEntryCache branch November 9, 2022 07:21
donnerpeter added a commit that referenced this pull request Jan 13, 2023
…sing more memory (opt-in) (#11893)

hunspell: allow for faster dictionary iteration during 'suggest' by using more memory (opt-in)
@rmuir rmuir added this to the 9.5.0 milestone Jan 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants