hunspell: allow for faster dictionary iteration during 'suggest' by using more memory (opt-in) #11893

donnerpeter · 2022-10-31T13:35:50Z

No description provided.

…sing more memory (opt-in)

donnerpeter · 2022-10-31T13:37:01Z

lucene/analysis/common/src/java/org/apache/lucene/analysis/hunspell/GeneratingSuggester.java

+
+    char[] excludeFlags = dictionary.allNonSuggestibleFlags();
+    FlagEnumerator.Lookup flagLookup = dictionary.flagLookup;
+    IntPredicate isSuggestible = formId -> !flagLookup.hasAnyFlag(formId, excludeFlags);


This is basically the inlined EntryFilter class with one for loop removed

donnerpeter · 2022-10-31T13:38:03Z

lucene/analysis/common/src/java/org/apache/lucene/analysis/hunspell/GeneratingSuggester.java

        Math.max(1, word.length() - MAX_ROOT_LENGTH_DIFF),
        word.length() + MAX_ROOT_LENGTH_DIFF,
-        (rootChars, forms) -> {
+        (rootChars, formSupplier) -> {


Instead of IntsRef we now pass a Supplier<IntsRef> to allow for lazier deserialization and exclusion of unnecessary information (morphological data ids)

donnerpeter · 2022-10-31T13:38:33Z

lucene/analysis/common/src/java/org/apache/lucene/analysis/hunspell/GeneratingSuggester.java

-          while (roots.size() > MAX_ROOTS) {
-            roots.poll();
+          IntsRef forms = formSupplier.get();
+          for (int i = 0; i < forms.length; i++) {


Here's the loop from the inlined EntryFilter, it's now a bit simpler.

donnerpeter · 2022-10-31T13:40:09Z

lucene/analysis/common/src/java/org/apache/lucene/analysis/hunspell/WordStorage.java

-          if (forms.ints.length < dataLength) {
-            forms.ints = new int[dataLength];
-          }
-          readForms(forms, in, dataLength);


Form id reading is moved into LazyFormReader

donnerpeter · 2022-10-31T13:41:08Z

lucene/analysis/common/src/java/org/apache/lucene/analysis/hunspell/WordStorage.java

@@ -54,7 +55,8 @@ class WordStorage {
  private static final int COLLISION_MASK = 0x40;
  private static final int SUGGESTIBLE_MASK = 0x20;
  private static final int MAX_STORED_LENGTH = SUGGESTIBLE_MASK - 1;
-
+  private final int maxEntryLength;


maxEntryLength is used to allow passing Integer.MAX_VALUE as the maximal interesting length (to iterate the whole dictionary) without allocating an array of that size

donnerpeter · 2022-10-31T13:42:05Z

lucene/analysis/common/src/java/org/apache/lucene/analysis/hunspell/WordStorage.java

@@ -54,7 +55,8 @@ class WordStorage {
  private static final int COLLISION_MASK = 0x40;
  private static final int SUGGESTIBLE_MASK = 0x20;
  private static final int MAX_STORED_LENGTH = SUGGESTIBLE_MASK - 1;
-
+  private final int maxEntryLength;
+  private final boolean hasCustomMorphData;


hasCustomMorphData is used to skip morphological data ids (if present) during the data deserialization, so that the cached flattened data is 2x smaller. That data isn't used during iteration anyway.

donnerpeter · 2022-10-31T13:44:02Z

lucene/analysis/common/src/test/org/apache/lucene/analysis/hunspell/TestPerformance.java

@@ -86,7 +86,7 @@ public void de() throws Exception {

  @Test
  public void de_suggest() throws Exception {
-    checkSuggestionPerformance("de", 100);
+    checkSuggestionPerformance("de", 150);


We can now suggest for more words in reasonable time, yay!

donnerpeter · 2022-10-31T13:44:32Z

With the cache, about 2x memory is used (~850MB for ~190 dictionaries). The caching gives me about 1.5x speedup for en/ru/de.

dweiss

Left a comment or two. As far as I understand, it's fine (and, again - I'm not really that familiar with Hunspell...).

lucene/analysis/common/src/java/org/apache/lucene/analysis/hunspell/Hunspell.java

…led and SuggestibleEntryCache

donnerpeter · 2022-11-04T10:39:04Z

I've realized that the entry cache should have a longer lifetime than checkCanceled, so I moved the suggester stuff into a separate class (where I also plan to add more optimizations). I'd appreciate comments on the new APIs.

dweiss · 2022-11-04T11:02:45Z

lucene/analysis/common/src/java/org/apache/lucene/analysis/hunspell/Suggester.java

+   * {@link Suggester} documentation), but using more memory. With this option, the dictionary
+   * entries are stored as fast-to-iterate plain words instead of highly compressed prefix trees.
+   */
+  public Suggester withSuggestibleEntryCache() {


I like the Suggester as a separate entity but this bit seems a bit awkward to me - it's a chicken and egg (constructor accepting SuggestibleEntryCache, but SuggestibleEntryCache is created via non-static method on the suggester). If the cache's lifecycle can be longer than the Suggester's then perhaps it should just be left as an explicit argument for a public constructor, without the static factory/ modifier method on the Suggester? Then folks who need the cache can compile it (for the same dictionary) and pass it to one or more suggesters, however they like.

Which also makes me think that buildCache should accept the dictionary reference and then a sanity-check could be added in the suggester to make sure the cache and the suggester's dictionary are the same... Or even modify the constructor to accept just the entry cache, which would piggyback the dictionary reference consistently.

Finally - I'm not a native speaker, but wouldn't it be simpler to call the cache just an EntryCache (or a SuggestionsCache)? Suggestible sounds a bit awkward to my (slavic) ear, but maybe it's just me.

Thank you!

I've tried to avoid making the cache also a publicly-accessible API because I don't think that clients need such detail. The cache's lifecycle can be longer than the one of the checkCanceled runnable (which is now bound to Hunspell unfortunately, but it turned out that we need to create it afresh on each spell/suggest call), not necessarily than the suggester's lifecycle. So far I envision clients creating a single suggester for a dictionary, customizing it in the way they need, and keeping it in the memory.

If the cache remains private API, renaming it to EntryCache seems fine to me. If it should be made public, I'd leave something about suggestion there. But SuggestionCache means that suggestions themselves are cached, while here I cache some intermediate results.

Ok, makes sense. Thanks for explaining this to me - looking at the diff only, I didn't get the whole picture. I still feel like that static mutating method is a bit awkward but if it's an experimental API then it's fine to change it as you go - no problem there.

As for naming - if nobody from the English part of the world expresses their opinion - leave it as it is. I really don't feel like I'm in the position to have the final say here :)

…ows a SuggestionTimeoutException

…sing more memory (opt-in) (#11893) hunspell: allow for faster dictionary iteration during 'suggest' by using more memory (opt-in)

hunspell: allow for faster dictionary iteration during 'suggest' by u…

4cdf474

…sing more memory (opt-in)

donnerpeter requested review from dweiss and rmuir October 31, 2022 13:35

donnerpeter commented Oct 31, 2022

View reviewed changes

dweiss approved these changes Oct 31, 2022

View reviewed changes

lucene/analysis/common/src/java/org/apache/lucene/analysis/hunspell/Hunspell.java Outdated Show resolved Hide resolved

lucene/analysis/common/src/java/org/apache/lucene/analysis/hunspell/Hunspell.java Outdated Show resolved Hide resolved

donnerpeter added 3 commits October 31, 2022 19:29

make the cache volatile

bacb17e

introduce Suggester to separate the lifecycles of Hunspell+checkCance…

689d9ee

…led and SuggestibleEntryCache

update CHANGES.txt

4486a23

donnerpeter added 2 commits November 4, 2022 11:42

tidy

fbc5962

make ecjLint happier

93d909a

dweiss reviewed Nov 4, 2022

View reviewed changes

donnerpeter added 2 commits November 7, 2022 09:08

restore providing partial results when the client's checkCanceled thr…

5e73e7a

…ows a SuggestionTimeoutException

tidy

058a912

donnerpeter merged commit f7417d5 into apache:main Nov 9, 2022

donnerpeter deleted the suggestibleEntryCache branch November 9, 2022 07:21

donnerpeter added a commit that referenced this pull request Jan 13, 2023

hunspell: allow for faster dictionary iteration during 'suggest' by u…

462f0cf

…sing more memory (opt-in) (#11893) hunspell: allow for faster dictionary iteration during 'suggest' by using more memory (opt-in)

rmuir added this to the 9.5.0 milestone Jan 17, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

hunspell: allow for faster dictionary iteration during 'suggest' by using more memory (opt-in) #11893

hunspell: allow for faster dictionary iteration during 'suggest' by using more memory (opt-in) #11893

donnerpeter commented Oct 31, 2022

donnerpeter Oct 31, 2022

donnerpeter Oct 31, 2022

donnerpeter Oct 31, 2022

donnerpeter Oct 31, 2022

donnerpeter Oct 31, 2022

donnerpeter Oct 31, 2022 •

edited

Loading

donnerpeter Oct 31, 2022

donnerpeter commented Oct 31, 2022

dweiss left a comment

donnerpeter commented Nov 4, 2022

dweiss Nov 4, 2022

donnerpeter Nov 4, 2022

dweiss Nov 4, 2022

hunspell: allow for faster dictionary iteration during 'suggest' by using more memory (opt-in) #11893

hunspell: allow for faster dictionary iteration during 'suggest' by using more memory (opt-in) #11893

Conversation

donnerpeter commented Oct 31, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

donnerpeter Oct 31, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

donnerpeter commented Oct 31, 2022

dweiss left a comment

Choose a reason for hiding this comment

donnerpeter commented Nov 4, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

donnerpeter Oct 31, 2022 •

edited

Loading