Lucene's FST Builder should have a simpler "knob" to trade off memory/CPU required against minimality #12542

mikemccand · 2023-09-07T14:03:37Z

Description

[Spinoff from this comment about the cool approach Tantivy's FST implementation uses to limit memory during construction. See also this awesome detailed blog post.]

The most RAM/CPU costly part of constructing an FST is recording all suffixes seen so far so that when you see a new suffix, if it was already seen before, you can share the previous suffix. To guarantee a minimal (smallest number of states) FST, you must record every such suffix. But this is crazy costly when compiling many keys, and in practice you can accept loss of minimality in exchange for trimming how many / which suffixes you store.

Lucene's FSTCompiler.Builder has three hairy integer options for this (minSuffixCount1, minSuffixCount2, and sharedMaxTailLength), but 1) nobody really knows exactly these do, 2) they are horribly indirect and heavily quantized ways to tune RAM/CPU. You don't know how much RAM/CPU you really are saving.

Whereas the approach in Tantivy's FST implementation (which was forked original from this implementation by Andrew Gallant) is a simple bounded HashMap keeping only the commonly reused suffixes. Then one could tune how large this is (Andrew suggests ~10K is large enough in practice) against how minimal you really need your FST to be.

I think we should replace the three confusing integers with this approach, which requires only a single integer. We could even make the single integer a bound on RAM required to make it even more clearly meaningful to users. We should experiment with some "typical" corpora in how Lucene uses FSTs (encode synonyms, terms, etc.) to find a good default. Today Lucene defaults to "save everything so you get the truly minimal FST".

The text was updated successfully, but these errors were encountered:

dweiss · 2023-09-07T18:41:44Z

I like it. These options we currently have are not even expert level, they're God-level...

madrob · 2023-09-07T21:37:55Z

What's the impact of having a non-minimal FST? Longer query times? Is that something that gets dwarfed by having multiple segments anyway? Maybe different merge policies have different defaults - when using tiered merges we can have some slack and when merging everything down to a single segment we probably should take the time to ensure minimality anyway.

mikemccand · 2023-09-08T12:35:49Z

Non-minimal FST means the index is a wee bit bigger, and perhaps lookups through the FST are a bit slower since we must have more bytes hot / fewer bytes cache local. But it's likely these effects are miniscule relative to the RAM savings during construction. We can test empirically to see the tradeoff curves.

mikemccand · 2023-09-08T12:41:30Z

Digging into this a bit, I think I found some silly performance bugs in our current FST impl:

We seem to create a PagedGrowableWriter with page size 128 MB here, meaning even when building a small FST, we are allocating at least 128 MB pages?
When we rehash, we create a new PagedGrowableWriter, with too small estimated bitsRequired since we pass count (the number of nodes in the hash) instead of the most recently added long node. We are actually storing the long node values, so it really should be node not count. The effect of this is we make PagedGrowableWriter work harder than necessary to reallocate when we store the next node that doesn't fit in that bitsRequired.

I'll try to get the LRU hash working, but if that takes too long, we should separately fix these performance bugs (if I'm right that these are really bugs!).

dweiss · 2023-09-08T13:49:22Z

With regard to automata/ FSTs - they're nearly the same thing, conceptually. Automata are logically transducers producing a constant epsilon value (no value). This knowledge can be used to make them smaller but they're the same animal, really.

The root of the automaton/fst difference in Lucene is historically the source of code: brics package for constructing arbitrary automata, Daciuk/Mihov algorithm implementing minimal (at least at first) FST construction directly from sorted data (no intermediate "unoptimized" transitions).

Non-minimal FST means the index is a wee bit bigger, and perhaps lookups through the FST are a bit slower since we must have more bytes hot / fewer bytes cache local. But it's likely these effects are miniscule relative to the RAM savings during construction.

If we allow non-optimal "transition graphs" then in theory we could also build FSTs in parallel: just build them independently from different prefix blocks. Yes, it wouldn't be optimal, but it if we don't care then so be it.

mikemccand · 2023-09-09T09:58:08Z

We seem to create a PagedGrowableWriter with page size 128 MB here, meaning even when building a small FST, we are allocating at least 128 MB pages?

OK this was really freaking me out overnight (allocating 128 MB array even for building the tiniest of FSTs), so I dug deeper, and it is a false alarm!

It turns out that PagedGrowableWriter, via its parent class AbstractPagedMutable, will allocate a "just big enough" final page, instead of the full 128 MB page size. And it will reallocate whenever the NodeHash resizes to a larger array. There is also some sneaky power-of-2 mod trickery that ensures that that final page, even on indefinite rehashing, is always sized to exactly a power of 2. And a real if statement to enforce it. Phew!

I'll open a separate tiny PR to address the wrong bitsRequired during rehash -- that's just a smallish performance bug when building biggish FSTs.

was too small, causing excess/wasted reallocations. This is just a performance bug, especially impacting larger FSTs, but likely a small overall impact even then. I also reduced the page size during rehashing from 1 GB (1 << 30) back down to the initial 128 MB (1 << 27) created on init. Relates apache#12542

mikemccand · 2023-10-08T13:37:54Z

Talking to @sokolovm at Community Over Code 2023 he suggested another idea here: instead of a (RAM hungry) hash table, couldn't we use the growing FST itself to lookup suffixes?

If we added the reversed (incoming) transitions to each FST node then we could do such a suffix lookup in reverse of the normal forward only FST lookup. Maybe we could have these additional reverse transitions in a separate RAM efficient structure, just used during construction? (Because the written FST only needs the forwards-only lookup).

mikemccand · 2023-10-20T15:54:30Z

I've merged the change into main! I'll let it bake for some time (week or two?) and if all looks good, backport to 9.x.

…0.0 -> 9.9.0 on bulk backport of recent FST improvements

mikemccand · 2023-11-20T19:19:33Z

Backported to 9.9.0 -- closing.

… move CHANGES.txt entry from 10.0 -> 9.9.0 on bulk backport of recent FST improvements

mikemccand added the type:enhancement label Sep 7, 2023

mikemccand mentioned this issue Sep 9, 2023

Fix minor (excess reallocation) performance bug when building FSTs #12545

Merged

mikemccand mentioned this issue Oct 8, 2023

Bound the RAM used by the NodeHash (sharing common suffixes) during FST compilation #12633

Merged

mikemccand added a commit that referenced this issue Nov 20, 2023

#12542, #12735, #12695, #12709, #12816: move CHANGES.txt entry from 1…

7eb8f6e

…0.0 -> 9.9.0 on bulk backport of recent FST improvements

mikemccand closed this as completed Nov 20, 2023

mikemccand added this to the 9.9.0 milestone Nov 20, 2023

slow-J pushed a commit to slow-J/lucene that referenced this issue Nov 20, 2023

apache#12542, apache#12735, apache#12695, apache#12709, apache#12816:…

c2aea2c

… move CHANGES.txt entry from 10.0 -> 9.9.0 on bulk backport of recent FST improvements

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lucene's FST Builder should have a simpler "knob" to trade off memory/CPU required against minimality #12542

Lucene's FST Builder should have a simpler "knob" to trade off memory/CPU required against minimality #12542

mikemccand commented Sep 7, 2023

dweiss commented Sep 7, 2023

madrob commented Sep 7, 2023

mikemccand commented Sep 8, 2023

mikemccand commented Sep 8, 2023

dweiss commented Sep 8, 2023

mikemccand commented Sep 9, 2023

mikemccand commented Oct 8, 2023

mikemccand commented Oct 20, 2023

mikemccand commented Nov 20, 2023

Lucene's FST Builder should have a simpler "knob" to trade off memory/CPU required against minimality #12542

Lucene's FST Builder should have a simpler "knob" to trade off memory/CPU required against minimality #12542

Comments

mikemccand commented Sep 7, 2023

Description

dweiss commented Sep 7, 2023

madrob commented Sep 7, 2023

mikemccand commented Sep 8, 2023

mikemccand commented Sep 8, 2023

dweiss commented Sep 8, 2023

mikemccand commented Sep 9, 2023

mikemccand commented Oct 8, 2023

mikemccand commented Oct 20, 2023

mikemccand commented Nov 20, 2023