New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Lucene's FST Builder should have a simpler "knob" to trade off memory/CPU required against minimality #12542
Comments
I like it. These options we currently have are not even expert level, they're God-level... |
What's the impact of having a non-minimal FST? Longer query times? Is that something that gets dwarfed by having multiple segments anyway? Maybe different merge policies have different defaults - when using tiered merges we can have some slack and when merging everything down to a single segment we probably should take the time to ensure minimality anyway. |
Non-minimal FST means the index is a wee bit bigger, and perhaps lookups through the FST are a bit slower since we must have more bytes hot / fewer bytes cache local. But it's likely these effects are miniscule relative to the RAM savings during construction. We can test empirically to see the tradeoff curves. |
Digging into this a bit, I think I found some silly performance bugs in our current FST impl:
I'll try to get the LRU hash working, but if that takes too long, we should separately fix these performance bugs (if I'm right that these are really bugs!). |
With regard to automata/ FSTs - they're nearly the same thing, conceptually. Automata are logically transducers producing a constant epsilon value (no value). This knowledge can be used to make them smaller but they're the same animal, really. The root of the automaton/fst difference in Lucene is historically the source of code: brics package for constructing arbitrary automata, Daciuk/Mihov algorithm implementing minimal (at least at first) FST construction directly from sorted data (no intermediate "unoptimized" transitions).
If we allow non-optimal "transition graphs" then in theory we could also build FSTs in parallel: just build them independently from different prefix blocks. Yes, it wouldn't be optimal, but it if we don't care then so be it. |
OK this was really freaking me out overnight (allocating 128 MB array even for building the tiniest of FSTs), so I dug deeper, and it is a false alarm! It turns out that I'll open a separate tiny PR to address the wrong |
was too small, causing excess/wasted reallocations. This is just a performance bug, especially impacting larger FSTs, but likely a small overall impact even then. I also reduced the page size during rehashing from 1 GB (1 << 30) back down to the initial 128 MB (1 << 27) created on init. Relates apache#12542
Talking to @sokolovm at Community Over Code 2023 he suggested another idea here: instead of a (RAM hungry) hash table, couldn't we use the growing FST itself to lookup suffixes? If we added the reversed (incoming) transitions to each FST node then we could do such a suffix lookup in reverse of the normal forward only FST lookup. Maybe we could have these additional reverse transitions in a separate RAM efficient structure, just used during construction? (Because the written FST only needs the forwards-only lookup). |
I've merged the change into |
Backported to 9.9.0 -- closing. |
… move CHANGES.txt entry from 10.0 -> 9.9.0 on bulk backport of recent FST improvements
Description
[Spinoff from this comment about the cool approach Tantivy's FST implementation uses to limit memory during construction. See also this awesome detailed blog post.]
The most RAM/CPU costly part of constructing an FST is recording all suffixes seen so far so that when you see a new suffix, if it was already seen before, you can share the previous suffix. To guarantee a minimal (smallest number of states) FST, you must record every such suffix. But this is crazy costly when compiling many keys, and in practice you can accept loss of minimality in exchange for trimming how many / which suffixes you store.
Lucene's
FSTCompiler.Builder
has three hairy integer options for this (minSuffixCount1
,minSuffixCount2
, andsharedMaxTailLength
), but 1) nobody really knows exactly these do, 2) they are horribly indirect and heavily quantized ways to tune RAM/CPU. You don't know how much RAM/CPU you really are saving.Whereas the approach in Tantivy's FST implementation (which was forked original from this implementation by Andrew Gallant) is a simple bounded
HashMap
keeping only the commonly reused suffixes. Then one could tune how large this is (Andrew suggests ~10K is large enough in practice) against how minimal you really need your FST to be.I think we should replace the three confusing integers with this approach, which requires only a single integer. We could even make the single integer a bound on RAM required to make it even more clearly meaningful to users. We should experiment with some "typical" corpora in how Lucene uses FSTs (encode synonyms, terms, etc.) to find a good default. Today Lucene defaults to "save everything so you get the truly minimal FST".
The text was updated successfully, but these errors were encountered: