Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lucene's FST Builder should have a simpler "knob" to trade off memory/CPU required against minimality #12542

Closed
mikemccand opened this issue Sep 7, 2023 · 9 comments

Comments

@mikemccand
Copy link
Member

Description

[Spinoff from this comment about the cool approach Tantivy's FST implementation uses to limit memory during construction. See also this awesome detailed blog post.]

The most RAM/CPU costly part of constructing an FST is recording all suffixes seen so far so that when you see a new suffix, if it was already seen before, you can share the previous suffix. To guarantee a minimal (smallest number of states) FST, you must record every such suffix. But this is crazy costly when compiling many keys, and in practice you can accept loss of minimality in exchange for trimming how many / which suffixes you store.

Lucene's FSTCompiler.Builder has three hairy integer options for this (minSuffixCount1, minSuffixCount2, and sharedMaxTailLength), but 1) nobody really knows exactly these do, 2) they are horribly indirect and heavily quantized ways to tune RAM/CPU. You don't know how much RAM/CPU you really are saving.

Whereas the approach in Tantivy's FST implementation (which was forked original from this implementation by Andrew Gallant) is a simple bounded HashMap keeping only the commonly reused suffixes. Then one could tune how large this is (Andrew suggests ~10K is large enough in practice) against how minimal you really need your FST to be.

I think we should replace the three confusing integers with this approach, which requires only a single integer. We could even make the single integer a bound on RAM required to make it even more clearly meaningful to users. We should experiment with some "typical" corpora in how Lucene uses FSTs (encode synonyms, terms, etc.) to find a good default. Today Lucene defaults to "save everything so you get the truly minimal FST".

@dweiss
Copy link
Contributor

dweiss commented Sep 7, 2023

I like it. These options we currently have are not even expert level, they're God-level...

@madrob
Copy link
Contributor

madrob commented Sep 7, 2023

What's the impact of having a non-minimal FST? Longer query times? Is that something that gets dwarfed by having multiple segments anyway? Maybe different merge policies have different defaults - when using tiered merges we can have some slack and when merging everything down to a single segment we probably should take the time to ensure minimality anyway.

@mikemccand
Copy link
Member Author

Non-minimal FST means the index is a wee bit bigger, and perhaps lookups through the FST are a bit slower since we must have more bytes hot / fewer bytes cache local. But it's likely these effects are miniscule relative to the RAM savings during construction. We can test empirically to see the tradeoff curves.

@mikemccand
Copy link
Member Author

Digging into this a bit, I think I found some silly performance bugs in our current FST impl:

I'll try to get the LRU hash working, but if that takes too long, we should separately fix these performance bugs (if I'm right that these are really bugs!).

@dweiss
Copy link
Contributor

dweiss commented Sep 8, 2023

With regard to automata/ FSTs - they're nearly the same thing, conceptually. Automata are logically transducers producing a constant epsilon value (no value). This knowledge can be used to make them smaller but they're the same animal, really.

The root of the automaton/fst difference in Lucene is historically the source of code: brics package for constructing arbitrary automata, Daciuk/Mihov algorithm implementing minimal (at least at first) FST construction directly from sorted data (no intermediate "unoptimized" transitions).

Non-minimal FST means the index is a wee bit bigger, and perhaps lookups through the FST are a bit slower since we must have more bytes hot / fewer bytes cache local. But it's likely these effects are miniscule relative to the RAM savings during construction.

If we allow non-optimal "transition graphs" then in theory we could also build FSTs in parallel: just build them independently from different prefix blocks. Yes, it wouldn't be optimal, but it if we don't care then so be it.

@mikemccand
Copy link
Member Author

We seem to create a PagedGrowableWriter with page size 128 MB here, meaning even when building a small FST, we are allocating at least 128 MB pages?

OK this was really freaking me out overnight (allocating 128 MB array even for building the tiniest of FSTs), so I dug deeper, and it is a false alarm!

It turns out that PagedGrowableWriter, via its parent class AbstractPagedMutable, will allocate a "just big enough" final page, instead of the full 128 MB page size. And it will reallocate whenever the NodeHash resizes to a larger array. There is also some sneaky power-of-2 mod trickery that ensures that that final page, even on indefinite rehashing, is always sized to exactly a power of 2. And a real if statement to enforce it. Phew!

I'll open a separate tiny PR to address the wrong bitsRequired during rehash -- that's just a smallish performance bug when building biggish FSTs.

mikemccand added a commit to mikemccand/lucene that referenced this issue Sep 9, 2023
was too small, causing excess/wasted reallocations.  This is just a
performance bug, especially impacting larger FSTs, but likely a small
overall impact even then.

I also reduced the page size during rehashing from 1 GB (1 << 30) back
down to the initial 128 MB (1 << 27) created on init.

Relates apache#12542
@mikemccand
Copy link
Member Author

Talking to @sokolovm at Community Over Code 2023 he suggested another idea here: instead of a (RAM hungry) hash table, couldn't we use the growing FST itself to lookup suffixes?

If we added the reversed (incoming) transitions to each FST node then we could do such a suffix lookup in reverse of the normal forward only FST lookup. Maybe we could have these additional reverse transitions in a separate RAM efficient structure, just used during construction? (Because the written FST only needs the forwards-only lookup).

@mikemccand
Copy link
Member Author

I've merged the change into main! I'll let it bake for some time (week or two?) and if all looks good, backport to 9.x.

mikemccand added a commit that referenced this issue Nov 20, 2023
…0.0 -> 9.9.0 on bulk backport of recent FST improvements
@mikemccand
Copy link
Member Author

Backported to 9.9.0 -- closing.

@mikemccand mikemccand added this to the 9.9.0 milestone Nov 20, 2023
slow-J pushed a commit to slow-J/lucene that referenced this issue Nov 20, 2023
… move CHANGES.txt entry from 10.0 -> 9.9.0 on bulk backport of recent FST improvements
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants