Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bound the RAM used by the NodeHash (sharing common suffixes) during FST compilation #12633

Merged
merged 29 commits into from
Oct 20, 2023

Conversation

mikemccand
Copy link
Member

This is a first attempt (not committable yet! work in progress! many nocommits!) to use a bounded NodeHash when finding common suffixes during FST compilation. It uses a simple double-barrel LRU cache to hold the NodeHash entries.

I created a simple tool, IndexToFST, to iterate all terms from a luceneutil benchy index and build an FST from them, and test_all_sizes.py to run this tool multiple times with different NodeHash sizes and gather the resulting metrics.

Relates #12542

@mikemccand
Copy link
Member Author

Here are the results from running test_all_sizes.py then results_to_md.py:

NodeHash size FST (mb) RAM (mb) FST build time (sec)
0 577.4 0.0 35.2
4 586.5 0.0 43.2
8 587.0 0.0 46.4
16 585.2 0.0 44.8
32 582.0 0.0 45.9
64 578.8 0.0 45.4
128 573.0 0.0 45.9
256 563.6 0.0 46.1
512 551.2 0.0 45.4
1024 537.5 0.0 45.7
2048 523.4 0.0 46.0
4096 509.5 0.1 45.6
8192 495.8 0.1 45.2
16384 481.8 0.2 46.3
32768 461.1 0.5 45.2
65536 447.2 1.0 45.7
131072 432.4 2.0 46.3
262144 418.6 4.0 46.3
524288 402.4 8.0 46.9
1048576 391.0 16.0 50.0
2097152 380.8 32.0 55.2
4194304 371.4 64.0 58.3
8388608 362.5 128.0 59.9
16777216 356.1 256.0 59.3
33554432 351.4 512.0 57.3
67108864 350.2 1024.0 52.6
134217728 350.2 2048.0 49.2
268435456 350.2 4096.0 48.4
536870912 350.2 8192.0 46.9
1073741824 350.2 16384.0 44.5

One WTF (wow that's funny) is why a NodeHash size of 0 (no prefix sharing) creates a smaller FST than the tiny NodeHash sizes: it should be monotonic since the NodeHash should only enable sharing of suffixes. Maybe something about the loss of locality of the FST suffix nodes, causing more bytes to refer to them later? Confusing.

Another observation is that it takes quite a few RAM MB to bring the final FST size close-ish to its optimal / minimal size (350.2 MB).

It's also curious how the FST Build time grows with a larger NodeHash -- maybe this is just the added cost of maintaining/cycling the double barrel hash (and promoting entries from the "old" to the "new" barrel)?

I will try soonish to post a similar table from main (unbounded NodeHash) for comparison to this approach by tuning the god-like knobs for controlling RAM usage during FST compilation.

@mikemccand
Copy link
Member Author

For comparison, this is how the curve (RAM required during construction vs final FST size) looks on trunk, using the god-like parameters as best I could. I sorted the results in reverse FST (mb) order but not that this results in a confusing mix of the first three columns (the god-like parameters).

I'll try to turn these two tables into a single chart comparing RAM used and final FST size and maybe build time:

Share suffix? Share non-singleton? Max tail length FST (mb) RAM (mb) Build time (sec)
True True 1 584.3 0.0 30.5
True False 1 584.3 0.0 30.5
True True 0 577.4 0.0 30.1
True False 0 577.4 0.0 30.2
False True 0 577.4 0.0 29.9
False False 0 577.4 0.0 29.8
True False 2 576.4 0.0 31.4
True True 2 569.9 14.5 32.4
True False 3 523.7 3.6 32.2
True True 3 509.7 29.0 34.1
True False 4 486.2 7.0 32.9
True True 4 468.9 58.0 36.7
True False 5 456.1 14.0 34.1
True True 5 437.6 56.0 37.8
True False 6 435.4 28.0 35.3
True False 7 419.8 58.0 36.6
True True 6 416.1 116.0 41.2
True False 8 408.1 56.0 37.6
True True 7 399.8 116.0 41.7
True False 9 398.8 116.0 39.9
True False 10 392.1 116.0 40.5
True True 8 387.8 112.0 43.3
True False 11 387.2 116.0 40.7
True False 12 383.7 112.0 41.2
True False 13 381.0 112.0 41.8
True False 14 379.1 112.0 42.1
True True 9 378.2 112.0 45.2
True False 15 377.6 232.0 44.4
True False 16 376.5 232.0 44.0
True False 17 375.6 232.0 45.0
True False 18 374.9 232.0 44.5
True False 19 374.3 232.0 45.6
True False 20 373.9 232.0 44.7
True False 2147483647 371.8 224.0 46.2
True True 10 371.3 232.0 49.5
True True 11 366.2 232.0 48.3
True True 12 362.6 232.0 50.0
True True 13 359.8 232.0 49.3
True True 14 357.8 224.0 49.0
True True 15 356.3 224.0 49.2
True True 16 355.1 224.0 50.2
True True 17 354.2 224.0 50.7
True True 18 353.5 224.0 50.8
True True 19 352.9 224.0 51.4
True True 20 352.4 224.0 51.8
True True 2147483647 350.2 464.0 56.9

@dweiss
Copy link
Contributor

dweiss commented Oct 9, 2023

I didn't get into all the details but I think this looks good. Your questions are indeed intriguing - I can't provide any explanation off the top of my head, really.

@mikemccand
Copy link
Member Author

Translating/merging the above two tables into a graph:

image

Some observations:

  • The PR is mostly better at using less RAM to make the same size FST, yay!

  • It is a more smooth/predictable/monotonic tradeoff: the larger the NodeHash size, the smaller the FST. Whereas on main, using the god-like parameters, it's more dicy/spiky/unpredictable. It's like you are the co-pilot trying to land a 747 alone using only toothpicks.

  • At the "spend all the RAM necessary to get a truly minimal FST" end (the right of the chart) the PR looks like it uses a bit more RAM than main. I think I can improve on this by not wastefully using long[] but rather one of Lucene's many cool bit-packing dynamic/growable array thingys, like main does for its NodeHash. Or maybe @msokolov's idea to somehow do a reversed suffix lookup against the growing FST. I'll try that.

  • Bang for the buck tapers off like you'd expect: the early MB of RAM you spend has a bigger payoff in reducing the FST size, while later MB of RAM is less and less impact. This is nice 80/20 like behavior...

  • With the PR, you unfortunately cannot easily say "give me a minimal FST at all costs", like you can with main today. You'd have to keep trying larger and larger NodeHash sizes until the final FST size gets no smaller. I don't really like this regression -- I'll think about how to somehow keep that capability in the PR. E.g. we would want to use this option when compiling FSTs for Kuromoji, or users may want this when compiling synonym maps.

@gf2121
Copy link
Contributor

gf2121 commented Oct 10, 2023

With the PR, you unfortunately cannot easily say "give me a minimal FST at all costs", like you can with main today. You'd have to keep trying larger and larger NodeHash sizes until the final FST size gets no smaller.

If we replace long[] with a growable array that inits from small and grows smoothly to nodeHashSize, can we just pass a big nodeHashSize (e.g. 1L << 63) or a constant NO_LIMIT to get a minimal FST ?

@@ -99,31 +87,23 @@ public class FSTCompiler<T> {
* tuning and tweaking, see {@link Builder}.
*/
public FSTCompiler(FST.INPUT_TYPE inputType, Outputs<T> outputs) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unrelated to this PR and just my thought that would it be more maintainable in the long run if we only have a single way to build the FST through the Builder?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that's a good point -- I'm not sure why we have a public ctor on this class. I'll see if I can remove it, maybe as follow-on PR.

this.fst = fst;
this.in = in;
}

/** Compares an unfrozen node (UnCompiledNode) with a frozen node at byte location address (long), returning
* true if they are equal. */
private boolean nodesEqual(FSTCompiler.UnCompiledNode<T> node, long address) throws IOException {
fst.readFirstRealTargetArc(address, scratchArc, in);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So does this means we still need to read from the in-writing FST? I'm wondering would it be possible to only read from the cache, then we can decouple the FST from NodeHash.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should do that (fully decouple NodeHash from the FST's byte[]) as a follow-on issue? This change is tricky enough :)

@mikemccand
Copy link
Member Author

With the PR, you unfortunately cannot easily say "give me a minimal FST at all costs", like you can with main today. You'd have to keep trying larger and larger NodeHash sizes until the final FST size gets no smaller.

If we replace long[] with a growable array that inits from small and grows smoothly to nodeHashSize, can we just pass a big nodeHashSize (e.g. 1L << 63) or a constant NO_LIMIT to get a minimal FST ?

I love this idea! This way the hash grows to consume only as much RAM as needed, up until the specified limit, at which point it begins pruning to cap the RAM usage at the limit.

@mikemccand
Copy link
Member Author

Thanks for the suggestions @dungba88! I took the approach you suggested, with a few more pushed commits just now. Despite the increase in nocommits I think this is actually close! I like this new approach:

  • It uses the same mutable packed blocked growable (in size and bpv) writer thingy (PagedGrowableWriter) that NodeHash uses on main
  • But now the FSTCompiler (and its Builder) take an option to set a limit on the size (count of number of suffix entries) of the NodeHash. I plan to change this to a ramMB limit instead....
  • If you set a massive limit (Long.MAX_VALUE) then every suffix is stored (as compactly as on main today) and you get a minimal FST.
  • If you set a lower limit and the NodeHash hits it, it will begin pruning the LRU suffixes, and you get a somewhat compressed FST. The larger the limit, the more RAM used, and the closer to minimal your FST is.

I tested again on all terms from wikimediumall index:

NodeHash size FST (mb) RAM (mb) Build time (sec)
4 585.8 0.0 110.0
8 587.0 0.0 74.7
16 586.3 0.0 60.1
32 583.7 0.0 52.5
64 580.4 0.0 46.5
128 575.9 0.0 44.0
256 568.0 0.0 42.6
512 556.6 0.0 41.8
1024 543.2 0.0 42.4
2048 529.3 0.0 40.9
4096 515.2 0.0 41.0
8192 501.5 0.1 40.8
16384 488.2 0.1 40.3
32768 474.0 0.2 41.5
65536 453.0 0.5 42.0
131072 439.0 0.9 41.6
262144 424.2 1.8 41.5
524288 408.9 3.6 41.7
1048576 396.0 7.3 42.3
2097152 384.4 14.5 44.1
4194304 375.0 29.0 48.0
8388608 365.9 58.0 51.5
16777216 358.6 116.0 52.4
33554432 352.7 232.0 52.7
67108864 350.2 448.0 52.9
134217728 350.2 464.0 56.5
268435456 350.2 464.0 56.6
536870912 350.2 464.0 56.1
1073741824 350.2 464.0 55.7

Rendered as a graph vs main:

Screenshot 2023-10-17 at 5 43 45 AM

It's less RAM than the previous long[] approach thanks to the packing done by PagedGrowableWriter.

assert node != 0;

// confirm frozen hash and unfrozen hash are the same
assert hash(node) == hash : "mismatch frozenHash=" + hash(node) + " vs hash=" + hash;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not necessarily related to this CR

This seems to be only for assertion, but hash(long) require reading the FST and thus one obstacle to decouple the NodeHash & FST. I'm wondering if it would make sense to let fst.addNode (it should be fstCompiler.addNode() now) return the node value instead of node address.

The 2nd and final obstacle is nodeEquals. But it seems it would require the cache table to store the node value & address instead of just address. Maybe we can build a LRU cache from nodeAddress to nodeValue? Storing value would mean more heap for small FST though.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to be only for assertion, but hash(long) require reading the FST and thus one obstacle to decouple the NodeHash & FST. I'm wondering if it would make sense to let fst.addNode (it should be fstCompiler.addNode() now) return the node value instead of node address.

This assertion is quite important for now, because we have two completely different hash(...) implementations: one reads from an already frozen node written into the growing FST (a byte[]) and the other from an unfrozen node.

The 2nd and final obstacle is nodeEquals. But it seems it would require the cache table to store the node value & address instead of just address. Maybe we can build a LRU cache from nodeAddress to nodeValue? Storing value would mean more heap for small FST though.

Hmm, if the hash table stores the full node value byte[] (instead of a reference into the growing FST byte[]), I think we would still need a nodeEquals method to compare unfrozen and frozen nodes? Or maybe we could just always freeze the pending unfrozen node to do our comparisons, but often that would be wasted work since the same node (suffix) already exists in the NodeHash?

@mikemccand
Copy link
Member Author

OK, I switched the accounting to approximate RAM usage of the NodeHash, which is more intuitive for users. It behaves monodically / smoothly: as you give more RAM for the suffixes stored in NodeHash, the resulting fst is smaller:

Screenshot 2023-10-18 at 8 41 59 AM

If you pass Double.POSITIVE_INFINITY (or any sufficiently large number) then we will store all suffixes and the resulting FST is minimal.

I think this is nearly ready -- I'll next clean up the remaining nocommits, and downgrade some to TODOs.

@mikemccand mikemccand changed the title [WIP] first cut at bounding the NodeHash size during FST compilation Bound the RAM used by the NodeHash (sharing common suffixes) during FST compilation Oct 18, 2023
@mikemccand mikemccand marked this pull request as ready for review October 18, 2023 15:38
@mikemccand
Copy link
Member Author

OK I think this is ready -- I removed/downgraded all nocommits, added CHANGES entry, rebased to latest main. Tests and precommit passed for me (at least once).

I set the default RAM limit to 32 MB, which should be plenty to get close (within 10%? at least for "real" terms) to the true minimal FST.

I think we could backport this to 9.x. There is an API break in FSTCompiler, but this class is marked @lucene.experimental so we are free to change the API. Also, this API is so uber-expert that likely very few users rely on it, and those that do would likely welcome the "limit by RAM instead of limit by un-understandable parameters". But we should let this bake for a while in main before thinking about backporting...

Copy link
Contributor

@gf2121 gf2121 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @mikemccand ! Great work!

I left some minor comments/questions otherwise looks good to me :)

Copy link
Contributor

@gf2121 gf2121 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, Thanks @mikemccand ! This is really a huge improvement in param friendliness :)

@mikemccand
Copy link
Member Author

Thanks @gf2121 -- I agree! So much more intuitive to tell the FST compiler how much RAM it can use to make as minimal an FST as it can. This means we can build bigger FSTs with less worry about heap sizes, and once we get "stream to disk" working, then FST building in a fixed amount of RAM is truly possible. Thank you Tantivy's FST implementation for inspiring these changes!

I'll leave this for another day or so, and then merge at first only to main and let it cook for a while before backporting to 9.x.

boolean doShareSuffix,
boolean doShareNonSingletonNodes,
int shareMaxTailLength,
double suffixRAMLimitMB, // pass 0 to disable suffix compression/trie; larger values create
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@paramin method javadoc instead? This is not easy to read with spotless truncation.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @bruno-roustant -- I agree it looks awful :) -- but I think I'll just remove this wimpy comment. These private ctor parameters are better documented on the Builder setters.

* bounded by the number of unique suffixes. If you pass a value smaller than the builder would
* use, the least recently used suffixes will be discarded, thus reducing suffix sharing and
* creating a non-minimal FST. In this case, the larger the limit, the closer the FST will be to
* its true minimal size, with diminishing returns as you increasea the limit. Pass {@code 0} to
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"increasea"

// there -- we would not need any long per entry -- we'd be able to start at the FST end node and
// work backwards from the transitions

// TODO: couldn't we prune natrually babck until we see a transition with an output? it's highly
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"natrually babck"

@@ -110,8 +110,10 @@ protected long baseRamBytesUsed() {
public long ramBytesUsed() {
long bytesUsed = RamUsageEstimator.alignObjectSize(baseRamBytesUsed());
bytesUsed += RamUsageEstimator.alignObjectSize(RamUsageEstimator.shallowSizeOf(subMutables));
// System.out.println("abstract.ramBytesUsed:");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we keep these commented prints?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we have a general rule :) I'll remove these ones.

@@ -99,20 +184,18 @@ private long hash(FSTCompiler.UnCompiledNode<T> node) {
h += 17;
}
}
// System.out.println(" ret " + (h&Integer.MAX_VALUE));

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to mask with Long.MAX_VALUE below, since we mask anyway with the table mask?

Instead, we should multiply with the gold constant BitMixer#PHI_C64 (make it public).
This really makes a difference in the evenness of the value distribution. This is one of the secrets of the HPPC hashing. By applying this, we get multiple advantages:

  • lookup should be improved (less hash collision)
  • we can try to rehash at 3/4 occupancy because the performance should not be impacted until this point.
  • in case of hash collision, we can lookup linearly with a pos = pos + 1 instead of quadratic probe (lines 95 and 327); this may avoid some mem cache miss.

(same for the other hash method)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to mask with Long.MAX_VALUE below, since we mask anyway with the table mask?

You're right, this is pointless -- I'll remove from both hash functions -- then we preserve that top (sign) bit for this following change:

Instead, we should multiply with the gold constant BitMixer#PHI_C64 (make it public).

Whoa, this sounds awesome! I was wondering if we could improve the simplistic hashing here ... I'll open a spinoff issue with this idea. Sounds like a low hanging hashing fruit!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll open a spinoff issue with this idea. Sounds like a low hanging hashing fruit!

I opened #12704. Thanks @bruno-roustant!

Copy link
Contributor

@bruno-roustant bruno-roustant left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I love the graphs!

@mikemccand
Copy link
Member Author

I'll also confirm Test2BFST still passes ... soon this test will no longer require a 35 GB heap to run!

…util wikipedia index and build an FST from them
…; when half of the double barrel is full, allocate new primary hash at full size to save cost of continuously rehashing for a large FST
… JVM; fix bogus assert (uncovered by Test2BFST); add TODO to Test2BFST anticipating building massive FSTs in small bounded RAM
@mikemccand
Copy link
Member Author

Test2BFST passed!

The slowest tests (exceeding 500 ms) during this run:                                                                                                                       
  2993.94s Test2BFST.test (:lucene:core)                                                                                                                                    
The slowest suites (exceeding 1s) during this run:                                                                                                                          
  2994.06s Test2BFST (:lucene:core)                                                                                                                                         
                                                                                                                                                                            
BUILD SUCCESSFUL in 49m 58s                                                                                                                                                 
246 actionable tasks: 78 executed, 168 up-to-date

That last check failed because of formatting -- I re-tidy'd and I think it's ready -- I'll push shortly.

@mikemccand mikemccand merged commit afb2a60 into apache:main Oct 20, 2023
4 checks passed
clayburn added a commit to runningcode/lucene that referenced this pull request Oct 20, 2023
…ache.org

* upstream/main: (239 commits)
  Bound the RAM used by the NodeHash (sharing common suffixes) during FST compilation (apache#12633)
  Fix index out of bounds when writing FST to different metaOut (apache#12697) (apache#12698)
  Avoid object construction when linear searching arcs (apache#12692)
  chore: update the Javadoc example in Analyzer (apache#12693)
  coorect position on entry in CHANGES.txt
  Refactor ByteBlockPool so it is just a "shift/mask big array" (apache#12625)
  Extract the hnsw graph merging from being part of the vector writer (apache#12657)
  Specialize `BlockImpactsDocsEnum#nextDoc()`. (apache#12670)
  Speed up TestIndexOrDocValuesQuery. (apache#12672)
  Remove over-counting of deleted terms (apache#12586)
  Use MergeSorter in StableStringSorter (apache#12652)
  Use radix sort to speed up the sorting of terms in TermInSetQuery (apache#12587)
  Add timeouts to github jobs. Estimates taken from empirical run times (actions history), with a generous buffer added. (apache#12687)
  Optimize OnHeapHnswGraph's data structure (apache#12651)
  Add createClassLoader to replicator permissions (block specific to jacoco). (apache#12684)
  Move changes entry before backporting
  CHANGES
  Move testing properties to provider class (no classloading deadlock possible) and fallback to default provider in non-test mode
  simple cleanups to vector code (apache#12680)
  Better detect vector module in non-default setups (e.g., custom module layers) (apache#12677)
  ...
mikemccand added a commit that referenced this pull request Nov 20, 2023
…ST compilation (#12633)

* tweak comments; change if to switch

* remove old SOPs, minor comment styling, fixed silly performance bug on rehash using the wrong bitsRequired (count vs node)

* first raw cut; some nocommits added; some tests fail

* tests pass!

* fix silly fallback hash bug

* remove SOPs; add some temporary debugging metrics

* add temporary tool to test FST performance across differing NodeHash sizes

* remove (now deleted) shouldShareNonSingletonNodes call from Lucene90BlockTreeTermsWriter

* add simple tool to render results table to GitHub MD

* add simple temporary tool to iterate all terms from a provided luceneutil wikipedia index and build an FST from them

* first cut at using packed ints for hash t able again

* add some nocommits; tweak test_all_sizes.py to new RAM usage approach; when half of the double barrel is full, allocate new primary hash at full size to save cost of continuously rehashing for a large FST

* switch to limit suffix hash by RAM usage not count (more intuitive for users); clean up some stale nocommits

* switch to more intuitive approximate RAM (mb) limit for allowed size of NodeHash

* nuke a few nocommits; a few more remain

* remove DO_PRINT_HASH_RAM

* no more FST pruning

* remove final nocommit: randomly change allowed NodeHash suffix RAM size in TestFSTs.testRealTerms

* remove SOP

* tidy

* delete temp utility tools

* remove dead (FST pruning) code

* add CHANGES entry; fix one missed fst.addNode -> fstCompiler.addNode during merge conflict resolution

* remove a mal-formed nocommit

* fold PR feedback

* fold feedback

* add gradle help test details on how to specify heap size for the test JVM; fix bogus assert (uncovered by Test2BFST); add TODO to Test2BFST anticipating building massive FSTs in small bounded RAM

* suppress sysout checks for Test2BFSTs; add helpful comment showing how to run it directly

* tidy
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants