Optimize FST suffix sharing for block tree index #12702

gf2121 · 2023-10-20T08:58:32Z

Description

Today our block tree index will encode floordata as output. The floor data is guaranteed to be stored within single arc (never be prefix shared) in FST because fp is encoded before it. As a suffix, floor data can rarely be shared too. I wonder if we can avoid adding floor data outputs into NodeHash some way?

Out of curiosity, i tried to completely disable suffix sharing in block tree index, result in only 1.47% total .tip size increased for wikimediumall.

baseline	candidate	diff
27188690	27588186	1.47%

The text was updated successfully, but these errors were encountered:

mikemccand · 2023-10-20T10:14:12Z

The floor data is guaranteed to be stored within single arc (never be prefix shared) in FST because fp is encoded before it.

But won't the leading bytes of fp be shared in that prefix (since we switched to MSB vLong encoding)?

Out of curiosity, i tried to completely disable suffix sharing in block tree index, result in only 1.47% total .tip size increased for wikimediumall.

This is impressive! I would have expected a worse impact.

This is likely because BlockTree is essentially storing a prefix-ish trie in RAM, not the full terms dictionary. So the suffixes are mostly dropped from the FST index and left in the term blocks stored separately. Note that we also do some interesting compression of the packed suffix byte[] in the on-disk block as well.

I wonder if we can avoid adding floor data outputs into NodeHash some way?

I'm curious: are there any floorData outputs in NodeHash (shared suffixes) at all today in the BlockTree terms index?

On the "limit how much RAM FST Compiler is allowed to use to share suffixes" PR I also tested fully disabling NodeHash (no prefix sharing) when storing all wikimediumall index body terms in an FST but found a much bigger increase (65% increase: 350.2 MB -> 577.4 MB), because the suffixes are stored.

Similarly, if we explore experimental codecs that hold all terms in an FST, now possible / reasonable since the FST is off-heap, sharing the suffixes will be important.

mikemccand · 2023-10-23T17:37:57Z

Out of curiosity, i tried to completely disable suffix sharing in block tree index, result in only 1.47% total .tip size increased for wikimediumall.

Maybe we should just disable suffix sharing when building the BlockTree terms index FST? It would sidestep the whole extra RAM that NodeHash will have to allocate?

gf2121 · 2023-10-24T12:35:06Z

Maybe we should just disable suffix sharing when building the BlockTree terms index FST? It would sidestep the whole extra RAM that NodeHash will have to allocate?

+1. I'll confirm the index/search performance later.

gf2121 · 2023-10-25T13:58:15Z

on wikimediumall

Queries (Nothing changed obviously):

                            TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value
                       OrHighLow      193.70      (2.3%)      190.83      (2.7%)   -1.5% (  -6% -    3%) 0.060
                         Respell       27.35      (2.0%)       27.00      (1.9%)   -1.3% (  -5% -    2%) 0.040
                      AndHighLow      345.65      (3.3%)      341.71      (4.2%)   -1.1% (  -8% -    6%) 0.341
                          Fuzzy2       32.56      (2.1%)       32.27      (2.2%)   -0.9% (  -5% -    3%) 0.201
                      OrHighHigh       17.79      (3.0%)       17.65      (3.0%)   -0.8% (  -6% -    5%) 0.427
                    OrNotHighLow      226.12      (1.9%)      224.78      (2.4%)   -0.6% (  -4% -    3%) 0.388
                       OrHighMed       38.91      (3.7%)       38.69      (6.0%)   -0.6% (  -9% -    9%) 0.718
                          Fuzzy1       41.03      (2.0%)       40.83      (2.2%)   -0.5% (  -4% -    3%) 0.489
                        PKLookup       99.00      (3.3%)       98.67      (3.2%)   -0.3% (  -6% -    6%) 0.749
            HighIntervalsOrdered        0.38      (3.4%)        0.38      (3.4%)   -0.3% (  -6% -    6%) 0.777
                 MedSloppyPhrase        9.42      (3.5%)        9.43      (3.7%)    0.0% (  -6% -    7%) 0.972
                      HighPhrase       38.18      (3.1%)       38.20      (4.4%)    0.1% (  -7% -    7%) 0.963
                       LowPhrase       25.50      (2.2%)       25.52      (2.9%)    0.1% (  -4% -    5%) 0.918
                   OrNotHighHigh      140.05      (3.3%)      140.22      (4.9%)    0.1% (  -7% -    8%) 0.926
                       MedPhrase        9.47      (2.0%)        9.48      (5.1%)    0.1% (  -6% -    7%) 0.903
                     MedSpanNear        1.11      (1.1%)        1.11      (1.9%)    0.3% (  -2% -    3%) 0.573
                     AndHighHigh       19.72      (2.0%)       19.82      (2.3%)    0.5% (  -3% -    4%) 0.463
                    OrHighNotLow      235.04      (3.2%)      236.29      (4.3%)    0.5% (  -6% -    8%) 0.654
                    OrNotHighMed      119.76      (2.7%)      120.43      (4.9%)    0.6% (  -6% -    8%) 0.657
             LowIntervalsOrdered        2.79      (4.6%)        2.81      (4.3%)    0.6% (  -7% -    9%) 0.677
                         Prefix3       65.42      (4.1%)       65.91      (3.7%)    0.8% (  -6% -    8%) 0.543
                         MedTerm      220.66      (3.4%)      222.69      (7.0%)    0.9% (  -9% -   11%) 0.596
                      AndHighMed       39.16      (2.2%)       39.53      (5.9%)    0.9% (  -6% -    9%) 0.507
                     LowSpanNear        4.14      (1.7%)        4.18      (2.3%)    0.9% (  -2% -    5%) 0.136
                HighSloppyPhrase        3.61      (2.8%)        3.65      (3.6%)    1.0% (  -5% -    7%) 0.354
                 LowSloppyPhrase        1.63      (4.8%)        1.64      (5.3%)    1.0% (  -8% -   11%) 0.543
                        Wildcard       67.75      (2.8%)       68.43      (1.9%)    1.0% (  -3% -    5%) 0.187
             MedIntervalsOrdered        3.50      (6.6%)        3.54      (6.2%)    1.1% ( -10% -   14%) 0.579
                    OrHighNotMed      118.45      (3.3%)      120.05      (4.4%)    1.4% (  -6% -    9%) 0.272
                   OrHighNotHigh      153.43      (4.0%)      156.00      (4.9%)    1.7% (  -6% -   10%) 0.235
                    HighSpanNear        6.82      (3.3%)        6.96      (2.7%)    2.0% (  -3% -    8%) 0.035
                        HighTerm      220.89      (3.7%)      227.25      (5.2%)    2.9% (  -5% -   12%) 0.044
                         LowTerm      218.76      (5.6%)      226.67      (9.8%)    3.6% ( -11% -   20%) 0.153

index (slightly faster):

Baseline

Elapsed time: 7751587 ms
        Time in JIT compilation: 27898 ms
        Time in Young Generation GC: 5825 ms (1141 collections)
        Time in Old Generation GC: 392 ms (2 collections)
Garbage Generated in Young Generation: 0.0 MiB
Garbage Generated in Survivor Generation: 0.0 MiB
Garbage Generated in Old Generation: 0.0 MiB
Average CPU Load: 111.12316/800


Candidate

Statistics Ended at Wed Oct 25 17:57:22 CST 2023
Elapsed time: 7591086 ms
        Time in JIT compilation: 28882 ms
        Time in Young Generation GC: 5433 ms (1127 collections)
        Time in Old Generation GC: 390 ms (2 collections)
Garbage Generated in Young Generation: 0.0 MiB
Garbage Generated in Survivor Generation: 0.0 MiB
Garbage Generated in Old Generation: 0.0 MiB
Average CPU Load: 111.06987/800

gf2121 added the type:enhancement label Oct 20, 2023

gf2121 mentioned this issue Oct 25, 2023

Disable suffix sharing for block tree index #12722

Merged

gf2121 closed this as completed in #12722 Oct 26, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize FST suffix sharing for block tree index #12702

Optimize FST suffix sharing for block tree index #12702

gf2121 commented Oct 20, 2023 •

edited

mikemccand commented Oct 20, 2023 •

edited

mikemccand commented Oct 23, 2023

gf2121 commented Oct 24, 2023

gf2121 commented Oct 25, 2023

Optimize FST suffix sharing for block tree index #12702

Optimize FST suffix sharing for block tree index #12702

Comments

gf2121 commented Oct 20, 2023 • edited

Description

mikemccand commented Oct 20, 2023 • edited

mikemccand commented Oct 23, 2023

gf2121 commented Oct 24, 2023

gf2121 commented Oct 25, 2023

gf2121 commented Oct 20, 2023 •

edited

mikemccand commented Oct 20, 2023 •

edited