Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize FST suffix sharing for block tree index #12702

Closed
gf2121 opened this issue Oct 20, 2023 · 4 comments · Fixed by #12722
Closed

Optimize FST suffix sharing for block tree index #12702

gf2121 opened this issue Oct 20, 2023 · 4 comments · Fixed by #12722

Comments

@gf2121
Copy link
Contributor

gf2121 commented Oct 20, 2023

Description

Today our block tree index will encode floordata as output. The floor data is guaranteed to be stored within single arc (never be prefix shared) in FST because fp is encoded before it. As a suffix, floor data can rarely be shared too. I wonder if we can avoid adding floor data outputs into NodeHash some way?

Out of curiosity, i tried to completely disable suffix sharing in block tree index, result in only 1.47% total .tip size increased for wikimediumall.

baseline candidate  diff
27188690 27588186 1.47%
@mikemccand
Copy link
Member

mikemccand commented Oct 20, 2023

The floor data is guaranteed to be stored within single arc (never be prefix shared) in FST because fp is encoded before it.

But won't the leading bytes of fp be shared in that prefix (since we switched to MSB vLong encoding)?

Out of curiosity, i tried to completely disable suffix sharing in block tree index, result in only 1.47% total .tip size increased for wikimediumall.

This is impressive! I would have expected a worse impact.

This is likely because BlockTree is essentially storing a prefix-ish trie in RAM, not the full terms dictionary. So the suffixes are mostly dropped from the FST index and left in the term blocks stored separately. Note that we also do some interesting compression of the packed suffix byte[] in the on-disk block as well.

I wonder if we can avoid adding floor data outputs into NodeHash some way?

I'm curious: are there any floorData outputs in NodeHash (shared suffixes) at all today in the BlockTree terms index?

On the "limit how much RAM FST Compiler is allowed to use to share suffixes" PR I also tested fully disabling NodeHash (no prefix sharing) when storing all wikimediumall index body terms in an FST but found a much bigger increase (65% increase: 350.2 MB -> 577.4 MB), because the suffixes are stored.

Similarly, if we explore experimental codecs that hold all terms in an FST, now possible / reasonable since the FST is off-heap, sharing the suffixes will be important.

@mikemccand
Copy link
Member

Out of curiosity, i tried to completely disable suffix sharing in block tree index, result in only 1.47% total .tip size increased for wikimediumall.

Maybe we should just disable suffix sharing when building the BlockTree terms index FST? It would sidestep the whole extra RAM that NodeHash will have to allocate?

@gf2121
Copy link
Contributor Author

gf2121 commented Oct 24, 2023

Maybe we should just disable suffix sharing when building the BlockTree terms index FST? It would sidestep the whole extra RAM that NodeHash will have to allocate?

+1. I'll confirm the index/search performance later.

@gf2121
Copy link
Contributor Author

gf2121 commented Oct 25, 2023

on wikimediumall

Queries (Nothing changed obviously):

                            TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value
                       OrHighLow      193.70      (2.3%)      190.83      (2.7%)   -1.5% (  -6% -    3%) 0.060
                         Respell       27.35      (2.0%)       27.00      (1.9%)   -1.3% (  -5% -    2%) 0.040
                      AndHighLow      345.65      (3.3%)      341.71      (4.2%)   -1.1% (  -8% -    6%) 0.341
                          Fuzzy2       32.56      (2.1%)       32.27      (2.2%)   -0.9% (  -5% -    3%) 0.201
                      OrHighHigh       17.79      (3.0%)       17.65      (3.0%)   -0.8% (  -6% -    5%) 0.427
                    OrNotHighLow      226.12      (1.9%)      224.78      (2.4%)   -0.6% (  -4% -    3%) 0.388
                       OrHighMed       38.91      (3.7%)       38.69      (6.0%)   -0.6% (  -9% -    9%) 0.718
                          Fuzzy1       41.03      (2.0%)       40.83      (2.2%)   -0.5% (  -4% -    3%) 0.489
                        PKLookup       99.00      (3.3%)       98.67      (3.2%)   -0.3% (  -6% -    6%) 0.749
            HighIntervalsOrdered        0.38      (3.4%)        0.38      (3.4%)   -0.3% (  -6% -    6%) 0.777
                 MedSloppyPhrase        9.42      (3.5%)        9.43      (3.7%)    0.0% (  -6% -    7%) 0.972
                      HighPhrase       38.18      (3.1%)       38.20      (4.4%)    0.1% (  -7% -    7%) 0.963
                       LowPhrase       25.50      (2.2%)       25.52      (2.9%)    0.1% (  -4% -    5%) 0.918
                   OrNotHighHigh      140.05      (3.3%)      140.22      (4.9%)    0.1% (  -7% -    8%) 0.926
                       MedPhrase        9.47      (2.0%)        9.48      (5.1%)    0.1% (  -6% -    7%) 0.903
                     MedSpanNear        1.11      (1.1%)        1.11      (1.9%)    0.3% (  -2% -    3%) 0.573
                     AndHighHigh       19.72      (2.0%)       19.82      (2.3%)    0.5% (  -3% -    4%) 0.463
                    OrHighNotLow      235.04      (3.2%)      236.29      (4.3%)    0.5% (  -6% -    8%) 0.654
                    OrNotHighMed      119.76      (2.7%)      120.43      (4.9%)    0.6% (  -6% -    8%) 0.657
             LowIntervalsOrdered        2.79      (4.6%)        2.81      (4.3%)    0.6% (  -7% -    9%) 0.677
                         Prefix3       65.42      (4.1%)       65.91      (3.7%)    0.8% (  -6% -    8%) 0.543
                         MedTerm      220.66      (3.4%)      222.69      (7.0%)    0.9% (  -9% -   11%) 0.596
                      AndHighMed       39.16      (2.2%)       39.53      (5.9%)    0.9% (  -6% -    9%) 0.507
                     LowSpanNear        4.14      (1.7%)        4.18      (2.3%)    0.9% (  -2% -    5%) 0.136
                HighSloppyPhrase        3.61      (2.8%)        3.65      (3.6%)    1.0% (  -5% -    7%) 0.354
                 LowSloppyPhrase        1.63      (4.8%)        1.64      (5.3%)    1.0% (  -8% -   11%) 0.543
                        Wildcard       67.75      (2.8%)       68.43      (1.9%)    1.0% (  -3% -    5%) 0.187
             MedIntervalsOrdered        3.50      (6.6%)        3.54      (6.2%)    1.1% ( -10% -   14%) 0.579
                    OrHighNotMed      118.45      (3.3%)      120.05      (4.4%)    1.4% (  -6% -    9%) 0.272
                   OrHighNotHigh      153.43      (4.0%)      156.00      (4.9%)    1.7% (  -6% -   10%) 0.235
                    HighSpanNear        6.82      (3.3%)        6.96      (2.7%)    2.0% (  -3% -    8%) 0.035
                        HighTerm      220.89      (3.7%)      227.25      (5.2%)    2.9% (  -5% -   12%) 0.044
                         LowTerm      218.76      (5.6%)      226.67      (9.8%)    3.6% ( -11% -   20%) 0.153

index (slightly faster):

Baseline

Elapsed time: 7751587 ms
        Time in JIT compilation: 27898 ms
        Time in Young Generation GC: 5825 ms (1141 collections)
        Time in Old Generation GC: 392 ms (2 collections)
Garbage Generated in Young Generation: 0.0 MiB
Garbage Generated in Survivor Generation: 0.0 MiB
Garbage Generated in Old Generation: 0.0 MiB
Average CPU Load: 111.12316/800


Candidate

Statistics Ended at Wed Oct 25 17:57:22 CST 2023
Elapsed time: 7591086 ms
        Time in JIT compilation: 28882 ms
        Time in Young Generation GC: 5433 ms (1127 collections)
        Time in Old Generation GC: 390 ms (2 collections)
Garbage Generated in Young Generation: 0.0 MiB
Garbage Generated in Survivor Generation: 0.0 MiB
Garbage Generated in Old Generation: 0.0 MiB
Average CPU Load: 111.06987/800

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants