Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Random access term dictionary #12688

Open
wants to merge 61 commits into
base: main
Choose a base branch
from

Conversation

Tony-X
Copy link
Contributor

@Tony-X Tony-X commented Oct 16, 2023

Description

Related issue #12513

Opening this PR early to avoid massive diffs in one-shot

  • Encode (term type, local ord) in FST
  • Encode/Decode term states with bitpacking/unpacking.

TODO:

  • Implement bit-packing and unpacking for each term type
  • Implement the PostingsFormat

@mikemccand
Copy link
Member

I'll try to review this soon -- it sounds compelling @Tony-X! I like how it is inspired by Tantivy's term dictionary format (which holds all terms + their metadata in RAM).

Also, with the upcoming ability to cleanly limit how much RAM the FSTCompiler is allowed to use to reduce the size of the FST, this approach becomes more feasible. Without that change, the FST compilation might easily use excessive RAM during indexing when merging large segments.

Copy link
Member

@mikemccand mikemccand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this start! I left a few comments.

I'm really curious how big the FST will be if you encode realistic terms. Maybe use the new IndexToFST tool in luceneutil? It pulls all unique terms from a Lucene index, and then builds an FST from them. I wrote it (and used it) for the "limit RAM in FSTCompiler" PR.

@bruno-roustant
Copy link
Contributor

I'll also try to review!
On the bit packing subject, I have some handy generic code (not in Lucene yet) to write and read variable size bits. Tell me if you are interested.

@Tony-X
Copy link
Contributor Author

Tony-X commented Oct 20, 2023

Thanks @bruno-roustant ! If you're okay to share it feel free to share it here.

I'm in the process of baking my own specific implementation (which internally uses a single long as bit buffer), but I might absorb some interesting ideas from your impl.

Tony Xu added 8 commits October 26, 2023 14:32
motivation: We will need to deal with encoding `IntBlockTermState` for
different type of terms. Instead of having dedicated class for each term type,
which would be 8 types in total, we can spell out the individual components of
`IntBlockTermState`. Then implement a codec which works with the composition of
the components. This way we can have a single implementation of the codec and
construct the composition (really just array of components) per term type.
TermStateCodecImpl implements TermStateCodec which supports encoding a block
of IntBlockTermState and decoding within that block at a given index.
Copy link
Member

@nknize nknize left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bump

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this is only used by tests? Maybe move to the tests package? I'm also curious every time I see a new bit packer as we do this a lot throughout the code. Is there some reuse from another class impl maybe? PackedInts? DataOutput#writeVLong? Do we need it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey Knize, thanks for reviewing.

This is not test-only. It is used by the TestTermStateCodecImpl. I'm in the process of building the real compact bit packer.

I'm also curious every time I see a new bit packer as we do this a lot throughout the code. Is there some reuse from another class impl maybe? PackedInts? DataOutput#writeVLong? Do we need it?

I did search through the code base and couldn't find something I can use. The goal here is to pack a sequence of values that have different bitwidths . We can't use PackedInts as it requires values to have same bitwidth. We can't use VLong either since we aim to write fixed size record, so that we can do random access.

More detailed discussion can be found in this email thread: https://lists.apache.org/list?dev@lucene.apache.org:lte=1M:packedInts

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm in the process of building the real compact bit packer.

+1. I'm assuming that will be added to this PR?

The goal here is to pack a sequence of values that have different bitwidths... We can't use VLong either since we aim to write fixed size record, so that we can do random access

Hmm.. I'll have to look deeper at this. The reason I ask is because I did a similar bit packing w/ "random access" when serializing the ShapeDocValues binary tree and it feels like we often re-implement this logic in different forms for different use cases. Can we generalize this and lean out our code base to make it more useable and readable?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1. I'm assuming that will be added to this PR?

Yes! This PR can be large so I took the advice from @mikemccand to open it early to avoid massive diff in one shot. Goal of the PR is to have a fully functional PostingFormat (or Codec).

The reason I ask is because I did a similar bit packing w/ "random access" when serializing the ShapeDocValues binary tree and it feels like we often re-implement this logic in different forms for different use cases. Can we generalize this and lean out our code base to make it more useable and readable?

Not sure if this code does the same thing. I could be wrong, but by a quick glance it seems to me it encodes values with variable length (VInt, VLong). Maybe the random-access is achieved in different ways?

Here in this PR, the use case is -- I have a bunch of bits in the form of byte[] that represents a block records that have same size (measure in bits, but size can be > 64 so we can't use PackedInts). Since they are of the same size, we can randomly access any record with an index and read the bits at [index * size, (index+1) * size]

I do agree that we should seek opportunities to unify. But for now since this is under sandbox, I'll make it specific to this implementation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since they are of the same size...

That's the difference. In your use case the records (blocks) are guaranteed to be the same size where as in the serialized tree case the records (tree nodes) are not guaranteed to be the same size. This is by design to ensure the resulting docvalue disk consumption is as efficient (small) as possible.

...by a quick glance it seems to me it encodes values with variable length (VInt, VLong). Maybe the random-access is achieved in different ways?

Yes to variable length encoding. The "random-ness" isn't purely random in that traversal of the serialized tree is DFS. Because the tree nodes are variable size the serialized array includes copious "book-keeping" in the form of "sizeOf" values. At DFS traversal the first "sizeOf" value provides the size of the entire left tree. To prune the left tree just means we skip that many bytes to get to the right tree.. this continues recursively. In practice we don't expect to ever "back up" in our DFS traversal so there is only a rewind method that simply resets the offset values to 0.

Seems the two use cases are subtly different but I could see roughly 80% overlap in the implementation. I'd love to efficiently encapsulate this logic for the next contributor that wants a random serialized traversal mechanism without a ridiculous amount of java object overhead. Sounds like @bruno-roustant had the same need? Could be a good follow on progress PR.

Copy link
Member

@mikemccand mikemccand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good @Tony-X! I'm very curious how the FST behaves in this terms dict ... it should be quite compact compared to our existing "whole terms dict in an FST" impls because this one just stores the packed long term type + ordinal in the FST and derefs out to the detailed metadata for this term.

It seems like you have the low level encode/decode working? So all that remains is to hook that up with the Codec components that read/write the terms dict ... then you can test the Codec by passing -Dtests.codec=<name> and Lucene will run all tests cases using your Codec.

}
if (ord > MAX_ORD) {
throw new IllegalArgumentException(
"Input ord is too large for TermType: "
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More user friendly message here too?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you mean to spell out the input ord?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, maybe something like Can only index XXX unique terms, but saw YYY terms for field ZZZ or so? (If that's what this error message really means to the user).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. That's a good point. I'll keep that in mind when implementing the main FieldsConsumer!

@Tony-X
Copy link
Contributor Author

Tony-X commented Nov 3, 2023

It seems like you have the low level encode/decode working? So all that remains is to hook that up with the Codec components that read/write the terms dict ... then you can test the Codec by passing -Dtests.codec= and Lucene will run all tests cases using your Codec.

Thanks for the tips! Yes, almost there. I'm working on the real compact bitpacker and unpacker. I still need to implement the PostingFormat afterwards. Do you think I need to implement a new Codec?

@Tony-X
Copy link
Contributor Author

Tony-X commented Nov 6, 2023

Just realized that we have lucene99 Codec out! I'll update the code to reflect that as this posting format aims to work with the latest Codec.

@mikemccand
Copy link
Member

This is reasonable as the terms index (FST) holds all the terms.

+1, nice!

Fuzzy/Wildcard/Prefix queries got much slower

This is also expected because currently I used the default implementation provided by TermsEnum which does not take advantage of the FST. With an optimized implementation I expect it to at least be on-par and slightly better because the FST holds information about all terms, whereas the current BlockTreeTerms only holds prefixes.

OK this makes sense, and it is a (sad) measure of how slow the emulated (on top of seekCeil) .intersect TermsEnum is. Once you have an optimized version it should likely be faster than block tree since it can intersect all suffixes instead of scanning byte[] suffixes in the term block and re-testing each.

HighTermTitleSort and HighTermMonthSort got about 4.5% ~ 10% less throughput

I don't quite understand why term lookup could affect sorting on a DV field

This is odd. Though, the HighTermMonthSort QPS is so crazy high as to not really be trustworthy -- likely BMW is kicking in and saving tons of work.

AndHighLow got slower

Am i missing some optimization opportunity for low freq terms?

Hmm maybe pulsing? Are we still inlining single-occurrence terms directly into the terms dict with your new terms dict?

Tony Xu added 8 commits December 14, 2023 22:31
Profiling show lots of allocation to build a name for such slice
…c/transitions

FST nodes have differetn variant. For non-variable length encoded node we can more efficiently
lookup for a target label.

Similarly, for FSAs the TransitionAccessor allows access to a list of [min, max] ranges in order, on which
we can perform binary-search to advance to applicable transitions for a given target
@Tony-X
Copy link
Contributor Author

Tony-X commented Dec 15, 2023

Since the first working version, I iterated with a list of profiling-guided allocation optimizations, as they stood out quite obviously from the merged JFR reports (thanks to luceneutil !).

Some of them comes from my code that implements the term dictionary data lookup, and a few of them are at more general Lucene level. I want to highlight the general issue I see from this work and maybe we can have separate issues to improve them!

Here is the first heap profile comparison (search-only, no indexing).

Candidate Heap
17.50%        24440M        java.lang.Long#valueOf()
10.09%        14096M        jdk.internal.misc.Unsafe#allocateUninitializedArray()
6.87%         9594M         org.apache.lucene.facet.taxonomy.IntTaxonomyFacets#initializeValueCounters()
4.40%         6140M         org.apache.lucene.facet.sortedset.SortedSetDocValuesFacetCounts#countOneSegmentNHLD()
...
main
13.65%        11898M        org.apache.lucene.facet.taxonomy.IntTaxonomyFacets#initializeValueCounters()
9.26%         8071M         org.apache.lucene.util.FixedBitSet#<init>()
6.70%         5836M         org.apache.lucene.facet.sortedset.SortedSetDocValuesFacetCounts#countOneSegmentNHLD()
6.60%         5751M         org.apache.lucene.util.ArrayUtil#growExact()
5.21%         4541M         org.apache.lucene.facet.FacetsConfig#stringToPath()
4.69%         4090M         org.apache.lucene.util.DocIdSetBuilder$Buffer#<init>()

FST doesn't play nicely with primitive types (I know, this is more or less a java issue)

24440M java.lang.Long#valueOf() huge amount of allocations... This is obvious. The FST implementation is generic over its output type and in my case T is Long. So for trivial long add and subtract, the implementation would allocate an object. Not only it is wasteful but from a perf perspective it'd be less than 1 CPU cycle v.s. calling allocator which is easily tens if not hundreds of cycles.

For this work, I forked the FST class and manually templated it with long just to see how much difference it makes. Here is a diff in heap profile and bench results before and after.

Before 

PERCENT       HEAP SAMPLES  STACK                                                                                                                              
25.97%        32791M        java.lang.Long#valueOf()
7.58%         9571M         org.apache.lucene.facet.taxonomy.IntTaxonomyFacets#initializeValueCounters()
5.13%         6482M         org.apache.lucene.util.FixedBitSet#<init>()
4.90%     
....

After
PERCENT       HEAP SAMPLES  STACK
8.44%         7988M         org.apache.lucene.facet.taxonomy.IntTaxonomyFacets#initializeValueCounters()
7.17%         6788M         org.apache.lucene.util.FixedBitSet#<init>()
6.22%         5886M         org.apache.lucene.facet.sortedset.SortedSetDocValuesFacetCounts#countOneSegmentNHLD()
5.89%         5577M         org.apache.lucene.util.ArrayUtil#growExact()

Bench diff

Before

                            TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value
                        Wildcard       11.61      (2.7%)        2.40      (0.6%)  -79.4% ( -80% -  -78%) 0.000
                          Fuzzy1       78.17      (0.7%)       27.16      (0.9%)  -65.3% ( -66% -  -64%) 0.000
                         Respell       29.09      (0.6%)       10.91      (0.8%)  -62.5% ( -63% -  -61%) 0.000
                          Fuzzy2       47.80      (0.6%)       18.50      (1.2%)  -61.3% ( -62% -  -59%) 0.000
                         Prefix3      765.08      (3.1%)      463.94      (0.9%)  -39.4% ( -42% -  -36%) 0.000
               HighTermTitleSort       98.48      (2.0%)       90.62      (2.2%)   -8.0% ( -11% -   -3%) 0.000
           BrowseMonthTaxoFacets        3.89     (29.2%)        3.62      (0.9%)   -6.9% ( -28% -   32%) 0.293
                 LowSloppyPhrase       22.73      (6.5%)       22.35      (6.9%)   -1.7% ( -14% -   12%) 0.432
                         LowTerm      365.47      (3.6%)      359.62      (2.9%)   -1.6% (  -7% -    5%) 0.121
                        HighTerm      398.57      (5.1%)      393.16      (4.7%)   -1.4% ( -10% -    8%) 0.380
                 MedSloppyPhrase       10.63      (3.6%)       10.51      (3.7%)   -1.1% (  -8% -    6%) 0.339
                         MedTerm      422.73      (4.2%)      418.60      (4.0%)   -1.0% (  -8% -    7%) 0.451
            MedTermDayTaxoFacets       14.84      (2.6%)       14.71      (2.5%)   -0.8% (  -5% -    4%) 0.296
                HighSloppyPhrase       12.41      (3.1%)       12.33      (3.1%)   -0.7% (  -6% -    5%) 0.487
            HighTermTitleBDVSort        6.88      (3.3%)        6.84      (3.5%)   -0.6% (  -7% -    6%) 0.599
                       LowPhrase       58.15      (2.9%)       57.85      (2.8%)   -0.5% (  -6% -    5%) 0.567
       BrowseDayOfYearSSDVFacets        3.24      (0.4%)        3.23      (0.5%)   -0.3% (  -1% -    0%) 0.042
                       MedPhrase       26.19      (3.1%)       26.11      (3.2%)   -0.3% (  -6% -    6%) 0.775
                    OrNotHighMed      185.23      (3.9%)      184.73      (3.3%)   -0.3% (  -7% -    7%) 0.813
          OrHighMedDayTaxoFacets        3.82      (3.3%)        3.81      (3.2%)   -0.3% (  -6% -    6%) 0.796
                    OrHighNotLow      194.98      (5.1%)      194.51      (4.6%)   -0.2% (  -9% -   10%) 0.875
                    OrHighNotMed      337.15      (4.4%)      336.53      (3.8%)   -0.2% (  -7% -    8%) 0.888
                          IntNRQ       67.60      (0.9%)       67.55      (1.0%)   -0.1% (  -1% -    1%) 0.783
                     MedSpanNear        9.85      (1.4%)        9.84      (2.1%)   -0.1% (  -3% -    3%) 0.906
                   OrNotHighHigh      205.12      (4.1%)      205.01      (3.9%)   -0.1% (  -7% -    8%) 0.967
        AndHighHighDayTaxoFacets        6.35      (1.5%)        6.34      (1.7%)   -0.0% (  -3% -    3%) 0.932
           BrowseMonthSSDVFacets        3.29      (0.8%)        3.29      (0.7%)   -0.0% (  -1% -    1%) 0.887
     BrowseRandomLabelSSDVFacets        2.30      (0.8%)        2.30      (1.0%)    0.0% (  -1% -    1%) 0.919
                     LowSpanNear       16.41      (2.6%)       16.42      (2.7%)    0.1% (  -5% -    5%) 0.931
                      HighPhrase       77.12      (3.0%)       77.20      (3.6%)    0.1% (  -6% -    6%) 0.923
         AndHighMedDayTaxoFacets       39.64      (1.2%)       39.68      (1.0%)    0.1% (  -2% -    2%) 0.742
     BrowseRandomLabelTaxoFacets        3.19      (1.6%)        3.19      (1.1%)    0.1% (  -2% -    2%) 0.728
            BrowseDateTaxoFacets        3.73      (0.7%)        3.74      (0.5%)    0.3% (   0% -    1%) 0.157
                     AndHighHigh       27.08      (1.3%)       27.15      (3.0%)    0.3% (  -4% -    4%) 0.718
       BrowseDayOfYearTaxoFacets        3.76      (0.6%)        3.77      (0.5%)    0.3% (   0% -    1%) 0.072
           HighTermDayOfYearSort      224.01      (2.1%)      224.81      (2.1%)    0.4% (  -3% -    4%) 0.592
                    HighSpanNear        6.09      (2.7%)        6.11      (3.1%)    0.4% (  -5% -    6%) 0.683
            HighIntervalsOrdered        8.08      (3.3%)        8.11      (3.4%)    0.4% (  -6% -    7%) 0.705
                      TermDTSort      103.29      (4.4%)      103.83      (3.1%)    0.5% (  -6% -    8%) 0.669
             MedIntervalsOrdered       33.12      (4.4%)       33.29      (4.6%)    0.5% (  -8% -    9%) 0.702
             LowIntervalsOrdered       10.06      (3.9%)       10.12      (3.6%)    0.6% (  -6% -    8%) 0.609
                      AndHighMed       73.71      (2.2%)       74.18      (2.5%)    0.6% (  -3% -    5%) 0.394
                       OrHighMed       71.44      (2.7%)       71.98      (3.3%)    0.7% (  -5% -    6%) 0.429
            BrowseDateSSDVFacets        0.96      (4.8%)        0.97      (5.7%)    0.9% (  -9% -   11%) 0.601
                   OrHighNotHigh      308.82      (4.0%)      311.53      (3.7%)    0.9% (  -6% -    8%) 0.470
                       OrHighLow      404.69      (3.0%)      408.63      (3.5%)    1.0% (  -5% -    7%) 0.348
                      OrHighHigh       20.44      (4.7%)       20.73      (7.2%)    1.4% ( -10% -   13%) 0.469
                    OrNotHighLow      381.28      (1.8%)      388.18      (2.1%)    1.8% (  -2% -    5%) 0.004
               HighTermMonthSort     2500.04      (2.2%)     2554.91      (4.3%)    2.2% (  -4% -    8%) 0.042
                      AndHighLow      668.12      (3.1%)      692.04      (3.9%)    3.6% (  -3% -   10%) 0.001
                        PKLookup      140.25      (2.0%)      168.53      (1.9%)   20.2% (  15% -   24%) 0.000

After

                            TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value
                        Wildcard       54.96      (2.6%)       10.43      (0.5%)  -81.0% ( -82% -  -80%) 0.000
                         Respell       45.54      (1.0%)       16.74      (0.7%)  -63.2% ( -64% -  -62%) 0.000
                          Fuzzy1       46.41      (1.2%)       17.26      (1.0%)  -62.8% ( -64% -  -61%) 0.000
                         Prefix3      121.65      (2.4%)       55.57      (0.9%)  -54.3% ( -56% -  -52%) 0.000
                          Fuzzy2       32.33      (1.2%)       15.79      (1.1%)  -51.2% ( -52% -  -49%) 0.000
               HighTermTitleSort       95.24      (2.1%)       87.04      (1.9%)   -8.6% ( -12% -   -4%) 0.000
     BrowseRandomLabelSSDVFacets        2.37      (7.1%)        2.33      (4.8%)   -1.7% ( -12% -   10%) 0.374
           BrowseMonthSSDVFacets        3.34      (7.3%)        3.29      (0.6%)   -1.5% (  -8% -    6%) 0.362
                      TermDTSort      120.57      (2.3%)      119.02      (3.4%)   -1.3% (  -6% -    4%) 0.163
                      OrHighHigh       19.13      (5.6%)       18.92      (2.7%)   -1.1% (  -8% -    7%) 0.430
                     AndHighHigh       22.04      (5.1%)       21.87      (3.0%)   -0.8% (  -8% -    7%) 0.555
                      AndHighMed       55.06      (3.0%)       54.79      (2.1%)   -0.5% (  -5% -    4%) 0.546
                    HighSpanNear        3.29      (1.6%)        3.28      (1.7%)   -0.5% (  -3% -    2%) 0.346
            HighIntervalsOrdered        0.65      (1.8%)        0.65      (2.0%)   -0.5% (  -4% -    3%) 0.433
           HighTermDayOfYearSort      282.86      (2.0%)      281.57      (2.6%)   -0.5% (  -4% -    4%) 0.533
             MedIntervalsOrdered       16.36      (1.5%)       16.29      (1.5%)   -0.4% (  -3% -    2%) 0.369
                       OrHighMed       68.27      (2.9%)       67.99      (1.8%)   -0.4% (  -5% -    4%) 0.598
                     MedSpanNear        3.22      (1.0%)        3.21      (1.4%)   -0.4% (  -2% -    2%) 0.317
                HighSloppyPhrase        9.59      (2.5%)        9.57      (2.6%)   -0.3% (  -5% -    4%) 0.733
           BrowseMonthTaxoFacets        3.64      (2.4%)        3.63      (1.8%)   -0.2% (  -4% -    4%) 0.756
             LowIntervalsOrdered       14.66      (0.9%)       14.63      (1.5%)   -0.2% (  -2% -    2%) 0.633
            MedTermDayTaxoFacets       15.56      (2.7%)       15.54      (3.9%)   -0.2% (  -6% -    6%) 0.879
         AndHighMedDayTaxoFacets       18.70      (1.4%)       18.67      (3.7%)   -0.2% (  -5% -    5%) 0.864
                     LowSpanNear        4.39      (1.1%)        4.38      (1.4%)   -0.1% (  -2% -    2%) 0.728
          OrHighMedDayTaxoFacets        5.38      (3.5%)        5.38      (5.4%)   -0.1% (  -8% -    9%) 0.945
        AndHighHighDayTaxoFacets        7.06      (1.6%)        7.06      (3.0%)   -0.1% (  -4% -    4%) 0.924
                 LowSloppyPhrase        7.16      (1.4%)        7.15      (1.6%)   -0.1% (  -2% -    2%) 0.891
                 MedSloppyPhrase      128.54      (1.9%)      128.56      (2.2%)    0.0% (  -4% -    4%) 0.979
                         LowTerm      417.80      (3.3%)      418.01      (2.7%)    0.1% (  -5% -    6%) 0.958
                       LowPhrase      125.59      (4.0%)      125.77      (3.1%)    0.1% (  -6% -    7%) 0.900
                       OrHighLow      313.22      (2.1%)      313.72      (2.2%)    0.2% (  -4% -    4%) 0.817
            BrowseDateTaxoFacets        3.73      (0.7%)        3.74      (0.7%)    0.2% (  -1% -    1%) 0.470
       BrowseDayOfYearTaxoFacets        3.76      (0.7%)        3.76      (0.7%)    0.2% (  -1% -    1%) 0.457
                         MedTerm      384.57      (4.6%)      385.44      (3.6%)    0.2% (  -7% -    8%) 0.863
                   OrHighNotHigh      255.07      (4.3%)      256.05      (4.3%)    0.4% (  -7% -    9%) 0.778
                       MedPhrase       11.17      (3.0%)       11.21      (2.6%)    0.4% (  -5% -    6%) 0.658
                        HighTerm      361.26      (5.1%)      362.86      (4.2%)    0.4% (  -8% -   10%) 0.764
     BrowseRandomLabelTaxoFacets        3.19      (1.5%)        3.20      (0.6%)    0.5% (  -1% -    2%) 0.203
                   OrNotHighHigh      205.38      (4.0%)      206.35      (4.0%)    0.5% (  -7% -    8%) 0.712
                    OrNotHighLow      317.96      (1.7%)      319.48      (2.1%)    0.5% (  -3% -    4%) 0.428
                      HighPhrase       47.91      (3.8%)       48.15      (3.3%)    0.5% (  -6% -    7%) 0.661
            BrowseDateSSDVFacets        0.97      (6.9%)        0.98      (6.7%)    0.5% ( -12% -   15%) 0.801
                    OrHighNotLow      185.96      (4.9%)      187.04      (5.0%)    0.6% (  -8% -   11%) 0.710
       BrowseDayOfYearSSDVFacets        3.21      (2.1%)        3.23      (0.9%)    0.6% (  -2% -    3%) 0.225
            HighTermTitleBDVSort        5.83      (3.7%)        5.87      (4.0%)    0.7% (  -6% -    8%) 0.584
                    OrNotHighMed      516.84      (2.5%)      520.76      (2.5%)    0.8% (  -4% -    5%) 0.334
                          IntNRQ       29.24      (3.0%)       29.50      (4.1%)    0.9% (  -6% -    8%) 0.425
                    OrHighNotMed      268.45      (4.4%)      270.92      (4.2%)    0.9% (  -7% -    9%) 0.501
               HighTermMonthSort     2498.46      (4.8%)     2590.43      (3.7%)    3.7% (  -4% -   12%) 0.007
                      AndHighLow      747.94      (2.1%)      775.60      (4.0%)    3.7% (  -2% -   10%) 0.000
                        PKLookup      141.68      (2.0%)      177.85      (1.5%)   25.5% (  21% -   29%) 0.000

@Tony-X
Copy link
Contributor Author

Tony-X commented Dec 15, 2023

Non-trivial amount of allocations for? .... building IndexInput slice descriptions !?

jdk.internal.misc.Unsafe#allocateUninitializedArray(). This was not trivial to find out why. But again with the raw JFR report, we can analyze the call tree. It turn out that in the buildSlice() implementation of MemorySegmentIndexInput we call IndexInput#getFullSliceDescription() which creates new String. And allocateUninitializedArray is called to allocate the bytes for the String.

AFAIK, the description is only used for debugging and tracking purposes. I didn't expect it'd cause that much of allocation. So I made a change to pass null when building the description so those allocations are gone.

Before 

PERCENT       HEAP SAMPLES  STACK
10.39%        12103M        java.lang.Long#valueOf()
9.91%         11543M        org.apache.lucene.facet.taxonomy.IntTaxonomyFacets#initializeValueCounters()
8.91%         10383M        jdk.internal.misc.Unsafe#allocateUninitializedArray()

After
PERCENT       HEAP SAMPLES  STACK                                                                                                                              [37/1812]
25.97%        32791M        java.lang.Long#valueOf()
7.58%         9571M         org.apache.lucene.facet.taxonomy.IntTaxonomyFacets#initializeValueCounters()
5.13%         6482M         org.apache.lucene.util.FixedBitSet#<init>()

@Tony-X
Copy link
Contributor Author

Tony-X commented Dec 15, 2023

Here is the even more interesting stuff. After all those allocation optimizations. I also implemented the on-paper more "efficient" algorithm to intersect FST and FSA for Terms.intersect(), which leverages the sorted nature of the FST arcs and FSA transitions from a given state (so at least we can binary search to advance with some skipping). FST in some cases have direct addressing which is exploited, too. As a side note -- it also uncovered a bug for the NFA impl which is tracked here #12906.

But that change is not moving the needle at all for WildCard and Prefix3 search tasks.

Before 
                            TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value
                        Wildcard       62.85      (1.8%)       12.66      (0.6%)  -79.9% ( -80% -  -78%) 0.000
                          Fuzzy2       55.12      (1.0%)       18.92      (0.8%)  -65.7% ( -66% -  -64%) 0.000
                          Fuzzy1       61.20      (0.8%)       22.55      (0.8%)  -63.2% ( -64% -  -62%) 0.000
                         Respell       31.11      (0.8%)       11.95      (0.6%)  -61.6% ( -62% -  -60%) 0.000
                         Prefix3      135.69      (2.0%)       65.06      (0.7%)  -52.1% ( -53% -  -50%) 0.000
               HighTermTitleSort      119.58      (0.9%)      111.03      (1.7%)   -7.2% (  -9% -   -4%) 0.000
                          IntNRQ       22.25      (1.1%)       21.87      (1.5%)   -1.7% (  -4% -    0%) 0.000
                      HighPhrase       25.82      (3.6%)       25.55      (3.2%)   -1.1% (  -7% -    5%) 0.318
                       MedPhrase        7.41      (2.4%)        7.35      (2.2%)   -0.8% (  -5% -    3%) 0.259
                     LowSpanNear        8.81      (1.9%)        8.74      (2.1%)   -0.8% (  -4% -    3%) 0.202
          OrHighMedDayTaxoFacets        3.86      (5.8%)        3.83      (4.8%)   -0.8% ( -10% -   10%) 0.636
                      TermDTSort      100.75      (2.9%)       99.98      (2.0%)   -0.8% (  -5% -    4%) 0.336
            HighIntervalsOrdered        6.07      (2.1%)        6.03      (2.4%)   -0.7% (  -5% -    3%) 0.342
             MedIntervalsOrdered       45.89      (2.0%)       45.61      (2.4%)   -0.6% (  -4% -    3%) 0.389
                    HighSpanNear       10.73      (1.0%)       10.66      (1.5%)   -0.6% (  -3% -    2%) 0.165
           HighTermDayOfYearSort      206.09      (1.8%)      204.93      (1.9%)   -0.6% (  -4% -    3%) 0.338
             LowIntervalsOrdered        8.39      (2.3%)        8.37      (2.5%)   -0.3% (  -5% -    4%) 0.654
                     MedSpanNear       66.00      (1.3%)       65.81      (1.9%)   -0.3% (  -3% -    2%) 0.574
                         MedTerm      322.61      (4.7%)      321.89      (4.5%)   -0.2% (  -9% -    9%) 0.878
         AndHighMedDayTaxoFacets       22.62      (1.0%)       22.58      (1.2%)   -0.2% (  -2% -    2%) 0.617
                       LowPhrase       48.52      (1.3%)       48.46      (1.4%)   -0.1% (  -2% -    2%) 0.745
                         LowTerm      403.54      (2.9%)      403.22      (2.4%)   -0.1% (  -5% -    5%) 0.923
     BrowseRandomLabelTaxoFacets        3.20      (0.7%)        3.20      (0.9%)   -0.0% (  -1% -    1%) 0.905
        AndHighHighDayTaxoFacets        8.06      (1.3%)        8.06      (1.6%)    0.0% (  -2% -    2%) 0.962
       BrowseDayOfYearTaxoFacets        3.76      (0.6%)        3.76      (0.6%)    0.0% (  -1% -    1%) 0.859
           BrowseMonthTaxoFacets        3.62      (1.0%)        3.62      (1.0%)    0.1% (  -1% -    2%) 0.866
                   OrHighNotHigh      156.16      (6.0%)      156.26      (5.9%)    0.1% ( -11% -   12%) 0.972
            BrowseDateTaxoFacets        3.73      (0.6%)        3.73      (0.6%)    0.1% (  -1% -    1%) 0.722
                   OrNotHighHigh      144.55      (5.1%)      144.68      (4.7%)    0.1% (  -9% -   10%) 0.957
            MedTermDayTaxoFacets       17.57      (2.8%)       17.59      (2.8%)    0.2% (  -5% -    5%) 0.863
       BrowseDayOfYearSSDVFacets        3.22      (0.9%)        3.23      (0.7%)    0.2% (  -1% -    1%) 0.424
                        HighTerm      401.13      (5.4%)      402.30      (5.7%)    0.3% ( -10% -   12%) 0.868
           BrowseMonthSSDVFacets        3.27      (0.7%)        3.28      (1.0%)    0.3% (  -1% -    2%) 0.202
                     AndHighHigh       20.87      (2.8%)       20.94      (2.5%)    0.4% (  -4% -    5%) 0.670
                HighSloppyPhrase        6.58      (3.3%)        6.61      (3.4%)    0.4% (  -6% -    7%) 0.727
                      AndHighMed       89.34      (1.8%)       89.76      (1.4%)    0.5% (  -2% -    3%) 0.355
            BrowseDateSSDVFacets        0.95      (2.8%)        0.95      (4.0%)    0.5% (  -6% -    7%) 0.656
                    OrNotHighLow      420.17      (2.2%)      422.34      (2.2%)    0.5% (  -3% -    4%) 0.452
                 LowSloppyPhrase        2.89      (2.4%)        2.91      (1.9%)    0.6% (  -3% -    4%) 0.369
                    OrNotHighMed      219.50      (4.5%)      221.02      (4.1%)    0.7% (  -7% -    9%) 0.611
                 MedSloppyPhrase       10.44      (2.2%)       10.52      (1.4%)    0.7% (  -2% -    4%) 0.222
                    OrHighNotLow      288.48      (5.4%)      290.70      (5.7%)    0.8% (  -9% -   12%) 0.663
                       OrHighMed       53.25      (3.6%)       53.66      (3.5%)    0.8% (  -6% -    8%) 0.488
            HighTermTitleBDVSort        2.77      (7.2%)        2.79      (7.0%)    0.9% ( -12% -   16%) 0.699
                    OrHighNotMed      270.38      (5.7%)      272.88      (5.4%)    0.9% (  -9% -   12%) 0.601
                      OrHighHigh       20.86      (5.1%)       21.08      (5.2%)    1.1% (  -8% -   12%) 0.519
                       OrHighLow      220.40      (4.2%)      223.52      (5.6%)    1.4% (  -8% -   11%) 0.367
     BrowseRandomLabelSSDVFacets        2.32      (4.3%)        2.38      (7.6%)    2.3% (  -9% -   14%) 0.240
                      AndHighLow      395.82      (2.9%)      405.36      (2.8%)    2.4% (  -3% -    8%) 0.008
               HighTermMonthSort     2375.71      (3.7%)     2555.72      (5.0%)    7.6% (  -1% -   16%) 0.000
                        PKLookup      140.60      (1.8%)      178.02      (1.3%)   26.6% (  23% -   30%) 0.000


After
                            TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value
                        Wildcard       37.77      (2.7%)        5.70      (1.3%)  -84.9% ( -86% -  -83%) 0.000
                         Prefix3       52.27      (2.7%)       22.71      (1.8%)  -56.6% ( -59% -  -53%) 0.000
                          Fuzzy1       59.96      (1.6%)       54.80      (2.3%)   -8.6% ( -12% -   -4%) 0.000
               HighTermTitleSort      106.17      (2.1%)      101.30      (1.5%)   -4.6% (  -7% -   -1%) 0.000
                          Fuzzy2       33.40      (1.3%)       32.14      (1.6%)   -3.8% (  -6% -    0%) 0.000
                         MedTerm      273.84      (5.4%)      265.92      (9.7%)   -2.9% ( -17% -   12%) 0.245
                        HighTerm      349.66      (5.2%)      341.63      (8.9%)   -2.3% ( -15% -   12%) 0.320
                         LowTerm      356.23      (3.1%)      350.27      (4.3%)   -1.7% (  -8% -    5%) 0.156
                 LowSloppyPhrase        4.47      (2.1%)        4.44      (4.6%)   -0.8% (  -7% -    6%) 0.492
                    HighSpanNear        8.12      (2.1%)        8.06      (2.5%)   -0.7% (  -5% -    4%) 0.331
                 MedSloppyPhrase       31.05      (3.6%)       30.83      (4.1%)   -0.7% (  -8% -    7%) 0.559
             MedIntervalsOrdered        3.88      (3.3%)        3.86      (3.3%)   -0.7% (  -7% -    6%) 0.519
                     MedSpanNear        8.94      (1.2%)        8.88      (1.6%)   -0.7% (  -3% -    2%) 0.126
             LowIntervalsOrdered        7.40      (3.3%)        7.35      (3.4%)   -0.7% (  -7% -    6%) 0.537
                     LowSpanNear       29.33      (2.0%)       29.15      (2.2%)   -0.6% (  -4% -    3%) 0.374
                HighSloppyPhrase        6.68      (3.6%)        6.64      (3.5%)   -0.5% (  -7% -    6%) 0.624
            MedTermDayTaxoFacets        9.14      (3.0%)        9.11      (8.2%)   -0.3% ( -11% -   11%) 0.861
                      HighPhrase      115.62      (3.9%)      115.24      (4.0%)   -0.3% (  -7% -    7%) 0.798                                                [162/1927]
                     AndHighHigh       13.95      (4.2%)       13.92      (4.5%)   -0.3% (  -8% -    8%) 0.847
           BrowseMonthSSDVFacets        3.30      (0.8%)        3.29      (1.0%)   -0.3% (  -2% -    1%) 0.377
                      AndHighMed       85.40      (2.0%)       85.18      (2.1%)   -0.2% (  -4% -    3%) 0.695
                          IntNRQ       16.65      (4.1%)       16.63      (3.8%)   -0.1% (  -7% -    8%) 0.914
     BrowseRandomLabelSSDVFacets        2.30      (0.9%)        2.30      (1.1%)   -0.1% (  -2% -    1%) 0.754
                      OrHighHigh       24.99      (6.1%)       24.97      (5.3%)   -0.1% ( -10% -   12%) 0.957
        AndHighHighDayTaxoFacets        2.29      (2.8%)        2.29      (2.4%)   -0.0% (  -5% -    5%) 0.977
         AndHighMedDayTaxoFacets       40.17      (1.4%)       40.20      (1.4%)    0.1% (  -2% -    2%) 0.872
          OrHighMedDayTaxoFacets        3.15      (3.9%)        3.15      (3.4%)    0.1% (  -6% -    7%) 0.946
                       LowPhrase       30.23      (2.5%)       30.26      (2.4%)    0.1% (  -4% -    5%) 0.911
                    OrNotHighLow      201.78      (2.9%)      202.03      (3.0%)    0.1% (  -5% -    6%) 0.896
     BrowseRandomLabelTaxoFacets        3.20      (3.2%)        3.21      (4.1%)    0.1% (  -6% -    7%) 0.899
            HighIntervalsOrdered        0.42      (1.9%)        0.42      (1.6%)    0.2% (  -3% -    3%) 0.699
                    OrNotHighMed      235.49      (5.5%)      236.01      (4.7%)    0.2% (  -9% -   11%) 0.892
           BrowseMonthTaxoFacets        3.62      (1.0%)        3.63      (1.0%)    0.2% (  -1% -    2%) 0.477
                   OrNotHighHigh      329.77      (4.9%)      330.79      (5.5%)    0.3% (  -9% -   11%) 0.851
                       MedPhrase       35.79      (3.4%)       35.90      (3.4%)    0.3% (  -6% -    7%) 0.771
                      TermDTSort      112.10      (3.2%)      112.45      (3.4%)    0.3% (  -6% -    7%) 0.763
            BrowseDateSSDVFacets        0.97      (7.1%)        0.98     (10.0%)    0.4% ( -15% -   18%) 0.897
       BrowseDayOfYearSSDVFacets        3.21      (2.2%)        3.22      (1.6%)    0.4% (  -3% -    4%) 0.525
           HighTermDayOfYearSort      235.24      (2.1%)      236.16      (1.6%)    0.4% (  -3% -    4%) 0.512
                       OrHighMed       70.60      (3.3%)       70.99      (2.7%)    0.5% (  -5% -    6%) 0.571
                      AndHighLow      370.31      (3.2%)      372.60      (3.4%)    0.6% (  -5% -    7%) 0.559
            HighTermTitleBDVSort        5.53      (4.1%)        5.56      (4.5%)    0.6% (  -7% -    9%) 0.648
                    OrHighNotLow      263.18      (5.6%)      264.95      (6.2%)    0.7% ( -10% -   13%) 0.717
                    OrHighNotMed      222.41      (5.8%)      224.06      (5.8%)    0.7% ( -10% -   13%) 0.688
                   OrHighNotHigh      233.04      (5.5%)      234.89      (5.8%)    0.8% (  -9% -   12%) 0.657
                       OrHighLow      463.17      (3.0%)      466.91      (3.1%)    0.8% (  -5% -    7%) 0.403
       BrowseDayOfYearTaxoFacets        3.77      (0.6%)        3.84      (8.9%)    1.9% (  -7% -   11%) 0.342
            BrowseDateTaxoFacets        3.74      (0.6%)        3.81      (8.8%)    1.9% (  -7% -   11%) 0.332
               HighTermMonthSort     2350.73      (4.0%)     2477.32      (4.5%)    5.4% (  -2% -   14%) 0.000
                         Respell       30.81      (1.5%)       34.87      (1.7%)   13.2% (   9% -   16%) 0.000
                        PKLookup      141.03      (1.9%)      177.54      (2.0%)   25.9% (  21% -   30%) 0.000

I tried to modify the bench task file and only run WildCard to understand where the time is spent.

My version

image

mainline

image

So we can see that the most time is spent in actually reading out the FST arcs and FSA transitions... My intuitive explanation for why this is slower than the blocktree is that it has worse locality in its data access pattern. (@mikemccand maybe you can shed some light) Here are some relevant factors:

  • The FST is larger as it contains all terms. So there are more Arcs to visit. Blocktree (main) use the FST to index prefixes.
  • When binary-searching or directly address Arc/Transitions the target is somewhat random.
  • The FST bytes are read backwards. (probably less of an issue if we read sequentially on modern HW)
  • Blocktree at a given node reads bytes sequentially and terms are sorted, too.

Just out of curiosity I altered my code to load the FST on-heap to compare with the default off-heap option. It did not help much with Wildcard but PKLookup got substantially faster!

The PKLookup task is a great proxy to FST performance, as both versions of the code visits the exact same number of Arcs.


Off heap
                        Wildcard       47.56      (1.7%)       10.13      (0.4%)  -78.7% ( -79% -  -77%) 0.000
                        PKLookup      136.03      (2.4%)      147.93      (2.3%)    8.8% (   3% -   13%) 0.000

on heap
                        Wildcard       37.11      (1.5%)        8.35      (0.3%)  -77.5% ( -78% -  -76%) 0.000
                        PKLookup      136.04      (3.3%)      269.60      (9.0%)   98.2% (  83% -  114%) 0.000

@mikemccand
Copy link
Member

Theres are very interesting results @Tony-X! I'll try to give deeper response soon, but one idea that jumped out about Wildcard is that BlockTree somewhere takes advantage of commonSuffixBytes or so? This is a BytesRef that is non-empty when all strings matched by the Automaton share some common suffix, as would happen with a wildcard query like ab*cd (cd would be the common suffix).

Maybe block tree is able to use this opto more effectively than the "all terms in FST" approach? But I think you could implement such an opto too, maybe: just find the one node (or, maybe more than one) whose suffix is the common suffix, and fast-check somehow?

@mikemccand
Copy link
Member

The PKLookup gains are astounding!

Especially interesting is the off -> on heap gains for that task. We are somehow paying a high price for going through Lucene's IO APIs instead of byte[] backed ByteBuffer?

@Tony-X
Copy link
Contributor Author

Tony-X commented Dec 19, 2023

Thanks @mikemccand for taking a look! I see the getCommonSuffixBytesRef method from Automaton.

I wonder if it is really applicable to the FST... i.e. for the FST is it guaranteed that there exists one and only one state where all valid outputs path that share the same suffix go through? Or put it in other words, how many sub-graphs of the FST are there that represents the same suffix?

private final FSTCompiler<Long> fstCompiler;

TermsIndexBuilder() throws IOException {
fstCompiler =
Copy link
Contributor

@dungba88 dungba88 Dec 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just FYI, you can plug the indexOutput directly to the FSTCompiler (passing from RandomAccessTermsDictWriter), and make the FST writing entirely off-heap (apart from the heap, it also eliminates the time taken to write to the on-heap DataOutput). I have some example PRs at:

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that the FST metadata and data cannot be written to the same file, but it seems you already separated metaOutput and indexOutput so that should be fine.

Copy link

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the dev@lucene.apache.org list. Thank you for your contribution!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants