Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor ByteBlockPool so it is just a "shift/mask big array" #12625

Merged
merged 12 commits into from
Oct 18, 2023

Conversation

iverase
Copy link
Contributor

@iverase iverase commented Oct 6, 2023

While working in the code base I stumble with this TODO and this issue so I give it a try to simplify ByteBlockPool so it does not contain specific logic for terms.

The result here is that I moved all the hairy allocSlice stuff as static method in TermsHashPerField and I introduce a BytesRefBlockPool to encapsulate of the BytesRefHash write/read logic.

I updated javadocs accordingly.

@mikemccand
Copy link
Member

While working in the code base I stumble with this TODO

Yay, TODO mining for the win :) We should scour our TODOs more often!

Copy link
Member

@mikemccand mikemccand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On quick pass this looks great! We are pulling the hairy BytesRef packing logic apart from the "just store a paged byte[]" logic. Thank you @iverase! Have you tested luceneutil to see if there's any performance change?

@iverase
Copy link
Contributor Author

iverase commented Oct 9, 2023

I run luceneutil for wikimedium10m and I don't think it shows any slow down (I find hard to understand the output):

                            TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value
     BrowseRandomLabelTaxoFacets       48.70     (42.5%)       41.90     (42.5%)  -14.0% ( -69% -  123%) 0.299
       BrowseDayOfYearTaxoFacets       37.40     (26.4%)       34.50     (27.0%)   -7.8% ( -48% -   62%) 0.358
            BrowseDateTaxoFacets       37.26     (26.5%)       34.40     (27.1%)   -7.7% ( -48% -   62%) 0.365
     BrowseRandomLabelSSDVFacets       19.63     (11.3%)       19.23      (9.2%)   -2.0% ( -20% -   20%) 0.535
                    OrHighNotLow      548.43      (4.7%)      539.05      (3.2%)   -1.7% (  -9% -    6%) 0.180
                        HighTerm      809.05      (4.7%)      797.24      (3.5%)   -1.5% (  -9% -    7%) 0.269
                    OrNotHighLow     1288.89      (4.7%)     1270.45      (4.2%)   -1.4% (  -9% -    7%) 0.312
                        PKLookup      326.62      (3.2%)      322.04      (3.8%)   -1.4% (  -8% -    5%) 0.205
           BrowseMonthSSDVFacets       23.98      (9.0%)       23.67      (2.1%)   -1.3% ( -11% -   10%) 0.526
                   OrHighNotHigh      490.49      (4.3%)      484.16      (4.5%)   -1.3% (  -9% -    7%) 0.355
                    OrHighNotMed      548.16      (4.5%)      543.09      (3.6%)   -0.9% (  -8% -    7%) 0.477
                    HighSpanNear       27.03      (2.6%)       26.78      (2.0%)   -0.9% (  -5% -    3%) 0.210
                   OrNotHighHigh      477.00      (3.3%)      473.00      (3.0%)   -0.8% (  -6% -    5%) 0.397
                      TermDTSort      279.13      (3.8%)      276.83      (4.7%)   -0.8% (  -8% -    7%) 0.539
                        Wildcard      681.29      (2.7%)      676.44      (2.2%)   -0.7% (  -5% -    4%) 0.368
                          Fuzzy2      122.34      (2.6%)      121.56      (2.0%)   -0.6% (  -5% -    4%) 0.384
                         Respell      108.58      (2.4%)      107.89      (1.8%)   -0.6% (  -4% -    3%) 0.343
                      AndHighLow     1251.02      (4.3%)     1243.31      (3.4%)   -0.6% (  -7% -    7%) 0.611
                       MedPhrase      745.27      (2.6%)      740.73      (2.4%)   -0.6% (  -5% -    4%) 0.444
           HighTermDayOfYearSort      635.99      (3.1%)      632.70      (2.9%)   -0.5% (  -6% -    5%) 0.589
                         MedTerm     1336.93      (3.7%)     1330.69      (3.7%)   -0.5% (  -7% -    7%) 0.692
                     AndHighHigh       34.01      (1.8%)       33.85      (1.2%)   -0.5% (  -3% -    2%) 0.337
                         LowTerm     1157.65      (4.6%)     1152.62      (4.0%)   -0.4% (  -8% -    8%) 0.751
         AndHighMedDayTaxoFacets       66.07      (1.5%)       65.80      (2.0%)   -0.4% (  -3% -    3%) 0.440
                          Fuzzy1       99.74      (2.8%)       99.32      (2.2%)   -0.4% (  -5% -    4%) 0.595
        AndHighHighDayTaxoFacets       52.67      (1.4%)       52.51      (1.6%)   -0.3% (  -3% -    2%) 0.515
                      HighPhrase       53.17      (2.6%)       53.01      (2.4%)   -0.3% (  -5% -    4%) 0.706
                     MedSpanNear       56.71      (2.3%)       56.55      (1.8%)   -0.3% (  -4% -    3%) 0.669
                      AndHighMed      197.01      (1.6%)      196.51      (1.7%)   -0.3% (  -3% -    3%) 0.629
                    OrNotHighMed      552.27      (2.8%)      551.27      (2.3%)   -0.2% (  -5% -    5%) 0.824
                HighSloppyPhrase        3.98      (4.4%)        3.98      (5.8%)   -0.2% (  -9% -   10%) 0.919
                 LowSloppyPhrase      246.31      (6.5%)      245.91      (8.1%)   -0.2% ( -13% -   15%) 0.944
            MedTermDayTaxoFacets       90.86      (1.7%)       90.71      (1.6%)   -0.2% (  -3% -    3%) 0.755
                      OrHighHigh       64.09      (7.0%)       63.99      (5.2%)   -0.2% ( -11% -   12%) 0.935
                       LowPhrase       30.38      (2.4%)       30.33      (2.6%)   -0.1% (  -4% -    4%) 0.856
            BrowseDateSSDVFacets        5.28     (18.5%)        5.28     (19.0%)   -0.1% ( -31% -   45%) 0.983
                         Prefix3      357.47      (3.0%)      357.07      (2.9%)   -0.1% (  -5% -    5%) 0.904
                     LowSpanNear       48.78      (3.2%)       48.79      (3.1%)    0.0% (  -6% -    6%) 0.978
               HighTermMonthSort     4113.83      (3.3%)     4115.73      (6.2%)    0.0% (  -9% -    9%) 0.976
                       OrHighLow      816.30      (4.7%)      817.94      (4.6%)    0.2% (  -8% -   10%) 0.892
          OrHighMedDayTaxoFacets        3.63      (6.9%)        3.64      (7.7%)    0.3% ( -13% -   15%) 0.907
                 MedSloppyPhrase      150.12      (6.4%)      150.96      (7.2%)    0.6% ( -12% -   15%) 0.796
               HighTermTitleSort      314.87      (4.9%)      317.15      (4.1%)    0.7% (  -7% -   10%) 0.616
                       OrHighMed      309.56      (6.8%)      312.09      (3.5%)    0.8% (  -8% -   11%) 0.633
                          IntNRQ       83.15     (17.7%)       83.86     (16.9%)    0.9% ( -28% -   43%) 0.876
            HighTermTitleBDVSort       32.41      (2.7%)       32.71      (5.3%)    0.9% (  -6% -    9%) 0.486
       BrowseDayOfYearSSDVFacets       23.79      (8.8%)       24.04      (8.6%)    1.0% ( -15% -   20%) 0.708
             MedIntervalsOrdered      119.02     (11.9%)      120.29     (10.4%)    1.1% ( -19% -   26%) 0.764
             LowIntervalsOrdered      164.17     (11.5%)      166.07      (9.6%)    1.2% ( -17% -   25%) 0.730
            HighIntervalsOrdered       11.13     (15.5%)       11.33     (13.4%)    1.8% ( -23% -   36%) 0.696
           BrowseMonthTaxoFacets       36.32     (30.0%)       39.04     (23.2%)    7.5% ( -35% -   86%) 0.377

@mikemccand
Copy link
Member

I run luceneutil for wikimedium10m and I don't think it shows any slow down (I find hard to understand the output):

Hmm, surprisingly noisy, especially for the biggest regression and anti-regression. Did this run for the full 20 iterations? Maybe re-run with wikimediumall?

Actually, sorry, this change should only impact indexing, I think? We only use this (complex) memory store to store interleaved slices of postings lists?

I think it's fine to leave this to the nightly benchmarks -- let's watch after pushing if they catch any slowdown. This is otherwise a rote refactoring (separating the logic of packed BytesRef from the paged byte[] storage).

@iverase
Copy link
Contributor Author

iverase commented Oct 10, 2023

I run wikimediumall and still a bit noisy:

                            TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value
     BrowseRandomLabelSSDVFacets        5.34      (9.9%)        5.08      (9.5%)   -4.9% ( -22% -   16%) 0.112
           BrowseMonthSSDVFacets        6.25     (10.3%)        5.98     (11.2%)   -4.3% ( -23% -   19%) 0.201
                          IntNRQ       74.13     (15.2%)       70.96     (16.8%)   -4.3% ( -31% -   32%) 0.398
       BrowseDayOfYearSSDVFacets        6.24     (10.0%)        5.98     (10.9%)   -4.2% ( -22% -   18%) 0.208
                      TermDTSort      192.52      (5.3%)      186.26      (7.6%)   -3.3% ( -15% -   10%) 0.118
            HighTermTitleBDVSort        7.68      (7.3%)        7.43      (7.5%)   -3.2% ( -16% -   12%) 0.168
            BrowseDateSSDVFacets        1.45     (11.8%)        1.40     (16.2%)   -3.1% ( -27% -   28%) 0.482
            HighIntervalsOrdered        4.65     (18.0%)        4.51     (15.4%)   -3.1% ( -30% -   36%) 0.555
                      OrHighHigh       35.11      (9.2%)       34.06     (10.6%)   -3.0% ( -20% -   18%) 0.340
                         Prefix3      339.69      (4.8%)      329.55      (9.2%)   -3.0% ( -16% -   11%) 0.198
                       OrHighLow      549.00      (5.6%)      532.81      (6.7%)   -2.9% ( -14% -    9%) 0.131
                 MedSloppyPhrase       19.19      (6.3%)       18.67      (9.4%)   -2.7% ( -17% -   13%) 0.283
             MedIntervalsOrdered       34.54     (17.0%)       33.61     (15.1%)   -2.7% ( -29% -   35%) 0.596
                       OrHighMed      251.55      (7.5%)      244.84      (9.1%)   -2.7% ( -17% -   15%) 0.312
                   OrNotHighHigh      362.40      (3.0%)      353.04      (7.9%)   -2.6% ( -13% -    8%) 0.173
                    OrNotHighMed      497.20      (3.4%)      484.74      (6.8%)   -2.5% ( -12% -    7%) 0.140
                      AndHighLow      852.22      (3.3%)      831.23      (7.3%)   -2.5% ( -12% -    8%) 0.170
           HighTermDayOfYearSort      283.45      (2.9%)      276.71      (7.5%)   -2.4% ( -12% -    8%) 0.188
                    OrHighNotLow      513.51      (4.1%)      501.37      (7.7%)   -2.4% ( -13% -    9%) 0.226
                      AndHighMed       68.50      (3.0%)       66.90      (7.5%)   -2.3% ( -12% -    8%) 0.192
                 LowSloppyPhrase       25.08      (4.0%)       24.50      (8.4%)   -2.3% ( -14% -   10%) 0.263
                   OrHighNotHigh      332.84      (3.2%)      325.15      (7.5%)   -2.3% ( -12% -    8%) 0.207
             LowIntervalsOrdered        3.87     (13.4%)        3.78     (12.6%)   -2.2% ( -24% -   27%) 0.588
                    OrHighNotMed      429.43      (4.2%)      420.06      (8.2%)   -2.2% ( -13% -   10%) 0.289
                    OrNotHighLow      596.57      (2.5%)      583.76      (6.5%)   -2.1% ( -10% -    6%) 0.167
               HighTermTitleSort      143.63      (2.2%)      140.62      (6.9%)   -2.1% ( -10% -    7%) 0.194
                     AndHighHigh       32.01      (2.6%)       31.34      (7.3%)   -2.1% ( -11% -    8%) 0.228
                       LowPhrase       38.27      (1.7%)       37.51      (7.5%)   -2.0% ( -11% -    7%) 0.245
                       MedPhrase      167.44      (2.2%)      164.14      (8.0%)   -2.0% ( -11% -    8%) 0.288
                         LowTerm      770.36      (4.6%)      755.91      (6.1%)   -1.9% ( -12% -    9%) 0.274
            MedTermDayTaxoFacets       18.15      (2.8%)       17.81      (6.8%)   -1.9% ( -11% -    7%) 0.258
                HighSloppyPhrase        6.57      (3.4%)        6.45      (8.0%)   -1.8% ( -12% -    9%) 0.354
                     LowSpanNear       86.59      (1.8%)       85.10      (6.8%)   -1.7% ( -10% -    6%) 0.272
                      HighPhrase      184.92      (2.1%)      181.75      (6.9%)   -1.7% ( -10% -    7%) 0.288
         AndHighMedDayTaxoFacets       46.77      (2.5%)       45.97      (7.6%)   -1.7% ( -11% -    8%) 0.338
                    HighSpanNear        6.19      (1.6%)        6.09      (7.0%)   -1.7% ( -10% -    7%) 0.301
                         Respell       76.88      (1.4%)       75.61      (7.3%)   -1.7% ( -10% -    7%) 0.321
                     MedSpanNear       33.16      (1.9%)       32.64      (7.0%)   -1.6% ( -10% -    7%) 0.335
                          Fuzzy2       30.94      (1.8%)       30.48      (6.9%)   -1.5% ( -10% -    7%) 0.344
                         MedTerm      788.96      (4.4%)      777.38      (8.7%)   -1.5% ( -13% -   12%) 0.500
                          Fuzzy1       72.99      (1.9%)       71.92      (5.9%)   -1.5% (  -9% -    6%) 0.293
        AndHighHighDayTaxoFacets       12.31      (1.8%)       12.14      (7.6%)   -1.4% ( -10% -    8%) 0.413
               HighTermMonthSort     4103.69      (4.1%)     4050.87      (6.7%)   -1.3% ( -11% -    9%) 0.462
                        PKLookup      276.09      (2.2%)      272.79      (6.5%)   -1.2% (  -9% -    7%) 0.439
                        HighTerm      802.38      (3.4%)      796.62      (7.0%)   -0.7% ( -10% -    9%) 0.678
                        Wildcard       70.43      (3.9%)       70.02      (8.7%)   -0.6% ( -12% -   12%) 0.784
          OrHighMedDayTaxoFacets        2.99      (8.1%)        3.01      (8.9%)    0.5% ( -15% -   18%) 0.857
     BrowseRandomLabelTaxoFacets        6.11      (4.5%)        6.16      (8.3%)    0.8% ( -11% -   14%) 0.715
            BrowseDateTaxoFacets        6.80      (5.0%)        6.85      (8.7%)    0.8% ( -12% -   15%) 0.720
       BrowseDayOfYearTaxoFacets        6.84      (5.2%)        6.90      (9.1%)    0.8% ( -12% -   15%) 0.729
           BrowseMonthTaxoFacets        9.65     (31.4%)       10.50     (28.6%)    8.7% ( -39% -  100%) 0.358

There has been some changes that might affect lucene bench so I will wait until waters are calmer to push this change and monitor the benches.

Copy link
Contributor

@stefanvodita stefanvodita left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @iverase! I was working on something similar in #12506. There are a few differences between our PRs.
You address the BytesRef methods, while I addressed a few other clean-up items.
We both moved the slice functionality out of ByteBlockPool. What do you think of making a class like ByteSlicePool to separte concerns from other TermsHashPerField functionality?

Overall, we could do the slices and BytesRef in this PR and I'll try to rebase #12506 after or we can try to merge the two PRs now.

*/
// pkg private for access by tests
static int newSlice(ByteBlockPool bytePool, final int size, final int level) {
assert LEVEL_SIZE_ARRAY[level] == size;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand this - why pass in both size and level if we assert that one can be deduced from the other?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is just a micro optimization to avoid looking twice into the array. This is a copy paste of the original code.

@iverase iverase merged commit 17ea6d5 into apache:main Oct 18, 2023
4 checks passed
@iverase iverase deleted the ByteBlockPool branch October 18, 2023 06:10
@iverase iverase added this to the 9.9.0 milestone Oct 18, 2023
iverase added a commit that referenced this pull request Oct 18, 2023
Moved all the hairy allocSlice stuff as static method in TermsHashPerField and I introduce a BytesRefBlockPool to
 encapsulate of the BytesRefHash write/read logic.
@mikemccand
Copy link
Member

What do you think of making a class like ByteSlicePool to separte concerns from other TermsHashPerField functionality?

This sounds compelling to me. The byte slicing/interleaving is complex enough to be self contained and separated from both ByteBlockPool (low level block storage) and TermsHashPerField (in-memory postings hash).

Overall, we could do the slices and BytesRef in this PR and I'll try to rebase #12506 after or we can try to merge the two PRs now.

Hmm it looks like this PR was merged already, so I guess @stefanvodita we should continue your ideas in #12506.

Very exciting that two devs are suddenly interesting in improving this complex low level code -- thanks!

clayburn added a commit to runningcode/lucene that referenced this pull request Oct 20, 2023
…ache.org

* upstream/main: (239 commits)
  Bound the RAM used by the NodeHash (sharing common suffixes) during FST compilation (apache#12633)
  Fix index out of bounds when writing FST to different metaOut (apache#12697) (apache#12698)
  Avoid object construction when linear searching arcs (apache#12692)
  chore: update the Javadoc example in Analyzer (apache#12693)
  coorect position on entry in CHANGES.txt
  Refactor ByteBlockPool so it is just a "shift/mask big array" (apache#12625)
  Extract the hnsw graph merging from being part of the vector writer (apache#12657)
  Specialize `BlockImpactsDocsEnum#nextDoc()`. (apache#12670)
  Speed up TestIndexOrDocValuesQuery. (apache#12672)
  Remove over-counting of deleted terms (apache#12586)
  Use MergeSorter in StableStringSorter (apache#12652)
  Use radix sort to speed up the sorting of terms in TermInSetQuery (apache#12587)
  Add timeouts to github jobs. Estimates taken from empirical run times (actions history), with a generous buffer added. (apache#12687)
  Optimize OnHeapHnswGraph's data structure (apache#12651)
  Add createClassLoader to replicator permissions (block specific to jacoco). (apache#12684)
  Move changes entry before backporting
  CHANGES
  Move testing properties to provider class (no classloading deadlock possible) and fallback to default provider in non-test mode
  simple cleanups to vector code (apache#12680)
  Better detect vector module in non-default setups (e.g., custom module layers) (apache#12677)
  ...
@stefanvodita
Copy link
Contributor

I've rebased #12506. I like having a separate class for slice allocation, but if there's disagreement over that, I can put the code back in TermsHashPerField.

@mikemccand
Copy link
Member

Thanks @stefanvodita -- I'll try to have a look soon at your rebased PR #12506.

And thank you for gracefully handling the "two people made very similar changes" situation :)

This happens often in open source, but it's actually a good thing since you get two very different perspectives and the final solution is best of both.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants