Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Write MSB VLong for better outputs sharing in block tree index #12631

Merged
merged 18 commits into from
Oct 10, 2023

Conversation

gf2121
Copy link
Contributor

@gf2121 gf2121 commented Oct 7, 2023

closes #12620

@gf2121
Copy link
Contributor Author

gf2121 commented Oct 8, 2023

With this change, The total size of tip reduced ~14% for wikimediumall.

  baseline candidate diff
_32_Lucene90_0.tip 5166889 4385234 -15.13%
_65_Lucene90_0.tip 5192450 4402679 -15.21%
_98_Lucene90_0.tip 5555704 4721674 -15.01%
_cb_Lucene90_0.tip 5591898 4761395 -14.85%
_fe_Lucene90_0.tip 5549684 4718536 -14.98%
_fp_Lucene90_0.tip 776845 716283 -7.80%
_g0_Lucene90_0.tip 775054 715176 -7.73%
_gb_Lucene90_0.tip 745250 685730 -7.99%
_gm_Lucene90_0.tip 738168 678853 -8.04%
_gx_Lucene90_0.tip 880015 809570 -8.00%
_gy_Lucene90_0.tip 151737 142043 -6.39%
_gz_Lucene90_0.tip 133664 125057 -6.44%
_h0_Lucene90_0.tip 116959 109360 -6.50%
_h1_Lucene90_0.tip 120897 112915 -6.60%
_h2_Lucene90_0.tip 111570 104185 -6.62%
sum 31606784 27188690 -13.98%

@gf2121 gf2121 requested a review from mikemccand October 8, 2023 12:50
@mikemccand
Copy link
Member

sum | 31606784 | 27188690 | -13.98%

WHOA, wow! This is a massive gain for such a tiny change :) I'll try to review soon! Nice to revisit ancient TODOs in the source code ... :)

Copy link
Member

@mikemccand mikemccand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks awesome! What a small change yielding such a big win!

We should watch the nightly luceneutil benchy PKLookup task to see any CPU performance impact from this change, or run the benchmark before pushing (I can help run it).

/** Current terms format. */
public static final int VERSION_CURRENT = VERSION_START;
public static final int VERSION_CURRENT = VERSION_MSB_VLONG_OUTPUT;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a comment explaining what changed, and link to the GitHub issue?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wrote some comments on VERSION_MSB_VLONG_OUTPUT. I prefer to leave comments there, as VERSION_CURRENT could probably link to other version in feature. WDYT?

Issue added :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, sorry, yes - VERSION_MSB_VLONG_OUTPUT is the right place for the comment! Thanks.

int LSBVLongBytes = (Long.SIZE - Long.numberOfLeadingZeros(i) - 1) / 7 + 1;
i <<= Long.SIZE - LSBVLongBytes * 7;
for (int j = 1; j < LSBVLongBytes; j++) {
scratchBytes.writeByte((byte) (((i >>> 57) & 0x7FL) | 0x80));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe @uschindler can help review this heavy math :)

@mikemccand
Copy link
Member

I kicked off a luceneutil run ... I'll post results here soonish.

@mikemccand
Copy link
Member

luceneutil results on wikimediumall look good -- looks like all noise (even for PKLookup), or, any signal (change) is very low, making the ~15% reduction very much worth it.

                            Task    QPS base      StdDevQPS msb_vlong      StdDev                Pct diff p-value                                                                                           
           BrowseMonthTaxoFacets        8.34      (4.4%)        8.24      (3.6%)   -1.2% (  -8% -    7%) 0.344                                                                                              
            HighIntervalsOrdered       11.37      (3.9%)       11.24      (4.5%)   -1.2% (  -9% -    7%) 0.382                                                                                              
                       OrHighLow      394.28      (1.6%)      390.47      (2.1%)   -1.0% (  -4% -    2%) 0.099                                                                                              
                     MedSpanNear       24.41      (3.9%)       24.23      (3.9%)   -0.8% (  -8% -    7%) 0.537                                                                                              
             MedIntervalsOrdered        9.96      (4.0%)        9.89      (4.3%)   -0.7% (  -8% -    7%) 0.602                                                                                              
                     LowSpanNear       13.82      (3.2%)       13.73      (3.2%)   -0.6% (  -6% -    5%) 0.526                                                                                              
                     AndHighHigh       97.28      (1.9%)       96.71      (1.2%)   -0.6% (  -3% -    2%) 0.259                                                                                              
     BrowseRandomLabelSSDVFacets        9.22      (4.5%)        9.17      (4.3%)   -0.6% (  -8% -    8%) 0.688                                                                                              
             LowIntervalsOrdered       28.78      (4.0%)       28.64      (4.1%)   -0.5% (  -8% -    7%) 0.720                                                                                              
                        PKLookup      250.58      (0.9%)      249.44      (1.0%)   -0.5% (  -2% -    1%) 0.118                                                                                              
                          IntNRQ      315.77      (2.5%)      314.47      (2.1%)   -0.4% (  -4% -    4%) 0.570                                                                                              
       BrowseDayOfYearSSDVFacets       12.75      (7.3%)       12.70     (10.5%)   -0.4% ( -16% -   18%) 0.887                                                                                              
                      HighPhrase       55.21      (3.5%)       55.00      (2.8%)   -0.4% (  -6% -    6%) 0.712                                                                                              
                       MedPhrase       77.25      (2.0%)       77.04      (1.9%)   -0.3% (  -4% -    3%) 0.658                                                                                              
                HighSloppyPhrase        6.90      (3.9%)        6.89      (3.4%)   -0.2% (  -7% -    7%) 0.841                                                                                              
                       LowPhrase      463.60      (1.4%)      462.90      (1.1%)   -0.2% (  -2% -    2%) 0.701                                                                                              
                    OrNotHighMed      424.68      (2.6%)      424.06      (2.5%)   -0.1% (  -5% -    5%) 0.857                                                                                              
                      AndHighLow      680.76      (1.3%)      679.77      (1.6%)   -0.1% (  -3% -    2%) 0.757                                                                                              
                    HighSpanNear       27.40      (1.5%)       27.37      (1.3%)   -0.1% (  -2% -    2%) 0.787                                                                                              
                      AndHighMed      200.30      (1.3%)      200.12      (1.3%)   -0.1% (  -2% -    2%) 0.826                                                                                              
                    OrNotHighLow      622.80      (1.2%)      622.27      (1.1%)   -0.1% (  -2% -    2%) 0.816                                                                                              
                 LowSloppyPhrase        7.73      (2.9%)        7.73      (3.0%)   -0.1% (  -5% -    6%) 0.951                                                                                              
               HighTermMonthSort     4175.08      (1.1%)     4173.92      (0.9%)   -0.0% (  -1% -    1%) 0.929                                                                                              
                      TermDTSort      268.34      (2.0%)      268.37      (2.3%)    0.0% (  -4% -    4%) 0.988                                                                                              
           HighTermDayOfYearSort      779.14      (1.1%)      779.57      (0.9%)    0.1% (  -1% -    2%) 0.867                                                                                              
                        Wildcard      163.23      (1.1%)      163.33      (1.3%)    0.1% (  -2% -    2%) 0.867                                                                                              
            BrowseDateTaxoFacets        8.27      (4.4%)        8.28      (4.2%)    0.1% (  -8% -    9%) 0.936                                                                                              
                         Respell       44.69      (0.7%)       44.74      (0.9%)    0.1% (  -1% -    1%) 0.642                                                                                              
                 MedSloppyPhrase       37.55      (2.1%)       37.62      (1.8%)    0.2% (  -3% -    4%) 0.792                                                                                              
                      OrHighHigh       44.44      (2.6%)       44.52      (3.3%)    0.2% (  -5% -    6%) 0.845                                                                                              
       BrowseDayOfYearTaxoFacets        8.22      (4.2%)        8.24      (4.1%)    0.2% (  -7% -    8%) 0.888                                                                                              
               HighTermTitleSort      146.90      (2.9%)      147.22      (2.5%)    0.2% (  -5% -    5%) 0.796                                                                                              
         AndHighMedDayTaxoFacets       47.61      (2.0%)       47.72      (1.5%)    0.2% (  -3% -    3%) 0.679                                                                                              
                   OrNotHighHigh      479.52      (3.7%)      480.64      (3.5%)    0.2% (  -6% -    7%) 0.837                                                                                              
                    OrHighNotMed      675.60      (4.5%)      677.32      (4.3%)    0.3% (  -8% -    9%) 0.856                                                                                              
           BrowseMonthSSDVFacets       13.96     (10.1%)       14.01     (12.1%)    0.3% ( -19% -   25%) 0.928                                                                                              
                         Prefix3      282.37      (1.2%)      283.46      (1.2%)    0.4% (  -2% -    2%) 0.321                                                                                              
                          Fuzzy2       60.02      (0.6%)       60.27      (0.8%)    0.4% (  -1% -    1%) 0.068                                                                                              
                          Fuzzy1       48.28      (1.3%)       48.50      (1.0%)    0.5% (  -1% -    2%) 0.210                                                                                              
        AndHighHighDayTaxoFacets       11.30      (1.1%)       11.35      (1.5%)    0.5% (  -2% -    3%) 0.264                                                                                              
                       OrHighMed      179.94      (1.7%)      180.80      (2.2%)    0.5% (  -3% -    4%) 0.446                                                                                              
            HighTermTitleBDVSort        6.77      (2.6%)        6.80      (2.4%)    0.5% (  -4% -    5%) 0.541                                                                                              
                         LowTerm      732.59      (1.7%)      736.71      (2.5%)    0.6% (  -3% -    4%) 0.404                                                                                              
     BrowseRandomLabelTaxoFacets        7.44      (2.8%)        7.49      (2.9%)    0.6% (  -4% -    6%) 0.485                                                                                              
                    OrHighNotLow      683.54      (5.7%)      687.86      (5.7%)    0.6% ( -10% -   12%) 0.726                                                                                              
                   OrHighNotHigh      352.91      (5.1%)      355.64      (5.2%)    0.8% (  -9% -   11%) 0.635                                                                                              
            MedTermDayTaxoFacets       39.46      (1.9%)       39.79      (2.5%)    0.8% (  -3% -    5%) 0.249                                                                                              
          OrHighMedDayTaxoFacets        2.64      (3.3%)        2.66      (3.1%)    0.8% (  -5% -    7%) 0.412                                                                                              
                         MedTerm      655.15      (4.7%)      660.94      (6.2%)    0.9% (  -9% -   12%) 0.614                                                                                              
                        HighTerm      683.48      (5.0%)      689.63      (6.3%)    0.9% (  -9% -   12%) 0.617                                                                                              
            BrowseDateSSDVFacets        2.17      (1.1%)        2.19      (3.1%)    1.3% (  -2% -    5%) 0.082 

Copy link
Contributor

@jpountz jpountz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

Copy link
Member

@mikemccand mikemccand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few more small comments! Thanks @gf2121!

@@ -460,6 +460,19 @@ public String toString() {
return "BLOCK: prefix=" + brToString(prefix);
}

private static void writeMSBVLong(long l, DataOutput scratchBytes) throws IOException {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we please add a randomized test that asserts that randomly encoding & decoding (non-negative?) long values always returns the same long / does not fail?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, could you add a comment pointing to the corresponding read method matching this write method? E.g. /** Encodes long value to variable length byte[], in MSB order. Use FieldReader#readMSBVlong to decode. */

}
}

private static long readMSBVLong(DataInput in) throws IOException {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a private javadoc linking to the write method? E.g.

/** Decodes a variable length byte[] in MSB order back to long, as written by Lucene90BlockTreeTermsWriter.writeMSBVLong. Use FieldReader#readMSBVlong to decode. */

Copy link
Member

@mikemccand mikemccand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @gf2121! Looks great :)

@gf2121
Copy link
Contributor Author

gf2121 commented Oct 10, 2023

@jpountz @mikemccand Thanks a lot for the great suggestions and benchmark !

@gf2121 gf2121 merged commit 4f01de2 into apache:main Oct 10, 2023
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Write VLong in opposite order for better outputs sharing in the FST
3 participants