Remove patching for doc blocks. #12741

slow-J · 2023-10-31T16:00:32Z

We are still keeping PFOR for positions only.
This is a partial revert of #69 which brings back ForDeltaUtil.

Starting this as a draft PR since creating the Lucene99PostingsFormat brings a lot of change.

Also pending some more benchmarking.

We are still keeping PFOR for positions only. This is a partial revert of apache#69 which brings back ForDeltaUtil.

jpountz

It looks good to me in general. Can you also split Lucene90PostingsFormat into a Lucene90PostingsFormat that is read-only and a Lucene90RWPostingsFormat that is only available for testing? You can check out Lucene95RWHnswVectorsFormat for a recent example of how file formats get split into a read-only implementation and a test-only read-write implementation.

lucene/CHANGES.txt

jpountz · 2023-10-31T17:02:45Z

lucene/core/src/java/org/apache/lucene/codecs/lucene99/ForDeltaUtil.java

+  }
+
+  /** Skip a sequence of 128 longs. */
+  void skip(DataInput in) throws IOException {


It looks like we don't need this method as it's only used for tests.

Thanks, removed in latest commit.

jpountz · 2023-10-31T17:03:49Z

lucene/core/src/java/org/apache/lucene/codecs/lucene99/ForDeltaUtil.java

+    if (bitsPerValue == 0) {
+      prefixSumOfOnes(longs, base);
+    } else {
+      forUtil.decodeAndPrefixSum(bitsPerValue, in, base, longs);


Should we inline this other method into this class? It's a bit awkward to have the prefix sum logic in ForUtil rather than ForDeltaUtil?

Oh maybe it's for convenience because this other class is generated and not this one?

I think its the convenience + otherwise we would have to duplicate about 650 lines of code from ForUtil. (all the decode1 -> decode24)

lucene/core/src/java/org/apache/lucene/codecs/lucene99/PForUtil.java

Also: * Change to Changes.txt * Removal of dead code which was only used in unit tests * Removal of test code from PForUtil

slow-J · 2023-10-31T20:07:27Z

Thanks for the suggestion, I added Lucene90RWPostingsFormat in latest commit and made Lucene90PostingsFormat read-only.

gf2121

Thanks @slow-J ! I left some minor comments about additional 90 -> 99 refactoring.

...kward-codecs/src/java/org/apache/lucene/backward_codecs/lucene90/Lucene90PostingsFormat.java

lucene/core/src/java/org/apache/lucene/codecs/lucene99/Lucene99PostingsFormat.java

lucene/core/src/java/org/apache/lucene/codecs/lucene99/Lucene99SkipReader.java

lucene/core/src/java/org/apache/lucene/codecs/lucene99/Lucene99PostingsFormat.java

Co-authored-by: gf2121 <52390227+gf2121@users.noreply.github.com>

slow-J · 2023-11-01T11:15:04Z

Thanks @slow-J ! I left some minor comments about additional 90 -> 99 refactoring.

Thanks @gf2121 , committed all the suggestions.

mikemccand · 2023-11-02T08:41:22Z

Thanks for tackling this / persisting @slow-J, especially the glorious fun experience of having to "bump" the Codec version ;) A nice rite-of-passage in this Lucene world!

slow-J · 2023-11-02T12:05:29Z

Thanks @mikemccand and yes, the codec version bump is the majority of this change :D

jpountz · 2023-11-03T14:51:57Z

For reference, I'm interested in taking advantage of the fact we're changing the codec anyway to look into other smaller changes, like switching tail postings from vints to group-varint, or better alignign blocks and skip lists so that BlockDocsEnum#advance doesn't need to check whether if docBufferUpto == BLOCK_SIZE to decode a new block and could do it directly under the target > nextSkipDoc check.

mikemccand

Thank you @slow-J -- what a big change this turned out to be.

I left some minor comments that can be resolve later. I think given how many files this is touching we should merge it sooner rather than later... I'll merge later today if there are no concerns otherwise.

mikemccand · 2023-11-06T12:06:57Z

lucene/core/src/java/org/apache/lucene/codecs/lucene99/Lucene99PostingsFormat.java

+ *         <li>SkipDatum --&gt; DocSkip, DocFPSkip, &lt;PosFPSkip, PosBlockOffset, PayLength?,
+ *             PayFPSkip?&gt;?, ImpactLength, &lt;CompetitiveFreqDelta, CompetitiveNormDelta?&gt;
+ *             <sup>ImpactCount</sup>, SkipChildLevelPointer?
+ *         <li>PackedDocDeltaBlock, PackedFreqBlock --&gt; {@link PackedInts PackedInts}


Hmm maybe separate out these two to clarify that the PackedDocDeltaBlock does not using patching, but the PackedFreqBlock does?

mikemccand · 2023-11-06T12:08:13Z

lucene/core/src/java/org/apache/lucene/codecs/lucene99/Lucene99PostingsFormat.java

+ *   <dd><b>Frequencies and Skip Data</b>
+ *       <p>The .doc file contains the lists of documents which contain each term, along with the
+ *       frequency of the term in that document (except when frequencies are omitted: {@link
+ *       IndexOptions#DOCS}). It also saves skip data to the beginning of each packed or VInt block,


Huh, I had thought skip data was saved at the end of each term's postings? And, the skip data is not stored per block, but rather once for the entire postings list?

(This is a pre-existing issue -- we can fix it separately).

Opened PR for the 2 javadoc comments: #12776

* Change Postings back to using FOR in Lucene99PostingsFormat We are still keeping PFOR for positions only. This is a partial revert of #69 which brings back ForDeltaUtil. * fix merge commit * Add forgotten forDeltaUtil calls to reader * Addressing comments: adding Lucene90RWPostingsFormat + more Also: * Change to Changes.txt * Removal of dead code which was only used in unit tests * Removal of test code from PForUtil * Changes.txt edit in right place now * Apply suggestions from code review: `90 -> 99 refactoring` Co-authored-by: gf2121 <52390227+gf2121@users.noreply.github.com> * Remove decodeTo32 from ForUtil and regenerate --------- Co-authored-by: gf2121 <52390227+gf2121@users.noreply.github.com>

slow-J · 2023-11-06T17:02:59Z

Thanks Mike and all reviewers!

mikemccand · 2023-11-06T17:16:37Z

Thank you @slow-J!

Addressing the last comments from apache#12741

IndexDiskUsageAnalyzer needs adjusting after apache/lucene#12741

Clean-up from adding the Lucene99PostingsFormat in apache#12741 These test cases were moved to Lucene99 dir and I forgot to copy the unmodified versions for the backward_codecs.lucene90

…2781) Clean-up from adding the Lucene99PostingsFormat in #12741 These test cases were moved to Lucene99 dir and I forgot to copy the unmodified versions for the backward_codecs.lucene90

IndexDiskUsageAnalyzer and IndexDiskUsageAnalyzerTests, as well as CompletionFieldMapper, CompletionFieldMapperTests and CompletionStatsCacheTests need adjusting after apache/lucene#12741 , to refer to the latest postings format. KuromojiTokenizerFactory needs adjusting after apache/lucene#12390

jpountz · 2023-11-10T07:19:22Z

Nightly benchmarks just caught up this change, it's no obvious that there is a speedup.

gf2121 · 2023-11-10T07:53:59Z

FYI this great view could be easier to see the impact of changes in single day for all tasks. It seems some count tasks get a bit happy with little p-value.

slow-J · 2023-11-10T22:24:25Z

I think that it's a little hard to tell with 1 datapoint due to noise, it seems to be trending upwards in the BooleanQuery graphs, but I agree that it's not obvious that there is a noticeable speedup...

jpountz · 2023-11-13T09:46:20Z

Thanks both, I pushed an annotation, it should show up tomorrow. I hah high expectations based on preliminary results from #12696 (comment) where AndHighMed had a reproducible 3-4% speedup, so I was expecting nightlies to show it too. @slow-J I'm curious if you had a chance to run benchmarks on this PR, did it also show a speedup?

slow-J · 2023-11-13T11:16:31Z

I ran a new luceneutil benchmark on Saturday with my commit 8ae598b (using Lucene99PostingsFormat) as candidate and the commit's parent as baseline (using Lucene90PostingsFormat).

Other benchmark variables for transparency:

Java 19
Ec2 instance: m5.12xlarge.
Disabled JFR

                            TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value
           BrowseMonthTaxoFacets        3.98      (6.7%)        3.92      (2.3%)   -1.7% (  -9% -    7%) 0.286
       BrowseDayOfYearSSDVFacets        3.29      (1.6%)        3.27      (1.1%)   -0.7% (  -3% -    2%) 0.126
            BrowseDateSSDVFacets        1.01      (4.8%)        1.00      (5.6%)   -0.6% ( -10% -   10%) 0.695
                        PKLookup      151.00      (2.5%)      150.26      (2.6%)   -0.5% (  -5% -    4%) 0.543
           BrowseMonthSSDVFacets        3.44      (1.6%)        3.43      (1.5%)   -0.5% (  -3% -    2%) 0.321
                         LowTerm      345.74      (3.0%)      344.39      (3.1%)   -0.4% (  -6% -    5%) 0.689
                        Wildcard      142.48      (1.9%)      142.00      (1.8%)   -0.3% (  -3% -    3%) 0.569
                         Prefix3     1056.96      (4.0%)     1054.07      (3.2%)   -0.3% (  -7% -    7%) 0.812
     BrowseRandomLabelSSDVFacets        2.55      (7.2%)        2.54      (7.2%)   -0.1% ( -13% -   15%) 0.975
       BrowseDayOfYearTaxoFacets        3.96      (0.7%)        3.96      (0.7%)   -0.0% (  -1% -    1%) 0.856
            BrowseDateTaxoFacets        3.94      (0.7%)        3.94      (0.7%)   -0.0% (  -1% -    1%) 0.912
     BrowseRandomLabelTaxoFacets        3.41      (0.8%)        3.41      (0.9%)   -0.0% (  -1% -    1%) 0.977
                          Fuzzy1       69.71      (0.9%)       69.74      (0.8%)    0.0% (  -1% -    1%) 0.902
                   OrHighNotHigh      205.50      (4.6%)      205.67      (5.3%)    0.1% (  -9% -   10%) 0.958
                          Fuzzy2       58.39      (0.8%)       58.44      (0.6%)    0.1% (  -1% -    1%) 0.688
                    OrHighNotLow      265.57      (5.4%)      266.05      (5.9%)    0.2% ( -10% -   12%) 0.921
            HighIntervalsOrdered        5.32      (4.5%)        5.33      (4.1%)    0.2% (  -8% -    9%) 0.879
            HighTermTitleBDVSort        7.81      (3.1%)        7.83      (2.7%)    0.3% (  -5% -    6%) 0.756
                         Respell       34.00      (1.3%)       34.10      (1.1%)    0.3% (  -2% -    2%) 0.451
            MedTermDayTaxoFacets       16.00      (3.8%)       16.04      (3.8%)    0.3% (  -7% -    8%) 0.801
                    OrHighNotMed      255.88      (5.0%)      257.07      (5.2%)    0.5% (  -9% -   11%) 0.774
        AndHighHighDayTaxoFacets        2.83      (3.5%)        2.84      (3.4%)    0.5% (  -6% -    7%) 0.658
                         MedTerm      374.43      (4.9%)      376.51      (5.6%)    0.6% (  -9% -   11%) 0.738
                        HighTerm      472.29      (4.8%)      474.92      (5.7%)    0.6% (  -9% -   11%) 0.738
                     MedSpanNear        5.01      (4.3%)        5.04      (4.6%)    0.6% (  -7% -    9%) 0.690
               HighTermMonthSort     2670.49      (4.1%)     2689.05      (3.1%)    0.7% (  -6% -    8%) 0.549
             LowIntervalsOrdered        6.65      (4.2%)        6.70      (4.3%)    0.7% (  -7% -    9%) 0.584
                      OrHighHigh       22.54      (1.2%)       22.71      (2.4%)    0.7% (  -2% -    4%) 0.204
          OrHighMedDayTaxoFacets        1.58      (4.6%)        1.59      (3.5%)    0.8% (  -6% -    9%) 0.523
                          IntNRQ       27.99      (2.4%)       28.25      (3.9%)    0.9% (  -5% -    7%) 0.371
                   OrNotHighHigh      244.47      (3.7%)      246.97      (4.5%)    1.0% (  -6% -    9%) 0.432
             MedIntervalsOrdered        3.14      (3.7%)        3.18      (3.9%)    1.0% (  -6% -    9%) 0.391
                       OrHighMed       43.19      (1.3%)       43.65      (1.7%)    1.1% (  -1% -    4%) 0.025
                    HighSpanNear        6.73      (2.7%)        6.80      (3.3%)    1.2% (  -4% -    7%) 0.223
                       LowPhrase      255.75      (2.1%)      259.11      (2.0%)    1.3% (  -2% -    5%) 0.043
           HighTermDayOfYearSort      253.38      (4.1%)      257.57      (2.2%)    1.7% (  -4% -    8%) 0.112
                     AndHighHigh       15.70      (1.1%)       15.98      (2.5%)    1.8% (  -1% -    5%) 0.004
                       MedPhrase       14.60      (2.4%)       14.89      (2.2%)    1.9% (  -2% -    6%) 0.009
                       OrHighLow      340.15      (2.3%)      346.84      (2.6%)    2.0% (  -2% -    6%) 0.010
                      HighPhrase      105.17      (2.3%)      107.28      (2.0%)    2.0% (  -2% -    6%) 0.004
                HighSloppyPhrase        1.97      (5.2%)        2.01      (3.6%)    2.0% (  -6% -   11%) 0.152
               HighTermTitleSort      118.65      (5.4%)      121.16      (2.4%)    2.1% (  -5% -   10%) 0.112
                 MedSloppyPhrase        2.73      (5.0%)        2.79      (3.1%)    2.2% (  -5% -   10%) 0.092
                 LowSloppyPhrase        9.48      (3.7%)        9.69      (1.7%)    2.3% (  -3% -    8%) 0.013
                      AndHighMed       47.32      (1.3%)       48.52      (2.8%)    2.5% (  -1% -    6%) 0.000
                     LowSpanNear       17.05      (2.0%)       17.49      (2.1%)    2.6% (  -1% -    6%) 0.000
                    OrNotHighMed      292.37      (3.0%)      300.28      (3.3%)    2.7% (  -3% -    9%) 0.007
         AndHighMedDayTaxoFacets       21.13      (1.8%)       21.74      (1.5%)    2.9% (   0% -    6%) 0.000
                      TermDTSort      124.22      (4.8%)      127.91      (1.3%)    3.0% (  -3% -    9%) 0.008
                      AndHighLow      477.93      (3.1%)      494.29      (2.5%)    3.4% (  -2% -    9%) 0.000
                    OrNotHighLow      424.91      (2.3%)      444.79      (2.8%)    4.7% (   0% -   10%) 0.000

and I am still seeing a speed, although the AndHighMed gain was 2.5%

mikemccand · 2023-11-13T13:54:46Z

I ran a new luceneutil benchmark on Saturday with my commit 8ae598b (using Lucene99PostingsFormat) as candidate and the commit's parent as baseline (using Lucene90PostingsFormat).

Is this wikimediumall or wikibigall?

slow-J · 2023-11-13T14:23:00Z

I ran a new luceneutil benchmark on Saturday with my commit 8ae598b (using Lucene99PostingsFormat) as candidate and the commit's parent as baseline (using Lucene90PostingsFormat).

Is this wikimediumall or wikibigall?

Should have specified, it's wikimediumall

Addressing the last comments from #12741

s1monw · 2023-12-19T10:48:00Z

I wanted to give my $0.02 on this. I am not convinced that a 2% change on a benchmark warrants a 6.2k SLoC addition to such an important codebase. I think the differences in terms of performance between FOR and PFOR vary a lot across benchmarks and are heavily dependent on what your index looks like, how big it is. I would even argue that the space savings PFOR was bringing in (about 5%) might make a bigger difference in terms of performance depending on the size of the index and your hardware.
I don't wanna go that far and ask for a revert of this change but I think we need to look closer in the future if the rather questionable improvements warrant a change like this or if such a change should rather be an optional postings format rather than the default.

slow-J added 3 commits October 31, 2023 15:40

Change Postings back to using FOR in Lucene99PostingsFormat

c8575cc

We are still keeping PFOR for positions only. This is a partial revert of apache#69 which brings back ForDeltaUtil.

fix merge commit

7d1d31e

Add forgotten forDeltaUtil calls to reader

0ac9863

slow-J mentioned this pull request Oct 31, 2023

Adding option to codec to disable patching in Lucene's PFOR encoding #12696

Closed

jpountz reviewed Oct 31, 2023

View reviewed changes

jpountz mentioned this pull request Oct 31, 2023

Use max BPV encoding in postings if doc buffer size less than ForUtil.BLOCK_SIZE #12717

Closed

Addressing comments: adding Lucene90RWPostingsFormat + more

12415c7

Also: * Change to Changes.txt * Removal of dead code which was only used in unit tests * Removal of test code from PForUtil

Changes.txt edit in right place now

0d75c9b

gf2121 reviewed Nov 1, 2023

View reviewed changes

Apply suggestions from code review: 90 -> 99 refactoring

819ca4c

Co-authored-by: gf2121 <52390227+gf2121@users.noreply.github.com>

Remove decodeTo32 from ForUtil and regenerate

c994d2b

slow-J marked this pull request as ready for review November 2, 2023 12:05

slow-J mentioned this pull request Nov 2, 2023

Explore partially decoding blocks (within-block skipping) #12749

Open

Merge remote-tracking branch 'origin' into remove_patch_postings

5a68c66

mikemccand approved these changes Nov 6, 2023

View reviewed changes

mikemccand merged commit 8ae598b into apache:main Nov 6, 2023
4 checks passed

slow-J deleted the remove_patch_postings branch November 6, 2023 16:58

slow-J added a commit to slow-J/lucene that referenced this pull request Nov 6, 2023

javadocs cleanup in Lucene99PostingsFormat

6fe1773

Addressing the last comments from apache#12741

This was referenced Nov 6, 2023

javadocs cleanup in Lucene99PostingsFormat #12776

Merged

Update postings format to Lucene99 + add annot mikemccand/luceneutil#243

Closed

Try turning off patching in Lucene's PFOR encoding Tony-X/search-benchmark-game#46

Closed

javanna added a commit to javanna/elasticsearch that referenced this pull request Nov 7, 2023

Fix compile error

4b014cc

IndexDiskUsageAnalyzer needs adjusting after apache/lucene#12741

javanna added a commit to javanna/elasticsearch that referenced this pull request Nov 7, 2023

Fix compile error

83bbea2

IndexDiskUsageAnalyzer needs adjusting after apache/lucene#12741

javanna added a commit to javanna/elasticsearch that referenced this pull request Nov 7, 2023

Fix compile error

8272824

IndexDiskUsageAnalyzer needs adjusting after apache/lucene#12741

javanna mentioned this pull request Nov 7, 2023

Fix compile errors elastic/elasticsearch#101874

Merged

slow-J mentioned this pull request Nov 8, 2023

Re-adding the backward_codecs.lucene90 TestPForUtil + TestForUtil #12781

Merged

mikemccand pushed a commit that referenced this pull request Nov 13, 2023

javadocs cleanup in Lucene99PostingsFormat (#12776)

fcf6878

Addressing the last comments from #12741

mikemccand pushed a commit that referenced this pull request Nov 13, 2023

javadocs cleanup in Lucene99PostingsFormat (#12776)

654e5ab

Addressing the last comments from #12741

ChrisHegarty mentioned this pull request Dec 5, 2023

Disk usage has increased in the nightly Logging benchmark elastic/elasticsearch#103002

Closed

This was referenced Dec 11, 2023

Write a HOWTO migrate Codec format version #12918

Open

Writing a HOWTO migrate codec version #12930

Open

vladak mentioned this pull request Dec 15, 2023

lucene 9.9.0 oracle/opengrok#4505

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove patching for doc blocks. #12741

Remove patching for doc blocks. #12741

slow-J commented Oct 31, 2023 •

edited

jpountz left a comment

jpountz Oct 31, 2023

slow-J Oct 31, 2023

jpountz Oct 31, 2023

jpountz Oct 31, 2023

slow-J Oct 31, 2023 •

edited

slow-J commented Oct 31, 2023

gf2121 left a comment

slow-J commented Nov 1, 2023

mikemccand commented Nov 2, 2023

slow-J commented Nov 2, 2023

jpountz commented Nov 3, 2023

mikemccand left a comment

mikemccand Nov 6, 2023

mikemccand Nov 6, 2023

slow-J Nov 6, 2023

slow-J commented Nov 6, 2023

mikemccand commented Nov 6, 2023

jpountz commented Nov 10, 2023

gf2121 commented Nov 10, 2023

slow-J commented Nov 10, 2023

jpountz commented Nov 13, 2023

slow-J commented Nov 13, 2023

mikemccand commented Nov 13, 2023

slow-J commented Nov 13, 2023

s1monw commented Dec 19, 2023

Remove patching for doc blocks. #12741

Remove patching for doc blocks. #12741

Conversation

slow-J commented Oct 31, 2023 • edited

jpountz left a comment

Choose a reason for hiding this comment

jpountz Oct 31, 2023

Choose a reason for hiding this comment

slow-J Oct 31, 2023

Choose a reason for hiding this comment

jpountz Oct 31, 2023

Choose a reason for hiding this comment

jpountz Oct 31, 2023

Choose a reason for hiding this comment

slow-J Oct 31, 2023 • edited

Choose a reason for hiding this comment

slow-J commented Oct 31, 2023

gf2121 left a comment

Choose a reason for hiding this comment

slow-J commented Nov 1, 2023

mikemccand commented Nov 2, 2023

slow-J commented Nov 2, 2023

jpountz commented Nov 3, 2023

mikemccand left a comment

Choose a reason for hiding this comment

mikemccand Nov 6, 2023

Choose a reason for hiding this comment

mikemccand Nov 6, 2023

Choose a reason for hiding this comment

slow-J Nov 6, 2023

Choose a reason for hiding this comment

slow-J commented Nov 6, 2023

mikemccand commented Nov 6, 2023

jpountz commented Nov 10, 2023

gf2121 commented Nov 10, 2023

slow-J commented Nov 10, 2023

jpountz commented Nov 13, 2023

slow-J commented Nov 13, 2023

mikemccand commented Nov 13, 2023

slow-J commented Nov 13, 2023

s1monw commented Dec 19, 2023

slow-J commented Oct 31, 2023 •

edited

slow-J Oct 31, 2023 •

edited