Split taxonomy arrays across chunks #12995

msfroh · 2024-01-06T02:20:01Z

Description

Taxonomy ordinals are added in an append-only way.

Instead of reallocating a single big array when loading new taxonomy ordinals and copying all the values from the previous arrays over individually, we can keep blocks of ordinals and reuse blocks from the previous arrays.

Resolves #12989

lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/TaxonomyIndexArrays.java

stefanvodita · 2024-01-06T08:46:41Z

lucene/facet/src/test/org/apache/lucene/facet/taxonomy/TestTaxonomyCombined.java

    }

    tr.close();
    indexDir.close();
  }

+  private static void assertArrayEquals(int[] expected, ParallelTaxonomyArrays.IntArray actual) {


I'm wondering if it's worth implementing equals for ChunkedArray. No strong opinion one way or the other.

This was the only place where we assert anything about the full contents of the ChunkedArray. I would lean toward not adding equals to ChunkedArray, since it's not something we would normally want to call, IMO. (In general, it's going to be pretty expensive.)

stefanvodita · 2024-01-06T08:50:33Z

lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/TaxonomyIndexArrays.java

    if (copyFrom.initializedChildren) {
      initChildrenSiblings(copyFrom);
    }
  }

+  private static int[][] allocateChunkedArray(int size) {
+    int chunkCount = size / CHUNK_SIZE + 1;


Should we do the +1 only if lastChunkSize != 0?

In practice, we always have size >= 1 (because the TaxonomyWriter writes the root ordinal on startup) -- but relying on that behavior may bite us. I'll add a check for the size == 0 case.

I should have been more detailed in my comment. I didn't think the size == 0 case specifically would be an issue. I was wondering more generally if we really want the empty array at the end in cases where size is a multiple of CHUNK_SIZE (i.e. lastChunkSize == 0). And I think it's reasonable to do it that way - it sticks with this contract of having the last array be mutable in a way that the others aren't.

My preferred solution right now would be to remove the special case for size == 0 and remove the other if condition in the method, since it's never not met.

I missed this thread that was still pending. @msfroh - what do you think, can we simplify this method?

Got it!

I handled the multiple of CHUNK_SIZE case without allocating an empty array. I also added a unit test that specifically flexes ChunkedIntArray.

(And I cleaned up a few more cases where I was still dividing/moduloing instead of bit-shifting and masking.)

lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/TaxonomyIndexArrays.java

stefanvodita · 2024-01-08T06:49:13Z

As far as testing, can we add some unit tests that allocate more than one chunk and exercise the new functionality?
Should we also run some benchmarks to understand if there's any sort of performance regression?

msfroh · 2024-01-09T07:27:57Z

As far as testing, can we add some unit tests that allocate more than one chunk and exercise the new functionality? Should we also run some benchmarks to understand if there's any sort of performance regression?

I can take care of the first part.

@stefanvodita, do you mind running the Lucene benchmarks against this change to see how it performs?

stefanvodita · 2024-01-09T17:22:31Z

The results are in. I don't see any significant p-values (< 0.05).
python3 src/python/localrun.py -source wikimediumall -r

                            TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value
                       MedPhrase       27.55      (7.2%)       27.08      (5.8%)   -1.7% ( -13% -   12%) 0.408
                          IntNRQ       19.20     (10.6%)       18.89      (7.1%)   -1.6% ( -17% -   18%) 0.566
           BrowseMonthTaxoFacets        9.67      (8.9%)        9.52     (10.7%)   -1.6% ( -19% -   19%) 0.613
     BrowseRandomLabelTaxoFacets        4.01     (25.1%)        3.97     (26.0%)   -1.0% ( -41% -   66%) 0.902
                HighSloppyPhrase       28.64      (3.5%)       28.41      (2.4%)   -0.8% (  -6% -    5%) 0.406
                      AndHighLow      325.06      (3.8%)      322.67      (7.0%)   -0.7% ( -11% -   10%) 0.680
     BrowseRandomLabelSSDVFacets        2.92      (1.8%)        2.90      (2.6%)   -0.6% (  -4% -    3%) 0.403
       BrowseDayOfYearSSDVFacets        4.08      (3.8%)        4.06      (3.7%)   -0.6% (  -7% -    7%) 0.625
                         Prefix3      142.86      (5.0%)      142.32      (5.4%)   -0.4% ( -10% -   10%) 0.819
            HighTermTitleBDVSort        7.40      (6.3%)        7.38      (6.4%)   -0.3% ( -12% -   13%) 0.862
                 LowSloppyPhrase       11.47      (1.9%)       11.44      (1.7%)   -0.3% (  -3% -    3%) 0.657
                     LowSpanNear        6.86      (2.0%)        6.85      (1.8%)   -0.2% (  -3% -    3%) 0.754
                      HighPhrase       67.07      (3.1%)       66.94      (3.1%)   -0.2% (  -6% -    6%) 0.850
                       LowPhrase        7.77      (3.2%)        7.76      (3.4%)   -0.1% (  -6% -    6%) 0.888
                    HighSpanNear        3.81      (3.0%)        3.80      (2.8%)   -0.1% (  -5% -    5%) 0.875
            HighIntervalsOrdered        1.61      (2.5%)        1.61      (2.4%)   -0.1% (  -4% -    4%) 0.853
                      AndHighMed       76.92      (6.7%)       76.83      (8.0%)   -0.1% ( -13% -   15%) 0.959
                 MedSloppyPhrase       15.30      (1.9%)       15.29      (1.7%)   -0.1% (  -3% -    3%) 0.850
                      OrHighHigh       20.33      (7.3%)       20.31      (6.4%)   -0.1% ( -12% -   14%) 0.961
             LowIntervalsOrdered        7.12      (2.5%)        7.12      (2.4%)   -0.0% (  -4% -    4%) 0.955
         AndHighMedDayTaxoFacets       11.03      (2.2%)       11.04      (2.2%)    0.0% (  -4% -    4%) 0.959
                    OrNotHighMed      216.20      (2.6%)      216.28      (3.0%)    0.0% (  -5% -    5%) 0.967
             MedIntervalsOrdered       23.70      (1.8%)       23.71      (1.6%)    0.0% (  -3% -    3%) 0.944
                     MedSpanNear       13.33      (1.9%)       13.34      (1.7%)    0.1% (  -3% -    3%) 0.910
        AndHighHighDayTaxoFacets        7.73      (3.5%)        7.74      (3.2%)    0.1% (  -6% -    7%) 0.906
            MedTermDayTaxoFacets       15.49      (2.2%)       15.54      (1.6%)    0.3% (  -3% -    4%) 0.590
               HighTermTitleSort      121.92      (1.8%)      122.33      (1.6%)    0.3% (  -2% -    3%) 0.533
                      TermDTSort      142.74      (2.1%)      143.26      (3.4%)    0.4% (  -5% -    5%) 0.683
                        PKLookup      172.65      (3.0%)      173.29      (2.8%)    0.4% (  -5% -    6%) 0.687
                         Respell       43.69      (2.3%)       43.87      (2.7%)    0.4% (  -4% -    5%) 0.613
           HighTermDayOfYearSort      210.29      (2.8%)      211.25      (3.1%)    0.5% (  -5% -    6%) 0.626
                     AndHighHigh       23.87     (11.4%)       23.99     (11.7%)    0.5% ( -20% -   26%) 0.888
            BrowseDateSSDVFacets        1.19      (5.2%)        1.19      (4.2%)    0.6% (  -8% -   10%) 0.698
                        Wildcard      186.59      (2.6%)      187.68      (2.9%)    0.6% (  -4% -    6%) 0.505
                       OrHighMed       90.56      (5.3%)       91.09      (4.9%)    0.6% (  -9% -   11%) 0.715
                   OrNotHighHigh      354.86      (3.6%)      356.98      (3.9%)    0.6% (  -6% -    8%) 0.611
                   OrHighNotHigh      306.95      (3.8%)      308.88      (4.2%)    0.6% (  -7% -    9%) 0.620
                    OrHighNotMed      363.76      (3.9%)      366.46      (4.0%)    0.7% (  -6% -    8%) 0.548
            BrowseDateTaxoFacets        4.97     (27.3%)        5.00     (27.3%)    0.8% ( -42% -   76%) 0.929
       BrowseDayOfYearTaxoFacets        5.06     (26.8%)        5.09     (26.9%)    0.8% ( -41% -   74%) 0.927
                    OrHighNotLow      244.26      (5.3%)      246.17      (5.3%)    0.8% (  -9% -   12%) 0.641
           BrowseMonthSSDVFacets        4.38      (3.9%)        4.41      (3.1%)    0.8% (  -5% -    8%) 0.479
                       OrHighLow      264.51      (5.4%)      266.91      (4.5%)    0.9% (  -8% -   11%) 0.568
               HighTermMonthSort     2111.04      (1.9%)     2131.88      (3.2%)    1.0% (  -4% -    6%) 0.236
                    OrNotHighLow      543.72      (2.9%)      549.12      (2.3%)    1.0% (  -4% -    6%) 0.229
                          Fuzzy2       51.05      (2.7%)       51.61      (2.6%)    1.1% (  -4% -    6%) 0.191
                          Fuzzy1       66.48      (3.0%)       67.31      (3.1%)    1.3% (  -4% -    7%) 0.199
                         MedTerm      508.60      (7.4%)      515.34      (7.2%)    1.3% ( -12% -   17%) 0.568
                        HighTerm      399.43      (7.2%)      405.35      (7.0%)    1.5% ( -11% -   16%) 0.509
          OrHighMedDayTaxoFacets        3.92      (4.2%)        3.98      (3.4%)    1.6% (  -5% -    9%) 0.194
                         LowTerm      350.89      (8.4%)      358.77      (9.4%)    2.2% ( -14% -   21%) 0.427

stefanvodita

@msfroh - I left just one more nitpicky comment. Can you also add a CHANGES entry? I would be happy to merge the PR after.

stefanvodita · 2024-01-19T21:20:02Z

lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/TaxonomyIndexArrays.java

+    int chunkCount = size >> CHUNK_SIZE_BITS;
+    int fullChunkCount;
+    int lastChunkSize = size & CHUNK_MASK;
+    if (lastChunkSize == 0) {


Thank you for persisting while we're iterating over this method.

Since fullChunkCount is assigned chunkCount on both branches, why not do this:

int fullChunkCount = chunkCount; if (lastChunkSize != 0) { chunkCount++; }

On a higher level, I think I still wasn't specific enough in my previous comment. I didn't mind that we would sometimes have an empty array at the end if size was a multiple of CHUNK_SIZE, but we had if-statements that didn't seem to me like they were adding something. In this case, I prefer spending those extra bytes if we can make the code simpler. If you think the implementation we already have is better, we can keep it, but here is my preferred solution written out if you want to consider it:

private static int[][] allocateChunkedArray(int size, int startFrom) { int chunkCount = (size >> CHUNK_SIZE_BITS) + 1; int[][] array = new int[chunkCount][]; for (int i = startFrom; i < chunkCount - 1; i++) { array[i] = new int[CHUNK_SIZE]; } array[chunkCount - 1] = new int[size & CHUNK_MASK]; return array; }

Oh, that's much cleaner, thank you!

Taxonomy ordinals are added in an append-only way. Instead of reallocating a single big array when loading new taxonomy ordinals and copying all the values from the previous arrays over individually, we can keep blocks of ordinals and reuse blocks from the previous arrays.

Especially avoid allocating new chunks that we're going to immediately disard.

Avoid redundant lookup of parent ordinal. Also added a couple of comments explaining why we're reading a value and immediately overwriting it.

Simplify allocateChunkedArray and add CHANGES entry. Also, I decreased the depth of facet labels used in TestTaxonomyCombined::testThousandsOfCategories, to reduce execution time.

msfroh · 2024-01-20T19:46:02Z

@msfroh - I left just one more nitpicky comment. Can you also add a CHANGES entry? I would be happy to merge the PR after.

Done!

I also tweaked the bounds for testThousandsOfCategories, since it would take ~3.5 seconds on my (relatively powerful) work computer with a chain length of 3-6. Dropping to a length of 2-4 cut the time to 1.25 seconds.

stefanvodita · 2024-01-20T20:29:12Z

Thank you! I think the PR is in good shape. I'll leave it up for a couple days and then merge if no one else comments.

Taxonomy ordinals are added in an append-only way. Instead of reallocating a single big array when loading new taxonomy ordinals and copying all the values from the previous arrays over individually, we can keep blocks of ordinals and reuse blocks from the previous arrays.

msfroh mentioned this pull request Jan 6, 2024

Taxonomy facets: can we change massive int[] for parent/child/sibling tree to paged/block int[] to reduce RAM pressure? #12989

Closed

stefanvodita requested changes Jan 6, 2024

View reviewed changes

stefanvodita approved these changes Jan 9, 2024

View reviewed changes

stefanvodita reviewed Jan 19, 2024

View reviewed changes

msfroh added 7 commits January 20, 2024 19:41

Incorporate feedback from @stefanvodita

c6ada5d

Especially avoid allocating new chunks that we're going to immediately disard.

Store returned parent in a local variable

e5d507d

Avoid redundant lookup of parent ordinal. Also added a couple of comments explaining why we're reading a value and immediately overwriting it.

Add unit test with tens of thousands of ordinals

ccb5212

Avoid allocating empty array when size is multiple of chunk size

7995993

Add unit test targeting ChunkedIntArray

f5cc1f0

More feedback from @stefanvodita

136a56e

Simplify allocateChunkedArray and add CHANGES entry. Also, I decreased the depth of facet labels used in TestTaxonomyCombined::testThousandsOfCategories, to reduce execution time.

msfroh force-pushed the chunked_taxonomy_arrays branch from 35442aa to 136a56e Compare January 20, 2024 19:42

stefanvodita merged commit 2a0b7f2 into apache:main Jan 22, 2024
4 checks passed

stefanvodita mentioned this pull request Jan 22, 2024

Fix issues with chunked TaxonomyIndexArray #13028

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Split taxonomy arrays across chunks #12995

Split taxonomy arrays across chunks #12995

msfroh commented Jan 6, 2024

stefanvodita Jan 6, 2024

msfroh Jan 17, 2024

stefanvodita Jan 6, 2024

msfroh Jan 7, 2024

stefanvodita Jan 8, 2024

stefanvodita Jan 14, 2024

msfroh Jan 17, 2024

stefanvodita commented Jan 8, 2024

msfroh commented Jan 9, 2024

stefanvodita commented Jan 9, 2024

stefanvodita left a comment

stefanvodita Jan 19, 2024

msfroh Jan 20, 2024

msfroh commented Jan 20, 2024

stefanvodita commented Jan 20, 2024

Split taxonomy arrays across chunks #12995

Split taxonomy arrays across chunks #12995

Conversation

msfroh commented Jan 6, 2024

Description

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stefanvodita commented Jan 8, 2024

msfroh commented Jan 9, 2024

stefanvodita commented Jan 9, 2024

stefanvodita left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

msfroh commented Jan 20, 2024

stefanvodita commented Jan 20, 2024