Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Split taxonomy arrays across chunks #12995

Merged
merged 7 commits into from
Jan 22, 2024

Conversation

msfroh
Copy link
Contributor

@msfroh msfroh commented Jan 6, 2024

Description

Taxonomy ordinals are added in an append-only way.

Instead of reallocating a single big array when loading new taxonomy ordinals and copying all the values from the previous arrays over individually, we can keep blocks of ordinals and reuse blocks from the previous arrays.

Resolves #12989

}

tr.close();
indexDir.close();
}

private static void assertArrayEquals(int[] expected, ParallelTaxonomyArrays.IntArray actual) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if it's worth implementing equals for ChunkedArray. No strong opinion one way or the other.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was the only place where we assert anything about the full contents of the ChunkedArray. I would lean toward not adding equals to ChunkedArray, since it's not something we would normally want to call, IMO. (In general, it's going to be pretty expensive.)

if (copyFrom.initializedChildren) {
initChildrenSiblings(copyFrom);
}
}

private static int[][] allocateChunkedArray(int size) {
int chunkCount = size / CHUNK_SIZE + 1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we do the +1 only if lastChunkSize != 0?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In practice, we always have size >= 1 (because the TaxonomyWriter writes the root ordinal on startup) -- but relying on that behavior may bite us. I'll add a check for the size == 0 case.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should have been more detailed in my comment. I didn't think the size == 0 case specifically would be an issue. I was wondering more generally if we really want the empty array at the end in cases where size is a multiple of CHUNK_SIZE (i.e. lastChunkSize == 0). And I think it's reasonable to do it that way - it sticks with this contract of having the last array be mutable in a way that the others aren't.

My preferred solution right now would be to remove the special case for size == 0 and remove the other if condition in the method, since it's never not met.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I missed this thread that was still pending. @msfroh - what do you think, can we simplify this method?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it!

I handled the multiple of CHUNK_SIZE case without allocating an empty array. I also added a unit test that specifically flexes ChunkedIntArray.

(And I cleaned up a few more cases where I was still dividing/moduloing instead of bit-shifting and masking.)

@stefanvodita
Copy link
Contributor

As far as testing, can we add some unit tests that allocate more than one chunk and exercise the new functionality?
Should we also run some benchmarks to understand if there's any sort of performance regression?

@msfroh
Copy link
Contributor Author

msfroh commented Jan 9, 2024

As far as testing, can we add some unit tests that allocate more than one chunk and exercise the new functionality? Should we also run some benchmarks to understand if there's any sort of performance regression?

I can take care of the first part.

@stefanvodita, do you mind running the Lucene benchmarks against this change to see how it performs?

@stefanvodita
Copy link
Contributor

The results are in. I don't see any significant p-values (< 0.05).
python3 src/python/localrun.py -source wikimediumall -r

                            TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value
                       MedPhrase       27.55      (7.2%)       27.08      (5.8%)   -1.7% ( -13% -   12%) 0.408
                          IntNRQ       19.20     (10.6%)       18.89      (7.1%)   -1.6% ( -17% -   18%) 0.566
           BrowseMonthTaxoFacets        9.67      (8.9%)        9.52     (10.7%)   -1.6% ( -19% -   19%) 0.613
     BrowseRandomLabelTaxoFacets        4.01     (25.1%)        3.97     (26.0%)   -1.0% ( -41% -   66%) 0.902
                HighSloppyPhrase       28.64      (3.5%)       28.41      (2.4%)   -0.8% (  -6% -    5%) 0.406
                      AndHighLow      325.06      (3.8%)      322.67      (7.0%)   -0.7% ( -11% -   10%) 0.680
     BrowseRandomLabelSSDVFacets        2.92      (1.8%)        2.90      (2.6%)   -0.6% (  -4% -    3%) 0.403
       BrowseDayOfYearSSDVFacets        4.08      (3.8%)        4.06      (3.7%)   -0.6% (  -7% -    7%) 0.625
                         Prefix3      142.86      (5.0%)      142.32      (5.4%)   -0.4% ( -10% -   10%) 0.819
            HighTermTitleBDVSort        7.40      (6.3%)        7.38      (6.4%)   -0.3% ( -12% -   13%) 0.862
                 LowSloppyPhrase       11.47      (1.9%)       11.44      (1.7%)   -0.3% (  -3% -    3%) 0.657
                     LowSpanNear        6.86      (2.0%)        6.85      (1.8%)   -0.2% (  -3% -    3%) 0.754
                      HighPhrase       67.07      (3.1%)       66.94      (3.1%)   -0.2% (  -6% -    6%) 0.850
                       LowPhrase        7.77      (3.2%)        7.76      (3.4%)   -0.1% (  -6% -    6%) 0.888
                    HighSpanNear        3.81      (3.0%)        3.80      (2.8%)   -0.1% (  -5% -    5%) 0.875
            HighIntervalsOrdered        1.61      (2.5%)        1.61      (2.4%)   -0.1% (  -4% -    4%) 0.853
                      AndHighMed       76.92      (6.7%)       76.83      (8.0%)   -0.1% ( -13% -   15%) 0.959
                 MedSloppyPhrase       15.30      (1.9%)       15.29      (1.7%)   -0.1% (  -3% -    3%) 0.850
                      OrHighHigh       20.33      (7.3%)       20.31      (6.4%)   -0.1% ( -12% -   14%) 0.961
             LowIntervalsOrdered        7.12      (2.5%)        7.12      (2.4%)   -0.0% (  -4% -    4%) 0.955
         AndHighMedDayTaxoFacets       11.03      (2.2%)       11.04      (2.2%)    0.0% (  -4% -    4%) 0.959
                    OrNotHighMed      216.20      (2.6%)      216.28      (3.0%)    0.0% (  -5% -    5%) 0.967
             MedIntervalsOrdered       23.70      (1.8%)       23.71      (1.6%)    0.0% (  -3% -    3%) 0.944
                     MedSpanNear       13.33      (1.9%)       13.34      (1.7%)    0.1% (  -3% -    3%) 0.910
        AndHighHighDayTaxoFacets        7.73      (3.5%)        7.74      (3.2%)    0.1% (  -6% -    7%) 0.906
            MedTermDayTaxoFacets       15.49      (2.2%)       15.54      (1.6%)    0.3% (  -3% -    4%) 0.590
               HighTermTitleSort      121.92      (1.8%)      122.33      (1.6%)    0.3% (  -2% -    3%) 0.533
                      TermDTSort      142.74      (2.1%)      143.26      (3.4%)    0.4% (  -5% -    5%) 0.683
                        PKLookup      172.65      (3.0%)      173.29      (2.8%)    0.4% (  -5% -    6%) 0.687
                         Respell       43.69      (2.3%)       43.87      (2.7%)    0.4% (  -4% -    5%) 0.613
           HighTermDayOfYearSort      210.29      (2.8%)      211.25      (3.1%)    0.5% (  -5% -    6%) 0.626
                     AndHighHigh       23.87     (11.4%)       23.99     (11.7%)    0.5% ( -20% -   26%) 0.888
            BrowseDateSSDVFacets        1.19      (5.2%)        1.19      (4.2%)    0.6% (  -8% -   10%) 0.698
                        Wildcard      186.59      (2.6%)      187.68      (2.9%)    0.6% (  -4% -    6%) 0.505
                       OrHighMed       90.56      (5.3%)       91.09      (4.9%)    0.6% (  -9% -   11%) 0.715
                   OrNotHighHigh      354.86      (3.6%)      356.98      (3.9%)    0.6% (  -6% -    8%) 0.611
                   OrHighNotHigh      306.95      (3.8%)      308.88      (4.2%)    0.6% (  -7% -    9%) 0.620
                    OrHighNotMed      363.76      (3.9%)      366.46      (4.0%)    0.7% (  -6% -    8%) 0.548
            BrowseDateTaxoFacets        4.97     (27.3%)        5.00     (27.3%)    0.8% ( -42% -   76%) 0.929
       BrowseDayOfYearTaxoFacets        5.06     (26.8%)        5.09     (26.9%)    0.8% ( -41% -   74%) 0.927
                    OrHighNotLow      244.26      (5.3%)      246.17      (5.3%)    0.8% (  -9% -   12%) 0.641
           BrowseMonthSSDVFacets        4.38      (3.9%)        4.41      (3.1%)    0.8% (  -5% -    8%) 0.479
                       OrHighLow      264.51      (5.4%)      266.91      (4.5%)    0.9% (  -8% -   11%) 0.568
               HighTermMonthSort     2111.04      (1.9%)     2131.88      (3.2%)    1.0% (  -4% -    6%) 0.236
                    OrNotHighLow      543.72      (2.9%)      549.12      (2.3%)    1.0% (  -4% -    6%) 0.229
                          Fuzzy2       51.05      (2.7%)       51.61      (2.6%)    1.1% (  -4% -    6%) 0.191
                          Fuzzy1       66.48      (3.0%)       67.31      (3.1%)    1.3% (  -4% -    7%) 0.199
                         MedTerm      508.60      (7.4%)      515.34      (7.2%)    1.3% ( -12% -   17%) 0.568
                        HighTerm      399.43      (7.2%)      405.35      (7.0%)    1.5% ( -11% -   16%) 0.509
          OrHighMedDayTaxoFacets        3.92      (4.2%)        3.98      (3.4%)    1.6% (  -5% -    9%) 0.194
                         LowTerm      350.89      (8.4%)      358.77      (9.4%)    2.2% ( -14% -   21%) 0.427

Copy link
Contributor

@stefanvodita stefanvodita left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@msfroh - I left just one more nitpicky comment. Can you also add a CHANGES entry? I would be happy to merge the PR after.

int chunkCount = size >> CHUNK_SIZE_BITS;
int fullChunkCount;
int lastChunkSize = size & CHUNK_MASK;
if (lastChunkSize == 0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for persisting while we're iterating over this method.

Since fullChunkCount is assigned chunkCount on both branches, why not do this:

int fullChunkCount = chunkCount;
if (lastChunkSize != 0) {
  chunkCount++;
}

On a higher level, I think I still wasn't specific enough in my previous comment. I didn't mind that we would sometimes have an empty array at the end if size was a multiple of CHUNK_SIZE, but we had if-statements that didn't seem to me like they were adding something. In this case, I prefer spending those extra bytes if we can make the code simpler. If you think the implementation we already have is better, we can keep it, but here is my preferred solution written out if you want to consider it:

private static int[][] allocateChunkedArray(int size, int startFrom) {
    int chunkCount = (size >> CHUNK_SIZE_BITS) + 1;
    int[][] array = new int[chunkCount][];
    for (int i = startFrom; i < chunkCount - 1; i++) {
        array[i] = new int[CHUNK_SIZE];
    }
    array[chunkCount - 1] = new int[size & CHUNK_MASK];
    return array;
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, that's much cleaner, thank you!

Taxonomy ordinals are added in an append-only way.

Instead of reallocating a single big array when loading new taxonomy
ordinals and copying all the values from the previous arrays over
individually, we can keep blocks of ordinals and reuse blocks from the
previous arrays.
Especially avoid allocating new chunks that we're going to immediately
disard.
Avoid redundant lookup of parent ordinal. Also added a couple of
comments explaining why we're reading a value and immediately
overwriting it.
Simplify allocateChunkedArray and add CHANGES entry.

Also, I decreased the depth of facet labels used in
TestTaxonomyCombined::testThousandsOfCategories, to reduce execution
time.
@msfroh
Copy link
Contributor Author

msfroh commented Jan 20, 2024

@msfroh - I left just one more nitpicky comment. Can you also add a CHANGES entry? I would be happy to merge the PR after.

Done!

I also tweaked the bounds for testThousandsOfCategories, since it would take ~3.5 seconds on my (relatively powerful) work computer with a chain length of 3-6. Dropping to a length of 2-4 cut the time to 1.25 seconds.

@stefanvodita
Copy link
Contributor

Thank you! I think the PR is in good shape. I'll leave it up for a couple days and then merge if no one else comments.

@stefanvodita stefanvodita merged commit 2a0b7f2 into apache:main Jan 22, 2024
4 checks passed
stefanvodita pushed a commit that referenced this pull request Jan 22, 2024
Taxonomy ordinals are added in an append-only way.
Instead of reallocating a single big array when loading new taxonomy
ordinals and copying all the values from the previous arrays over
individually, we can keep blocks of ordinals and reuse blocks from the
previous arrays.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants