-
Notifications
You must be signed in to change notification settings - Fork 975
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Split taxonomy arrays across chunks #12995
Conversation
lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/TaxonomyIndexArrays.java
Outdated
Show resolved
Hide resolved
lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/TaxonomyIndexArrays.java
Outdated
Show resolved
Hide resolved
} | ||
|
||
tr.close(); | ||
indexDir.close(); | ||
} | ||
|
||
private static void assertArrayEquals(int[] expected, ParallelTaxonomyArrays.IntArray actual) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering if it's worth implementing equals
for ChunkedArray
. No strong opinion one way or the other.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was the only place where we assert anything about the full contents of the ChunkedArray
. I would lean toward not adding equals
to ChunkedArray
, since it's not something we would normally want to call, IMO. (In general, it's going to be pretty expensive.)
if (copyFrom.initializedChildren) { | ||
initChildrenSiblings(copyFrom); | ||
} | ||
} | ||
|
||
private static int[][] allocateChunkedArray(int size) { | ||
int chunkCount = size / CHUNK_SIZE + 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we do the +1
only if lastChunkSize != 0
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In practice, we always have size >= 1
(because the TaxonomyWriter
writes the root ordinal on startup) -- but relying on that behavior may bite us. I'll add a check for the size == 0
case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I should have been more detailed in my comment. I didn't think the size == 0
case specifically would be an issue. I was wondering more generally if we really want the empty array at the end in cases where size
is a multiple of CHUNK_SIZE
(i.e. lastChunkSize == 0
). And I think it's reasonable to do it that way - it sticks with this contract of having the last array be mutable in a way that the others aren't.
My preferred solution right now would be to remove the special case for size == 0
and remove the other if condition in the method, since it's never not met.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I missed this thread that was still pending. @msfroh - what do you think, can we simplify this method?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it!
I handled the multiple of CHUNK_SIZE
case without allocating an empty array. I also added a unit test that specifically flexes ChunkedIntArray.
(And I cleaned up a few more cases where I was still dividing/moduloing instead of bit-shifting and masking.)
lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/TaxonomyIndexArrays.java
Outdated
Show resolved
Hide resolved
lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/TaxonomyIndexArrays.java
Show resolved
Hide resolved
lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/TaxonomyIndexArrays.java
Outdated
Show resolved
Hide resolved
lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/TaxonomyIndexArrays.java
Outdated
Show resolved
Hide resolved
lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/TaxonomyIndexArrays.java
Outdated
Show resolved
Hide resolved
lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/TaxonomyIndexArrays.java
Outdated
Show resolved
Hide resolved
As far as testing, can we add some unit tests that allocate more than one chunk and exercise the new functionality? |
I can take care of the first part. @stefanvodita, do you mind running the Lucene benchmarks against this change to see how it performs? |
The results are in. I don't see any significant p-values (< 0.05).
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@msfroh - I left just one more nitpicky comment. Can you also add a CHANGES entry? I would be happy to merge the PR after.
int chunkCount = size >> CHUNK_SIZE_BITS; | ||
int fullChunkCount; | ||
int lastChunkSize = size & CHUNK_MASK; | ||
if (lastChunkSize == 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for persisting while we're iterating over this method.
Since fullChunkCount
is assigned chunkCount
on both branches, why not do this:
int fullChunkCount = chunkCount;
if (lastChunkSize != 0) {
chunkCount++;
}
On a higher level, I think I still wasn't specific enough in my previous comment. I didn't mind that we would sometimes have an empty array at the end if size
was a multiple of CHUNK_SIZE
, but we had if-statements that didn't seem to me like they were adding something. In this case, I prefer spending those extra bytes if we can make the code simpler. If you think the implementation we already have is better, we can keep it, but here is my preferred solution written out if you want to consider it:
private static int[][] allocateChunkedArray(int size, int startFrom) {
int chunkCount = (size >> CHUNK_SIZE_BITS) + 1;
int[][] array = new int[chunkCount][];
for (int i = startFrom; i < chunkCount - 1; i++) {
array[i] = new int[CHUNK_SIZE];
}
array[chunkCount - 1] = new int[size & CHUNK_MASK];
return array;
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, that's much cleaner, thank you!
Taxonomy ordinals are added in an append-only way. Instead of reallocating a single big array when loading new taxonomy ordinals and copying all the values from the previous arrays over individually, we can keep blocks of ordinals and reuse blocks from the previous arrays.
Especially avoid allocating new chunks that we're going to immediately disard.
Avoid redundant lookup of parent ordinal. Also added a couple of comments explaining why we're reading a value and immediately overwriting it.
Simplify allocateChunkedArray and add CHANGES entry. Also, I decreased the depth of facet labels used in TestTaxonomyCombined::testThousandsOfCategories, to reduce execution time.
35442aa
to
136a56e
Compare
Done! I also tweaked the bounds for |
Thank you! I think the PR is in good shape. I'll leave it up for a couple days and then merge if no one else comments. |
Taxonomy ordinals are added in an append-only way. Instead of reallocating a single big array when loading new taxonomy ordinals and copying all the values from the previous arrays over individually, we can keep blocks of ordinals and reuse blocks from the previous arrays.
Description
Taxonomy ordinals are added in an append-only way.
Instead of reallocating a single big array when loading new taxonomy ordinals and copying all the values from the previous arrays over individually, we can keep blocks of ordinals and reuse blocks from the previous arrays.
Resolves #12989