Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clean up ByteBlockPool #12506

Merged
merged 11 commits into from
Nov 6, 2023
Merged

Clean up ByteBlockPool #12506

merged 11 commits into from
Nov 6, 2023

Conversation

stefanvodita
Copy link
Contributor

@stefanvodita stefanvodita commented Aug 12, 2023

There are several changes here. The ones that change public API:

  1. Slice functionality is moved out from ByteBlockPool to its own class. There's probably more we can do here, but separating the slices from the block pool seems like a step in the right direction.
  2. There was a comment asking not to modify the buffers array outside ByteBlockPool. I've made it private instead and readable through a getter to enforce that.
  3. The various setBytesRef are consolidated. The offsets they worked with were all ints, so I've changed those that were declared as longs.

Functionality changes:

  1. Calling newSlice with a size larger than a block will throw an exception.
  2. Decoding a length larger than the remaining buffer space in setBytesRef will trip an assertion.
  3. ramBytesUsage has become simpler, but produces the same result.
  4. ByteBlockPool.Allocator.recycleByteBlocks is removed.
  5. Throw an exception if the buffer index in ByteBlockPool.setBytesRef overflows.

Other than that, comments and javadocs are updated or expanded.

Closes #6675.

* 0</code> bytes before they reused or passed to {@link Allocator#recycleByteBlocks(byte[][],
* int, int)}. Calling {@link ByteBlockPool#nextBuffer()} is not needed after reset.
* Resets the pool to its initial state, reusing the first buffer and filling all buffers with
* {@code 0} bytes before they are reused or passed to {@link
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we really need to zero out the buffers that will be recycled. Maybe it makes more sense to have the recycling method do it if necessary?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to make this cleaner (who zeros). Why does byte slicing even require pre-zero'd buffers?

Maybe open a follow-on issue for this? This change is already great.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, there's enough going on in this PR already. I opened an issue to look into this separately: #12734

@stefanvodita
Copy link
Contributor Author

I noticed the failing checks on this PR, but I haven't been able to reproduce them. They appear related to the nested javadoc tags I had introduced. I've removed them now. Hopefully that satisfies the checks.

1. Move slice functionality to a separate class, ByteSlicePool.
2. Add exception case if the requested slice size is larger than the
   block size.
3. Make pool buffers private instead of the comment asking not to modify it.
4. Consolidate setBytesRef methods with int offsets.
5. Simplify ramBytesUsed.
6. Update and expand comments and javadoc.
@stefanvodita
Copy link
Contributor Author

The last commit is a large rebase + conflict resolution after #12625 got merged. What this PR does hasn't really changed.

@mikemccand
Copy link
Member

Thanks @stefanvodita -- I'll try to have a look soon! And thank you for gracefully handling the "two people made very similar changes" situation :)

This happens often in open source, but it's actually a good thing since you get two very different perspectives and the final solution is best of both.

Copy link
Member

@mikemccand mikemccand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great, thanks @stefanvodita! I left a few minor comments...

I wonder if this will impact indexing performance ... I don't think we need to test before pushing, but let's watch the nightly run after this is merged to see if there was any impact?

@@ -46,6 +65,7 @@ protected Allocator(int blockSize) {

public abstract void recycleByteBlocks(byte[][] blocks, int start, int end);

// TODO: This is not used. Can we remove?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to remove now! This is an internal API -- we are free to suddenly change it.

*/
void setBytesRef(BytesRefBuilder builder, BytesRef result, long offset, int length) {
void setBytesRef(BytesRefBuilder builder, BytesRef result, int offset, int length) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm why the change from long -> int? Previously we were able to address 32 + BYTE_BLOCK_SHIFT bits of address space using long? Or did that fail to work somewhere?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I missed that the purpose of the long was to increase the address space. I'll change this back.

result.length = length;

int bufferIndex = (int) (offset >> BYTE_BLOCK_SHIFT);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we stick with long we should change this to Math.toIntExact to catch overflow?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea!

size += RamUsageEstimator.shallowSizeOf(buffers);
for (byte[] buf : buffers) {
if (buf == buffer) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did the previous code special case buffer? Isn't buffer a full sized block?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't figured out why buffer was a special case. Even if it were not a full block, would that make a difference if we're calling RamUsageEstimator#sizeOfObject on it in both cases?


/**
* Class that Posting and PostingVector use to write byte streams into shared fixed-size byte[]
* arrays. The idea is to allocate slices of increasing lengths For example, the first slice is 5
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Period after lengths -- of increasing lengths. For example,

package org.apache.lucene.util;

/**
* Class that Posting and PostingVector use to write byte streams into shared fixed-size byte[]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe say write interleaved byte streams?

*
* @lucene.internal
*/
public class ByteSlicePool {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

final?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe move this class to org.apache.lucene.index and be package private?

public class ByteSlicePool {
/**
* The underlying structure consists of fixed-size blocks. We overlay variable-length slices on
* top. Each slice is contiguous in memory, i.e. it does not strddle multiple blocks.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

strddle -> straddle

import org.apache.lucene.tests.util.TestUtil;

public class TestByteSlicePool extends LuceneTestCase {
public void testAllocKnowSizeSlice() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Know -> Known?

}
}

public void testAllocLargeSlice() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we maybe add a randomized test that writes a random number of random length interleaves streams, then reads them back and confirms the streams match?

@iverase
Copy link
Contributor

iverase commented Oct 24, 2023

I like the introduction of ByteSlicePool but I wonder if the naming is correct as it does not feel a generic slicer class but very tied to the format used by TermsHashPerField. Just wondering if we can have a more descriptive name ,(That is one of the reason I kept it inside TermsHashPerField).

In the case we keep this class, let's remove the static method I introduced.

@stefanvodita
Copy link
Contributor Author

Thanks for the review @iverase! I’m putting together a new revision. Do you have a name suggestion for ByteSlicePool? I don’t really have a better idea. By putting it in a separate class, I was also trying to account for the possibility that it would find other uses in the future, but maybe that's not likely.

@stefanvodita
Copy link
Contributor Author

I’ve integrated most of the suggestions. There’s just the matter of the name for the slice pool class and the right package for it. I don’t have a strong opinion on this. Maybe we move it to org.apache.lucene.index and keep the name. It goes well with ByteSliceReader.

@mikemccand
Copy link
Member

+1 to move to oal.index, and make it package private if possible? ByteSlicePool name sounds good to me :) Naming is the hardest part!

@stefanvodita
Copy link
Contributor Author

Thanks @mikemccand! I just pushed a commit that does that move.

@stefanvodita
Copy link
Contributor Author

@mikemccand @iverase - what do you think, is this PR ready?

Copy link
Contributor

@iverase iverase left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changes LGTM

Copy link
Member

@mikemccand mikemccand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, thank you @stefanvodita for giving some love to this long neglected quite hairy class.

I'll merge today.

@mikemccand mikemccand merged commit 6aed68f into apache:main Nov 6, 2023
4 checks passed
mikemccand pushed a commit that referenced this pull request Nov 6, 2023
* Clean up ByteBlockPool

1. Move slice functionality to a separate class, ByteSlicePool.
2. Add exception case if the requested slice size is larger than the
   block size.
3. Make pool buffers private instead of the comment asking not to modify it.
4. Consolidate setBytesRef methods with int offsets.
5. Simplify ramBytesUsed.
6. Update and expand comments and javadoc.

* Revert to long offsets in ByteBlockPool; Clean-up

* Remove slice functionality from TermsHashPerField

* First working test

* [Broken] Test interleaving slices

* Fixed randomized test

* Remove redundant tests

* Tidy

* Add CHANGES

* Move ByteSlicePool to oal.index
mikemccand added a commit that referenced this pull request Nov 6, 2023
…backport where Java 11 doesn't have this API
mikemccand added a commit that referenced this pull request Nov 6, 2023
mikemccand added a commit that referenced this pull request Nov 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ByteBlockPool's documentation is completely useless [LUCENE-5613]
3 participants