New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-14718: [Java] loadValidityBuffer should avoid allocating memory when input is not null and there are only null or non-null values #11709
Conversation
|
cc @liyafan82 @emkornfield could you help reviewing this? thanks. |
@sunchao The change looks reasonable. Let's see if the checks pass after rerunning. |
@sunchao it looks like the CI failures might be legitimate? Not sure if this is exposing bugs in underlying code or if this is the cause. |
@@ -329,7 +329,8 @@ public static ArrowBuf loadValidityBuffer(final ArrowFieldNode fieldNode, | |||
final int valueCount = fieldNode.getLength(); | |||
ArrowBuf newBuffer = null; | |||
/* either all NULLs or all non-NULLs */ | |||
if (fieldNode.getNullCount() == 0 || fieldNode.getNullCount() == valueCount) { | |||
if (sourceValidityBuffer == null && (fieldNode.getNullCount() == 0 || |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the sourceValidity buffer will can only be null if getNullCount == 0. It isn't clear to me why this code chose to always allocate a new buffer in this case as opposed to reusing the available one(I would have to trace it).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry to be clearer. The spec only allows for the bitmap to be null if all values are present, not if all values are null.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @emkornfield , yea I missed that. Let me fix it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It turned out this is because the ORC adapter wraps null validity buffer from native side with OrcMemoryJniWrapper
, which then gets converted into ArrowBuf
with memoryAddress
equals to 0 (i.e., it wraps a null buffer).
With the change in this PR, the Java side will think the ArrowBuf
is valid and thus will skip allocating new buffer. However it is actually invalid and thus will cause a series of issues.
I added an additional check of the ArrowBuf
's memory address, to make sure it is indeed a null validity buffer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, found another issue with IPC between C++ and Java. In this case even though the source validity buffer is null from the C++ side, when deserializing it from the IPC RecordBatch
, it is just a slice derived from the original message body (the IPC format seems to require validity buffer to always be present, even if all values are not-null in the Arrow array). Therefore, it won't be null or having a null memory address.
To fix, I changed the check to be:
sourceValidityBuffer == null || sourceValidityBuffer.capacity() == 0;
The latter indicates that the buffer is empty.
The CI is green now. @emkornfield @liyafan82 could you review this again? thanks! |
fieldNode = new ArrowFieldNode(1024, numNulls); | ||
try (ArrowBuf src = allocator.buffer(128)) { | ||
for (int i = 0; i < numNulls; i++) { | ||
BitVectorHelper.setBit(src, i); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the buffer content is undefined, the other bits need to be set explicitly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I can zero out the buffer first.
BitVectorHelper.setBit(src, i); | ||
} | ||
try (ArrowBuf dst = BitVectorHelper.loadValidityBuffer(fieldNode, src, allocator)) { | ||
assertEquals(128, allocator.getAllocatedMemory()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can assert that src
and dst
refer to the same underlying buffer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can't compare by reference since dst
is a different ArrowBuf
object (see ReferenceManager.retain
). I added a comparison on memory address of these two.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Merged to master, thanks @liyafan82 and @emkornfield for the review! |
Benchmark runs are scheduled for baseline = 36367fe and contender = 902b541. 902b541 is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
Currently in
BitVectorHelper.loadValidityBuffer
, we always allocate memory when the source vector contains only null or non-null values. However, as the format also allows allocating validity buffer even if all values are null or not-null, the method should also consider whether the input validity buffer is null or not, and avoiding allocating new buffer when it is latter.