Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-14718: [Java] loadValidityBuffer should avoid allocating memory when input is not null and there are only null or non-null values #11709

Closed
wants to merge 6 commits into from

Conversation

sunchao
Copy link
Member

@sunchao sunchao commented Nov 16, 2021

Currently in BitVectorHelper.loadValidityBuffer, we always allocate memory when the source vector contains only null or non-null values. However, as the format also allows allocating validity buffer even if all values are null or not-null, the method should also consider whether the input validity buffer is null or not, and avoiding allocating new buffer when it is latter.

@github-actions
Copy link

@github-actions
Copy link

⚠️ Ticket has not been started in JIRA, please click 'Start Progress'.

@sunchao
Copy link
Member Author

sunchao commented Nov 16, 2021

cc @liyafan82 @emkornfield could you help reviewing this? thanks.

@liyafan82
Copy link
Contributor

cc @liyafan82 @emkornfield could you help reviewing this? thanks.

@sunchao The change looks reasonable. Let's see if the checks pass after rerunning.

@emkornfield
Copy link
Contributor

@sunchao it looks like the CI failures might be legitimate? Not sure if this is exposing bugs in underlying code or if this is the cause.

@@ -329,7 +329,8 @@ public static ArrowBuf loadValidityBuffer(final ArrowFieldNode fieldNode,
final int valueCount = fieldNode.getLength();
ArrowBuf newBuffer = null;
/* either all NULLs or all non-NULLs */
if (fieldNode.getNullCount() == 0 || fieldNode.getNullCount() == valueCount) {
if (sourceValidityBuffer == null && (fieldNode.getNullCount() == 0 ||
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the sourceValidity buffer will can only be null if getNullCount == 0. It isn't clear to me why this code chose to always allocate a new buffer in this case as opposed to reusing the available one(I would have to trace it).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry to be clearer. The spec only allows for the bitmap to be null if all values are present, not if all values are null.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @emkornfield , yea I missed that. Let me fix it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It turned out this is because the ORC adapter wraps null validity buffer from native side with OrcMemoryJniWrapper, which then gets converted into ArrowBuf with memoryAddress equals to 0 (i.e., it wraps a null buffer).

With the change in this PR, the Java side will think the ArrowBuf is valid and thus will skip allocating new buffer. However it is actually invalid and thus will cause a series of issues.

I added an additional check of the ArrowBuf's memory address, to make sure it is indeed a null validity buffer.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, found another issue with IPC between C++ and Java. In this case even though the source validity buffer is null from the C++ side, when deserializing it from the IPC RecordBatch, it is just a slice derived from the original message body (the IPC format seems to require validity buffer to always be present, even if all values are not-null in the Arrow array). Therefore, it won't be null or having a null memory address.

To fix, I changed the check to be:

sourceValidityBuffer == null || sourceValidityBuffer.capacity() == 0;

The latter indicates that the buffer is empty.

@sunchao sunchao marked this pull request as draft December 8, 2021 21:48
@sunchao sunchao marked this pull request as ready for review December 9, 2021 20:16
@sunchao
Copy link
Member Author

sunchao commented Dec 9, 2021

The CI is green now. @emkornfield @liyafan82 could you review this again? thanks!

fieldNode = new ArrowFieldNode(1024, numNulls);
try (ArrowBuf src = allocator.buffer(128)) {
for (int i = 0; i < numNulls; i++) {
BitVectorHelper.setBit(src, i);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the buffer content is undefined, the other bits need to be set explicitly?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I can zero out the buffer first.

BitVectorHelper.setBit(src, i);
}
try (ArrowBuf dst = BitVectorHelper.loadValidityBuffer(fieldNode, src, allocator)) {
assertEquals(128, allocator.getAllocatedMemory());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can assert that src and dst refer to the same underlying buffer?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't compare by reference since dst is a different ArrowBuf object (see ReferenceManager.retain). I added a comparison on memory address of these two.

Copy link
Contributor

@liyafan82 liyafan82 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@sunchao sunchao closed this in 902b541 Dec 14, 2021
@sunchao sunchao deleted the ARROW-14718 branch December 14, 2021 18:37
@sunchao
Copy link
Member Author

sunchao commented Dec 14, 2021

Merged to master, thanks @liyafan82 and @emkornfield for the review!

@ursabot
Copy link

ursabot commented Dec 14, 2021

Benchmark runs are scheduled for baseline = 36367fe and contender = 902b541. 902b541 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Finished ⬇️0.0% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.4% ⬆️0.04%] ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python. Runs only benchmarks with cloud = True
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

@jorgecarleitao
Copy link
Member

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants