Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Java] The offset buffer of empty BaseVariableWidthVector should not be empty when exposed through C Data Interface #40038

Closed
viirya opened this issue Feb 11, 2024 · 18 comments · Fixed by #40043

Comments

@viirya
Copy link
Member

viirya commented Feb 11, 2024

Describe the bug, including details regarding any error messages, version, and platform.

We encountered an error when exchanging string array from Java to Rust through Arrow C data interface. At Rust side, it complains that the buffer at position 1 (offset buffer) is null. After tracing down and some debugging, it looks like the issue is Java Arrow BaseVariableWidthVector class assigns an empty offset buffer if the array is empty (value count 0).

According to Arrow spec for variable size binary layout:

The offsets buffer contains length + 1 signed integers ...

So for an empty string array, its offset buffer should be a buffer with one element (generally it is 0).

Component(s)

Java

viirya added a commit to viirya/arrow that referenced this issue Feb 11, 2024
@vibhatha
Copy link
Collaborator

@viirya Thanks for reporting this issue. Planning to make a PR?

@viirya
Copy link
Member Author

viirya commented Feb 12, 2024

@vibhatha Yea, I'm working on a fix locally. But it causes a few tests failed now. Still looking into fixing the tests.

@vibhatha
Copy link
Collaborator

Wonderful!

@viirya
Copy link
Member Author

viirya commented Feb 12, 2024

It looks like both cases are acceptable (empty or one single zero value element):

https://lists.apache.org/thread/w7g1zfqrjxx0bvrct0mt5zwxvdnc9nob

Close this for now.

@viirya viirya closed this as completed Feb 12, 2024
@viirya viirya reopened this Feb 12, 2024
@viirya
Copy link
Member Author

viirya commented Feb 12, 2024

Per more discussions in the PR, we probably need to fix C data interface of Java Arrow to properly export empty offset buffer for var-size arrays.

viirya added a commit to viirya/arrow that referenced this issue Feb 12, 2024
… variable-size layout should not be empty"

This reverts commit 5eb34e1.
@vibhatha
Copy link
Collaborator

Is there a sample code that we could use to reproduce the issue and probably look for a fix?

@viirya
Copy link
Member Author

viirya commented Feb 13, 2024

Hmm, the code producing the issue is complicated. We execute TPCDS query in Spark/DataFusion and pass the results as Arrow vectors through C Data interface to Rust (arrow-rs).

I saw there are some tests in c module RoundtripTest . I'm looking to have a reproducer test there. But I cannot build c module on Mac...

@viirya
Copy link
Member Author

viirya commented Feb 13, 2024

Ah, after deeper debugging today, I found we were misled by sun.misc.Unsafe.allocateMemory API document.

Allocates a new block of native memory, of the given size in bytes. ... The resulting native pointer will never be zero,...

But as I debugged it today, it actually returns zero value pointer. So it causes the Rust arrow think of it a NULL pointer.

And, there is an update to the API document:
openjdk/jdk@209d87a

It is changed to:

The resulting native pointer will be zero if and only if the requested size is zero.

Above is easily to verify by a simple test in vector module, for example:

 @Test
  public void testEmptyBuffer() throws Exception {
    try (ListVector empty = ListVector.empty("empty", allocator)) {
      Assert.assertTrue("memory shouldn't zero", empty.getOffsetBuffer().memoryAddress() != 0);
    }
  }

It will pass in default run, but fail in unsafe run.

So once UnsafeAllocationManager is used , these empty buffers could be treated as NULL pointers actually when they are exported.

cc @lidavidm @pitrou

@viirya
Copy link
Member Author

viirya commented Feb 13, 2024

Actually we have RoundtripTest in c module which tests empty ListVector. But we don't have c unit tests with unsafe memory allocation manager like vector module. So it is not caught.

@viirya
Copy link
Member Author

viirya commented Feb 13, 2024

I am going to add unsafe run for c module in the PR #40043 too.

@lidavidm
Copy link
Member

You could statically allocate a buffer in the allocator itself to represent the zero size buffer (in fact doesn't the allocator already do this?)

@viirya
Copy link
Member Author

viirya commented Feb 13, 2024

Zero size buffer? Why we need zero size buffer? Do you mean one zero value buffer?

@lidavidm
Copy link
Member

Er, weren't we talking about this?

But as I debugged it today, it actually returns zero value pointer. So it causes the Rust arrow think of it a NULL pointer.

@lidavidm
Copy link
Member

I don't see why the allocator would give NULL for an allocation of size 4 or 8...

@viirya
Copy link
Member Author

viirya commented Feb 13, 2024

Er, weren't we talking about this?

As you said, we already have it now. But for C Data Interface, as we discussed yesterday, it is not valid under C Data Interface. I think now we are going to send one value (zero) buffer instead of zero size buffer, isn't?

I don't see why the allocator would give NULL for an allocation of size 4 or 8...

To clarify, we currently send a zero size buffer (i.e, empty) buffer through C Data Interface (not size 4 or 8)

For the reason, why the allocator give NULL for an allocation of empty buffer, see my previous comment: #40038 (comment).

In short, for the empty buffer allocated in UnsafeAllocationManager, which is a NULL pointer, because MemoryUtil.UNSAFE.allocateMemory(0) returns NULL pointer for 0 bytes allocation.

@viirya
Copy link
Member Author

viirya commented Feb 13, 2024

Let me try to summarize the issue:

  1. Java Arrow uses empty buffer internally for offset buffer (valid, for historical reason). For C Data Interface, it is not valid. We are going to (should) send one value (zero) buffer at the moment.
  2. Currently the empty buffer sent by Java Arrow would be NULL pointer if unsafe allocator is used. It is because UNSAFE allocateMemory API returns a NULL pointer for zero bytes allocation.

@lidavidm
Copy link
Member

Ok. (1) sounds fine. (2) is also fine. That's explicitly allowed so long as the buffer size really is 0. It won't be relevant after (1) is fixed, right?

@viirya
Copy link
Member Author

viirya commented Feb 13, 2024

Yea, if we don't send empty buffer as offset buffer, it is fine.

@viirya viirya changed the title [Java] The offset buffer of empty BaseVariableWidthVector should not be empty [Java] The offset buffer of empty BaseVariableWidthVector should not be empty when exposed through C Data Interface Feb 22, 2024
lidavidm pushed a commit that referenced this issue Apr 2, 2024
…out through C Data Interface (#40043)

### Rationale for this change

We encountered an error when exchanging string array from Java to Rust through Arrow C data interface. At Rust side, it complains that the buffer at position 1 (offset buffer) is null. After tracing down and some debugging, it looks like the issue is Java Arrow `BaseVariableWidthVector` class assigns an empty offset buffer if the array is empty (value count 0).

According to Arrow [spec](https://arrow.apache.org/docs/format/Columnar.html#variable-size-binary-layout) for variable size binary layout:

> The offsets buffer contains length + 1 signed integers ...

So for an empty string array, its offset buffer should be a buffer with one element (generally it is `0`).

### What changes are included in this PR?

This patch replaces current empty offset buffer in variable-size layout vector classes when exporting arrays through C Data Interface.

### Are these changes tested?

Added test cases.

### Are there any user-facing changes?

No

* Closes: #40038

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: David Li <li.davidm96@gmail.com>
@lidavidm lidavidm added this to the 16.0.0 milestone Apr 2, 2024
tolleybot pushed a commit to tmct/arrow that referenced this issue May 2, 2024
…ze layout through C Data Interface (apache#40043)

### Rationale for this change

We encountered an error when exchanging string array from Java to Rust through Arrow C data interface. At Rust side, it complains that the buffer at position 1 (offset buffer) is null. After tracing down and some debugging, it looks like the issue is Java Arrow `BaseVariableWidthVector` class assigns an empty offset buffer if the array is empty (value count 0).

According to Arrow [spec](https://arrow.apache.org/docs/format/Columnar.html#variable-size-binary-layout) for variable size binary layout:

> The offsets buffer contains length + 1 signed integers ...

So for an empty string array, its offset buffer should be a buffer with one element (generally it is `0`).

### What changes are included in this PR?

This patch replaces current empty offset buffer in variable-size layout vector classes when exporting arrays through C Data Interface.

### Are these changes tested?

Added test cases.

### Are there any user-facing changes?

No

* Closes: apache#40038

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: David Li <li.davidm96@gmail.com>
tolleybot pushed a commit to tmct/arrow that referenced this issue May 4, 2024
…ze layout through C Data Interface (apache#40043)

### Rationale for this change

We encountered an error when exchanging string array from Java to Rust through Arrow C data interface. At Rust side, it complains that the buffer at position 1 (offset buffer) is null. After tracing down and some debugging, it looks like the issue is Java Arrow `BaseVariableWidthVector` class assigns an empty offset buffer if the array is empty (value count 0).

According to Arrow [spec](https://arrow.apache.org/docs/format/Columnar.html#variable-size-binary-layout) for variable size binary layout:

> The offsets buffer contains length + 1 signed integers ...

So for an empty string array, its offset buffer should be a buffer with one element (generally it is `0`).

### What changes are included in this PR?

This patch replaces current empty offset buffer in variable-size layout vector classes when exporting arrays through C Data Interface.

### Are these changes tested?

Added test cases.

### Are there any user-facing changes?

No

* Closes: apache#40038

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: David Li <li.davidm96@gmail.com>
rok pushed a commit to tmct/arrow that referenced this issue May 8, 2024
…ze layout through C Data Interface (apache#40043)

### Rationale for this change

We encountered an error when exchanging string array from Java to Rust through Arrow C data interface. At Rust side, it complains that the buffer at position 1 (offset buffer) is null. After tracing down and some debugging, it looks like the issue is Java Arrow `BaseVariableWidthVector` class assigns an empty offset buffer if the array is empty (value count 0).

According to Arrow [spec](https://arrow.apache.org/docs/format/Columnar.html#variable-size-binary-layout) for variable size binary layout:

> The offsets buffer contains length + 1 signed integers ...

So for an empty string array, its offset buffer should be a buffer with one element (generally it is `0`).

### What changes are included in this PR?

This patch replaces current empty offset buffer in variable-size layout vector classes when exporting arrays through C Data Interface.

### Are these changes tested?

Added test cases.

### Are there any user-facing changes?

No

* Closes: apache#40038

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: David Li <li.davidm96@gmail.com>
rok pushed a commit to tmct/arrow that referenced this issue May 8, 2024
…ze layout through C Data Interface (apache#40043)

### Rationale for this change

We encountered an error when exchanging string array from Java to Rust through Arrow C data interface. At Rust side, it complains that the buffer at position 1 (offset buffer) is null. After tracing down and some debugging, it looks like the issue is Java Arrow `BaseVariableWidthVector` class assigns an empty offset buffer if the array is empty (value count 0).

According to Arrow [spec](https://arrow.apache.org/docs/format/Columnar.html#variable-size-binary-layout) for variable size binary layout:

> The offsets buffer contains length + 1 signed integers ...

So for an empty string array, its offset buffer should be a buffer with one element (generally it is `0`).

### What changes are included in this PR?

This patch replaces current empty offset buffer in variable-size layout vector classes when exporting arrays through C Data Interface.

### Are these changes tested?

Added test cases.

### Are there any user-facing changes?

No

* Closes: apache#40038

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: David Li <li.davidm96@gmail.com>
vibhatha pushed a commit to vibhatha/arrow that referenced this issue May 25, 2024
…ze layout through C Data Interface (apache#40043)

### Rationale for this change

We encountered an error when exchanging string array from Java to Rust through Arrow C data interface. At Rust side, it complains that the buffer at position 1 (offset buffer) is null. After tracing down and some debugging, it looks like the issue is Java Arrow `BaseVariableWidthVector` class assigns an empty offset buffer if the array is empty (value count 0).

According to Arrow [spec](https://arrow.apache.org/docs/format/Columnar.html#variable-size-binary-layout) for variable size binary layout:

> The offsets buffer contains length + 1 signed integers ...

So for an empty string array, its offset buffer should be a buffer with one element (generally it is `0`).

### What changes are included in this PR?

This patch replaces current empty offset buffer in variable-size layout vector classes when exporting arrays through C Data Interface.

### Are these changes tested?

Added test cases.

### Are there any user-facing changes?

No

* Closes: apache#40038

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: David Li <li.davidm96@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants